Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CFP-2023-05-17: Hubble Flow Logs #25508

Closed
AwesomePatrol opened this issue May 17, 2023 · 17 comments · Fixed by #28873
Closed

CFP-2023-05-17: Hubble Flow Logs #25508

AwesomePatrol opened this issue May 17, 2023 · 17 comments · Fixed by #28873
Assignees
Labels
kind/feature This introduces new functionality. sig/agent Cilium agent related. sig/hubble Impacts hubble server or relay

Comments

@AwesomePatrol
Copy link
Contributor

Cilium Feature Proposal

Is your feature request related to a problem?

Without Hubble Timescape it is impossible to get information about the past Hubble Flows (outside the history stored in the ring buffer).

Describe the feature you'd like

I want to have an option to output information on Flows for a period of time (maybe indefinite).

(Optional) Describe your proposed solution

I can't link to the doc, so I will copy the content in the first comment.

@AwesomePatrol AwesomePatrol added the kind/feature This introduces new functionality. label May 17, 2023
@AwesomePatrol
Copy link
Contributor Author

AwesomePatrol commented May 17, 2023

Motivation

Many users of k8s are interested in collecting information about traffic patterns from certain workloads for security and audit purposes. They would like to utilize existing observability stack (for example Loki+Grafana) instead of introducing and maintaining new tool such as Hubble Timescape.

Proposal

Add a new CRD that will be watched by cilium-agent. When end is in the future it will process Hubble flows, the ones that match the filter will be stripped from information outside the field mask and logged in a file. Most observability tools can be configured to ingest it.

Definition

// Not a real CRD
kind: flowlogging.cilium.io
spec:
  properties:
    fieldmask:
      description: "List of field names that will be kept in the log output"
      type: array
      items:
        description: "Field name that will be kept in the output. Subfields are separated by dot"
        type: string
    filter:
      description: "Filter"
      type: object
      properties:
        allowlist:
          description: "Allowlist filter generated by CLI. Empty matches all. Flow passes the filter if any of the entries match. Only Flows matching Allowlist, but not matching Denylist will be logged"
          type: array
          items:
            description: "Allowlist entry"
            type: string
        denylist:
          description: "Denylist filter generated by CLI. Empty matches none. Flow passes the filter if none of the entries match. Only Flows matching Allowlist, but not matching Denylist will be logged"
          type: array
          items:
            description: "Denylist entry"
            type: string
    end:
      description: "End denotes the timestamp when logging will stop"
      type: string

Examples

Example of collecting Flow logs for a specific user-project.

  • until 15:04:05 of 2023-05-10 PDT
  • from all pods in a user-project namespace
  • but exclude kube-dns traffic in both directions
  • keep only 4 fields: time, src, dst, verdict (forward/drop)
spec:
  fieldmask:
    - "time"
    - "source.pod_name"
    - "destination.pod_name"
    - "verdict"
  filter:
    allowlist:
      - '{"source_pod":["user-project/"]}'
      - '{"destination_pod":["user-project/"]}'
    denylist:
      - '{"source_label":["k8s:k8s-app=kube-dns"]}'
      - '{"destination_label":["k8s:k8s-app=kube-dns"]}'
  end: "2023-05-10T15:04:05-07:00"

Example of collecting Flow logs for an entire organization.

  • until 15:04:05 of 2023-05-10 PDT
  • from all pods in the cluster
  • keep all information
spec:
  fieldmask:
  filter:
  end: "2023-05-10T15:04:05-07:00"

Example of stopping Flow logs collection:

  • end is set to a timestamp in a past
spec:
  fieldmask:
  filter:
  end: "1970-01-01T00:00:00"

@rolinh rolinh added the sig/hubble Impacts hubble server or relay label May 17, 2023
@EItanya
Copy link

EItanya commented May 24, 2023

Thinking out loud I wonder if this would already be possible with something like https://github.com/cilium/hubble-otel. You could send all of the flows through oTEL and do all filtering/storage that way.

@AwesomePatrol
Copy link
Contributor Author

Thanks for pointing it out to me. This may indeed cover my use-case. However, as it is a separate component, I wonder if it doesn't waste too much resources by requiring every Flow to go through grpc. I need to investigate it further. I hope to have results next week

@EItanya
Copy link

EItanya commented May 25, 2023

It totally depends on your use case. Otel Collectors can run as a DaemonSet, so you could potentially send it all via Unix Domain Socket and then filter using Otel processors.

@AwesomePatrol
Copy link
Contributor Author

Unfortunately there are too many issues with Otel approach:

  1. Repository is not maintained: most instructions no longer work and code dependencies weren't updated in almost 2 years
  2. Documentation is almost non-existent: I can read flags from the code, but I have no idea how to use them in the configuration (see How do i set hubble.tls.enable to false? hubble-otel#94)
  3. Resource consumption makes it impossible to use as DaemonSet. GRPC -> logs, with only 15 flows/s, consumed 27 CPU and 121 MB of memory (maybe with Add new experiments to reduce Hubble traffic #25647 I could run it as a sidecar to Hubble-relay)
  4. Traces in Jaeger are very hit or miss

@EItanya
Copy link

EItanya commented Jun 1, 2023

That's too bad.

Given that the CFP here would require engineering work either way, I think the engineering work would be better directed at fixing up the oTEL approach, rather than adding it directly into Hubble. That would immediately fix issues 1 + 2 outlined above.

Also, if there was a robust oTEL approach available, it would allow consumer of this data to plug it into their existing data pipelines very easily rather than needing to spin up new tooling.

As far as resource consumption. The logic to handle the flows needs to happen somewhere, whether it's in the oTEL pod or somewhere else, so I think optimizing it within the pipeline makes sense either way.

@AwesomePatrol
Copy link
Contributor Author

CFP here would require engineering work either way

We already have very similar functionality, so I would say that over 60% work for this is already done.

The logic to handle the flows needs to happen somewhere, whether it's in the oTEL pod or somewhere else

It is still surprising to me as other components (Hubble CLI/Relay) use ~20 MB. It seems that it is mainly caused by oTEL boilerplate as I see a similar collector using 110 MB and doing pretty much nothing.

I think optimizing it within the pipeline makes sense either way.

I see that it is possible to run hubble-otel in a standalone mode. It gives me hope that I can use it without introducing OpenTelemetry stack (just yet).

I started working on getting it working again. Let's see if I need to become a maintainer to change it quickly to suit my needs 😄

@AwesomePatrol
Copy link
Contributor Author

AwesomePatrol commented Jun 5, 2023

As discussed on slack, https://github.com/cilium/hubble-otel is unfortunately abandoned and there is no willing maintainer to take care of it.

About the proposal, I think instead of having JSON embedded in YAML. It is better to just use flow.Filter directly.

Go struct will look like this:

type FlowLogging struct {
  // List of field names that will be kept in the log output
  FieldMask         []string
  // Filters specifies flows to include in the logging output. Flow passes the
  // filter if any of the entries match. Only Flows matching the filter, but
  // not matching exclude-filter will be logged. The check is disabled when
  // empty.
  Filters           []flow.Filter
  ExcludeFilters    []flow.Filter
  // End denotes the timestamp when logging will stop.
  Expires           metav1.Time
}

And the first example:

spec:
  fieldMask:
    - "time"
    - "source.pod_name"
    - "destination.pod_name"
    - "verdict"
  filters:
    - source_pod:
      - "user-project/"
    - destination_pod:
      - "user-project/"
  excludeFilters:
    - source_label:
      - "k8s:k8s-app=kube-dns"
    - destination_label:
      - "k8s:k8s-app=kube-dns"
  expires: "2023-05-10T15:04:05-07:00"

The latest definition is part of #26646

@dylandreimerink dylandreimerink added the sig/agent Cilium agent related. label Jun 5, 2023
@kaworu
Copy link
Member

kaworu commented Jun 5, 2023

Thanks for opening this CFP @AwesomePatrol, agree that export configuration through CRD would be very useful.

First, note that although currently undocumented, there is already some facility to export Hubble flows to a file (see here and here). While it doesn't implement filtering, fieldmasking, or deadline (as the CFP suggest) it implements file rotation and compression.

About the CFP, is the goal to have a single well-defined target output file and expect exactly one CR to drive the export? If yes, what happens when I submit two FlowLogging CRs? If not, shouldn't the target output file be part of the CRD (but then what about two CRDs defining the same output file? Are the filters etc. ORed?)

@AwesomePatrol
Copy link
Contributor Author

there is already some facility to export Hubble flows

Are ConfigMap changes applied without cilium-agent restart? If yes, adding support for fieldmasking, or filtering would likely cover my use case.

is the goal to have a single well-defined target output file and expect exactly one CRD to drive the export?

An idea was to add a webhook to limit CRs to only one per cluster (with a well-known name). Otherwise, we would probably need to add another CR with a list of FlowLogging CRs to apply which could fail reconciling when there are conflicts, etc. I think it is too complicated as more complex filtering could be applied to logs already collected. One-CR-per-cluster approach also solves the problem of write-amplification (when multiple requests log the same data to different files).

what about two CRDs defining the same output file? Are the filters etc. ORed?

I think we discussed it previously. Combining filters is quite complicated. For example:

  1. first filter: allow: {src_pod: ["nsA/podA"]}, deny: {dst_pod: ["nsB/"]}
  2. second filter: allow: {dst_pod: ["nsB/podB"]}

Combining these filters would result in the second request giving no results as all will be denied by the first one.

Another solution to the above would be creating some intermediate CR - FlowLogRequest - which would define filters, field mask, etc. Controller (cilium-operator?) would create FlowLogging CRs with a unique name (and export path) with deadline set to correct value (now + duration).

@AwesomePatrol
Copy link
Contributor Author

Are ConfigMap changes applied without cilium-agent restart?

I experimented with the configuration update. I created a small proof of concept that reloads Hubble subsystem when ConfigMap changes in AwesomePatrol@b6c4d25

It works, but the main pain point is that it can take a long time for changes to take effect (up to 2m for ConfigMap propagation, Hubble restart is quick). However I believe that hubble-relay already covers use-cases when someone wants to get this information quickly. I am not sure how much work it would require to completely clean up all resources and track config changes. Probably less than adding a new CRD.

is the goal to have a single well-defined target output file and expect exactly one CRD to drive the export? what about two CRDs defining the same output file?

I think that the output file path could be inferred from its name as it neatly prevents having two objects specify the same output. And in the case that an object is recreated it just overwrites the previous file. Alternatively it can use UUID in its path.

Are the filters etc. ORed?

There is no need to combine filters when multiple CRs are present. They can work in independent threads outputting to different files.

end vs duration

If we make this CRD immutable, we can have both (maybe end taking precedence over duration): end <- CR.CreateTime + duration. I think it would be more user-friendly. What do you think?

@kaworu
Copy link
Member

kaworu commented Jun 13, 2023

Agree that a CRD approach is more flexible than the ConfigMap, even if ConfigMap reload would work flawlessly.

I think that the output file path could be inferred from its name as it neatly prevents having two objects specify the same output. And in the case that an object is recreated it just overwrites the previous file.

Sounds reasonable to me.

There is no need to combine filters when multiple CRs are present. They can work in independent threads outputting to different files.

This sounds great, but please keep in mind that currently Protobuf structs to JSON serialization is resource hungry. When multiple CRs select a given flow to be exported in their respective files, the implementation should encode it once.

If we make this CRD immutable, we can have both (maybe end taking precedence over duration): end <- CR.CreateTime + duration. I think it would be more user-friendly. What do you think?

👍

@AwesomePatrol
Copy link
Contributor Author

Sounds reasonable to me

I will start the initial implementation then.

the implementation should encode it once

This could work with filters, but not field mask. The initial idea is to skip the masked parts when writing the serialized object, but this will likely be complex. We can consider adding support for lightweight output formats (LTSV or logfmt). This could potentially make field masking even cheaper (by implementing it as part of serialization process).

@SirNexus
Copy link

@AwesomePatrol how is this work progressing?

@AwesomePatrol
Copy link
Contributor Author

@AwesomePatrol how is this work progressing?

I was offline for a week (not sure how to nicely communicate it on github).

I am expanding Hubble-exporter to support field masks and filters in #26379.
I also have a draft of FlowLogging CRD in https://github.com/AwesomePatrol/cilium/tree/flow-logging-crd, but I was distracted by #26367, so I hope it will be resolved soon.

Hopefully I will get everything review-ready this week.

@chancez
Copy link
Contributor

chancez commented Aug 3, 2023

@AwesomePatrol what is the purpose of the end field exactly? I'm struggling to understand the use-case for it.

@AwesomePatrol
Copy link
Contributor Author

It was renamed to Expires. I will update it later.

The use case is that you want to capture the flows over a period of time across multiple clusters. Synchronization between clusters isn't instant, so it is best to set an expiration as it can be exactly the same on all of them. Same with nodes in a single cluster.

For example, it is 15:00. I want to capture some logs and start debugging in 30m. I will set Expires: 15:30. Even though one of the clusters was slower to get the CiliumFlowLogging created or some nodes in the same nodes got notified only at 15:05. I know that I can download all the logs I need at 15:31 and there will be no more logs produced.

Moreover, even with filters and field mask, having it running indefinitely can be costly, so I think it is better to have it expire at some point instead of relying on user to delete CR and hope that all nodes reconcile it quickly enough.

For having it running all the time, I think that implementing some simple storage solution could be better (maybe dumping in sqlite db on a node and quering it with regular hubble observe or having hubble-relay dump it somewhere)

marqc added a commit to marqc/cilium that referenced this issue Nov 2, 2023
Plugin will be used to generate yaml annotation in structs generated
from proto to be able to unmarshall FlowFilter from YAML for cilium#25508.

Signed-off-by: Marek Chodor <mchodor@google.com>
marqc added a commit to marqc/cilium that referenced this issue Nov 2, 2023
Yaml annotations are needed to unmarshall FlowFilter structs for flow
logging feature cilium#25508.

Signed-off-by: Marek Chodor <mchodor@google.com>
marqc added a commit to marqc/cilium that referenced this issue Nov 6, 2023
Plugin will be used to generate yaml annotation in structs generated
from proto to be able to unmarshall FlowFilter from YAML for cilium#25508.

Signed-off-by: Marek Chodor <mchodor@google.com>
marqc added a commit to marqc/cilium that referenced this issue Nov 6, 2023
Yaml annotations are needed to unmarshall FlowFilter structs for flow
logging feature cilium#25508.

Signed-off-by: Marek Chodor <mchodor@google.com>
marqc added a commit to marqc/cilium that referenced this issue Nov 7, 2023
Plugin will be used to generate yaml annotation in structs generated
from proto to be able to unmarshall FlowFilter from YAML for cilium#25508.

Signed-off-by: Marek Chodor <mchodor@google.com>
marqc added a commit to marqc/cilium that referenced this issue Nov 7, 2023
Yaml annotations are needed to unmarshall FlowFilter structs for flow
logging feature cilium#25508.

Signed-off-by: Marek Chodor <mchodor@google.com>
marqc added a commit to marqc/cilium that referenced this issue Nov 7, 2023
Plugin will be used to generate yaml annotation in structs generated
from proto to be able to unmarshall FlowFilter from YAML for cilium#25508.

Signed-off-by: Marek Chodor <mchodor@google.com>
marqc added a commit to marqc/cilium that referenced this issue Nov 7, 2023
Yaml annotations are needed to unmarshall FlowFilter structs for flow
logging feature cilium#25508.

Signed-off-by: Marek Chodor <mchodor@google.com>
marqc added a commit to marqc/cilium that referenced this issue Nov 7, 2023
Plugin will be used to generate yaml annotation in structs generated
from proto to be able to unmarshall FlowFilter from YAML for cilium#25508.

Signed-off-by: Marek Chodor <mchodor@google.com>
marqc added a commit to marqc/cilium that referenced this issue Nov 7, 2023
Yaml annotations are needed to unmarshall FlowFilter structs for flow
logging feature cilium#25508.

Signed-off-by: Marek Chodor <mchodor@google.com>
marqc added a commit to marqc/cilium that referenced this issue Nov 15, 2023
Plugin will be used to generate yaml annotation in structs generated
from proto to be able to unmarshall FlowFilter from YAML for cilium#25508.

Signed-off-by: Marek Chodor <mchodor@google.com>
marqc added a commit to marqc/cilium that referenced this issue Nov 15, 2023
Plugin will be used to generate yaml annotation in structs generated
from proto to be able to unmarshall FlowFilter from YAML for cilium#25508.

Signed-off-by: Marek Chodor <mchodor@google.com>
marqc added a commit to marqc/cilium that referenced this issue Nov 15, 2023
Yaml annotations are needed to unmarshall FlowFilter structs for flow
logging feature cilium#25508.

Signed-off-by: Marek Chodor <mchodor@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature This introduces new functionality. sig/agent Cilium agent related. sig/hubble Impacts hubble server or relay
Projects
None yet
7 participants