Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agent: add several new flags to control Cilium's datapath events notifications #30063

Merged
merged 1 commit into from Mar 7, 2024

Conversation

mvisonneau
Copy link
Contributor

@mvisonneau mvisonneau commented Dec 28, 2023

This commit introduces three new configuration flags for the Cilium agent, allowing users to choose the bpf event types they want to expose to Cilium monitor and Hubble.

  • --bpf-events-drop-enabled Expose 'drop' events for Cilium monitor and/or Hubble (default true)
  • --bpf-events-policy-verdict-enabled Expose 'policy verdict' events for Cilium monitor and/or Hubble (default true)
  • --bpf-events-trace-enabled Expose 'trace' events for Cilium monitor and/or Hubble (default true)

The default values for these flags remain set to true, not changing the current behaviour.

In our case, we found particularly useful to disable the TraceNotification in order to reduce the CPU overhead on some of our nodes when Hubble is enabled as we were mostly interested into dropped packets.

agent: add several new flags to control Cilium's datapath events notifications

@mvisonneau mvisonneau requested review from a team as code owners December 28, 2023 10:30
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Dec 28, 2023
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Dec 28, 2023
Copy link
Member

@asauber asauber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment on the semantics, otherwise LGTM.

Others may have more thoughts, as I'm not very familiar with the runtime behavior of the Envoy config.

daemon/cmd/daemon_main.go Outdated Show resolved Hide resolved
Copy link
Member

@asauber asauber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Thanks for the updates.

Copy link
Member

@meyskens meyskens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@meyskens
Copy link
Member

meyskens commented Jan 5, 2024

You might need to run make -C Documentation update-cmdref (again) so that CI is happy :)

@mvisonneau
Copy link
Contributor Author

indeed, thanks! 🙈

@mvisonneau
Copy link
Contributor Author

Just realizing that I missed out on @EricMountain's draft PR #30031

@joestringer, are you happy with the way it looks or would you prefer if we pursue a more generic command-line option?

@kaworu kaworu added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/hubble Impacts hubble server or relay dont-merge/blocked Another PR must be merged before this one. labels Jan 8, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Jan 8, 2024
@kaworu kaworu requested review from kaworu and a team January 8, 2024 16:43
@joestringer
Copy link
Member

Thanks for the PR. My gut feeling is it'd be simpler to have a single notify flag that allows defining which notifications to enable (with a default that enables all notification types). I see that @kaworu has expressed interest in this PR and I think he could have some useful input here as well.

Copy link
Member

@kaworu kaworu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @mvisonneau!

For context, This PR will introduce rate-limiting for events, and one of the potential follow-up is per-event type rate-limiting.

This PR is currently missing configMap / Helm values corresponding to the new flags, and I'd to find a way to expose cleanly enable / disable toggles (introduced by this PR) and rate-limiting configuration. How about something like this:

-- control events generated by the Cilium datapath exposed to cilium monitor and hubble.
events:
  drop:
    enabled: true
  policyVerdict:
    enabled: true
  trace:
    enabled: true

That way we have a per event-type section. The global rate-limiting PR can introduce values under the events section, and the per event-type rate-limiting follow-up under the drop / policyVerdict / trace sections. What do you think @joestringer?

@joestringer
Copy link
Member

@kaworu I like that proposal. This should enable us to subsequently add more specific configurations for each event type.

@mvisonneau What do you think about the Helm configuration proposed by @kaworu above?

One more thing, this PR describes the change as modifying the behaviour for endpoint event notifications, but I think that's incorrect. I believe this impacts all datapath event notifications (via daemon.GetOptions). Please double-check and consider renaming + changing the release note and PR description to reflect this.

@julianwiedmann
Copy link
Member

The PR already picked up conflicts that need to be resolved :/

@julianwiedmann julianwiedmann removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jan 29, 2024
@mvisonneau
Copy link
Contributor Author

argh indeed, I just rebased, hope CI will still pass! 🤞

@julianwiedmann
Copy link
Member

/test

@julianwiedmann julianwiedmann removed the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Jan 29, 2024
@ti-mo ti-mo added the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Feb 7, 2024
@ti-mo
Copy link
Contributor

ti-mo commented Feb 7, 2024

This needs another rebase, sorry.

@mvisonneau
Copy link
Contributor Author

apologies for the delay, it should now be rebased! :)

@joestringer
Copy link
Member

/test

@mvisonneau
Copy link
Contributor Author

mvisonneau commented Feb 22, 2024

should I assume that the failing tests are flakes or is there something else required on my end? 🤔

@tommyp1ckles
Copy link
Contributor

@mvisonneau ipsec upgrade can be flakey, and seems unlikely to be related to these changes.

For the Travis failure, it looks like a similar flake issue/fix was closed recently . Perhaps try rebasing to ensure your branch has those changes?

@mvisonneau
Copy link
Contributor Author

thanks for your guidance @tommyp1ckles, I've once again rebased! 🤞

@ldelossa
Copy link
Contributor

ldelossa commented Mar 1, 2024

/test

@joestringer
Copy link
Member

Heads up that the main blocker on this PR is getting the required jobs to pass. The L4LB test has been retriggered five times over the past couple of weeks and seems to be consistently failing in this way:

curl: (28) Failed to connect to 10.0.0.2 port 80 after 135242 ms: Connection timed out
+[18:53:45] echo 'Failed 1'
Failed 1
+[18:53:45] exit -1
Error: Process completed with exit code 255.

It's not clear to me why this PR would be triggering such a failure, but it's also hard for us to ignore a consistent signal from CI that the tests are failing and then merge the PR. It may be worth looking around in the issues to see if there are similar failures reported by others, and worst case rebase again to see if it was a problem already fixed by another PR in the main branch.

And yes I recognize that asking to rebase again is a bit tedious, unfortunately it looks like this PR may have been adversely impacted by the elevated background failure rate in the tree since late Jan / early feb (via the CI dashboard). We're gradually improving some of the individual workflows that have been failing more often, but a couple of the workflows still seem to have elevated failure rates:

image

@mvisonneau
Copy link
Contributor Author

👋 hey @joestringer, thanks for the feedback! I've just stumbled upon #31167, going to attempt rebasing as well as I am also unsure where to start looking otherwise 🙈

…ions

This commit introduces three new configuration flags for the Cilium agent, allowing users to choose the bpf event types they want to expose to Cilium monitor and Hubble.
 - `--bpf-events-drop-enabled`                                   Expose 'drop' events for Cilium monitor and/or Hubble (default true)
 - `--bpf-events-policy-verdict-enabled`                         Expose 'policy verdict' events for Cilium monitor and/or Hubble (default true)
 - `--bpf-events-trace-enabled`                                  Expose 'trace' events for Cilium monitor and/or Hubble (default true)

The default values for these flags remain set to `true`, not changing the current behaviour.

In our case, we found particularly useful to disable the TraceNotification in order to reduce the CPU overhead on some of our nodes when Hubble is enabled as we were mostly interested into dropped packets.

Signed-off-by: Maxime Visonneau <maxime.visonneau@gmail.com>
@hemanthmalla
Copy link
Member

/test

@mvisonneau
Copy link
Contributor Author

seems like it did the trick! 🙌

@aanm aanm removed the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Mar 7, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Mar 7, 2024
@joestringer joestringer added this pull request to the merge queue Mar 7, 2024
@joestringer
Copy link
Member

@mvisonneau thanks for the contribution!

Merged via the queue into cilium:main with commit 5c2ed45 Mar 7, 2024
62 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/community-contribution This was a contribution made by a community member. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/hubble Impacts hubble server or relay
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet