Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WiP: Do not generate events in eBPF if they are going to be dropped in the filter anyway #27885

Conversation

marqc
Copy link
Contributor

@marqc marqc commented Sep 1, 2023

Do not generate events in eBPF if they are going to be dropped in the filter anyway when the hubble-monitor-events flag is in use

Please ensure your pull request adheres to the following guidelines:

  • All code is covered by unit and/or runtime tests where feasible.
  • All commits contain a well written commit description including a title,
    description and a Fixes: #XXX line if the commit addresses a particular
    GitHub issue.
  • If your commit description contains a Fixes: <commit-id> tag, then
    please add the commit author[s] as reviewer[s] to this issue.
  • All commits are signed off. See the section Developer’s Certificate of Origin
  • Provide a title or release-note blurb suitable for the release notes.

Do not generate events in bpf programs when they are going to be dropped by the hubble-monitor-events filter in the cilium agent.

The original idea was to skip some types of flows when the system cannot keep up with processing all monitored events #24828. In a typical setup it happens when there is a very high rate of trace events and we are not interested in them at all (and we are only interested in eg. drop or policy verdict events).

Actually we can provide a better solution, by disabling event generation in bpf programs altogether.

I did a simple test case where a heavy number of flows were generated and:
a) no filter was provided
b) current filter implementation was set to only "drop" events
c) this implementation with the same setting of only "drop" events

The results are as follows:

Test case A B C
cilium-agent cpu usage 2.9 vCPU 1.07 vCPU 0.06 vCPU
hubble_events_lost ~250k/s ~41k/s 0/s

With the current approach still the filtered event types may be lost in perf-event-ring-buffer and actually under a heavy traffic they do. Disabling event generation reduces resources usage and leaves enough capacity to process all events of the required event type.

The Trace, Drop and NetworkPolicy events won't be generated when they are filtered out by hubble-monitor-events.

cc: @epk

@marqc marqc requested a review from a team as a code owner September 1, 2023 13:37
@marqc marqc requested a review from gandro September 1, 2023 13:37
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Sep 1, 2023
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Sep 1, 2023
@marqc marqc force-pushed the optimize-event-generation-for-hubble-monitor-events branch from 201b986 to 91e9ee1 Compare September 1, 2023 13:41
Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

Overall, this is a nice improvement and I think filtering in BPF is the right way to do it. However, with this change, a hubble flag (which in the past mainly only affected Hubble itself), will now affect the output of cilium monitor.

This could be surprising to users. We should at least clearly document this more clearly in the description of hubble-monitor-events.

@gandro gandro added kind/performance There is a performance impact of this. area/monitor Impacts monitoring, access logging, flow logging, visibility of datapath traffic. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/hubble Impacts hubble server or relay labels Sep 4, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Sep 4, 2023
@marqc
Copy link
Contributor Author

marqc commented Sep 4, 2023

Thanks for the PR!

Overall, this is a nice improvement and I think filtering in BPF is the right way to do it. However, with this change, a hubble flag (which in the past mainly only affected Hubble itself), will now affect the output of cilium monitor.

This could be surprising to users. We should at least clearly document this more clearly in the description of hubble-monitor-events.

Yeah. I'm not sure about the API so I'm eager to hear your suggestions. Generally I don't really like the idea of the hubble-monitor-events flag itself, but I understand that sometimes it might be necessary. Maybe this should not be bound to the existing flag, but I should rather introduce a new flag (or flags) for enabling/disabling event generation. There are already flags for some event types (debug). What do you think?

@gandro
Copy link
Member

gandro commented Sep 4, 2023

Yeah. I'm not sure about the API so I'm eager to hear your suggestions. Generally I don't really like the idea of the hubble-monitor-events flag itself, but I understand that sometimes it might be necessary. Maybe this should not be bound to the existing flag, but I should rather introduce a new flag (or flags) for enabling/disabling event generation. There are already flags for some event types (debug). What do you think?

Yeah, I'm not really sure how to best approach this either. The nice thing is that users can manually enable the disabled events in cilium monitor again at run-time, without affecting Hubble, so for troubleshooting there is an escape hatch.

Therefore I personally am fine with hubble-monitor-events affecting cilium monitor, as long as we document it. Another option would be to rename the flag to enabled-monitor-events or similar, to make it a bit more clear that it is a global flag that affects both Hubble and cilium monitor.

Only half-related side-note:We have discussed in the past if it maybe even makes sense to disable trace notifications by default, and mainly rely on policy verdict events (we could also issue them if there is no policy to get observability for the default-allow behavior too). Basically have hubble-monitor-events (or whatever we would call the flag) become opt-in instead of the opt-out that it currently is. But that is a larger discussion and does not need to be addressed by this PR.

@marqc marqc force-pushed the optimize-event-generation-for-hubble-monitor-events branch from 91e9ee1 to e35b5c4 Compare September 4, 2023 13:25
@marqc
Copy link
Contributor Author

marqc commented Sep 4, 2023

Thanks @gandro for your reply. I changed the docs for now.

Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I'll add the needs-discussion label for now to give other members of @cilium/sig-hubble a chance to also chime in. But if there are no other objections by the end of the week, I think we can go ahead as is.

@gandro gandro added the dont-merge/discussion A discussion is ongoing and should be resolved before merging, regardless of reviews & tests status. label Sep 4, 2023
Copy link
Contributor

@michi-covalent michi-covalent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow very cool i didn't even know there were flags to enable/disable individual monitor event types 🥰

@chancez
Copy link
Contributor

chancez commented Sep 5, 2023

With this change, are the other changes in #24828 even needed anymore? Could we get rid of the monitorFilter type and related logic if the data path is no longer generating these events?

@epk
Copy link
Contributor

epk commented Sep 5, 2023

With this change, are the other changes in #24828 even needed anymore? Could we get rid of the monitorFilter type and related logic if the data path is no longer generating these events?

I think this PR will be helpful for us but I would prefer if this event filtering was totally separate from the hubble filtering.

For example: we tap into the monitor socket through a different daemonset where we don't care about the CPU/Memory usage. (the reasoning behind #24828 was to help tame the CPU used by the hubble subsystem so the CNI responsibilities remain snappy)

Also cilium monitor is helpful to debug issues on demand

@learnitall
Copy link
Contributor

This is amazing 😲. It's a great way to give users the ability to tune things as they need. IMO, I think we should follow through with what @gandro was mentioning about changing the flag name and making it opt-in. The opt-in is especially important because a common theme I've heard is how folks will enable hubble and be surprised by the increased resource usage. If users need more events from hubble, then I think it makes sense to have users go through the process of opting-in, as it gives them the opportunity to read in the docs the performance trade-offs, rather than having to discover the trade-off at run time. If users don't need more events from hubble, then they'd get the benefits of lower resource usage and we could potentially save them time figuring out how to tune hubble for their needs.

@marqc, if you have time, do you mind adding a note in Documentation/operations/performance/tuning.rst about this change? (Here's an example of a similar change #26979.) I can also do this as a follow-up after this PR is merged if that'd be helpful.

@marqc
Copy link
Contributor Author

marqc commented Sep 6, 2023

The MonitorFilter configured with hubble-monitor-events covers some more event types than only drop, trace and policy-verdict so I don't find this change as justification to remove this monitor entirely. I'm rather thinking of changing the API and introducing new flags to enable/disable BPF events.
There is currently a config flag debug-verbose: datapath used for enableng DebugEvent which enables Debug and Capture event types in bpf. There is enable-recorder which enables RecCapture event type. There is a flag trace-sock that enables TraceSock event type.

Current state

To summarize we have the following event types and steering flags:

event flag
drop - (always enabled)
debug debug-verbose: datapath
capture debug-verbose: datapath
trace - (always enabled)
policy-verdict - (always enabled)
rec-capture enable-recorder
trace-sock trace-sock

Possible options

1. Separate flag for each type

Add a steering boolean flag for each of the event types that are currently always on: enable-drop-event, enable-trace-event, enable-policy-verdict-event.

Pros:

  • backwards compatible

Cons:

  • complicated setup, a lot of different flags

2. Replace all existing flags with a new list type flag to cover all event types

cilium-monitor-events: drop,policy-verdict

Pros:

  • simple setup with a single flag

Cons:

  • losing backwards compatibility
  • debug and capture event types were tied by feature to debug-verbose flag, we will lose this affiliation

@gandro @chancez @learnitall Please let me know what you think

@gandro
Copy link
Member

gandro commented Sep 6, 2023

Great writeup, thanks! I personally have a slight preference for option two, but I'm fine with either solutions.

A note on Option 1: Yes, separate flags are cumbersome to use in the configmap, but that could be "simplified" via a ciliumMonitorEvents Helm value that is a list of enabled types, that the Helm template then unrolls into "true" or "false" entries for each flag. The end-user experience would then bi similar to option 2.

A note on Option 2: Since debug-verbose is arguably mainly a developer setting that doesn't really work well with production setups, I'm less worried about deprecating and eventually removing those values for the flag.

@@ -1314,11 +1315,16 @@ func initEnv(vp *viper.Viper) {
bpf.CheckOrMountFS(option.Config.BPFRoot)
cgroups.CheckOrMountCgrpFS(option.Config.CGroupRoot)

monitorEvents := option.Config.HubbleMonitorEvents
var enableEventGeneration = func(eventType string) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we default to true if hubble is disabled?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marqc please resolve this discussion if you have adressed it in your current PR. The discussion is blocking the merge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working on changes agreed on in the general discussion.

@learnitall
Copy link
Contributor

+1 for option two, that sounds great!

@lmb
Copy link
Contributor

lmb commented Sep 25, 2023

/test

@marqc marqc changed the title Do not generate events in eBPF if they are going to be dropped in the filter anyway WiP: Do not generate events in eBPF if they are going to be dropped in the filter anyway Sep 25, 2023
@marqc marqc marked this pull request as draft September 25, 2023 14:08
@marqc marqc force-pushed the optimize-event-generation-for-hubble-monitor-events branch from e35b5c4 to 2f821ad Compare September 25, 2023 14:47
@marqc marqc force-pushed the optimize-event-generation-for-hubble-monitor-events branch from 2f821ad to 710b39f Compare September 27, 2023 08:46
Add monitor-events option allowing to select which event types are to be
generated.

Signed-off-by: Marek Chodor <mchodor@google.com>
@marqc marqc force-pushed the optimize-event-generation-for-hubble-monitor-events branch from 710b39f to 68e0cb9 Compare October 3, 2023 13:02
@rolinh
Copy link
Member

rolinh commented Oct 6, 2023

Just chiming in: this is great work and I also favor option 2.

@chancez
Copy link
Contributor

chancez commented Oct 6, 2023

I also like option 2 overall.

Copy link

github-actions bot commented Nov 6, 2023

This pull request has been automatically marked as stale because it
has not had recent activity. It will be closed if no further activity
occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Nov 6, 2023
Copy link

This pull request has not seen any activity since it was marked stale.
Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/monitor Impacts monitoring, access logging, flow logging, visibility of datapath traffic. dont-merge/discussion A discussion is ongoing and should be resolved before merging, regardless of reviews & tests status. kind/community-contribution This was a contribution made by a community member. kind/performance There is a performance impact of this. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/hubble Impacts hubble server or relay stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants