New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpf: Fix monitor aggregation for 'from-network' #12559
bpf: Fix monitor aggregation for 'from-network' #12559
Conversation
Ultimately here if you set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, thanks for the fix!
hmm, it seems like there's no datapath aggregation on egress at all? |
Right, typically we have the most information on egress because we went through the entire datapath already and we've made a decision about how to forward the traffic. For most of the hubble use cases we've looked at so far, only gathering egress monitor events is sufficient for the visibility we need. Furthermore, on egress we will have already gone through the CT table which allows us to filter monitor events based on the number of flows rather than the number of packets; and additionally we store a timestamp of the last monitor event on egress for that CT entry, and will avoid sending extra events for the flow if we already sent an event recently. So I guess EDIT: Well, I assumed that |
test-me-please |
Previously, we did not take into account 'from-network' sources in the monitor aggregation logic check in `send_trace_notify()`, which was fine because we rarely ever sent such events (limited to ipsec for instance). However, since commit c470e28 we also use this in bpf_host which suddenly means that any and all traffic from the network will trigger monitor events, flooding the monitor output. Fixes: 7a4b0be ("bpf: Add MonitorAggregation option") Fixes: c470e28 ("Adds TRACE_TO_NETWORK obs label and trace pkts in to-netdev prog.") Fixes: cilium#12555 Signed-off-by: Joe Stringer <joe@cilium.io>
61837fc
to
abae0f6
Compare
test-me-please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find!
BPF checkpatch is complaining about my use of |
retest-gke |
I just finished running this fix in a loop, a hundred times, to validate it fixes the flake we were seeing 🎉 Should we nevertheless reopen #12555 to (1) track and address the memory issue (huge spike in memory consumption seen after failure on both Jenkins and locally) and (2) discuss whether we can and should reduce the verbosity level for that test (currently |
Previously, we did not take into account 'from-network' sources in the
monitor aggregation logic check in
send_trace_notify()
, which was finebecause we rarely ever sent such events (limited to ipsec for instance).
However, since commit c470e28 we also use this in bpf_host which
suddenly means that any and all traffic from the network will trigger
monitor events, flooding the monitor output.
Fixes: 7a4b0be ("bpf: Add MonitorAggregation option")
Fixes: c470e28 ("Adds TRACE_TO_NETWORK obs label and trace pkts in to-netdev prog.")
Fixes: #12555
This is only problematic in v1.8 due to c470e28 , but the bug goes back to Cilium v1.2. May as well backport to all supported branches.