Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpf: Fix monitor aggregation for 'from-network' #12559

Merged
merged 1 commit into from Jul 17, 2020

Conversation

joestringer
Copy link
Member

@joestringer joestringer commented Jul 16, 2020

Previously, we did not take into account 'from-network' sources in the
monitor aggregation logic check in send_trace_notify(), which was fine
because we rarely ever sent such events (limited to ipsec for instance).
However, since commit c470e28 we also use this in bpf_host which
suddenly means that any and all traffic from the network will trigger
monitor events, flooding the monitor output.

Fixes: 7a4b0be ("bpf: Add MonitorAggregation option")
Fixes: c470e28 ("Adds TRACE_TO_NETWORK obs label and trace pkts in to-netdev prog.")
Fixes: #12555


This is only problematic in v1.8 due to c470e28 , but the bug goes back to Cilium v1.2. May as well backport to all supported branches.

@joestringer joestringer added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Jul 16, 2020
@joestringer joestringer requested review from pchaigno, Weil0ng and a team July 16, 2020 22:27
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.8.2 Jul 16, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.7.7 Jul 16, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.6.11 Jul 16, 2020
@joestringer
Copy link
Member Author

Ultimately here if you set monitorAggregationLevel to None, you would absolutely get a flood of monitor messages just because any packets reaching the host would be logged. This fixes it by ensuring that monitorAggregationLevel of low or above does not generate events for such packets. We likely still need to look closer at to-network events and the other aspects of the linked issue #12555 (comment) but I believe this will address the core problem where we generated 2.2GB of verbose monitor output in a single test run which triggered a test failure (+CI timeout).

Copy link
Contributor

@Weil0ng Weil0ng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, thanks for the fix!

@Weil0ng
Copy link
Contributor

Weil0ng commented Jul 16, 2020

We likely still need to look closer at to-network events

hmm, it seems like there's no datapath aggregation on egress at all?

@joestringer
Copy link
Member Author

joestringer commented Jul 16, 2020

hmm, it seems like there's no datapath aggregation on egress at all?

Right, typically we have the most information on egress because we went through the entire datapath already and we've made a decision about how to forward the traffic. For most of the hubble use cases we've looked at so far, only gathering egress monitor events is sufficient for the visibility we need.

Furthermore, on egress we will have already gone through the CT table which allows us to filter monitor events based on the number of flows rather than the number of packets; and additionally we store a timestamp of the last monitor event on egress for that CT entry, and will avoid sending extra events for the flow if we already sent an event recently.

So I guess to-network should actually be fine from the above logic.

EDIT: Well, I assumed that to-network trace calls make use of this same infrastructure but at a glance it seems like maybe this is not the case. So to-network may still be noisy. Filed #12561 to follow up.

@joestringer
Copy link
Member Author

test-me-please

Previously, we did not take into account 'from-network' sources in the
monitor aggregation logic check in `send_trace_notify()`, which was fine
because we rarely ever sent such events (limited to ipsec for instance).
However, since commit c470e28 we also use this in bpf_host which
suddenly means that any and all traffic from the network will trigger
monitor events, flooding the monitor output.

Fixes: 7a4b0be ("bpf: Add MonitorAggregation option")
Fixes: c470e28 ("Adds TRACE_TO_NETWORK obs label and trace pkts in to-netdev prog.")
Fixes: cilium#12555

Signed-off-by: Joe Stringer <joe@cilium.io>
@joestringer joestringer force-pushed the submit/from-network-aggregation branch from 61837fc to abae0f6 Compare July 16, 2020 22:46
@joestringer
Copy link
Member Author

test-me-please

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find!

@joestringer
Copy link
Member Author

BPF checkpatch is complaining about my use of commit xxxxx in the commit message but at the bottom of the message there's a line that satisfies this so for the future commit log reader, all of the information is there. We can ignore it.

@coveralls
Copy link

coveralls commented Jul 16, 2020

Coverage Status

Coverage increased (+0.002%) to 37.035% when pulling abae0f6 on joestringer:submit/from-network-aggregation into f55ec90 on cilium:master.

@brb
Copy link
Member

brb commented Jul 17, 2020

retest-gke

@borkmann borkmann merged commit 037bef5 into cilium:master Jul 17, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.8 in 1.8.2 Jul 17, 2020
@joestringer joestringer deleted the submit/from-network-aggregation branch July 17, 2020 17:27
@pchaigno
Copy link
Member

I just finished running this fix in a loop, a hundred times, to validate it fixes the flake we were seeing 🎉

Should we nevertheless reopen #12555 to (1) track and address the memory issue (huge spike in memory consumption seen after failure on both Jenkins and locally) and (2) discuss whether we can and should reduce the verbosity level for that test (currently monitor -vv)?

@joestringer
Copy link
Member Author

@pchaigno I considered this, but I think that the intent will be to move all of this CI monitor framework over to Hubble via #12416 & followups, at which point all of that code should be replaced.

@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.8 to Backport done to v1.8 in 1.8.2 Jul 20, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.7 in 1.7.7 Jul 21, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.7 to Backport done to v1.7 in 1.7.7 Jul 22, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.6 in 1.6.11 Jul 30, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.6 in 1.6.11 Jul 30, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.6 to Backport done to v1.6 in 1.6.11 Aug 3, 2020
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.6 to Backport done to v1.6 in 1.6.11 Aug 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
No open projects
1.6.11
Backport done to v1.6
1.7.7
Backport done to v1.7
1.8.2
Backport done to v1.8
9 participants