New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hubble: Fix Race in Hubble Consumer #15967
Conversation
I was looking through the hubble code today to see who the BPF Perf map got transmuted into the bubble perf ring buffer, to see if there was a possibility of a direct read situation (seems like it's not possible). In the course of doing that though I ran into this, which I think is a legitimate bug. |
5d32427
to
806fc4f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Thanks for fixing it!
When the hubble event consumer channel is full there is a possible race condition. The code was structured such that the consumer could end up losing the count of lost events, because it was updating the count without a lock. Additionally, it could reach a condition wherein it would reset the count and send a notification of recovery more than once. Finally, there was a log message for dropped events, but it was "debug" level event rather than the warning that it should be. Signed-off-by: Nate Sweet <nathanjsweet@pm.me>
806fc4f
to
c11c010
Compare
test-me-please |
Please triage the 5 failing checks and if they're all truly unrelated, briefly describe why we should be able to ignore them. |
All test failures were either network or DNS failures, and in one weird case a success that was supposed to be a failure. The likelihood that any of these failures was caused by this PR is as close to nil as I can imagine. |
When the hubble event consumer channel is full there is
a possible race condition. The code was structured such that
the consumer could end up losing the count of lost events, because
it was updating the count without a lock. Additionally, it could
reach a condition wherein it would reset the count and send a
notification of recovery more than once.
Finally, there was a log message for dropped events, but it was
"debug" level event rather than the warning that it should be.
Signed-off-by: Nate Sweet nathanjsweet@pm.me