New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hubble/recorder: Refactor service implementation to fix multiple races #16472
Conversation
test-me-please |
lgtm. would be good to get another review on 07e4f36 |
Thanks @michi-covalent! I know this is not a very nice change to review. I'm also happy to walk through the changes via Zoom with someone if interested. |
Converting back into draft. The code review via Zoom revealed we don't have to drain the queue anymore when stopping. |
07e4f36
to
c310ed3
Compare
As discussed offline, I pushed a new commit (5d512ba179bd4d1286fc87c527a78a3353bf0d7c, second one in the history) which replaces two previous commits around the handling of |
test-me-please Travis hit #11560 - restarting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well structured PR was nice to review and the inline comments help a great deal understanding the control flow. Thanks!
Two small nits inline.
@tklauser I addressed your feedback (and them some) in a separate commit fe4a09d19b35b234d0f9d4f670654cbeea3b4fa0 |
fe4a09d
to
4a88bee
Compare
test-me-please |
test-runtime Edit: Hit was looks like a transient error https://jenkins.cilium.io/job/Cilium-PR-Runtime-4.9/5082/ |
test-1.21-4.9 Edit: Hit was looks like a timeout https://jenkins.cilium.io/job/Cilium-PR-K8s-1.21-kernel-4.9/829 Edit 2: Somehow didn't get triggered. Re-triggering below. |
test-1.21-4.9 |
Need to rebase yet again to pull in #16646 |
4a88bee
to
1bb0073
Compare
test-me-please Edit: This is a bug fix, so probably not part of the merge freeze, rebased to pull in latest master. |
This commit splits the sink's `done chan error` channel (which only allows a single consumer to wait on the sink to finish) into a `chan struct{}` channel and a `lastError error` variable. This enables us to signal that the sink has finished by closing the channel instead of sending a value over it. Closing the channel allows multiple go routines to block on this event via `<-s.done`. The final error value can then be retrieved via `s.err()`. This pattern is very similar to how `context.Context` works. This commit does not yet make use of this functionality. The changes enabled by this will follow in a subsequent commit. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Previously, we waited for the recording queue to drain when the client requested a stop. However, because the client has no visibility into the queue (and indeed doesn't even know if there are queued records when they issue a stop request), this does not provide any value to the client. Therefore, this PR changes the semantics of a stop request by immediately initiating a shutdown, instead of waiting for the queue to drain. This ensures that the resulting recording more closely matches the observed statistics at the time when the client issued a stop request. It also simplifies the code a bit. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This commit changes the interface of sink.Dispatch from an explicit `RegisterSink`+`UnregisterSink` pair to a `StartSink` call which will unregister itself when it stops due to an error, an expired context, or an explicit stop request. This commit does not introduce any functional changes, it is purely a refactoring. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Since we have introduced the `Handle.Done` channel, we do not have toV signal the shutdown of the sink by closing the statistics channel anymore. Instead, consumers can now wait on the `Handle.Done` channel getting closed. While there are not many benefits in this version of the code, it will make the select statement in a subsequent commit much more readable. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This commit fixes a concurrency issue in the implementation of the Hubble Recorder API. Before this commit, we were sending responses to the client from both the main `Record` function, as well as the `watchRecording` function which was spawned in a separate go routine. However, sending to a grpc.ServerStream from multiple go routines is _not_ safe: https://pkg.go.dev/google.golang.org/grpc#ServerStream It is however safe to have one go routine receive from, and another go routine send to the stream. Therefore, this commit restructures the Hubble Recorder API in such a way that only the `Record` stub ever sends back messages to the client. Receiving is done in a separate go routine which forwards all received messages into a channel, allowing us to select on incoming responses. In addition, this commit hopefully also makes the logic a bit more easier to read, as it tries to separate the cleanup of resources and communicating with the client a bit more explicitly. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This commit documents what fields are now protected by the mutex in `type sink` and updates a two usages accordingly, by moving channel operations out of the critical section. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
1bb0073
to
2c65520
Compare
test-me-please Seems potentially related to #16928 (due to the |
I'm marking this ready-to-merge. Reasoning: This is fixing a bug in a released Cilium version and thus should be exempt from the zero-flake policy. It's almost 2 months old at this point and has hit various Jenkins flakes over its lifetime, but always different ones and all looked unrelated to this PR, as the pcap recorder (which this PR is touching) is not enabled in neither Jenkins CI nor our conformance tests. |
This PR fixes multiple concurrency issues in the implementation of the
Hubble Recorder API. The main one being that we were sending responses
to the client from both the main
Record
function, as well as thewatchRecording
function which was spawned in a separate go routine.However, sending to a grpc.ServerStream from multiple go routines is
not safe: https://pkg.go.dev/google.golang.org/grpc#ServerStream
It is however safe to have one go routine receive from, and another go routine
send to the stream.
Therefore, this commit restructures the Hubble Recorder API in such a
way that only the
Record
stub ever sends back messages to the client.Receiving is done in a separate go routine which forwards all received
messages into a channel, allowing us to select on incoming responses.
In addition, this commit hopefully also makes the logic a bit more
easier to read, as it tries to separate the cleanup of resources and
communicating with the client a bit more explicitly. It also drastically
simplifies the implementation of a follow-up PR which extends the
API with stop conditions (#16473).
I apologize for the size, but I had to restructure quite a bit to fix
the underlying issue. I highly recommended to review per commit.