Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hubble/recorder: Refactor service implementation to fix multiple races #16472

Merged
merged 6 commits into from Jul 28, 2021

Conversation

gandro
Copy link
Member

@gandro gandro commented Jun 8, 2021

This PR fixes multiple concurrency issues in the implementation of the
Hubble Recorder API. The main one being that we were sending responses
to the client from both the main Record function, as well as the
watchRecording function which was spawned in a separate go routine.

However, sending to a grpc.ServerStream from multiple go routines is
not safe: https://pkg.go.dev/google.golang.org/grpc#ServerStream
It is however safe to have one go routine receive from, and another go routine
send to the stream.

Therefore, this commit restructures the Hubble Recorder API in such a
way that only the Record stub ever sends back messages to the client.
Receiving is done in a separate go routine which forwards all received
messages into a channel, allowing us to select on incoming responses.

In addition, this commit hopefully also makes the logic a bit more
easier to read, as it tries to separate the cleanup of resources and
communicating with the client a bit more explicitly. It also drastically
simplifies the implementation of a follow-up PR which extends the
API with stop conditions (#16473).

I apologize for the size, but I had to restructure quite a bit to fix
the underlying issue. I highly recommended to review per commit.

@gandro gandro added release-note/bug This PR fixes an issue in a previous release of Cilium. sig/hubble Impacts hubble server or relay needs-backport/1.10 labels Jun 8, 2021
@gandro gandro requested review from a team and glibsm June 8, 2021 15:03
@gandro
Copy link
Member Author

gandro commented Jun 8, 2021

test-me-please

@michi-covalent
Copy link
Contributor

lgtm. would be good to get another review on 07e4f36

@gandro
Copy link
Member Author

gandro commented Jun 9, 2021

Thanks @michi-covalent! I know this is not a very nice change to review. I'm also happy to walk through the changes via Zoom with someone if interested.

@gandro gandro marked this pull request as draft June 10, 2021 17:17
@gandro
Copy link
Member Author

gandro commented Jun 10, 2021

Converting back into draft. The code review via Zoom revealed we don't have to drain the queue anymore when stopping.

@gandro gandro force-pushed the pr/gandro/hubble-recorder-refactor branch from 07e4f36 to c310ed3 Compare June 14, 2021 11:30
@gandro gandro marked this pull request as ready for review June 14, 2021 12:10
@gandro
Copy link
Member Author

gandro commented Jun 14, 2021

As discussed offline, I pushed a new commit (5d512ba179bd4d1286fc87c527a78a3353bf0d7c, second one in the history) which replaces two previous commits around the handling of s.queue

@gandro
Copy link
Member Author

gandro commented Jun 14, 2021

test-me-please

Travis hit #11560 - restarting

Copy link
Contributor

@michi-covalent michi-covalent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Member

@tklauser tklauser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well structured PR was nice to review and the inline comments help a great deal understanding the control flow. Thanks!

Two small nits inline.

pkg/hubble/recorder/sink/sink.go Outdated Show resolved Hide resolved
pkg/hubble/recorder/sink/sink.go Outdated Show resolved Hide resolved
@gandro
Copy link
Member Author

gandro commented Jun 15, 2021

@tklauser I addressed your feedback (and them some) in a separate commit fe4a09d19b35b234d0f9d4f670654cbeea3b4fa0

@gandro gandro force-pushed the pr/gandro/hubble-recorder-refactor branch from fe4a09d to 4a88bee Compare June 24, 2021 10:16
@gandro
Copy link
Member Author

gandro commented Jun 24, 2021

test-me-please

@gandro
Copy link
Member Author

gandro commented Jun 29, 2021

test-runtime

Edit: Hit was looks like a transient error https://jenkins.cilium.io/job/Cilium-PR-Runtime-4.9/5082/

@gandro
Copy link
Member Author

gandro commented Jun 29, 2021

test-1.21-4.9

Edit: Hit was looks like a timeout https://jenkins.cilium.io/job/Cilium-PR-K8s-1.21-kernel-4.9/829

Edit 2: Somehow didn't get triggered. Re-triggering below.

@gandro
Copy link
Member Author

gandro commented Jun 29, 2021

test-1.21-4.9

@gandro
Copy link
Member Author

gandro commented Jun 29, 2021

Need to rebase yet again to pull in #16646

@gandro gandro force-pushed the pr/gandro/hubble-recorder-refactor branch from 4a88bee to 1bb0073 Compare June 29, 2021 17:15
@gandro
Copy link
Member Author

gandro commented Jun 30, 2021

test-me-please

Edit: This is a bug fix, so probably not part of the merge freeze, rebased to pull in latest master.

This commit splits the sink's `done chan error` channel (which only
allows a single consumer to wait on the sink to finish) into a `chan
struct{}` channel and a `lastError error` variable.

This enables us to signal that the sink has finished by closing the
channel instead of sending a value over it. Closing the channel allows
multiple go routines to block on this event via `<-s.done`. The final
error value can then be retrieved via `s.err()`. This pattern is very
similar to how `context.Context` works.

This commit does not yet make use of this functionality. The changes
enabled by this will follow in a subsequent commit.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Previously, we waited for the recording queue to drain when the client
requested a stop. However, because the client has no visibility into the
queue (and indeed doesn't even know if there are queued records when
they issue a stop request), this does not provide any value to the
client.

Therefore, this PR changes the semantics of a stop request by
immediately initiating a shutdown, instead of waiting for the queue to
drain. This ensures that the resulting recording more closely matches
the observed statistics at the time when the client issued a stop
request. It also simplifies the code a bit.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This commit changes the interface of sink.Dispatch from an explicit
`RegisterSink`+`UnregisterSink` pair to a `StartSink` call which will
unregister itself when it stops due to an error, an expired context, or
an explicit stop request.

This commit does not introduce any functional changes, it is purely a
refactoring.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Since we have introduced the `Handle.Done` channel, we do not have toV
signal the shutdown of the sink by closing the statistics channel
anymore. Instead, consumers can now wait on the `Handle.Done` channel
getting closed.

While there are not many benefits in this version of the code, it will
make the select statement in a subsequent commit much more readable.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This commit fixes a concurrency issue in the implementation of the
Hubble Recorder API. Before this commit, we were sending responses to
the client from both the main `Record` function, as well as the
`watchRecording` function which was spawned in a separate go routine.

However, sending to a grpc.ServerStream from multiple go routines is
_not_ safe: https://pkg.go.dev/google.golang.org/grpc#ServerStream It is
however safe to have one go routine receive from, and another go routine
send to the stream.

Therefore, this commit restructures the Hubble Recorder API in such a
way that only the `Record` stub ever sends back messages to the client.
Receiving is done in a separate go routine which forwards all received
messages into a channel, allowing us to select on incoming responses.

In addition, this commit hopefully also makes the logic a bit more
easier to read, as it tries to separate the cleanup of resources and
communicating with the client a bit more explicitly.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This commit documents what fields are now protected by the mutex in
`type sink` and updates a two usages accordingly, by moving channel
operations out of the critical section.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro gandro force-pushed the pr/gandro/hubble-recorder-refactor branch from 1bb0073 to 2c65520 Compare July 19, 2021 09:17
@gandro
Copy link
Member Author

gandro commented Jul 19, 2021

test-me-please

Edit: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.19-kernel-5.4/696/testReport/junit/Suite-k8s-1/19/K8sDatapathConfig_Encapsulation_Check_connectivity_with_VXLAN_encapsulation/

Seems potentially related to #16928 (due to the failed to find plugin \"cilium-cni\" in path [/opt/cni/bin]" errors)
Slack thread: https://cilium.slack.com/archives/C7PE7V806/p1626960016449700

@gandro
Copy link
Member Author

gandro commented Jul 27, 2021

I'm marking this ready-to-merge. Reasoning: This is fixing a bug in a released Cilium version and thus should be exempt from the zero-flake policy. It's almost 2 months old at this point and has hit various Jenkins flakes over its lifetime, but always different ones and all looked unrelated to this PR, as the pcap recorder (which this PR is touching) is not enabled in neither Jenkins CI nor our conformance tests.

@gandro gandro added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 27, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.10.4 Jul 27, 2021
@brb brb merged commit 6f9b875 into cilium:master Jul 28, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.10 in 1.10.4 Jul 28, 2021
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Backport done to v1.10 in 1.10.4 Sep 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/hubble Impacts hubble server or relay
Projects
No open projects
1.10.4
Backport done to v1.10
Development

Successfully merging this pull request may close these issues.

None yet

7 participants