hubble/recorder: Refactor service implementation to fix multiple races #16472

gandro · 2021-06-08T15:03:43Z

This PR fixes multiple concurrency issues in the implementation of the
Hubble Recorder API. The main one being that we were sending responses
to the client from both the main Record function, as well as the
watchRecording function which was spawned in a separate go routine.

However, sending to a grpc.ServerStream from multiple go routines is
not safe: https://pkg.go.dev/google.golang.org/grpc#ServerStream
It is however safe to have one go routine receive from, and another go routine
send to the stream.

Therefore, this commit restructures the Hubble Recorder API in such a
way that only the Record stub ever sends back messages to the client.
Receiving is done in a separate go routine which forwards all received
messages into a channel, allowing us to select on incoming responses.

In addition, this commit hopefully also makes the logic a bit more
easier to read, as it tries to separate the cleanup of resources and
communicating with the client a bit more explicitly. It also drastically
simplifies the implementation of a follow-up PR which extends the
API with stop conditions (#16473).

I apologize for the size, but I had to restructure quite a bit to fix
the underlying issue. I highly recommended to review per commit.

gandro · 2021-06-08T15:19:12Z

test-me-please

michi-covalent · 2021-06-09T04:56:52Z

lgtm. would be good to get another review on 07e4f36

gandro · 2021-06-09T08:06:18Z

Thanks @michi-covalent! I know this is not a very nice change to review. I'm also happy to walk through the changes via Zoom with someone if interested.

gandro · 2021-06-10T17:17:57Z

Converting back into draft. The code review via Zoom revealed we don't have to drain the queue anymore when stopping.

gandro · 2021-06-14T12:12:24Z

As discussed offline, I pushed a new commit (5d512ba179bd4d1286fc87c527a78a3353bf0d7c, second one in the history) which replaces two previous commits around the handling of s.queue

gandro · 2021-06-14T15:52:48Z

test-me-please

Travis hit #11560 - restarting

michi-covalent

tklauser

Well structured PR was nice to review and the inline comments help a great deal understanding the control flow. Thanks!

Two small nits inline.

pkg/hubble/recorder/sink/sink.go

gandro · 2021-06-15T12:19:51Z

@tklauser I addressed your feedback (and them some) in a separate commit fe4a09d19b35b234d0f9d4f670654cbeea3b4fa0

gandro · 2021-06-24T12:57:07Z

test-me-please

gandro · 2021-06-29T13:08:01Z

test-runtime

Edit: Hit was looks like a transient error https://jenkins.cilium.io/job/Cilium-PR-Runtime-4.9/5082/

gandro · 2021-06-29T13:13:38Z

test-1.21-4.9

Edit: Hit was looks like a timeout https://jenkins.cilium.io/job/Cilium-PR-K8s-1.21-kernel-4.9/829

Edit 2: Somehow didn't get triggered. Re-triggering below.

gandro · 2021-06-29T14:39:58Z

test-1.21-4.9

gandro · 2021-06-29T17:14:30Z

Need to rebase yet again to pull in #16646

gandro · 2021-06-30T08:18:35Z

test-me-please

Edit: This is a bug fix, so probably not part of the merge freeze, rebased to pull in latest master.

This commit splits the sink's `done chan error` channel (which only allows a single consumer to wait on the sink to finish) into a `chan struct{}` channel and a `lastError error` variable. This enables us to signal that the sink has finished by closing the channel instead of sending a value over it. Closing the channel allows multiple go routines to block on this event via `<-s.done`. The final error value can then be retrieved via `s.err()`. This pattern is very similar to how `context.Context` works. This commit does not yet make use of this functionality. The changes enabled by this will follow in a subsequent commit. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

Previously, we waited for the recording queue to drain when the client requested a stop. However, because the client has no visibility into the queue (and indeed doesn't even know if there are queued records when they issue a stop request), this does not provide any value to the client. Therefore, this PR changes the semantics of a stop request by immediately initiating a shutdown, instead of waiting for the queue to drain. This ensures that the resulting recording more closely matches the observed statistics at the time when the client issued a stop request. It also simplifies the code a bit. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

This commit changes the interface of sink.Dispatch from an explicit `RegisterSink`+`UnregisterSink` pair to a `StartSink` call which will unregister itself when it stops due to an error, an expired context, or an explicit stop request. This commit does not introduce any functional changes, it is purely a refactoring. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

Since we have introduced the `Handle.Done` channel, we do not have toV signal the shutdown of the sink by closing the statistics channel anymore. Instead, consumers can now wait on the `Handle.Done` channel getting closed. While there are not many benefits in this version of the code, it will make the select statement in a subsequent commit much more readable. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

This commit fixes a concurrency issue in the implementation of the Hubble Recorder API. Before this commit, we were sending responses to the client from both the main `Record` function, as well as the `watchRecording` function which was spawned in a separate go routine. However, sending to a grpc.ServerStream from multiple go routines is _not_ safe: https://pkg.go.dev/google.golang.org/grpc#ServerStream It is however safe to have one go routine receive from, and another go routine send to the stream. Therefore, this commit restructures the Hubble Recorder API in such a way that only the `Record` stub ever sends back messages to the client. Receiving is done in a separate go routine which forwards all received messages into a channel, allowing us to select on incoming responses. In addition, this commit hopefully also makes the logic a bit more easier to read, as it tries to separate the cleanup of resources and communicating with the client a bit more explicitly. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

This commit documents what fields are now protected by the mutex in `type sink` and updates a two usages accordingly, by moving channel operations out of the critical section. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

gandro · 2021-07-19T09:30:21Z

test-me-please

Edit: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.19-kernel-5.4/696/testReport/junit/Suite-k8s-1/19/K8sDatapathConfig_Encapsulation_Check_connectivity_with_VXLAN_encapsulation/

Seems potentially related to #16928 (due to the failed to find plugin \"cilium-cni\" in path [/opt/cni/bin]" errors)
Slack thread: https://cilium.slack.com/archives/C7PE7V806/p1626960016449700

gandro · 2021-07-27T11:36:27Z

I'm marking this ready-to-merge. Reasoning: This is fixing a bug in a released Cilium version and thus should be exempt from the zero-flake policy. It's almost 2 months old at this point and has hit various Jenkins flakes over its lifetime, but always different ones and all looked unrelated to this PR, as the pcap recorder (which this PR is touching) is not enabled in neither Jenkins CI nor our conformance tests.

gandro added release-note/bug This PR fixes an issue in a previous release of Cilium. sig/hubble Impacts hubble server or relay needs-backport/1.10 labels Jun 8, 2021

gandro requested review from a team and glibsm June 8, 2021 15:03

maintainer-s-little-helper bot assigned glibsm Jun 8, 2021

gandro mentioned this pull request Jun 8, 2021

hubble/recorder: Extend the API to allow stopping a recording automatically #16473

Merged

gandro marked this pull request as draft June 10, 2021 17:17

gandro force-pushed the pr/gandro/hubble-recorder-refactor branch from 07e4f36 to c310ed3 Compare June 14, 2021 11:30

gandro marked this pull request as ready for review June 14, 2021 12:10

gandro mentioned this pull request Jun 14, 2021

CI: Travis: ENISuite.TestNodeManagerManyNodes fails #11560

Closed

gandro requested review from michi-covalent and tklauser and removed request for glibsm June 14, 2021 16:05

maintainer-s-little-helper bot assigned michi-covalent and tklauser Jun 14, 2021

michi-covalent approved these changes Jun 15, 2021

View reviewed changes

maintainer-s-little-helper bot unassigned michi-covalent Jun 15, 2021

tklauser approved these changes Jun 15, 2021

View reviewed changes

pkg/hubble/recorder/sink/sink.go Outdated Show resolved Hide resolved

pkg/hubble/recorder/sink/sink.go Outdated Show resolved Hide resolved

maintainer-s-little-helper bot unassigned tklauser Jun 15, 2021

gandro requested a review from tklauser June 15, 2021 12:19

maintainer-s-little-helper bot assigned tklauser Jun 15, 2021

gandro force-pushed the pr/gandro/hubble-recorder-refactor branch from fe4a09d to 4a88bee Compare June 24, 2021 10:16

gandro force-pushed the pr/gandro/hubble-recorder-refactor branch from 4a88bee to 1bb0073 Compare June 29, 2021 17:15

gandro added dont-merge/merge-freeze and removed dont-merge/merge-freeze labels Jul 19, 2021

gandro added 6 commits July 19, 2021 11:16

hubble/recorder: Be more explicit about mutex

2c65520

This commit documents what fields are now protected by the mutex in `type sink` and updates a two usages accordingly, by moving channel operations out of the critical section. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

gandro force-pushed the pr/gandro/hubble-recorder-refactor branch from 1bb0073 to 2c65520 Compare July 19, 2021 09:17

gandro added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 27, 2021

maintainer-s-little-helper bot added this to Needs backport from master in 1.10.4 Jul 27, 2021

brb merged commit 6f9b875 into cilium:master Jul 28, 2021

pchaigno mentioned this pull request Jul 28, 2021

v1.10 backports 2021-07-28 #17011

Merged

pchaigno added backport-pending/1.10 and removed needs-backport/1.10 labels Jul 28, 2021

maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.10 in 1.10.4 Jul 28, 2021

joestringer added backport-done/1.10 and removed backport-pending/1.10 labels Sep 1, 2021

maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Backport done to v1.10 in 1.10.4 Sep 1, 2021

joestringer mentioned this pull request Sep 1, 2021

Prepare for release v1.10.4 #17287

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hubble/recorder: Refactor service implementation to fix multiple races #16472

hubble/recorder: Refactor service implementation to fix multiple races #16472

gandro commented Jun 8, 2021 •

edited

gandro commented Jun 8, 2021

michi-covalent commented Jun 9, 2021

gandro commented Jun 9, 2021

gandro commented Jun 10, 2021

gandro commented Jun 14, 2021

gandro commented Jun 14, 2021 •

edited

michi-covalent left a comment

tklauser left a comment

gandro commented Jun 15, 2021 •

edited

gandro commented Jun 24, 2021

gandro commented Jun 29, 2021 •

edited

gandro commented Jun 29, 2021 •

edited

gandro commented Jun 29, 2021

gandro commented Jun 29, 2021

gandro commented Jun 30, 2021 •

edited

gandro commented Jul 19, 2021 •

edited

gandro commented Jul 27, 2021

hubble/recorder: Refactor service implementation to fix multiple races #16472

hubble/recorder: Refactor service implementation to fix multiple races #16472

Conversation

gandro commented Jun 8, 2021 • edited

gandro commented Jun 8, 2021

michi-covalent commented Jun 9, 2021

gandro commented Jun 9, 2021

gandro commented Jun 10, 2021

gandro commented Jun 14, 2021

gandro commented Jun 14, 2021 • edited

michi-covalent left a comment

Choose a reason for hiding this comment

tklauser left a comment

Choose a reason for hiding this comment

gandro commented Jun 15, 2021 • edited

gandro commented Jun 24, 2021

gandro commented Jun 29, 2021 • edited

gandro commented Jun 29, 2021 • edited

gandro commented Jun 29, 2021

gandro commented Jun 29, 2021

gandro commented Jun 30, 2021 • edited

gandro commented Jul 19, 2021 • edited

gandro commented Jul 27, 2021

gandro commented Jun 8, 2021 •

edited

gandro commented Jun 14, 2021 •

edited

gandro commented Jun 15, 2021 •

edited

gandro commented Jun 29, 2021 •

edited

gandro commented Jun 29, 2021 •

edited

gandro commented Jun 30, 2021 •

edited

gandro commented Jul 19, 2021 •

edited