release-26.1: kvcoord: avoid concurrent restarts of active rangefeeds#163325
release-26.1: kvcoord: avoid concurrent restarts of active rangefeeds#163325stevendanna wants to merge 1 commit intocockroachdb:release-26.1from
Conversation
There is a concurrency bug in MuxRangeFeed:
0. Goroutine A starts a goroutine B to read events from the stream.
1. Goroutine A adds a new streamID to the muxer's map of active rangefeeds.
2. Goroutine A Sends a new RangeFeedRequest to the stream in a retry loop
3. Concurrenly,
(a) The Send() returns an error; and
(b) despite the error, the rangefeed was registered and then returns an
error event, or Recv() fails and restarts all rangefeeds
4. Then it may be the scase that goroutine B resets the transport to
nil.
5. Goroutine A, also in a retry loop, attempts to access this now-nil transport.
There are many ways to solve the NPE itself, but here we try to resolve the more
fundamental race: we have two goroutines attempting to restart the same stream.
The best way to solve this would be to refactor this code, but here we first
attempt a somewhat backportable fix:
1. Each restart attempt generates a new streamID. 2. Goroutine's A and B delete
the streamID from the map before attempting to restart it, and only attempt a
restart if they found their ID.
The included test failed with a NPE before the fix.
Fixes cockroachdb#157997
Fixes cockroachdb#161822
Fixes cockroachdb#161479
Fixes-26.1 cockroachdb#161663
Fixes-26.1 cockroachdb#160442
Release note: None
8ad29f3 to
4d04568
Compare
|
Thanks for opening a backport. Before merging, please confirm that the change does not break backwards compatibility and otherwise complies with the backport policy. Include a brief release justification in the PR description explaining why the backport is appropriate. All backports must be reviewed by the TL for the owning area. While the stricter LTS policy does not yet apply, please exercise judgment and consider gating non-critical changes behind a disabled-by-default feature flag when appropriate. |
|
I'll defer to @stevendanna and @wenyihu6 on whether/when we're comfortable backporting. It's nice to be able to link test failures to a pull request, even if we decide not to merge. |
|
@arulajmani @wenyihu6 This has baked for a little bit now. What do we think? DRPC is technically preview in 26.1 so there could be a few people who try it and are thus a bit more exposed to this bug. |
|
Is this bug more likely with DRPC - I'm missing the part on why this relates to DRPC (I saw some comments around this in the issue but can't spot why based on the fix)? I feel comfortable with merging the backport since it mainly affects rangefeed stream management, and the new logic feels low risk and won't cause anything silent event loss issues, and the potential downside seems small. Wdyt? |
I do not have a reason why, but we were only ever able to reproduce it on DRPC. |
I developed a reason why. Looking at the dRPC code, it appears they do a context cancellation check after flushing their send buffer and then return that context cancellation error to the user. gRPC doesn't appear to do anything similar. So I think in context cancellation cases we are more likely to hit this case. |
|
Should we merge this now that it's been baking for some time @stevendanna ? |
Backport 1/1 commits from #161955 on behalf of @stevendanna.
There is a concurrency bug in MuxRangeFeed:
Goroutine A starts a goroutine B to read events from the stream.
Goroutine A adds a new streamID to the muxer's map of active rangefeeds.
Goroutine A Sends a new RangeFeedRequest to the stream in a retry loop
Concurrenly,
(a) The Send() returns an error; and
(b) despite the error, the rangefeed was registered and then returns an
error event, or Recv() fails and restarts all rangefeeds
Then it may be the scase that goroutine B resets the transport to nil.
Goroutine A, also in a retry loop, attempts to access this now-nil transport.
There are many ways to solve the NPE itself, but here we try to resolve the more fundamental race: we have two goroutines attempting to restart the same stream.
The best way to solve this would be to refactor this code, but here we first attempt a somewhat backportable fix:
The included test failed with a NPE before the fix.
Fixes #163276
Release note: None
Release justification: Fix for node-crashing bug.