Skip to content

release-26.1: kvcoord: avoid concurrent restarts of active rangefeeds#163325

Open
stevendanna wants to merge 1 commit intocockroachdb:release-26.1from
stevendanna:blathers/backport-release-26.1-161955
Open

release-26.1: kvcoord: avoid concurrent restarts of active rangefeeds#163325
stevendanna wants to merge 1 commit intocockroachdb:release-26.1from
stevendanna:blathers/backport-release-26.1-161955

Conversation

@stevendanna
Copy link
Collaborator

@stevendanna stevendanna commented Feb 11, 2026

Backport 1/1 commits from #161955 on behalf of @stevendanna.


There is a concurrency bug in MuxRangeFeed:

  1. Goroutine A starts a goroutine B to read events from the stream.

  2. Goroutine A adds a new streamID to the muxer's map of active rangefeeds.

  3. Goroutine A Sends a new RangeFeedRequest to the stream in a retry loop

  4. Concurrenly,

    (a) The Send() returns an error; and

    (b) despite the error, the rangefeed was registered and then returns an
    error event, or Recv() fails and restarts all rangefeeds

  5. Then it may be the scase that goroutine B resets the transport to nil.

  6. Goroutine A, also in a retry loop, attempts to access this now-nil transport.

There are many ways to solve the NPE itself, but here we try to resolve the more fundamental race: we have two goroutines attempting to restart the same stream.

The best way to solve this would be to refactor this code, but here we first attempt a somewhat backportable fix:

  1. Each restart attempt generates a new streamID. 2. Goroutine's A and B delete the streamID from the map before attempting to restart it, and only attempt a restart if they found their ID.

The included test failed with a NPE before the fix.

Fixes #163276

Release note: None


Release justification: Fix for node-crashing bug.

There is a concurrency bug in MuxRangeFeed:

0. Goroutine A starts a goroutine B to read events from the stream.
1. Goroutine A adds a new streamID to the muxer's map of active rangefeeds.
2. Goroutine A Sends a new RangeFeedRequest to the stream in a retry loop
3. Concurrenly,

    (a) The Send() returns an error; and

    (b) despite the error, the rangefeed was registered and then returns an
    error event, or Recv() fails and restarts all rangefeeds

4. Then it may be the scase that goroutine B resets the transport to
   nil.

5. Goroutine A, also in a retry loop, attempts to access this now-nil transport.

There are many ways to solve the NPE itself, but here we try to resolve the more
fundamental race: we have two goroutines attempting to restart the same stream.

The best way to solve this would be to refactor this code, but here we first
attempt a somewhat backportable fix:

1. Each restart attempt generates a new streamID. 2. Goroutine's A and B delete
the streamID from the map before attempting to restart it, and only attempt a
restart if they found their ID.

The included test failed with a NPE before the fix.

Fixes cockroachdb#157997
Fixes cockroachdb#161822
Fixes cockroachdb#161479
Fixes-26.1 cockroachdb#161663
Fixes-26.1 cockroachdb#160442

Release note: None
@stevendanna stevendanna force-pushed the blathers/backport-release-26.1-161955 branch from 8ad29f3 to 4d04568 Compare February 11, 2026 08:54
@stevendanna stevendanna requested a review from a team as a code owner February 11, 2026 08:54
@blathers-crl blathers-crl bot added blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. labels Feb 11, 2026
@blathers-crl
Copy link

blathers-crl bot commented Feb 11, 2026

Thanks for opening a backport.

Before merging, please confirm that the change does not break backwards compatibility and otherwise complies with the backport policy. Include a brief release justification in the PR description explaining why the backport is appropriate. All backports must be reviewed by the TL for the owning area. While the stricter LTS policy does not yet apply, please exercise judgment and consider gating non-critical changes behind a disabled-by-default feature flag when appropriate.

@blathers-crl blathers-crl bot added backport Label PR's that are backports to older release branches T-kv KV Team labels Feb 11, 2026
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg
Copy link
Member

tbg commented Feb 11, 2026

I'll defer to @stevendanna and @wenyihu6 on whether/when we're comfortable backporting. It's nice to be able to link test failures to a pull request, even if we decide not to merge.

@stevendanna
Copy link
Collaborator Author

@arulajmani @wenyihu6 This has baked for a little bit now. What do we think? DRPC is technically preview in 26.1 so there could be a few people who try it and are thus a bit more exposed to this bug.

@wenyihu6
Copy link
Contributor

Is this bug more likely with DRPC - I'm missing the part on why this relates to DRPC (I saw some comments around this in the issue but can't spot why based on the fix)?

I feel comfortable with merging the backport since it mainly affects rangefeed stream management, and the new logic feels low risk and won't cause anything silent event loss issues, and the potential downside seems small. Wdyt?

@stevendanna
Copy link
Collaborator Author

Is this bug more likely with DRPC

I do not have a reason why, but we were only ever able to reproduce it on DRPC.

@stevendanna
Copy link
Collaborator Author

I do not have a reason why

I developed a reason why. Looking at the dRPC code, it appears they do a context cancellation check after flushing their send buffer and then return that context cancellation error to the user. gRPC doesn't appear to do anything similar. So I think in context cancellation cases we are more likely to hit this case.

@wenyihu6
Copy link
Contributor

wenyihu6 commented Mar 2, 2026

Should we merge this now that it's been baking for some time @stevendanna ?

@stevendanna stevendanna linked an issue Mar 9, 2026 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport Label PR's that are backports to older release branches blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. T-kv KV Team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kv/kvserver: TestMergeQueue failed [NPE in rangefeed client]

4 participants