Skip to content

drpcmanager: fix race between manageReader and stream creation#39

Merged
shubhamdhama merged 1 commit intocockroachdb:stream-multiplexingfrom
shubhamdhama:make-stream-sync
Mar 25, 2026
Merged

drpcmanager: fix race between manageReader and stream creation#39
shubhamdhama merged 1 commit intocockroachdb:stream-multiplexingfrom
shubhamdhama:make-stream-sync

Conversation

@shubhamdhama
Copy link
Copy Markdown

@shubhamdhama shubhamdhama commented Mar 23, 2026

pdone.Send() was firing before m.newStream() completed, allowing
manageReader to process the next frame before sbuf.Set() registered the
stream. Back-to-back invokes could deadlock because the second invoke would
hit the KindInvoke case in manageReader with curr still nil, sending to
m.pkts with no receiver. No receiver because the first NewServerStream
already returned and the next one hasn't been called yet. The same applies
when curr is not nil and a new stream replaces it.

This scenario is unlikely but possible. The main benefit of this fix is
simplicity: it removes the goto-again retry loop by making manageReader
wait for stream registration before proceeding. The cost is a tiny bit of
added synchrony during stream creation.

With pdone gated on m.newStream(), curr is guaranteed to be set when
manageReader reads the next frame. The default case no longer needs to
wait and retry, a non-invoke first frame is now a protocol error.

TestRandomized_Server is disabled because it sends packets with stream IDs
greater than the client's current stream ID, which is invalid. Fixing it is
deferred because the upcoming stream-multiplexing changes will likely
require further changes to this test; it should be re-enabled before
merging to main. In the similar fashion TestRandomized_Client is also
disabled.

@shubhamdhama
Copy link
Copy Markdown
Author

shubhamdhama commented Mar 23, 2026

  • There is one test in random_test that would fail because the test itself violates the wire-protocol. I'll fix that tomorrow.
    SKIPPED the test

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a potential race in drpcmanager.Manager where manageReader could advance past an invoke before NewServerStream fully registered the stream, leading to rare deadlock scenarios on back-to-back invokes.

Changes:

  • Gate manageReader progression on stream registration by waiting on pdone after forwarding invoke packets.
  • Remove the retry loop in manageReader and treat non-invoke first frames for a new stream as a protocol error.
  • Add a suite of manageReader-focused tests covering monotonicity, packet assembly, and stream/old-stream handling.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
staticcheck.conf Disables ST1003 in staticcheck configuration.
drpcmanager/manager_test.go Adds extensive manageReader tests and helper functions.
drpcmanager/manager.go Adjusts invoke handling synchronization and replaces retry logic with a protocol error path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

@cthumuluru-crdb cthumuluru-crdb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me but I couldn't tell if all the tests are relevant to this change?

@shubhamdhama
Copy link
Copy Markdown
Author

Changes look good to me but I couldn't tell if all the tests are relevant to this change?

No they are not. Most of the tests were added for the next PR to ensure the refactors are safe. They helped me catch this. These tests should work even before my next PR so I kept them here. But I think I will remove the tests that are not relevant to current PR.

@shubhamdhama shubhamdhama changed the base branch from main to stream-multiplexing March 24, 2026 14:06
@shubhamdhama shubhamdhama force-pushed the make-stream-sync branch 3 times, most recently from 1b5b3c2 to 136d0cd Compare March 25, 2026 06:17
Copy link
Copy Markdown

@suj-krishnan suj-krishnan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@cthumuluru-crdb cthumuluru-crdb self-requested a review March 25, 2026 08:04
Copy link
Copy Markdown

@cthumuluru-crdb cthumuluru-crdb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

pdone.Send() was firing before m.newStream() completed, allowing
manageReader to process the next packet before sbuf.Set() registered the
stream. Back-to-back invokes could deadlock because the second invoke would
hit the KindInvoke case in manageReader with curr still nil, sending to
m.pkts with no receiver. No receiver because the first NewServerStream
already returned and the next one hasn't been called yet. The same applies
when curr is not nil and a new stream replaces it.

This scenario is unlikely but possible. The main benefit of this fix is
simplicity: it removes the goto-again retry loop by making manageReader
wait for stream registration before proceeding. The cost is a tiny bit of
added synchrony during stream creation.

With pdone gated on m.newStream(), curr is guaranteed to be set when
manageReader reads the next packet. The default case no longer needs to
wait and retry, a non-invoke first packet is now a protocol error.

TestRandomized_Server is disabled because it sends packets with stream IDs
greater than the client's current stream ID, which is invalid. Fixing it is
deferred because the upcoming stream-multiplexing changes will likely
require further changes to this test; it should be re-enabled before
merging to main. In the similar fashion TestRandomized_Client is also
disabled.
@shubhamdhama
Copy link
Copy Markdown
Author

Note that I have to run the make test and make lint manually to ensure tests are passing because GH actions are not running on non-main branch.

@shubhamdhama shubhamdhama merged commit 0fd92af into cockroachdb:stream-multiplexing Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants