drpcmanager: fix race between manageReader and stream creation#39
Conversation
|
b8091d4 to
3ab0f13
Compare
There was a problem hiding this comment.
Pull request overview
This PR fixes a potential race in drpcmanager.Manager where manageReader could advance past an invoke before NewServerStream fully registered the stream, leading to rare deadlock scenarios on back-to-back invokes.
Changes:
- Gate
manageReaderprogression on stream registration by waiting onpdoneafter forwarding invoke packets. - Remove the retry loop in
manageReaderand treat non-invoke first frames for a new stream as a protocol error. - Add a suite of
manageReader-focused tests covering monotonicity, packet assembly, and stream/old-stream handling.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| staticcheck.conf | Disables ST1003 in staticcheck configuration. |
| drpcmanager/manager_test.go | Adds extensive manageReader tests and helper functions. |
| drpcmanager/manager.go | Adjusts invoke handling synchronization and replaces retry logic with a protocol error path. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
cthumuluru-crdb
left a comment
There was a problem hiding this comment.
Changes look good to me but I couldn't tell if all the tests are relevant to this change?
No they are not. Most of the tests were added for the next PR to ensure the refactors are safe. They helped me catch this. These tests should work even before my next PR so I kept them here. But I think I will remove the tests that are not relevant to current PR. |
1b5b3c2 to
136d0cd
Compare
pdone.Send() was firing before m.newStream() completed, allowing manageReader to process the next packet before sbuf.Set() registered the stream. Back-to-back invokes could deadlock because the second invoke would hit the KindInvoke case in manageReader with curr still nil, sending to m.pkts with no receiver. No receiver because the first NewServerStream already returned and the next one hasn't been called yet. The same applies when curr is not nil and a new stream replaces it. This scenario is unlikely but possible. The main benefit of this fix is simplicity: it removes the goto-again retry loop by making manageReader wait for stream registration before proceeding. The cost is a tiny bit of added synchrony during stream creation. With pdone gated on m.newStream(), curr is guaranteed to be set when manageReader reads the next packet. The default case no longer needs to wait and retry, a non-invoke first packet is now a protocol error. TestRandomized_Server is disabled because it sends packets with stream IDs greater than the client's current stream ID, which is invalid. Fixing it is deferred because the upcoming stream-multiplexing changes will likely require further changes to this test; it should be re-enabled before merging to main. In the similar fashion TestRandomized_Client is also disabled.
136d0cd to
ea71b00
Compare
|
Note that I have to run the |
pdone.Send() was firing before m.newStream() completed, allowing
manageReader to process the next frame before sbuf.Set() registered the
stream. Back-to-back invokes could deadlock because the second invoke would
hit the KindInvoke case in manageReader with curr still nil, sending to
m.pkts with no receiver. No receiver because the first NewServerStream
already returned and the next one hasn't been called yet. The same applies
when curr is not nil and a new stream replaces it.
This scenario is unlikely but possible. The main benefit of this fix is
simplicity: it removes the goto-again retry loop by making manageReader
wait for stream registration before proceeding. The cost is a tiny bit of
added synchrony during stream creation.
With pdone gated on m.newStream(), curr is guaranteed to be set when
manageReader reads the next frame. The default case no longer needs to
wait and retry, a non-invoke first frame is now a protocol error.
TestRandomized_Server is disabled because it sends packets with stream IDs
greater than the client's current stream ID, which is invalid. Fixing it is
deferred because the upcoming stream-multiplexing changes will likely
require further changes to this test; it should be re-enabled before
merging to main. In the similar fashion TestRandomized_Client is also
disabled.