Fix late selective_channel retry after EndRPC#3359
Conversation
c2c7d03 to
e27a7e9
Compare
|
@chenBright hello, please help review |
| _main_cntl->SetFailed(ECANCELED, | ||
| "SelectiveChannel balancer is unavailable"); | ||
| return -1; | ||
| } |
There was a problem hiding this comment.
The problem seems unresolved. The process still crashes even if the balancer is deleted by main_cntl after the if statement.
There was a problem hiding this comment.
@chenBright thanks for review, The previous null check was not enough because it did not keep the balancer alive after _lb.get().
I updated the patch so Sender holds its own intrusive_ptr<SharedLoadBalancer> captured from SelectiveChannel at creation time, and IssueRPC() uses that retained reference. This prevents main_cntl->_lb.reset() in EndRPC() from freeing the balancer while IssueRPC() is still using it.
The FLAGS_ENDING_RPC guard is kept to prevent late SubDone from re-entering retry / backup after the main RPC starts ending.
e27a7e9 to
531f2d7
Compare
There was a problem hiding this comment.
Pull request overview
This PR addresses a SelectiveChannel race where late SubDone::Run() callbacks could re-enter retry/backup logic after the main RPC has entered Controller::EndRPC(), potentially operating on partially torn-down controller state (including a previously observed null balancer path).
Changes:
- Introduces an “ending RPC” controller flag set at the beginning of
Controller::EndRPC()and ignores lateOnVersionedRPCReturned()callbacks once teardown has begun. - Makes
schan::Senderhold its own reference to the load balancer and adds a defensive null check before selecting a sub-channel. - Adds a regression test that forces the main RPC to time out while delayed sub-calls finish later, validating late-callback behavior under retry/backup configuration.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
test/brpc_channel_unittest.cpp |
Adds a regression test exercising late sub-call completion after main timeout with retry + backup enabled. |
src/brpc/selective_channel.cpp |
Passes the balancer into schan::Sender and adds a defensive null check before using it. |
src/brpc/controller.h |
Adds FLAGS_ENDING_RPC and a helper accessor to expose “ending RPC” state. |
src/brpc/controller.cpp |
Sets the “ending RPC” flag early in EndRPC() and short-circuits late callbacks in OnVersionedRPCReturned(). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (!initialized()) { | ||
| cntl->SetFailed(EINVAL, "SelectiveChannel=%p is not initialized yet", | ||
| this); | ||
| } | ||
| schan::Sender* sndr = new schan::Sender(cntl, request, response, user_done); | ||
| schan::Sender* sndr = |
There was a problem hiding this comment.
Addressed. SelectiveChannel::CallMethod() now returns immediately when the channel is not initialized, and runs done in-place if provided, matching the behavior of PartitionChannel.
This avoids allocating Sender or dispatching into the inner channel with an unavailable balancer. I also added a regression test covering both sync and async uninitialized calls.
Ignore late selective_channel SubDone callbacks once the main RPC enters EndRPC, and let Sender hold the selective balancer while issuing sub-RPCs. Return early when SelectiveChannel is called before Init, preserving the intended EINVAL result. Add regression tests covering timeout plus delayed sub-call completion and uninitialized SelectiveChannel calls.
531f2d7 to
4217a8a
Compare
What problem does this PR solve?
Issue Number: #3358
Problem Summary:
This fixes a
selective_channelrace where a lateSubDone::Run()may re-enterretry/backup after the main RPC has already entered
EndRPC().The crash report in #3358 shows that a late sub-call callback can still flow into:
Controller::OnVersionedRPCReturned()Controller::IssueRPC()schan::Sender::IssueRPC()after the main controller has already started tearing down its state.
That leaves
selective_channelvulnerable to retrying on partially torn-downstate, including the previously observed null balancer path.
What is changed and the side effects?
Changed:
EndRPC()SubDonecallbacks once the main RPC is already endingschan::Sender::IssueRPC()This keeps the retry/backup state machine from re-entering after teardown has
started, and also preserves a hard guard at the
selective_channelboundary.Test:
Added a regression test that:
SelectiveChannelThis reproduces the late callback window and verifies it no longer re-enters
retry/backup after timeout.
Also re-ran related selective/backup request tests.
Side effects:
Performance effects:
Breaking backward compatibility:
Check List: