-
Notifications
You must be signed in to change notification settings - Fork 873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(replication): fix cancel replication race #2196
Conversation
The bug: One connection calls replica start and second call replica stop. In this flow stop first reset state mask state_mask_.store(0), Strat sets state mask state_mask_.store(R_ENABLED) continues to greet and create main replication fiber and then stop runs cntx_.Cancel(), which will at later be reset inside the main replication fiber. In this flow main replication fiber does not cancel and the connection calling to Stop is deadlocked waiting to join the main replication fiber. The fix: Swith the order between cntx_.Cancel() and state_mask_.store(0) in Stop function. Signed-off-by: adi_holden <adi@dragonflydb.io>
I still don't understand.... How do we miss context cancellation if it happens after setting enabled = false The only lines I know of that discard cancellation are those: while (state_mask_.load() & R_ENABLED) {
// Discard all previous errors and set default error handler.
cntx_.Reset([this](const GenericError& ge) { this->DefaultErrorHandler(ge); }); and no matter in what order you place it, if we cross here it'll be problematic in any order... (It's what I realize now 😅 ) Maybe we should really run Stop on the replica's native thread? I remember we did some time ago 🤔 |
Here it is! 😄 #1058 Those changes are not safe... |
@dranikpg as I wrote in the description above setting enabled = false is done when the replica start is called (before we store R_ENABLED) and calling context cancel from stop flow is called after the main replication fiber is created fiber. |
Are you saying that we should call in Stop: if (sock_) { I can try this as well, although I am pretty sure that my change fixes the problem this looks more robust |
@adiholden, based on your PR description I created this sketch (the problematic flow). Is it correct? flowchart TD
A[Start] --> T[Time:T0]
T --> E["state |=Enabled"]
E --> F[FiberDispatch]
B[Stop] --> R["state=0"]
R --> T2[Time:T0]
T2 --> T3[Time:T1]
T3 --> CC["Context.Cancel()"]
CC --> D[JoinFiber/Deadlock]
T <--> T2
T4[Time:T1] <--> T3
FS[FiberStart] --> W["While (state & enabled)"]
W --> T4
T4 --> ER["if context.error() then continue"]
ER --> W
|
Guys, please take into account that for state transitions under management commands, the simplest and the most robust approach is to use mutexes/spinlocks together with a state variable to benefit from simplicity of transactional semantics. Commands that do not require performance do not need this extra complexity of atomics. Delegating into a designated thread is also a valid approach. |
@romange the flow you are describing also shows the problem of setting state before cancel context although its not the exact one as I saw in my prints when I had the test fail. |
As for the following suggested fix: Because today we allow running stop before we finish replica start finishes, the socket may not be initialized when runnig the stop as it is initialized in start when we run connect and auth. We have this comment in the code: // We proceed connecting below without the lock to allow interrupting the replica immediately. |
@adiholden fixed and updated the PR description. It's an editable flowchar using the mermaid syntax |
@romange Ok after changing the the flow to not allow running stop and start together of course the cancel replication pytest fails, as this will not allow to cancel replication. Why do we need to be able to cancel replication? I mean fast cancel , before we return from replicaof cancel it.. |
Because the socket might not be initialized we could just store the proactor on creation (I suggested it in the comments of Roy's Pr)
Yes, but those operation can happen very quickly regardless of order. See the code again, this is theoretically possible while (state_mask_.load() & R_ENABLED) {
// What if
// mask = 0
// cntx.Cancel
// happens here - regardless of order we will proceed
cntx_.Reset([this](const GenericError& ge) { this->DefaultErrorHandler(ge); }); |
We do not need "fast cancel". This has been designed by Roey because we allowed it so he wanted to cover this use-case. If we do not allow concurrent updates (which I think is the right approach) we can just check that sending both commands concurrently works in consistent manner (i.e. if we had replication before, it will be cancelled, and if not, then the cancellation command is ignored or an error is returned) |
It is not hard to fix! Not having fast cancel means that if the master endpoint is invalid, Dragonfly will be unresponsive until the connection timed out. That's not cool We already supported it before, Roy just removed the hop because he changed socket management |
Was there specific product use-case that benefited from "fast cancel"? |
Signed-off-by: adi_holden <adi@dragonflydb.io>
We have a connection timeout when connecting to master in Start. If you notice that you've chosen the wrong endpoint for some reason, there is no way to stop this operation manually, you have to wait for it to time out |
The bug: One connection calls replica start and second call replica stop. In this flow stop first reset state mask state_mask_.store(0), Start sets state mask state_mask_.store(R_ENABLED) continues to greet and create main replication fiber and then stop runs cntx_.Cancel(), which will at later be reset inside the main replication fiber. In this flow main replication fiber does not cancel and the connection calling to Stop is deadlocked waiting to join the main replication fiber.
The fix: run cntx_.Cancel() and state_mask_.store(0) in replica thread.