fix(replication): fix cancel replication race #2196

adiholden · 2023-11-20T20:00:23Z

The bug: One connection calls replica start and second call replica stop. In this flow stop first reset state mask state_mask_.store(0), Start sets state mask state_mask_.store(R_ENABLED) continues to greet and create main replication fiber and then stop runs cntx_.Cancel(), which will at later be reset inside the main replication fiber. In this flow main replication fiber does not cancel and the connection calling to Stop is deadlocked waiting to join the main replication fiber.

The fix: run cntx_.Cancel() and state_mask_.store(0) in replica thread.

flowchart TD
    A[Start] --> T[Time:T0]
    T --> E["state |=Enabled"]
    E -->  F[FiberDispatch]
   B[Stop] --> R["state=0"] 
   R --> T2[Time:T0] 
   T2 --> T3[Time:T1]
   T3 --> CC["Context.Cancel()"] 
   CC --> D[JoinFiber/Deadlock]
   T <--> T2
   T4[Time:T1] <--> T3
   FS[FiberStart] --> W["While (state & enabled)"]
   W --> T4
   T4 --> ER["if context.error() then continue"]
   ER --> W

The bug: One connection calls replica start and second call replica stop. In this flow stop first reset state mask state_mask_.store(0), Strat sets state mask state_mask_.store(R_ENABLED) continues to greet and create main replication fiber and then stop runs cntx_.Cancel(), which will at later be reset inside the main replication fiber. In this flow main replication fiber does not cancel and the connection calling to Stop is deadlocked waiting to join the main replication fiber. The fix: Swith the order between cntx_.Cancel() and state_mask_.store(0) in Stop function. Signed-off-by: adi_holden <adi@dragonflydb.io>

dranikpg · 2023-11-20T20:21:54Z

I still don't understand.... How do we miss context cancellation if it happens after setting enabled = false

The only lines I know of that discard cancellation are those:

while (state_mask_.load() & R_ENABLED) {
    // Discard all previous errors and set default error handler.
    cntx_.Reset([this](const GenericError& ge) { this->DefaultErrorHandler(ge); });

and no matter in what order you place it, if we cross here it'll be problematic in any order... (It's what I realize now 😅 )

Maybe we should really run Stop on the replica's native thread? I remember we did some time ago 🤔

dranikpg · 2023-11-20T20:23:21Z

Maybe we should really run Stop on the replica's native thread? I remember we did some time ago 🤔

Here it is! 😄 #1058

Those changes are not safe...

#1058 (comment)

adiholden · 2023-11-21T06:54:17Z

@dranikpg as I wrote in the description above setting enabled = false is done when the replica start is called (before we store R_ENABLED) and calling context cancel from stop flow is called after the main replication fiber is created fiber.
When this happends we break from the flow fibers continue in the while loop of the main replication fiber and reset the context.
Switching the order of state_mask_.store(0) and cntx_.Cancel() does solve the problem as if cancel context from Stop is called before we exit the Start (before main replication fiber is created) we will return on cancel from Start. If it is called after main replication fiber is created, calling state_mask_.store(0) will not be overwrite and once we see the cancel context we will exit the while loop.

adiholden · 2023-11-21T06:58:43Z

Are you saying that we should call in Stop:

if (sock_) {
sock_->proactor()->Await([this] {
state_mask_ = 0; // Specifically ~R_ENABLED.
cntx_.Cancel(); // Context is fully resposible for cleanup.
});
}

I can try this as well, although I am pretty sure that my change fixes the problem this looks more robust

romange · 2023-11-21T07:30:34Z

@adiholden, based on your PR description I created this sketch (the problematic flow). Is it correct?

flowchart TD
    A[Start] --> T[Time:T0]
    T --> E["state |=Enabled"]
    E -->  F[FiberDispatch]
   B[Stop] --> R["state=0"] 
   R --> T2[Time:T0] 
   T2 --> T3[Time:T1]
   T3 --> CC["Context.Cancel()"] 
   CC --> D[JoinFiber/Deadlock]
   T <--> T2
   T4[Time:T1] <--> T3
   FS[FiberStart] --> W["While (state & enabled)"]
   W --> T4
   T4 --> ER["if context.error() then continue"]
   ER --> W

romange · 2023-11-21T07:38:40Z

Guys, please take into account that for state transitions under management commands, the simplest and the most robust approach is to use mutexes/spinlocks together with a state variable to benefit from simplicity of transactional semantics. Commands that do not require performance do not need this extra complexity of atomics. Delegating into a designated thread is also a valid approach.

adiholden · 2023-11-21T07:44:28Z

@romange the flow you are describing also shows the problem of setting state before cancel context although its not the exact one as I saw in my prints when I had the test fail.
When I saw the fail the Context.Cancel() happened after the FiberStart of the main replication fiber, in this fiber we have a while loop that resets the context at the begining of the loop. In my fail I saw catching the error in the middle of the while when the flow fibers are already running, we exit this fibers and contiue the while loop from start as the condition in the while is checking Enabled state. Than we reset the context inside the while loop and start replication again

adiholden · 2023-11-21T07:50:02Z

As for the following suggested fix:
if (sock_) {
sock_->proactor()->Await([this] {
state_mask_ = 0; // Specifically ~R_ENABLED.
cntx_.Cancel(); // Context is fully resposible for cleanup.
});
}

Because today we allow running stop before we finish replica start finishes, the socket may not be initialized when runnig the stop as it is initialized in start when we run connect and auth. We have this comment in the code: // We proceed connecting below without the lock to allow interrupting the replica immediately.
I dont think this is so creatical to allow stop before we finish replica start, so I can disable running stop and start together making it posible to use the reset as suggested above.
If you think there is a real reason to allow stop and start run together let me know @dranikpg

romange · 2023-11-21T07:56:48Z

@adiholden fixed and updated the PR description. It's an editable flowchar using the mermaid syntax

adiholden · 2023-11-21T08:12:31Z

@romange Ok after changing the the flow to not allow running stop and start together of course the cancel replication pytest fails, as this will not allow to cancel replication. Why do we need to be able to cancel replication? I mean fast cancel , before we return from replicaof cancel it..

dranikpg · 2023-11-21T08:16:46Z

Because the socket might not be initialized we could just store the proactor on creation (I suggested it in the comments of Roy's Pr)

Switching the order of state_mask_.store(0) and cntx_.Cancel() does solve the problem as if cancel context from Stop is called before we exit the Start

Yes, but those operation can happen very quickly regardless of order. See the code again, this is theoretically possible

while (state_mask_.load() & R_ENABLED) {
    // What if
    // mask = 0
    // cntx.Cancel
    // happens here - regardless of order we will proceed
    cntx_.Reset([this](const GenericError& ge) { this->DefaultErrorHandler(ge); });

romange · 2023-11-21T08:17:41Z

We do not need "fast cancel". This has been designed by Roey because we allowed it so he wanted to cover this use-case. If we do not allow concurrent updates (which I think is the right approach) we can just check that sending both commands concurrently works in consistent manner (i.e. if we had replication before, it will be cancelled, and if not, then the cancellation command is ignored or an error is returned)

dranikpg · 2023-11-21T08:19:41Z

We do not need "fast cancel

It is not hard to fix! Not having fast cancel means that if the master endpoint is invalid, Dragonfly will be unresponsive until the connection timed out. That's not cool

We already supported it before, Roy just removed the hop because he changed socket management

romange · 2023-11-21T09:48:42Z

Was there specific product use-case that benefited from "fast cancel"?
Can you explain when and why Dragonfly would be unresponsive?

Signed-off-by: adi_holden <adi@dragonflydb.io>

dranikpg · 2023-11-21T10:40:22Z

Can you explain when and why Dragonfly would be unresponsive?

We have a connection timeout when connecting to master in Start. If you notice that you've chosen the wrong endpoint for some reason, there is no way to stop this operation manually, you have to wait for it to time out

adiholden requested a review from kostasrim November 20, 2023 20:00

run in proactor thread

8e78813

Signed-off-by: adi_holden <adi@dragonflydb.io>

dranikpg approved these changes Nov 21, 2023

View reviewed changes

kostasrim approved these changes Nov 21, 2023

View reviewed changes

adiholden merged commit c10dac4 into main Nov 21, 2023
10 checks passed

adiholden deleted the fix_replica_stop branch November 21, 2023 10:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(replication): fix cancel replication race #2196

fix(replication): fix cancel replication race #2196

adiholden commented Nov 20, 2023 •

edited

dranikpg commented Nov 20, 2023 •

edited

dranikpg commented Nov 20, 2023 •

edited

adiholden commented Nov 21, 2023

adiholden commented Nov 21, 2023

romange commented Nov 21, 2023 •

edited

romange commented Nov 21, 2023

adiholden commented Nov 21, 2023

adiholden commented Nov 21, 2023

romange commented Nov 21, 2023

adiholden commented Nov 21, 2023

dranikpg commented Nov 21, 2023 •

edited

romange commented Nov 21, 2023

dranikpg commented Nov 21, 2023 •

edited

romange commented Nov 21, 2023

dranikpg commented Nov 21, 2023

fix(replication): fix cancel replication race #2196

fix(replication): fix cancel replication race #2196

Conversation

adiholden commented Nov 20, 2023 • edited

dranikpg commented Nov 20, 2023 • edited

dranikpg commented Nov 20, 2023 • edited

adiholden commented Nov 21, 2023

adiholden commented Nov 21, 2023

romange commented Nov 21, 2023 • edited

romange commented Nov 21, 2023

adiholden commented Nov 21, 2023

adiholden commented Nov 21, 2023

romange commented Nov 21, 2023

adiholden commented Nov 21, 2023

dranikpg commented Nov 21, 2023 • edited

romange commented Nov 21, 2023

dranikpg commented Nov 21, 2023 • edited

romange commented Nov 21, 2023

dranikpg commented Nov 21, 2023

adiholden commented Nov 20, 2023 •

edited

dranikpg commented Nov 20, 2023 •

edited

dranikpg commented Nov 20, 2023 •

edited

romange commented Nov 21, 2023 •

edited

dranikpg commented Nov 21, 2023 •

edited

dranikpg commented Nov 21, 2023 •

edited