Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(replication): fix cancel replication race #2196

Merged
merged 2 commits into from
Nov 21, 2023
Merged

Conversation

adiholden
Copy link
Collaborator

@adiholden adiholden commented Nov 20, 2023

The bug: One connection calls replica start and second call replica stop. In this flow stop first reset state mask state_mask_.store(0), Start sets state mask state_mask_.store(R_ENABLED) continues to greet and create main replication fiber and then stop runs cntx_.Cancel(), which will at later be reset inside the main replication fiber. In this flow main replication fiber does not cancel and the connection calling to Stop is deadlocked waiting to join the main replication fiber.

The fix: run cntx_.Cancel() and state_mask_.store(0) in replica thread.

flowchart TD
    A[Start] --> T[Time:T0]
    T --> E["state |=Enabled"]
    E -->  F[FiberDispatch]
   B[Stop] --> R["state=0"] 
   R --> T2[Time:T0] 
   T2 --> T3[Time:T1]
   T3 --> CC["Context.Cancel()"] 
   CC --> D[JoinFiber/Deadlock]
   T <--> T2
   T4[Time:T1] <--> T3
   FS[FiberStart] --> W["While (state & enabled)"]
   W --> T4
   T4 --> ER["if context.error() then continue"]
   ER --> W

The bug: One connection calls replica start and second call replica
stop. In this flow stop first reset state mask state_mask_.store(0),
Strat sets state mask state_mask_.store(R_ENABLED) continues to greet
and create main replication fiber and then stop runs cntx_.Cancel(),
which will at later be reset inside the main replication fiber. In this
flow main replication fiber does not cancel and the connection calling
to Stop is deadlocked waiting to join the main replication fiber.

The fix: Swith the order between cntx_.Cancel() and state_mask_.store(0)
in Stop function.

Signed-off-by: adi_holden <adi@dragonflydb.io>
@dranikpg
Copy link
Contributor

dranikpg commented Nov 20, 2023

I still don't understand.... How do we miss context cancellation if it happens after setting enabled = false

The only lines I know of that discard cancellation are those:

while (state_mask_.load() & R_ENABLED) {
    // Discard all previous errors and set default error handler.
    cntx_.Reset([this](const GenericError& ge) { this->DefaultErrorHandler(ge); });

and no matter in what order you place it, if we cross here it'll be problematic in any order... (It's what I realize now 😅 )

Maybe we should really run Stop on the replica's native thread? I remember we did some time ago 🤔

@dranikpg
Copy link
Contributor

dranikpg commented Nov 20, 2023

Maybe we should really run Stop on the replica's native thread? I remember we did some time ago 🤔

Here it is! 😄 #1058

Those changes are not safe...

#1058 (comment)

@adiholden
Copy link
Collaborator Author

@dranikpg as I wrote in the description above setting enabled = false is done when the replica start is called (before we store R_ENABLED) and calling context cancel from stop flow is called after the main replication fiber is created fiber.
When this happends we break from the flow fibers continue in the while loop of the main replication fiber and reset the context.
Switching the order of state_mask_.store(0) and cntx_.Cancel() does solve the problem as if cancel context from Stop is called before we exit the Start (before main replication fiber is created) we will return on cancel from Start. If it is called after main replication fiber is created, calling state_mask_.store(0) will not be overwrite and once we see the cancel context we will exit the while loop.

@adiholden
Copy link
Collaborator Author

Are you saying that we should call in Stop:

if (sock_) {
sock_->proactor()->Await([this] {
state_mask_ = 0; // Specifically ~R_ENABLED.
cntx_.Cancel(); // Context is fully resposible for cleanup.
});
}

I can try this as well, although I am pretty sure that my change fixes the problem this looks more robust

@romange
Copy link
Collaborator

romange commented Nov 21, 2023

@adiholden, based on your PR description I created this sketch (the problematic flow). Is it correct?

flowchart TD
    A[Start] --> T[Time:T0]
    T --> E["state |=Enabled"]
    E -->  F[FiberDispatch]
   B[Stop] --> R["state=0"] 
   R --> T2[Time:T0] 
   T2 --> T3[Time:T1]
   T3 --> CC["Context.Cancel()"] 
   CC --> D[JoinFiber/Deadlock]
   T <--> T2
   T4[Time:T1] <--> T3
   FS[FiberStart] --> W["While (state & enabled)"]
   W --> T4
   T4 --> ER["if context.error() then continue"]
   ER --> W

@romange
Copy link
Collaborator

romange commented Nov 21, 2023

Guys, please take into account that for state transitions under management commands, the simplest and the most robust approach is to use mutexes/spinlocks together with a state variable to benefit from simplicity of transactional semantics. Commands that do not require performance do not need this extra complexity of atomics. Delegating into a designated thread is also a valid approach.

@adiholden
Copy link
Collaborator Author

@romange the flow you are describing also shows the problem of setting state before cancel context although its not the exact one as I saw in my prints when I had the test fail.
When I saw the fail the Context.Cancel() happened after the FiberStart of the main replication fiber, in this fiber we have a while loop that resets the context at the begining of the loop. In my fail I saw catching the error in the middle of the while when the flow fibers are already running, we exit this fibers and contiue the while loop from start as the condition in the while is checking Enabled state. Than we reset the context inside the while loop and start replication again

@adiholden
Copy link
Collaborator Author

As for the following suggested fix:
if (sock_) {
sock_->proactor()->Await([this] {
state_mask_ = 0; // Specifically ~R_ENABLED.
cntx_.Cancel(); // Context is fully resposible for cleanup.
});
}

Because today we allow running stop before we finish replica start finishes, the socket may not be initialized when runnig the stop as it is initialized in start when we run connect and auth. We have this comment in the code: // We proceed connecting below without the lock to allow interrupting the replica immediately.
I dont think this is so creatical to allow stop before we finish replica start, so I can disable running stop and start together making it posible to use the reset as suggested above.
If you think there is a real reason to allow stop and start run together let me know @dranikpg

@romange
Copy link
Collaborator

romange commented Nov 21, 2023

@adiholden fixed and updated the PR description. It's an editable flowchar using the mermaid syntax

@adiholden
Copy link
Collaborator Author

@romange Ok after changing the the flow to not allow running stop and start together of course the cancel replication pytest fails, as this will not allow to cancel replication. Why do we need to be able to cancel replication? I mean fast cancel , before we return from replicaof cancel it..

@dranikpg
Copy link
Contributor

dranikpg commented Nov 21, 2023

Because the socket might not be initialized we could just store the proactor on creation (I suggested it in the comments of Roy's Pr)

Switching the order of state_mask_.store(0) and cntx_.Cancel() does solve the problem as if cancel context from Stop is called before we exit the Start

Yes, but those operation can happen very quickly regardless of order. See the code again, this is theoretically possible

while (state_mask_.load() & R_ENABLED) {
    // What if
    // mask = 0
    // cntx.Cancel
    // happens here - regardless of order we will proceed
    cntx_.Reset([this](const GenericError& ge) { this->DefaultErrorHandler(ge); });

@romange
Copy link
Collaborator

romange commented Nov 21, 2023

We do not need "fast cancel". This has been designed by Roey because we allowed it so he wanted to cover this use-case. If we do not allow concurrent updates (which I think is the right approach) we can just check that sending both commands concurrently works in consistent manner (i.e. if we had replication before, it will be cancelled, and if not, then the cancellation command is ignored or an error is returned)

@dranikpg
Copy link
Contributor

dranikpg commented Nov 21, 2023

We do not need "fast cancel

It is not hard to fix! Not having fast cancel means that if the master endpoint is invalid, Dragonfly will be unresponsive until the connection timed out. That's not cool

We already supported it before, Roy just removed the hop because he changed socket management

@romange
Copy link
Collaborator

romange commented Nov 21, 2023

Was there specific product use-case that benefited from "fast cancel"?
Can you explain when and why Dragonfly would be unresponsive?

Signed-off-by: adi_holden <adi@dragonflydb.io>
@dranikpg
Copy link
Contributor

Can you explain when and why Dragonfly would be unresponsive?

We have a connection timeout when connecting to master in Start. If you notice that you've chosen the wrong endpoint for some reason, there is no way to stop this operation manually, you have to wait for it to time out

@adiholden adiholden merged commit c10dac4 into main Nov 21, 2023
10 checks passed
@adiholden adiholden deleted the fix_replica_stop branch November 21, 2023 10:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants