-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not seeing expected improvement in throughput of RaftCluster.ReplicateAsync method when cluster minority is inaccessible #233
Comments
It is a measurement of latency, not throughput. By calling |
Hi @sakno I can see that we messed up the terms a bit 😊 From the wording in the changelog:
we were hoping to also see an improvement on latency (in case of inaccessible cluster minority). Are you saying that is not the case? Or are you saying we should just call We have a single writer in our system (and we do prefer latency over throughput). |
Not exactly, it's improvement over throughput from client perspective. However, the underlying state machine commits changes earlier. A small recap:
|
If you want to improve latency in presence of unavailable nodes, you can use the following techniques:
|
Our |
The leader exposes |
Average of
(Linux ARM + .NET8 + DotNext.Net.Cluster 5.4.0) |
How many nodes are disconnected? 920 ms with 1 disconnected node? |
A single disconnected node (out of a cluster of 6 in total). |
Very suspicious, |
One more way to investigate the issue. There is A group of metrics is |
Example before disconnecting
Example after disconnecting
And approx 1,5 minutes later the message changes to
|
Omg, I found a root cause. It's trivial, one-liner fix. |
Awesome! :) |
Could you check |
Also, the upcoming release introduces new |
@sakno the "snapshot installation" messages are also gone 👍 |
I'll prepare a new release today |
Release 5.5.0 has been published. |
This plots shows the timing (in ms) of RaftCluster.ReplicateAsync - at the vertical green line 1 node is disconnected (out of a cluster of 6 nodes in total):
(Linux ARM + .NET6 + DotNext.Net.Cluster 4.14.1)
From the change log of DotNext.Net.Cluster 4.15.0:
This made us hope that we would no longer see these kinds of longer timings in case of inaccessible cluster minority. However, we see a pretty similar plot - at the green line 1 node is disconnected (out of a cluster of 6 nodes):
(Linux ARM + .NET8 + DotNext.Net.Cluster 5.3.0)
Did we have wrong expectations, or are we doing something wrong?
The text was updated successfully, but these errors were encountered: