Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not seeing expected improvement in throughput of RaftCluster.ReplicateAsync method when cluster minority is inaccessible #233

Closed
LarsWithCA opened this issue Apr 19, 2024 · 20 comments
Assignees
Labels
enhancement New feature or request

Comments

@LarsWithCA
Copy link

This plots shows the timing (in ms) of RaftCluster.ReplicateAsync - at the vertical green line 1 node is disconnected (out of a cluster of 6 nodes in total):
(Linux ARM + .NET6 + DotNext.Net.Cluster 4.14.1)

image

From the change log of DotNext.Net.Cluster 4.15.0:

Raft performance: improved throughput of IRaftCluster.ReplicateAsync method when cluster minority is not accessible (faulty node). Now the leader waits for replication from majority of nodes instead of all nodes

This made us hope that we would no longer see these kinds of longer timings in case of inaccessible cluster minority. However, we see a pretty similar plot - at the green line 1 node is disconnected (out of a cluster of 6 nodes):
(Linux ARM + .NET8 + DotNext.Net.Cluster 5.3.0)

image

Did we have wrong expectations, or are we doing something wrong?

@sakno sakno self-assigned this Apr 19, 2024
@sakno sakno added the enhancement New feature or request label Apr 19, 2024
@sakno
Copy link
Collaborator

sakno commented Apr 20, 2024

(in ms) of RaftCluster.ReplicateAsync

It is a measurement of latency, not throughput. By calling ForceReplicationAsync on every write, you choose better latency by the cost of low throughput. With ForceReplicateAsync you don't need to wait for the next heartbeat round, that's why latency improves. However, if you want to have good throughput, you need to accumulate as much uncommitted log entries as possible and wait for replication. It can be done without explicit call of ForceReplicationAsync. If you have just one writer, prefer latency over throughput.

@LarsWithCA
Copy link
Author

Hi @sakno

I can see that we messed up the terms a bit 😊

From the wording in the changelog:

Now the leader waits for replication from majority of nodes instead of all nodes.

we were hoping to also see an improvement on latency (in case of inaccessible cluster minority). Are you saying that is not the case? Or are you saying we should just call ForceReplicationAsync instead?

We have a single writer in our system (and we do prefer latency over throughput).

@sakno
Copy link
Collaborator

sakno commented Apr 23, 2024

we were hoping to also see an improvement on latency

Not exactly, it's improvement over throughput from client perspective. However, the underlying state machine commits changes earlier. A small recap:

  1. ForceReplicationAsync resumes when all nodes replicated (or some of them detected as unavailable)
  2. AppendAsync from WAL invokes before ForceReplication because only majority is needed to mark log entries as committed

@sakno
Copy link
Collaborator

sakno commented Apr 23, 2024

If you want to improve latency in presence of unavailable nodes, you can use the following techniques:

  1. Reduce connect timeout
  2. Enable automatic failure detection so the leader can remove unhealthy nodes automatically from the list of cluster members

@LarsWithCA
Copy link
Author

Reduce connect timeout

Our ConnectTimeout is only 10ms. Did you mean another timeout? We have 'RequestTimeout' set to 1 second.

@sakno
Copy link
Collaborator

sakno commented Apr 24, 2024

The leader exposes broadcast-time counter in milliseconds. You can measure it with dotnet-counters. What's the average value of this counter with and without unavailable nodes?

@LarsWithCA
Copy link
Author

Average of broadcast-time before/after disconnecting a node:

  • before: 5ms
  • after: 920ms

(Linux ARM + .NET8 + DotNext.Net.Cluster 5.4.0)

@sakno
Copy link
Collaborator

sakno commented May 2, 2024

How many nodes are disconnected? 920 ms with 1 disconnected node?

@LarsWithCA
Copy link
Author

A single disconnected node (out of a cluster of 6 in total).

@sakno
Copy link
Collaborator

sakno commented May 2, 2024

Very suspicious, 920ms - 5ms doesn't give 10ms for connection timeout.

@sakno
Copy link
Collaborator

sakno commented May 2, 2024

One more way to investigate the issue. There is response-time counter that shows response time for each node individually. Could you share this value for each node? Every counter has a tag indicating IP address of the node (you can replace IP addresses with anything else, it's needed just to distinguish values).

A group of metrics is DotNext.Net.Cluster.Consensus.Raft.Client, not DotNext.Net.Cluster.Consensus.Raft.Server.

@LarsWithCA
Copy link
Author

LarsWithCA commented May 2, 2024

Example before disconnecting 202:

[dotnext.raft.client.address=232;dotnext.raft.client.message=AppendEntries;Percentile=50] | 6.375
[dotnext.raft.client.address=202;dotnext.raft.client.message=AppendEntries;Percentile=50] | 1.296875
[dotnext.raft.client.address=149;dotnext.raft.client.message=AppendEntries;Percentile=50] | 10.375
[dotnext.raft.client.address=154;dotnext.raft.client.message=AppendEntries;Percentile=50] | 7.671875
[dotnext.raft.client.address=31;dotnext.raft.client.message=AppendEntries;Percentile=50] | 6.765625

Example after disconnecting 202:

[dotnext.raft.client.address=232;dotnext.raft.client.message=AppendEntries;Percentile=50] | 1.41796875
[dotnext.raft.client.address=202;dotnext.raft.client.message=AppendEntries;Percentile=50] | 996
[dotnext.raft.client.address=149;dotnext.raft.client.message=AppendEntries;Percentile=50] | 1.990234375
[dotnext.raft.client.address=154;dotnext.raft.client.message=AppendEntries;Percentile=50] | 2.3828125
[dotnext.raft.client.address=31;dotnext.raft.client.message=AppendEntries;Percentile=50] | 7.375

And approx 1,5 minutes later the message changes to InstallSnapshot for 202:

[dotnext.raft.client.address=232;dotnext.raft.client.message=AppendEntries;Percentile=50] | 7.3125
[dotnext.raft.client.address=202;dotnext.raft.client.message=InstallSnapshot;Percentile=50] | 1000
[dotnext.raft.client.address=149;dotnext.raft.client.message=AppendEntries;Percentile=50] | 4.171875
[dotnext.raft.client.address=154;dotnext.raft.client.message=AppendEntries;Percentile=50] | 4.484375
[dotnext.raft.client.address=31;dotnext.raft.client.message=AppendEntries;Percentile=50] | 4.6328125

@sakno
Copy link
Collaborator

sakno commented May 2, 2024

It seems like ConnectTimeout is not equal to 10ms, it's equal to 1000ms (as for RequestTimeout). How do you set ConnectTimeout in the code?

Omg, I found a root cause. It's trivial, one-liner fix.

sakno added a commit that referenced this issue May 2, 2024
@LarsWithCA
Copy link
Author

Awesome! :)

@sakno
Copy link
Collaborator

sakno commented May 2, 2024

Could you check develop branch? It contains both fixes: incorrect usage of ConnectTimeout and accidental snapshot installation.

@sakno
Copy link
Collaborator

sakno commented May 2, 2024

Also, the upcoming release introduces new WaitForLeadershipAsync method that waits for the local node to be elected as a leader of the cluster and returns leadership token. It is very convenient if your code relies on LeaderChanged event to determine, which node allows writes.

@LarsWithCA
Copy link
Author

The issue seems to be fixed, i.e. the time of RaftCluster.ReplicateAsync immediately goes back to something very low (after disconnecting one node):
image

This is great, thanks a lot!
I'll keep my cluster running and get back to you tomorrow regarding the accidental snapshot installation.

@LarsWithCA
Copy link
Author

@sakno the "snapshot installation" messages are also gone 👍

@sakno
Copy link
Collaborator

sakno commented May 3, 2024

I'll prepare a new release today

@sakno
Copy link
Collaborator

sakno commented May 5, 2024

Release 5.5.0 has been published.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Closed
Development

No branches or pull requests

2 participants