Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

short lived leader election reported when starting a raft cluster #168

Closed
freddyrios opened this issue Jun 2, 2023 · 24 comments
Closed
Assignees
Labels
bug Something isn't working
Projects

Comments

@freddyrios
Copy link
Contributor

freddyrios commented Jun 2, 2023

The issue can be reproduced with the example project, by building it locally and adding a cluster.bat to the output folder that has the following lines:

START /B .\RaftNode.exe tcp 3262 node2 > node2.log
START /B .\RaftNode.exe tcp 3263 node3 > node3.log
START /B .\RaftNode.exe tcp 3264 node4 > node4.log

Run: del -r node* && .\cluster.bat

I can reproduce it with 2-6 attempts in a windows x64 machine. The issue was originally found in a raspberry pi (arm - linux).

The leader prints this its log:

New cluster leader is elected. Leader address is 127.0.0.1:3260
Term of local cluster member is 1. Election timeout 00:00:00.1590000
Consensus cannot be reached
Term of local cluster member is 1. Election timeout 00:00:00.1590000
New cluster leader is elected. Leader address is 127.0.0.1:3267
Term of local cluster member is 2. Election timeout 00:00:00.1590000
Accepting value 500
Accepting value 1000
Accepting value 1500
Accepting value 2000
...

Other nodes print (+ the leader prints the save messages)

New cluster leader is elected. Leader address is 127.0.0.1:3267
Term of local cluster member is 2. Election timeout 00:00:00.1700000
Accepting value 500
Accepting value 1000
Accepting value 1500
Accepting value 2000
...

When done with the run, the RaftNode processes need to be killed via task manager since they are running in the background.

Originally posted by @freddyrios in #167 (reply in thread)

@sakno sakno added this to Opened in Cluster via automation Jun 2, 2023
@sakno
Copy link
Collaborator

sakno commented Jun 2, 2023

Leader address is 127.0.0.1:3260

How is it possible if there are only 3262, 3263, 3264 nodes?

@freddyrios
Copy link
Contributor Author

@sakno sorry for the noise, I used the log messages of the 10 nodes test run because they looked the same and forgot about it. Here are the actual entries for the 3 nodes run:

New cluster leader is elected. Leader address is 127.0.0.1:3264
Term of local cluster member is 2. Election timeout 00:00:00.2510000
Accepting value 500
Accepting value 1000
Accepting value 1500
Accepting value 2000

New cluster leader is elected. Leader address is 127.0.0.1:3263
Term of local cluster member is 1. Election timeout 00:00:00.1820000
Consensus cannot be reached
Term of local cluster member is 1. Election timeout 00:00:00.1820000
New cluster leader is elected. Leader address is 127.0.0.1:3264
Term of local cluster member is 2. Election timeout 00:00:00.1820000
Accepting value 500
Accepting value 1000
Accepting value 1500
Accepting value 2000

New cluster leader is elected. Leader address is 127.0.0.1:3264
Term of local cluster member is 2. Election timeout 00:00:00.2560000
Saving value 500 generated by the leader node
Accepting value 500
Saving value 1000 generated by the leader node
Accepting value 1000
Saving value 1500 generated by the leader node
Accepting value 1500
Saving value 2000 generated by the leader node
Accepting value 2000

@sakno
Copy link
Collaborator

sakno commented Jun 3, 2023

So the problem with 3263 node, right? It was elected as a leader, and that fact was not reflected by other nodes.

@freddyrios
Copy link
Contributor Author

freddyrios commented Jun 3, 2023 via email

@sakno
Copy link
Collaborator

sakno commented Jun 3, 2023

In theory, it's possible (and correct from Raft perspective). But it should be very rare situation. I'll take a look.

@sakno sakno self-assigned this Jun 3, 2023
@sakno sakno added the enhancement New feature or request label Jun 3, 2023
@sakno
Copy link
Collaborator

sakno commented Jun 3, 2023

Maybe related PR #170

@sakno
Copy link
Collaborator

sakno commented Jun 4, 2023

Can you turn on Debug severity and attach logs for each node separately?

@freddyrios
Copy link
Contributor Author

switched to file logging that also shows times + added an extra log message just before calling StartAsync.
node4.log
node2.log
node3.log

@freddyrios
Copy link
Contributor Author

forgot to add that I set logging to trace and did not get anything extra

@sakno
Copy link
Collaborator

sakno commented Jun 5, 2023

Example code uses Console.WriteLine instead of logging infrastructure. It seems like logs were not written at all (or written to another source, such as file instead of stdio).

@freddyrios
Copy link
Contributor Author

freddyrios commented Jun 5, 2023

Turned out a lot of the logging is only enabled for http.

Here are the details of one of the runs after enabling it for tcp.

node2.log
node3.log
node4.log

Some things that I found interesting:

  1. term 0 is skipped, as nodes transitioning to leader change to term 1
  2. at 41.8886 node 3 logs that, in term 1, node 2 rejected the vote but node 4 approved it
  3. node 2 accepts the loss by downgrading to follower at 41.9006, while node 3 says it became a leader at 41.9008
  4. node 4 first raft log is at 41.8900 which is .0014 after it suppodsely voted for node 3.
  5. at 41.8900 node 4 says twice it downgrades to follower in term1 and that "Election timeout is refreshed". But then incorrectly inmediately transitions to candidate.
  6. later node 4 starts and wins a vote in term 2 as normal.
  7. there are still some funny artifacts at this point though, for example:
    1. node 4 reports node 3 rejection 29ms after node 2 approved the vote
    2. (maybe expected) node 3 reports a leader change without a leader with term 1 / when we have already entered term 2 election
    3. node 3 reports "Leader state reverted with error" ... a task was canceled.

@sakno
Copy link
Collaborator

sakno commented Jun 5, 2023

term 0 is skipped, as nodes transitioning to leader change to term 1

This is fine. Term 0 is an initial term for every fresh node. Transition to Candidate increases the term.

@freddyrios
Copy link
Contributor Author

  1. at 41.8900 node 4 says twice it downgrades to follower in term1 and that "Election timeout is refreshed". But then incorrectly inmediately transitions to candidate.

I missed it says Member is downgrading vs. Member is downgraded / so its not double message of that. The timeout message does show twice though.

sakno added a commit that referenced this issue Jun 5, 2023
@sakno
Copy link
Collaborator

sakno commented Jun 5, 2023

I found a way to repro the issue with an isolated unit test (see commit above). Test fails and reports that some nodes in the cluster see different leaders immediately after start.

@freddyrios
Copy link
Contributor Author

I also tracked down on my side some of it.

At least in the last capture I made the election timer trigers 35ms before the node transitions to Candidate. In between those 2 steps, the node can vote for others.

@sakno sakno added bug Something isn't working and removed enhancement New feature or request labels Jun 5, 2023
sakno added a commit that referenced this issue Jun 5, 2023
@sakno
Copy link
Collaborator

sakno commented Jun 5, 2023

I forgot to add ColdStart=false to configuration. Now the issue is not reproducible. You can try the same using TcpTransportTests.ConcurrentElection() test.

@freddyrios
Copy link
Contributor Author

At least in the last capture I made the election timer trigers 35ms before the node transitions to Candidate. In between those 2 steps, the node can vote for others.

I did not have time to post the details yesterday.

It seems the root cause is that in the process of becoming a candidate the node can still grant votes, and when it does it still becomes a candidate.

One potential solution is in RaftCluster.MoveToCandidateState, after taking the transitionSync lock, to abort becoming a candidate if FollowerState.Refresh was called after the timeout was set. PreVote already seems to work similarly in there. It seems the lock would prevent votes while we are checking this, which should remove the ambiguity on voting vs. becoming a candidate. Additionally it seems FollowerState.Refresh and related caller logic is called under the lock too, so it could help avoiding similar issues in other uses that expect the timeout to have been refreshed if the node was still a follower.

This is the capture from the node that I used to track it down:

2023-06-05 16:24:35.8528|INFO|Program|calling StartAsync
2023-06-05 16:24:36.1754|INFO|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|timer timed out, transitioning to candidate state!
2023-06-05 16:24:36.1840|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Member is downgrading to follower state with term 1
2023-06-05 16:24:36.1840|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Election timeout is refreshed
2023-06-05 16:24:36.1840|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Member is downgraded to follower state with term 1
2023-06-05 16:24:36.2010|INFO|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Transition to Candidate state has started with term 1
2023-06-05 16:24:36.2010|DEBUG|DotNext.Net.Cluster.Consensus.Raft.RaftCluster|Voting is started with timeout 258 and term 2

The "timer timed out" log entry is an entry I added to FollowerState.Track when it exits the loop and sets timedOut = true.

Here is my interpretation of it:

  1. at 36.1754 FollowerState.Track times out and begins the move to candidate state
  2. The move to candidate state involves these actions: queue in ThreadPool the move to candidate, work is picked up and PreVote runs (at least another node needs to prevote), takes a lock on transitionSync, checks the FollowState { IsExpired: true}, logs "Transition to Candidate state has started with term 1", increments term & actually becomes a candidate, starts voting
  3. notice that between "timer timed out" and "transition to Candidate has started" at 36.2010 there are 25 ms (35ms was a mistake in my msg yesterday)
  4. The log of the short lived leader (not included above) logs at 36.1953 that the problem node granted it the vote.
  5. The other messages at 36.1840 must be when the problem node was granting the above vote
  6. The above means the problem node was granting votes when it had already decided to become a candidate. When granting votes the node is supposed to reset its timer, which it tries to do via FollowerState.Refresh, but that is just ignored as the timeout had already triggered

@sakno
Copy link
Collaborator

sakno commented Jun 6, 2023

One potential solution is in RaftCluster.MoveToCandidateState, after taking the transitionSync lock, to abort becoming a candidate if FollowerState.Refresh was called after the timeout was set. PreVote already seems to work similarly in there. It seems the lock would prevent votes while we are checking this, which should remove the ambiguity on voting vs. becoming a candidate. Additionally it seems FollowerState.Refresh and related caller logic is called under the lock too, so it could help avoiding similar issues in other uses that expect the timeout to have been refreshed if the node was still a follower.

I'll check this out. However, I can tell that this part of the algorithm remains unchanged for a long time. Also, I recommend you to check develop branch because of #170, which probably is a source of competing log entries.

@freddyrios
Copy link
Contributor Author

FYI I pulled the latest develop after my last message and reproduced it again.

@freddyrios
Copy link
Contributor Author

By the way, the issue is present at least since version 4.4.1 which is the one we have been running, so it makes sense it relates to code that has not changed over time. We had some issues on our side that made it harder to notice the issue.

@sakno
Copy link
Collaborator

sakno commented Jun 7, 2023

Fixed as following: check whether the refresh has been requested after the lock inside of MoveToCandidateState. If so, resume Follower state instead of moving to Candidate.

@freddyrios
Copy link
Contributor Author

works great, thanks!

ran the example reproduction at least 15 times and in most cases it properly elects the leader in term 1. The other case was when all nodes became candidates close to each others and rejected each others votes and then elected a leader in term 2 (as expected in raft).

@sakno
Copy link
Collaborator

sakno commented Jun 7, 2023

Perfect, I'll publish a new release today.

@sakno
Copy link
Collaborator

sakno commented Jun 7, 2023

Fixed in 4.12.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Cluster
Closed
Development

No branches or pull requests

2 participants