Skip to content
This repository has been archived by the owner on Dec 17, 2018. It is now read-only.

It takes too long for a node to catch up if it's seriously behind #13

Open
allengeorge opened this issue Nov 21, 2013 · 0 comments
Open
Assignees
Milestone

Comments

@allengeorge
Copy link
Owner

Configuration: 3-machine cluster. Machines 1 and 2 were left to run for a long period of time with NOPCommands being applied every 1 second. Machine 3 was left offline. Eventually there was a backlog of over 7000 entries for Machine 3 to apply. On starting up, I observed that Machine 3 was not catching up quickly. This was quickly traced to two factors:

  1. On receiving a negative AppendEntriesReply, a new AppendEntries is not sent immediately. Instead, we wait for the next heartbeat timeout. On KayVee the heartbeats are sent after multi-second intervals, which means it can take forever for the backlog to be cleared.
  2. The leader rolls back its prefix one index position at a time. Perhaps the optimization described in the Raft paper would be useful, where the follower reports information about its log entries.

It's also possible that this will be mitigated through the use of snapshots.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant