It takes too long for a node to catch up if it's seriously behind #13

allengeorge · 2013-11-21T16:26:03Z

Configuration: 3-machine cluster. Machines 1 and 2 were left to run for a long period of time with NOPCommands being applied every 1 second. Machine 3 was left offline. Eventually there was a backlog of over 7000 entries for Machine 3 to apply. On starting up, I observed that Machine 3 was not catching up quickly. This was quickly traced to two factors:

On receiving a negative AppendEntriesReply, a new AppendEntries is not sent immediately. Instead, we wait for the next heartbeat timeout. On KayVee the heartbeats are sent after multi-second intervals, which means it can take forever for the backlog to be cleared.
The leader rolls back its prefix one index position at a time. Perhaps the optimization described in the Raft paper would be useful, where the follower reports information about its log entries.

It's also possible that this will be mitigated through the use of snapshots.

ghost assigned allengeorge Nov 21, 2013

allengeorge mentioned this issue Nov 21, 2013

Send new AppendEntries with updated prefix immediately on receiving an unapplied AppendEntriesReply #15

Open

allengeorge removed this from the 0.2.0 Release milestone Feb 25, 2014

allengeorge added this to the 0.2.1 Release milestone Mar 25, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

It takes too long for a node to catch up if it's seriously behind #13

It takes too long for a node to catch up if it's seriously behind #13

allengeorge commented Nov 21, 2013

It takes too long for a node to catch up if it's seriously behind #13

It takes too long for a node to catch up if it's seriously behind #13

Comments

allengeorge commented Nov 21, 2013