Raft Distributed Consensus Protocol #1

benbjohnson · 2013-08-02T00:27:43Z

Overview

The Raft distributed consensus protocol allows a collection of processes to maintain consistency even in the face of multiple node failure. The two main tenants of the protocol are leader election and log replication.

This visualization will lay out the problem of distributed consensus followed by a general overview of leader election and log replication. It will then follow up with details of Leader Election using best case (Single Candidate) and worst case (Split Vote) scenarios. Then it will show details of Log Replication using the best case (Network OK) and worst case (Network Partitions) scenarios. Finally, it will conclude with additional resources on where to learn more.

URL

http://thesecretlivesofdata.com/raft/

Frames

- [x] What is Distributed Consensus?
- [x] Overview
  - [x] States (Follower, Candidate, Leader)
- [x] Leader Election
  - [x] Election Timeout
  - [x] Candidacy
  - [x] Leadership & heartbeat timeout.
  - [x] Re-election
  - [x] Split Vote
- [ ] Log Replication
  - [X] Complex state machine example.
  - [X] Commitment rules
  - [X] Network Partitions
  - [X] Client reads.
- [X] Conclusion & Additional Resources

The text was updated successfully, but these errors were encountered:

benbjohnson · 2013-12-06T18:05:12Z

Added the "Overview" frame. Still need to add navigation so users can skip ahead.

benbjohnson · 2013-12-06T18:08:36Z

@vanstee Here's the Github Issue for Raft. I'll post to this ticket as I update the visualization so feel free to add yourself as a watcher. Otherwise I'll ping you when I get it done.

I added an "Overview" section that starts right after the "What is Distributed Consensus" section I sent you earlier.

philips · 2013-12-06T19:12:40Z

Looks really great! But, I want arrow keys!

benbjohnson · 2013-12-06T19:18:01Z

@philips Good call. I added #3 to track it.

ongardie · 2013-12-07T01:03:46Z

Looks awesome. A few bits that'd be great to add:

Replying to the client
How clients find the leader
Larger, more aggressive state machines
Terms
Checking whether the candidate's log is up-to-date when granting vote
Full commitment rule

benbjohnson · 2013-12-07T20:48:22Z

@ongardie Thanks, Diego. I added those items to the description. I'll go through the paper again and flesh out some more details for each section as well.

benbjohnson · 2013-12-10T15:07:49Z

I added a few navigation features:

Left / right arrow keys when the buttons are visible.
Top-level menu drop down.
Deep linking.
Replay (aka "back" button).

There are some timing bugs that came from adding snapshots for replay. I'll get those worked out though.

/cc: @philips @andybons

andybons · 2013-12-10T15:23:44Z

👍

benbjohnson · 2013-12-11T17:05:50Z

@ongardie I got the start of an actual simulation going in the Leader Election frame:

http://thesecretlivesofdata.com/raft/#election

ongardie · 2013-12-12T02:29:31Z

Nice, that's slick :)

benbjohnson · 2013-12-27T19:46:50Z

Lots of refactoring to the underlying code and to the playback.js library. Here's the "Leader Election" section:

http://thesecretlivesofdata.com/raft/#election

It's more simulation-like now but I think I still need to make the Split Vote section clearer.

Comment welcome. :)

stig · 2013-12-27T20:19:52Z

I found a bug (I think) in the slides. When the follower animation finishes it automatically continues with the candidate animation, but without moving the text to say "candidate". This confused me a bit when I skipped to the next slide.

ongardie · 2013-12-31T00:45:07Z

Really cool, @benbjohnson. The leader election section is definitely shaping up now.

One last thing to mention would be that a server updates its term when it hears a message with a newer term. The animations show it, but it's worth pointing out.

Another thing I'm wondering about is the use of a 4-node cluster to show a split vote. The problem is that we don't expect people to ever run 4-node clusters (since their availability is no better than 3-node clusters), and the definition of majority is less obvious in 4-node clusters (we mean 3 servers but noobs may assume only 2).

In general, you need either 3 servers to time out simultaneously for a split vote, or 2 servers and 1 crash or network anomaly. I agree that showing an example with a 3-node cluster is a bit confusing because there's just not enough servers around to show the pattern, but I don't think switching to a 4-node cluster is best.

How about switching to either a 5-node cluster with 3 candidates, or a 5-node cluster with 2 candidates and 1 crash (this would be identical to the current scenario except with an extra grayed out node)? It may also be illustrative to show how you can't get a split vote with only 2 candidates if everything is going well (one will necessarily get a majority).

benbjohnson · 2014-01-07T17:04:01Z

@stig I think I fixed what you're talking about. Can you check it out and tell me if it's correct now?

@ongardie I pushed out an initial cut without the 5-node split vote configuration. I mostly wanted to get a first version out. I'll come back and fix it after the next round of feedback from people.

The first draft of the visualization is released: https://twitter.com/secretlivesdata/status/420600602498838529

benbjohnson · 2014-01-07T17:04:30Z

NOTE: I closed the issue for this release but feel free to continue to add additional comments.

seguer · 2014-01-07T22:26:17Z

Does the raft protocol outline how clients are meant to find and then interact with leaders? For example, if a client is sending messages to a node that was leader, but becomes a follower after an election, the client needs to know where to send future messages.

ebroder · 2014-01-08T13:44:19Z

This is really awesome! I've sent this around to the rest of my team.

I don't know how you're scoping this particular visualization, but I would have enjoyed more of a discussion on why distributed consensus algorithms are difficult. In the awesome future where everyone just accepts that Raft is the way and the light, that might not be needed, but in the mean time I think there's potentially value to explaining why you should use Raft instead of developing your own (undoubtedly wrong) consensus algorithm.

benbjohnson · 2014-01-08T22:29:32Z

@seguer Each node has the current leader that it knows about. If a client sends a request to a follower then it will be denied and it will be notified what the current leader is. I was debating whether to include this point in the visualization because I don't want to overload people with information but I've heard that question from a few people so I'll add it in.

@ebroder Thanks! The scope for the visualization was fuzzy. I was mostly trying to explore what a semi-interactive visualization of a distributed system would look like so I went with what I knew best (Raft). One of the hardest parts has been trying to figure out what to put in and what to keep out. Too much information and it's confusing but too little and people don't get a full understanding.

I think the understanding behind why distributed consensus is difficult is a more general problem. I'd like to visualize some other consensus algorithms (Paxos, ZAB) so I may leave these bigger questions as a separate visualization entirely.

TheWinch · 2014-06-23T15:25:43Z

One of the cleanest presentation on distributed consensus I've seen so far! I have 2 questions however:

in the split brain scenario, how can node B become a leader of the minority, since it never receives a majority of voters for itself? In my understanding it should just be stuck in election phase, shouldn't it?
likewise when the healing occurs, how does B know about the "higher election term"? In your example the terms are the same in both partitions ("2") so maybe a bit of explanation is needed here

I'de like to contribute on the Paxos visualization since I know quite well the algorithm, and I'm working also on a ZAB explanation. How can I help?

benbjohnson · 2014-07-01T04:32:02Z

@TheWinch To answer your questions:

in the split brain scenario, how can node B become a leader of the minority, since it never receives a majority of voters for itself? In my understanding it should just be stuck in election phase, shouldn't it?

Node B is the leader before the split happens so it stays the leader. However, it can't communicate to a quorum so it's ineffective and can't commit any log entries. You can also implement Raft so that the leader times out if it doesn't hear from a quorum for an election timeout.

likewise when the healing occurs, how does B know about the "higher election term"? In your example the terms are the same in both partitions ("2") so maybe a bit of explanation is needed here

You're right. That could have been clearer. When it receives an AppendEntries request from the new leader or when it tries to send an AppendEntries request and receives a response from any node in the new election term it will know about the new term and step down.

I had a lot of fun building the Raft implementation but it was really time consuming. I'm swamped right now so I don't have the bandwidth to do any additional implementations right now. I'll let you know if I start another one in the future.

erikbgithub · 2014-09-05T13:38:40Z

Sorry if another person asked this before. I'm reading this while working, so I don't have the time to get into it as deep as I like.

The question is this: What happens if I have an equal split between 8 nodes? Inside their subnet they could get a majority (Is that enough?). But neither can achieve a global majority (Do they even know how many nodes they are globally? I haven't seen a request that collects all the nodes). What happens now when two clients send different requests to the two leaders?

benbjohnson · 2014-09-05T14:13:48Z

@erikb85

What happens if I have an equal split between 8 nodes? Inside their subnet they could get a majority (Is that enough?). But neither can achieve a global majority (Do they even know how many nodes they are globally?

Every node knows what nodes exist in the cluster. The 4-4 split would mean that neither group could achieve cluster majority so no leader could be elected and the cluster would be unavailable.

What happens now when two clients send different requests to the two leaders?

Write requests to the old leader would not be able to replicate to a majority so they would hang until the leader steps down and then an error would be returned. The new leader would be able to complete the write request since it can connect to a majority.

If you want consistent reads, you'll need to send your read through the leader. That's a fairly nuanced topic. There's details in the Raft paper and @aphyr did a great write up of all this on his blog:

http://aphyr.com/posts/316-call-me-maybe-etcd-and-consul

GauravBuche · 2015-03-05T00:06:20Z

Excellent !

It makes things very clear. The way you have covered essential scenarios is amazing.

Thanks !

lusitania · 2015-10-20T06:38:21Z

Some feedback:

"Leader election" is missing the back button
A "skip animation" would be nice if you wanted to navigate to a particular point in the presentation
Also animations don't continue in background.
The nav links aren't working when selecting them backwards, i.e. replication -> election
I tried to review the last slide (4 nodes) in election but still don't get it entirely (can't go back to check again, too annoyed to skip through all slides again)
The continue button is visible but not functional while animation proceeds (partition scenario)
The partition isn't visible in a replay
Client notification differs between single and multi leader/client scenario: in the first scenario client confirmation is sent prior commit, in partition it is sent post commit. I assume the later is correct (although I've seen both schemes used in practice).

xuanyuanking · 2015-11-04T06:00:26Z

What a wonderful job! It make the algorithms clear to understand, do we have other algorithms to show? I want join in. :)

davidxia · 2016-01-07T19:28:45Z

It'd be nice to be able to step back. Great visualization overall!

zimuabc · 2016-04-26T07:06:59Z

wonderful job! It's woud be better to show paxos algorithm in this method 👍

thomasjoel · 2017-10-22T23:40:31Z

Great animation! As others have mentioned, it would be nice to have navigation. Esp in the Leader Election section, some of the slides went by too fast and I would have liked to go back and review it.

jeven2016 · 2018-01-10T08:49:48Z

So impressed animation, very great , could I download it from somewhere?

ulbrich · 2018-06-03T06:06:08Z

Thanks, for your great work: I love it! Maybe add deep linking by changing the hash in the URL between lessons/chapters. This would make it easier to continue reading after the browser had to reload the page and it would ease citation in blog posts.

Savemech · 2018-06-13T16:15:29Z

That is fantastic work!
Love to have option to step back too(reset current, and one presentation step back).

kartikjena3 · 2018-10-30T07:33:10Z

Great

rickchen1979 · 2018-10-31T08:44:29Z

need backward button，thanks!!

shudo1219 · 2019-03-20T08:01:16Z

so great work, it will be perfect if leader election and log replication chapters have backward button.
I have some questions about the following sentences:
A client sends a change to the leader , the change is appended to the leader's log.
Then the change is sent to the followers on the Next HeartBeat.

Why on the next heartbeat ? If the clients change the system, we have to wait for the next heartbeat to finish the change? How to make sure the high system perform?

jenna-h · 2019-04-15T22:24:45Z

The back button isn't available throughout the entire tutorial (it disappears when we get to the section on Leader Election). Additionally, I would appreciate a bit more explanation as to how nodes keep track of the current election term.

SJGe · 2019-05-01T17:29:18Z

Great work! Thanks!
Need backward button in the last pages.

paulvidal · 2019-07-02T20:30:10Z

Just wanted to say you are a hero! Kudos to you for this AMAZING visualisation :) Really waiting for the next ones

pckeyan · 2020-03-02T02:43:22Z

Great Visualization! Kudos for the team for spreading knowledge. BTW which tool is used for presentation? I am also looking for similar tool. Thanks in advance

maximveksler · 2020-04-09T20:11:26Z

Thank you.

monotypical · 2020-06-13T15:07:02Z

Thank you for the very helpful explanation, I thought I'd let you know that the link to the raft paper in the conclusion is broken, but the paper is available at https://raft.github.io/raft.pdf

rgidwani-splunk · 2020-06-15T17:40:49Z

this is awesome! thank you for your work

benbjohnson · 2020-06-15T18:33:18Z

@monotypical Thanks for letting me know. The link's been updated. 👍

JeidiPadron · 2020-11-25T09:16:24Z

Congratullations, excellent initiative. I suggest to out a menu on left side, because I want to view again some views and it was impossible, just restart process (mayce back button can work, but I can not use it) It is the best resourse I see to undertabd raft. Thank you very much

ThisIsNSH · 2021-06-11T16:14:52Z

@benbjohnson great work 👏

AliAzlanAziz · 2021-07-04T13:14:32Z

@benbjohnson great work bro, you made it easy for us. But may be you forgot to add the "go back to previous page" feature on the pages of chapter "Leader Election" and after.

anuragrana · 2021-09-27T11:12:57Z

Great work. Extremely easy to understand.
Please add a back button as well.

iamatulkumar · 2022-01-17T11:38:20Z

Great work

Cachetian · 2022-03-19T05:02:10Z

Awesome👍 easy to understand, open to share, thank you!

ohhh-yang · 2022-06-17T11:55:31Z

After reading raft lecture, its so confusal. But this annimation show really does great help for me to comprehend the whole flow of Raft! Brilliant!
Thanks for the developer!

HugoRoss · 2022-12-17T13:12:20Z

World class how you break down a rather complex algorithm into well explained (and well understandable) smaller steps. Really enjoy this documentation/tutorial, great teaching. Like @ongardie I wonder how does the client know who is the leader? Does it just communicate to any node and the node then replies who the current leader is and then the client resends its request to the leader?

HugoRoss · 2022-12-17T13:27:43Z

In a scenario where A is the leader and B and C followers and a client now sends a message to A, say "x = 8" and the following happens:

A adds uncommited "x = 8" and notifies B and C
B + C add uncommited "x = 8" and notify A that they got the instruction/message/transaction
A commits "x = 8" and notifies B and C but right then a network split occurs and A cannot reach B and C anymore
does A then roll back "x = 8" to uncommited?

ohhh-yang · 2022-12-17T13:34:44Z

Recent days I'm preparing for exam, so sorry that I haven't enough time to contemplate such questions. I'll reflect on it after my exam. Thanks!

…

---Original--- From: "Christoph ***@***.***> Date: Sat, Dec 17, 2022 21:27 PM To: ***@***.***>; Cc: ***@***.******@***.***>; Subject: Re: [benbjohnson/thesecretlivesofdata] Raft Distributed ConsensusProtocol (#1) In a scenario where A is the leader and B and C followers and a client now sends a message to A, say "x = 8" and the following happens: A adds uncommited "x = 8" and notifies B and C B + C add uncommited "x = 8" and notify A that they got the instruction/message/transaction A commits "x = 8" and notifies B and C but right then a network split occurs and A cannot reach B and C anymore does A then roll back "x = 8" to uncommited? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

alibttb · 2023-03-23T12:29:58Z

Greate demonstration, adding an option to go to specific step would be great.

satya-bodapati · 2023-08-01T09:25:45Z

Back button to previous step please!

m0d0nne11 · 2023-09-04T16:03:35Z

Nice work. And I like that sugggestion re: having a Back button, and please arrange for that Back button as well as the Continue button to have one static, predictable location (like top-right corner) rather than having them appear at different locations on the rendered page that depend on other page characteristics/variables. Thx!

amit-rastogi · 2024-01-05T14:20:39Z

Just went through https://thesecretlivesofdata.com/raft/. Thanks for creating this, gives very clear understanding regarding the RAFT protocol covering various scenarios for leader election and WAL replication 💯

shadow1999k · 2024-04-09T11:14:23Z

@erikb85

What happens if I have an equal split between 8 nodes? Inside their subnet they could get a majority (Is that enough?). But neither can achieve a global majority (Do they even know how many nodes they are globally?

Every node knows what nodes exist in the cluster. The 4-4 split would mean that neither group could achieve cluster majority so no leader could be elected and the cluster would be unavailable.

What happens now when two clients send different requests to the two leaders?

Write requests to the old leader would not be able to replicate to a majority so they would hang until the leader steps down and then an error would be returned. The new leader would be able to complete the write request since it can connect to a majority.

If you want consistent reads, you'll need to send your read through the leader. That's a fairly nuanced topic. There's details in the Raft paper and @aphyr did a great write up of all this on his blog:

http://aphyr.com/posts/316-call-me-maybe-etcd-and-consul

Raft explanation was excellent. Thanks ! Looking forward for new concepts to be implemented with
The Secret Lives of Data :))

ghost assigned benbjohnson Aug 2, 2013

benbjohnson closed this as completed Jan 7, 2014

jaseemabid mentioned this issue Dec 15, 2015

RAFT: Follower's term is not incremented for request_vote RPC #20

Closed

Raft Distributed Consensus Protocol #1

Raft Distributed Consensus Protocol #1

Comments

benbjohnson commented Aug 2, 2013

Overview

URL

Frames

benbjohnson commented Dec 6, 2013

benbjohnson commented Dec 6, 2013

philips commented Dec 6, 2013

benbjohnson commented Dec 6, 2013

ongardie commented Dec 7, 2013

benbjohnson commented Dec 7, 2013

benbjohnson commented Dec 10, 2013

andybons commented Dec 10, 2013

benbjohnson commented Dec 11, 2013

ongardie commented Dec 12, 2013

benbjohnson commented Dec 27, 2013

stig commented Dec 27, 2013

ongardie commented Dec 31, 2013

benbjohnson commented Jan 7, 2014

benbjohnson commented Jan 7, 2014

seguer commented Jan 7, 2014

ebroder commented Jan 8, 2014

benbjohnson commented Jan 8, 2014

TheWinch commented Jun 23, 2014

benbjohnson commented Jul 1, 2014

erikbgithub commented Sep 5, 2014

benbjohnson commented Sep 5, 2014

GauravBuche commented Mar 5, 2015

lusitania commented Oct 20, 2015

xuanyuanking commented Nov 4, 2015

davidxia commented Jan 7, 2016

zimuabc commented Apr 26, 2016

thomasjoel commented Oct 22, 2017

jeven2016 commented Jan 10, 2018

ulbrich commented Jun 3, 2018

Savemech commented Jun 13, 2018

kartikjena3 commented Oct 30, 2018

rickchen1979 commented Oct 31, 2018

shudo1219 commented Mar 20, 2019

jenna-h commented Apr 15, 2019

SJGe commented May 1, 2019

paulvidal commented Jul 2, 2019

pckeyan commented Mar 2, 2020

maximveksler commented Apr 9, 2020

monotypical commented Jun 13, 2020

rgidwani-splunk commented Jun 15, 2020

benbjohnson commented Jun 15, 2020

JeidiPadron commented Nov 25, 2020

ThisIsNSH commented Jun 11, 2021

AliAzlanAziz commented Jul 4, 2021

anuragrana commented Sep 27, 2021

iamatulkumar commented Jan 17, 2022

Cachetian commented Mar 19, 2022

ohhh-yang commented Jun 17, 2022

HugoRoss commented Dec 17, 2022

HugoRoss commented Dec 17, 2022

ohhh-yang commented Dec 17, 2022 via email

alibttb commented Mar 23, 2023

satya-bodapati commented Aug 1, 2023

m0d0nne11 commented Sep 4, 2023 • edited

amit-rastogi commented Jan 5, 2024

shadow1999k commented Apr 9, 2024

m0d0nne11 commented Sep 4, 2023 •

edited