Cluster Membership changes Single Server Change Algorithm

Raft的作者在其大论文"CONSENSUS: BRIDGING THEORY AND PRACTICE" Chapter 4中提出的一种单server修改的算法。不幸的是一开始这个算法有个大bug。之后，作者又修复了这个bug。关于此bug的讨论参见Google Group 这里讲的就是修复之后的算法。

If RPC is AddServer RPC:

If state.role != leader:
	Reply NOT_LEADER;

If Catch up newServer  failed:
	Reply TIMEOUT;

Wait for the previous configuration in the log is committed;

//为修改bug，Proposed solution://另一种修改方法，" a leader appends a no-op entry to the log when it is elected"
If leader在currentTerm中还没有commit过 一个entry:
	Append a no-op entry in its log;
	Wait this entry to be commited(in Cold);
	
//到此处,leader在currentTerm中已经commit过 一个entry
Append Cnew entry in its log;
Wait this entry to be commited(in Cnew);

//Cnew is committed.
Reply Ok;

If RPC is RemoveServer RPC:

If state.role != leader:
	Reply NOT_LEADER;

Wait for the previous configuration in the log is committed;

//为修改bug，Proposed solution://另一种修改方法，" a leader appends a no-op entry to the log when it is elected"
If leader在currentTerm中还没有commit过 一个entry:
	Append a no-op entry in its log;
	Wait this entry to be commited(in Cold);
	
//到此处,leader在currentTerm中已经commit过 一个entry
Append Cnew entry in its log;
Wait this entry to be commited(in Cnew);

//Cnew is committed.
    Reply Ok;

作者修复Bug时的讨论

https://groups.google.com/forum/#!topic/RAFT-dev/t4xj6dJTP6E

Hi raft-dev, Unfortunately, I need to announce a bug in the dissertation version of membership changes (the single-server changes, not joint consensus). The bug is potentially severe, but the fix I'm proposing is easy to implement. When Huanchen Zhang and Brandon Amos were working on a class project at CMU to formalize single-server membership changes, Huanchen found the bug and the counter-example below by hand. They contacted me over email on May 14th, and I chose to keep this quiet for a while until we had agreed upon a solution to propose to the list. After several incorrect and/or ugly attempts, I came up with the solution proposed below. I apologize for keeping this information from you for so long.

Recapping the single-server membership change algorithm in my dissertation: （重述论文中的算法） A leader may only create a configuration entry in its log if the prior one is committed and the new one differs from the prior one by at most one server (i.e., adding or removing a single server at a time). Each server always uses the latest configuration in its log, regardless of whether that entry is committed, for counting votes and determining commitment. The full details are in Chapter 4 of my dissertation: https://github.com/ongardie/dissertation The Raft paper "In search of an understandable consensus algorithm" and its extended version present an earlier, different membership change algorithm called joint consensus. The joint consensus algorithm is not affected by this bug (see paragraph under Scope below).

The bug: It's essential to Raft that if one decision (vote or commitment) is made in one quorum (majority), a subsequent quorum will contain at least one server that's aware of the decision, and this is needed even across membership changes. My dissertation shows how if two configurations differ by at most one server, a majority from the first and a majority from the second will have at least one server in common. Within a single term, a single leader can easily ensure that one configuration and the next differ by at most one server（在单个single term中，单个leader可以很容易地确保一个配置和下一个配置最多有一个服务器不同。）. This bug shows up across term boundaries.（这个bug出现在term边界处。） It's possible for two concurrent, competing changes across term boundaries to have quorums that don't overlap with each other, causing a safety violation (split brain).（两个跨越term边界的并发的、相互竞争的changes【配置变化】，有可能产生相互不重叠的quorums(majority)，从而导致安全性违规（脑裂）。）

Below I've reproduced Huanchen's counter-examples, starting from a 4-server cluster. The three are similar, and I include all three for completeness, but you might get the idea after one or two.

Counter-example 1 with one add and one remove:一增一减 This involves an attempt to add a server to a 4-server cluster and a concurrent attempt to remove a server from the same 4-server cluster. 0. Initial state: S1 (L1): [C] S2 (F1): [C] S3 (F1): [C] S4 (F1): [C] S5 (F1): [] where C is the configuration {S1,S2,S3,S4}.

S1 catches up S5 and appends configuration D to its log: D中增加了S5 S1 (L1): [C, D] S2 (F1): [C] S3 (F1): [C] S4 (F1): [C] S5 (F1): [C] where D is the configuration {S1,S2,S3,S4,S5}.
S1 replicates D to S5 and goes offline for a while: S1 (L1): [C, D] X S2 (F1): [C] S3 (F1): [C] S4 (F1): [C] S5 (F1): [C, D]
S2 starts an election and becomes leader in term 2 with votes from {S2, S3, S4}: S2的配置依然是C{S1,S2,S3,S4}. Majority 是{S2,S3,S4}. majority的日志与S2一样新。 S1 (L1): [C, D] X S2 (L2): [C] S3 (F2): [C] S4 (F2): [C] S5 (F1): [C, D]
S2 appends configuration E: E中减少了S1,{S2, S3, S4} S1 (L1): [C, D] X S2 (L2): [C, E] S3 (F2): [C] S4 (F2): [C] S5 (F1): [C, D] where E is the configuration {S2, S3, S4}.
S2 replicates configuration E to S3 and marks it committed: S2的配置是E{S2, S3, S4}. majority是{S2,S3}. S1 (L1): [C, D] X S2 (L2): [C, E] S3 (F2): [C, E] S4 (F2): [C] S5 (F1): [C, D]
S1 starts an election and becomes leader in term 3 with votes from {S1, S4, S5}: S1的配置是D{S1,S2,S3,S4,S5}.majority是{S1, S4, S5}. S2,S3的日志更新，不能投票给S1. S1 (L3): [C, D] S2 (L2): [C, E] S3 (F2): [C, E] S4 (F3): [C] S5 (F3): [C, D] Note that S1 does not have the committed entry E in its log.
S1 replicates D to everyone else, overwriting a committed entry (E): S1 (L3): [C, D] S2 (L2): [C, D] S3 (F2): [C, D] S4 (F3): [C, D] S5 (F3): [C, D]

Counter-example 2 with two adds: This involves an attempt to add a server to a 4-server cluster and a concurrent attempt to add a different server to the same 4-server cluster. 0. Initial state: S1 (L1): [C] S2 (F1): [C] S3 (F1): [C] S4 (F1): [C] S5 (F1): [] S6 (F1): [] where C is the configuration {S1, S2, S3, S4}

S1 catches up S5 and appends configuration D to its log: S1 (L1): [C, D] S2 (F1): [C] S3 (F1): [C] S4 (F1): [C] S5 (F1): [C] S6 (F1): [] where D is the configuration {S1, S2, S3, S4, S5}
S1 replicates D to S5 and goes offline for a while: S1 (L1): [C, D] X S2 (F1): [C] S3 (F1): [C] S4 (F1): [C] S5 (F1): [C, D] S6 (F1): []
S2 starts an election and becomes leader in term 2 with votes from {S2, S3, S4}: majority是{S2, S3, S4}. S1 (L1): [C, D] X S2 (L2): [C] S3 (F2): [C] S4 (F2): [C] S5 (F1): [C, D] S6 (F1): []
S2 catches up S6 and appends configuration E to its log: S1 (L1): [C, D] X S2 (L2): [C, E] S3 (F2): [C] S4 (F2): [C] S5 (F1): [C, D] S6 (F1): [C] where E is the configuration {S1, S2, S3, S4, S6}:
S2 replicates E to S3 and S6, and marks it committed: S1 (L1): [C, D] X S2 (L2): [C, E] S3 (F2): [C, E] S4 (F2): [C] S5 (F1): [C, D] S6 (F1): [C, E]
S1 starts an election and becomes leader in term 3 with votes from {S1, S4, S5}: majority是{S1, S4, S5}. S1 (L3): [C, D] S2 (L2): [C, E] S3 (F2): [C, E] S4 (F3): [C] S5 (F3): [C, D] S6 (F1): [C, E] Note that S1 does not have the committed entry E in its log.
S1 replicates D to everyone else, overwriting a committed entry (E): S1 (L3): [C, D] S2 (L2): [C, D] S3 (F2): [C, D] S4 (F3): [C, D] S5 (F3): [C, D] S6 (F1): [C, D]

Counter-example 3 with two removes: This involves an attempt to remove a server from a 4-server cluster and a concurrent attempt to remove a different server from the same 4-server cluster. 0. Initial state: S1 (L1): [C] S2 (F1): [C] S3 (F1): [C] S4 (F1): [C] where C is the configuration {S1, S2, S3, S4}

S1 appends configuration D to its log and goes offline for a while: S1 (L1): [C, D] X S2 (F1): [C] S3 (F1): [C] S4 (F1): [C] where D is the configuration {S1, S2, S3}
S2 starts an election and becomes leader in term 2 with votes from {S2, S3, S4}: S1 (L1): [C, D] X S2 (L2): [C] S3 (F2): [C] S4 (F2): [C]
S2 appends configuration E to its log: S1 (L1): [C, D] X S2 (L2): [C, E] S3 (F2): [C] S4 (F2): [C] where E is the configuration {S1, S2, S4}
S2 replicates E to S4, and marks it committed: S1 (L1): [C, D] X S2 (L2): [C, E] S3 (F2): [C] S4 (F2): [C, E]
S1 starts an election and becomes leader in term 3 with votes from {S1, S3}: S1 (L3): [C, D] S2 (L2): [C, E] S3 (F3): [C] S4 (F2): [C, E] Note that S1 does not have the committed entry E in its log.
S1 replicates D to everyone else, overwriting the committed entry E: S1 (L3): [C, D] S2 (L2): [C, D] S3 (F3): [C, D] S4 (F2): [C, D]

Scope and severity: This affects Raft implementations that use single-server membership changes when:

Starting from an even-sized cluster,
Multiple different changes are requested concurrently, and
Leadership is lost. In this event, data loss and permanent split brain could occur. This does not affect Raft implementations that use joint consensus, since:
Any two competing joint configurations have the old configuration in common and therefore overlap with themselves and their predecessor, and
Any two competing simple configurations have the new configuration in common and therefore overlap with themselves and their predecessor. (I can go into more detail on this upon request, but it may not be necessary.)

Proposed solution: The solution I'm proposing is exactly like the dissertation describes except that a leader may not append a new configuration entry until it has committed an entry from its current term. (leader不能append a new configuration(Cnew)，直到leader在its current term中，已经提交一个entry.) In a typical Raft implementation, a leader appends a no-op entry to the log when it is elected.（在leader被选举出时，leader拼接一个no-op entry.） This change would mean rejecting or delaying membership change requests until the no-op entry is committed. (这个变化意味着rejecting or delaying membership change requests 直到 no-op entry被提交。)

Once a leader has committed an entry in its current term, it knows it has the latest committed configuration, and no existing uncommitted configurations from prior terms can be committed anymore (the servers that store the current leader's entry won't vote for the servers that have the uncommitted configuration). So then it is safe for the leader to create a new configuration (that differs by at most one server) and begin replicating it. （一旦leader在current term已经提交一个entry，它知道它有最新的已提交的configuration, 并且不存在来自之前terms的未提交的configuration. 因此，对leader来说，可以安全得创建一个new configuration，并且开始复制new configuration.）

In John Ousterhout's words, which tend to be better than mine, "The overall requirement is that a leader must not begin replicating a new configuration entry if there is an older configuration entry that is incompletely replicated (it is stored on at least one server, is not currently committed, but could become committed in the future). This ensures that if two configurations "compete", they differ by at most one server and hence have overlapping consensuses. In the algorithm from the dissertation, leaders were careful to make sure there were no incomplete configuration entries from the current term, but the algorithm did not properly handle incomplete configuration entries from previous leaders (which might not even be visible to the current leader). The new approach fixes that by ensuring that any such entries, if they exist, cannot be committed in the future, hence cannot be used for making decisions."

Safety argument: We don't yet have a formal safety proof for the correctness of membership changes (of either type) in Raft. I've made an attempt at a safety argument for why the proposed solution works, but it's a little ugly. It extends the safety argument in the paper/dissertation. The only part that's different (as far as I can tell) for membership changes is Step 2, which claims the existence of "the voter". The voter is the server that is part of the quorum that commits the entry in term T and the quorum that votes for the leader of term U. With a static configuration, it's easy to see the voter must exist (overlap of two majorities of the same set). With membership changes, it's more difficult; this aims to show it for the dissertation algorithm patched with the solution proposed above. Those interested can find the safety argument here: https://gist.github.com/ongardie/a11f32b70581e20d6bcd . I don't have any current plans to flesh this out into a full proof, but I'd be happy to discuss merits of the argument and ways to simplify it.

Closing remarks: Let's use this thread to discuss the issue with respect to the single-server change algorithm described in my dissertation and the proposed solution(s). If you want to discuss how this affects other membership change algorithms, including joint consensus or whatever else you folks might have come up with, please keep those to separate clearly-labeled threads to avoid confusion. I hope this bug hasn't affected anyone yet, and the events needed to cause it seem pretty unlikely. Still, this can cause data loss and split brain, so it should be taken seriously. If you have an implementation that's affected, you should file a bug report right away and get this patched up soon. Even if the solution I've proposed is ultimately superseded by something better, it won't do any harm to add this if-statement now. I also plan to start a list of errata for my dissertation, so that new people reading it are aware of this issue. Your feedback is welcome, and I'm looking forward to the discussion. -Diego

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster Membership changes Single Server Change Algorithm

Table of Contents

If RPC is AddServer RPC:

If RPC is RemoveServer RPC:

作者修复Bug时的讨论

Clone this wiki locally