raft: deadlock during PreVote migration process #8501

irfansharif · 2017-09-05T19:32:15Z

This was first uncovered in the context of cockroachdb/cockroach#18151. For some background we were running a migration from v1.0.x cockroachdb/cockroach binary to our v1.1-rc binary, the difference between the two branches was that the earlier revision had raft groups configured with PreVote: false and CheckQuorum: true whereas the new v1.1-rc revision had PreVote: true and CheckQuorum: false. The "migration" process here was a rolling restart of the cluster from the old revision to the new where it was entirely likely that for a given raft group, you could have some replicas running the new revision with PreVote enabled and some without.

Here's the relevant summary of a raft group that got wedged during this migration process:

Note that of the two replicas shown, we have the first replica with a higher current term than the second replica but with a lower last index. At this point the following manifests:

The second replica cannot grant votes to the first due to the first replica having a lower last index, the first replica cannot therefore assume leadership
Messages from the second replica, when received on the first replica, keep getting silently dropped due it them coming from a replica with a lower term
Both replicas have PreVote enabled at this point so neither replica advances it's current term (consequently the second replica cannot/does not overtake the first replica in order to break this deadlock).

Now, as for how we could get to such a state where a replica with a lower last index ends up with a higher term, remember that this was a rolling restart process where we had replicas without pre-vote enabled. Replicas at this stage, when calling for elections, advance their current term. This advanced current term is persisted (necessarily so) and when the node is then restarted with PreVote enabled, we're deadlocked as described above.

Here's a test demonstrating this:

func TestPreVoteMigration(t *testing.T) {
	n1 := newTestRaft(1, []uint64{1, 2, 3}, 10, 1, NewMemoryStorage())
	n2 := newTestRaft(2, []uint64{1, 2, 3}, 10, 1, NewMemoryStorage())
	n3 := newTestRaft(3, []uint64{1, 2, 3}, 10, 1, NewMemoryStorage())

	n1.becomeFollower(1, None)
	n2.becomeFollower(1, None)
	n3.becomeFollower(1, None)

	n1.preVote = true
	n2.preVote = true
	// We intentionally do not enable PreVote for n3, this is done so in order
	// to simulate a rolling restart process where it's possible to have a
	// mixed version cluster with replicas with PreVote enabled, and replicas
	// without.

	nt := newNetwork(n1, n2, n3)
	nt.send(pb.Message{From: 1, To: 1, Type: pb.MsgHup})
	if n1.state != StateLeader {
		t.Errorf("node 1 state: %s, want %s", n1.state, StateLeader)
	}

	if n2.state != StateFollower {
		t.Errorf("node 2 state: %s, want %s", n2.state, StateFollower)
	}

	if n3.state != StateFollower {
		t.Errorf("node 3 state: %s, want %s", n3.state, StateFollower)
	}

	// Cause a network partition to isolate n3.
	nt.cut(1, 3)
	nt.cut(2, 3)

	nt.send(pb.Message{From: 1, To: 1, Type: pb.MsgProp, Entries: []pb.Entry{{Data: []byte("some data")}}})
	nt.send(pb.Message{From: 1, To: 1, Type: pb.MsgProp, Entries: []pb.Entry{{Data: []byte("some data")}}})
	nt.send(pb.Message{From: 1, To: 1, Type: pb.MsgProp, Entries: []pb.Entry{{Data: []byte("some data")}}})

	// Check whether the term values are expected:
	//  n1.Term == 2
	//  n2.Term == 2
	//  n3.Term == 2
	if n1.Term != 2 {
		t.Errorf("node 1 term: %d, want %d", n1.Term, 2)
	}

	if n2.Term != 2 {
		t.Errorf("node 2 term: %d, want %d", n2.Term, 2)
	}

	if n3.Term != 2 {
		t.Errorf("node 3 term: %d, want %d", n3.Term, 2)
	}

	nt.send(pb.Message{From: 3, To: 3, Type: pb.MsgHup})
	// Check state and term for n3:
	//  n3.Term == 3
	//  n3.state == StateCandidate
	if n3.Term != 3 {
		t.Errorf("node 3 term: %d, want %d", n3.Term, 3)
	}
	if n3.state != StateCandidate {
		t.Errorf("node 3 state: %s, want %s", n3.state, StateCandidate)
	}

	// Check state for n1, n2:
	//  n1.state. == Leader
	//  n2.state. == Follower
	if n1.state != StateLeader {
		t.Errorf("node 1 state: %s, want %s", n1.state, StateLeader)
	}
	if n2.state != StateFollower {
		t.Errorf("node 2 state: %s, want %s", n2.state, StateFollower)
	}
	prevLeaderTerm := n1.Term

	n1.logger.Info("going to bring back peer 3 with prevote enabled, kill peer 2")

	// Enable prevote on n3.
	n3.preVote = true

	// Recover the network then immediately isolate n2 which is currently
	// the leader, this is to emulate the crash of n2.
	nt.recover()
	nt.cut(2, 1)
	nt.cut(2, 3)

	// Call for elections from both n1 and n3.
	nt.send(pb.Message{From: 3, To: 3, Type: pb.MsgHup})
	nt.send(pb.Message{From: 1, To: 1, Type: pb.MsgHup})

	// n1's term should have advanced, the earlier proposed entries having been
	// persisted on n1 and n2 while n3 was partitioned away necessitate that
	// only n1 can be the leader and it has to do so for a subsequent term.
	if n1.Term == prevLeaderTerm {
		t.Errorf("node 1 term: %d, expected to advance", prevLeaderTerm)
	}

	// Do we have a leader?
	if n1.state != StateLeader && n3.state != StateFollower {
		t.Errorf("no leader")
	}
}

+cc @bdarnell @xiang90.

The text was updated successfully, but these errors were encountered:

xiang90 · 2017-09-05T20:10:31Z

@irfansharif @bdarnell @siddontang @es-chow

Someone probably needs to own the pre-voting feature, and make it stable. I do not have the bandwidth to own this for the near future. But if no one is going to own it, I will put this item on my schedule and make it stable.

bdarnell · 2017-09-06T15:36:09Z

I agree, and I apologize for (mostly) implementing this without getting it into a working state. We'll work on fixing this.

siddontang · 2017-09-08T00:46:57Z

We will also do more tests to make it more stable.

/cc @javaforfun

irfansharif · 2017-09-08T21:58:52Z

Originally wrote the following on #8525 but posting here instead:

This addresses the specific deadlock observed in #8501 but given how we're looking to stabilize PreVote overall I was wondering what your thoughts were on the following proposal:

The current Pre-Vote implementation works such that a a PreCandidate sending out PreVote messages annotates Message.Term with the "next" term, i.e. the term it would call elections for should the PreVote phase pass with quorum. I tried retrofitting this implementation idea to the original TLA+ spec by @ongardie (ongardie/raft.tla) but proved to be slightly difficult to get right. I then tried however annotating Message.Term with the PreCandidate's current term and having the receiver know this on the other end (i.e. PreVotes are votes for a term in the immediate future) and things were slightly cleaner to work out. You can find my implementation here: irfansharif/raft.tla, irfansharif/raft.tla@22b0581 specifically. I have a WIP branch exploring this approach instead here for PreVotes but nothing ready as yet.

As it pertains to #8501, consider how it addresses that: Messages from higher terms always cause us to revert to the follower state. See irfansharif/raft.tla@22b0581#L520-L522 which was from the original spec (this was the issue I had before trying to retrofit into). So for the "rogue" candidate with advanced terms pre-PreVote enabling, it would just work out of the box. The message handling at the receiver too is simplified as there's less of case-by-case basis. Right now if we have a Message.Term > raft.Term if PreVote, let it fall through to be handled below. This patch introduces a negative response but by comparing logs if it's specifically PreVote. I think the alternate is slightly cleaner in this respect but I have nothing concrete at hand to present as yet.

NB: With the TLA+ spec adding PreVotes, I did not complete a full run. I tried constraining the search space and tried random searching but after a 5-ish hour run with no failures (albeit on my tiny 4-core machine) I am convinced it's OK.

As an aside, I'll be on vacation through the upcoming week (so will Ben). I will be back in school right after but I'll try submitting this by/before then.

bdarnell · 2018-01-26T21:24:38Z

When I initially implemented prevote, I tried it both ways (the current implementation with MsgPreVote containing r.Term+1, and Irfan's suggestion that it contains the current term and the recipient adds one). Both solutions are awkward in places; I think that making this change would just move complexity around without being much of a net win.

In particular, I don't think making the switch in how we send pre-vote terms would simplify the fix for this deadlock. Sending the true term instead of the future term would avoid the special case when the terms only differ by one, but if there is a greater difference we'd still need a special case to handle MsgPreVote with a term in the past.

This was referenced Sep 5, 2017

stability: PreVote can lead to unavailable ranges cockroachdb/cockroach#18151

Closed

storage: stuck pre-candidate phase cockroachdb/cockroach#18129

Closed

irfansharif changed the title ~~raft: deadlock during {CheckQuorum,PreVote} migration process~~ raft: deadlock during PreVote migration process Sep 5, 2017

xiang90 added the area/raft label Sep 5, 2017

irfansharif mentioned this issue Sep 7, 2017

etcdserver: use PreVote so majority partition's leader isn't forced to step down #8499

Closed

lishuai87 mentioned this issue Sep 8, 2017

raft: fix deadlock during PreVote migration process #8525

Merged

heyitsanthony assigned xiang90 Sep 12, 2017

bdarnell mentioned this issue Jan 23, 2018

storage: re-enable Raft PreVote RPC cockroachdb/cockroach#16950

Closed

xiang90 closed this as completed in #8525 Jan 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raft: deadlock during PreVote migration process #8501

raft: deadlock during PreVote migration process #8501

irfansharif commented Sep 5, 2017

xiang90 commented Sep 5, 2017

bdarnell commented Sep 6, 2017

siddontang commented Sep 8, 2017

irfansharif commented Sep 8, 2017 •

edited

Loading

bdarnell commented Jan 26, 2018

raft: deadlock during PreVote migration process #8501

raft: deadlock during PreVote migration process #8501

Comments

irfansharif commented Sep 5, 2017

xiang90 commented Sep 5, 2017

bdarnell commented Sep 6, 2017

siddontang commented Sep 8, 2017

irfansharif commented Sep 8, 2017 • edited Loading

bdarnell commented Jan 26, 2018

irfansharif commented Sep 8, 2017 •

edited

Loading