Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: deadlock during PreVote migration process #8501

Closed
irfansharif opened this issue Sep 5, 2017 · 5 comments
Closed

raft: deadlock during PreVote migration process #8501

irfansharif opened this issue Sep 5, 2017 · 5 comments
Assignees

Comments

@irfansharif
Copy link
Contributor

This was first uncovered in the context of cockroachdb/cockroach#18151. For some background we were running a migration from v1.0.x cockroachdb/cockroach binary to our v1.1-rc binary, the difference between the two branches was that the earlier revision had raft groups configured with PreVote: false and CheckQuorum: true whereas the new v1.1-rc revision had PreVote: true and CheckQuorum: false. The "migration" process here was a rolling restart of the cluster from the old revision to the new where it was entirely likely that for a given raft group, you could have some replicas running the new revision with PreVote enabled and some without.

Here's the relevant summary of a raft group that got wedged during this migration process:
image

Note that of the two replicas shown, we have the first replica with a higher current term than the second replica but with a lower last index. At this point the following manifests:

  • The second replica cannot grant votes to the first due to the first replica having a lower last index, the first replica cannot therefore assume leadership
  • Messages from the second replica, when received on the first replica, keep getting silently dropped due it them coming from a replica with a lower term
  • Both replicas have PreVote enabled at this point so neither replica advances it's current term (consequently the second replica cannot/does not overtake the first replica in order to break this deadlock).

Now, as for how we could get to such a state where a replica with a lower last index ends up with a higher term, remember that this was a rolling restart process where we had replicas without pre-vote enabled. Replicas at this stage, when calling for elections, advance their current term. This advanced current term is persisted (necessarily so) and when the node is then restarted with PreVote enabled, we're deadlocked as described above.

Here's a test demonstrating this:

func TestPreVoteMigration(t *testing.T) {
	n1 := newTestRaft(1, []uint64{1, 2, 3}, 10, 1, NewMemoryStorage())
	n2 := newTestRaft(2, []uint64{1, 2, 3}, 10, 1, NewMemoryStorage())
	n3 := newTestRaft(3, []uint64{1, 2, 3}, 10, 1, NewMemoryStorage())

	n1.becomeFollower(1, None)
	n2.becomeFollower(1, None)
	n3.becomeFollower(1, None)

	n1.preVote = true
	n2.preVote = true
	// We intentionally do not enable PreVote for n3, this is done so in order
	// to simulate a rolling restart process where it's possible to have a
	// mixed version cluster with replicas with PreVote enabled, and replicas
	// without.

	nt := newNetwork(n1, n2, n3)
	nt.send(pb.Message{From: 1, To: 1, Type: pb.MsgHup})
	if n1.state != StateLeader {
		t.Errorf("node 1 state: %s, want %s", n1.state, StateLeader)
	}

	if n2.state != StateFollower {
		t.Errorf("node 2 state: %s, want %s", n2.state, StateFollower)
	}

	if n3.state != StateFollower {
		t.Errorf("node 3 state: %s, want %s", n3.state, StateFollower)
	}

	// Cause a network partition to isolate n3.
	nt.cut(1, 3)
	nt.cut(2, 3)

	nt.send(pb.Message{From: 1, To: 1, Type: pb.MsgProp, Entries: []pb.Entry{{Data: []byte("some data")}}})
	nt.send(pb.Message{From: 1, To: 1, Type: pb.MsgProp, Entries: []pb.Entry{{Data: []byte("some data")}}})
	nt.send(pb.Message{From: 1, To: 1, Type: pb.MsgProp, Entries: []pb.Entry{{Data: []byte("some data")}}})

	// Check whether the term values are expected:
	//  n1.Term == 2
	//  n2.Term == 2
	//  n3.Term == 2
	if n1.Term != 2 {
		t.Errorf("node 1 term: %d, want %d", n1.Term, 2)
	}

	if n2.Term != 2 {
		t.Errorf("node 2 term: %d, want %d", n2.Term, 2)
	}

	if n3.Term != 2 {
		t.Errorf("node 3 term: %d, want %d", n3.Term, 2)
	}

	nt.send(pb.Message{From: 3, To: 3, Type: pb.MsgHup})
	// Check state and term for n3:
	//  n3.Term == 3
	//  n3.state == StateCandidate
	if n3.Term != 3 {
		t.Errorf("node 3 term: %d, want %d", n3.Term, 3)
	}
	if n3.state != StateCandidate {
		t.Errorf("node 3 state: %s, want %s", n3.state, StateCandidate)
	}

	// Check state for n1, n2:
	//  n1.state. == Leader
	//  n2.state. == Follower
	if n1.state != StateLeader {
		t.Errorf("node 1 state: %s, want %s", n1.state, StateLeader)
	}
	if n2.state != StateFollower {
		t.Errorf("node 2 state: %s, want %s", n2.state, StateFollower)
	}
	prevLeaderTerm := n1.Term

	n1.logger.Info("going to bring back peer 3 with prevote enabled, kill peer 2")

	// Enable prevote on n3.
	n3.preVote = true

	// Recover the network then immediately isolate n2 which is currently
	// the leader, this is to emulate the crash of n2.
	nt.recover()
	nt.cut(2, 1)
	nt.cut(2, 3)

	// Call for elections from both n1 and n3.
	nt.send(pb.Message{From: 3, To: 3, Type: pb.MsgHup})
	nt.send(pb.Message{From: 1, To: 1, Type: pb.MsgHup})

	// n1's term should have advanced, the earlier proposed entries having been
	// persisted on n1 and n2 while n3 was partitioned away necessitate that
	// only n1 can be the leader and it has to do so for a subsequent term.
	if n1.Term == prevLeaderTerm {
		t.Errorf("node 1 term: %d, expected to advance", prevLeaderTerm)
	}

	// Do we have a leader?
	if n1.state != StateLeader && n3.state != StateFollower {
		t.Errorf("no leader")
	}
}

+cc @bdarnell @xiang90.

@irfansharif irfansharif changed the title raft: deadlock during {CheckQuorum,PreVote} migration process raft: deadlock during PreVote migration process Sep 5, 2017
@xiang90
Copy link
Contributor

xiang90 commented Sep 5, 2017

@irfansharif @bdarnell @siddontang @es-chow

Someone probably needs to own the pre-voting feature, and make it stable. I do not have the bandwidth to own this for the near future. But if no one is going to own it, I will put this item on my schedule and make it stable.

@bdarnell
Copy link
Contributor

bdarnell commented Sep 6, 2017

I agree, and I apologize for (mostly) implementing this without getting it into a working state. We'll work on fixing this.

@siddontang
Copy link
Contributor

We will also do more tests to make it more stable.

/cc @javaforfun

@irfansharif
Copy link
Contributor Author

irfansharif commented Sep 8, 2017

Originally wrote the following on #8525 but posting here instead:


This addresses the specific deadlock observed in #8501 but given how we're looking to stabilize PreVote overall I was wondering what your thoughts were on the following proposal:

The current Pre-Vote implementation works such that a a PreCandidate sending out PreVote messages annotates Message.Term with the "next" term, i.e. the term it would call elections for should the PreVote phase pass with quorum. I tried retrofitting this implementation idea to the original TLA+ spec by @ongardie (ongardie/raft.tla) but proved to be slightly difficult to get right. I then tried however annotating Message.Term with the PreCandidate's current term and having the receiver know this on the other end (i.e. PreVotes are votes for a term in the immediate future) and things were slightly cleaner to work out. You can find my implementation here: irfansharif/raft.tla, irfansharif/raft.tla@22b0581 specifically. I have a WIP branch exploring this approach instead here for PreVotes but nothing ready as yet.

As it pertains to #8501, consider how it addresses that: Messages from higher terms always cause us to revert to the follower state. See irfansharif/raft.tla@22b0581#L520-L522 which was from the original spec (this was the issue I had before trying to retrofit into). So for the "rogue" candidate with advanced terms pre-PreVote enabling, it would just work out of the box. The message handling at the receiver too is simplified as there's less of case-by-case basis. Right now if we have a Message.Term > raft.Term if PreVote, let it fall through to be handled below. This patch introduces a negative response but by comparing logs if it's specifically PreVote. I think the alternate is slightly cleaner in this respect but I have nothing concrete at hand to present as yet.

NB: With the TLA+ spec adding PreVotes, I did not complete a full run. I tried constraining the search space and tried random searching but after a 5-ish hour run with no failures (albeit on my tiny 4-core machine) I am convinced it's OK.

As an aside, I'll be on vacation through the upcoming week (so will Ben). I will be back in school right after but I'll try submitting this by/before then.

@bdarnell
Copy link
Contributor

When I initially implemented prevote, I tried it both ways (the current implementation with MsgPreVote containing r.Term+1, and Irfan's suggestion that it contains the current term and the recipient adds one). Both solutions are awkward in places; I think that making this change would just move complexity around without being much of a net win.

In particular, I don't think making the switch in how we send pre-vote terms would simplify the fix for this deadlock. Sending the true term instead of the future term would avoid the special case when the terms only differ by one, but if there is a greater difference we'd still need a special case to handle MsgPreVote with a term in the past.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants