New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: introduce/fix TestNodeWithSmallerTermCanCompleteElection #8288

Merged
merged 1 commit into from Jul 25, 2017

Conversation

4 participants
@irfansharif
Copy link
Contributor

irfansharif commented Jul 20, 2017

Fixes #8243.

TestNodeWithSmallerTermCanCompleteElection tests the scenario where a
node that has been partitioned away (and fallen behind) rejoins the
cluster at about the same time the leader node gets partitioned away.
Previously the cluster would come to a standstill when run with PreVote
enabled.

When responding to Msg{Pre,}Vote messages we now include the term from
the message, not the local term. To see why consider the case where a
single node was previously partitioned away and it's local term is now
of date. If we include the local term (recall that for pre-votes we
don't update the local term), the (pre-)campaigning node on the other
end will proceed to ignore the message (it ignores all out of date
messages).
The term in the original message and current local term are the same in
the case of regular votes, but different for pre-votes.

NB: Had to change TestRecvMsgVote to include pb.Message.Term when
sending MsgVote messages. The new sanity checks on MsgVoteResp
(m.Term != 0) would panic with the old test as raft.Term would be equal
to 0 when responding with MsgVoteResp messages.

+cc @bdarnell @xiang90.

} else {
return nil
case m.Type == pb.MsgPreVote || m.Type == pb.MsgPreVoteResp:
// If we receive a MsgPreVote from a node with a lower term number,

This comment has been minimized.

@bdarnell

bdarnell Jul 20, 2017

Member

Why is it important that we reject MsgPreVote instead of dropping it? We just drop MsgVote messages with older terms.

This comment has been minimized.

@irfansharif

irfansharif Jul 20, 2017

Contributor

It isn't, it can be dropped as well. Rejecting it just puts the slower (previously partitioned) node into the candidate follower state (which would happen anyway when the faster node sends a MsgPreVote message with a higher term).
As an aside, why do we just drop MsgVote messages with older terms instead of rejecting them directly?

This comment has been minimized.

@bdarnell

bdarnell Jul 20, 2017

Member

How does rejecting a message lead to a transition from pre-candidate to candidate?

As for why we drop instead of reject MsgVotes with low terms, I don't think there's any deep reason for it. It's just easier to drop since we drop all other messages with lower terms.

This comment has been minimized.

@irfansharif

irfansharif Jul 20, 2017

Contributor

Mistyped. Meant transition back to follower state.

// If we receive a MsgPreVote from a node with a lower term number,
// we reject it. For a MsgPreVoteResp we simply pass it to our
// stepFunc. For a stale MsgPreVoteResp (think of late responses,
// we've already transitioned to a leader or follower state) this

This comment has been minimized.

@bdarnell

bdarnell Jul 20, 2017

Member

But if we've transitioned to follower and then back to pre-candidate, there's no other Term check in StepCandidate. We could accept a MsgPreVoteResp for term T as a valid pre-vote in term T+1.

This comment has been minimized.

@irfansharif

irfansharif Jul 20, 2017

Contributor

there's no other Term check in StepCandidate

Isn't this then an existing problem for MsgVoteResp? We could accept granted MsgVoteResps from earlier elections? I see nothing in raft.go: *raft.poll that does this check.

This comment has been minimized.

@bdarnell

bdarnell Jul 20, 2017

Member

No, because they're dropped by this block (the term check at line 713). Step handles all messages from other terms, so that the stepState methods can safely assume that the message is for the current term (with the exception of preVotes, which may have a higher term).

This comment has been minimized.

@irfansharif

irfansharif Jul 20, 2017

Contributor

I see. Short of augmenting pb.Message (or repurposing pb.Message.Term, hacky & single purpose for me but might be the only option) I see no solution to this (unless I'm missing something). Say I have a three node cluster (A, B & C) where there is an asymmetric partition between A & C in that it C can receive incoming messages but cannot deliver outgoing ones to A. Consider the following:

  • A campaigns, send out a MsgPreVote to C
  • C responds with MsgPreVoteResp, doesn't deliver (yet)
  • A returns back to follower state (B wins election)
    ...
  • Timeout, A campaigns again (later term), sends out MsgPreVote to C
  • C responds with another MsgPreVoteResp, doesn't deliver (yet)
    [partition heals]
  • A receives two MsgPreVoteResp messagess, cannot distinguish which one is in response to which election. (This is because C doesn't change terms when it receives MsgPreVote messages, pb.Message.Term in each MsgPreVoteResp message is simply C's term when it received said message.)

Ignored is the fact that in this specific example with the three node cluster necessarily C's term has increased between the two messages. Point stands where we still have no good way to determine what election cycle each MsgPreVoteResp message is from.

This comment has been minimized.

@bdarnell

bdarnell Jul 20, 2017

Member

It's true that there's an ambiguity in pre-votes - because they don't increment the term, two consecutive pre-vote election cycles look identical. But it's OK if there are (rare) false positives here - all that will happen is that we'll continue to a full election, which is slightly disruptive but will guarantee that only an up-to-date node can become the new leader.

This comment has been minimized.

@irfansharif

irfansharif Jul 20, 2017

Contributor

To clarify, you're OK with granted MsgPreVoteResp messages from earlier election cycles count towards the total # votes needed to pass the PreVote phase? (With a comment to this effect of course.)

This comment has been minimized.

@bdarnell

bdarnell Jul 20, 2017

Member

Yeah. There are other possibilities for false positives too (for example, a node may grant pre-vote requests to two different nodes in the same term, but only one of them can win the real election).

@bdarnell

This comment has been minimized.

Copy link
Member

bdarnell commented Jul 20, 2017

When we send the MsgPreVoteResp at the end of r.Step, we use r.Term (which is out of date). This is the problem: the PreVoteResp is dropped because it is out of date when the (pre-)campaigning node receives it. Vote responses should use the term from the message, not the node's local term (these are the same for regular votes, but different for pre-votes).

@irfansharif irfansharif force-pushed the irfansharif:pre-vote branch from 18c6006 to 7a03156 Jul 20, 2017

@irfansharif

This comment has been minimized.

Copy link
Contributor

irfansharif commented Jul 20, 2017

PTAL.

// the message (it ignores all out of date messages).
// The term in the original message and current local term are the
// same in the case of regular votes, but different for pre-votes.
if m.Type == pb.MsgVote || m.Type == pb.MsgPreVote {

This comment has been minimized.

@bdarnell

bdarnell Jul 20, 2017

Member

This condition is redundant, we're already in case pb.MsgVote, pb.MsgPreVote:.

This comment has been minimized.

@irfansharif

irfansharif Jul 20, 2017

Contributor

whoops.

@bdarnell bdarnell requested a review from xiang90 Jul 20, 2017

@bdarnell bdarnell added the area/raft label Jul 20, 2017

@irfansharif irfansharif force-pushed the irfansharif:pre-vote branch 2 times, most recently from 9c5bfaf to 90cbb14 Jul 20, 2017

raft: introduce/fix TestNodeWithSmallerTermCanCompleteElection
TestNodeWithSmallerTermCanCompleteElection tests the scenario where a
node that has been partitioned away (and fallen behind) rejoins the
cluster at about the same time the leader node gets partitioned away.
Previously the cluster would come to a standstill when run with PreVote
enabled.

When responding to Msg{Pre,}Vote messages we now include the term from
the message, not the local term. To see why consider the case where a
single node was previously partitioned away and it's local term is now
of date. If we include the local term (recall that for pre-votes we
don't update the local term), the (pre-)campaigning node on the other
end will proceed to ignore the message (it ignores all out of date
messages).
The term in the original message and current local term are the same in
the case of regular votes, but different for pre-votes.

NB: Had to change TestRecvMsgVote to include pb.Message.Term when
sending MsgVote messages. The new sanity checks on MsgVoteResp
(m.Term != 0) would panic with the old test as raft.Term would be equal
to 0 when responding with MsgVoteResp messages.

@irfansharif irfansharif force-pushed the irfansharif:pre-vote branch from 90cbb14 to a92ceee Jul 21, 2017

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Jul 21, 2017

Codecov Report

❗️ No coverage uploaded for pull request base (master@a64d15e). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #8288   +/-   ##
=========================================
  Coverage          ?   76.34%           
=========================================
  Files             ?      345           
  Lines             ?    27074           
  Branches          ?        0           
=========================================
  Hits              ?    20670           
  Misses            ?     4920           
  Partials          ?     1484
Impacted Files Coverage Δ
raft/raft.go 92.58% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a64d15e...a92ceee. Read the comment docs.

@xiang90

This comment has been minimized.

Copy link
Contributor

xiang90 commented Jul 25, 2017

LGTM. Thanks for fixing this!

@xiang90 xiang90 merged commit 93826f2 into etcd-io:master Jul 25, 2017

5 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
jenkins-cov Build finished.
Details
jenkins-ppc64le Build finished.
Details
jenkins-proxy-ci Build finished.
Details
semaphoreci The build passed on Semaphore.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment