Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: introduce/fix TestNodeWithSmallerTermCanCompleteElection #8288

Merged
merged 1 commit into from
Jul 25, 2017

Conversation

irfansharif
Copy link
Contributor

@irfansharif irfansharif commented Jul 20, 2017

Fixes #8243.

TestNodeWithSmallerTermCanCompleteElection tests the scenario where a
node that has been partitioned away (and fallen behind) rejoins the
cluster at about the same time the leader node gets partitioned away.
Previously the cluster would come to a standstill when run with PreVote
enabled.

When responding to Msg{Pre,}Vote messages we now include the term from
the message, not the local term. To see why consider the case where a
single node was previously partitioned away and it's local term is now
of date. If we include the local term (recall that for pre-votes we
don't update the local term), the (pre-)campaigning node on the other
end will proceed to ignore the message (it ignores all out of date
messages).
The term in the original message and current local term are the same in
the case of regular votes, but different for pre-votes.

NB: Had to change TestRecvMsgVote to include pb.Message.Term when
sending MsgVote messages. The new sanity checks on MsgVoteResp
(m.Term != 0) would panic with the old test as raft.Term would be equal
to 0 when responding with MsgVoteResp messages.

+cc @bdarnell @xiang90.

raft/raft.go Outdated
} else {
return nil
case m.Type == pb.MsgPreVote || m.Type == pb.MsgPreVoteResp:
// If we receive a MsgPreVote from a node with a lower term number,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it important that we reject MsgPreVote instead of dropping it? We just drop MsgVote messages with older terms.

Copy link
Contributor Author

@irfansharif irfansharif Jul 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't, it can be dropped as well. Rejecting it just puts the slower (previously partitioned) node into the candidate follower state (which would happen anyway when the faster node sends a MsgPreVote message with a higher term).
As an aside, why do we just drop MsgVote messages with older terms instead of rejecting them directly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does rejecting a message lead to a transition from pre-candidate to candidate?

As for why we drop instead of reject MsgVotes with low terms, I don't think there's any deep reason for it. It's just easier to drop since we drop all other messages with lower terms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mistyped. Meant transition back to follower state.

raft/raft.go Outdated
// If we receive a MsgPreVote from a node with a lower term number,
// we reject it. For a MsgPreVoteResp we simply pass it to our
// stepFunc. For a stale MsgPreVoteResp (think of late responses,
// we've already transitioned to a leader or follower state) this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we've transitioned to follower and then back to pre-candidate, there's no other Term check in StepCandidate. We could accept a MsgPreVoteResp for term T as a valid pre-vote in term T+1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's no other Term check in StepCandidate

Isn't this then an existing problem for MsgVoteResp? We could accept granted MsgVoteResps from earlier elections? I see nothing in raft.go: *raft.poll that does this check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because they're dropped by this block (the term check at line 713). Step handles all messages from other terms, so that the stepState methods can safely assume that the message is for the current term (with the exception of preVotes, which may have a higher term).

Copy link
Contributor Author

@irfansharif irfansharif Jul 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Short of augmenting pb.Message (or repurposing pb.Message.Term, hacky & single purpose for me but might be the only option) I see no solution to this (unless I'm missing something). Say I have a three node cluster (A, B & C) where there is an asymmetric partition between A & C in that it C can receive incoming messages but cannot deliver outgoing ones to A. Consider the following:

  • A campaigns, send out a MsgPreVote to C
  • C responds with MsgPreVoteResp, doesn't deliver (yet)
  • A returns back to follower state (B wins election)
    ...
  • Timeout, A campaigns again (later term), sends out MsgPreVote to C
  • C responds with another MsgPreVoteResp, doesn't deliver (yet)
    [partition heals]
  • A receives two MsgPreVoteResp messagess, cannot distinguish which one is in response to which election. (This is because C doesn't change terms when it receives MsgPreVote messages, pb.Message.Term in each MsgPreVoteResp message is simply C's term when it received said message.)

Ignored is the fact that in this specific example with the three node cluster necessarily C's term has increased between the two messages. Point stands where we still have no good way to determine what election cycle each MsgPreVoteResp message is from.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true that there's an ambiguity in pre-votes - because they don't increment the term, two consecutive pre-vote election cycles look identical. But it's OK if there are (rare) false positives here - all that will happen is that we'll continue to a full election, which is slightly disruptive but will guarantee that only an up-to-date node can become the new leader.

Copy link
Contributor Author

@irfansharif irfansharif Jul 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, you're OK with granted MsgPreVoteResp messages from earlier election cycles count towards the total # votes needed to pass the PreVote phase? (With a comment to this effect of course.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. There are other possibilities for false positives too (for example, a node may grant pre-vote requests to two different nodes in the same term, but only one of them can win the real election).

@bdarnell
Copy link
Contributor

When we send the MsgPreVoteResp at the end of r.Step, we use r.Term (which is out of date). This is the problem: the PreVoteResp is dropped because it is out of date when the (pre-)campaigning node receives it. Vote responses should use the term from the message, not the node's local term (these are the same for regular votes, but different for pre-votes).

@irfansharif
Copy link
Contributor Author

PTAL.

raft/raft.go Outdated
// the message (it ignores all out of date messages).
// The term in the original message and current local term are the
// same in the case of regular votes, but different for pre-votes.
if m.Type == pb.MsgVote || m.Type == pb.MsgPreVote {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition is redundant, we're already in case pb.MsgVote, pb.MsgPreVote:.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops.

@bdarnell bdarnell requested a review from xiang90 July 20, 2017 21:23
@irfansharif irfansharif force-pushed the pre-vote branch 2 times, most recently from 9c5bfaf to 90cbb14 Compare July 21, 2017 01:32
TestNodeWithSmallerTermCanCompleteElection tests the scenario where a
node that has been partitioned away (and fallen behind) rejoins the
cluster at about the same time the leader node gets partitioned away.
Previously the cluster would come to a standstill when run with PreVote
enabled.

When responding to Msg{Pre,}Vote messages we now include the term from
the message, not the local term. To see why consider the case where a
single node was previously partitioned away and it's local term is now
of date. If we include the local term (recall that for pre-votes we
don't update the local term), the (pre-)campaigning node on the other
end will proceed to ignore the message (it ignores all out of date
messages).
The term in the original message and current local term are the same in
the case of regular votes, but different for pre-votes.

NB: Had to change TestRecvMsgVote to include pb.Message.Term when
sending MsgVote messages. The new sanity checks on MsgVoteResp
(m.Term != 0) would panic with the old test as raft.Term would be equal
to 0 when responding with MsgVoteResp messages.
@codecov-io
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (master@a64d15e). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #8288   +/-   ##
=========================================
  Coverage          ?   76.34%           
=========================================
  Files             ?      345           
  Lines             ?    27074           
  Branches          ?        0           
=========================================
  Hits              ?    20670           
  Misses            ?     4920           
  Partials          ?     1484
Impacted Files Coverage Δ
raft/raft.go 92.58% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a64d15e...a92ceee. Read the comment docs.

@xiang90
Copy link
Contributor

xiang90 commented Jul 25, 2017

LGTM. Thanks for fixing this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants