Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upraft: introduce/fix TestNodeWithSmallerTermCanCompleteElection #8288
Conversation
irfansharif
referenced this pull request
Jul 20, 2017
Closed
storage: re-enable Raft PreVote RPC #16950
raft/raft.go
| } else { | ||
| return nil | ||
| case m.Type == pb.MsgPreVote || m.Type == pb.MsgPreVoteResp: | ||
| // If we receive a MsgPreVote from a node with a lower term number, |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
bdarnell
Jul 20, 2017
Member
Why is it important that we reject MsgPreVote instead of dropping it? We just drop MsgVote messages with older terms.
bdarnell
Jul 20, 2017
Member
Why is it important that we reject MsgPreVote instead of dropping it? We just drop MsgVote messages with older terms.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
irfansharif
Jul 20, 2017
Contributor
It isn't, it can be dropped as well. Rejecting it just puts the slower (previously partitioned) node into the candidate follower state (which would happen anyway when the faster node sends a MsgPreVote message with a higher term).
As an aside, why do we just drop MsgVote messages with older terms instead of rejecting them directly?
irfansharif
Jul 20, 2017
•
Contributor
It isn't, it can be dropped as well. Rejecting it just puts the slower (previously partitioned) node into the candidate follower state (which would happen anyway when the faster node sends a MsgPreVote message with a higher term).
As an aside, why do we just drop MsgVote messages with older terms instead of rejecting them directly?
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
bdarnell
Jul 20, 2017
Member
How does rejecting a message lead to a transition from pre-candidate to candidate?
As for why we drop instead of reject MsgVotes with low terms, I don't think there's any deep reason for it. It's just easier to drop since we drop all other messages with lower terms.
bdarnell
Jul 20, 2017
Member
How does rejecting a message lead to a transition from pre-candidate to candidate?
As for why we drop instead of reject MsgVotes with low terms, I don't think there's any deep reason for it. It's just easier to drop since we drop all other messages with lower terms.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
raft/raft.go
| // If we receive a MsgPreVote from a node with a lower term number, | ||
| // we reject it. For a MsgPreVoteResp we simply pass it to our | ||
| // stepFunc. For a stale MsgPreVoteResp (think of late responses, | ||
| // we've already transitioned to a leader or follower state) this |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
bdarnell
Jul 20, 2017
Member
But if we've transitioned to follower and then back to pre-candidate, there's no other Term check in StepCandidate. We could accept a MsgPreVoteResp for term T as a valid pre-vote in term T+1.
bdarnell
Jul 20, 2017
Member
But if we've transitioned to follower and then back to pre-candidate, there's no other Term check in StepCandidate. We could accept a MsgPreVoteResp for term T as a valid pre-vote in term T+1.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
irfansharif
Jul 20, 2017
Contributor
there's no other Term check in StepCandidate
Isn't this then an existing problem for MsgVoteResp? We could accept granted MsgVoteResps from earlier elections? I see nothing in raft.go: *raft.poll that does this check.
irfansharif
Jul 20, 2017
Contributor
there's no other Term check in StepCandidate
Isn't this then an existing problem for MsgVoteResp? We could accept granted MsgVoteResps from earlier elections? I see nothing in raft.go: *raft.poll that does this check.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
bdarnell
Jul 20, 2017
Member
No, because they're dropped by this block (the term check at line 713). Step handles all messages from other terms, so that the stepState methods can safely assume that the message is for the current term (with the exception of preVotes, which may have a higher term).
bdarnell
Jul 20, 2017
Member
No, because they're dropped by this block (the term check at line 713). Step handles all messages from other terms, so that the stepState methods can safely assume that the message is for the current term (with the exception of preVotes, which may have a higher term).
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
irfansharif
Jul 20, 2017
Contributor
I see. Short of augmenting pb.Message (or repurposing pb.Message.Term, hacky & single purpose for me but might be the only option) I see no solution to this (unless I'm missing something). Say I have a three node cluster (A, B & C) where there is an asymmetric partition between A & C in that it C can receive incoming messages but cannot deliver outgoing ones to A. Consider the following:
- A campaigns, send out a MsgPreVote to C
- C responds with MsgPreVoteResp, doesn't deliver (yet)
- A returns back to follower state (B wins election)
... - Timeout, A campaigns again (later term), sends out MsgPreVote to C
- C responds with another MsgPreVoteResp, doesn't deliver (yet)
[partition heals] - A receives two MsgPreVoteResp messagess, cannot distinguish which one is in response to which election. (This is because C doesn't change terms when it receives MsgPreVote messages, pb.Message.Term in each MsgPreVoteResp message is simply C's term when it received said message.)
Ignored is the fact that in this specific example with the three node cluster necessarily C's term has increased between the two messages. Point stands where we still have no good way to determine what election cycle each MsgPreVoteResp message is from.
irfansharif
Jul 20, 2017
•
Contributor
I see. Short of augmenting pb.Message (or repurposing pb.Message.Term, hacky & single purpose for me but might be the only option) I see no solution to this (unless I'm missing something). Say I have a three node cluster (A, B & C) where there is an asymmetric partition between A & C in that it C can receive incoming messages but cannot deliver outgoing ones to A. Consider the following:
- A campaigns, send out a MsgPreVote to C
- C responds with MsgPreVoteResp, doesn't deliver (yet)
- A returns back to follower state (B wins election)
... - Timeout, A campaigns again (later term), sends out MsgPreVote to C
- C responds with another MsgPreVoteResp, doesn't deliver (yet)
[partition heals] - A receives two MsgPreVoteResp messagess, cannot distinguish which one is in response to which election. (This is because C doesn't change terms when it receives MsgPreVote messages, pb.Message.Term in each MsgPreVoteResp message is simply C's term when it received said message.)
Ignored is the fact that in this specific example with the three node cluster necessarily C's term has increased between the two messages. Point stands where we still have no good way to determine what election cycle each MsgPreVoteResp message is from.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
bdarnell
Jul 20, 2017
Member
It's true that there's an ambiguity in pre-votes - because they don't increment the term, two consecutive pre-vote election cycles look identical. But it's OK if there are (rare) false positives here - all that will happen is that we'll continue to a full election, which is slightly disruptive but will guarantee that only an up-to-date node can become the new leader.
bdarnell
Jul 20, 2017
Member
It's true that there's an ambiguity in pre-votes - because they don't increment the term, two consecutive pre-vote election cycles look identical. But it's OK if there are (rare) false positives here - all that will happen is that we'll continue to a full election, which is slightly disruptive but will guarantee that only an up-to-date node can become the new leader.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
irfansharif
Jul 20, 2017
Contributor
To clarify, you're OK with granted MsgPreVoteResp messages from earlier election cycles count towards the total # votes needed to pass the PreVote phase? (With a comment to this effect of course.)
irfansharif
Jul 20, 2017
•
Contributor
To clarify, you're OK with granted MsgPreVoteResp messages from earlier election cycles count towards the total # votes needed to pass the PreVote phase? (With a comment to this effect of course.)
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
bdarnell
Jul 20, 2017
Member
Yeah. There are other possibilities for false positives too (for example, a node may grant pre-vote requests to two different nodes in the same term, but only one of them can win the real election).
bdarnell
Jul 20, 2017
Member
Yeah. There are other possibilities for false positives too (for example, a node may grant pre-vote requests to two different nodes in the same term, but only one of them can win the real election).
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
bdarnell
Jul 20, 2017
Member
When we send the MsgPreVoteResp at the end of r.Step, we use r.Term (which is out of date). This is the problem: the PreVoteResp is dropped because it is out of date when the (pre-)campaigning node receives it. Vote responses should use the term from the message, not the node's local term (these are the same for regular votes, but different for pre-votes).
|
When we send the MsgPreVoteResp at the end of r.Step, we use r.Term (which is out of date). This is the problem: the PreVoteResp is dropped because it is out of date when the (pre-)campaigning node receives it. Vote responses should use the term from the message, not the node's local term (these are the same for regular votes, but different for pre-votes). |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
PTAL. |
raft/raft.go
| // the message (it ignores all out of date messages). | ||
| // The term in the original message and current local term are the | ||
| // same in the case of regular votes, but different for pre-votes. | ||
| if m.Type == pb.MsgVote || m.Type == pb.MsgPreVote { |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
bdarnell
Jul 20, 2017
Member
This condition is redundant, we're already in case pb.MsgVote, pb.MsgPreVote:.
bdarnell
Jul 20, 2017
Member
This condition is redundant, we're already in case pb.MsgVote, pb.MsgPreVote:.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
bdarnell
requested a review
from
xiang90
Jul 20, 2017
bdarnell
added
the
Raft
label
Jul 20, 2017
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
codecov-io
Jul 21, 2017
Codecov Report
❗️ No coverage uploaded for pull request base (master@a64d15e). Click here to learn what that means.
The diff coverage is100%.
@@ Coverage Diff @@
## master #8288 +/- ##
=========================================
Coverage ? 76.34%
=========================================
Files ? 345
Lines ? 27074
Branches ? 0
=========================================
Hits ? 20670
Misses ? 4920
Partials ? 1484| Impacted Files | Coverage Δ | |
|---|---|---|
| raft/raft.go | 92.58% <100%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing data
Powered by Codecov. Last update a64d15e...a92ceee. Read the comment docs.
codecov-io
commented
Jul 21, 2017
Codecov Report
@@ Coverage Diff @@
## master #8288 +/- ##
=========================================
Coverage ? 76.34%
=========================================
Files ? 345
Lines ? 27074
Branches ? 0
=========================================
Hits ? 20670
Misses ? 4920
Partials ? 1484
Continue to review full report at Codecov.
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
LGTM. Thanks for fixing this! |
irfansharif commentedJul 20, 2017
•
edited
Edited 1 time
-
irfansharif
edited Jul 21, 2017 (most recent)
Fixes #8243.
TestNodeWithSmallerTermCanCompleteElectiontests the scenario where anode that has been partitioned away (and fallen behind) rejoins the
cluster at about the same time the leader node gets partitioned away.
Previously the cluster would come to a standstill when run with PreVote
enabled.
When responding to
Msg{Pre,}Votemessages we now include the term fromthe message, not the local term. To see why consider the case where a
single node was previously partitioned away and it's local term is now
of date. If we include the local term (recall that for pre-votes we
don't update the local term), the (pre-)campaigning node on the other
end will proceed to ignore the message (it ignores all out of date
messages).
The term in the original message and current local term are the same in
the case of regular votes, but different for pre-votes.
NB: Had to change
TestRecvMsgVoteto includepb.Message.Termwhensending MsgVote messages. The new sanity checks on
MsgVoteResp(
m.Term != 0) would panic with the old test asraft.Termwould be equalto 0 when responding with
MsgVoteRespmessages.+cc @bdarnell @xiang90.