-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raft: introduce/fix TestNodeWithSmallerTermCanCompleteElection #8288
Conversation
raft/raft.go
Outdated
} else { | ||
return nil | ||
case m.Type == pb.MsgPreVote || m.Type == pb.MsgPreVoteResp: | ||
// If we receive a MsgPreVote from a node with a lower term number, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it important that we reject MsgPreVote instead of dropping it? We just drop MsgVote messages with older terms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't, it can be dropped as well. Rejecting it just puts the slower (previously partitioned) node into the candidate follower state (which would happen anyway when the faster node sends a MsgPreVote message with a higher term).
As an aside, why do we just drop MsgVote messages with older terms instead of rejecting them directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does rejecting a message lead to a transition from pre-candidate to candidate?
As for why we drop instead of reject MsgVotes with low terms, I don't think there's any deep reason for it. It's just easier to drop since we drop all other messages with lower terms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mistyped. Meant transition back to follower state.
raft/raft.go
Outdated
// If we receive a MsgPreVote from a node with a lower term number, | ||
// we reject it. For a MsgPreVoteResp we simply pass it to our | ||
// stepFunc. For a stale MsgPreVoteResp (think of late responses, | ||
// we've already transitioned to a leader or follower state) this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if we've transitioned to follower and then back to pre-candidate, there's no other Term check in StepCandidate. We could accept a MsgPreVoteResp for term T as a valid pre-vote in term T+1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's no other Term check in StepCandidate
Isn't this then an existing problem for MsgVoteResp? We could accept granted MsgVoteResps from earlier elections? I see nothing in raft.go: *raft.poll
that does this check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because they're dropped by this block (the term check at line 713). Step
handles all messages from other terms, so that the stepState
methods can safely assume that the message is for the current term (with the exception of preVotes, which may have a higher term).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Short of augmenting pb.Message
(or repurposing pb.Message.Term
, hacky & single purpose for me but might be the only option) I see no solution to this (unless I'm missing something). Say I have a three node cluster (A, B & C) where there is an asymmetric partition between A & C in that it C can receive incoming messages but cannot deliver outgoing ones to A. Consider the following:
- A campaigns, send out a MsgPreVote to C
- C responds with MsgPreVoteResp, doesn't deliver (yet)
- A returns back to follower state (B wins election)
... - Timeout, A campaigns again (later term), sends out MsgPreVote to C
- C responds with another MsgPreVoteResp, doesn't deliver (yet)
[partition heals] - A receives two MsgPreVoteResp messagess, cannot distinguish which one is in response to which election. (This is because C doesn't change terms when it receives MsgPreVote messages, pb.Message.Term in each MsgPreVoteResp message is simply C's term when it received said message.)
Ignored is the fact that in this specific example with the three node cluster necessarily C's term has increased between the two messages. Point stands where we still have no good way to determine what election cycle each MsgPreVoteResp message is from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's true that there's an ambiguity in pre-votes - because they don't increment the term, two consecutive pre-vote election cycles look identical. But it's OK if there are (rare) false positives here - all that will happen is that we'll continue to a full election, which is slightly disruptive but will guarantee that only an up-to-date node can become the new leader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, you're OK with granted MsgPreVoteResp messages from earlier election cycles count towards the total # votes needed to pass the PreVote phase? (With a comment to this effect of course.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. There are other possibilities for false positives too (for example, a node may grant pre-vote requests to two different nodes in the same term, but only one of them can win the real election).
When we send the MsgPreVoteResp at the end of r.Step, we use r.Term (which is out of date). This is the problem: the PreVoteResp is dropped because it is out of date when the (pre-)campaigning node receives it. Vote responses should use the term from the message, not the node's local term (these are the same for regular votes, but different for pre-votes). |
PTAL. |
raft/raft.go
Outdated
// the message (it ignores all out of date messages). | ||
// The term in the original message and current local term are the | ||
// same in the case of regular votes, but different for pre-votes. | ||
if m.Type == pb.MsgVote || m.Type == pb.MsgPreVote { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition is redundant, we're already in case pb.MsgVote, pb.MsgPreVote:
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoops.
9c5bfaf
to
90cbb14
Compare
TestNodeWithSmallerTermCanCompleteElection tests the scenario where a node that has been partitioned away (and fallen behind) rejoins the cluster at about the same time the leader node gets partitioned away. Previously the cluster would come to a standstill when run with PreVote enabled. When responding to Msg{Pre,}Vote messages we now include the term from the message, not the local term. To see why consider the case where a single node was previously partitioned away and it's local term is now of date. If we include the local term (recall that for pre-votes we don't update the local term), the (pre-)campaigning node on the other end will proceed to ignore the message (it ignores all out of date messages). The term in the original message and current local term are the same in the case of regular votes, but different for pre-votes. NB: Had to change TestRecvMsgVote to include pb.Message.Term when sending MsgVote messages. The new sanity checks on MsgVoteResp (m.Term != 0) would panic with the old test as raft.Term would be equal to 0 when responding with MsgVoteResp messages.
Codecov Report
@@ Coverage Diff @@
## master #8288 +/- ##
=========================================
Coverage ? 76.34%
=========================================
Files ? 345
Lines ? 27074
Branches ? 0
=========================================
Hits ? 20670
Misses ? 4920
Partials ? 1484
Continue to review full report at Codecov.
|
LGTM. Thanks for fixing this! |
Fixes #8243.
TestNodeWithSmallerTermCanCompleteElection
tests the scenario where anode that has been partitioned away (and fallen behind) rejoins the
cluster at about the same time the leader node gets partitioned away.
Previously the cluster would come to a standstill when run with PreVote
enabled.
When responding to
Msg{Pre,}Vote
messages we now include the term fromthe message, not the local term. To see why consider the case where a
single node was previously partitioned away and it's local term is now
of date. If we include the local term (recall that for pre-votes we
don't update the local term), the (pre-)campaigning node on the other
end will proceed to ignore the message (it ignores all out of date
messages).
The term in the original message and current local term are the same in
the case of regular votes, but different for pre-votes.
NB: Had to change
TestRecvMsgVote
to includepb.Message.Term
whensending MsgVote messages. The new sanity checks on
MsgVoteResp
(
m.Term != 0
) would panic with the old test asraft.Term
would be equalto 0 when responding with
MsgVoteResp
messages.+cc @bdarnell @xiang90.