raft: advance commit index safely#122690
Conversation
|
Thank you for contributing to CockroachDB. Please ensure you have followed the guidelines for creating a PR. My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI. I was unable to automatically find a reviewer. You can try CCing one of the following members:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
lyang24
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/raft/raft.go line 897 at r1 (raw file):
} // the leader's log is consistent with itself. r.accTerm = r.Term
let me know if this field fits better in the raftLog struct
pkg/raft/raft_test.go line 1158 at r1 (raw file):
sm := newTestRaft(1, 5, 1, storage) // need to simulate the follower is consistent with leader. sm.accTerm = tt.accTerm
need to assign to simulate follower in sync otherwise follower accTerm is 0.
In this case commit index will not moved by heartbeat unless the follow received entry or snapshot
|
Thank you for updating your pull request. My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
lyang24
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/raft/rawnode_test.go line 744 at r2 (raw file):
rawNode, err := NewRawNode(cfg) rawNode.raft.accTerm = 1
This passes the tests but i am not sure if this is the right thing to do ideally when a node become leader the accterm will be assigned as the current term of the leader, but it seems like we do not have election here?
pav-kv
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @lyang24)
pkg/raft/raft.go line 897 at r1 (raw file):
Previously, lyang24 (Lanqing Yang) wrote…
let me know if this field fits better in the raftLog struct
It does, but including it into raftLog requires a bit larger refactoring. How about a TODO in the accTerm field comment?
pkg/raft/raft_test.go line 1158 at r1 (raw file):
Previously, lyang24 (Lanqing Yang) wrote…
need to assign to simulate follower in sync otherwise follower accTerm is 0.
In this case commit index will not moved by heartbeat unless the follow received entry or snapshot
This should not be needed when you make the newRaft func initialize accTerm to the last entry term.
pkg/raft/rawnode_test.go line 744 at r2 (raw file):
Previously, lyang24 (Lanqing Yang) wrote…
This passes the tests but i am not sure if this is the right thing to do ideally when a node become leader the accterm will be assigned as the current term of the leader, but it seems like we do not have election here?
This should not be needed when you make the newRaft func initialize accTerm to the last entry term.
lyang24
left a comment
There was a problem hiding this comment.
Thanks for the review, I have addressed the feedbacks please take another look.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @pav-kv)
pkg/raft/raft.go line 897 at r1 (raw file):
Previously, pav-kv (Pavel Kalinnikov) wrote…
It does, but including it into
raftLogrequires a bit larger refactoring. How about a TODO in theaccTermfield comment?
added todo on the field
pkg/raft/raft.go line 348 at r2 (raw file):
Previously, pav-kv (Pavel Kalinnikov) wrote…
A more elaborate comment would be good, e.g.:
// accTerm is the term of the leader whose append was accepted into the log // last. The log is a prefix of the accTerm's leader log. // // Invariant: raftLog.lastTerm() <= accTerm <= Term // // NB: the log can be partially or fully compacted. When we say "log" above, // we logically include all the entries that were the pre-image of a snapshot, // as well as the entries that are still physically in the log.
Done.
pkg/raft/raft.go line 349 at r2 (raw file):
Previously, pav-kv (Pavel Kalinnikov) wrote…
We should initialize
accTermto be last entry term, somewhere innewRaftfunc. If we don't, then a restarted node won't be able to advance commit index immediately, it will need an append message first, to getappTermup-to-date.It would also be good to document the invariant (see above).
Done.
pkg/raft/raft.go line 1715 at r2 (raw file):
Previously, pav-kv (Pavel Kalinnikov) wrote…
Rename
leaderTermtoaccTermin this comment? (here and below)
Done.
pav-kv
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained
|
LGTM, but let's have a second review. |
| lastID := r.raftLog.lastEntryID() | ||
| // Invariant: accTerm needs to be initialized as the term of last raft entry | ||
| // to make sure restarted nodes can advance commit index immediately. | ||
| r.accTerm = lastID.term |
There was a problem hiding this comment.
Note for reviewers. There is a case when the commit index won't be able to advance by a heartbeat message right after a restart, specifically when the lastID.term < HardState.Term. This is okay though, because the next accepted MsgApp message from the Term leader (which is guaranteed, because this follower is behind the leader on the log) will bump accTerm to the leader's term, and advance the commit index.
For complete snappiness, we need #122446, i.e. persist accTerm to the HardState - this will make the "log is consistent with the leader" signal survive restarts. We need this accTerm persisted for other reasons too.
| r.logger.Infof("%x [commit: %d] restored snapshot [index: %d, term: %d]", | ||
| r.id, r.raftLog.committed, sindex, sterm) | ||
| // log is now consistent with leader | ||
| r.accTerm = sterm |
There was a problem hiding this comment.
I think this needs to be m.Term, same as in the append case. We're saying here that the state of the log/snapshot is now consistent with the leader.
I suspect , but semantically we should better use sterm == m.Term (because the leader always sends a snapshot including at least one of its entries)m.Term (in case the snapshot is sent by someone else, or is outdated).
There was a problem hiding this comment.
agreed it should be the term of the raft message instead of term on the snapshot.
in case the snapshot is sent by someone else, or is outdated
just curious from reading the raft paper if its outdated or its not sent from a quorum leader the restore will fail right?
There was a problem hiding this comment.
If the snapshot was sent by the leader with an outdated Term (or on behalf of it), the snapshot will be ignored.
However, we can't generally assume that the snapshot (index, term) has the same Term. Say, the leader's Term is 2, and log entries terms are [1 1 1 2 2 2], and the leader has only applied/committed the first 3 entries. Then the snapshot will be for (index=3, term=1) @ Term=2.
nvb
left a comment
There was a problem hiding this comment.
Reviewed 1 of 3 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @lyang24 and @pav-kv)
pkg/raft/raft.go line 434 at r3 (raw file):
which is guaranteed, because this follower is behind the leader on the log
Why is the follower guaranteed to be behind the leader on the log?
pkg/raft/raft.go line 349 at r4 (raw file):
lead uint64 // accTerm is the term of the leader whose append was accepted into the log // last. The log is a prefix of the accTerm's leader log.
We should clarify that a rejected MsgApp does not update accTerm.
pkg/raft/raft.go line 1729 at r4 (raw file):
// the log unconditionally, but it can only be done if m.Term > accTerm. if m.Term == r.accTerm { r.raftLog.commitTo(min(m.Commit, r.raftLog.lastIndex()))
@pav-kv This is going to avoid a panic in raftLog.commitTo after a follower's loss of durability which causes its raft log to regress across a restart. That's a good thing. But what is going to happen once the follower doesn't crash? Will the leader ever recognize that the follower's log has lost entries (which from its perspective, it just acknowledged) and go back to re-append those?
pav-kv
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @lyang24 and @nvanbenschoten)
pkg/raft/raft.go line 434 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
which is guaranteed, because this follower is behind the leader on the log
Why is the follower guaranteed to be behind the leader on the log?
This just describes a scenario in which lastID.term < HardState.Term at restart. This means the leader at HardState.Term (or any higher term) hasn't replicated the log to this follower all the way up to the Term entries, so we will soon see an append.
pkg/raft/raft.go line 1729 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
@pav-kv This is going to avoid a panic in
raftLog.commitToafter a follower's loss of durability which causes its raft log to regress across a restart. That's a good thing. But what is going to happen once the follower doesn't crash? Will the leader ever recognize that the follower's log has lost entries (which from its perspective, it just acknowledged) and go back to re-append those?
I think this depends on the leader implementation. At some point it will see a rejected MsgApp at Match index, so it will see an urge to regress Match. Regressing Match means potentially uncommitting entries, so it's not safe. I think it should panic or put the range to a brick state. It may also enter a busy loop in which it does not regress Match, and sends appends post-Match which never bring the follower to an up-to-date state. It will only heal when the leader changes, but there is still risk of uncommitting entries somewhere else (explicitly, or, worse, implicitly).
Maybe there are some cases when regressing Match can be survived. Particularly, if a regressed Match does not cause the commit index regression (because there is still a quorum of Match-es at >= commit index), then maybe we're good. But we need to be extremely cautious here.
All-in-all, loss of durability is Byzantine, we can't make Raft fully tolerant to that, or correctly detect it in all cases.
pav-kv
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @lyang24 and @nvanbenschoten)
pkg/raft/raft.go line 1729 at r4 (raw file):
Previously, pav-kv (Pavel Kalinnikov) wrote…
I think this depends on the leader implementation. At some point it will see a rejected MsgApp at
Matchindex, so it will see an urge to regressMatch. RegressingMatchmeans potentially uncommitting entries, so it's not safe. I think it should panic or put the range to a brick state. It may also enter a busy loop in which it does not regressMatch, and sends appends post-Matchwhich never bring the follower to an up-to-date state. It will only heal when the leader changes, but there is still risk of uncommitting entries somewhere else (explicitly, or, worse, implicitly).Maybe there are some cases when regressing
Matchcan be survived. Particularly, if a regressedMatchdoes not cause the commit index regression (because there is still a quorum ofMatch-es at >= commit index), then maybe we're good. But we need to be extremely cautious here.All-in-all, loss of durability is Byzantine, we can't make Raft fully tolerant to that, or correctly detect it in all cases.
Maybe one way to put things: if Match regresses (durability is lost), this should be equivalent to a dead node. If making this node equivalent to dead doesn't cause LoQ, we're good. After making it "dead", we should reconfigure the Range, and maybe somehow re-add this replica with another ID. The fact that it's got most of the log already can probably be utilized to quickly catch it up.
124225: raft: add match index safety check on followers r=pav-kv a=pav-kv This PR adds a safety check which ensures that a follower's log state does not contradict to the leader's `Match` index. If the leader has a non-zero `Match` for the follower, it believes that the follower's log a) is consistent with this leader, and b) has all entries up to the `Match` index persisted. If the follower's `Term <= leader.Term` and (b) turns out to not be the case, this means the follower has lost part of the durable state (for example, fsync didn't work properly in the system, and the node got restarted). This is a safety violation, so this follower panics to avoid spreading the harm. In the future, it would be better to "quarantine" this follower, and surface this information to the operator without crashing the node. Touches #122690 Epic: none Release note: none Co-authored-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
pav-kv
left a comment
There was a problem hiding this comment.
@lyang24 the blocking PR has merged. Please address Nathan's comments, and it should be safe to merge this one.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @lyang24)
pkg/raft/raft.go line 1729 at r4 (raw file):
Previously, pav-kv (Pavel Kalinnikov) wrote…
Added #124225 to retain the panic behaviour if the follower breaks its durability promises.
The PR has merged. It should be safe to make the change now.
Hi I have rebased to master will merge when tests passes and will work on the follow up prs. |
|
bors r+ |
|
🕐 Waiting for PR status (GitHub check) to be set, probably by CI. Bors will automatically try to run when all required PR statuses are set. |
|
bors cancel |
|
@lyang24 there are some outstanding comments from @nvanbenschoten, could you please address them first? @nvanbenschoten Do you want to take a final pass / LGTM? |
nvb
left a comment
There was a problem hiding this comment.
Reviewed 1 of 1 files at r5, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @lyang24 and @pav-kv)
pkg/raft/raft.go line 434 at r3 (raw file):
Previously, pav-kv (Pavel Kalinnikov) wrote…
This just describes a scenario in which
lastID.term < HardState.Termat restart. This means the leader atHardState.Term(or any higher term) hasn't replicated the log to this follower all the way up to theTermentries, so we will soon see an append.
How does that relate to "make sure restarted nodes can advance commit index immediately"? Is the point that you're making that in the node restart case, lastID.term will be correct, so in that case this allows it to advance its commit index immediately instead of waiting for ... something?
What is that "something" by the way? Are there cases where the restarted follower is caught up enough to not need to accept any new MsgApps?
If the term changed, the follower cannot be caught up on its log because a term change is followed by an empty entry in the log. So that is enough to require a MsgApp which will eventually set accTerm and will eventually inform the follower of the commit index.
What if the term did not change and the leader has not proposed anything new since the follower restarted. Will accTerm ever be set to the leader's current term? Is this also relying on the empty entry in the log to be from the leader's current term, such that if we're in this state, lastID.term must be the leader's current term?
pav-kv
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @lyang24)
pkg/raft/raft.go line 434 at r3 (raw file):
What is that "something" by the way?
A necessary condition to advance commit index is:
a) the entry at this index is committed (so the leader who sent the message has learned about it)
b) our log ends at least at this index
c) our log matches the leader's up to this index
(a) and (b) are obvious. A sufficient condition for (c) is leader.Term == accTerm, meaning our log is entirely a prefix of the leader's. Weaker conditions are possible here, if the follower has more info about the leader's log (e.g. the leader includes a sample of index,term pairs in the request), but let's ignore those.
Are there cases where the restarted follower is caught up enough to not need to accept any new MsgApps?
If we persist accTerm across restarts, we don't need need any new MsgApps. Since we don't yet persist it, what is the safe and the best value we can initialize it to? We need accTerm to be such that our log is a prefix of the accTerm leader. The easy way to guarantee this is to initialize accTerm to lastEntry.term. If we're lucky and lastEntry.term == leader.Term, we initialized to exactly the pre-restart accTerm, otherwise we're at a conservatively low value, so our commit index will start moving only at first MsgApp.
If the term changed, the follower cannot be caught up on its log because a term change is followed by an empty entry in the log. So that is enough to require a MsgApp which will eventually set
accTermand will eventually inform the follower of the commit index.
Yes.
What if the term did not change and the leader has not proposed anything new since the follower restarted. Will
accTermever be set to the leader's current term? Is this also relying on the empty entry in the log to be from the leader's current term, such that if we're in this state,lastID.termmust be the leader's current term?
Today, this does rely on the dummy entry (more broadly, on the fact that a leader will append at least one entry). But generally, this relies on the fact that a leader pings follower until it knows that the follower's accTerm == Term (as if there was an invisible entry at the start of the leader's term).
pav-kv
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @lyang24)
pkg/raft/raft.go line 434 at r3 (raw file):
Previously, pav-kv (Pavel Kalinnikov) wrote…
What is that "something" by the way?
A necessary condition to advance commit index is:
a) the entry at this index is committed (so the leader who sent the message has learned about it)
b) our log ends at least at this index
c) our log matches the leader's up to this index(a) and (b) are obvious. A sufficient condition for (c) is
leader.Term == accTerm, meaning our log is entirely a prefix of the leader's. Weaker conditions are possible here, if the follower has more info about the leader's log (e.g. the leader includes a sample of index,term pairs in the request), but let's ignore those.Are there cases where the restarted follower is caught up enough to not need to accept any new MsgApps?
If we persist
accTermacross restarts, we don't need need any new MsgApps. Since we don't yet persist it, what is the safe and the best value we can initialize it to? We needaccTermto be such that our log is a prefix of theaccTermleader. The easy way to guarantee this is to initializeaccTermtolastEntry.term. If we're lucky andlastEntry.term == leader.Term, we initialized to exactly the pre-restartaccTerm, otherwise we're at a conservatively low value, so our commit index will start moving only at first MsgApp.If the term changed, the follower cannot be caught up on its log because a term change is followed by an empty entry in the log. So that is enough to require a MsgApp which will eventually set
accTermand will eventually inform the follower of the commit index.Yes.
What if the term did not change and the leader has not proposed anything new since the follower restarted. Will
accTermever be set to the leader's current term? Is this also relying on the empty entry in the log to be from the leader's current term, such that if we're in this state,lastID.termmust be the leader's current term?Today, this does rely on the dummy entry (more broadly, on the fact that a leader will append at least one entry). But generally, this relies on the fact that a leader pings follower until it knows that the follower's
accTerm == Term(as if there was an invisible entry at the start of the leader's term).
So "make sure restarted nodes can advance commit index immediately" is not entirely accurate here. Without accTerm persistence, it is best effort (but, crucially, still correct).
nvb
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @lyang24 and @pav-kv)
pkg/raft/raft.go line 434 at r3 (raw file):
Previously, pav-kv (Pavel Kalinnikov) wrote…
So "make sure restarted nodes can advance commit index immediately" is not entirely accurate here. Without
accTermpersistence, it is best effort (but, crucially, still correct).
Thanks for explaining, this makes sense. I think what I'm asking for then is more words in this comment about why we bother setting r.accTerm = lastID.term on restart, what cases that helps with, and what cases it does not.
I'd encourage more commentary around the call to r.raftLog.commitTo as well. Specifically, it would be useful to reason through why dropping a heartbeat's commit index contribution in either of the following two cases still ensures that the follower eventually learns of the latest commit index.
- the case where
m.Term != r.accTerm - the case where
r.raftLog.lastIndex() < m.Commit
794d435 to
f3b4906
Compare
pav-kv
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/raft/raft.go line 434 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Thanks for explaining, this makes sense. I think what I'm asking for then is more words in this comment about why we bother setting
r.accTerm = lastID.termon restart, what cases that helps with, and what cases it does not.I'd encourage more commentary around the call to
r.raftLog.commitToas well. Specifically, it would be useful to reason through why dropping a heartbeat's commit index contribution in either of the following two cases still ensures that the follower eventually learns of the latest commit index.
- the case where
m.Term != r.accTerm- the case where
r.raftLog.lastIndex() < m.Commit
Done
pkg/raft/raft.go line 349 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
We should clarify that a rejected MsgApp does not update
accTerm.
Done
nvb
left a comment
There was a problem hiding this comment.
Reviewed 1 of 1 files at r6, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @lyang24 and @pav-kv)
pkg/raft/raft.go line 452 at r6 (raw file):
// higher accTerm (ideally, matching the current leader Term) gives us more // information about the log, and then allows bumping its commit index sooner // than the next MsgApp arrives.
"sooner than when the ..."?
pkg/raft/raft.go line 1760 at r6 (raw file):
The m.Term leader indicates to us that
This can be interpreted a few different ways. Consider "The m.Term leader is indicating to us through this heartbeat message that ..."
This change makes the commit index advancement in handleHeartbeat safe. Previously, a follower would attempt to update the commit index to whichever was sent in the MsgHeartbeat message. Out-of-bound indices would crash the node. It is always safe to advance a commit index if the follower's log is "in sync" with the leader, i.e. when its log is guaranteed to be a prefix of the leader's log. This is always true if the term of last entry in the log matches the leader's term, otherwise this guarantee is established when the first MsgApp append message from the leader succeeds. At the moment, the leader will never send a commit index that exceeds the follower's log size. However, this may change in future. This change is a defence-in-depth. The newly added accTerm field will be used for other safety checks in the future, for example to establish that overriding a suffix of entries in raftLog is safe. Epic: None Release note: None
lyang24
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten and @pav-kv)
pkg/raft/raft.go line 452 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
"sooner than when the ..."?
Done
pkg/raft/raft.go line 1760 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
The m.Term leader indicates to us that
This can be interpreted a few different ways. Consider "The m.Term leader is indicating to us through this heartbeat message that ..."
done
|
bors r+ |
This change makes the commit index advancement in handleHeartbeat safe. Previously, a follower would attempt to update the commit index to whichever was sent in the MsgHeartbeat message. Out-of-bound indices would crash the node.
It is always safe to advance a commit index if the follower's log is "in sync" with the leader, i.e. when its log is guaranteed to be a prefix of the leader's log. This is always true if the term of last entry in the log matches the leader's term, otherwise this guarantee is established when the first MsgApp append message from the leader succeeds.
At the moment, the leader will never send a commit index that exceeds the follower's log size. However, this may change in future. This change is a defence-in-depth.
The newly added accTerm field will be used for other safety checks in the future, for example to establish that overriding a suffix of entries in raftLog is safe.
Part of #122100