raft: use last accepted term for log writes#127424
Conversation
df876b2 to
b26e6a2
Compare
8deb73e to
4b57f5a
Compare
4b57f5a to
71946ca
Compare
| m.Index = last.index | ||
| m.LogTerm = last.term | ||
| m.Index = rd.Entries[ln-1].Index | ||
| m.LogTerm = r.raftLog.accTerm() |
There was a problem hiding this comment.
@nvanbenschoten Could use m.Term field instead of m.LogTerm, if we don't want to overload the meaning of m.LogTerm. In MsgApp messages, m.LogTerm is the "entry term", while in this protocol it carries the "last accepted / leader term".
Maybe using m.Term for this would be more semantically consistent - m.Term is usually the leader sending a MsgApp, and MsgStorageAppend is essentially MsgApp forwarded to the storage on behalf of the latest leader with whom our log is consistent.
WDYT?
|
@sumeerbhola The logic in |
| // the time that the MsgStorageAppend was sent to the time that the response | ||
| // is received, indicating that no new leader has overwritten the log. | ||
| // | ||
| // However, this replaces a correctness problem with a liveness problem. If we |
There was a problem hiding this comment.
We would still have this liveness problem if accTerm could be bumped without having new entries. But today, accTerm is either bumped on a snapshot (when we clear the unstable log), or on appending a non-empty slice of entries. In the former case, we don't need the last accepted term logic because there are no pending unstable entries. In the latter case, we send messages because len(Entries) > 0.
nvb
left a comment
There was a problem hiding this comment.
Reviewed 10 of 10 files at r1, 4 of 4 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @pav-kv and @sumeerbhola)
pkg/raft/raft.go line 1045 at r1 (raw file):
case m.Term == 0: // local message case m.Type == pb.MsgStorageAppendResp:
Should we stop setting Term on MsgStorageAppendResp messages?
EDIT: next commit 😄
pkg/raft/rawnode.go line 312 at r1 (raw file):
Previously, pav-kv (Pavel Kalinnikov) wrote…
@nvanbenschoten Could use
m.Termfield instead ofm.LogTerm, if we don't want to overload the meaning ofm.LogTerm. InMsgAppmessages,m.LogTermis the "entry term", while in this protocol it carries the "last accepted / leader term".Maybe using
m.Termfor this would be more semantically consistent -m.Termis usually the leader sending aMsgApp, andMsgStorageAppendis essentiallyMsgAppforwarded to the storage on behalf of the latest leader with whom our log is consistent.WDYT?
Agreed on switching to m.Term for the reason you mentioned — LogTerm is a reference to the term of a specific log entry, while Term is a reference to the context of the log append.
pav-kv
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)
pkg/raft/rawnode.go line 312 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Agreed on switching to
m.Termfor the reason you mentioned —LogTermis a reference to the term of a specific log entry, whileTermis a reference to the context of the log append.
I realized we need accTerm both in MsgStorageAppend and Resp, for RAC. But MsgStorageAppend already uses the Term field to carry HardState.Term.
A couple of workarounds:
- Use
LogTerm, but it overrides its meaning, as we discussed. - Only use
TerminMsgStorageAppendResp, and parse it out fromMsgStorageAppend.Responses. This is also a bit inconsistent:Termis used for different things in the message and its reply.
Both approaches have a similar flaw, but maybe (1) is a bit more consistent, at least in scope of the message/reply pair; and easier to use. How about sticking to it for now?
Longer-term, the MsgStorageAppend/Resp could be replaced by a structured Ready-like API. I previously played with this idea here, but there is no urgency to pursue this now.
sumeerbhola
left a comment
There was a problem hiding this comment.
to avoid the need of searching through the message
Responses).
can you point me to that code? I don't quite understand.
Regarding outdated admissions, we are tracking individual entries for admission, and we can easily change waitingForAdmissionState to wipeout stale entries in waitingForAdmissionState.add. So we have multiple options regarding how to ignore outdated admissions. We could definitely also do something like you outlined here.
I believe we partially discussed in https://cockroachlabs.slack.com/archives/C06UFBJ743F/p1720029427647799?thread_ts=1720023953.470729&cid=C06UFBJ743F
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)
b07ee8c to
a1b903a
Compare
pav-kv
left a comment
There was a problem hiding this comment.
can you point me to that code?
We don't have this code yet. The 2 messages are: MsgStorageAppend and MsgStorageAppendResp. To implement the no-op for admissions, we need to extract the leader Term from MsgStorageAppend (so that we can attach it to the admission metadata, and, when it comes back admitted, we can compare it with the one tracked in waitingForAdmissionState).
We could extract this term by scanning MsgStorageAppend.Responses field and finding the MsgStorageAppendResp (which contains the term). Or we could attach the term to the MsgStorageAppend directly to avoid this unergonomic scan. Added the last commit to this extent.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten)
nvb
left a comment
There was a problem hiding this comment.
Reviewed 2 of 7 files at r4, 4 of 4 files at r5, 1 of 1 files at r6, 3 of 3 files at r7, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @pav-kv)
pkg/raft/rawnode.go line 224 at r7 (raw file):
if ln := len(rd.Entries); ln != 0 { m.LogTerm = r.raftLog.accTerm() m.Index = rd.Entries[ln-1].Index
nit: assign these fields in the same order that you do down below.
This commit modifies the log write protocol to take advantage of the last accepted term tracked in raftLog/unstable. Epic: none Release note: none
Epic: none Release note: none
Epic: none Release note: none
Epic: none Release note: none
a1b903a to
4beb96b
Compare
pav-kv
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @pav-kv)
pkg/raft/rawnode.go line 224 at r7 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: assign these fields in the same order that you do down below.
done
|
bors r=nvanbenschoten |
This PR modifies the log write protocol to take advantage of the last accepted term tracked in
raftLog/unstable.Log entry writes are ordered by the last accepted term (
unstable.term). Entries in the log can be overwritten only if this term is bumped (when we accept a log write from a higher term leader). Ifunstable.termis unchanged when we receive an acknowledgement from log storage, then the acknowledged entries don't have any overwrites in the write queue, and it is safe to remove them fromunstable.Part of #124440