New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tla-plus: add spec for raft leader election #123189
base: master
Are you sure you want to change the base?
tla-plus: add spec for raft leader election #123189
Conversation
This commit updates the ParallelCommit spec with a .cfg config file which mirrors its Toolbox .launch file. This allows the spec to be checked with the command-line version of the TLC model checker. As a result, this allows the spec to play well with the VSCode TLA+ extension, which is more popular and easier to use than the original eclipse-based TLA+ Toolbox. Epic: None Relase note: None
Experience reportBackgroundWe started working with TLA+ and formal model checking at Cockroach Labs back in 2019. @tbg and @petermattis kicked this off with some background conversations and then this email thread. We had Hillel Wayne come and host a 1-week intensive workshop on TLA+ and PlusCal (an imperative language which transpiles to TLA+), much of which was derived from his excellent Practical TLA+: Planning Driven Development book. That was critical and near-necessary for adoption of TLA+ at CRL, because the available documentation on the topic online is severely lacking (unless you enjoy reading 100-page white papers full of mathematical notation). By the end of the workshop, we had a couple of working TLA+ specs, an understanding of how formal methods might contribute to the development of CockroachDB in the future, and desires for it to play a role. We haven't done much with TLA+ since 2021, but that desire is still there. Since we're starting to work with the Raft algorithm directly again and have desires to extend it to support leader leases (design in progress and not ready for broader consumption quite yet), it made sense to start thinking about formal verification again. ExperienceTo get reacquainted with TLA+ and PlusCal, I started by skimming the Practical TLA+ book again. It's really a great resource, especially compared to everything else written on the topic, so I would suggest that those who want to work with TLA+ familiarize themselves with the book (so that you don't need to read Specifying Systems). The other resources I found useful were:
In the past, we had worked with the Eclipse-based TLA+ Toolkit. I decided to avoid that this time, as it's fairly clunky and hides configuration behind a GUI. Instead, I opted for the VSCode TLA+ extension, which is slim but works well and is built directly on top of the official TLA+ command line tools. I would recommend this to others who want to work with TLA+/PlusCal. ResultsAs detailed above, my goal was to write a spec to verify the correctness (specifically the lease disjointness invariant safety property) of our proposed leader-lease extension to Raft. To do so, I started by writing this simpler spec of the existing Raft leader election algorithm. The process took around 10 hours. That is to be expected, as there is a large barrier to entry to writing these specifications which authors should be prepared for. Don't get discouraged though. The initial investment may be more than one might expect, but the ease of extension once a spec is written is equally as surprising in the other direction. The other thing to be aware of is that the process of writing the spec is half the battle. While TLA+ does support exhaustive model checking with TLC and it works just as well as promised, I have found that the process of writing the spec is the most valuable part of the process. Doing so forces one to sit and think about an algorithm, often revealing bugs before the model checker even gets to run. It's a really valuable learning experience which doesn't have many direct equivalents. With that all said, here is what I learned while writing this spec (again, without starting the extension to add leader-leases): PreVote bugI believe that there is a bug in the existing PreVote algorithm in For example, I think the following order of operations is allowed:
The consequences for this bug are not overly severe, because a pre-vote election is a best-effort attempt to prevent a disruptive leader election. Bugs in pre-vote do not compromise safety (i.e. leader exclusivity). Still, we should fix this. Leader-lease overlapping quorumsIn the leader-lease design, we rely on overlapping quorums between followers that support a leader and followers that are calling an election to ensure that if a leader has support from a quorum of followers, it will not be replaced until after that support is withdrawn. Up until now, I had been thinking that the overlapping quorums to consider were the pre-vote election and the normal election. We would then build safety off of this pre-vote election never succeeding if a leader still had support from the followers that had elected it in the previous term. Writing this spec revealed a flaw in this approach. Immediately a leader is elected, its followers who elected it do not know that it won the election. As a result, we can't have them refuse to vote for anyone else at a future term or we would lose liveness. Followers should only start "protecting" the leader once they know that it has won the election. This reveals the flaw in the previous assumption that the pre-vote and vote quorums are the ones we care about. Instead, we care about the vote quorum and some post-election "heartbeat" quorum. A leader needs to reach out to each follower and confirm that it knows about the successful election and has not voted in a new election before it can be sure it has that follower's support. This is interesting for a few reasons:
Lead and LeadEpoch require persistenceThe initial leader-lease design doc mentioned that a new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly questions to make up for my gaps in understanding of TLA+ and Raft.
Reviewed 2 of 2 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @miraradeva, @nvanbenschoten, and @pav-kv)
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 15 at r2 (raw file):
(* RaftLeader is a specification for the Raft leader election protocol, as *) (* described in section 3.4 of the Raft thesis[^1]. The spec models multiple *) (* terms, node restarts, and the Pre-Vote election phase extension. *)
was this needed since https://github.com/ongardie/raft.tla/blob/master/raft.tla omits pre-vote?
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 32 at r2 (raw file):
term |-> 0, vote |-> 0, lead |-> 0,
basic question: is this lead field in the thesis, and what does it correspond to there? The persistent state is only currentTerm, votedFor (which we keep in HardState) and the log, and I don't see it in the volatile state either. I am guessing this is to define the invariant below.
From a modeling perspective, it isn't clear yet which of these fields are persistent. I'm going to assume this will get clarified later. Edit: I see the restart()
macro now.
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 38 at r2 (raw file):
]]; \* In-flight messages, by destination. network = [i \in Nodes |-> <<>>];
basic modeling clarification: this is a sequence, since it represents the order of delivery to this node, yes?
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 56 at r2 (raw file):
RaftTerms == 0..Terms RaftVotes == Nodes \union {0} RaftLead == Nodes \union {0}
presumably a 0 is no leader?
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 66 at r2 (raw file):
RaftMsgInvariant == /\ \A i \in Nodes : \A j \in 1..(Len(network[i])-1) :
why the -1? Since sequences are 1 indexed, I was expecting the last index is Len(network[i])
. I suppose there is something special about the last message that will get clarified later.
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 73 at r2 (raw file):
TypeInvariants == /\ RaftStateInvariant /\ RaftMsgInvariant
PlusCal question: is this needed, given all the invariants need to hold and we've already defined both individually?
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 83 at r2 (raw file):
raft[i].term = raft[j].term /\ raft[i].lead /= 0 /\ raft[j].lead /= 0 => raft[i].lead = raft[j].lead end define;
so no liveness properties yet?
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 91 at r2 (raw file):
macro become_follower(term, lead) begin if raft[self].term /= term then
should this be raft[self].term < term
. Or presumably this is an accurate model of what the spec says since the caller has already verified that raft[self].term <= term
.
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 106 at r2 (raw file):
raft[self].state := StatePreCandidate || raft[self].lead := 0 || raft[self].voteResps := {self};
so does voteResps
model both PreVoteResp
and VoteResp
? Is there a way/need to distinguish between the two?
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 224 at r2 (raw file):
\****************************\ fair process node \in Nodes
PlusCal question: would we need strong fairness if we wanted to add any liveness properties?
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 239 at r2 (raw file):
Previously, pav-kv (Pavel Kalinnikov) wrote…
Do we need to allow out-of-order messages here? E.g. by making
network[self]
be a set from which we get a random message.
I had a similar question. Wondering whether it was done this way since there is non-determinism across processes and within processes that are adding to the sequence, to make the sequence sufficiently non-deterministic. It seems it would not allow for a stray late message to show up, like in the PreVote bug
example (though in that case it is possible n1 never received another message between steps 3 and 8, since n2 did not need n1's vote to become leader, so perhaps it will get tickled despite this ordering).
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 242 at r2 (raw file):
if msg.term > raft[self].term then if msg.type \in {MsgVote, MsgPreVote} /\ raft[self].lead /= 0 then
if raft[self].lead = 0
then it seems we would want to do the same become_follower(msg.term, 0)
, but we don't do this here and instead do it in the later else. The structure looks a bit odd. Is this because it is following the spec listed at the end of the thesis. Section 3.4 is a lot of words, so there could be many translations.
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 248 at r2 (raw file):
\* either \* NON-DETERMINISM! */ \* \* Reject vote or pre-vote if recently heard from leader.
It can reject a (non-pre) vote for a higher term if it has recently heard from the leader?
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 262 at r2 (raw file):
if msg.type = MsgApp then become_follower(msg.term, msg.from); else
Would MsgAppResp
fall through to here? Presumably that can't happen with a higher term?
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 273 at r2 (raw file):
end if; else \* TODO(nvanbenschoten): is not having this check a bug?
couldn't it get a stray old pre-vote response at any time?
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 281 at r2 (raw file):
RecvMessage: if msg.type \in {MsgVote, MsgPreVote} then \* We can vote if this is a repeat of a vote we've already cast...
Looks like for MsgVote
and MsgPreVote
we can only be here if msg.term >= raft[self].term
.
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 306 at r2 (raw file):
\* end if; \* or \* \* The candidate's log is not up-to-date.
so this will require modeling the log, which is not done yet?
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 325 at r2 (raw file):
elsif msg.type \in {MsgVoteResp, MsgPreVoteResp} then if ~msg.reject /\ ((raft[self].state = StateCandidate /\ msg.type = MsgVoteResp) \/ (raft[self].state = StatePreCandidate /\ msg.type = MsgPreVoteResp)) then
so it stays in states StateCandidate
or StatePreCandidate
on a rejection? Which means CanCampaign
will not be true? But if someone else (pre)campaigns we will execute become_follower
, and can potentially campaign again.
just curious if this is intentional?
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 348 at r2 (raw file):
elsif msg.type = MsgApp then if msg.term < raft[self].term then \* TODO: will this violate leases?
can you elaborate on this TODO -- I don't see the relationship with leases?
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 357 at r2 (raw file):
if ~msg.reject /\ raft[self].state = StateLeader then raft[self].appResps := raft[self].appResps \union {msg.from} || raft[self].appended := (raft[self].appResps \union {msg.from}) \in RaftQuorum;
does appended
really mean committed
?
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 359 at r2 (raw file):
raft[self].appended := (raft[self].appResps \union {msg.from}) \in RaftQuorum; if raft[self].appended then Appended:
what is the purpose of this Appended
label?
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 372 at r2 (raw file):
end algorithm; *) \* BEGIN TRANSLATION (chksum(pcal) = "6779ecf7" /\ chksum(tla) = "6dcdeccf") VARIABLES raft, network, pc
Is the following the TLA+ translation of PlusCal, so can ignore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @miraradeva, @nvanbenschoten, and @pav-kv)
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 248 at r2 (raw file):
Previously, sumeerbhola wrote…
It can reject a (non-pre) vote for a higher term if it has recently heard from the leader?
Never mind, I see the following in raft.go
// If a server receives a RequestVote request within the minimum election timeout
// of hearing from a current leader, it does not update its term or grant its vote
This commit adds a new TLA+ specification (expressed in PlusCal) called RaftLeader. RaftLeader is a specification for the Raft leader election protocol, as described in section 3.4 of the Raft thesis[^1]. The spec models multiple terms, node restarts, and the Pre-Vote election phase extension. The central safety property of leader election is exclusivity. In Raft, this is defined as the Election Safety Property: "at most one leader can be elected in a given term". This property is defined as an invariant of the spec. My goal is to use this spec to verify the correctness of leader-leases. We can start with this simpler spec and then extend it to include notions of store liveness and of leases associated with leaders. From there, we can define the lease disjointness invariant and prove that it holds. [^1]: https://web.stanford.edu/~ouster/cgi-bin/papers/OngaroPhD.pdf Epic: None Release note: None
2c11ba1
to
dd79a9a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dismissed @sumeerbhola from a discussion.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @miraradeva, @pav-kv, and @sumeerbhola)
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 15 at r2 (raw file):
Previously, sumeerbhola wrote…
was this needed since https://github.com/ongardie/raft.tla/blob/master/raft.tla omits pre-vote?
It was useful (not needed) for all of the following reasons:
- raft.tla is all of raft, so it's a larger spec than necessary.
- raft.tla does not have any invariants. It's a spec, but there's nothing to check. https://github.com/Vanlightly/raft-tlaplus/blob/main/specifications/standard-raft/Raft.tla improves on this.
- raft.tla omits pre-vote
- I want to extend this spec to include leader-leases, and would prefer to do so in PlusCal. That's both because PlusCal is easier to write, it's easier to read, and it's easier to make it more closely mirror our specific etcd/raft implementation.
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 32 at r2 (raw file):
is this lead field in the thesis, and what does it correspond to there?
You're right, it's not explicitly in the raft thesis/paper. It is implied though, just like a state
(Leader, Follower, Candidate) field is implied. And both of these are in our implementation.
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 38 at r2 (raw file):
Previously, sumeerbhola wrote…
basic modeling clarification: this is a sequence, since it represents the order of delivery to this node, yes?
Yes, see Pavel's question above. We can do out-of-order delivery, and we do in #123523, but it makes model checking exponentially slower. It's an easy switch once we're ready for it.
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 56 at r2 (raw file):
Previously, sumeerbhola wrote…
presumably a 0 is no leader?
Yes, 0 is no leader. This is used to define vote
and lead
as optional.
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 66 at r2 (raw file):
Previously, sumeerbhola wrote…
why the -1? Since sequences are 1 indexed, I was expecting the last index is
Len(network[i])
. I suppose there is something special about the last message that will get clarified later.
Just a bug. Fixed while adding out-of-order delivery.
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 73 at r2 (raw file):
Previously, sumeerbhola wrote…
PlusCal question: is this needed, given all the invariants need to hold and we've already defined both individually?
In PlusCal / TLA+, you need to then tell the spec which invariants to check at each step. So this packages these two invariants under a single operator which we then plug into the .cfg file.
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 83 at r2 (raw file):
Previously, sumeerbhola wrote…
so no liveness properties yet?
Not yet. I'm not sure that we will add any. I've found that they're harder to express in TLA+, often requiring a spec to be stripped down further first. Raft also doesn't have a guarantee of liveness (leader elections can continue to hang) and relies on randomness in the election timeouts to get close.
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 91 at r2 (raw file):
Previously, sumeerbhola wrote…
should this be
raft[self].term < term
. Or presumably this is an accurate model of what the spec says since the caller has already verified thatraft[self].term <= term
.
This is just trying to mirror the etcd/raft code 1:1.
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 106 at r2 (raw file):
Previously, sumeerbhola wrote…
so does
voteResps
model bothPreVoteResp
andVoteResp
? Is there a way/need to distinguish between the two?
That's correct. We don't need to distinguish between the two, we can tell the difference based on state. This is also how etcd/raft works.
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 224 at r2 (raw file):
Previously, sumeerbhola wrote…
PlusCal question: would we need strong fairness if we wanted to add any liveness properties?
I think even that would not be enough to guarantee that a leader election eventually succeeds. But I don't understand this part of TLA+ to say for sure.
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 242 at r2 (raw file):
Previously, sumeerbhola wrote…
if
raft[self].lead = 0
then it seems we would want to do the samebecome_follower(msg.term, 0)
, but we don't do this here and instead do it in the later else. The structure looks a bit odd. Is this because it is following the spec listed at the end of the thesis. Section 3.4 is a lot of words, so there could be many translations.
This structure is a bit odd. It was needed to avoid adding another label when the non-determinism below is enabled. It should be equivalent to the etcd/raft logic.
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 262 at r2 (raw file):
Previously, sumeerbhola wrote…
Would
MsgAppResp
fall through to here? Presumably that can't happen with a higher term?
Yes, it would fall through. And it can actually happen, see
Line 1055 in b4ad399
// We have received messages from a leader at a lower term. It is possible |
This is a case we will need to handle with leader leases using a graceful stepdown.
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 273 at r2 (raw file):
Previously, sumeerbhola wrote…
couldn't it get a stray old pre-vote response at any time?
This bug has been captured in #123297. The key to this bug is that we send PreVote requests at Term+1, so stray old pre-votes may have m.Term=r.Term
(and hit this branch) and not m.Term < r.Term
and hit the discard branch above.
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 281 at r2 (raw file):
Previously, sumeerbhola wrote…
Looks like for
MsgVote
andMsgPreVote
we can only be here ifmsg.term >= raft[self].term
.
Right, this is here to allow a replay of the same vote for the same term. It's also a direct translation from etcd/raft and probably more than we strictly need for this spec.
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 306 at r2 (raw file):
Previously, sumeerbhola wrote…
so this will require modeling the log, which is not done yet?
Either modeling it or treating it as a form of non-determinism. If we aren't adding log-related invariants to this spec (to do so, we'd need a larger spec, which may make it unfeasible to model check), then non-determinism allows us to explore the larger state space more easily.
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 325 at r2 (raw file):
Previously, sumeerbhola wrote…
so it stays in states
StateCandidate
orStatePreCandidate
on a rejection? Which meansCanCampaign
will not be true? But if someone else (pre)campaigns we will executebecome_follower
, and can potentially campaign again.
just curious if this is intentional?
Correct, it won't re-campaign if it loses the election. We could expand the spec to detect when an election is definitely lost and re-campaign.
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 348 at r2 (raw file):
Previously, sumeerbhola wrote…
can you elaborate on this TODO -- I don't see the relationship with leases?
This was related to the discussion above about this code. Currently, if a candidate moves to a later term and then eventually hears from the leader, it will inform the leader about this and encourage it to step down. The rationale is that this will allow the quorum to move to the later term and "pick up" this replica who was otherwise stranded at a large term. This is mostly avoided by pre-vote, but is still possible with it.
We will need to handle this with leases, but we'll need to do so without invalidating any future-time reads that the leaseholder has already served.
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 357 at r2 (raw file):
Previously, sumeerbhola wrote…
does
appended
really meancommitted
?
Yes, renamed.
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 359 at r2 (raw file):
Previously, sumeerbhola wrote…
what is the purpose of this
Appended
label?
Just for debugging purposes, so I can see it in the state space coverage explorer.
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 372 at r2 (raw file):
Previously, sumeerbhola wrote…
Is the following the TLA+ translation of PlusCal, so can ignore?
Yes, it's the translation. Can be ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @miraradeva, @nvanbenschoten, @pav-kv, and @sumeerbhola)
docs/tla-plus/LeaderLeases/RaftLeader.tla
line 241 at r4 (raw file):
toSend = <<>>; msg = [type |-> 0]; begin TickLoop:
Wouldn't this be better represented as nodes "responding" to messages they receive rather than sitting in a TickLoop
? It seems like from a modeling perspective this doesn't add any additional value and it makes it unclear whether the TickLoop is really necessary. There are only two "triggers" here. Either CanCampaign
or ReceiveMsg
. I'm not very familiar with TLA+ so maybe there is some value in having this done this way. This is related to my comment in the doc as well.
Feel free to ignore this if this is the best way to represent this.
123827: tla-plus: add .cfg file for ParallelCommits spec r=nvanbenschoten a=nvanbenschoten NOTE: this is just extracting the first commit from #123189 out into a separate PR so that we can get it merged. This commit updates the ParallelCommit spec with a .cfg config file which mirrors its Toolbox .launch file. This allows the spec to be checked with the command-line version of the TLC model checker. As a result, this allows the spec to play well with the VSCode TLA+ extension, which is more popular and easier to use than the original eclipse-based TLA+ Toolbox. Epic: None Relase note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
This commit adds a new TLA+ specification (expressed in PlusCal) called
RaftLeader
.RaftLeader
is a specification for the Raft leader election protocol, as described in section 3.4 of the Raft thesis1. The spec models multiple terms, node restarts, and the Pre-Vote election phase extension.The central safety property of leader election is exclusivity. In Raft, this is defined as the Election Safety Property: "at most one leader can be elected in a given term". This property is defined as an invariant of the spec.
My goal is to use this spec to verify the correctness of leader-leases. We can start with this simpler spec and then extend it to include notions of store liveness and of leases associated with leaders. From there, we can define the lease disjointness invariant and prove that it holds.
Epic: None
Release note: None
Footnotes
https://web.stanford.edu/~ouster/cgi-bin/papers/OngaroPhD.pdf ↩