# Learning TLA+ by Examples

Richard Tang

January 28, 2025

# Contents

| Ι  | Intro              | oduction               | 4         |
|----|--------------------|------------------------|-----------|
| 1  | Motiva             | ation                  | 5         |
|    | 1.1 Ca             | atching Problems Early | 5         |
|    |                    | he Generalized Problem | 6         |
|    |                    | LA+                    | 7         |
|    |                    | bout This Book         | 7         |
|    |                    | ow to Use This Book    | 8         |
| 2  | TLA+               | Primer                 | 9         |
|    | 2.1 De             | esign Intent           | 9         |
|    |                    | equirement             | 10        |
|    | 2.3 Sr             | Dec                    | 10        |
|    | -                  | afety                  | 11        |
|    |                    | veness                 | 11        |
|    | 2.6 M              | odel Checker           | 12        |
|    |                    |                        |           |
| II | Exa                | mples with TLA+        | 14        |
| 3  | Blinki             | ng LED                 | <b>15</b> |
|    | 3.1 Re             | equirement             | 15        |
|    |                    | •                      | 15        |
|    | -                  | afety                  | 17        |
|    |                    | veness                 | 17        |
|    | 3.5 M              | odel Checking          | 17        |
|    |                    | imitation              | 18        |
| 4  | Simple             | e Gossip Protocol      | 19        |
|    |                    |                        |           |
|    | 4.1 Re             | equirement             | 19        |
|    |                    | equirement             | 19<br>20  |
|    | 4.2 S <sub>I</sub> |                        |           |

Contents 2

| 5 | Sim  | ple Scheduler                          | 23              |
|---|------|----------------------------------------|-----------------|
|   | 5.1  | Requirement                            | 23              |
|   | 5.2  | Spec                                   | 23              |
|   | 5.3  | Safety                                 | 25              |
|   | 5.4  | Liveness                               | 25              |
| 6 | Solo | ective Retransmit                      | 27              |
| U | 6.1  |                                        | 21<br>29        |
|   | 6.2  |                                        | $\frac{23}{30}$ |
|   | 6.3  | 1                                      | 34              |
|   | 0.5  |                                        | 34              |
|   | 6.4  |                                        | 34              |
|   | -    |                                        | -               |
|   | 6.5  | Liveness                               | 34              |
| 7 | Raf  |                                        | 35              |
|   | 7.1  | Requirement                            | 35              |
|   | 7.2  | Spec                                   | 36              |
|   | 7.3  | Model Reduction                        | 40              |
|   |      | 7.3.1 Modeling Messages as a Set       | 40              |
|   |      | 7.3.2 Limit Term Divergence            | 40              |
|   |      |                                        | 41              |
|   |      |                                        | 42              |
|   |      | 7.3.5 Prune Messages with Stale Terms  | 42              |
|   |      | 7.3.6 Enable Symmetry                  | 43              |
|   | 7.4  |                                        | 43              |
|   | 7.5  | Liveness                               | 43              |
|   |      |                                        |                 |
| Π | I I  | Examples with PlusCal                  | 45              |
| 8 | SPS  | SC Lockless Queue                      | 47              |
| Ü | 8.1  | -                                      | 47              |
|   | 8.2  |                                        | 48              |
|   | 8.3  | 1                                      | 49              |
|   | 8.4  |                                        | 49              |
|   | 8.5  |                                        | 50              |
|   | 0.0  | Comiguration                           | 50              |
| 9 |      | ······································ | 51              |
|   | 9.1  | 8                                      | 51              |
|   | 9.2  | 1                                      | 52              |
|   | 9.3  |                                        | 54              |
|   | 94   | Liveness                               | 55              |

| Contents  | 3 |
|-----------|---|
| Contonios |   |

| ΙV        | Reference                | <b>56</b> |
|-----------|--------------------------|-----------|
| 10        | Data Structure           | 57        |
|           | 10.1 Set                 | 57        |
|           | 10.2 Tuple               | 57        |
| 11        | Fairness and Liveness    | 59        |
|           | 11.1 Liveness            | 59        |
|           | 11.2 Weak Fairness       | 60        |
|           | 11.3 Strong Fairness     | 61        |
| 12        | General Tips             | 63        |
|           | 12.1 Model Checker Debug | 63        |
|           | 12.1.1 Dead Lock         | 63        |
|           | 12.1.2 Live Lock         | 63        |
|           | 12.2 Model Refinement    | 64        |
| 13        | Abstraction Guideline    | 65        |
| <b>14</b> | Reference                | 66        |

# Part I Introduction

# Motivation

### 1.1 Catching Problems Early

Years ago, I worked on a propietary low power processor in an embedded system. The processor ran microcode featuring a custom instruction set. To enter a low power state, a set of (possibly hundreds) instructions were executed. These instructions progressively puts the system in lower power state. For example: turn off IP A, then turn off IP B, then turn off the power island to the IPs. To save cost and power, the low power processor had very limited debuggability support.

An experienced reader may start to notice some redflags.

If the microcode attempts to access the memory interface when the power island has been shut off, processor would hang. Since the power island has been shut off, the physical hardware debug port is also unavailable, leaving the developer with *no way* of live debugging related problem. At this point the developer needs to siphon through (possibly hundreds) of instructions to catch invariant violation *manually*.

As one can imagine, maintaining the microcode was very expensive. Fortunately, the propietary low power processor only had a handful of instructions, I created an emulator for this propietary processor to verify the microcode prior to deploying it on target. The emulator models the processor states as a state graph, with executed instruction transitions the state machine to the next state. At every state all the invariants are evaluated to ensure none have been violated. Some of the invariants included:

- Accessing memory interface after power off leads to a hang
- Accessing certain register in certain chip revision leads to a hang
- Verify IPs are shut off in the allowed order

The verification algorithm was implemented using a depth-first search algorithm, providing 100% microcode coverage before deploy on target.

To generalize, we can model arbitrary system as a set of states with a collection of invariant that must be upheld at all times. The complexity of the such arbitrarily system generally grows quadratically as the number of states grow linearly (eg. in a N state system, adding state N+1 may introduce N transitions into the new state). There are many engineering problems that exhibit a large number of states, such as lockless or waitfree data structure, distributed algorithms, OS scheduler, and more. As the number of states grow, the problem becomes more mentally taxing for designer to reason about.

How does a designer reason about the correctness of a proposed solution?

### 1.2 The Generalized Problem

Fast forward to now: I stumbled across TLA+, a formalized solution of what I was looking for.

The Turing Award winner Leslie Lamport invented the TLA+ 1999, but TLA+ didn't appear to have caught on until the 2010's. My personal opinion is TLA+ was invented ahead of its time, and the problem complexity finally caught up in the past decade or so, to allow TLA+ to visibly demonstrate its strength.

We are at a point in the technology curve where vertical scaling is no longer practical, with CPU speed plateau'd in the past decade or so. The industry is exploring horizontal scaling solution, such as hardware vendor focusing adding more CPU cores, or software vendors buying many low end hardware instead of a few high end hardware. This shifts the technology complexity from hardware to software, demanding software solution to maximize concurrent hardware resource utilization. One slight problem:

Humans are good at high level reasoning, but not so good at reasoning about many things happening at the same time.

The average cognitive load of of a person is between 5 to 9 items. It is hard to enumerate all possible scenario in one's mind to ensure the design accommodates all the edge cases.

Consider a low power processor execution hundreds of instruction, how does the designer ensure the invariants are upheld after every instruction?

Consider a distributed system. The system is a cluster of independently operating entities and need to somehow collectively offer the correct system be-

haviour, while any one of the machines may receive instructions out of order, crash, recover, etc.

Consider a single producer multiple consumer lockless queue. The consumers may reserve an index in the queue in certain order, but may release them in different order. What if one reader is really slow, and another reader is super fast and possibly lapse the slow reader?

Consider an OS scheduler with locks. Assume all the processes have the same priority. Can a process starve the other processes by repeatedly acquire and release the lock? How do we ensure scheduling is fair?

The anti-pattern is to keep bandaiding the design until user stops filing bug reports. This is never ideal. Per Murphy's law, anything that can go wrong will go wrong, and a hard to reproduce bug will come in at the most inconvinient time. How do we make sure the solution is correct by design? To solve this problem, we must rely on tools to do the reasoning for us.

### $1.3 \quad TLA +$

TLA+ is a system specification language, with the intent to describe the system with implementation details removed. TLA+ allows designer to describe the system as a sequence of states. The designer can expresses transition condition from one state to another, describe invariants that must hold true in every state and liveness properties that the overall system should converge to. The key innovation of TLA+ is once the system is modeled as a finite state machine, the states can be exhaustively explored (via breath-first-search) to ensure certain properties are held through out the entire state space (either per state or a sequence of states).

### 1.4 About This Book

This book was initially a set of notes I took as I was learning TLA+. To my surprise, there's not as much material on TLA+ as I had assumed for such a critical tool in the designer toolbox. I decided to formalize the notes into this short book, which I hope the readers find this helpful in their journey to master TLA+.

The intent for the book is to teach reader how to write TLA+ spec for their design to provide confidence in *design correctness*. The content to this book is appropriate for software designer, hardware designer, system architect, and such.

As for the readers: general computing science knowledge is required. One doesn't need to be expert at a particular language to understand this book; TLA+ is effectively its own language. This book is example driven and will go through designs such as lockless queue, simple task scheduler, consensus algorithm, etc. Reader will likely enjoy a deeper insight if she has some familiarity with these topics.

### 1.5 How to Use This Book

This book was designed to be used as a reference, providing examples and references using TLA+.

Examples are split into two categories: A set of examples written using TLA+, and another set of examples written using PlusCal, the C-like syntax that transpiles down to TLA+. I believe they are useful for different use cases. The differences will be highlighted in their respective sections. All examples will follow a similar layout, covering the expected design process (eg. requirement, spec, safety and liveness properties).

All examples in this book will be described using TLA+ mathemetical notation. Converting between Mathmetical and ASCII notation is assumed straightforward due to the one-to-one mapping. Readers are encouraged to consult Table 8 in [1] as needed.

Finally, there will be part that language reference portion that that discuss a few topics deserving extra attention. The intent is to be using this section of the book as a *reference*.

# TLA+ Primer

### 2.1 Design Intent

The key insight into TLA+ is modelling a system as a state machine. A simple digital clock can be represented by two variables, hour and minute and the number of possible states in a digital clock is 24\*60=1440. For example, 10:01 is the next state 10:00 can transition to. Extrapolating further, Asssume an arbitrarily system described by N variables, each variable having K possible values such arbitrary system can have up to  $N^K$  state.

For every specification, designer can specify safety proerty (or invariants) that must be true in every states. For example, in any state of the digital clock hour must be between 0 to 23, or formally described as  $hour \in 0..23$ . Similarly, minute must have value between 0 to 59, or  $minute \in 0..59$ . Examples invariants of a system include: Only one thread has exclusive access to a critical region, all variables in the system are within allowable value, resource allocation manager never allocates more than available resources.

Designer can also specify *liveness* property. These are properties to be satisfied by a *sequence of state*. One liveness property for the digital clock could be when the clock is 10:00, it will eventually become 11:00 (10:00 leads to 11:00). Example liveness property include: a distributed system eventually converges, the scheduler eventually schedules every tasks in the task queue, the resource allocation manager fairly allocates resources.

A TLA+ Spec can be checked by TLC, the model checker. TLC uses breath-first search algorithm to explore all states in the state machine and ensure safety and liveness properties are upheld.

A TLA+ Spec describes the system using temporal logic. The syntax may appear unfamiliar if one hasn't seen it before, but like any other programming

language an initiated reader should become familiarized quickly. In this book I will use

# 2.2 Requirement

In this example, we will specify a *digital clock*. The digital clock has a few simple requirements:

- Two variables to represent state: hour and minute
- The clock increment one minute at a time
- The clock wraps around at midnight (ie. 23:59 transitions to 00:00)

# 2.3 Spec

The *Init* state of such system can be described as:

```
Init \stackrel{\triangle}{=} \\ \wedge hour = 0 \\ \wedge minute = 0
```

 $\stackrel{\triangle}{=}$  is the defines equal symbol and  $\wedge$  is the logical and symbol. The above TLA+ syntax can be read as Init state is defined as both hour and minute are both 0.

The spec also always include a *Next* definition, an *action formula* describing how the system transition from one state to another. Action formula contains *primed* variables what happens to the variable in its next state. The *Next* action for the digital clock can be defined as:

```
NextHour \triangleq \\ \land minute = 59 \\ \land hour' = (hour + 1)\%24 \\ \land minute' = 0 \\ NextMinute \triangleq \\ \land minute \neq 59 \\ \land hour' = hour \\ \land minute' = minute + 1 \\ Next \triangleq \\ \lor NextMinute \\ \lor NextHour
```

Here's a breakdown of what the spec does:

• Next can take NextMinute or NextHour

- Next takes NextMinute when minute is not 59, next hour is hour, next minute is minute + 1.
- Next takes NextHour when minute is 59, next hour is (hour + 1) modulus 24, next minute set to 0

Technically it's possible for *Next* to take both *NextMinute* and *NextHour*. This is not possible in this definition as *NextHour* and *NextMinute* are defined in a *mutually exlusively* fashion.

Finally, the spec itself is formally defined as:

```
vars \triangleq \langle hour, minute \rangle
Spec \triangleq \\ \land Init \\ \land \Box [Next]_{vars}
```

 $\Box[Next]_{vars}$  deserves some special attention:

- vars is defined to be all variables in the spec. Different combination of these variables constitute the states of the system (eg. 23:59 and 00:00 are both states in the system).
- $\square[Next]_{vars}$  is a box-action formula, where Next is an action and vars is a state function.
- operator asserts the formula is always true for every step in the behaviour.
- And steps in the behaviour is defined as  $[Next]_{vars}$ , where Next describe the action and vars capturing all variables representing the state.

# 2.4 Safety

Safety property describes invariant that must hold true in every state of system. A common invariant is *type safety* checks. In a digital clock, hour can only be in value between 0 to 23, and minute can only be value of 0 to 59:

```
Type\_OK \triangleq \\ \land hour \in 0 ... 23 \\ \land minute \in 0 ... 59
```

### 2.5 Liveness

Liveness property verifies certain behavioural across a sequence of state. One liveness property can be confirming the clock wraps around correctly at midnight (which involves multiple states):

```
Liveness \triangleq \land hour = 23 \land minute = 59 <math>\rightsquigarrow hour = 0 \land minute = 0
```

→ is the *leads to* operator, suggesting something is eventually true. TLA+ provides a set of formulas that can be used to describe liveness property.

To verify liveness, we need to modify the spec slightly to enable *fairness* to prevent *stuttering*. In plain terms, fairness ensure *something* always happen in every step, allowing the states to transition. Without fairness the spec is allowed to *do nothing* as next step, this means liveness condition may fail because the spec permits the system to do nothing in perpetuity as next state. Fairness will be covered in more detailed in later chapter.

```
Spec \triangleq \\ \wedge Init \\ \wedge \Box [Next]_{vars} \\ \wedge WF_{vars}(Next)
```

 $WF_{vars}(Next)$  is the fairness qualifier.

### 2.6 Model Checker

The TLA+ spec can be verified using TLC model checker. The TLC model checker runs the spec and verifies all configured safety and liveness properties are satisfied during execution. To run TLC, we need two things:

- clock.tla the spec itself
- $\bullet$  clock.cfg the corresponding configuration file

For reference, clock.tla spec is listed below:

```
EXTENDS Naturals

VARIABLES hour, minute

vars \triangleq \langle hour, minute \rangle

Type\_OK \triangleq

\land hour \in 0 ... 23

\land minute \in 0 ... 59

Liveness \triangleq

\land hour = 23 \land minute = 59 \rightsquigarrow hour = 0 \land minute = 0

Init \triangleq

\land hour = 0

\land minute = 0

NextMinute \triangleq

\land minute = 59

\land hour' = (hour + 1)\%24
```

The corresponding clock.cfg is listed below:

```
SPECIFICATION Spec
INVARIANTS Type_OK
PROPERTIES Liveness
```

Now run TLC and one should see something like this:

```
Model checking completed. No error has been found.
...
The depth of the complete state graph search is 1440.
```

# 

# Blinking LED

Let's start with a trivial specification of a blinking LED. The intent of this example is to demonstrate the core functionalities of TLA+ specification language.

TODO: briefly talk about tla+ and model checker here.

# 3.1 Requirement

The LED is represented by a boolean variable that can be either 0 or 1.

... that's it.

# 3.2 Spec

The specification language may appear alienating as it is mathematically motivated based on propositional logic. Despite the (possibly) daunting syntax, designer only need to be familiar with a handful of key operators to start realizing value using TLA+. This chapter will attempt to describe the example in exhaustive detail to reduce the learning curve.

The following describe the core portion of the blinking LED spec.

```
VARIABLES b

vars \triangleq \langle b \rangle

Init \triangleq

\wedge b = 0

On \triangleq

\wedge b' = 1

Off \triangleq

\wedge b = 1

\wedge b' = 0
```

```
 \begin{array}{c} Next \; \stackrel{\triangle}{=} \\ \; \; \vee \; Off \\ \; \; \vee \; On \\ Spec \; \stackrel{\triangle}{=} \\ \; \wedge \; Init \\ \; \wedge \; \Box [Next]_{vars} \end{array}
```

- $\stackrel{\triangle}{=}$  is the defines equal operator
- $\wedge$  and  $\vee$  are the AND and OR operator. The effect of these operator follow the natural definition in English:
  - -C  $\triangleq$   $A \land B$ : C is true iff A and B are true -C  $\triangleq$   $A \lor B$ : C is true iff A or B is true
- The ' operator represents the next state. b' represent b's next state.
- VARIABLES keyword defines a list of variables for the spec. In this case the spec defines a variable b which can be either 0 or 1
- vars is typically defined as a shorthand to refer to all variables in the spec.

With the above definition, we can revisit the Action definitions: *Init* defines the initial system state, where b is set to 0.

Next requires more elaboration. TLA+ specifies the system as a collection of states with transitions between them. In a simplified sense, the state is described as a collection of ANDs (eg. system is in state C if both A and B are true), the ORs then describe the states the system can possibly be in (eg. system can be in state C OR D). Revisiting the example, the blinking LED has two states:

- $On \triangleq b = 0 \land b' = 1$ : b switches on
- Off  $\stackrel{\triangle}{=} = 1 \wedge b' = 0$ : b switches off

The system's Next state is defined to be one of these states: Next  $\triangleq$  On  $\vee$  Off.

 $\Box[Next]_{vars}$  is a **Box-Action Formula**, where *Next* is an action and *vars* is a state function. The formula is true iff every successive pair of steps in behaviour is a  $[Next]_{vars}$ . Finally *Spec* is conjunction between *Init* and  $\Box[Next]_{vars}$ . Note all TLA+ specification follows very similar template. There are situation we will need to provide *fairness* description - this will be covered later.

In short: this specification describes a two-state state machine where b toggles between 0 and 1.

Note that b can technically be *anything*. b can be  $0,\,1,\,-42,\,a$  dinosaur, etc. TLA+ specifies values of b which are valid in the system.

### 3.3 Safety

The spec so far only defines the possible states - but the *power* of TLA+ lies in its *properties* description. Safety properties are invariants that must hold true in *every* state. An invariant in the blinking LED example is:

```
TypeOK \stackrel{\Delta}{=} b \in \{0, 1\}
```

This states the only valid value of b is 0 or 1. If b is ever set to anything else, the spec is invalid.

Some example safety properties include: Only a single thread have exclusive access to critical section, number of concurrent reads cannot exceed data available to be read, etc.

### 3.4 Liveness

While safety properties describe invariant that must be upheld in every state, *Liveness* describe properties of a sequence of states. In the blinking LED example, a liveness property can be the if b is 0, it eventually becomes 1, and vice versa. This is described below:

```
Liveness \stackrel{\triangle}{=} \\ \wedge b = 0 \leadsto b = 1 \\ \wedge b = 1 \leadsto b = 0
```

It is the author's opinion liveness describes the *design essense* behind the spec. The key characteristic of a system is described by its *behaviour* across a series of states. Does a distribute algorithm eventually converge to a working state? Does a resource manager fairly allocate resources in all scenarios? Does a scheduler ensure all tasks are eventually scheduled? These are behaviours that are *cannot* be concluded by looking at a single state, but across a *sequence of state*. Liveness allows designer to express and verify these properties.

# 3.5 Model Checking

Since the blinking LED is trivially specified, the full specification is included below. For subsequent chapters only snippet will be included. Please refer to the accompanied material for full spec source.

— MODULE blinking

```
TODO: install toolchain
TODO: commandline
TODO: using TLC
The following is the content of blinking.tla:
```

```
EXTENDS Naturals Variables b vars \triangleq \langle b \rangle TypeOK \triangleq \land b \in \{0, 1\}
```

```
\begin{array}{l} Liveness \ \stackrel{\triangle}{=} \\ \qquad \land b = 0 \leadsto b = 1 \\ \qquad \land b = 1 \leadsto b = 0 \\ Init \ \stackrel{\triangle}{=} \\ \qquad \land b = 0 \\ Next \ \stackrel{\triangle}{=} \\ \qquad \lor \land b = 0 \\ \qquad \land b' = 1 \\ \qquad \lor \land b = 1 \\ \qquad \land b' = 0 \\ Spec \ \stackrel{\triangle}{=} \\ \qquad \land Init \\ \land \ \Box [Next]_{vars} \\ \land \ WF_{vars}(Next) \end{array}
```

The following is the content of *blinking.cfg*:

```
SPECIFICATION Spec
INVARIANTS TypeOK
PROPERTIES Liveness
```

# 3.6 Limitation

Since TLA+ exhaustively explores all possible state, a linear growth of variables leads to TLC (temporal logic checker) execution time grows *exponentially*. This means the specification must be scoped correctly to limit the state space.

Similarly, if you want to verify concurrent psuedo code implementation in PlusCal, you can likely at most verify 10s of lines of code.

# Simple Gossip Protocol

This section the author's notes on a simple gossip protocol by Andrew Hewler: https://ahelwer.ca/post/2023-11-01-tla-finite-monotonic/

# 4.1 Requirement

In a distributed system, a cluster of nodes collectively provide a serivce. A distributed database may have a collection of 10s to 100s of nodes working together to offer the service in a geo diverse fashion to be immnue to partial outage. The nodes often have requirements to know about each other. In the context of distributed database, a node may need to know the key range another of its peers. The cluster needs a way to communicate this information. One such mechanism is the gossip protocol.

Gossip protocols are used to communicate cluster information in a distributed fashion, (unsurprisingly) in a distributed system. Without gossip protocol, nodes in a cluster learns about its neighbours by contacting a centralized server. This introduces a single failure point in the system. As the name suggest, gossip protocol relies on nodes to gossip with each other. The nodes in the cluster periodically selects a set of neighbors to exchange what it knows about the cluster. The recency information is part of the gossip message itself, allowing the node and the peer its talking to quickly decide who has the latest information on a node, and converge to it. Assume a N node cluster and each interal a node selects k neighbours to gossip with, the total amount of gossip propagation time is described logrithmticly below:

$$propagation\_time = \log_k N * gossip\_interval$$
 (4.1)

With the total number of messages exchanged:

$$messages\_exchanged = \log_k N * k$$
 (4.2)

Now let's look at how a simple gossip protocol can be described by TLA+.

# 4.2 Spec

In gossip protocol, every node needs to remember all its peers current version. This can be represented as a two dimension array:

```
Init \stackrel{\Delta}{=} counter = [n \in Node \mapsto [o \in Node \mapsto 0]]
```

This defines counter a collection of nodes, where each node also contains a collection of nodes initialized to 0. The nodes can bump to a new version:

```
Increment(n) \triangleq counter' = [counter \ \ \text{Except} \ ![n][n] = @+1]
```

Notice increment only bumps node's version. This change needs to be gossiped across the cluster:

```
Gossip(n, o) \triangleq \\ \text{LET } Max(a, b) \triangleq \text{ IF } a > b \text{ THEN } a \text{ ELSE } b \\ \text{IN } counter' = [\\ counter \text{ EXCEPT } ![o] = [\\ nn \in Node \mapsto \\ Max(counter[n][nn], \ counter[o][nn]) \\ ]
```

A few things to unpack here:

- n, o are the two nodes exchanging gossip. o is the node to be updated and n is the neighbor o gossips with.
- LET..IN allows local definition under LET used under IN. In this case Max is a local macro defined to return maximum between a and b.
- counter' (or referred to as counter *prime*) is what the variable will be in the next state. TLA+ doesn't provide a way to update a variable in a collection, so the convention is to assign a new array to the variable.
- $counter\ EXCEPT![o] = [...]$  return counter with counter[o] defined in the bracket.
- where [...] is a collection of nodes with with counter set to the max between the current node and neighbour.

Finally, the actual spec:

```
Next \stackrel{\triangle}{=} \lor \exists n \in Node : Increment(n)
```

```
\vee \exists n, o \in Node : Gossip(n, o)
```

Next supports two possible next steps describe using disjunctions. The first is bumping the version of a random node, the second is select a pair of nodes to gossip. Note the *existential qualifier* on both, which basically states there exists a node n in nodes, or there exists a pair of nodes n, o in nodes, respectively.

There's a minor problem with the definition above. Gossip protocol, like many converging protocols, have a *monotonic increasing* requirement. On failures, the protocol bumps the version, which increases monotonically. Since TLA+ spec models the system as a graph, a monotonic increasing version number means the state graph is *infinitely large*. To put the specification back into finite space, we can normalize the state:

```
GarbageCollect \triangleq \\ \text{LET } SetMin(s) \triangleq \text{CHOOSE } e \in s : \forall \ o \in s : e \leq o\text{IN} \\ \text{LET } Transpose \triangleq SetMin(\{counter[n][o] : n, \ o \in Node\})\text{IN} \\ \wedge counter' = [\\ n \in Node \mapsto [\\ o \in Node \mapsto counter[n][o] - Transpose \\ ]\\ \cap \land \text{UNCHANGED } converge
```

Garbage Collect substracts every version value with set minimum. To limit range of version value, the increment function is now updated to:

```
 Increment(n) \triangleq \\  \land \neg converge \\  \land counter[n][n] < Divergence \\  \land S!Increment(n) \\  \land \text{UNCHANGED} \ converge
```

Finally, the *Next* is updated to the follow:

```
Next \triangleq 
 \lor \exists n \in Node : Increment(n)
 \lor \exists n, o \in Node : Gossip(n, o)
 \lor Converge
 \lor GarbageCollect
```

Note *Garbage Collect* is a now part of possible state transition. We will discuss *Converge* later, as it is related to liveness check. Lastly:

```
Fairness \stackrel{\triangle}{=} \forall n, o \in Node : WF_{vars}(Gossip(n, o))
Spec \stackrel{\triangle}{=} \land Init
```

```
 \land \Box [Next]_{vars} \\ \land Fairness
```

The Fairness definition ensures Gossip runs between every pair of nodes gossip.

# 4.3 Safety

In every state, counter[n][o] next must be larger than counter[n][o] current:

### 4.4 Liveness

For liveness we want to check the version value across all nodes eventually converge. *Next* is updated to set Converge to true, which triggers the liveness condition and ensure all pair of nodes eventually have the same information.

```
\begin{array}{ll} Convergence & \triangleq \forall \, n, \, o \in Node : counter[n] = counter[o] \\ Liveness & \triangleq converge \leadsto S! \, Convergence \\ Converge & \triangleq \\ & \land \, converge' = \texttt{TRUE} \\ & \land \, \texttt{UNCHANGED} \, \, counter \\ Next & \triangleq \\ & \lor \, \exists \, n \in Node : Increment(n) \\ & \lor \, \exists \, n, \, o \in Node : Gossip(n, \, o) \\ & \lor \, Converge \\ & \lor \, GarbageCollect \end{array}
```

# Simple Scheduler

### 5.1 Requirement

In this section we will define a spec for a simple task scheduler. The task sechdeuler has the following requirements:

- Supporting N execution context (ie. CPUs)
- Supporting T number of tasks
- Tasks have identical priority and are scheduled coopertively
- System has a single global lock
- Any task can attempt to acquire the lock, Any task attempting to acquire the lock are gauranteed to be scheduled.
- If multiple tasks attempt to grab the lock, the tasks will be scheduled in lock request order.

# 5.2 Spec

We will model scheduler using the following variables:

```
 \begin{array}{l} Init \ \stackrel{\triangle}{=} \\ \  \  \, \land \  cpus = [i \in 0 \ldots N-1 \mapsto ``"] \\ \  \  \, \land \  ready\_q = S2T(Tasks) \\ \  \  \, \land \  blocked\_q = \langle \rangle \\ \  \  \, \land \  lock\_owner = \  ``" \end{array}
```

A few things to note:

- The system has N executing context, represented as number of CPUs. When a task is running, cpus[k] is set to taskName. When CPU is idle, cpus[k] is set to an empty string.
- $ready\_q$  and  $blocked\_q$  are initialized as  $ordered\ tuple$ , due to the cooperative scheduling requirement.
- $\bullet$  S2T is a macro that converts a set into a ordered tuple. This is to accommodate the fact it appears I cannot define tuple in .cfg file.
- Finally, the single system lock is represented as *lock\_owner*.

A task can be in three possible state: Ready, Blocked and Running. The *Next* box-action fomula will define a Ready and Running action, and the implementation will include related lock contention handling.

```
- Module scheduler -
MoveToReady(k) \triangleq
     \land cpus[k] \neq ""
     \land lock\_owner \neq cpus[k]
     \land ready\_q' = Append(ready\_q, cpus[k])
     \wedge cpus' = [cpus \ EXCEPT \ ![k] = ""]
     \land UNCHANGED \langle lock\_owner, blocked\_q \rangle
Lock(k) \triangleq
       lock is empty
       \lor \land cpus[k] \neq ````
           \land lock\_owner = ""
           \wedge lock\_owner' = cpus[k]
           \land UNCHANGED \langle ready\_q, cpus, blocked\_q \rangle
        someone else has the lock
          \land cpus[k] \neq ```
           \land \ lock\_owner \neq ""
           \land lock\_owner \neq cpus[k] cannot double lock
           \land blocked\_q' = Append(blocked\_q, cpus[k])
           \land cpus' = [cpus \ \text{except} \ ![k] = ""]
           \land UNCHANGED \langle ready\_q, lock\_owner \rangle
Unlock(k) \triangleq
     \land \ cpus[k] \neq ""
     \land lock\_owner = cpus[k]
     \land \mathit{lock\_owner'} = ````
     \land cpus' = [cpus \ \text{EXCEPT} \ ![k] = ""]
     \wedge ready\_q' = ready\_q \circ blocked\_q \circ \langle cpus[k] \rangle
     \land blocked\_q' = \langle \rangle
Running \triangleq
    \exists k \in \text{DOMAIN } cpus:
         \land cpus[k] \neq "
```

```
 \land \lor MoveToReady(k) \\ \lor Lock(k) \\ \lor Unlock(k)
```

# 5.3 Safety

### 5.4 Liveness

I believe this is the most important part of cooperative scheduler design. While the scheduler can't *force* a task to relinquish a lock (the scheduler doesn't dictate when the task is *done*), the scheduler can ensure scheduling fairness by scheduling the next lock requester intsead of the task that just relinquished the lock.

```
Liveness \triangleq \\ \forall \ t \in Tasks: \\ \text{LET} \\ b \triangleq \{x \in \text{DOMAIN } blocked\_q: blocked\_q[x] = t\} \\ \text{IN} \\ \land b \neq \{\} \leadsto b = \{\}
```

The formula defines set b to be either an empty set or a set of one task. Assume a set of  $\{"p0", "p1", "p2\}$ . Possible value of b include:  $\{\}, \{"p0"\}, \{"p1"\}$  and  $\{"p2"\}$ . The fomula then states an non empty set of b leads to an empty set of b. In other words:

If a task ever becomes blocked, it will eventually become unblocked.

However, when we actually run the model checker, we will find the liveness property is *violated*. The failure scenario is basically one task holding onto the lock in one CPU, while the scheduler repeatedly schedule/deschedule a separate task in another CPU. While this is perfectly allowed, the model checker detects a possible path for the the system to trap in a local state and fail the liveness property.

Perhaps not surprisingly, if you construct similar liveness property to verify a task is *eventually* always scheduled, it will also fail. The model checker will provide a counter case where a task is never scheduled because another task is repeatedly acquire/release the global lock.

```
\begin{array}{l} L \triangleq \\ \forall \, t \in \mathit{Tasks}: \\ \forall \, n \in 0 \ldots (N-1): \\ \mathrm{WF}_{\mathit{vars}}(\mathit{HoldingLock}(t) \wedge \mathit{Unlock}(n)) \\ \mathit{Spec} \triangleq \\ \wedge \mathit{Init} \\ \wedge \square[\mathit{Next}]_{\mathit{vars}} \\ \wedge \mathrm{WF}_{\mathit{vars}}(\mathit{Next}) \\ \wedge L \end{array}
```

Fairness ensures that we are never stuck in a repeated state.

# Selective Retransmit

Assume client device that playsback a video stream. Structurelly, a video is composed of frames, frames are then segmented into packets to stream across a network. The client device recombines the packets into a frame, then sequence the frame to playback the video.

However, network is not-deterministic. Depending on the route the packets take to get to the client, they may arrive out-of-order. The client may need to maintain a receive buffer for the packets, re-order the packets back into sequence before pushing the packets down to decoding engine.

The network may also drop packets if any of the switches along the way gets busy. In the case of a packet drop, the client has a few options. The client can either discard the frame and let the decoding engine downstream to deal with it (which may ressult in visible artifacts during playback). The client can request the whole frame to be re-sent, which results in additional bandwidth consumption. The client can selectively request the missing packet to be retransmited, which will minimize additional bandwidth consumption, but increases implementation complexity.

In this chapter, we will implement a simple selective retransmit algorithm.



Since packets may arrive out-of-order, the server stamps the packets with sequence number to allow the client to order the packets as they arrive. Once the client has a set of ordered packets, it moves the packets from the receive buffer into the decoding engine to be displayed.

The video packets are often sent via unreliable channel to minimize network overhead and latency. The client sends acknowledgement back to the server to acknowledge the received packet. This indicates to the server it can send more video data to the client. Acknowledgements are not latency sensitive in nature, and take up a very small proportion of bandwidth, so they are transported through reliable channel.

The following illustrates packet reorder handling:



The following illustrates packet loss handling:



There are other design considerations. The server is allowed to send up to W packets before getting an acknowledgement, this reduces latency perceived by the user. The client also doesn't need to acknowledge all the packets, since the server assumes once an acknowledgement of packet N is received, then all packets prior to N have also been received.

# 6.1 Requirement

With the above description, we are now ready to provide a more formal description for our requirement:

- Client is the receiver that displays the video stream
- Server is the sender that sends the video stream
- Server always sends the packets in-order
- Client may receive the packets out-of-order

- Client may never receive some packet due to loss
- Server can send up to W packets before an acknowledgement is received
- Per packet sequence number is represented by fixed number of bytes in the network header, the sequence number will eventually wrap around once it hits the maximum representable value. This upper limit will be represented by the constant N
- Client will remove the packets from the receive buffer once the sequence number of the received packets are contiguous. The client will also send an acknowledgement back to the Server with the most recent sequence number
- Server to Client data packets are assumed unreliable due to bulk bandwidth
- Client to Server control packets are assumed to be reliable

We are now ready to implement the Spec.

# 6.2 Spec

The following is the skeleton of the Spec:

```
Init \stackrel{\triangle}{=}
     \land network = \{\}
     \land server\_tx = 0
     \land server\_tx\_limit = W
     \land server\_tx\_ack = 0
     \wedge client_rx = 0
     \land client\_buffer = \{\}
     \wedge lost = 0
Next \triangleq
     \vee Send
     \vee \exists p \in network :
      Receive(p)
     \lor ClientRetransmitRequest
     \lor ClientAcknowledgement
     \vee \exists p \in network :
      \wedge p.dst = "client"
      \wedge Drop(p)
```

The server is represented by three variables:

• tx+1 represents the sequence number to be used in the next packet

- tx\_liimit represents the highest sequence number server can send without waiting for an acknowledgement
- tx\_ack represents the most recent acknowledged sequence number

The client is represented by two variables:

- client\_rx is the most recently acknowledged sequence number
- client\_buffer is the receive buffer holding all the packets waiting to be re-ordered prior to being acknowleged

The allowed actions include packet *Receive*, which the existential quantifier also has the side effect of re-ordering. *ClientRetransmitRequest* detects and sends retransmit request. *ClientAcknowledgement* sends acknowledgement. Finally, data packets may be dropped.

Before we start defining the actions, let us define some helper functions:

```
MinS(s) \triangleq
Choose x \in s : \forall y \in s : x \leq y
MaxS(s) \triangleq
CHOOSE x \in s : \forall y \in s : x \geq y
MaxIndex \triangleq
     LET
          upper \stackrel{\Delta}{=} \{x \in client\_buffer : x > N - W\}
          lower \triangleq \{x \in client\_buffer : x < W\}
          maxv \stackrel{\triangle}{=} \text{IF } upper \neq \{\} \land lower \neq \{\}
           THEN
                MaxS(lower)
           ELSE
                MaxS(client\_buffer)
     IN
          maxv
MinIndex \triangleq
     LET
          upper \stackrel{\triangle}{=} \{x \in client\_buffer : x > N - W\}
          lower \triangleq \{x \in client\_buffer : x < W\}
          minv \triangleq \text{IF } upper \neq \{\} \land lower \neq \{\}
           THEN
               MinS(upper)
           ELSE
                MinS(client\_buffer)
     IN
          minv
```

```
egin{aligned} Range & \triangleq \\ & \text{IF } \textit{MaxIndex} \geq \textit{MinIndex} \\ & \text{THEN} \\ & \textit{MaxIndex} - \textit{MinIndex} + 1 \\ & \text{ELSE} \\ & \textit{MaxIndex} + 1 + N - \textit{MinIndex} \end{aligned}
```

At any moment the system allows a window of packets to be unacknowledged. Both the client and server are aware of the window size, represented by W. By looking at packets in its receive buffer and its most acknowledged sequence number, the client can determine which packets were lost. There's actually some nuisance to implement this.

Since the system does not allow more than W unacknowledged packets, the client can assume the window of packet in its receiver buffer must have sequence number  $s \in client\_rx..client\_rx + W$ . Since the sequence number has a ceiling, the window of packets may wrap around the boundary. This introduces some complication around determining the minimum and maximum in the window of packet. The functions defined above calcluates the range, maximum and minimum value in the window accounting for wraparound.

Now we can look at how the client acknowledgement logic:

```
\begin{aligned} & \text{MergeReady} \triangleq \\ & \land client\_buffer \neq \{\} \\ & \land (client\_rx + 1)\%N = MinIndex \\ & \land Range = Cardinality(client\_buffer) \end{aligned} \quad \text{contiguous with previous ack} \\ & \land Range = Cardinality(client\_buffer) \end{aligned} \quad \text{combined is contiguous} 
\begin{aligned} & ClientAcknowledgement \triangleq \\ & \land client\_buffer \neq \{\} \\ & \land MergeReady \\ & \land client\_buffer' = \{\} \\ & \land client\_rx' = MaxIndex \\ & \land network' = AddMessage([dst \mapsto \text{"server"}, type \mapsto \text{"ack"}, ack \mapsto MaxIndex], network) \\ & \land \text{UNCHANGED} \ \langle server\_tx, server\_tx\_ack, server\_tx\_limit, lost \rangle \end{aligned}
```

The client only acknowledges when it has a contiguous sequence of packets that follows its most recently acknowledged packet. When MergeReady is true, the client sends the acknowledgement back to the server.

```
\begin{aligned} & ClientReceive(pp) \; \triangleq \\ & \wedge \; network' = RemoveMessage(pp, \; network) \\ & \wedge \; client\_buffer' = client\_buffer \cup \{pp.seq\} \\ & \wedge \; \text{UNCHANGED} \; \langle server\_tx, \; client\_rx, \; server\_tx\_ack, \; server\_tx\_limit, \; lost \rangle \\ & Missing \; \triangleq \end{aligned}
```

```
LET
          full\_seq \triangleq
               If MaxIndex \ge client\_rx + 1
                THEN
                    \{x \in client\_rx + 1 \dots MaxIndex : TRUE\}
                ELSE
                     \{x \in 0 ... MaxIndex : TRUE\} \cup \{x \in client\_rx + 1 ... N - 1 : TRUE\}
          all\_client\_msgs \stackrel{\triangle}{=} \{m \in network : m.dst = "client"\}
          all\_client\_seqs \triangleq \{m.seq : m \in all\_client\_msgs\}

network\_missing \triangleq full\_seq \setminus all\_client\_seqs
          client\_missing \triangleq full\_seg \setminus client\_buffer
          to\_request \triangleq network\_missing \cap client\_missing
    IN
          to\_reguest
ClientRetransmitRequest \stackrel{\Delta}{=}
     \land \neg MergeReady
     \land client\_buffer \neq \{\}
     \land Missing \neq \{\}
     \land network' = AddMessage([dst \mapsto "server",
     type \mapsto "retransmit",
     seq \mapsto \text{CHOOSE } x \in Missing : \text{TRUE}],
       network)
     \land UNCHANGED \langle server\_tx, server\_tx\_limit, client\_rx, client\_buffer, server\_tx\_ack, lost \rangle
```

ClientReceive moves the a packet from network into client receive buffer. The only reason why this is done as a separate step is to make debugging easier.

Missing returns a set of missing missing sequence number. This is done by cross checking the client receive buffer and the outstanding network packet targeting the client. In theory, the client doesn't know if a gap in its receive buffer means the packet is lost or will arrive soon. Practically, the client will assume a packet is lost after some configurable timeout and request a retransmit.

Let us take a look at server related definitions:

```
RemoveStaleAck(ack, msgs) \triangleq \\ LET \\ acks \triangleq \{(ack-k+N)\%N: k \in 1 ... W\} \\ IN \\ \{m \in msgs: \neg(m.dst = \text{"server"} \land m.type = \text{"ack"} \land m.ack \in acks)\} \\ ServerReceive(pp) \triangleq \\ \lor \land pp.type = \text{"ack"} \\ \land server\_tx\_ack' = pp.ack \\ \land server\_tx\_limit' = (pp.ack + W)\%N \\ \end{cases}
```

```
 \land network' = RemoveStaleAck(pp.ack, RemoveMessage(pp, network)) 
 \land UNCHANGED \ \langle server\_tx, client\_rx, client\_buffer, lost \rangle 
 \lor \land pp.type = \text{"retransmit"} 
 \land network' = AddMessage([dst \mapsto \text{"client"}, seq \mapsto pp.seq], 
 RemoveMessage(pp, network)) 
 \land lost' = lost - 1 
 \land UNCHANGED \ \langle server\_tx, server\_tx\_limit, client\_rx, client\_buffer, server\_tx\_ack \rangle
```

When the server receives an acknowledgement for sequence number K, it assumes K-1 and prior were all received by the client. *RemoveStaleAck* is an model optimization to drop all acknowledgements with sequence number less than K. Note that sequence number k, k-N, k-2\*N, are all represented as k, and in theory the client may not be able to differentiate between them. Practically, N is sized large enough to represent a few seconds worth of data, so the system can safely assume a sequence number k is for the most recent N packets.

Upon receiving an acknowledgement from the client, the server bumps the server\_tx\_limit allowing it to send more data. The server can also receive a retransmit request and send the requested data. *lost* is configurable to determine how many packets can be dropped at the same time.

### 6.3 Model Reduction

### 6.3.1 Removing Stale Acknowlegement

TODO: W \* 2 ; N TODO: stale ack removal TODO: describe network empty TODO: retransmit is assumed to be reliable, because making it unreliable doesn't make sense.

# 6.4 Safety

### 6.5 Liveness

Given the system may randomly drop packets, one possible liveness condition is to verify packets of all sequence number are received by the client at some point. This can be described as for all possible sequence number value k, k is eventually exists in client receive buffer.

```
 \begin{array}{l} \textit{Liveness} \; \stackrel{\triangle}{=} \\ \forall \, i \in 0 \; ... \; N-1 : \\ \textit{client\_buffer} = \{\} \leadsto \exists \, j \in \textit{client\_buffer} : i = j \end{array}
```

# Raft Consensus Protocol

# 7.1 Requirement

Raft is a consensus algorithm that enables a cluster of nodes to agree on a collective state even in the presence of failures. An application of Raft is database replication protocol. With a replication factor of 3 (which means the data is replicated across 3 nodes) and hard drive failure rate of 0.81% per year, the possibility of the total failure where the entire replication group goes down is  $1-0.0081^3=99.9999\%$  uptime [4].

Since the purpose of the book is to learn TLA+, this chapter will only implement a portion of the Raft protocol, namely leader election. For a full description of the Raft protocol, please refer to the original paper [5].

We will briefly describe Raft and its leader election process below:

- A Raft cluster have N nodes, the cluster work collective as a *system* to offer some service
- Each node can be in one of three possible states: Follower, Candidate, Leader
- During normal operations, a cluster of N nodes have a single leader and N-1 followers
- The leader handles all the client interactions. Requests sent to followers will be redirected to the leader
- The leader regularly sends heartbeat to the follower, indicate its alive
- If a follower fails to receive heartbeat from the leader after timeout, it will become a candidate, vote for itself, and campaign to be leader
- A candidate collect the majority of the vote becomes the leader

- If multiple candidates are campaigning and a split vote happen, candidates will eventually declare election timeout and start new round of election
- The cluster can have multiple leaders due to unfavourable network conditions, but the leaders must be on different terms
- A newly elected leader will send a heartbeat to other nodes to establish leadership
- All request and responses include the sender's term, allowing the receiver to react accordingly

The protocol also included description about log synchronization, state recovery, and more. Many details are ommitted in chis chapter to reduce modeling cost. The N nodes in the cluster operate *independently* following the above hueristics. Hopefully this highlights the complexity around verifying the the correctness of the protocol.

The following illustrate the state diagram of one node in the cluster:



### **7.2** Spec

The following implements the skeleton portion of the leader election protocol:

```
 \begin{array}{l} Init \ \triangleq \\  \  \, \wedge \, state = [s \in Servers \mapsto \text{``Follower''}] \\  \  \, \wedge \, messages = \{\} \\  \  \, \wedge \, voted\_for = [s \in Servers \mapsto \text{``''}] \\  \  \, \wedge \, vote\_granted = [s \in Servers \mapsto \{\}] \\  \  \, \wedge \, vote\_requested = [s \in Servers \mapsto 0] \\  \  \, \wedge \, term = [s \in Servers \mapsto 0] \\ \\ Request VoteSet(i) \ \triangleq \ \{ \end{array}
```

```
[fSrc \mapsto i, fDst \mapsto s, fType \mapsto \text{``RequestVoteReq''}, fTerm \mapsto term[i]]
          : s \in Servers \setminus \{i\}
}
Campaign(i) \triangleq
     \land vote\_requested[i] = 0
     \land vote\_requested' = [vote\_requested \ \texttt{EXCEPT} \ ![i] = 1]
     \land messages' = messages \cup RequestVoteSet(i)
     \land UNCHANGED \langle state, term, vote\_granted, voted\_for \rangle
KeepAliveSet(i) \triangleq \{
    [fSrc \mapsto i, fDst \mapsto s, fType \mapsto \text{``AppendEntryReq''}, fTerm \mapsto term[i]]
          : s \in Servers \setminus \{i\}
}
Leader(i) \triangleq
     \wedge state[i] = \text{``Leader''}
     \land messages' = messages \cup KeepAliveSet(i)
     \land UNCHANGED \langle state, voted\_for, term, vote\_granted, vote\_requested <math>\rangle
BecomeLeader(i) \triangleq
     \land Cardinality(vote\_granted[i]) > Cardinality(Servers) \div 2
     \land state' = [state \ EXCEPT \ ![i] = "Leader"]
     \land UNCHANGED \langle messages, voted\_for, term, vote\_granted, vote\_requested <math>\rangle
Candidate(i) \triangleq
     \wedge state[i] = \text{``Candidate''}
     \land \lor Campaign(i)
        \vee BecomeLeader(i)
        \vee Timeout(i)
Follower(i) \triangleq
     \land state[i] = "Follower"
     \land Timeout(i)
Receive(msg) \triangleq
     \lor \land msg.fType = \text{``AppendEntryReq''}
          \land AppendEntryReq(msg)
     \lor \land msg.fType = \text{``AppendEntryResp''}
          \land AppendEntryResp(msg)
     \lor \land msg.fType = "RequestVoteReq"
          \land RequestVoteReq(msg)
     \lor \land msg.fType = "RequestVoteResp"
          \land RequestVoteResp(msq)
Next \triangleq
     \vee \exists i \in Servers :
           \vee Leader(i)
```

```
\lor Candidate(i)
\lor Follower(i)
\lor \exists msg \in messages : Receive(msg)
```

- Next either picks a server to make progress, or picks a message in the message pool to process. Message processing is done by Receive, handling is state agnostic
- message is defined to be a set that holds a collection of functions, where each function is a message with source, destination, type, and more specified
- *voted\_for* tracks who a given node previously voted for. This prevents a node from voting more than once
- vote\_granted tracks how many votes a candidate has received
- vote\_requested tracks if a node has already issued request vote to its peers
- Follower either Receive or Timeout and campaign to be a leader
- Candidate campaigns to be a leader, and becomes one if it has enough vote. Failing to collect enough votes, Candidate start a new election on a new term. It can also receive a request with a higher term and transition to be a Follower.
- Leader will establish its leadership by sending AppendEntryReq to all its peers

The Spec implements four messages AppendEntry request/response, RequestVote request/response. Handling for all messages are fairly similar in structure. In this chapter we will look at *RequestVoteReq* only. Readers are encouraged to check the remaining definition as an exercise:

```
fTerm \mapsto t,
                                       fSuccess \mapsto 1,
                                       RemoveMessage(msq, messages))
  \(\triangle \) UNCHANGED \(\state, \text{ term}, \text{ vote_granted}, \text{ vote_requested}, \text{ establish_leadership}\)
already voted someone else
\vee \wedge t = term[i]
  \land voted\_for[i] \neq j
  \land \ voted\_for[i] \neq ```'
  \land messages' = AddMessage([fSrc \mapsto i,
                                       fDst \mapsto j,
                                       fType \mapsto "RequestVoteResp",
                                       fTerm \mapsto t,
                                       fSuccess \mapsto 0,
                                       RemoveMessage(msq, messages))
    ∧ UNCHANGED ⟨state, voted_for, term, vote_granted, vote_requested, establish_leadership⟩
   \wedge t < term[i]
    \land messages' = AddMessage([fSrc \mapsto i,
                                       fType \mapsto "RequestVoteResp",
                                       fTerm \mapsto term[i],
                                       fSuccess \mapsto 0,
                                       RemoveMessage(msg, messages))
    ∧ UNCHANGED ⟨state, voted_for, term, vote_granted, vote_requested, establish_leadership⟩
revert back to follower
\vee \wedge t > term[i]
   \wedge state' = [state \ EXCEPT \ ![i] = "Follower"]
   \land term' = [term \ EXCEPT \ ![i] = t]
    \land voted\_for' = [voted\_for \ EXCEPT \ ![i] = j]
    \land vote\_granted' = [vote\_granted \ EXCEPT \ ![i] = \{\}]
    \land vote\_requested' = [vote\_requested \ EXCEPT \ ![i] = 0]
    \land establish\_leadership' = [establish\_leadership \ EXCEPT \ ![i] = 0]
    \land messages' = AddMessage([fSrc \mapsto i,
                                       fDst \mapsto j
                                       fType \mapsto "RequestVoteResp",
                                       fTerm \mapsto t,
                                       fSuccess \mapsto 1,
                                       RemoveMessage(msg, messages))
```

The handling is split into three cases:

- If received request is on a higher term, processing node grants vote and becomes a Follower
- If received request is on a lower term, processing node ignores request
- If received request is on the same term, processing node only grants vote if it hasn't voted, or has had voted for the same requester prior

### 7.3 Model Reduction

The model checker will run the spec as defined, but due to the expoential growth of states it is unlikely to complete in a reasonable amount of time. We need to simplify the model and possibly trades off some correctness. Careful consideration must go into finding the right balance between maximizing model correctness and minimizing model checker runtime.

The main strategy is to *bound* the state graph. The following describe a set of optimization implemented for this example.

#### 7.3.1 Modeling Messages as a Set

In the original Raft TLA+ Spec [6], messages are modeled as an *unordered map* to track the count of each message. It is possible for a sender to repeatedly send the same message (eg. keepalive), and grow the message count in an unbounded fashion.

messages in this example has been implemented as a set, which effectively limits message instance count to one. It is still possible for messages to grow unboundedly because of the monotonically increasing term value. Further changes are described below.

### 7.3.2 Limit Term Divergence

It is possible for a node to *never* make progress. Such case can occur when a node is partitioned off while the rest of the cluster elects new leader and move onto newer terms. Many of the interesting behaviours of Raft are how it addresses these cases. In a cluster of nodes with mixed terms, the nodes with older term will eventually converge onto newer terms when they are contacted by new leader. This converging behaviour will happen whether the stale node is either 1 or N terms away from the current leader, and the former is much less costly to simulate than the latter because the reduced number of states.

We can include *LimitDivergence* as a conjunction in *Timeout*:

```
\begin{array}{l} \textit{LimitDivergence}(i) \ \stackrel{\triangle}{=} \\ \textit{values} \ \stackrel{\triangle}{=} \ \{\textit{term}[s] : s \in \textit{Servers}\} \\ \textit{max\_v} \ \stackrel{\triangle}{=} \ \textit{CHOOSE} \ x \in \textit{values} : \forall \ y \in \textit{values} : x \geq y \\ \textit{min\_v} \ \stackrel{\triangle}{=} \ \textit{CHOOSE} \ x \in \textit{values} : \forall \ y \in \textit{values} : x \leq y \\ \text{IN} \\ & \lor \land \textit{term}[i] \neq \textit{max\_v} \\ & \lor \land \textit{term}[i] = \textit{max\_v} \\ & \land \textit{term}[i] - \textit{min\_v} < \textit{MaxDiff} \\ \end{array}
```

#### 7.3.3 Normalize Cluster Term

However, term *itself* can grow unbounded. This is a key tenent converging protocols rely on, an monotonically increasing counter. We want to *normalize* the range of terms in the cluster so the minimum value resets back to 0. This provides an upper bound to the state graph.

```
Normalize \triangleq
     LET
           \begin{array}{ll} values \; \stackrel{\triangle}{=} \; \{term[s]: s \in Servers\} \\ max\_v \; \stackrel{\triangle}{=} \; \text{CHOOSE} \; x \in values: } \forall \; y \in values: x \geq y \end{array}
           min_v \triangleq \text{CHOOSE } x \in values : \forall y \in values : x < y
     IN
            \wedge max_v = MaxTerm
            \land term' = [s \in Servers \mapsto term[s] - min\_v]
            \land messages' = \{\}
            \(\triangle \) UNCHANGED \(\state, voted_for, vote_granted, vote_requested, establish_leadership\)
Next \triangleq
      \lor \land \forall i \in Servers : term[i] \neq MaxTerm
          \land \lor \exists i \in Servers :
                     \vee Leader(i)
                     \vee Candidate(i)
                     \vee Follower(i)
              \vee \exists msg \in messages : Receive(msg)
      \lor \land \exists i \in Servers : term[i] = MaxTerm
          \land Normalize
```

The implementation ensures only the state machine only moves forward when none of the nodes is on *MaxTerm*. If any of the node is on *MaxTerm*, the cluster terms are normalized.

Another caveat here is in the initial implementation I didn't update messsages. This led to liveness property violation as the messages had terms disagreeing with the system state. To simplify the spec I simply cleared all messages. This indirectly verifies a portion of the packet loss handling in the spec as well.

#### 7.3.4 Sending Request as a Batch

The send requests were initially implemented using the existential quantifier. This introduces many interleaving states. This was replaced with a universal quantifier so the set of messages are only sent once. The implementation no longer tracks if the responses were received, since the spec should handle packet loss scenarios as well.

```
RequestVoteSet(i) \triangleq \{
    [fSrc \mapsto i, fDst \mapsto s, fType \mapsto \text{``RequestVoteReq''}, fTerm \mapsto term[i]]
          : s \in Servers \setminus \{i\}
}
Campaign(i) \triangleq
     \land vote\_requested[i] = 0
     \land vote\_requested' = [vote\_requested \ EXCEPT \ ![i] = 1]
     \land messages' = messages \cup RequestVoteSet(i)
     \land UNCHANGED \langle state, term, vote\_granted, voted\_for, establish\_leadership <math>\rangle
KeepAliveSet(i) \stackrel{\Delta}{=} \{
     [fSrc \mapsto i, fDst \mapsto s, fType \mapsto \text{``AppendEntryReq''}, fTerm \mapsto term[i]]
          : s \in Servers \setminus \{i\}
}
Leader(i) \triangleq
     \wedge state[i] = \text{``Leader''}
     \land establish\_leadership[i] = 0
     \land establish\_leadership' = [establish\_leadership \ Except \ ![i] = 1]
     \land messages' = messages \cup KeepAliveSet(i)
     \land UNCHANGED \langle state, voted\_for, term, vote\_granted, vote\_requested <math>\rangle
```

### 7.3.5 Prune Messages with Stale Terms

When a node's term advances, all messages targeted to this node with older terms are discarded. Keeping messages with stale terms allows the model checker to verify the node correctly discards them, but can exponentially grow the state machine. To simplify the model, we can prune stale messages as we add a new message:

```
 \begin{array}{c} AddMessage(to\_add,\ msgs) \ \triangleq \\ \text{LET} \\ pruned \ \triangleq \ \{msg \in msgs: \\ \neg (msg.fDst = to\_add.fDst \land msg.fTerm < to\_add.fTerm)\} \\ \text{IN} \end{array}
```

```
pruned \cup \{to\_add\}
```

 $RemoveMessage(to\_remove, msgs) \stackrel{\Delta}{=}$ 

### 7.3.6 Enable Symmetry

Since the behaviour is symmetric between nodes, we can enable symmetry to speed up model checker runtime:

 $Perms \stackrel{\triangle}{=} Permutations(Servers)$ 

### 7.4 Safety

One of the goals for the protocol is to ensure the cluster only have one leader. It is possible for the clusters to have mulitple leaders due to unfavourable network connections. For example, a leader node is partitioned off and a new leader is elected. However, even when the cluster have multiple leaders, they *must* be on different terms. The leader with the highest term is effectively the *true leader*. This invariant can be implemented like so:

```
 \begin{array}{ll} \textit{LeaderUniqueTerm} & \triangleq \\ \forall \, s1, \, s2 \in \textit{Servers}: \\ & (\textit{state}[s1] = \text{``Leader''} \land \textit{state}[s2] = \text{``Leader''} \land s1 \neq s2) \Rightarrow (\textit{term}[s1] \neq \textit{term}[s2]) \end{array}
```

For every pair of nodes, they cannot both be Leaders and have the same term.

### 7.5 Liveness

In any failure recovery scenario, the nodes in the cluster converges to a higher term value either voluntary or involuntarily. For example:

- A node timed out and starts a new election on a new term
- A partitioned follower receives heartbeat from a new leader on a new term
- A candidate receiving a request vote from another candidate on a higher term

In any case, a node's term number always increase. This can be described as below:

```
Converge \triangleq

\forall s \in Servers :

term[s] = 0 \rightarrow term[s] = MaxTerm - MaxDiff
```

Instead of MaxTerm, we use MaxTerm-MaxDiff to ensure the liveness property is always upheld even after Normalization. However, running the spec against TLC now will encounter a set of stuttering issues. We also need to update the fairness description to ensure all possible actions are called when the enabling conditions are eventually always true:

```
 \begin{array}{l} Liveness \ \triangleq \\ \land \forall \ i \in Servers: \\ \land \ \text{WF}_{vars}(Leader(i)) \\ \land \ \text{WF}_{vars}(Candidate(i)) \\ \land \ \text{WF}_{vars}(Follower(i)) \\ \land \ \text{WF}_{vars}(\exists \ msg \in messages: Receive(msg)) \end{array}
```

# Part III Examples with PlusCal

PlusCal is a C-like syntax that allows designer to describe their *Spec* in a more programming language like fashion. I'm of the opinion that these are suitable to describe concurrent algorithm, where the code execution between multiple contexts may interleave in any way imaginable. While it certainly is possible to express these in TLA+ directly, I find it to be error prone, somewhat comparable to writing in Assembly instead of C. In this section we will describe a few PlusCal example.

# SPSC Lockless Queue

A single producer single consumer (SPSC) lockless queue is a data exchange queue between a producer and a consumer. The SPSC lockfree queue enables data exchange between producer and consumer without the use of a lock, allowing both producer and consumer to make progress in all scenarios.

An example application of SPSC queue is a data exchange interface between ASIC and the CPU in a driver implementation.

A real implementation need to account for memory ordering effects specific to the architecture. For example, ARM has weak memory ordering model where read/write may appear out-of-order between CPUs. In this chapter we will assume *logical* execution order where each command is perceived issued sequentially (even across CPUs) to focus the discussion on describing the system using TLA+.

### 8.1 Requirement

The following describes the SPSC queue requirements:

- Two executing context, reader and writer
- Writer advances wtpr after writes
- Reader advances rtpr after reads
- If rtpr equals wptr, queue is empty
- If (wtpr + 1) % N equals rptr, queue is full

The following is an example of a SPSC queue:



Since reader and writer execute in different context, the instructions in read and write can interleave in *any* way imaginable:

- Reader empty check can happen just as writer is writing data
- Writer full check can happen just as reader is reading data
- Reading and writing can occur concurrently

The key observations is the index held by write pointer is reserved by the writer. Similarly, index held by the read pointer is reserved by the reader. The only exception is when read pointer equals to write pointer, then the queue is empty. Given the possible ways the reader and writer execution can interleave, we can use TLA+ to verify the design.

### 8.2 Spec

TLA+ specification can be writen using its native formal specification language, or a C-like syntax called PlusCal (which transpiles down to itse native form). In this example, I chose to implement the specification using PlusCal, since the content to be verified is psuedo implementation. While it is possible specify SPSC in native TLA+, I find the approach more error prone as each line is effectively an individual state to be modeled.

The following is a snippet of the *Spec* written in PlusCal:

```
procedure reader()
begin
r\_chk\_empty:
                    if rptr = wptr then
r_early_ret:
                    return;
       end if;
                    assert buffer[rptr] \neq 0;
r\_read\_buf:
             buffer[rptr] := 0;
r\_cs:
r\_upd\_rtpr:
                   rptr := (rptr + 1)\%N;
       return;
end procedure;
procedure writer()
begin
w\_chk\_full:
                   if (wptr + 1)\%N = rptr then
w\_early\_ret:
                       return;
```

Note some lines start with label (eg. r\_chk\_empty). All the actions associated with the label is assumed executed atomically. This is reflected in the generated TLA+ code:

```
r\_chk\_empty(self) \triangleq \land pc[self] = "r\_chk\_empty"
 \land IF rptr = wptr
 THEN \land pc' = [pc \text{ EXCEPT } ![self] = "r\_early\_ret"]
 ELSE \land pc' = [pc \text{ EXCEPT } ![self] = "r\_read\_buf"]
 \land UNCHANGED \langle rptr, wptr, buffer, stack \rangle
```

### 8.3 Safety

Some safety requirement we can enforce include:

```
\begin{array}{l} \textit{MUTEX} \; \triangleq \\ \; \neg((\textit{pc}[\textit{WRITER}] = \text{``w\_cs''}) \land (\textit{pc}[\textit{READER}] = \text{``r\_cs''}) \land \textit{rptr} = \textit{wptr}) \\ \textit{Inv\_Basics} \; \triangleq \\ \; \land ((\textit{written} \cup \textit{writing}) \cup \textit{unused}) = \textit{all} \\ \; \land \textit{reading} \subseteq \textit{written} \qquad \qquad \text{reading is a subset of written} \\ \; \land \forall i \in \textit{unused} : \textit{buffer}[i] = 0 \\ \; \land \lor \textit{Cardinality}(\textit{to\_be\_read}) + 1 = \textit{Cardinality}(\textit{reading}) \\ \; \lor \textit{Cardinality}(\textit{to\_be\_read}) = \textit{Cardinality}(\textit{reading}) + 1 \\ \; \lor \textit{Cardinality}(\textit{to\_be\_read}) = \textit{Cardinality}(\textit{reading}) \\ \; \land \textit{MUTEX} \end{array}
```

### 8.4 Liveness

### 8.5 Configuration

# SPMC Lockless Queue

As the name suggest, a SPMC lockless queue supports a single producer multiple consumer usage topology.

In previously described SPSC, the reader assumes write index is reserved, and the writer assumes read index is reserved. While the relationship between read and write context remains true in SPMC, the complication is now readers compete to reserve an index during read. Some design considerations include:

- Readers must have a way to reserve an index to read
- Readers may complete the read in order different from when they reserve the indices
- Readers can perform the reads independently

A primitive in modern CPU architecture is the compare-and-swap (CAS) instruction. To use CAS, issuer needs to specify a memory location, a current value and a new value. Only when value in the memory location matches the current value, the new value will be writen. The instruction ensure this is done atomically. If multiple CPUs issue CAS at the same time to the same memory location, only one CPU will win as the other CPU's current value check will fail. Many of the concurrent algorithms are built ontop of this primitive.

### 9.1 Design

Some of the design limitation include:

- A resource is exclusively updated by one owner, and read by one or more readers
- For shared resource that can be updated by multiple owners, the CPU can gaurantees exclusive update from a single resource

For the design:

- SPMC is implemented as a circular queue with size of N
- The status of indvidual index is represented as an status array of size N
- The status of each index is either UNUSED, WRITTEN, or READING
- Each reader maintain its own read pointer
- A *outstanding* counter is incremented by the writer when write is complete, and decremented by the reader when it reserves a read

Whenever the write finishes a write, it increments *outstanding* to indicate some buffer is ready to read.

To read, a reader performs a two-step reservation:

- The reader decrements *outstanding*. A successful decrement means the reader is *gauranteed* a read index.
- After successful decrement of outstanding, the reader walk its read pointer
  until it successfully reserve the next available index to read. A This is done
  by attempting to CAS update an index from WRITTEN to READING.
  If the update fails, then the index was already reserved by another reader.

There may be more than one approach in implementing SPMC, the above description is what we will implement in this chapter.

### 9.2 Spec

The following is the core reader implementation:

```
procedure reader( )
variable
    i = self;
begin
r_chk_empty:
    if outstanding \neq 0 then
        outstanding := outstanding - 1;
    else
    r_early_ret:
        return;
    end if;
r_try_lock:
    if status[rptr[i]] = WRITTEN then
        status[rptr[i]] := READING;
    else
```

```
 r_{-}retry: \\ rptr[i] := (rptr[i] + 1)\%N; \\ \mathbf{goto} \ r_{-}try_{-}lock; \\ \mathbf{end if }; \\ r_{-}data_{-}chk: \\ \mathbf{assert} \ buffer[rptr[i]] = rptr[i] + 1000; \\ r_{-}read_{-}buf: \\ buffer[rptr[i]] := 0; \\ r_{-}unlock: \\ status[rptr[i]] := UNUSED; \\ r_{-}done: \\ \mathbf{return}; \\ \mathbf{end procedure} ;
```

The reader performs a non-zero check on outstanding. If outstanding is zero, queue is empty and reader early returns.

If outstanding is K, then at most K readers can reserve a index to read. If system has M readers, then M-K readers will fail to reserve a read index. The readers now compete to reserve a read. More specifically:

- Reader loads outstanding, stores that onto local variable A
- Reader early returns if A is zero
- Reader calls rv = CAS(A, A-1)
- If rv is non-success, go back to the top

If rv is non-success, another reader has won the reservation. The current reader can attempt to reserve again if outstanding is non-zero.

If rv is success, the reader is gauranteed a read. However, readers may need to compete to reserve an available index. This is done by attempting to CAS update the index status from WRITTEN to READING. Failing the CAS indicate another reader has reserved this index, and the reader will bump the read pointer and try to reserve the next index.

Now let us take a look at the writer implementation:

```
procedure writer( ) begin
w\_chk\_full:
   if outstanding = N-1 then
       w_early_ret:
       return;
   end if;
   w_chk_st:
```

```
 \begin{aligned} &\textbf{if } status[wptr] \neq UNUSED \textbf{ then } \\ &w\_early\_ret2: \\ &\textbf{return;} \\ &\textbf{end if ;} \\ &w\_write\_buf: \\ &buffer[wptr] := wptr + 1000; \\ &w\_mark\_written: \\ &status[wptr] := WRITTEN; \\ &w\_inc\_wptr: \\ &wptr := (wptr + 1)\%N; \\ &w\_inc: \\ &outstanding := outstanding + 1; \\ &w\_done: \\ &\textbf{return;} \\ &\textbf{end procedure ;} \end{aligned}
```

The writer first checks outstanding, and early return if queue is full. After fullness check, the writer then checks if the current index is UNUSED. This is to account for the edge case where a reader has performed the reservation first step to decrement outstanding but haven't done the actual read. If both checks pass, then writer now has an UNUSED index it can write to.

The SPMC algorithm can be tricky yo get right. There's plenty of possible pitfall in the design. For example:

- Readers can lapse each other
- A reader can attempt to lapse writer
- A reader can starve the other readers
- A slow reader can block the system

In all of the above case, the system *correctness* need to be maintained.

### 9.3 Safety

When a reader reserves an index to read, the reader must have exclusive access:

Similarly, reader and writer cannot operate on the same index at the same time:

 $ExclusiveReadWrite \triangleq$ 

```
\forall \, x \in READERS: \\ (pc[x] = \text{``r\_read\_buf''} \land pc[WRITER] = \text{``w\_write\_buf''}) \Rightarrow (rptr[x] \neq wptr)
```

### 9.4 Liveness

All indices must be used as the system runs. The following verifies all unused indices are eventually used, and all used indicies are eventually unused:

```
 \begin{array}{l} \textit{Liveness} \; \stackrel{\triangle}{=} \\ \; \land \forall \, k \in 0 \ldots N-1 : \\ \; \textit{buffer}[k] = 0 \leadsto \textit{buffer}[k] \neq 0 \\ \; \land \forall \, k \in 0 \ldots N-1 : \\ \; \textit{buffer}[k] \neq 0 \leadsto \textit{buffer}[k] = 0 \end{array}
```

The following describes a more subtle scenario. We need to ensure the system remains functional even if readers complete out-of-order. The following describe such scenario, where two non-contiguous indicies have been reserved for reading. In this case we expect the indicies to eventually be re-used. This means the system remains functional after such scenario.

```
 \begin{array}{l} \textit{Liveness2} \triangleq \\ \forall \, k \in 0 \ldots N-3: \\ \land (\textit{status}[k] = \textit{READING} \land \textit{status}[k+1] = \textit{UNUSED} \land \textit{status}[k+2] = \textit{READING}) \\ \rightsquigarrow (\textit{status}[k] = \textit{WRITTEN}) \\ \land (\textit{status}[k] = \textit{READING} \land \textit{status}[k+1] = \textit{UNUSED} \land \textit{status}[k+2] = \textit{READING}) \\ \rightsquigarrow (\textit{status}[k+2] = \textit{WRITTEN}) \\ \end{array}
```

# Part IV

# Reference

## **Data Structure**

Like other languages, TLA+ provides its data structure. I assume the readers are already familiar with common data structure, and this chapter will only focus on the TLA+ language semantics.

### 10.1 Set

This is the most common data structure used in TLA+ spec. The following is a few examples on how a set can be used:

```
\begin{array}{l} a \stackrel{\triangle}{=} \{0,1,2\} \\ b \stackrel{\triangle}{=} \{2,3,4\} \\ c \stackrel{\triangle}{=} a \cup b \\ d \stackrel{\triangle}{=} a \cap b \\ e \stackrel{\triangle}{=} \exists x \in c : x > 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x > 5 \\ g \stackrel{\triangle}{=} \forall x \in c : x < 3 \\ h \stackrel{\triangle}{=} \forall x \in c : x < 5 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel{\triangle}{=} \exists x \in c : x < 3 \\ f \stackrel
```

### 10.2 Tuple

```
\begin{array}{lll} A & \stackrel{\triangle}{=} & \langle 0,\, 1,\, 2 \rangle \\ B & \stackrel{\triangle}{=} & \langle 2,\, 3,\, 4 \rangle \\ C & \stackrel{\triangle}{=} & A \circ B & \text{tuple: } 0,\, 1,\, 2,\, 2,\, 3,\, 4 \\ D & \stackrel{\triangle}{=} & Len(C) & 6 \\ E & \stackrel{\triangle}{=} & \forall\, x \in 1 \ldots Len(C) : C[x] \neq 10 & \text{TRUE - every C[x] is not } 10 \\ F & \stackrel{\triangle}{=} & \exists\, x \in 1 \ldots Len(C) : C[x] = 2 & \text{TRUE - there exists a C[x] that is } 2 \end{array}
```

 $G \; \stackrel{\Delta}{=} \; \big\{ x \in 1 \; . \; . \; Len(C) : C[x] = 2 \big\} \quad \{3,\,4\} \; \text{- when index is 3 or 4, C[x]} = 2$ 

### Fairness and Liveness

For rigorous definition and proof, please refer to (TODO: citations). This chapter focus on the application aspect of liveness and fairness and define an elevator spec that goes up and down.



### 11.1 Liveness

Consider the following elevator Spec:

```
- MODULE elevator -
EXTENDS Integers
VARIABLES a
vars \triangleq \langle a \rangle
TOP
BOTTOM \triangleq 1
Init \triangleq
      \wedge a = BOTTOM
      \land \ a \neq \mathit{TOP}
      \wedge a' = a + 1
Down \triangleq
      \land a \neq BOTTOM
      \wedge a' = a - 1
Spec \triangleq
   \wedge Init
   \wedge \Box [Up \vee Down]_a
```

The building has a set of floors and the elevator can go either up or down. The elevator keeps going up until it's the top floor, or keep going down until it's the bottom floor. TLC will pass the Spec as is.

Let's introduce a liveness property. The elevator should always at least go to the second floor:

```
Liveness \stackrel{\triangle}{=} \\ \land a = 1 \leadsto a = 2
```

Running the *Spec* against TLC will report a violation:

```
Error: Temporal properties were violated.
Error: The following behavior constitutes a counter-example:
State 1: <Initial predicate>
a = 1
State 2: Stuttering
```

Since the *Spec* permits *suttering*, the state machine is allowed to perpetually stay on 1F and *never* go to 2F. This can be fixed by introducing fairness description.

### 11.2 Weak Fairness

Weak fairness is defined as:

$$\Diamond \Box (ENABLED\langle A \rangle_v) \Rightarrow \Box \Diamond \langle A \rangle_v \tag{11.1}$$

 $ENABLED\langle A \rangle$  represents conditions required for action A. The above translates to: if conditions required for action A to occur is eventually always true, then action A will always eventually happen.

Without weak fairness defined, the elevator may stutter at 1F and never go to 2F. Weak fairness states that if the conditions of an action is eventually always true (ie. elevator decides to stay on 1F but can go up), the elevator always eventually go up.

```
\begin{array}{l} Spec \;\; \stackrel{\triangle}{=} \\ \;\; \wedge \; Init \\ \;\; \wedge \; \Box [Down \vee Up]_a \\ \;\; \wedge \; \mathrm{WF}_a(Down) \\ \;\; \wedge \; \mathrm{WF}_a(Up) \end{array}
```

Running the spec against TLC passes again. What if we want to verify the elevator eventually always goes to the top, not just to 2F? Let's modify the Liveness property again:

 $Liveness \triangleq$ 

```
\land a = BOTTOM \leadsto a = TOP
```

TLC now reports the following violation:

```
Error: Temporal properties were violated.
Error: The following behavior constitutes a counter-example:
State 1: <Initial predicate>
a = 1
State 2: <Up line 10, col 5 to line 11, col 17 of module elevator>
a = 2
Back to state 1: <Down line 13, col 5 to line 14, col 17 of module elevator>
```

TLC identified a case where the elevator is perpetually stuck going between 1F and 2F, but never go to 3F. Weak fairness is no longer enough, because the the elevator is not stuck on 2F repeatedly, but stuck going between 1F and 2F. This is where we need strong fairness.

### 11.3 Strong Fairness

Strong fairness is defined as:

$$\Box \Diamond (ENABLED\langle A \rangle_v) \Rightarrow \Box \Diamond \langle A \rangle_v \tag{11.2}$$

The difference between weak and strong fairness is the *eventually always* vs. *always eventually*.

In weak fairness, once the state machine is stuck in a state forever, the state machine always transition to a possible next state permitted by the *Spec* (eg. if the elevator is stuck on 1F but can go to 2F, it will). With strong fairness, the elevator doesn't need to be stuck on 2F to go to 3F. If the elevator *always* eventually makes it to 2F, it always eventually go to 3F.

Intuitively we are tempted to enable strong fairness like so:

```
Spec \triangleq \\ \land Init \\ \land \Box [Up \lor Down]_a \\ \land WF_a(Down) \\ \land SF_a(UP)
```

However, TLC still reports the same violation. What's going on?

If we take a closer look at the enabling condition for Up, it only requires current floor to be not the top floor. When the elevator is stuck in a loop going Up and Down between 1F and 2F indefinitely, strong fairness for Up is already satisfied. What we really want is strong fairness on Up for every floor, instead

of any floor except top floor. So if elevator makes to 2F once, it will always eventually go to 3F. If elevator makes to 3F once, it will always eventually go to 4F, so on and so forth. The following is the change required:

```
 \begin{split} Spec & \stackrel{\triangle}{=} \\ & \wedge Init \\ & \wedge \square [\mathit{Up} \vee \mathit{Down}]_a \\ & \wedge \mathrm{WF}_a(\mathit{Down}) \\ & \wedge \forall f \in \mathit{BOTTOM} \ .. \ \mathit{TOP} - 1 : \\ & \wedge \mathrm{WF}_a(\mathit{Up} \wedge f = a) \end{split}
```

Once again with this change TLC will pass.

# General Tips

### 12.1 Model Checker Debug

Debugging in TLC is a bit different than debugging with normal programs. A step in the model checker is really a state transition. Even if the model checker completes, it's still worthwhile dump and audit the states just to make sure the Spec is defined correctly.

```
tlc elevator -dump out > /dev/null && cat out.dump | head -n5
State 1:
a = 1
State 2:
a = 2
```

You may want to grep the output to look for state being set to certain value to confirm the Spec is working as intended.

#### 12.1.1 Dead Lock

Deadlock typically happens when the model checker ran out of things to do. This is typically a result of an incomplete Spec definition, where certain edge cases were not accounted for. The model checker typically provides a fairly comprehensive backtrace leading up to the dead lock to simplify debug.

#### 12.1.2 Live Lock

Livelock happens when the model checker identifies a case where the liveness property is violated. An example is the elevator stuck going between two floors instead of keep going to the top floor, or the system is stuck dropping and retransmit the same packet.

These are typically fixed by providing additional fairness description to the Spec, telling the model checker how continue in the case of a live lock.

For a detail fairness description please refer to Chapter 11.

#### 12.2 Model Refinement

This is, in some sense, the *art* associated with writing model checker verifiable TLA+ spec. Model checking is only valuable if it can be verified within a reasonable amount of time. Since the model complexity grows exponentially, there's little value in attempting to hyper-optimize the details. Designer should focus on optimizing the broad stroke, such as removing features that are harder to get wrong, and focus the model on the bits that have the highest return on investment.

One useful way of trimming out low value portion of the Spec is to audit the state dump. Even in the case of a non-terminating run, a partial state dump can help identify low value details we can remove from the Spec.

One key value of TLA+ is it highlights all the corner cases in the system. Even if you end up removing certain details from the Spec, it still likely highlights to you certain condition you were unaware of.

As a broad stroke principle: when the Spec has millions or higher more states, it is unlikely to terminate within a few seconds. From first principles if you can find one failure case in a million states, you can also likely reduce the scope of the model to reproduce the failure in much fewer states.

# Abstraction Guideline

Reference

# Bibliography

- $[1] \ https://lamport.azurewebsites.net/tla/book.html$
- [2] Srikumar Subramanian https://sriku.org/posts/fairness-in-tlaplus/, 2015
- [3] Richard M. Murray, Nok Wongpiromsarn Linear Temporal Logic, Lecture 3, 2012
- [4] https://www.backblaze.com/blog/cloud-storage-durability/
- $[5] \ https://raft.github.io/raft.pdf$
- [6] https://github.com/ongardie/raft.tla