# A model of speculative evaluation

CRAIG DISSELKOEN, University of California San Diego and Mozilla Research Internship RADHA JAGADEESAN, DePaul University ALAN JEFFREY, Mozilla Research JAMES RIELY, DePaul University

This paper studies information flow caused by speculation mechanisms in hardware and software. The Spectre attack shows that there are practical information flow attacks which use an interaction of dynamic security checks, speculative evaluation and cache timing. Previous formal models of program execution have not been designed to model speculative evaluation, and so do not capture attacks such as Spectre. In this paper, we propose a model based on pomsets which is designed to model speculative evaluation. The model provides a compositional semantics for a simple shared-memory concurrent language, which captures features such as data and control dependencies, relaxed memory and transactions. We provide models for existing information flow attacks based on speculative evaluation and transactions, and new information flow attacks on compiler optimizations. The new attacks are experimentally validated against gcc and clang. A simple temporal logic provides reasoning principals, including composition and proofs using invariants.

### 1 INTRODUCTION

This paper studies information flow caused by speculation mechanisms in hardware and software. Information flow from high (hi) to low (lo) provides a formal foundation for end-to-end security. Informally, a program is secure if there is no observable dependency of lo-observables on hi-inputs. The precise formalization of this intutive idea has been the topic of extensive research (e.g., see [34] for a detailed survey till 2006), and can be generalized to account for a variety of language features and observables such as non-determinisim [38], concurrency [35], reactivity [29], and probability [15]. The static and dynamic enforcement of these definitions in general purpose languages [28] has also been studied extensively and has influenced language design and implementation.

A key parameter in the definitions cited above is the notion of the *observational power* of the attacker model. Whereas the classical input-output behavior is often an adequate foundation, it has long been known [7, 23] that side-channels that leak information arise from other observables such as execution time and power consumption.

The focus of this paper is the development of a compositional semantic model of executions of programs to explicate side channel attacks that arise from speculation. In particular, we model conditionals such that it is possible for the behaviour of a conditional (if (M) { C } else { D }) to depend on the behaviour of D even when the condition M is true.

Our study addresses several sources of speculation in the concurrent execution of programs.

• Pipelined microprocessors use heuristics to predict the path of a program, for example by guessing memory dependencies, or the outcome of conditionals. Execution proceeds along

Authors' addresses: Craig Disselkoen, University of California San Diego, Mozilla Research Internship, cdisselk@cs.ucsd.edu; Radha Jagadeesan, DePaul University, rjagadeesan@cs.depaul.edu; Alan Jeffrey, Mozilla Research, ajeffrey@mozilla.com; James Riely, DePaul University, jriely@cs.depaul.edu.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

© 2018 Copyright held by the owner/author(s). XXXX-XXXX/2018/7-ART

https://doi.org/10.1145/nnnnnnn.nnnnnnn

the predicted path until it is possible to validate the prediction; the execution is committed if the prediction is correct; otherwise it is rolled back. The speculation does not affect the observable input-output behavior of the program, thus ensuring correctness with respect to the usual intended semantics of the program. However, since there are clear timing differences between the cases where the prediction is and is not successful, it raises the concern that the predictive mechanisms themselves could cause information leaks via timing side-channels; a fear that is realized with devastating impact by the Spectre family of attacks [21].

• Several modern microprocessors [10] support transactions to aid in the design of correct and efficient concurrent programs. Transactions are executed optimistically; they are committed if there are no conflicts, and aborted otherwise. All memory effects of an aborted transaction are rolled back; so there is no way for a concurrent observer to detect an aborted transaction. However, the thread of the aborted transaction, via abort-handler code, gets notified of the cancellation of the transaction.

Thus, when a transaction of a thread aborts, the thread can deduce plausible information about specific memory accesses of a concurrently executing transaction. In combination with the techniques used in Spectre, this potential has been exploited recently to accelerate and scale up the efficacy of the attacker in the Spectre family of attacks [12].

• Compiler optimizations may depend on both branches of a conditional, for example one that replaces (if (*M*) { *C* } else { *C* }) by *C*. In particular, (if (*r*) { *x* := 1 } else { *x* := 1 }) can be optimized to (*x* := 1), which does not have a control dependency from *r* to the assignment to *x*. In contrast, the program (if (*r*) { *x* := 1 } else { *x* := 2 }) cannot be optimized, so the control dependency cannot be removed.

This would be fine if there were no program constructs which could observe control dependencies, but unfortunately the relaxed memory models for Java [27], C++ [8] and LLVM [39] allow for such observations. This means there is the possibility for information flows caused by optimizing compilers, which we investigate in this paper.

Whereas information flows caused by speculation in hardware, or by transactional memory are known [12, 21], these attacks on compiler optimizations, and the relaxed memory models that justify them, are new. In this paper we provide both a formal model for such attacks, and experimental evidence about their practicality.

Information flow attacks motivate the main technical development of this paper. Our model is based on *partially ordered multisets* [14, 32] ("pomsets"), whose labels are given by read and write actions. These can be visualized as a graph where the edges indicate dependencies, for example (r:=x; y:=1; z:=r+1) has an execution modelled by the pomset:



The edge from (R x 1) to (W z 2) indicates a data dependency. The novel aspect of the model is that events have *preconditions* which may be false. These are used in giving the semantics of conditionals, for example (if (x) { y:= 1; z:= 1 } else { y:= 2; z:= 1 }) has an execution:



Th edges from (Rx1) to (Wy1) and (Wy2) indicate control dependencies. The presence of a crossed out (Wy2) indicates an event with an unsatisfiable precondition,

The novel contributions of this paper are:

- a model of program execution that includes speculation (§2),
- examples showing how the model can be applied, including information flow attacks on hardware, optimizing compilers, and transactional memory (§3),
- experimental evidence about how practical it is to mount the new class of attacks (§4), and
- a temporal logic which supports compositional proof (§5).

Readers who wish to focus on the impact of the model can skip past §2 on first reading, and refer back to it when needed.

#### 2 MODEL

The model used in this paper is one of sets of pomsets with event labels of the form  $(\phi \mid a)$ , where  $\phi$  is the event's precondition (such as M = v) and a is the event's action (such as  $W \times v$ ). For example the semantics of the program (x := M) includes the case where M is v, which is written to x, and is captured by the one-event pomset:

$$(M = v \mid W x v)$$

We make few requirements of the logic of preconditions, save that it has includes equalities between expressions, is closed under substitution, and supports a notion of implication.

For example, the set of pomsets [r:=y; x:=r+1] contains:

$$Ry1 \longrightarrow Wx2$$

The semantics is defined compositionally. First, [x:=r+1] contains the pomset:

$$(r = 1 \mid Wx2)$$

Next, we perform the substitution of r with 1 in every precondition, to get that [x:=r+1][1/r] contains the pomset:

$$\boxed{1 = 1 \mid W x 2}$$

and since (1 = 1) is a tautology, we elide it:

This substitution is performed in defining [r:=y; x:=r+1], which contains the pomset:

$$(Ry1) \rightarrow (Wx2)$$

as required. There is an ordering (Ry1) < (Wx2) because the precondition (r=1) depends on r. If the precondition was independent of r then there would be no ordering, for example [r:=y; x:=r+1-r] contains the pomset:

$$\begin{bmatrix} R y 1 \end{bmatrix} \begin{bmatrix} W x 1 \end{bmatrix}$$

since the precondition (r + 1 - r = 1) is independent of r.

The main novelty of our semantics, since it is designed to model speculative evaluation, is in modelling conditionals. In most sets-of-pomsets semantics, a pomset in  $[if(M) \{C\}]$  else [D] would either be given by a pomset in [C] or a pomset in [D]. To model speculative evaluation, we need to allow a pomset in  $[if(M) \{C\}]$  else [D] to be given by both a pomset in [C] and a pomset in [D]. For example,  $[if(M) \{x:=1\}]$  else [D] contains the pomset:

$$M \neq 0 \mid Wx1$$
 
$$M = 0 \mid Wx2$$

that is we have recorded behaviour from both branches of execution. Moreover, an action which is performed on both sides of the conditional can be merged, producing only one event in the resulting pomset. The precondition of the merged event is the disjunction of the preconditions of the original events. For example  $[if(M) \{x:=1; y:=3\}]$  else  $\{x:=2; y:=3\}$  contains the pomset:

$$M \neq 0 \mid Wx1$$
  $M = 0 \mid Wx2$   $(M \neq 0) \lor (M = 0) \mid Wy3$ 

and since  $(M \neq 0) \lor (M = 0)$  is a tautology, this is:

$$M \neq 0 \mid Wx1 ) \quad M = 0 \mid Wx2 ) \quad Wy3$$

Combining this model of conditionals with the model of memory using substitutions gives that  $[if(z) \{x:=1; y:=3\}]$  contains the pomset:

$$\begin{array}{c|c} \hline (Rz1) & \hline & 1 \neq 0 \mid Wx1 \end{array}$$
 
$$\begin{array}{c|c} \hline & 1 = 0 \mid Wx2 \end{array}$$
 
$$\begin{array}{c|c} \hline & Wy3 \end{array}$$

and since  $(1 \neq 0)$  is a tautology and (1 = 0) is unsatisfiable, this is:

$$Rz1 \longrightarrow Wx1 \longrightarrow Wy3$$

Similarly,  $[if(z) \{x:=1; y:=3\}$ else  $\{x:=2; y:=3\} ]$  contains the pomset:

$$\begin{array}{c|c} \hline (Rz0) & \hline (Wy3) \\ \hline \end{array}$$

Note that this semantics captures control dependencies such as (Rz 0) < (Wx 1), independencies such as  $(Rz 0) \ne (Wy 3)$ , and failed speculations such as the crossed out (Wx 1).

In summary, the features we need of the underlying data model are:

- actions, which may read or write to memory locations, and
- preconditions, which form a logic closed under substitution.

from which we can define the operations used in defining the semantics of programs, which include:

- prefixing  $a \to \mathcal{P}$ , which adds an event with precondition  $\phi$  and action a to pomsets in  $\mathcal{P}$ ,
- substitution  $\mathcal{P}[M/x]$ , which performs a substitution on every precondition in  $\mathcal{P}$ , and
- concurrency  $\mathcal{P}_1 \parallel \mathcal{P}_2$ , which unions pomsets from  $\mathcal{P}_1$  and  $\mathcal{P}_2$ , allowing events to be merged.

We make data models precise in §2.1, define pomsets in §2.2, and operations on sets of pomsets in §2.3, which are used to give a compositional semantics for a simple imperative language.

#### 2.1 Data models

A data model consists of:

- a set of *memory locations* X, ranged over by x and y,
- a set of registers  $\mathcal{R}$ , ranged over by r and s,
- a set of *values* V, ranged over by v and w,
- a set of *expressions*  $\mathcal{E}$ , ranged over by M and N,
- a set of *logical formulae*  $\Phi$ , ranged over by  $\phi$  and  $\psi$ , and
- a set of actions  $\mathcal{A}$ , ranged over by a and b,

such that:

- values include at least the constants 0 and 1,
- expressions include at least registers and values,

- expressions are closed under substitutions of the form M[N/r],
- formulae include at least true, false, and equalities of the form (M = N) and (x = N),
- formulae are closed under negation, conjunction, disjunction,
- formulae are closed under substitutions of the form  $\phi[x/r]$  or  $\phi[N/x]$ ,
- there is a relation ⊨ between formulae, and
- there are partial functions R and W :  $\mathcal{A} \to (X \times V)$ ,

We shall say a reads v from x whenever R(a) = (x, v), and a writes v to x whenever W(a) = (x, v), and We shall say  $\phi$  implies  $\psi$  whenever  $\phi \models \psi$ ,  $\phi$  is a tautology whenever true  $\models \phi$ ,  $\phi$  is unsatisfiable whenever  $\phi \models$  false, and  $\phi$  is independent of x whenever  $\phi \models \phi[v/x] \models \phi$  for every v. In examples, the actions are of the form (R x v), which reads v from v, and (W x v), which writes v to v. Going forward, we assume a given data model, though some examples in §3 make use of particular  $\mathcal{A}$ .

# 2.2 3-valued pomsets

Recall the definition of a pomset from [14]:

*Definition 2.1.* A pomset  $(E, \leq, \lambda)$  with alphabet  $\Sigma$  is a partial order  $(E, \leq)$  together with  $\lambda : E \to \Sigma$ .

Going forward, we fix the alphabet  $\Sigma = (\Phi \times \mathcal{A})$ . We will write  $(\phi \mid a)$  for the pair  $(\phi, a)$ , elide  $\phi$  when  $\phi$  is a tautology, and write  $\alpha$  when  $\phi$  is unsatisfiable. We lift terminology from logical formulae and actions to events, for example if  $\lambda(e) = (\phi \mid a)$  and then we say e is unsatisfiable whenever  $\phi$  is unsatisfiable, e writes v to v whenever v writes v to v and so forth. We visualize a pomset as a graph where the nodes are drawn from v, each node v is labelled with v and an edge v and v corresponds to an ordering v and v are corresponds to an ordering v are corresponds to v and v are corresponds to v are corresponds to v and v are corresp



is a visualization of the pomset where:

$$0 \le 1$$
  $0 \le 2$   $\lambda(0) = (true, R x 1)$   $\lambda(1) = (false, W y 0)$   $\lambda(2) = (true, W y 1)$ 

We are building a compositional semantics of shared memory concurrency, which means we require a notion of when a read has a matching write. This is a property we require of closed programs, but *not* of open programs. For example a program whose semantics includes:



may be put in parallel with another program which writes 0 to x. If the program is closed with respect to x though, such an execution cannot exist, so we need each read of x to have a matching write. This is captured by defining when e reads x from d [3]. A preliminary definition (which, as we shall see, needs to be strengthened) is:

- d < e.
- if *e* is satisfiable, then *d* is a tautology,
- d writes v to x, and e reads v from x, and
- there is no d < c < e such that c writes to x.

In diagrams, for readability we often highlight the reads-from edges, for example:



Unfortunately by itself, this is not enough. The problem is the final clause saying that there does not exist an x-blocking event c between d and e. Unfortunately, concurrency can turn events that were not x-blockers into an x-blocker, even if the new thread does not mention x. To see this, consider a program with execution:



If this is placed in parallel with:

then a possible execution is:



and now the (W x 2) event is an x-blocker, so (R x 1) cannot read from (W x 1).

There are a number of ways this can be addressed, for example in models such as [6] the readsfrom relation is taken as a primitive. In this paper, we propose 3-valued pomsets as a solution. These are pomsets in which, in addition to positive statements (d < e) (interpreted as e depends on d), we also have negative statements  $(d \nmid e)$  (interpreted as e cannot depend on d).

Definition 2.2. A 3-valued poset  $(E, \leq, \downarrow)$  is a poset  $(E, \leq)$  together with  $\not \subseteq (E \times E)$  such that:

- if  $d \le e$  then  $e \nmid d$ ,
- if  $d \le e$  and  $d \nmid e$  then d = e,
- if  $c \ge d \leqslant e$  or  $c \leqslant d \ge e$  then  $c \leqslant e$ .

Definition 2.3. A 3-valued pomset  $(E, \leq, \downarrow, \lambda)$  is a 3-valued poset  $(E, \leq, \downarrow)$  and a pomset  $(E, \leq, \lambda)$ .

In diagrams, we visualize  $(e \nmid d)$  as a dashed arrow from d to e (note the flip of direction). We wrefer to edges introduced by (d < e) as *strong edges* and by  $(e \nmid d)$  as *weak edges*, for example:

$$(\mathbf{W} \times \mathbf{0}) \rightarrow (\mathbf{W} \times \mathbf{1}) \rightarrow (\mathbf{R} \times \mathbf{1}) \rightarrow (\mathbf{W} \times \mathbf{2})$$

Structures similar to 3-valued pomsets have come up in many guises, for example rough sets [30] or ultrametrics over  $\{0, \frac{1}{2}, 1\}$ . They correspond to axioms A1–A3 of Lamport's *system executions* [22]. They are the notion of pomset given by interpreting  $d \le e$  in a 3-valued logic [36].

We strengthen the definition of reads-from to require not just that no blocker exists, but that any candidate blocker must either have  $d \nmid c$  or  $c \nmid e$ . This ensures that any further concurrency cannot turn a non-blocker into a blocker.

*Definition 2.4.* In a 3-valued pomset, *e can read x from d* whenever:

- d < e,
- if *e* is satisfiable, then *d* is a tautology,
- d writes v to x, and e reads v from x, and
- if *c* writes to *x* then either  $d \leqslant c$  or  $c \leqslant e$ .

*Definition 2.5.* A 3-valued pomset is *x*-closed if, for every  $e \in E$ :

- *e* is independent of *x*, and
- if e reads from x, then there is a d such that e can read x fom d.

In our previous example, in order for (R x 1) to read from (W x 1), we either need  $(W x 1) \neq (W x 2)$  or  $(W x 2) \neq (W x 1)$ , for example:



then putting this in parallel as before results in:



but this is *not* a valid 3-valued pomset, since (W x 2) < (R x 1) but also  $(W x 2) \nleq (R x 1)$ , which is a contradiction.

The definitions as they stand allow cycles in weak edges. This is necessary for examples such as  $(x:=y-1; x:=1 \mid | y:=x-1; y:=1)$  which has execution:



However, in order to model fencing or transactions, we need to ban executions such as:



The problem here is the weak cycle between (W x 0) and (W x 1), which according to Definition 2.4, allows both (R x 0) and (R x 1), even though one of them must be a stale value. This can be addressed

by requiring ≰ to form a *per-location* partial order. This is a form of partial coherence, and can be strengthened to total coherence by requiring *₹* to be a per-location total order.

Definition 2.6. A 3-valued pomset is partially (resp. totally) x-coherent if, when restricted to events which write to x,  $\leq$  forms a partial (resp. total) order.

### 2.3 Sets of 3-valued pomsets

Our model of programs is going to be sets of 3-valued pomsets. In this section we define the operations on pomsets which are used in giving the semantics. These are operations such as prefixing, parallel composition, restriction, and so on; they are familiar from models of concurrency such as [9], but adapted here to the setting of speculative evaluation.

Definition 2.7. Let  $P_0 \in (\mathcal{P}_1 \parallel \mathcal{P}_2)$  whenever there are  $P_1 \in \mathcal{P}_1$  and  $P_2 \in \mathcal{P}_2$  such that:

- $E_0 = E_1 \cup E_2$ ,
- if  $e \leq_1 d$  or  $e \leq_2 d$  then  $e \leq_0 d$ ,
- if  $e \not\downarrow_1 d$  or  $e \not\downarrow_2 d$  then  $e \not\downarrow_0 d$ ,
- if  $\lambda_0(e) = (\phi_0 \mid a)$  then either:
  - $-\lambda_1(e) = (\phi_1 \mid a)$  and  $\lambda_2(e) = (\phi_2 \mid a)$  and  $\phi_0$  implies  $\phi_1 \vee \phi_2$ ,
  - $λ_1(e) = (φ_1 | a)$  and  $e ∉ E_2$  and  $φ_0$  implies  $φ_1$ , or
  - $λ_2(e) = (φ_2 | a)$  and  $e ∉ E_1$  and  $φ_0$  implies  $φ_2$ .

We use  $\mathcal{P}_1 \parallel \mathcal{P}_2$  in defining the semantics of conditionals and concurrency. It contains the union of pomsets from  $\mathcal{P}_1$  and  $\mathcal{P}_2$ , allowing overlap as long as they agree on actions. For example, if  $\mathcal{P}_1$ and  $\mathcal{P}_2$  contain:

$$\overbrace{\phi \mid a} \rightarrow \underbrace{\psi_1 \mid b} \qquad \qquad \underbrace{\psi_2 \mid b} \rightarrow \underbrace{\chi \mid c}$$

then  $\mathcal{P}_1 \parallel \mathcal{P}_2$  contains:

*Definition 2.8.* Let  $a \to \mathcal{P}$  be the set  $\mathcal{P}'$  where  $P' \in \mathcal{P}'$  whenever there is  $P \in \mathcal{P}$  such that:

- $\bullet \ E' = E \cup \{c\}.$
- if  $d \le e$  then  $d \le' e$ ,
- if  $d \not < e$  then  $d \not <' e$ ,
- $\lambda'(c) = (\phi, a)$ , and
- if  $\lambda(e) = (\psi \mid b)$  then:

$$-\lambda'(e) = (\psi' \mid b) \text{ then.}$$

$$-\lambda'(e) = (\psi' \mid b),$$

$$-\psi' \text{ implies } \begin{cases} \psi[v/x] & \text{if } a \text{ reads } v \text{ from } x \text{ and } c <' e \\ \psi[v/x] \text{ and } \psi \text{ if } a \text{ reads } v \text{ from } x \end{cases}$$

$$[\text{INDEPENDENT READ}]$$

$$\downarrow \psi \qquad \text{otherwise} \qquad [\text{NON-READ}]$$

- if a and b both write to y, then  $c \not\geq' e$ .

Prefixing is used to define the semantics of reads and writes, and adds a new event c with action a. The tricky part of the definition is the named cases, which places requirements on read dependencies. If a reads v from x, we have to decide whether e depends on c for some e with old precondition  $\psi$  and new precondition  $\psi'$ . The first case [DEPENDENT READ] is that the dependency exists, in which case  $\psi'$  just has to imply  $\psi[v/x]$ . The more interesting case is [INDEPENDENT READ], in which case  $\psi'$  has to imply  $\psi[v/x]$  and  $\psi$ . This corresponds to a case where e can be performed with or without c. In particular, if  $\psi$  is independent of x then we can pick  $\psi'$  to be  $\psi$ , and the independent read case will apply. For example, if  $\mathcal{P}$  contains:

$$(\psi \mid b) \rightarrow (\chi \mid c)$$

where a reads v from x and writes w to y, b writes to y, and  $\psi$  is independent of x, then  $a \to \mathcal{P}$  contains:

$$\overbrace{\phi \mid a} \rightarrow \underbrace{\psi \mid b} \rightarrow \underbrace{\chi[\vec{v}/\vec{x}] \mid c}$$

Definition 2.9. Let  $\mathcal{P}[M/x]$  be the set  $\mathcal{P}'$  where  $P' \in \mathcal{P}'$  whenever there is  $P \in \mathcal{P}$  such that:

- $\bullet$  E' = E,
- if  $d \le e$  then  $d \le' e$ , and
- if  $d \leqslant e$  then  $d \leqslant' e$ , and
- if  $\lambda(e) = (\psi \mid a)$  then  $\lambda'(e) = (\psi[M/x] \mid a)$ .

and similarly for  $\mathcal{P}[x/r]$ .

*Definition 2.10.* Let  $(\phi \mid \mathcal{P})$  be the subset of  $\mathcal{P}$  such that  $P \in \mathcal{P}$  whenever:

• if  $\lambda(e) = (\psi \mid a)$  then  $\phi$  implies  $\psi$ .

*Definition 2.11.* Let  $(vx \cdot P)$  be the subset of P such that  $P \in P$  whenever P is x-closed and partially x-coherent.

We can use the operations defined above to give the semantics of a simple concurrent imperative programming language.

Definition 2.12.

A write generates a write event that may be visible to other threads. A read may see a thread-local value, or it may generate a read event that must be justified by another thread. In the latter case, occurrences of r are replaced with x (rather than v) to ensure that dependencies are tracked properly.

We have completed the formal definition of our model of speculative evaluation, and now turn to examples of this model in use.

### 3 EXAMPLES

In this section, we shall start off by giving some basic examples, and then show how three different information flow attacks can be modeled. We cover Spectre in §3.7, new attacks on compiler optimizations in §3.9–3.10, and attacks on transactions in §3.12.

# 3.1 Sequential memory accesses

In the semantics of memory, there are two very different ways memory can be accessed: sequentially or concurrently. These are modelled differently, since hardware and compilers give very different

guarantees about their behaviour. In this section, we discuss the sequential semantics, and leave the concurrent semantics to §3.2.

Consider the program (x := 0; y := x + 1). One execution of this program is where the write to y uses the sequential value of x, which is 0:

$$(\mathbf{W} x 0) (\mathbf{W} y 1)$$

To see how this execution is modelled, we first expand out the syntax sugar to get the program (x:=0; r:=x; y:=r+1; skip) Now [skip] is just  $\{\emptyset\}$ , and [y:=r+1; skip] includes:

$$(r + 1 = 1 \mid (Wy1) \rightarrow [skip][1/y])$$

which contains the pomset:

$$\boxed{r+1=1\mid \mathsf{W}\,y\,1}$$

expressing that this program can write 1 to y, as long as the precondition (r + 1 = 1) is satisfied. Now [r:=x; y:=r+1; skip] has two cases, the sequential case (which does not introduce a read action) and the concurrent case (which does). For the moment, we are interested in the sequential case, which is:

$$[y := r + 1; skip][x/r]$$

which contains the pomset:

$$x + 1 = 1 \mid Wy1$$

In this pomset, the precondition is (x + 1 = 1), which specifies a property of the thread-local value of x. Finally [x:=0; r:=x; y:=r+1; skip] includes:

$$(0 = 0 \mid (\mathbf{W} \times 0)) \rightarrow [r := x; y := r + 1; skip][0/x])$$

which contains the pomset:

$$\boxed{0 = 0 \mid W \times 0} \qquad \boxed{0 + 1 = 1 \mid W \times 1}$$

all of whose preconditions are tautologies, so this has the expected behaviour:

$$(\mathbf{W} x 0) (\mathbf{W} y 1)$$

There is no dependency between (Wx 0) and (Wy 1), since (0 + 1 = 1) is independent of x.

This example demonstrates how preconditions capture the sequential semantics of memory. In an execution containing an event with label  $(\phi \mid a)$ , one way the precondition  $\phi$  can be discharged is by an assignment x := M, which performs a substitution [M/x]. This is a variant of the Hoare semantics for assignment, where if C has precondition  $\phi$  then x := M; C has precondition  $\phi[M/x]$ .

# 3.2 Concurrent memory accesses

We now turn to the case of concurrent accesses to memory. Consider a concurrent version of the program from §3.1:  $(x:=1 \mid | y:=x+1)$ . One execution of this program is where the write to y performs a concurrent read of x:

$$(Wx1) \longrightarrow (Rx1) \longrightarrow (Wy2)$$

To see how this execution is modelled, we first expand out the syntax sugar to get the program (x:=1; skip || r:=x; y:=r+1; skip). As before, [y:=r+1; skip] includes:

$$(r + 1 = 2 \mid (W y 2) \rightarrow [skip][2/y])$$

which contains the pomset:

$$(r+1=2\mid Wy2)$$

As before, [r:=x; y:=r+1; skip] has two cases. We are now interested in the concurrent case, which includes:

$$(Rx1) \rightarrow [y:=r+1; skip][x/r]$$

which contains the pomset:

$$(Rx1) \rightarrow (Wy2)$$

Note that (R x 1) reads 1 from x, and while (x + 1 = 2)[1/x] is a tautlogy, (x + 1 = 2) is not, and so there is a dependency (R x 1) < (W y 2).

Now, [x:=1; skip] includes the pomset:

$$(\mathbf{W} \mathbf{x} \mathbf{1})$$

and so [x:=1; skip || r:=x; y:=r+1; skip] includes:

as expected, including a reads-from dependency (W x 1) < (R x 1).

This example demonstrates how read and write events capture the concurrent semantics of memory. In an execution containing an event with label (Rxv), if the execution is x-closed, then there must be an event it reads from, for example one labelled (Wxv).

## 3.3 Independent writes

Consider an example with two independent writes (x := 1; y := 2). This has semantics including:

$$(W x 1) \rightarrow (W y 2) \rightarrow \{\emptyset\}$$

One of the executions this contains is:

$$(\mathbf{W} x 1) \rightarrow (\mathbf{W} y 2)$$

but it also contains:

$$(\mathbf{W} \mathbf{x} \mathbf{1})$$
  $(\mathbf{W} \mathbf{y} \mathbf{2})$ 

and:

$$(Wx1) \leftarrow (Wy2)$$

since there is no requirement that (W x 1) < (W y 2).

Thus, the semantics of (x := 1; y := 2) is the same as the semantics of (y := 2; x := 1).

#### 3.4 Independent reads and writes

Whereas write prefixing introduces weak dependencies on events which write to the same location, read prefixing introduces strong dependencies on preconditions which depend on the location being read. For example in §3.2 we saw that the program (y:=x+1) includes the pomset:

$$(R x 1) \rightarrow (W y 2)$$

but since (x + 1 = 2) depends on x, we have the requirement that  $(Rx 1) \le (Wy 2)$ .

This is in contrast to the program (r := x; y := r + 2 - r). Since (x + 2 - x = 2) is independent of x (at least for integer arithmetic) this contains:

$$(Rx1)$$
  $(Wy2)$ 

and so the semantics of (r := x; y := r + 2 - r) is the same as the semantics of (y := 2; r := x).

Note this this example shows that we are not just dealing with a syntactic notion of dependency, which is common in hardware models of memory. In syntactic dependency, since r occurs free in (y := r + 2 - r), there would be a dependency between (r := x) and (y := r + 2 - r). In contrast, this model is based on logical implication, which can be interpreted semantically.

# 3.5 Control dependencies

Conditionals introduce control dependencies, for example consider the program:

$$r := z$$
; if  $(r) \{ x := 1 \}$  else  $\{ y := 2 \}$ 

This includes executions in which the false branch is taken:



and ones where the true branch is taken:

$$(Rz1) \rightarrow (Wx1)$$
  $(Wx2)$ 

In both cases, we record the actions in the branch that was not taken. This is a novel feature of this model, and is intended to capture speculative evaluation. In §3.7 we will show how this model captures Spectre-like information flow attacks, once the attacker is provided with the ability to observe such speculations.

To see how these executions are modelled, consider the semantics of [x:=1; skip], which contains any pomset of the form:

$$\phi \mid Wx1$$

in particular it contains:

$$r \neq 0 \mid Wx1$$

Similarly [y:=2; skip] contains:

$$(r = 0 \mid Wy2)$$

and so  $[if(r) \{x:=1; skip\}]$  else  $\{y:=2; skip\}]$  contains:

$$r \neq 0 \mid Wx1$$
  $r = 0 \mid Wy2$ 

Now, the semantics of concurrent read performs substitutions, for example:

which gives the required pomset:

$$(Rz0)$$
  $(Wy2)$ 

Note that the precondition r = 0 is dependent on r, and so there is a dependency (Rz 0) < (Wy 2), modelling the control dependency introduced by the conditional.

### 3.6 Control independencies

In most models of control dependencies, the dependency relation is syntactic, based on whether the action occurs inside syntactically inside a conditional. In contrast, the notion in this model is semantic: if an action can occur on both sides of a conditional, there is no control dependency. Consider a variant of the example from §3.5:

$$r := z$$
; if  $(r) \{ x := 1 \}$  else  $\{ x := 1 \}$ 

This has the expected execution in which the control dependencies exist:

$$(Rz0)$$
  $(Wx1)$ 

but it also has an execution in which the two writes of 1 to x are merged, resulting in no dependency:

$$\begin{bmatrix} Rz0 \end{bmatrix} \begin{bmatrix} Wx1 \end{bmatrix}$$

To see how this arises, consider the definition of  $[if(r) \{x := 1; skip\}]$  else  $\{x := 1; skip\}$ 

$$\mathcal{P}_1 \parallel \mathcal{P}_2 \quad \text{where} \quad \mathcal{P}_1 = (r \neq 0 \mid \llbracket x \colon = 1; \, \mathsf{skip} \rrbracket) \quad \text{and} \quad \mathcal{P}_2 = (r = 0 \mid \llbracket x \colon = 1; \, \mathsf{skip} \rrbracket)$$

Now, one pomset in  $\mathcal{P}_1$  is:

$$r \neq 0 \mid Wx1$$

that is  $P_1$  where:

$$E_1 = \{e\}$$
  $\lambda_1(e) = (r \neq 0, W x 1)$ 

and similarly, one pomset in  $\mathcal{P}_2$  is:

$$r = 0 \mid Wx1$$

that is  $P_2$  where:

$$E_2 = \{e\}$$
  $\lambda_2(e) = (r = 0, Wx 1)$ 

Crucuially, in the definition of  $\mathcal{P}_1 \parallel \mathcal{P}_2$  there is *no* requirement that  $E_1$  and  $E_2$  are disjoint, and in this case they overlap at e. As a result, one pomset in  $\mathcal{P}_1 \sqcup \mathcal{P}_2$  is  $P_0$  where:

$$E_0 = \{e\}$$
  $\lambda_0(e) = (r \neq 0 \lor r = 0, W x 1)$ 

that is:

Note that this pomset has no precondition dependent on r, since  $(r \neq 0 \lor r = 0)$  does not depend on r, which is why we end up with an execution without a control dependency:

$$(Rz0)$$
  $(Wx1)$ 

This semantics captures compiler optimizations which may, for example merge code executed on both branches of a conditional, or hoist constant assignments out of loops.

We can now see the counterintuitive behavior of conditionals in the presence of control dependencies. There are programs such as if (z) { x := 1 } else { x := 1 } executions in which (W x 1) is independent of (R z 1):

$$Rz1$$
  $Wx1$ 

while programs such as if (z) { x := 1 } else { y := 2 } only have executions in which (W x 1) is dependent on (R z 1):

$$(Rz1)$$
  $(Wx1)$   $(Wx2)$ 

so these programs have different dependency relations, depending on conditional branches that were not taken. In §3.9 we shall see that this has security implications, since relaxed memory can observe dependency. The attack is similar to Spectre, so we shall take a detour to see how Spectre can be modeled in this setting.

### 3.7 Spectre

We give a simplified model of Spectre attacks, ignoring the details of timing. In this model, we extend programs with the ability to tell whether a memory location has been touched (in practice this is implemented using timing attacks on the cache). For example, we can model Spectre by:

```
var a; if (canRead(SECRET)) { a[SECRET]:=1} else if (touched a[0]) { x:=0} else if (touched a[1]) { x:=1}
```

This is a low-security program, which is attempting to discover the value of a high-security variable SECRET. The low-security program is allowed to attempt to escalate its privileges by checking that it is allowed to read a high-security variable:

```
if (canRead(SECRET)) \{\cdots code allowed to read SECRET\cdots\} else \{\cdots\}
```

In this case, canRead(SECRET) is false, so the fallback code is executed. Unfortunately, the escalated code is speculatively evaluated, which allows information to leak by testing for which memory locations have been touched.

We model the touched test by introducing a new action (Tx) and defining:

$$\llbracket \mathsf{if}(\mathsf{touched}\, x) \, \{\, C\, \} \, \mathsf{else} \, \{\, D\, \} \rrbracket \quad = \quad ((\mathsf{T}\, x) \, \rightarrow \, \llbracket C \rrbracket) \, \cup \, \llbracket D \rrbracket$$

The additional requirement we need to add for x-closure is:

• if  $\lambda(e) = (\phi \mid \mathsf{T} x)$  then there is  $d \not \geq e$  where d reads or writes x.

Note that there is no requirement that *d* be satisfiable, and indeed Spectre has the execution:

$$\begin{array}{c}
R \text{ SECRET 1} \longrightarrow Walt 1 \longrightarrow Ta[1] \longrightarrow Wx1
\end{array}$$

Putting this in parallel with a high-security write to SECRET gives:

but due the requirement of *a*-closure we do *not* have:

Thus, the attacker has managed to leak the value of a high-security location to a low-security one: if (W x 1) is observed, the SECRET must have been 1.

This shows how our model of speculation can express (very abstract, untimed) Spectre attacks.

# 3.8 Relaxed memory

In §3.9 we present an information flow attack on relaxed memory, similar to Spectre in that it relies on speculative evaluation. Unlike Spectre it does not depend on timing attacks, but instead is based on the sensitivity of relaxed memory to data dependencies. For this reason, we present a simple model of relaxed memory, which is strong enough to capture this attack. The model includes concurrent memory accesses, which can introduce concurrent reads-from. Since we are allowing events to be partially ordered, this gives a simple model of relaxed memory, for example an independent read independent write (IRIW) example is:

$$x := 0$$
;  $x := x + 1 \mid | y := 0$ ;  $y := y + 1 \mid | if(x) \{ r := y \} | | if(y) \{ s := x \}$ 

which includes the execution:



This model does not introduce thin-air reads (TAR), for example the TAR pit is  $(x := y \mid \mid y := x)$  but an attempt to produce a value from thin air fails, for the usual reason of producing a cycle in  $\leq$ , as shown on the left below:



This cycle can be broken by removing a dependency, for example  $0 (x := y \mid | r := x; y := r + 1 - r)$  has the execution on the right above. Note that  $(Rx 1) \not \leq (Wy 1)$ , so this does not introduce a cycle.

Although it is not the primary focus of this paper, our model may be an attractive model of relaxed memory. Many prior models either permit thin-air executions that our model forbids or forbid desirable executions that our model permits.

In §5, we develop a logic which allows us to prove that our semantics forbids thin air examples that are permitted by prior speculative models [16, 19, 27].

Our model passes all of the causality test cases [33]. Significantly, this includes test case 9, which is forbidden by [18], one of the few models that disallows the thin air example from §5. We present this test case in the appendix, where we also discuss the thread inlining examples from [27].

Batty et al. [5] showed that the thin-air problem has no per-candidate-execution solution for C++. This result does not apply to our model, as the semantics of a conditional can depend on the semantics of both branches.

# 3.9 Information flow attacks on relaxed memory

Consider an attacker program, again using security checks to try to learn a SECRET. Whereas SPECTRE uses hardware capabilities, which have to be modeled by adding extra capabilities to the language, this new attacker works by exploiting relaxed memory which can result in unexpected information flows. The attacker program is:

```
\begin{array}{l} \text{var} \, x \! : \! = \! 0 \, ; \\ y \! : \! = \! x \, \mid \mid \, \text{if} \, (y \! = \! 0) \, \{ \, x \! : \! = \! 1 \, \} \\ & \text{else if} \, (\text{canRead(SECRET)}) \, \{ \, x \! : \! = \! \text{SECRET} \, \} \\ & \text{else} \, \{ \, x \! : \! = \! 1 \, ; \, z \! : \! = \! 1 \, \} \end{array}
```

In the case where SECRET is 2, this has many executions, one of which is:



but there are no executions which exhibit (W z 1), since any attempt to do so produces a cycle:



In the case where SECRET is 1, there is an execution:



Note that in this case, there is no dependency from (R y 1) to (W x 1), which is what makes this execution possible. Thus, if the attacker sees an execution with (W z 1), they can conclude that SECRET is 1, which is an information flow attack.

This attack is not just an artifact of the model, since the same behavior can be exhibited by compiler optimizations. Consider the program fragment:

$$if(y = 0) \{x := 1\} else if(canRead(SECRET)) \{x := SECRET\} else \{x := 1; z := 1\}$$

Now, in the case where SECRET is a constant 1, the compiler can inline it:

$$if(y = 0) \{x := 1\} else if(canRead(SECRET)) \{x := 1\} else \{x := 1; z := 1\}$$

and lift the assignment to *x* out of the if statement:

$$x:=1$$
; if  $(y=0)$  { } else if (canRead(SECRET)) { } else {  $z:=1$  }

After these optimizations, a sequentially consistent execution exhibits (W z 1). We discuss the practicality of this attack further in §4.

This approach can be generalized to detect information flows in arbitrary code. If we replace the code fragment above with:

$$if(y = 0) \{ x := P(0) \} else \{ x := P(1); z := 1 \}$$

then (Wz1) is possible only if P is independent of its input. Thus, the conditional is able to capture multiple executions, as in [4].

# 3.10 Dead store elimination

A common compiler optimization is *dead store elimination*, in which writes are omitted if they will be overwritten by a subsequent write later in the same thread. We can model eliminated writes by ones with an unsatisfiable precondition. For example, one execution of  $(x:=1; x:=2) \mid | (r:=x)$  is:

$$\begin{array}{c} W \times 2 \\ \hline \end{array} \longrightarrow \begin{array}{c} R \times 2 \\ \hline \end{array}$$

Recall that for any satisfiable e, if e reads x from y then d is a tautology. This means that, although we can eliminate (W x 1) we cannot eliminate (W x 2).

One heuristic that a compiler might adopt is to only eliminate writes that are guaranteed to be followed by another write to the same variable. This can be formalized by saying that d is eliminatable if there is a  $e \nmid d$  such that e is a tautology and d writes to every location e writes to. A model of dead store elimination is one where, in every pomset, every eliminatable event is unsatisfiable. This simple model includes the examples above.

Note that if dead store elimination is *always* performed, then there is an information flow attack similar to the one in §3.9. Consider the program:

In the case that SECRET is 0, there is an execution:

where  $\phi$  is (¬canRead(SECRET)), which is not a tautology, and so the (W x 1) event is not eliminated. In the case that SECRET is not 0, the matching execution is:

$$(R \times 2)$$
  $(W \times 2)$ 

Now the (W x 2) event is a guaranteed write, so the (W x 1) is eliminated, and so cannot be read. In the case that the attacker can rely on dead store elimination taking place, this is an information flow: if the attacker observes x to be 1, then they know SECRET is 0. We return to this attack in §4.

#### 3.11 Release/acquire synchronization

In relaxed memory models, synchronization actions act as memory fences: that is, they are a barrier to reordering memory accesses. In this section, we present a simple model of release/acquire fencing. In §3.12, we show that this can be scaled up to a model of transactional memory.

We assume there are sets Rel and Acq  $\subseteq \mathcal{A}$ . We say that a is a *release action* if  $a \in \text{Rel}$  and a is an *acquire action* if  $a \in \text{Acq}$ . In a pomset, a release event is one labelled with a release action, and an acquire event is one labelled by an acquire action. To give the semantics of fences, we add extra constraints to definition 2.8 of prefixing (recalling that c is the event being introduced):

- $c \le e$  whenever c is an acquire event or e is a release event
- if *c* is an acquire event then *e* is independent of *x*, for every *x*.

The first constraint ensures that events are ordered before a release and after an acquire. The second constraint ensures that thread-local reads do not cross acquire fences.

In examples, we will use releasing writes and acquiring reads:

- (Rel xv), a release action that writes v to x, and
- (Acq x v), an acquire action that reads v from x.

The semantics of programs with releasing write and acquiring read are the same as for regular write and read and, but with Rel xv replacing W xv and Acq xv replacing R xv.

To see the need for the first constraint on prefixing, consider the program:

$$var x := 0$$
;  $var f := 0$ ;  $(x := 1; rel f := 1 || acq r := f; s := x)$ 

This has an execution:



but not:

since  $(Wx0) \ngeq (Wx1) < (Rx0)$ , so this pomset does not satisfy the requirements to be *x*-closed. If we replace the release with a plain write, then the outcome (Acq f 1) and (Rx0) is possible:

since no order is required between (W x 1) and (W f 1). Symmetrically, if we replace the acquire of the original program with a plain read, then the outcome (R f 1) and (R x 0) is possible.

To see the need for the second constraint on prefixing, consider the program:

$$(x:=1; rel f:=1; acq r:=f; y:=x) \mid\mid (acq s:=f; x:=2; rel f:=2;)$$

whose semantics includes execution:



This execution exists because [y:=x] includes  $(x=1 \mid Wy1)$  and the precondition x=1 is fulfilled by the preceding write x:=1. In implementation term, this execution is reading 1 from x in a "stale cache." The alternative execution that attempts to read 1 from the x in "main memory," has an explicit (Rx1) between (Acq f 2) and (Wy1), and thus will fail to be x-closed.

To prevent thread-local writes from crossing release/acquire pairs, we require that pomsets in the semantics of acquire have no free locations. This corresponds to the idea that acquires flush the read cache, and therefore reads must reload values from main memory after an acquire.

#### 3.12 Transactions

We present a model of transactional memory [24] that is sufficient to capture PRIME+ABORT attacks [12]. We make several simplifying assumptions: transactions are serializable, strongly isolated, and only abort due to cache conflicts. To model the latter, we assume that the set of locations X is partitioned into *cache sets*.

The action  $(B v) \in Acq$  represents the begin of a transaction with id v and  $(C v) \in Rel$  represents the corresponding commit. We model a language in which transactions have explicit identifiers (which we elide in examples) abort handlers (which we elide when they are empty):

The semantics of a transaction has two cases: a successful case (the transaction body) and an aborted case (the transaction body and the recovery code, and marks the body as being unsatisfiable). For

example, two executions of (begin; x := 1; x := 2; commit; onabort  $\{y := 1\}$ ) are:



At top level, we require that pomsets be serializable, as defined below.

Definition 3.1. We say that event c matches b if  $\lambda(c) = (C v)$  and  $\lambda(b) = (B v)$ . We say that begin event b begins e if  $b \le e$  and there is no intervening matching commit; in this case e belongs to b. We say that commit event c commits e if  $e \le c$  and there is no intervening matching begin.

Definition 3.2. A pomset is serializable if:

- (1) no two begins have the same id,
- (2) every commit follows the matching begin,
- (3)  $\leq$  totally orders tautological begins and commits,
- (4) if *b* begins *e*, but not *d*, and  $d \le e$  then  $d \le b$ ,
- (5) if *c* ends *e*, but not *d*, and  $e \le d$  then  $c \le d$ ,
- (6) if *e* and *d* belong to *b* and read the same location, then both read the same value, and
- (7) if e belongs to b, then e implies some matching c that ends e.

In discussion, we identify transactions by their unique begin event. A transaction that does not abort is *successful*. Conditions 1-5 ensure serializability of committed transactions. Conditions 4-6 also ensure strong isolation for non-transactional events [13]. Condition 7 ensures that all events in aborted transactions are unsatisfiable. For example Conditions 5 and 7 rule out executions (which violate strong isolation and atomicity):



In order to model PRIME+ABORT, we need a mechanism for modeling *why* a transaction aborts, as this can be used as a back channel. We model a simple form of concurrent transaction, which aborts when it encounters a memory conflict—this is similar to the treatment of touched in §3.7.

Definition 3.3. A commit event c matching b aborts due to memory conflict if there is some e ended by c, and some tautologous  $b \nmid d \nmid c$  that does not belong to b such that e and d touch locations in the same cache set.

PRIME+ABORT requires an honest agent whose cache-set access depends upon a secret. If a[0] and a[1] belong to separate cache sets, then such an honest agent is:

$$a[SECRET] := 1$$

The attack relies on discovery of some y which belongs to the cache-set of a[1]. Then the program

begin; 
$$y := 0$$
;  $r :=$ commit; onabort  $\{x := 1\}$ 

can write 1 to *x* if the SECRET is 1, in which case the following execution is possible.



If the attacker knows that commits only abort due to memory conflicts, then this attack is an information flow, since the memory conflict only happens when the SECRET is 1.

#### 4 EXPERIMENTS

One theme of this paper is that optimizations not typically part of formal abstractions can result in information-flow leaks. This is typified by the Spectre attack, which leverages speculative execution, a hardware optimization. §3.9 and §3.10 presented other attacks along the same line, which leverage relaxed memory models and dead store elimination respectively. These attacks may result, not from hardware optimizations, but from common *compiler* optimizations. These attacks also, unlike Spectre, do not rely on timing side channels, or indeed timers of any kind, bypassing many common Spectre mitigations [?].

In this section we present concrete implementations of the attacks outlined in §3.9 and §3.10, in both cases leveraging compiler optimizations to construct an information flow attack. The attacker model for these attacks (detailed in §4.1) is currently unrealistic in a real-world sense, rendering these attacks proof-of-concepts rather than immediately exploitable vulnerabilities. However, we believe the novelty of their general mechanisms may lead to interesting discussion; and with much more development, these attacks may evolve into genuine threats against real-world targets such as JIT compilers. We demonstrate the efficacy of both of our concrete proof-of-concept attacks against the clang and gcc C compilers.

All of our experiments are performed on a {describe machine} with clang version {clang version} and gcc version {gcc version}.

#### 4.1 Attacker model

In our attacker model, we assume that there is a SECRET which an attacker wishes to learn; for instance, SECRET may be a cryptographic key hardcoded into the application. This SECRET is known at compile time, but may not be accessed except behind a security check. Since the attacker is running with low security privileges, the security check always fails, so the attacker can only access SECRET in dead code. The attacker has no capabilities other than writing and executing code — in particular the attacker may not disassemble the compiler or libraries to learn the SECRET directly; may not examine the internal state of the compiler; may not access timers of any kind; and may not leverage hardware side channels. The attacker's goal is to learn the value of the SECRET.

As a hypothetical concrete example, suppose there is a library which contains a hardcoded SECRET such as an API or signing key, which cannot be accessed directly, only through a function guarded by a security check:

```
private static uint SECRET = 0x1234;
public uint get_secret() {
  if (canRead(SECRET)) { return SECRET; }
  else { return 0; }
}
```

As noted above, this is not necessarily a realistic attacker model, since in most cases secrets are only known at run time rather than compile time, which means that the attacks presented in this section are more of theoretical interest than practical concern. However, the mechanism of the attack is novel and could potentially be applied in other contexts. For instance, many real-world contexts allow attackers (untrusted or third-party entities) to write code in a scripting language which is then compiled alongside and integrated into a larger application, often using a just-in-time (JIT) compiler. JavaScript code from third-party websites running in a browser is a common example of this. Our attack gives an attacker similar capabilities against a compiler, except it considers the simpler setting of using C code against a C compiler. One could imagine a similar attack using JavaScript against browser JIT compilers, where the compiler may have access to interesting secrets

```
SECRET == 0
                    SECRET == 1
 mov f(%rip), %eax
                     mov f(%rip), %eax
 mov $1, x(%rip)
                     mov y(%rip), %eax
  test %eax, %eax
                      mov $1, x(%rip)
  ie label1
                      test %eax, %eax
  mov %0, x(%rip)
                      sete %eax
label1:
                      ret
 mov y(%rip), %eax
  test %eax, %eax
  sete %eax
  ret
```

Fig. 1. (Simplified) x86 assembly output from gcc for the main thread of the load-store reordering attack. In particular, note that the order between mov 1, x(rip) and mov y(rip), %eax is different in the two cases. The call to canRead(SECRET) has been inlined; we implemented canRead(x) as return f; where volatile bool f = false;. Thus, gcc preserves the read of f even when its value is unused, as in the case on the right.

in the browser itself, and may be able to optimize based on those secrets. We plan to explore JavaScript attacks of this type as future work.

## 4.2 Load-store reordering attack

We begin by examining the attack in §3.9 in more detail, subject to the attacker model given above. In particular, we show that by exploiting compiler optimizations which perform load-store reordering, an attacker can learn the value of a compile-time SECRET despite only being allowed to use it inside dead code, that is, code that can never be executed at runtime. This attack was tested and works against gcc version {gcc version}.

The form of the attack presented in §3.9 works in theory, but in practice, just because a compiler is *allowed* to perform a load-store reordering doesn't mean that it *will*. We found that gcc and clang chose to read y into a register first (before writing to x), regardless of the value of SECRET. However, we did find a related pattern in which gcc will emit a different ordering of the read of y and the write of x depending on the value of a SECRET:

```
\begin{array}{l} \text{var} \, x \! : \! = \! 0 \, ; \\ y \! : \! = \! x \mid \mid x \! : \! = \! 1 \, ; \\ \text{if} \, (\text{canRead(SECRET)}) \, \{ \, x \! : \! = \! \text{SECRET} \, ; \, \} \\ \text{if} \, (y > 0) \, \{ \, \text{return} \, 0 \, \} \\ \text{else} \, \{ \, \text{return} \, 1 \, \} \end{array}
```

Figure 1 shows the assembly output of gcc in the cases where SECRET is 0 and 1 respectively. In the case that SECRET is 1, gcc removes the if statement entirely, and moves the read of y above the write of x. However, when SECRET is 0, the if statement must remain intact, and gcc does not move the read of y. This means that if SECRET is 1, the second thread will always read y = 0 and always return 1. However, if SECRET is 0, it is possible that the first thread may observe x = 1 and write y := 1 in time for the second thread to observe y == 1 and thus return 0. In this way, we leverage compiler load-store reordering to learn the value of a compile-time SECRET.

We extend this attack to leak a secret consisting of an arbitrary number N of bits. To do this, we simply compile N copies of the test function, each performing a boolean test on a single bit of the secret. The function used for reading the kth bit is as follows (for  $N \le 64$ ):

```
 \begin{array}{l} \text{var} \, x \! : \! = \! 0 \, ; \\ y \! : \! = \! x \mid \mid x \! : \! = \! 1 \, ; \\ \text{if} \, (\text{canRead}(\text{SECRET})) \, \{ x \! : \! = \! (\text{SECRET \& (1 << k)}) \, ? \, 1 \, : \, \emptyset ; \, \} \\ \text{if} \, (y \! > \! 0) \, \{ \, \text{return} \, 0 \, \} \\ \text{else} \, \{ \, \text{return} \, 1 \, \} \\ \end{array}
```

Following the same analysis as above, this function will always return 1 if the appropriate bit of SECRET is 1, but may return 0 if the appropriate bit of SECRET is 0. The extension of the attack to the general case with truly arbitrary N is straightforward; SECRET becomes an array of 64-bit values, and we use  $k \neq 64$  and  $1 \ll (k \& 63)$  as the array index and bitmask respectively.

We make three additional tweaks to improve the reliability so that the attacker can confidently infer the value of SECRET based on the observed return values of the function. First, rather than performing y:=x only once in the first thread, we perform y:=x continuously in a loop. This maximizes the probability that, once x:=1 occurs in the second thread, y will be immediately assigned 1 by the first thread and the second thread will be able to read y:=1.

Second, we wish to lengthen the timing window between x := 1 and the read of y in the second thread (in the case where the appropriate bit of SECRET is 0 and the read of y remains below x := 1). However, we wish to do this in a way that does not block the reordering of the read of y upwards in the case where the appropriate bit of SECRET is 1. We do this by inserting many copies of the line

```
if (canRead(SECRET)) { x := (SECRET & (1 << k)) ? 1 : 0; }</pre>
```

instead of just one. In the case where the appropriate bit of SECRET is 0, this results in many calls to canRead(SECRET) and many conditional jumps, which in practice creates a timing window for the first thread to perform y := x. However, in the case where the appropriate bit of SECRET is 1, all of these inserted lines can be removed just as a single copy could be. In practice, we found that inserting too many copies of the line prevents gcc from reordering the read of y above the write to x as desired; inserting 30 copies was sufficient to create a timing window while still allowing the desired reordering.

Finally, we redundantly execute the entire attack several times, noting the return value of the function in each case. We note that if any of the redundant runs produces a return value of 0 for a particular bit position, we can be certain that the corresponding bit of SECRET must be 0, as it implies the read of y was not reordered upwards in that particular function. On the other hand, the more runs that produce a return value of 1 for a particular bit position, the more certain we can be that the read of y was reordered above the x:=1 assignment, and the appropriate bit of SECRET is 1.

Figure 2 gives the performance results for this attack against gcc version {ver}. The attack can sustain hundreds of thousands of bits per second leaked with near-perfect accuracy, or millions of bits per second with error rates of a few percent. This means that an attacker can leak a 2048-bit secret with near-perfect accuracy in under 10 ms. Note that this bandwidth assumes that all copies of the attack function are already compiled; the cost of compilation is not included here.

# 4.3 Dead store elimination attack

In this section we return to the attack in §3.10 based on dead store elimination. We show that in our attacker model (given in §4.1), the attacker is able to exploit dead store elimination to again learn the value of a compile-time SECRET despite only being allowed to use it inside dead code, that is, code that can never be executed at runtime. This attack is even more efficient than the attack on

| Redundancy | Bandwidth (bits/s) | Bitwise Acc | Per-run Acc   |  |
|------------|--------------------|-------------|---------------|--|
| 1          | 3.17 million       | 90.13%      | 0.0%          |  |
| 2          | 1.62 million       | 96.77%      | 0.7%          |  |
| 3          | 1.07 million       | 98.84%      | 3.9%<br>13.5% |  |
| 4          | 812 thousand       | 99.55%      |               |  |
| 5          | 652 thousand       | 99.83%      | 34.0%         |  |
| 7          | 466 thousand       | 99.97%      | 71.8%         |  |
| 10         | 322 thousand       | 99.998%     | 96.6%         |  |
| 15         | 216 thousand       | 100.00%     | 100.0%        |  |

Fig. 2. Performance results for the load-store reordering attack when leaking a 2048-bit secret. 'Redundancy' is the number of redundant runs performed for error correction; more redundant runs improves accuracy but reduces bandwidth. 'Bandwidth' is the number of bits leaked per second after accounting for any error correction. 'Bitwise Accuracy' is the percentage of bits that were correct, while 'Per-run Accuracy' is the percentage of full 2048-bit secrets that were correct in all bit positions. {Note: numbers are not final (collected on Craig's machine inside a VM), but give an idea of where we stand.}

load-store reordering, and further, we were able to demonstrate its effectiveness against both gcc and clang.

We start from the simple form of the attack presented in  $\S 3.10$ , and extend it to leak a secret consisting of an arbitrary number N of bits. As we did in the load-store reordering attack, we again compile N copies of the test function, each performing a boolean test on a single bit of the secret. The function used for reading the kth bit is as follows (for N <= 64):

```
 \begin{array}{l} {\rm var}\,x\!:=\!0\,;\\ r\!:=\!x\mid\mid x\!:=\!1\,;\\ &{\rm if}\,({\rm canRead}({\rm SECRET}))\,\{\,{\rm if}\,({\rm SECRET}\,\,\&\,\,(1\,<\!<\,k)\,\,!\!=\!0)\,\{\,x\!:=\!2\,\}\,\}\\ &{\rm else}\,\{\,x\!:=\!2\,\} \end{array}
```

Then, we test each function in turn, each time noting the value of r observed by the 'listening' thread. If the appropriate bit of SECRET is 1, the x:=2 assignment is guaranteed to happen, so the compiler can eliminate the x:=1 assignment as a dead store and we will observe r:=2; however, if the appropriate bit of SECRET is 0, the x:=1 assignment cannot be eliminated, and we will observe r:=1 with some probability. The extension of the attack to the general case with truly arbitrary N is straightforward and proceeds exactly as it did for the attack on load-store reordering.

We make three additional tweaks to improve the reliability so that the attacker can confidently infer the value of SECRET based on the observed values of r. These three tweaks strongly resemble the reliability tweaks we made to the load-store reordering attack and differ only in a few details.

First, rather than simply observing x with r:=x in the 'listening' thread, we continuously load x in a loop until a nonzero value is observed — i.e., we perform do  $\{r:=x\}$  while  $\{r:=0\}$ . This remedies the case where r:=x could observe a value of x from 'before' either of the two possible writes performed by the other thread.

Second, we insert additional time-consuming computation immediately following the x:=1 operation in the 'main' thread. This lengthens the timing window in which x has the value 1, increasing the likelihood that the 'listening' thread will be able to observe x:=1 (unless the x:=1 write was eliminated, of course). Inserting this computation can be done without interfering with the dead store elimination process itself, so that the compiler will continue to eliminate the x:=1 write if and only if the appropriate bit of SECRET was 1. For gcc, we have a fair amount of freedom with the time-consuming computation — for instance, we can use an arbitrarily long loop. In fact, we can perform a further optimization by monitoring the value of the variable r (written to by

| Redundancy | Bandwidth (bits/s) | Bitwise Acc | Per-run Acc |  |
|------------|--------------------|-------------|-------------|--|
| 1          | 1.40 million       | 99.92%      | 66.8%       |  |
| 2          | 702 thousand       | 99.999%     | 97.5%       |  |
| 3          | 470 thousand       | 99.99999%   | 99.9%       |  |
| 4          | 352 thousand       | 100.0%      | 100.0%      |  |

Fig. 3. Performance results for the dead store elimination attack on clang when leaking a 2048-bit secret. Terms are the same as defined in the caption for Figure 2. {Note: numbers are not final (collected on Craig's Mac with Apple Clang), but give an idea of where we stand.}

| Stall amount | 10           | 20           | 50           | 100          | 200          | 500          |
|--------------|--------------|--------------|--------------|--------------|--------------|--------------|
| Redundancy 1 | 2.75 million | 2.55 million | 2.27 million | 1.83 million | 1.34 million | 728 thousand |
|              | 97.13%       | 99.72%       | 99.98%       | 99.99%       | 99.99%       | 99.99%       |
| Redundancy 2 | 1.40 million | 1.30 million | 1.14 million | 902 thousand | 663 thousand | 382 thousand |
|              | 99.01%       | 99.99%       | 100.0%       | 100.0%       | 100.0%       | 100.0%       |
| Redundancy 3 | 926 thousand | 870 thousand | 763 thousand | 610 thousand | 443 thousand | 246 thousand |
|              | 99.44%       | 100.0%       | 100.0%       | 100.0%       | 100.0%       | 100.0%       |
| Redundancy 4 | 694 thousand | 645 thousand | 571 thousand | 451 thousand | 330 thousand | 186 thousand |
|              | 99.71%       | 100.0%       | 100.0%       | 100.0%       | 100.0%       | 100.0%       |

Fig. 4. Performance results for the dead store elimination attack on gcc when leaking a 2048-bit secret. Rows give different values of 'redundancy' (as defined in previous figures), while columns give amounts of stall time immediately following the x := 1 write (as measured in loop iterations). Each table cell gives the leak bandwidth in bits/sec, followed by the bitwise accuracy. {Note: numbers are not final (collected on Craig's machine inside a VM), but give an idea of where we stand.}

the listening thread) and breaking out of the loop early if we see that the listening thread has already observed x == 1. However, with clang, we cannot use a loop at all — the time-consuming computation must be branch-free, and furthermore must not consist of too many instructions. This is because clang's dead store elimination pass operates only within basic blocks, and uses a heuristic to stop scanning the basic block early if it is too large. Nonetheless, we find that even with these restrictions, we are able to construct a reliable and fast attack against both clang and gcc.

Finally, we redundantly execute the entire attack several times, noting the final value of r (the first observed nonzero value of x) in each case. We note that if any of the redundant runs produces r=1 for a particular bit position, we can be certain that the corresponding bit of SECRET must be 0, as it implies that the x:=1 write was not eliminated in that particular function. On the other hand, the more runs that observe r=2 in a particular bit position despite our other reliability-increasing measures taken above, the more certain we can be that the x:=1 write was eliminated in that function, and the appropriate bit of SECRET is 1.

Performance results for the dead store elimination attack against clang are given in Figure 3. This attack is faster than the load-store reordering attack against gcc for any given desired accuracy. {however, compared to the load-store reordering attack we don't have an even-faster-but-even-less-accurate setting. This could be accomplished by reducing the amount of 'time-consuming computation' which for clang we have been holding constant near the max value clang will tolerate. Really, we could make this a 2D table like Figure 4, we'll just have a strict maximum on how far to the right you can go.}

Figure 4 shows the performance results for the dead store elimination attack against gcc. Unlike the other attacks we've discussed, our implementation of this attack has two important "knobs" which trade off reliability vs. performance, rather than only one. First, we have the length of time

which the writing thread attempts to "stall" immediately after the x := 1 write (which is configurable in this case because gcc will perform the dead store elimination even with a loop here). Second, as in the other attacks, we have the number of entire redundant runs of the attack that are performed before the attacker reaches her conclusion. Increased reliability can be achieved by adjusting either of these knobs, and they each have (different) effects on the overall performance of the attack. Although Redundancy 1 never reaches perfect accuracy (probably because of rare catastrophic events like OS interrupts), once we move to Redundancy 2, increasing the "stall" time is much more advantageous than further redundancy. Perfect accuracy (in our tests) can be achieved with a bandwidth of over 1.1 million bits per second at Redundancy 2, making this our most efficient and accurate attack by a considerable margin. In particular, this means that this attack can leak a 2048-bit cryptographic key with perfect accuracy (in our tests) in under 2 ms.

### 5 LOGIC

In this section, we develop sufficient logical infrastructure to prove that our semantics disallows thin air executions. We present a variant of the TAR-pit example from §3.8 which poses difficulties under many speculative semantics.

We adapt past linear temporal logic (PLTL) [25] to pomsets by dropping the previous instant operator and adopting strict versions of the temporal operators. The atoms of our logic are write and read events. Given an pomset P and event e, define:

$$P, e \models Wxv$$
, if  $\lambda(e) = (\text{true}, Wxv)$   
 $P, e \models Rxv$ , if  $\lambda(e) = (\text{true}, Rxv)$   
 $P, e \models \phi \land \psi$ , if  $P, e \models \phi$  and  $P, e \models \psi$   
 $P, e \models \text{true}$   
 $P, e \models \neg \phi$ , if  $P, e \not\models \phi$   
 $P, e \models \Box^{-1}\phi$ , if  $(\forall d \leq e, d \neq e) P, d \models \phi$ 

Define  $P \models \phi$  if  $(\forall e \in E) P$ ,  $e \models \phi$  and  $\mathcal{P} \models \phi$  if  $(\forall P \in \mathcal{P}) P \models \phi$ .

Let  $\diamondsuit^{-1}\phi$  be defined as  $\neg(\Box^{-1}\neg\phi)$ . In addition, let false,  $\lor$  and  $\Rightarrow$  be defined in the standard way. The past operators do not include the current instant, and thus they do *not* satisfy the rule  $\Box^{-1}\phi \Rightarrow \diamondsuit^{-1}\phi$ . However, they do satisfy:

$$(\phi \Rightarrow \diamondsuit^{-1}\phi) \Rightarrow \neg \phi$$
 (Coinduction)  
$$(\Box^{-1}\phi \Rightarrow \phi) \Rightarrow \phi$$
 (Induction)

Note that  $P \models \phi \land \Box^{-1} \phi$  whenever  $P \models \phi$ .

We now present two proof rules. The first rule captures the semantics of local variables. Define  $closed(x) = (Rxv \Rightarrow \diamondsuit^{-1}Wxv)$ . Although this definition does not mention intervening writes, it is sufficient for our example. It is straightforward to establish that following rule is sound:

$$\frac{\phi \text{ is independent of } x \qquad P \models \mathsf{closed}(x) \Rightarrow \phi}{vx \cdot P \models \phi} \tag{Closing } x)$$

The second rule describes composition, in the style of Abadi and Lamport [1]. To simplify the presentation, we present the special case with a single invariant. In order to state the theorem, we generalize the satisfaction relation to include environment assumptions. Let  $\mathsf{Models}(\phi) = \{\mathcal{P} \mid \mathcal{P} \models \phi\}$  be the set of pomsets that satisfy  $\phi$ . We say that  $\phi$  is prefix closed if  $\mathsf{Models}(\phi)$  is  $\mathsf{prefix}\text{-closed}^1$ . Define  $\phi$ ,  $\mathcal{P} \models \psi$  if  $\mathsf{Models}(\phi) \parallel \mathcal{P} \models \psi$ .

 $<sup>^{1}</sup>P'$  is a prefix of P if  $E'\subseteq E$ ,  $e\in E'$  and  $d\le e$  imply  $d\in E'$ , and  $(\lambda',\le',\swarrow')$  coincide with  $(\lambda,\le,\swarrow)$  for elements of E'.

Proposition 5.1. Let  $\phi$  be prefix-closed. Let  $\mathcal{P}_1, \mathcal{P}_2$  be augmentation-closed<sup>2</sup> Then:

$$\frac{\phi, \mathcal{P}_1 \models \phi \qquad \phi, \mathcal{P}_2 \models \phi}{\mathcal{P}_1 \parallel \mathcal{P}_2 \models \phi}$$
 (Composition)

PROOF SKETCH. We will show that all prefixes in the prefix closures of  $\mathcal{P}_1 \parallel \mathcal{P}_2$  satisfy the required property. Proof proceeds by induction on prefixes of  $P \in \mathcal{P}_1 \parallel \mathcal{P}_2$ .

The case for empty prefix follows from assumption that  $\phi$  is prefix closed.

For the inductive case, consider  $P \in P_1 \parallel P_2$  where  $P_i \in \mathcal{P}_i$ . Since  $\mathcal{P}_1$  and  $\mathcal{P}_2$  are augmentation closed, we can assume that the restriction of P to the events of  $P_i$  coincides with  $P_i$ , for i = 1, 2. Consider a prefix P' derived by removing a maximal element e from P. Suppose e comes from  $P_1$  (the other case is symmetric). Since  $P_2$  is a prefix of P' and  $P' \models \phi$  by induction hypothesis, we deduce that  $P_2 \models \phi$ . Since  $P_1 \in \mathcal{P}_1$ , by assumption  $\phi$ ,  $\mathcal{P}_1 \models \phi$  we deduce that  $P \models \phi$ .

We now turn the conditional TAR-pit program, which is a variant of [26, Figure 8]:

$$var x := 0$$
;  $var y := 0$ ;  $var z := 0$ ;  $(y := x || if(z) \{x := 1\} else \{x := y; a := y\} || z := 1)$ 

This program is allowed to write 1 to a under many speculative memory models [16, 19, 27], even though the read of 1 from y in the else branch of the second thread arises out of thin air. In contrast, we prove the formula  $\neg \diamondsuit^{-1}(W a 1)$  holds for the models of this program in our semantics. We start with the following invariant, which holds for each of the three threads, and thus, by composition, for the aggregate program:

$$[\diamondsuit^{-1}(Wy1) \Rightarrow \diamondsuit^{-1}(Rx1)] \land [\diamondsuit^{-1}(Wa1) \Rightarrow (\diamondsuit^{-1}(Ry1) \land \Box^{-1}(Wx1 \Rightarrow \diamondsuit^{-1}(Ry1)))]$$

Closing y, we have,  $\lozenge^{-1}(Ry1) \Rightarrow \lozenge^{-1}(Wy1)$  which we substitute into the left conjunct to get:

$$\diamondsuit^{-1}(Ry1) \Rightarrow \diamondsuit^{-1}(Rx1)$$

which in turn we sustitute into the right conject to get:

$$\Diamond^{-1}(W a 1) \Rightarrow (\Diamond^{-1}(R x 1) \wedge \Box^{-1}(W x 1 \Rightarrow \Diamond^{-1}(R x 1)))$$

Closing x, we can replace  $\diamondsuit^{-1}(R x 1)$  with  $\diamondsuit^{-1}(W x 1)$ :

$$\diamondsuit^{-1}(\mathsf{W}\,a\,1) \Rightarrow (\diamondsuit^{-1}(\mathsf{W}\,x\,1) \land \Box^{-1}(\mathsf{W}\,x\,1 \Rightarrow \diamondsuit^{-1}(\mathsf{W}\,x\,1)))$$

Applying coinduction to the right conjunct, we have:

$$\diamondsuit^{-1}(W \ a \ 1) \Rightarrow (\diamondsuit^{-1}(W \ x \ 1) \land \Box^{-1}(\neg W \ x \ 1))$$

Simplifying, we have, as required:

$$\Diamond^{-1}(W a 1) \Rightarrow false$$

### 6 CONCLUSIONS AND FUTURE WORK

In this paper, we have presented a model of speculative evaluation, and showed that it captures non-trivial properties of speculations produced by hardware, compiler optimizations, and transactions. These properties include information flow attacks: in the case of hardware and transactions this is modelling known attacks [12, 21], but in the case of compiler optimizations the attacks are new, and were discovered as a direct result of developing the model. We have experimentally validated that the attacks can be carried out against gcc and clang, though only against secrets known at compile time.

The model of relaxed memory used in this paper is deliberately simplified, compared for example to C11 [6, 8]. In particular our model of reads-from is strong, and could be weakened by replacing

 $<sup>\</sup>overline{{}^2P'}$  is an augmentation of P if E'=E,  $e\leq d$  implies  $e\leq'd$ ,  $e\nmid d$  implies  $e\not\leq'd$ , and if  $\lambda(e)=(\psi\mid b)$  then  $\lambda'(e)=(\psi'\mid b)$  where  $\psi'$  implies  $\psi$ .

the requirement d < e in Definition 2.4 by  $e \nleq d$ . It remains to be seen how this impacts the model, in particular the logical formulation of x-closure in §5 as  $((Rxv) \Rightarrow \diamondsuit^{-1}(Wxv))$  would no longer be sound. The model is also not considering coherence, though we speculate it can be added by requiring that for each  $x, \nleq$  form a total order when restricted to events that write to x.

The design space for transactions is very rich [13]. We have only presented one design choice, and it remains to be seen how other design choices could be adopted. For example we have chosen to pun between commits which are aborted due to transaction failure, and ones which are aborted for other reasons such as failed speculation.

Plotkin and Pratt [32] studied full-abstraction for pomset models. It would interesting to extend their results to 3-valued pomsets.

One interesting feature of this model is that (in the language of [31]) it is a *per-candidate execution model*, in that the correctness of an execution only requires looking at that one execution, not at others. This is explicit in memory models such as [17, 20] in which "alternative futures" are explored, in a style reminiscent of Abramsky's bisimulation as a testing equivalence [2]. Models of information flow are similar, in that they require comparing different runs to test for the presence of dependencies [11]. In contrast, the model presented here explicitly captures dependency in the pomset order, and models multiple runs by giving the semantics of if in terms of a concurrent semantics of both branches. In the parlance of information flow [4], the humble conditional suffices to construct a composition operator to detect information flow in the presence of speculation.

#### REFERENCES

- [1] Martín Abadi and Leslie Lamport. 1993. Composing Specifications. ACM Trans. Program. Lang. Syst. 15, 1 (Jan. 1993), 73–132. https://doi.org/10.1145/151646.151649
- [2] Samson Abramsky. 1987. Observation equivalence as a testing equivalence. *Theoretical Computer Science* 53, 2 (1987), 225 241. https://doi.org/10.1016/0304-3975(87)90065-X
- [3] Jade Alglave, Luc Maranget, and Michael Tautschnig. 2014. Herding Cats: Modelling, Simulation, Testing, and Data Mining for Weak Memory. ACM Trans. Program. Lang. Syst. 36, 2, Article 7 (July 2014), 74 pages. https://doi.org/10.1145/2627752
- [4] Gilles Barthe, Pedro R. D'Argenio, and Tamara Rezk. 2004. Secure Information Flow by Self-Composition. In Proceedings of the 17th IEEE Workshop on Computer Security Foundations (CSFW '04). IEEE Computer Society, Washington, DC, USA, 100–114. https://doi.org/10.1109/CSFW.2004.17
- [5] Mark Batty, Kayvan Memarian, Kyndylan Nienhuis, Jean Pichon-Pharabod, and Peter Sewell. 2015. The Problem of Programming Language Concurrency Semantics. In Proc. European Symp. on Programming. 283–307.
- [6] Mark Batty, Scott Owens, Susmit Sarkar, Peter Sewell, and Tjark Weber. 2011. Mathematizing C++ Concurrency. In Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL '11). ACM, New York, NY, USA, 55-66. https://doi.org/10.1145/1926385.1926394
- [7] Arnab Kumar Biswas, Dipak Ghosal, and Shishir Nagaraja. 2017. A Survey of Timing Channels and Countermeasures. ACM Comput. Surv. 50, 1, Article 6 (March 2017), 39 pages. https://doi.org/10.1145/3023872
- [8] Hans-J. Boehm and Sarita V. Adve. 2008. Foundations of the C++ Concurrency Memory Model. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '08). ACM, New York, NY, USA, 68–78. https://doi.org/10.1145/1375581.1375591
- [9] S. D. Brookes, C. A. R. Hoare, and A. W. Roscoe. 1984. A Theory of Communicating Sequential Processes. J. ACM 31, 3 (June 1984), 560–599. https://doi.org/10.1145/828.833
- [10] Nathan Chong, Tyler Sorensen, and John Wickerson. 2018. The semantics of transactions and weak memory in x86, Power, ARM, and C++. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2018, Philadelphia, PA, USA, June 18-22, 2018. 211–225. https://doi.org/10.1145/3192366.3192373
- [11] Michael R. Clarkson and Fred B. Schneider. 2010. Hyperproperties. J. Comput. Secur. 18, 6 (Sept. 2010), 1157–1210. http://dl.acm.org/citation.cfm?id=1891823.1891830
- [12] Craig Disselkoen, David Kohlbrenner, Leo Porter, and Dean M. Tullsen. 2017. Prime+Abort: A Timer-Free High-Precision L3 Cache Attack using Intel TSX. In 26th USENIX Security Symposium, USENIX Security 2017, Vancouver, BC, Canada, August 16-18, 2017., Engin Kirda and Thomas Ristenpart (Eds.). USENIX Association, 51-67. https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/disselkoen

- [13] Brijesh Dongol, Radha Jagadeesan, and James Riely. 2018. Transactions in relaxed memory architectures. PACMPL 2, POPL (2018), 18:1–18:29. https://doi.org/10.1145/3158106
- [14] Jay L. Gischer. 1988. The equational theory of pomsets. Theoretical Computer Science 61, 2 (1988), 199–224. https://doi.org/10.1016/0304-3975(88)90124-7
- [15] James W. Gray, III. 1992. Toward a Mathematical Foundation for Information Flow Security. J. Comput. Secur. 1, 3-4 (May 1992), 255–294. http://dl.acm.org/citation.cfm?id=2699806.2699811
- [16] Radha Jagadeesan, Corin Pitcher, and James Riely. 2010. Generative Operational Semantics for Relaxed Memory Models. In Programming Languages and Systems, 19th European Symposium on Programming, ESOP 2010, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2010, Paphos, Cyprus, March 20-28, 2010. Proceedings (Lecture Notes in Computer Science), Andrew D. Gordon (Ed.), Vol. 6012. Springer, 307–326. https://doi.org/10.1007/978-3-642-11957-6\_17
- [17] Radha Jagadeesan, Corin Pitcher, and James Riely. 2010. Generative Operational Semantics for Relaxed Memory Models. In Proceedings of the 19th European Conference on Programming Languages and Systems (ESOP'10). Springer-Verlag, Berlin, Heidelberg, 307–326. https://doi.org/10.1007/978-3-642-11957-6\_17
- [18] A. Jeffrey and J. Riely. 2016. On Thin Air Reads Towards an Event Structures Model of Relaxed Memory. In Proceedings of the 31st Annual ACM/IEEE Symposium on Logic in Computer Science, LICS '16, New York, NY, USA, July 5-8, 2016, M. Grohe, E. Koskinen, and N. Shankar (Eds.). ACM, 759–767. https://doi.org/10.1145/2933575.2934536
- [19] J. Kang, C.-K. Hur, O. Lahav, V. Vafeiadis, and D. Dreyer. 2017. A promising semantics for relaxed-memory concurrency. In Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages, POPL 2017, Paris, France, January 18-20, 2017, Giuseppe Castagna and Andrew D. Gordon (Eds.). ACM, 175–189. https://doi.org/10.1145/3009837
- [20] Jeehoon Kang, Chung-Kil Hur, Ori Lahav, Viktor Vafeiadis, and Derek Dreyer. 2017. A Promising Semantics for Relaxed-memory Concurrency. In Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages (POPL 2017). ACM, New York, NY, USA, 175–189. https://doi.org/10.1145/3009837.3009850
- [21] Paul Kocher, Daniel Genkin, Daniel Gruss, Werner Haas, Mike Hamburg, Moritz Lipp, Stefan Mangard, Thomas Prescher, Michael Schwarz, and Yuval Yarom. 2018. Spectre Attacks: Exploiting Speculative Execution. CoRR abs/1801.01203 (2018). arXiv:1801.01203 http://arxiv.org/abs/1801.01203
- [22] Leslie Lamport. 1986. On Interprocess Communication. Part I: Basic Formalism. Distributed Computing 1, 2 (1986), 77–85. https://doi.org/10.1007/BF01786227
- [23] Butler W. Lampson. 1973. A Note on the Confinement Problem. *Commun. ACM* 16, 10 (Oct. 1973), 613–615. https://doi.org/10.1145/362375.362389
- [24] Jim Larus and Ravi Rajwar. 2007. Transactional Memory (Synthesis Lectures on Computer Architecture). Morgan & Claypool Publishers.
- [25] Orna Lichtenstein, Amir Pnueli, and Lenore D. Zuck. 1985. The Glory of the Past. In *Proceedings of the Conference on Logic of Programs*. Springer-Verlag, London, UK, UK, 196–218. http://dl.acm.org/citation.cfm?id=648065.747612
- [26] A. Lochbihler. 2013. Making the Java memory model safe. ACM Trans. Program. Lang. Syst. 35, 4 (2013), 12:1–12:65. https://doi.org/10.1145/2518191
- [27] Jeremy Manson, William Pugh, and Sarita V. Adve. 2005. The Java Memory Model. SIGPLAN Not. 40, 1 (Jan. 2005), 378–391. https://doi.org/10.1145/1047659.1040336
- [28] Andrew C. Myers. 1999. JFlow: practical mostly-static information flow control. In 26th ACM Symp. on Principles of Programming Languages (POPL). 228–241. http://www.cs.cornell.edu/andru/papers/popl99/popl99.pdf
- [29] Kevin R. O'Neill, Michael R. Clarkson, and Stephen Chong. 2006. Information-Flow Security for Interactive Programs. In Proceedings of the 19th IEEE Workshop on Computer Security Foundations (CSFW '06). IEEE Computer Society, Washington, DC, USA, 190–201. https://doi.org/10.1109/CSFW.2006.16
- [30] Zdzisław Pawlak. 1982. Rough sets. International Journal of Computer & Information Sciences 11, 5 (01 Oct 1982), 341–356. https://doi.org/10.1007/BF01001956
- [31] Jean Pichon-Pharabod and Peter Sewell. 2016. A Concurrency Semantics for Relaxed Atomics That Permits Optimisation and Avoids Thin-air Executions. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL '16). ACM, New York, NY, USA, 622–633. https://doi.org/10.1145/2837614.2837616
- [32] Gordon Plotkin and Vaughan Pratt. 1997. Teams Can See Pomsets (Preliminary Version). In *Proceedings of the DIMACS Workshop on Partial Order Methods in Verification (POMIV '96)*. AMS Press, Inc., New York, NY, USA, 117–128. http://dl.acm.org/citation.cfm?id=266557.266600
- [33] W. Pugh. 2004. Causality Test Cases. http://www.cs.umd.edu/~pugh/java/memoryModel/CausalityTestCases.html.
- [34] A. Sabelfeld and A. C. Myers. 2006. Language-based Information-flow Security. *IEEE J.Sel. A. Commun.* 21, 1 (Sept. 2006), 5–19. https://doi.org/10.1109/JSAC.2002.806121
- [35] Geoffrey Smith and Dennis Volpano. 1998. Secure Information Flow in a Multi-threaded Imperative Language. In Proceedings of the 25th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL '98). ACM, New York, NY, USA, 355–364. https://doi.org/10.1145/268946.268975

- [36] Alasdair Urquhart. 1986. *Many-valued Logic*. Springer Netherlands, Dordrecht, 71–116. https://doi.org/10.1007/978-94-009-5203-4\_2
- [37] Jaroslav Ševčík. 2008. Program Transformations in Weak Memory Models. PhD thesis. Laboratory for Foundations of Computer Science, University of Edinburgh.
- [38] J. Todd Wittbold and Dale M. Johnson. 1990. Information Flow in Nondeterministic Systems. In *IEEE Symposium on Security and Privacy*.
- [39] Jianzhou Zhao, Santosh Nagarakatte, Milo M. K. Martin, and Steve Zdancewic. 2012. Formalizing the LLVM intermediate representation for verified program transformations. In Proceedings of the 39th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2012, Philadelphia, Pennsylvania, USA, January 22-28, 2012, John Field and Michael Hicks (Eds.). ACM, 427-440. https://doi.org/10.1145/2103656.2103709

#### A MEMORY MODEL EXAMPLES

Pugh [33] developed a set of twenty causality test cases in the process of revising the Java Memory Model (JMM) [27]. Using hand calculation, we have confirmed that our model gives the desired result for all twenty cases, unrolling loops as necessary. Our model also gives the desired results for all of the examples in Batty et al. [5, §4] and all but one in Ševčík [37, §5.3]: redundant-write-after-read-elimination—this counterexample applies to any sensible non-coherent semantics. Our model agrees with the JMM on the "surprising and controversial behaviors" of Manson et al. [27, §8], and thus fails to validate thread inlining.

In this section, we discuss three of the causality test cases and the thread inlining from [27]. In presenting the examples, we unroll loops, correct typos and simplify the code.

# A.1 Causality test case 8

Test case 8 asks whether:

$$var x := 0$$
;  $var y := 0$ ;  $(if (x < 2) \{ y := 1 \} || x := y)$ 

may read 1 for both x and y. This behavior is allowed, since "interthread analysis could determine that x and y are always either 0 or 1." This breaks the dependency between the read of x and the write to y in the first thread, allowing the write to be moved earlier.

The semantics of TC8 includes



Where we require (W x 0) < (R x 1) but not (R x 1) < (W y 1). To see why this execution exists, consider the left thread with syntax sugar removed:

$$r := x; if(r < 2) \{ y := 1 \}$$

[if (r < 2) { y := 1}] includes  $(r < 2 \mid Wy1)$ . Thus, by definition 2.12, [r := x; if (r < 2) { y := 1}] includes  $(Rx1) \rightarrow (r < 2 \mid Wy1)[x/r]$  which simplifies to  $(Rx1) \rightarrow (x < 2 \mid Wy1)$ , which, by definition 2.8, includes:

$$Rx1$$
  $x < 2 | Wy1$ 

Here we have used the *non-ordering read* clause of definition 2.8: " $\psi'$  implies  $\psi[v/x] \wedge \psi$ , if a reads v from x," where a = (R x 1),  $\psi = \psi' = (x < 2)$ . We can use this case since x < 2 implies  $1 < 2 \wedge x < 2$ . Prefixing with (W x 0) allows us to discharge the assumption x < 2, arriving at:

$$(\mathbf{W} x 0) \longrightarrow (\mathbf{R} x 1) \quad (\mathbf{W} y 1)$$

Here we have used the *ordering read* clause of 2.8: " $\psi'$  implies  $\psi[v/x]$ , if a reads v from x and c < e," where a = (Wx0),  $\psi = (x < 2)$  and  $\psi' = \text{true}$ . As long as require (Wx0) < (Rx1), we can use this case since true implies 0 < 2.

# A.2 Causality test case 9

Test case 9 asks whether:

$$var x := 0$$
;  $var y := 0$ ; (if  $(x < 2) \{ y := 1 \} || x := y || y := 2$ ;)

may read 1 for both x and y. This behavior is also allowed. This is "similar to test case 8, except that x is not always 0 or 1. However, a compiler might determine that the read of x by thread 1 will never see the write by thread 3 (perhaps because thread 3 will be scheduled after thread 1)"

Reasoning as for test case 8, the semantics of test case 9 includes:



Thus, with respect to the introduction of new threads, our model appears to be more robust than the event structures semantics of [18], which fails on this test case.

# A.3 Causality test case 14

Test case 14 asks whether:

$$var a := 0$$
;  $var b := 0$ ;  $var y := 0$ ;  $(if(a) \{b := 1\} else \{y := 1\} || while  $(y+b == 0) \{skip\} a := 1)$$ 

may read 1 for a and b, yet 0 for y. Here a and b are regular variables and y is volatile, which is equivalent to release/acquire in this example. This behavior is also disallowed, since "in all sequentially consistent executions, [the read of a gets 0] and the program is correctly synchronized. Since the program is correctly synchronized in all SC executions, no non-SC behaviors are allowed." Unrolling the loop once, we have:

$$var a := 0$$
;  $var b := 0$ ;  $var y := 0$ ;  $(if (a) \{ b := 1 \} else \{ y := 1 \} || if (y \lor b) \{ a := 1 \})$ 

We argue that any execution with (R a 1), (R b 1), and (R y 0) must be cyclic. The closure requirements require that (W a 1) < (R a 1) and (R b 1) < (R b 1). Ignoring initialization, least ordered execution that includes all of these actions is:



where the read of a is ordering for (W b 1) but not (W y 1), and the read of b is ordering for (W a 1) but the read of y is not. (W y 1) is crossed out, since its precondition must imply  $(\neg a)[1/a]$ , which is equivalent to false. To avoid order from (R y 0) to (W a 1), we have strengthened the predicate on (W a 1) from ( $y \lor b$ ) to ( $y = 0 \land b = 1$ ). Note that we cannot use this trick symmetrically to remove the order from (R b 1) to (W a 1), since b = 1 does not follow from the initialization of b.

### A.4 Thread inlining

One property one could ask of a model of shared memory is thread inlining: any execution of  $[\![P];Q]\!]$  is an execution of  $[\![P];Q]\!]$ . This is *not* a goal of our model, and indeed is not satisfied, due to the different semantics of concurrent and sequential memory accesses. We demonstrate this by considering an example from the Java Memory Model [27], which shows that Java does not satisfy thread inlining either.

The lack of thread inlining is related to the different dependency relations introduced by sequential and concurrent access. Recall from  $\S 3.1$  that the program  $(x := \emptyset; y := x+1;)$  has

execution:

$$(\mathbf{W} x 0) (\mathbf{W} y 1)$$

but that (x := 1; || y := x+1;) has:

That is, in the sequential case there is no dependency from the write of *x* to the write of *y*, but in the concurrent case there is such a dependency.

This can be used to construct a counter-example to thread inlining, based on [27, Ex 11]:

$$x := 0$$
; if  $(x == 1) \{ z := 1; \}$  else  $\{ x := 1; \} || y := x; || x := y;$ 

This has no execution containing (W z 1). Any attempt to build such an execution results in a cycle:



Inlining the thread (y := x) gives [27, Ex 12]:

$$x := 0$$
; if  $(x == 1) \{ z := 1; \}$  else  $\{ x := 1; \} y := x; || x := y;$  with execution:



To see why this execution exists, consider the program fragment:

if 
$$(x == 1) \{ z := 1; \}$$
else  $\{ x := 1; \} y := x; \}$ 

Removing the syntax sugar, this is:

Now,  $[z := 1; r_2 := x; y := r_2; skip]$  includes pomset:

$$(r_1 = 1 \mid Wz1) \quad (r_1 = x = 1 \mid Wy1)$$

and  $[x := 1; r_3 := x; y := r_3; skip]$  includes pomset:

$$\boxed{r_1 \neq 1 \mid \mathsf{W} \, x \, 1} \quad \boxed{r_1 \neq 1 \mid \mathsf{W} \, y \, 1}$$

so [if  $(r_1 = 1)$  { z := 1;  $r_2 := x$ ;  $y := r_2$ ; skip } else { x := 1;  $r_3 := x$ ;  $y := r_3$ ; skip }]] includes:

$$r_1 = 1 \mid Wz1$$
  $r_1 \neq 1 \mid Wx1$   $r_1 = 1 \mid Wx1$ 

which means  $[if(r_1 = 1) \{z := 1; r_2 := x; y := r_2; skip\} else \{x := 1; r_3 := x; y := r_3; skip\}][x/r_1]$  includes:

$$(x = 1 \mid Wz1)$$
  $(x \neq 1 \mid Wx1)$   $(x = x = 1) \lor (x \neq 1)) \mid Wy1$ 

Now  $(x = x = 1) \lor (x \ne 1)$  is a tautology, so this is just:

and so  $[r_1:=x; if (r_1=1) \{z:=1; r_2:=x; y:=r_2; skip \} else \{x:=1; r_3:=x; y:=r_3; skip \}]$  includes:

$$\begin{array}{c|c}
\hline
(Rx1) & \hline
\end{array}$$

$$\begin{array}{c|c}
1 = 1 \mid Wz1
\end{array}$$

$$\begin{array}{c|c}
1 \neq 1 \mid Wx1
\end{array}$$

$$\begin{array}{c|c}
Wy1
\end{array}$$

which simplifies to:



as required. The rest of the example is straightforward, and shows that our semantics agrees with the JMM in not supporting thread inlining.