#### 7 8 9 10

11 12

13

#### 14 15 16 17

18

19

#### 20 21 22 23 24 25

#### 26 27 28 29 30 31

#### 32 33 34 35 36

### 37 38

# 39

40

## 41 42

## 43 44 45 46

#### 47 48 49

#### Let go of that which does not serve you: Semantic Independence in a Model of Relaxed Memory

#### ANONYMOUS AUTHOR(S)

We define a relaxed memory model, building on simple and traditional mathematical foundations. We show that the model (1) satisfies the DRF-SC criterion, (2) compiles to TSO and ARMV8 without fences on relaxed operations, (3) supports all reasonable sequential compiler optimizations, and (4) supports compositional reasoning for temporal safety properties.

CCS Concepts: • Theory of computation → Parallel computing models; Abstraction;

Additional Key Words and Phrases: Relaxed Memory Models, Hardware Transactional Memory

#### **ACM Reference Format:**

Anonymous Author(s). 2019. Let go of that which does not serve you: Semantic Independence in a Model of 

#### 1 INTRODUCTION

Manson et al. [2005] identify the central problem in the design of software relaxed memory models: "The memory model must strike a balance between ease-of-use for programmers and implementation flexibility for system designers." This paper provides a solution.

There are two dimensions to "implementation flexibility":

- The model should be realizable on modern hardware with minimal synchronization. A canonical example is that the relaxed atomics of C11 (or the plain variables of Java) should not require any extra synchronization.
- The model should facilitate compiler transformations. Ideally, the model should support the known transformations used to optimize synchronization free single threaded code. A canonical example is the commuting of independent statements.

There are also two dimensions to "ease-of-use":

- The data race free-sequentially consistent (DRF-SC) criterion [Adve and Hill 1990, 1993] permits the programmer to forget about relaxed memory for correctly synchronized programs.
- All programs, including those with data races, should support compositional reasoning on temporal safety properties [Abadi and Lamport 1993; Misra and Chandy 1981; Pnueli 1985; Stark 1985]. The simplest form of such a composition principle is:

$$\frac{\phi, \mathcal{P}_1 \models \phi \qquad \phi, \mathcal{P}_2 \models \phi}{\mathcal{P}_1 \parallel \mathcal{P}_2 \models \phi}$$

where  $\phi$  is a temporal safety property on read and write actions, and  $\phi$ ,  $\mathcal{P}_1 \models \phi$  indicates that  $\mathcal{P}_1$  satisfies  $\phi$  if the environment satisfies  $\phi$ . This cyclical compositional proof rule permits us to reason separately about individual threads.

We illustrate the last criterion with two examples. First, consider the well-known "Out Of Thin Air" (OOTA) litmus test, with all variables initialized to 0:

$$(y:=x \parallel x:=y) \tag{1}$$

This paper is published under the Creative Commons Attribution 4.0 International (CC-BY 4.0) license.

© 2019 Copyright held by the owner/author(s).

XXXX-XXXX/2019/7-ART

https://doi.org/10.1145/nnnnnnnnnnnnn

Informally, both threads satisfy the invariant that conjoins "A write of 1 to x requires a prior read of 1 from y" and "A write of 1 to y requires a prior read of 1 from x". If composition holds, the full program satisfies the invariant. Since the variable declaration closes the program from other writes to x, y, we deduce the conjunction of "A write of 1 to x requires a prior write of 1 to x" and "A write of 1 to y requires a prior write of 1 from x". Thus, we deduce that "A write of 1 to x requires a prior write of 1 to x", and consequently "there is no write of 1 to x". Composability of safety properties provides an *objectively falsifiable* measurement of OOTA in a memory model.

Now we consider the "type unsafety" example, from [Lochbihler 2013, Figure 8]:

$$x:=y \parallel r:=x; \text{ if } (b)\{s:=\text{new C()}\} \text{ else } \{r:=\text{new D()}\}; y:=r; \parallel b:=1$$
 (2)

Anon.

Prima facie, the allocation operation can pick the same address for the two new objects, because only one of them occurs in any one execution. This causes type safety to break for racy programs in the JMM, forcing Lochbihler to include the type information in the address itself. Informally, all threads satisfy the invariant that conjoins "a creation of a C object is preceded by a read of b as 1" and "a creation of a D object is preceded by a read of b as 0". If composition holds, the full program satisfies the invariant. Thus, composability of safety properties facilitates the proof of a temporal invariant that forbids "pointer forging" within a single execution<.

None of the extant memory models validate both "implementation flexibility" and "ease-of-use". Models that validate "ease-of-use" include Sequential Consistency (SC) Lamport [1979], Dolan et al. [2018], Lahav et al. [2017], Jeffrey and Riely [2016], RC11 model[Lahav et al. 2017] and Boehm [2018]. However, *all* of them invalidate reordering of independent statements. [Boehm 2018; Dolan et al. 2018; Jeffrey and Riely 2016] forbid breaking of the program order from reads to writes and thus require extra fences after read actions in hardware implementations. The results of Boehm and Demsky [2014] show that the RC11 model[Lahav et al. 2017] forces a dependency or a fence between a relaxed atomic read and a subsequent relaxed atomic write.

As quoted at the beginning of this section, the JMM was designed with these two aims. However, the JMM and the related models of Jagadeesan et al. [2010] and Kang et al. [2017] invalidate compositional reasoning. In §7 we formalize a variant of Lochbihler's type unsafety example which demonstrates this.

Our approach builds on pomsets [Gischer 1988; Plotkin and Pratt 1997], enriched with sequential preconditions [Hoare 1969]. This simple combination of traditional features provides a conceptual foundation for the notion of time in our model of relaxed memory, even while eschewing the now familiar paraphernalia associated with models of relaxed memory, such as mutliple relations and thread ids.

- The acylicity of the pomset provides an axiomatic basis for a global notion of time, as envisioned in chapter 3.3 of Alglave [2010]. The novelty in our framework is that sequential evolution proceeds silently, *without* advancing global time. Thus, we capture the essence of multi-copy atomicity (MCA) in a software memory model, whereas traditionally MCA is explored in hardware memory models. Tso (see, e.g. [Sewell et al. 2010]) and recent architectures, such as ARMV8 (see, e.g. [Pulte et al. 2018]), are MCA, but not older architectures, such as POWER (see, e.g. [Sarkar et al. 2011]) or ARMV7 (see, e.g. [Alglave et al. 2009]).
- The implicit use of Hoare logic in our model enables us to extract and incorporate all the independence information available in sequential code. Intuitively, this facilitates the *implicit* and *dynamic* discovery of permissible program transformations. Thus, the model fully reaps the benefits of viewing a memory model in terms of (sequential) program transformations, eg. see [Demange et al. 2013; Ferreira et al. 2010; Lahav and Vafeiadis 2016; Saraswat et al. 2007], without explicitly being formalized as such.

We show the following.

- The model satisfies the DRF-SC criterion (§4).
- The model compiles to ARMV8 and TSO *without* extra synchronization for raw variables (§5). However, a semantic version of MCA in our model means that a compilation to ARMV7 or POWER requires extra synchronization.
- The model validates a variety of single threaded transformations including reordering of independent statements and roach motel laws for synchronization (§6). We provide two forms of abstract evidence that our model supports as many sequential transformations as can be expected. First, we show a precise relationship with standard Hoare-triples for sequential code. Second, we prove a "full abstraction" style completeness theorem; the inability of our model to transform a top-level thread *C* into *D* is *always* justified by a parallel observer who can see an interaction with *D* that is not supported by *C*.
- The model validates compositional reasoning on safety properties (§7). Thus, it supports the infrastructure required to prove a realistic type safety theorem that includes racy programs without requiring that that the type information is included in the address itself.

To the reader interested in models that forbid load buffering, we provide a way to adapt our model to forbid the relaxing of the program order from reads to writes, thus modeling [Boehm 2018; Dolan et al. 2018]. Our new contributions for such a reader are an approach to validating data-sensitive compiler optimizations and compositional reasoning of temporal properties.

We give an informal introduction to the model in §2 before presenting the precise formalities in §3. §4 proves the DRF theorem, whereas §5 provides a compilation into ARMV8 and TSO. Single threaded optimizations, and the associated completeness theorems are addressed in §6. §7 describes a temporal logic, and a compositional proof principle for proving safety properties. We end with a discussion of related work in §8.

#### 2 AN INTRODUCTION TO THE MODEL

In this section, we define the model, provide some intuitions about the semantics, and work through a series of illustrative examples. We present precise details of the semantics in §3.

Let  $\mathcal{A}$  be a set of *actions*, that includes actions of the form (Rxv), which *reads* value v from location x,  $(R^{acq}xv)$ , which is an *acquire* that reads v from x, (Wxv), which writes v to x, and  $(W^{rel}xv)$ , which is a *release* that writes v to x. Actions that neither acquire nor release are *relaxed*; other actions are *synchronizations*.

The actions listed above are *external*. Each external action has a corresponding *internal* action, denoted by prefixing  $\tau$ . Internal actions also read and write locations, just as external actions do, but are not used to model communication between threads, so we do not record their value.

Two actions *conflict* if one writes a location and the other either reads or writes the same location. Let  $\Phi$  be a set of logical formulae, such as (x=1) or (r=s+1), where r and s are register names. Formulae are *open*, in that occurrences of register names and memory locations are subject to substitutions of the form  $\phi[x/r]$  and  $\phi[N/x]$ , where x is a memory location, r is a register and N is an memory-location-free expression. Actions are not subject to substitution.

Our model is based on partially ordered multisets (pomsets) [Gischer 1988].

#### **Definition 2.1.** A (memory model) pomset is a tuple $(E, \leq, \lambda)$ , such that

- *E* is a set of *states*,
- $\lambda : E \to (\Phi \times \mathcal{A})$  is a *labeling*, from which we derive  $\lambda_{\Phi} : E \to \Phi$  and  $\lambda_{\mathcal{A}} : E \to \mathcal{A}$ ,
- $\leq \subseteq (E \times E)$  is a partial order, and
- if  $d \le e$  then  $\lambda_{\Phi}(e)$  implies  $\lambda_{\Phi}(d)$ .

In the parlance of chapter 3.3 of Alglave [2010],  $\leq$  is a "global happens-before" relation. The restriction of < to conflicting writes is called the coherence order, co.

We refer to "memory model pomsets" as "pomsets". We write pairs in  $(\Phi \times \mathcal{A})$  as  $(\phi \mid a)$ . We elide  $\phi$  when it is a tautology. We write d < e when  $d \le e$  and  $d \ne e$ .

We give the semantics of a program as a set of pomsets, where each pomset corresponds to a single execution. We visualize a pomset as a directed graph where the nodes are drawn from E, each node e is labeled with  $\lambda(e)$ , and order is drawn as an edge. We elide events with unsatisfiable preconditions. As discussed below, we visualize order using arrows that indicate the reason that the order arises. Although we use multiple arrow, we emphasize that they are all part of the same  $\leq$  relation.

Dependencies. A write event e has two kinds of dependencies. First, it may have an external dependency on a read d from another thread, which is reflected in the order d < e. Second, it may have an internal dependency on a preceding event in the same thread, which is reflected in the formula  $\lambda_{\Phi}(e)$ . In order for the write event to "fire," both these dependencies need to be satisfied. The external dependency on a read is satisfied by globally visible writes. The internal dependency can only be fulfilled by a preceding actions of the same thread.

*Intrathread dependencies.* Within a single thread, < captures the notion of dependency, indicating that the related events cannot appear in the opposite order.

For example, semantics of y := r contains the following pomset.

$$\overbrace{r = 1 \mid W y 1}$$
(3)

that can be viewed as expressing the Hoare triple  $\{r=1\}$  y:=r  $\{y=1\}$ . The Hoare triple  $\{x=1\}$  r:=x; y:=r  $\{y=1\}$  expresses the effect of prefixing an assignment to r. The assignment to r causes a substitution in the precondition, changing the label to  $(x=1 \mid Wy1)$ . The effect of the read is realized in the semantics in two different ways:

$$(Rx) \rightarrow (x = 1 \mid Wy1)$$
 and  $(Rx1) \rightarrow (Wy1)$ 

In the pomset on the left, the read is *internal*, and therefore the formula on the write actions must be discharged by a preceding action in the same thread. In this execution, we have included an *internal action*, which we visualize in grey.

In the pomset on the right, the read is *external*. An external read may discharge the precondition on a event that is ordered after: by introducing order  $(R x 1) \rightarrow (x = 1 \mid W y 1)$ , we may substitute 1 for x in the precondition, resulting in 1 = 1, which is a tautology and therefore elided.

Consider prefixing a write to the above program: x := 1; r := x; y := r. We again have two cases, depending on whether the write to x is *internal* or *external*:

$$(Wx)$$
  $(Rx) \longrightarrow (Wy1)$  and  $(Wx1)$   $(Rx) \longrightarrow (Wy1)$ 

This pomset is derived discharging the internal dependency x = 1 from the left pomset above in the semantics of r := x; y := r. This feature is reminiscent of the *read-from internal* (rfi) in hardware memory models, where writes from a thread are forwarded to a read in the same thread without global publication.

The Hoare triple corresponding to the above pomset is  $\{\text{true}\}\ x:=1;\ r:=x;\ y:=r\ \{y=1\land x=1\}.$  This triple reflects the store forwarding optimization in compilers.

As seen above, preconditions can be weakened as a result of prefixing. As in [Disselkoen et al. 2019], weakening of preconditions<sup>1</sup> can also happen by merging actions. Consider the program if

<sup>&</sup>lt;sup>1</sup>Formula  $\psi$  is weaker than a formula  $\phi$  if  $\phi$  implies  $\psi$ .

 $(r)\{y:=1\}$  else  $\{y:=1\}$ . We derive:

$$\frac{\{r=0\}\ y\!:=\!1\ \{y=1\},\quad \{r\neq 0\}\ y\!:=\!1\ \{y=1\}}{\{\mathsf{true}\}\ \mathsf{if}(r)\{y\!:=\!1\}\,\mathsf{else}\,\{y\!:=\!1\}\ \{y=1\}}$$

This is reflected in our semantics by a pomset with a single action whose precondition is true

that is derived as follows. The two branches of the conditional contain pomsets as follows:

THEN: 
$$(r = 0 \mid Wy1)$$
 ELSE:  $(r \neq 0 \mid Wy1)$ 

In the semantics of the conditional, we take the (not necessarily disjoint) union of the events, that permits events with the same actions to be identified by combining their preconditions with a disjunction. In this case, this yields the event  $(r \neq 0 \lor r = 0 \mid W y \mid 1)$ , permitting us to derive the desired pomset in the semantics of if  $(r)\{y:=1\}$  else  $\{y:=1\}$ .

Within a single thread, < also orders *conflicting* events. We refer order arising from conflict as "weak" and visualize it using a dashed arrow " $\rightarrow$ ". This is in contrast to the "strong" order arising from dependency, which we visualize using a solid arrow " $\rightarrow$ ". For example, the semantics of x := 0; x := 1 includes the pomset:

$$(\mathbf{W} \times \mathbf{0}) \rightarrow (\mathbf{W} \times \mathbf{1}) \tag{4}$$

The weak edge  $(Wx0) \rightarrow (Wx1)$  is required since x:=0 precedes x:=1 in thread order. This notion of conflict is stable under substitution (since actions are not subject to substitution).

Interthread dependencies. Between threads, < includes reads-from, which always goes from writes to reads across threads. We consider reads-from to be a strong order, which we visualize with a thick arrow " $\rightarrow$ ". indicates that one thread has *read-from* another. Since new threads may introduce order from write to read, we define our semantics to be closed with respect to augmentation of  $\leq$ .

At top level, we expect that every read event is *fulfilled* by a matching write. We define fulfillment so that it is monotone with respect to addition of new threads.

**Definition 2.2.** We say d fulfills e on x if d externally writes v to x, e externally reads v from x,

- d < e, and
- if an event *c* writes to *x* then either  $c \le d$  or  $e \le c$ .

A pomset is x-closed if every external read on x is fulfilled, and every event is independent of x. A pomset is top-level if it is x-closed for every location x.

For example, combining (3) and (4), the semantics of x := 0;  $x := 1 \parallel r := x$ ; y := r includes:

$$(W \times 0) \rightarrow (W \times 1) \longrightarrow (R \times 1) \longrightarrow (W \times 1)$$
(5)

We often highlight the required strong edge from a write to a read action that it fulfills, as above. Between threads, < also defines a partial order on conflicting actions; we consider these to be a weak order, visualized "- $\rightarrow$ ". Because of the requirements of fulfillment, any writes to x that are read must be totally ordered. As an example consider the following program and execution:

$$x:=3 \parallel if(x = 1)\{r:=x\} \parallel x:=1 \parallel x:=2$$
 $(x:=3) \parallel if(x = 1)\{r:=x\} \parallel x:=1 \parallel x:=2$ 
 $(x:=3) \parallel if(x = 1)\{r:=x\} \parallel x:=1 \parallel x:=2$ 
 $(x:=3) \parallel if(x = 1)\{r:=x\} \parallel x:=1 \parallel x:=2$ 

The restriction of < to conflicting events on a single location is called the *extended coherence order*, eco. Since < is a partial order, so is eco.

 *Synchronization.* The model ensures that events are ordered before a release and after an acquire. This is a strong order which is visualized as a solid arrow, like order derived from dependency. As an example, consider the program:

$$x := 0$$
;  $x := 1$ ;  $y^{rel} := 1 \parallel r := y^{acq}$ ;  $s := x$ 

which allows:



Since  $(Wx 0) \le (Wx 1) < (Rx 0)$ , this pomset does not satisfy the requirements to fulfill the read. If we replace the release with a plain write, then the outcome  $(R^{acq}y 1)$  and (Rx 0) is possible:

since no order is required between (W x 1) and (W y 1). Symmetrically, if we replace the acquire of the original program with a plain read, then the outcome (R y 1) and (R x 0) is possible.

*Impact of internal reads.* The ability to remove of the program order from a write to a read of the same thread as discussed earlier causes great expressivity in terms of permitted executions. To see this, consider the example 3.6 from [Podkopaev et al. 2019]:

$$r := x; y^{\text{rel}} := 1; z := y \parallel x := z$$

$$Rx1 \longrightarrow W^{\text{rel}} y 1 \longrightarrow Wz 1$$

$$Wx1 \longleftarrow Rz 1$$
(6)

Internal actions do not need to be fulfilled by a matching write. As a result, there is no order from  $(W^{rel}\ y\ 1)$  to  $(W\ z\ 1)$ . This behavior is allowed by ARMV8. Were the internal action made external, then there would be order due to the requirements of fulfillment; this behavior would be disallowed due to the evident cycle:

$$\begin{array}{c}
(Rx1) \longrightarrow (W^{\text{rel}}y1) \longrightarrow (Ry1) \longrightarrow (Wz1) \\
(Wx1) \longleftarrow (Rz1)
\end{array}$$

The explicit use of internal actions facilitate the translation to ARMV8 (§5) and the definitions required in the proof of DRF-SC (§4).

The dependency calculation for (Wz1) in this execution proceeds as follows. Desugaring, the thread is  $(r:=x; y^{\text{rel}}:=1; s:=y; z:=s)$ . The semantics of z:=s includes a pomset with event  $(s=1 \mid Wz1)$ . The precondition is discharged internally by the write of y=1 without an externally visible dependency to an explicit read: Prefixing s:=y transforms the precondition to y=1. As in the discussion of execution (3), the order from  $(W^{\text{rel}}y1)$  allows the precondition y=1 to be weakened to y=1, which is a tautology. Note that both write prefixing and read prefixing enable precondition weakening. Only read prefixing imposes order; there is no order from the prefixed write to any subsequent action.

In defining *data races* (and thus in the proof of DRF-SC) it is important to position internal actions relative to synchronization actions. For example, the following execution is consider data-race free.

$$x := 1; r := x; y^{\text{rel}} := 1; \parallel s := y^{\text{acq}}; x := 2$$



However, if we commute the read of x and the release of y, the resulting program has a data race.

Invalidation of thread inlining. The behavior in execution (6) is not possible if we split the first thread in two: r := x;  $y^{\text{rel}} := 1 \parallel z := y$ , since that would make the read explicit, thus imposing the order shown in gray. This shows that, like the JMM, our model invalidates thread inlining.

Impact of internal writes. In order to validate redundant write removal in §6, we include internal writes in addition to internal reads. Internal writes cannot be used to fulfill an external read, but they can be used to satisfy the precondition on an internal read. In addition, internal writes prevent prior external writes from being used to fulfill a read. Consider the program:

$$x := 1; x := 2; y^{rel} := 1 \parallel if(y^{acq})\{r := x\}$$

Both of the following executions violate the definition of fulfillment:

The left execution is not allowed since  $(Wx 1) < (Wx \bot) < (Rx 1)$  and  $(Wx \bot)$  writes x. The right execution is not allowed since  $(Wx \perp)$  is an internal write, not an external one. In this example, if the second write to x is internal, then there is no way to fulfill the read of x by the second thread. Of course, there are always executions where the second write is external, in which case the second thread can read 2 for x.

Multi-copy atomicity. Many of the standard litmus tests for MCA are variants of IRIW (Independent Reads of Independent Writes), such as IRIW with control dependencies between the reads:

$$x:=0; x:=1 \parallel if(x)\{r:=y\}$$
  
  $\parallel y:=0; y:=1 \parallel if(y)\{s:=x\}$ 



Since our semantics does not enforce control dependencies between reads, this execution is acyclic and is allowed. If the first read in each thread is acquiring, there is order between the reads.

$$x:=0; x:=1 \parallel if(x^{acq})\{r:=y\}$$
  
 $\parallel y:=0; y:=1 \parallel if(y^{acq})\{s:=x\}$ 

The resulting execution is disallowed due to the cycle.



Another class of non-MCA behavior is characterized by the following litmus test.

$$if(x)\{y:=0\}; y:=1 \parallel if(y)\{z:=0\}; z:=1 \parallel if(z)\{x:=0\}; x:=1$$

This execution is disallowed due to the evident cycle.

 *Load buffering and thin air.* The program  $(y:=x \parallel s:=y; x:=1)$  has top level executions that result in the final outcome x=y=1, such as:



In §7 we provide machinery to prove that this outcome is impossible if there is order from read to write in both threads. This order can be achieved by replacing the second thread (s:=y; x:=1) with  $(s:=y^{acq}; x:=1)$  or  $(if(y)\{x:=1\})$  or (x:=y).

A more interesting example is the following variant of (2):

$$(y:=x \parallel if(z)\{x:=1\} else\{x:=y; a:=y\} \parallel z:=0; z:=1)$$
 (7)

This program is allowed to write 1 to a under many speculative memory models [Jagadeesan et al. 2010; Kang et al. 2017; Manson et al. 2005], even though the read of 1 from y in the else branch of the second thread arises out of thin air. Lochbihler [2013] argues that such executions compromise type safety unless object allocation partitions memory by type. In our model, the attempted execution is:



This is forbidden by the evident cycle.

*Pointers.* Our language allows address calculations. To model this, we allow locations to have the form [n], where n is a natural number. Because we do not enforce order between reads, there is some danger that address calculations could allow thin air behavior. To see how our model addresses this, assume that we have the following locations and initial values:

$$[0] = 0$$
  $[1] = 2$   $[2] = 1$   $x = 0$   $y = 0$ 

Consider the program  $(x := [y] \parallel y := [x])$  with attempted execution:



Although there is no order enforced between the reads, the read-to-write order induced by the semantics is sufficient to prohibit this thin-air behavior.

The dependency calculation in this example is interesting. Desugaring, the first thread is r:=y; s:=[r]; x:=s. In isolation, the write action is  $(s=1 \mid Wx1)$ . Following the discussion of execution (3), s:=[r] causes the precondition to become [r]=1; subsequently prefixing with  $(r=2 \mid R \mid 2 \mid 1)$ , this can be weakened to 1=1 by the substitution of 1 for [2]. But the precondition is also constrained by the last clause of the Definition 2.1: if  $d \le e$  then  $\lambda_{\Phi}(e)$  implies  $\lambda_{\Phi}(d)$ . Thus the we arrive at the label  $(r=2 \mid Wx1)$ . Subsequently, r:=y transforms the precondition to y=2 and prefixing R y 2 allows this to be weakened to z=2.

*Blockers.* Fulfillment is often defined in terms of the absence of *blocker*, rather than the presence of order to every conflicting event. In this definition, the last bullet of Definition 2.2 becomes:

• there is no d < c < e such that c writes to x.

This definition is not preserved by parallel composition. To make the consequences clear, we include location declarations in our language, written  $\operatorname{var} x$ ; C. Fulfillment with blockers violates scope extrusion [Milner 1999], in that we can find programs C and D such that the semantics  $\operatorname{var} x$ ;  $(C \parallel D)$  is different from the semantics of  $(\operatorname{var} x; C) \parallel D$ , even if D does not mention x. To see this, consider the program:

$$r := a^{\operatorname{acq}}; \ x := 1; \ b := {\operatorname{rel}} \ 1 \ \| \ r := c^{\operatorname{acq}}; \ x := 2; \ d := {\operatorname{rel}} \ 1 \ \| \ r := e^{\operatorname{acq}}; \ s := x; \ f := {\operatorname{rel}} \ 1$$

$$(\operatorname{R}^{\operatorname{acq}} a \ 1) \longrightarrow (\operatorname{W} x \ 1) \longrightarrow (\operatorname{W}^{\operatorname{rel}} b \ 1) \qquad (\operatorname{R}^{\operatorname{acq}} e \ 1) \longrightarrow (\operatorname{R} x \ 1) \longrightarrow (\operatorname{W}^{\operatorname{rel}} f \ 1)$$

In this execution, the read is fulfilled by the first thread; the second thread is not blocking. However, when placed in parallel with (c := b; d := e), the second thread may become a blocker:



Our definition of fulfillment requires that the read be ordered with respect to both writes:

This makes augmentation in the reverse order impossible.

Read from unexecuted branch (RFUB). Boehm [2018] analyzes programs in which register state from an unexecuted branch can effect the outcome of an execution. Boehm's RFUB1 example can be written in our language as:

$$y:=x \parallel r:=y; \text{ if } (r \neq 1)\{z:=1; r:=1; x:=r\} \text{ else } \{x:=r\}$$
 (8)

Boehm's concern arises from the similarity of this program with the OOTA litmus test that replaces the second thread with r:=y; x:=r. Should an unexecuted conditional be allowed to change the outcome of the program? In our semantics the answer is "yes". First note that the program writes 1 to x in both branches of the conditional. Further, the writes to x and x in the then-branch of the conditional are independent. Therefore, it is sensible for a compiler to hoist the write to x out of the conditional.

To analyze the example formally, we use combinators on pomsets that are defined in the next section. These are prefixing  $(\Rightarrow)$  and composition  $(\parallel)$ . We also include events with unsatisfiable conditions in our drawings; we indicate unsatisfiability by crossing out the event. The execution in question is:

$$\left\| \begin{array}{c} (Rx1) \\ \downarrow \\ (Wy1) \end{array} \right\| Ry1 \Rightarrow \left( \begin{array}{c} (r \neq 1 \mid Wz1) \\ (r \neq 1 \mid Wx1) \end{array} \right) \left( \begin{array}{c} (r \neq 1 \mid Wz1) \\ (r \neq 1 \mid Wx1) \end{array} \right)$$

With an internal read of y, we have:

$$\begin{array}{c|c}
(R x 1) & y \neq 1 \mid W z 1 \\
\hline
(W y 1) & W x 1
\end{array}$$

but the precondition  $y \neq 1$  cannot be satisfied locally. With an external read of (R y 1), we can discharge the precondition, but in this case, the predicate becomes  $1 \neq 1$ .



Causality test cases. Pugh [2004] developed a set of twenty causality test cases in the process of revising the Java Memory Model (JMM) [Manson et al. 2005]. Using hand calculation, we have confirmed that our model gives the desired result for all twenty cases, unrolling loops as necessary. Our model also gives the desired results for all of the examples in Batty et al. [2015, §4] and Ševčík [2008, §5.3]. Our model agrees with the JMM on the "surprising and controversial behaviors" of Manson et al. [2005, §8]. §D develops some of these examples.

#### 3 THE SEMANTICS OF A SIMPLE CONCURRENT LANGUAGE

In the previous section, we described our semantics informally. In this section we firm things up. In §3.1, we describe data models. In §3.2 we formalize some definitions for sets of pomsets. In both cases, the details are tedious and mundane. The exciting bits are in §3.3, where we describe combinators for sets of pomsets, and §3.4, where we use the combinators to define the semantics of a simple concurrent language.

#### 3.1 Data models

442

443

445

447

449

451

453

455

457

459

461

462

463 464

465

466

467

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486 487

488

489

490

A data model consists of:

- a set of values V, ranged over by v, w, u and  $\ell$ ,
- a set of registers  $\mathcal{R}$ , ranged over by r and s,
- a set of expressions  $\mathcal{E}$ , ranged over by M, N, L and K,
- a set of memory locations X, ranged over by x and y,
- a set of logical formulae  $\Phi$ , ranged over by  $\phi$  and  $\psi$ , and
- a set of actions  $\mathcal{A}$ , ranged over by a and b.

Let  $\sigma$  range over substitutions of the form  $\phi[x/r]$  or  $\phi[N/x]$ .

We require that data models satisfy the following:

- the sets of values, registers and memory locations are disjoint,
- values include at least the constants 0 and 1,
- expressions include at least registers and values,
- memory locations have the form [v],
- expressions do *not* include memory locations or the operator [M],
- formulae include at least true, false, and equalities of the form (M = v),
- formulae are closed under negation, conjunction, disjunction, and substitution<sup>2</sup>, and
- there is a relation ⊨ between formulae.

We say that  $\phi$  is *independent of x* whenever  $\phi \models \phi[v/x] \models \phi$  for every v, and that  $\phi$  is *dependent on x* otherwise. We say that  $\phi$  is *location independent* if it is independent of every location.

We say that  $\phi$  implies  $\psi$  whenever  $\phi \vDash \psi$ , that  $\phi$  is a *tautology* whenever true  $\vDash \phi$ , that  $\phi$  is *unsatisfiable* whenever  $\phi \vDash$  false.

For the actions of a data model, we require that

• there are partial functions Reads and Writes :  $\mathcal{A} \to (\mathcal{X} \times \mathcal{V}_{\perp})$ ,

<sup>&</sup>lt;sup>2</sup>Since formulae are closed under substitutions of the form  $\phi[x/r]$ , they must include equalities of the form  $(\mathbb{M} = v)$  and  $(\mathbb{M} = x)$ , where  $\mathbb{M}$  is an *extended expression* that includes memory locations. We elide the details. By composition of the closure conditions, formulae must also be closed under that substitutions of the form  $\phi[M/r] = \phi[x/r][M/x]$ .

- 491 492 493 494
- 494 495 496
- 497 498 499
- 501 502 503 504 505
- 506 507 508 509 510
- 511 512 513 514
- 515516517518
- 519 520 521 522
- 523 524 525 526 527
- 529 530 531 532

- 533534535536
- 536537538

539

- there are sets Release and Acquire  $\subseteq \mathcal{A}$ , and
- there is a function internalize :  $(X \times \mathcal{A}) \to \mathcal{A}$  that satisfies the restrictions given below.

We say that a reads v from x whenever Reads(a) = (x, v), and that a writes v to x whenever Writes(a) = (x, v). We say that a reads from x whenever Reads(a) = (x, v), and that a writes to x whenever Writes(a) = (x, v), for some v (possibly  $\perp$ ).

Actions that read or write values are *external*, actions that read or write  $\bot$  are *internal*.

We say that a is an acquire if  $a \in Acquire$ , and that a is a release if  $a \in Release$ .

We require that internalize satisfy the following:

- internalize(a) reads  $\perp$  from x exactly when a reads from x,
- internalize(a) writes  $\perp$  from x exactly when a writes from x,
- internalize(a) is an acquire exactly when a is an acquire, and
- internalize(*a*) is a release exactly when *a* is a release.

As noted in §2, our example language includes acquiring read ( $R^{acq} x v$ ), relaxed read (R x v), releasing write ( $W^{rel} x v$ ), and relaxed write (W x v). For each external action, we also define a corresponding internal action ( $R^{acq} x \perp$ ), ( $R x \perp$ ), ( $W^{rel} x \perp$ ), and ( $W x \perp$ ). In pictures, we draw internal actions grayed out, rather than using  $\perp$ .

We also include acquire-release fences of the form (Facqrel).

#### 3.2 The semantic domain

Recall Definition 2.1 of memory model pomsets.

We lift terminology from logical formulae and actions to events. For example, we say that e is unsatisfiable when  $\lambda_{\Phi}(e)$  is unsatisfiable, and that e is an acquire when  $\lambda_{\pi}(e)$  is an acquire.

We give the semantics of programs as sets of pomsets. Each pomset  $P \in [\![C]\!]$  will represent a single execution of C.

We expect the sets of pomsets given by the semantics to be closed with respect to *isomorphism*, *augmentation* and *implication*.

**Definition 3.1.** P' is an *isomorphism* of P if there is a bijection  $f: E \to E'$  such that  $\lambda(e) = \lambda'(f(e))$ , and  $e \le d$  iff  $f(e) \le f(d)$ .

```
P' is an augmentation of P if E' = E, \lambda' = \lambda, and \leq' \supseteq \leq.

P' implies P if E' = E, \leq' = \leq, \lambda'_{\mathcal{A}} = \lambda_{\mathcal{A}}, and \lambda'_{\Phi}(e) implies \lambda_{\Phi}(e) for all e \in E.
```

Each  $P \in [\![C]\!]$  as a *completed* execution. So, we do not expect  $[\![C]\!]$  to be prefixed closed. However, implication closure in a memory-model pomset does give something similar: any event e can be given an unsatisfiable precondition, which means that every event ordered after e must also be unsatisfiable, as per Definition 2.1. Since unsatisfiable events are ignored by our model, this provides a kind of prefix closure.

#### 3.3 Combinators

We give the semantics using combinators over sets of pomsets, defined below. Using  $\mathcal P$  to range over sets of pomsets, these are:

- substitution  $\mathcal{P}\sigma$ , which applies the substitution to every precondition,
- restriction  $vx \cdot P$ , which internalizes x for pomsets that are x-closed,
- composition  $\mathcal{P}^1 \parallel \mathcal{P}^2$ , which unions pomsets, allowing events to be merged, and
- prefixing  $(\phi \mid a) \Rightarrow \mathcal{P}$ , which adds an event c with label  $(\phi \mid a)$  to pomsets in  $\mathcal{P}$ , ordering c before any e whose predicate depends on the value read by a. We elide  $\phi$  when it is a tautology.

These operations are similar to those from models of concurrency such as [Brookes et al. 1984].

We also define two filtering operations:

• guarding  $\phi \triangleright \mathcal{P}$ , which keeps pomsets where all preconditions imply  $\phi$ , and

• independency filtering  $X \triangleright \mathcal{P}$ , which keeps pomsets where all preconditions are location independent.

The definitions of substitution, restriction and the filtering operations are straightforward<sup>3</sup>:

**Definition 3.2.** Let  $\mathcal{P}\sigma$  be the set  $\mathcal{P}'$  where  $P' \in \mathcal{P}'$  whenever there is  $P \in \mathcal{P}$  such that: E' = E,  $\leq' = \leq$ , and  $\lambda'(e) = (\psi \sigma \mid a)$  when  $\lambda(e) = (\psi \mid a)$ .

Let  $(vx \cdot P)$  be the set P' where  $P' \in P'$  whenever there is  $P \in P$  such that P is x-closed and:  $E' = E, \le' = \le$ , and  $\lambda'(e) = (\psi \mid \text{internalize}(a)) \text{ when } \lambda(e) = (\psi \mid a)$ .

Let  $(\phi \triangleright \mathcal{P})$  be the subset of  $\mathcal{P}$  such that  $P \in \mathcal{P}$  whenever  $\phi$  implies  $\lambda_{\Phi}(e)$ , for every  $e \in E$ .

Let  $(X \triangleright P)$  be the subset of P such that  $P \in P$  whenever  $\lambda_{\Phi}(e)$  is location independent, for every  $e \in E$ .

**Definition 3.3.** Let  $P' \in (\mathcal{P}^1 \parallel \mathcal{P}^2)$  whenever there are  $P^1 \in \mathcal{P}^1$  and  $P^2 \in \mathcal{P}^2$  such that:

- $\bullet \ E' = E^1 \cup E^2.$
- $\leq' \supseteq \leq^1 \cup \leq^2$ .

541

542

543

545

547

549

551

553

554

555

561

562

563

568

569 570

571

572

573

574

575

576

577

578

579

580

581

582

583

584 585

586

587

588

- $\begin{array}{l} -\ \lambda_{\mathcal{A}}'(e) = \lambda_{\mathcal{A}}^1(e) = \lambda_{\mathcal{A}}^2(e) \ \text{and} \ \lambda_{\Phi}'(e) \ \text{implies} \ \lambda_{\Phi}^1(e) \vee \lambda_{\Phi}^2(e), \\ -\ \lambda_{\mathcal{A}}'(e) = \lambda_{\mathcal{A}}^1(e), \ \ e \notin E^2 \ \ \text{and} \ \lambda_{\Phi}'(e) \ \text{implies} \ \lambda_{\Phi}^1(e), \ \ \text{or} \\ -\ \lambda_{\mathcal{A}}'(e) = \lambda_{\mathcal{A}}^2(e), \ \ e \notin E^1 \ \ \text{and} \ \lambda_{\Phi}'(e) \ \text{implies} \ \lambda_{\Phi}^2(e). \end{array}$

Composition is used in giving the semantics for conditionals and concurrency.  $\mathcal{P}^1 \parallel \mathcal{P}^2$  contains the union of pomsets from  $\mathcal{P}^1$  and  $\mathcal{P}^2$ , allowing overlap as long as they agree on actions. For example, if  $\mathcal{P}^{\bar{1}}$  and  $\mathcal{P}^2$  contain:

then  $\mathcal{P}^1 \parallel \mathcal{P}^2$  contains:

$$\boxed{\phi \mid a} \rightarrow \boxed{\psi^1 \lor \psi^2 \mid b} \rightarrow \boxed{\chi \mid c}$$

**Definition 3.4.** Let  $(\phi \mid a) \Rightarrow \mathcal{P}$  be the set  $\mathcal{P}'$  where  $P' \in \mathcal{P}'$  whenever there is  $P \in \mathcal{P}$  such that:

- (1)  $E' = E \cup \{c\},\$
- $(2) \leq' \supseteq \leq,$
- (3)  $\lambda'_{\mathcal{A}}(c) = a$ ,  $\lambda'_{\mathcal{A}}(e) = \lambda_{\mathcal{A}}(e)$ , and  $\lambda'_{\Phi}(c)$  implies  $\phi$ ,
- (4) if *a* is not a read then  $\lambda'_{\Phi}(e)$  implies  $\lambda_{\Phi}(e)$ ,
- (5) if *a* internally reads from *x* then both
- (5a)  $\lambda'_{\Phi}(e)$  implies  $\lambda_{\Phi}(e)$ , and
- (5b) if *e* is a write then either c < 'e or  $\lambda_{\Phi}(e)$  is independent of x,
- (6) if a externally reads v from x then both
- (6a)  $\lambda'_{\Phi}(e)$  implies  $\lambda_{\Phi}(e)[v/x]$ , and
- (6b) if *e* is a write then either c <' e or  $\lambda'_{\Phi}(e)$  implies  $\lambda_{\Phi}(e)$ ,
- (7) if *a* and  $\lambda_{\mathcal{A}}(e)$  are external actions in conflict, then  $c \leq e$ ,
- (8) if *a* is an acquire or  $\lambda_{\mathcal{A}}(e)$  is a release then c < 'e, and
- (9) if *a* is an acquire then  $\lambda_{\Phi}(e)$  is location independent.

<sup>&</sup>lt;sup>3</sup>We have chosen the definition of restriction for its simplicity. It is worth noting, however, that our definition does not support renaming of variables. In particular  $(vx \cdot \mathcal{P} \parallel (vx \cdot Q))$  is generally not the same as  $(vx \cdot \mathcal{P} \parallel (vy \cdot Q[y/x]))$ . To support renaming,  $(vx \cdot P)$  would need to either remove or relabel events that mention x.

  $(\phi \mid a) \Rightarrow \mathcal{P}$  adds a new event c with label  $(\phi \mid a)$  to each pomset in  $\mathcal{P}$ . As in the definition of parallel composition, the definition allows the new event to overlap with events in  $\mathcal{P}$  as long as they agree on the action. Overlapping of synchronization events is disallowed by item 8.

If c writes to a location that is also written by some e in  $\mathcal{P}$ , item 7 introduces order between them:  $c \leq e$ . This ensures that these writes cannot be given the reverse order in an augmentation.

If c reads from a location that occurs in the predicate of e, then prefixing introduces order from c to e. whose predicate depends on x. For example, if  $\mathcal{P}$  contains ( $x > 0 \mid Wz1$ ) then (R x1)  $\Rightarrow \mathcal{P}$  contains:

$$(Rx1) \rightarrow (1 > 0 \mid Wz1)$$
 and  $(Rx1)$   $(x > 0 \mid Wz1)$ 

In order to weaken the predicate on (W z 1), item 6b requires that we include the order from (R x 1) to (W z 1), but not if the predicate remains x > 0 (which trivially implies 1 > 0).

Item 8 ensures that events are ordered before a release and after an acquire.

Item 9 filters the executions of  $\mathcal{P}$ , ensuring that thread-local reads do not cross acquire actions. This prevents bad executions like the following, which violate DRF-SC.

$$x:=1; a^{\text{rel}}:=1; \text{ if } (b^{\text{acq}})\{r:=x; y:=r\} \parallel \text{ if } (a^{\text{acq}})\{x:=2; b^{\text{rel}}:=1\}$$

$$\underbrace{(Wx1)}_{\text{Racq }a1} \underbrace{(R^{\text{acq}}b1)}_{\text{W}x2} \underbrace{(R^{\text{acq}}b1)}_{\text{W}rel} \underbrace{(R^{\text{acq}b1)}_{\text{W}rel} \underbrace{(R^{\text{acq}}b1)}_{\text{W}rel} \underbrace{(R^{\text{acq}b1)}_{\text{$$

In item 9, we do not require that  $\psi'$  is independent of every y; were we to require this, the definition would not be augment closed.

The following lemma is immediate from the definitions.

Lemma 3.5. All combinators are monotone with respect to subset order.

#### 3.4 Semantics of programs

We consider a simple shared-memory concurrent language, with statements defined as follows.

$$C, D := \text{skip} \mid C \mid D \mid \text{var}x; C \mid \text{fence}; C \mid \text{if}(M)\{C\} \text{else}\{D\}$$
  
 $\mid r := M; C \mid r := [L]^{\text{acq}}; C \mid r := [L]; C \mid [L]^{\text{rel}} := M; C \mid [L] := M; C$ 

We use common syntax sugar, such as *extended expressions*, which include memory locations. For example, if the extended expression  $\mathbb{M}$  includes a single occurrence of x, then  $y := \mathbb{M}$ ; C is shorthand for r := x;  $y := \mathbb{M}[r/x]$ ; C. Each occurrence of x in an extended expression corresponds to an independent read. We also write  $if(M)\{C^1\}$  else  $\{C^2\}$ ; D as shorthand for  $if(M)\{C^1\}$  else  $\{C^2\}$ ; D and var x := M; C as shorthand for var x; x := M; C.

The semantics of programs is as follows<sup>4</sup>:

<sup>&</sup>lt;sup>4</sup>Fork parallelism is easily expressible in this framework:  $[\![C \ ]\!] D]\!] = [\![C]\!] \parallel \mathcal{X} \rhd [\![D]\!].$ 

The semantics of relaxed reads is the union of two sets. The first set adds a read action to each pomset in  $[\![C]\!]$ ; the second adds an internal action. Whereas read actions have values that are important in the semantics, the values of internal actions are ignored. Internal reads are used in the definition of data races in §4. Acquiring reads are never internal. The rule for relaxed writes is similar.

The write rule uses the substitution [M/x] with precondition M = v, rather than using [v/x] directly. To see the need for this, consider  $[if(s = r)\{z := 1\}]$ , which includes  $(s = r \mid Wz \mid 1)$ . Therefore  $[s := x; if(s = r)\{z := 1\}]$  includes  $(x = r \mid Wz \mid 1)$  and  $[x := r; s := x; if(s = r)\{z := 1\}]$  includes  $(r = r \mid Wz \mid 1)$  which is independent of r. If we took the semantics of write to use [v/x], then we would end up with pomsets of the form  $(v = r \mid Wz \mid 1)$  which depend on r.

The role of internal writes in the semantics is to facilitate dead store. Consider the semantics of y:=0; y:=1. It includes not only:

$$\boxed{ Wy0}$$
 ---- $\boxed{ Wy1}$ 

but also:

$$[\overline{W}y]$$
 ----  $\rightarrow \overline{W}y1$ 

The latter pomset is effectively the semantics of y := 1, thus validating the dead store transformation that eliminates the first write.

Prefixing does not necessarily induce a dependency, even for read actions where the read is used. To see that this is desirable, consider the semantics of y:=0; r:=y; if  $(r \le 1)\{x:=2\}$ . To begin, note that  $[if(r \le 1)\{x:=2\}]$  includes  $(r \le 1 \mid Wx2)$  which depends on r. Then  $[r:=y; if(r \le 1)\{x:=2\}]$  includes

$$(Ry) \rightarrow (y \le 1 \mid Wx2)$$

Prefixing with a write to y,  $[y:=0; r:=y; if (r \le 1)\{x:=2\}]$  discharges the precondition of the write to x, yielding

which simplifies to:

$$(\mathbf{W} y \mathbf{0})$$
  $(\mathbf{R} y) \rightarrow (\mathbf{W} x \mathbf{2})$ 

Here the thread-local value of y discharges the predicate.

#### 4 DATA RACE FREE BEHAVIORS ARE SEQUENTIALLY CONSISTENT

In this section, we prove the SC-DRF theorem, which states that any program that lacks data races under the SC semantics must only have executions that are compatible with SC executions. We present the result for programs of the form  $\vec{x} := \vec{0}$ ; fence; C, where C is restriction-free. Thus, all memory locations are initialized to 0, initialization happens-before the execution of any command, and internal actions only arise from (intra-thread) implicit reads.

Define the relation **eco** so that  $(e, d) \in \text{eco}$  if  $e \leq d$  and e and d conflict.

The program  $x:=1 \parallel x:=2$  is considered to have an SC data race, but x:=1; x:=2 does not. In our semantics the only difference between these is that x:=1; x:=2 enforces weak order between the writes. Note also that the [x:=1; a:=y] = [a:=y; x:=1], yet these two must be distinguished in SC, as per the load-buffering and store-buffering litmus tests.

688 689

694

696

698

700

701

702

704

706

707 708

709

710

711

712

713 714

715

716

717

718 719

720

721

722

723 724

725

726

727

728

729

731 732

733

734

735

In order to define SC executions and SC data races, it is necessary to augment our semantics to record program order. We extend the definitions in §3.4 with  $po \subseteq E \times E$ , defined as follows:

- po ' = po when  $\mathcal{P}' = \mathcal{P}\sigma$  or  $\mathcal{P}' = \phi \triangleright \mathcal{P}$
- po ' = po $|_{E'}$  when  $\mathcal{P}' = vx \cdot \mathcal{P}$
- po ' = po<sup>1</sup>  $\cup$  po<sup>2</sup> when  $\mathcal{P}' = \mathcal{P}^1 \parallel \mathcal{P}^2$
- po ' = po  $\cup \{(c, e) \mid e \in E\}$  when  $\mathcal{P}' = a \Rightarrow \mathcal{P}$  and  $E' = E \cup \{c\}$

Define the relation rf so that  $(e, d) \in \text{rf}$  if e writes x, d reads x, and for any e that writes e either e or e or

```
sw = [Release]; (rf \setminus po); [Acquire]

hb = (po \cup sw)^+
```

Note that our semantics guarantees that  $sw \subseteq <$ .

A pomset has a *data race* if there are events *e* and *d* such that

- e and d are unordered by hb,
- $\lambda_{\Phi}(e)$  and  $\lambda_{\Phi}(d)$  are tautologies, and
- $\lambda_{\mathcal{A}}(e)$  and  $\lambda_{\mathcal{A}}(d)$  conflict.

**Definition 4.1.** Let  $[\![C]\!]_{SC}$  be the subset of  $[\![C]\!]$  such that  $P \in [\![C]\!]_{SC}$  whenever P is a top-level pomset and  $< \cup$  po is acyclic.

We argue that this definition is sufficient to capture sequential consistency: Any total order that linearizes the acyclic relation is consistent with strong order (<) and the program order (po). Since  $\leq$  contains eco, only the last write to a location is read in such a total order.

We only consider *generators*, which are top-level pomsets that are minimal with respect to augmentation and implication. Since we are considering finite programs without loops, the pomsets in the semantics of threads are finite. Thus, there are no infinite descending chains of augmentations.

We prove the following theorem in §B.

**Theorem 4.2.** *Let P be a generator for C.* 

- If P does not have a data race,  $P \in [\![C]\!]_{SC}$ .
- If P has a data race, then there exists  $P' \in [\![C]\!]_{SC}$  that has a data race.

A key step of this proof is an analysis of the closure properties of the semantics. In order to perform this fine grained analysis of dependency, we describe a variant of the semantics using modal pomsets defined below.

**Definition 4.3.** A *modal pomset* is a tuple  $(E, \leq, \leq, \lambda)$ , such that

- $(E, \leq, \lambda)$  is a pomset, and
- $\trianglelefteq \subseteq \subseteq$  is a partial order.

These are the pomset equivalent of the *modal transition systems* of Larsen and Thomsen [1988], Huth et al. [2001] call  $\lhd$  a "must transition" and < a "may transition". In anticipation, we have used the terms "strong order" and "weak order" respectively, drawing  $\lhd$  as a solid arrow " $\rightarrow$ " and < dashed " $\rightarrow$ " in the pictures following Lamport's [1986] notation. The intuitive temporal meaning of  $e \lhd d$  is that e *must* strictly precede d, whereas  $e \lhd d$  is intended to connote that e may precede d.

<sup>&</sup>lt;sup>5</sup>For simplicity, the definition of sw does not include release sequences or fences. For example, we consider  $(W^{rel} \ x \ 1) \Rightarrow (W \ x \ 2) \parallel (R^{acq} \ x \ 2)$  to be racy, whereas this pomset is race-free using release sequences. If we include release sequences and fences, then hb relates more events and thus there are fewer races. Our results hold under either definition.

The semantics of programs in the modal model proceeds as before, mutatis mutandis, with details described in the appendix. This finer analysis of necessary and possible dependency allows us to establish the existence of pomsets in the semantics as we search for sequential witnesses to data races. We provide some illustrative examples below.

For the program  $(y:=0; r:=y; x:=1) \parallel (x:=0; s:=x; y:=1) :$ 



For the program  $(y := 0; x := 1; r := y) \parallel (x := 0; y := 1; s := x) :$ 



For the program  $(x := 1) \parallel (x := 0)$ :



#### 5 EFFICIENT IMPLEMENTATION ON ARMV8

In this section, we consider the fragment of our language without restriction. For simplicity, we allow release and acquire synchronization but ban fences. We assume that all memory locations are initialized to 0 and parallel-composition occurs only at top level. We take the set of memory locations to be finite. In other words, we assume that programs have the form

$$x_1 := 0; \dots; x_m := 0; (C^1 \parallel \dots \parallel C^n)$$

where  $C^1, \ldots, C^n$  do not include composition, restriction or fence operations.

Our language can be translated to ARM using 1dr for relaxed read, 1dar for acquiring read, str for relaxed write, and stlr for releasing write. Relative to the ARM specification, we have removed loops and read-modify-write (RMW) operations, in addition to fences<sup>6</sup>.

We show that any ARM-consistent execution graph for this sublanguage can be considered an execution of our semantics. Due to space limitations, we do not include a full description ARM consistency in the main text . Here we provide a birds eye view of the details, drawing on the intuitions gleaned from [Pulte et al. 2018]. Interested readers should see §C for further details.

An ARM execution graph G defines many relations, including program order (po), reads-from (rf), coherence (co) and several dependency orders. From these are derived:

- poloc, which is the subrelation of po that only relates actions on the same location,
- ob, which is required to be acyclic (EXTERNAL), and
- eco, with the requirement that poloc ∪ eco be acyclic (SC-PER-LOC).

Given an execution graph G, we say that e is an *internal read* if  $e \in \text{codomain}(p \circ \cap rf)$ .

The ob order is an acyclic global order on events, agreed upon by all threads, reflecting the progress of time in an ARMV8 execution. The cross thread component of the ordering is induced by the ordering on conflicting actions on the same location from different threads. The intra

<sup>&</sup>lt;sup>6</sup>We leave out fences for simplicity. Following Podkopaev et al. [2019], our fence instruction can be translated to dmb.sy, since it has release-acquire semantics. Acquire fences map to dmb.ld, and release fences to dmb.sy — dmb.st does not provide order to prior reads.

thread component of the ordering is induced by barrier ordering and data ordering. Notably, these dependencies are determined syntactically. In particular, ob may not necessarily include the intra thread component of poloc ordering.

This motivates the translation of an ARMV8 execution into our setting. In our setting, the progress of time is given by <. We accommodate intra-thread reordering by internal read actions, thus excusing us from the obligation of placing them on the global <-timeline.

Formally, from *G* we construct a candidate pomset *P* as follows:

 $\bullet$   $E = \mathsf{E}$ ,

- $\lambda_{\mathcal{A}}(e) = \tau \text{lab}(e)$ , if *e* is a relaxed internal read,
- $\lambda_{\mathcal{A}}(e) = \text{lab}(e)$ , if e is not a relaxed internal read,
- $\lambda_{\Phi}(e) = \text{true}$ ,
- $\leq = (ob \cup eco)^*$ , where \* denotes reflexive and transitive closure.

**Theorem 5.1.** If G is ARM consistent, the constructed candidate satisfies the requirements for a top-level memory model pomset.

Any <-ordering imposed in our model is enforced by ARMV8, since our notion of semantic dependency is more permissive than ARMV8's syntactic dependency. So, the heart of the proof is showing the acyclicity of  $(ob \cup eco)^*$  for the events under consideration. Since the cross thread portion of eco (coe, fre, rfe) is included in ob, this result is really about the influence of eco  $\cap$  po. Our translation of ARMV8's rfi as silent internal actions removes them from order considerations. Consequently, we only have to consider the suborder of ob derived without ever using rfi for the following key property demonstrated in §C.

LEMMA 5.2. Let e, d be distinct events and  $d'(\xrightarrow{ob} \cap \xrightarrow{po}) d((\xrightarrow{eco} \cap \xrightarrow{po}) \setminus \xrightarrow{rfi}) e(\xrightarrow{ob} \cap \xrightarrow{po}) e'$ . Then  $d'\xrightarrow{ob} e'$ .

*Remark 5.3 (Proof for TSO).* The proof for compilation into TsO is very similar. In particular the facts listed above hold for TsO, where ob is replaced by (the transitive closure of) the propagation relation defined for TsO [Alglave et al. 2014].

#### 6 SINGLE THREADED OPTIMIZATIONS

As we have seen already, our model *invalidates* thread inlining. We argue that our model is fully flexible with respect to single threaded optimizations; concretely by validating several single threaded optimizations including reordering of independent statements; and abstractly, by proving a full abstraction theorem for single threads without synchronization and connecting to traditional Hoare logic for sequential programs.

Let sat( $\mathcal{P}$ ) be the set  $\mathcal{P}'$  where  $P' \in \mathcal{P}'$  whenever there is  $P \in \mathcal{P}$  such that:  $E' = \{e \in E \mid \lambda_{\Phi}(e) \text{ is satisfiable and } \lambda_{\mathcal{P}}(e) \text{ is external}\}, \leq' = \leq_{E'}, \text{ and } \lambda' = \lambda_{E'}.$ 

By Lemma 3.5, if  $[\![C]\!] \supseteq [\![D]\!]$  then C can be transformed into D in any program context. Since unsatisfiable events are ignored in our architecture model, it is an immediate corollary that it is sufficient to show that  $\mathsf{sat}[\![C]\!] \supseteq \mathsf{sat}[\![D]\!]$ .

#### 6.1 Validating single threaded optimizations

We follow the terminology and presentation of section 7.1 of Dolan et al. [2018], to maintain a clear comparison with models that enforce extra ordering.

Certain transformations involving adjacent operations on the same location are permissible.

Lemma 6.1. Suppose M = 0 is a tautology and L = K is a tautology. Then the following hold.

$$\operatorname{sat}[\inf(M)\{C\} \operatorname{else}\{D\}] \supseteq \operatorname{sat}[D]$$
 (Dead Code)

PROOF. For dead code elimination, the result is immediate from the semantics.

 For dead store, note that no event in C can depend on the first store, therefore it may be taken with precondition false.

For the others, the proof proceeds by observing that the semantics allows us to use an implicit action on the lefthand side to mimic the right hand side. For example, taking [L] = [K] = x, the argument for redundant load is as follows.

$$[r:=x; s:=x; C] \supseteq [s:=x; C[x/r]]$$

$$\supseteq [C][x/r][x/s] \cup \bigcup_{v} (Rxv) \Rightarrow [C][x/r][x/s]$$

$$= [C][r/s][x/r] \cup \bigcup_{v} (Rxv) \Rightarrow [C][r/s][x/r]$$

$$= [r:=x; C[r/s]]$$

Certain reorderings involving adjacent operations on the distinct locations are also permissible.

Lemma 6.2. Suppose  $L \neq K$  is a tautology. Then the following reorderings hold.

$$[r := [L]; s := [K]; C] = [s := [K]; r := [L]; C]$$
 if  $r \notin \operatorname{fn}(K)$  and  $s \notin \operatorname{fn}(L)$  (R-R) 
$$[r := [L]; [K] := N; C] = [[K] := N; r := [L]; C]$$
 if  $r \notin \operatorname{fn}(K) \cup \operatorname{fn}(N)$  (R-W) 
$$[[L] := M; [K] := N; C] = [[K] := N; [L] := M; C]$$
 (W-W)

PROOF. The proof of both of the above lemmas follows from the semantics of prefixing where the only enforced < relationships come from conflict on x or release actions in C.

Roach-motel reorderings that increase the scope of synchronization are valid.

Lemma 6.3. Suppose  $L \neq K$  is a tautology. Then the following reorderings hold.

Proof. The proof follows from noticing that the pomsets in the semantics of the right hand sides are augmentations of a pomset on the left hand side.

Compiler optimizations. Reordering and peephole optimizations can be combined to describe common compiler optimizations. We illustrate using common subexpression elimination, following Dolan et al. [2018]: Consider the command (r:=x\*2; C; s:=x\*2) where C is independent of r. Reordering yields (C; r:=x\*2; s:=x\*2), followed by redundant load to yield (C; r:=x\*2; s:=r).

Similarly, the treatment of loop-invariant code motion, dead-store elimination and constant propagation from Dolan et al. [2018] follow.

Since our model is more generous about permitted reorderings, we permit optimizations that they forbid. Consider: (r:=x; y:=z; x:=r). Reordering, permitted by us, but forbidden by them, yields (r:=x; x:=r; y:=z), followed by the valid elimination of redundant load (r:=x; x:=r; y:=z).

#### 6.2 Full abstraction for synchronization free threads

Our semantics is complete for reasoning about full-thread optimizations of synchronization free programs.

In the rest of this section, we only consider commands C, D that are restriction-free, composition-free (ie. single threaded), and synchronization-free (ie. no acquire, release, fence).

In order to develop the proof, we first make the semantics insensitive to reads that are never used. Let read( $\mathcal{P}$ ) be the smallest augmentation closed set containing  $\mathcal{P}$  that also satisfies closure under the inclusion of useless reads: if  $P \in \text{read}(\mathcal{P})$ , then  $P' \in \text{read}(\mathcal{P})$ , where  $E' = E \cup \{e\}$ ,  $e \notin E$ ,  $\leq' = \leq$ ,  $\lambda'|_E = \lambda$ , and  $\lambda'(e) = \mathbb{R} x v$  for any x and v.

Let readsat( $\mathcal{P}$ ) = read(sat( $\mathcal{P}$ )).

This closure permits us to describe a normal form for the top-level pomsets that arise in single threaded and synchronization free code.

#### **Definition 6.4.** *P* is in normal form if:

- If d < e and d is a write, then e is a write or read on the same variable.
- If d < e and e is a read, then d is a write on the same variable.
- If e, e' have the same read action label, then there exists a write d on the same location such that  $e \le d \le e'$ .

In a normal form pomset, the successors of a write event (resp. the predecessors of a read) are related in the coherence order to the event. Any two events with the same read label are separated by a write in the eco order.

It suffices to consider normal forms when distinguishing single threaded, synchronization free code at the top level. The normal forms determine the full semantics by the closure properties of the semantics.

LEMMA 6.5. Every top-level pomset of readsat  $[\![C]\!]$  is a top-level pomset of readsat  $[\![D]\!]$  if and only if every top-level normal form pomset of readsat  $[\![C]\!]$  is a top-level normal form pomset of readsat  $[\![D]\!]$ .

PROOF. The proof proceeds by induction on structure. The key case is prefixing. The only edges <-edges out of writes and into read actions enforced by prefixing are eco edges. Read prefixing permits the reuse of events to ensure that distinct read events not separated by eco have distinct read labels.

We develop testers for top-level pomsets in normal form. We follow Plotkin and Pratt [1997], albeit in a concrete form appropriate to our setting.

Let *P* be a top-level pomset in normal form with events  $e_1, \ldots, e_n$ . For all *i*, we assume a new location  $b_i$ . Let  $v_{fr}$  be a fresh value that does not occur in *P*.

Let  $\vec{d}$  be all the predecessors of  $e_i$  in <. Let  $\vec{c}$  be the corresponding sub vector of  $b_i$ 's.

• If  $e_i$  has label R xv, we define a program  $\mathsf{test}_i^P$  as follows:

$$\mathsf{test}_i^P = (\mathsf{if}(\vec{c} = \vec{1})\{x := v; \, \mathsf{fence}; \, b_i := 1\})$$

• If  $e_i$  has label W x v, we define a program test<sup>P</sup> as follows:

$$\mathsf{test}_i^P = (\mathsf{if}(\vec{c} = \vec{1}) \{ \mathsf{if}(x = v) \{ x = v_\mathsf{fr}; \, \mathsf{fence}; \, b_i \text{:=} 1 \} \})$$

 $\mathsf{test}_i^P$  is as expected, matching the reads (resp. writes) in P by writes (resp. reads). The fresh value is used to flush out prior writes and reset to see fresh writes.

**Definition 6.6** (Tester for P). The tester context for P,  $test^P[-]$  is

$$b := b_1 \wedge \cdots \wedge b_n \parallel \mathsf{test}_1^P \parallel \cdots \parallel \mathsf{test}_n^P \parallel [-]$$

The constraints of the normal form ensure that if the labels of any two events are the same, then one is a predecessor of the other.

Lemma 6.7.  $[test^P[C]]$  has a point that sets b to 1 iff  $P \in readsat[C]$ .

PROOF. It suffices to prove that there is a pomset that sets b to 1 in readsat  $[b:=b_1 \wedge \cdots \wedge b_n \mid test_1^P \mid \cdots \mid test_n^P] \mid P'$  iff P is an augmentation of P'.

We prove by induction on < that  $test_i^P[C]$  sets  $b_i$  to 1 iff the event  $e_i$  is enabled. Result follows.  $\Box$ 

**Theorem 6.8.** Let  $C_1$  and  $C_2$  be restriction-free, composition-free and synchronization free. Then all top-level pomsets of readsat  $[C_1]$  are also pomsets of readsat  $[C_2]$  if and only if for all parallel contexts D, all top-level pomsets of readsat  $[C_1]$  [D] are also pomsets of readsat  $[C_2]$  [D]

PROOF. The forward implication follows from the compositionality of the semantics.

For the converse, pick a top level pomset in normal form in  $P \in \text{readsat}[C_1] \setminus \text{readsat}[C_2]$ . The required context is given by the tester context  $\text{test}^P[-]$ .

The full abstraction theorem applies only to full threads. The reason for this limited statement is that our impoverished language does not permit parallel composition to be used freely as the continuation in arbitrary sequential contexts. A full abstraction theorem that applies also to commands can be achieved with these richer distinguishing contexts.

#### 6.3 Relationship to Hoare logic

In order to provide useful information about sequential commands in sequential contexts, we establish a relationship with Hoare triples for a sequential language.

This invariant captured in the following theorem underlies the definitions in the semantics of sequential commands. In this theorem, we consider Hoare triples over formulas as per as per §3.1.

**Theorem 6.9.** Let C be a command without either synchronization or parallelism.

The following Hoare triple is valid,

$$\{\phi \wedge y_1 = v_1 \wedge \cdots \wedge y_n = v_n\} C \{x = w\}$$

if and only if readsat  $[\![C]\!]$  contains an execution that includes

$$\underbrace{\vec{\psi} \mid \mathbf{R} \, \vec{y} \, \vec{v}}_{e:(\phi \mid \mathbf{W} \, \mathbf{x} \, \mathbf{w})}$$

where  $R \vec{y} \vec{v}$  are all the events that precede e and  $W x \vec{u}$  are all the writes on x.

PROOF. The proof proceeds by structural induction on the command.

We illustrate with a few examples. {true} x := 1; x := 2 {x = 2} holds. In the semantics, this is mirrored by the execution:

$$(W x 1) \rightarrow (W x 2)$$

The two pomsets corresponding to the triple  $\{x = 1\}$  r := x; if  $\{y = 1\}$   $\{y = 1\}$  illustrate the two ways to satisfy preconditions in the above theorem.

Internal: 
$$x = 1 \mid Wy1$$
 External:  $x = 1 \mid Wy1$ 

{true} r:=x; if  $(r=1)\{y:=1\}$  else  $\{y:=1\}$  { $y=1\}$  holds. In the semantics, this is mirrored by the execution:

$$\boxed{ Wy1 } \boxed{ Rxv }$$

 In contrast, in a semantics that forbids load buffering, where the best that one can prove is  $\{x = v\}$  r:=x; if  $\{x = 1\}$  else  $\{y:=1\}$   $\{y = 1\}$ .

Let C = r := x; if  $(r = 1)\{y := 1\}$ ; s := u; if  $(s = 1)\{v := 1\}$ . Then:  $\{x = 1\}$  C  $\{y = 1\}$  and  $\{u = 1\}$  C  $\{v = 1\}$  hold. In the semantics, this is mirrored by the execution:



Consider a fragment from example (8).

$$D: r:=y$$
; if  $(r \neq 42)\{a:=1; s:=42\}; x:=42$ 

Then, {true} D {x = 42} and { $y \ne 42$ } D {a = 1}. Thus, the traditional sequential semantics *does not* attribute any cause for the write of 42 to x. In the semantics, this is mirrored by the execution:



#### 7 INVARIANT REASONING IN TEMPORAL LOGIC

In this section, we develop sufficient logical infrastructure to prove that our semantics disallows thin air executions. In particular, we provide a proof that thin air execution (7) from §2 is impossible.

We adapt past linear temporal logic (PLTL) [Lichtenstein et al. 1985] to pomsets by dropping the previous instant operator and adopting strict versions of the temporal operators. The atoms of our logic are write and read events. Given a pomset P and event e, define:

$$P, e \models Wxv, \text{ if } \lambda(e) = (\text{true}, Wxv)$$
  
 $P, e \models Rxv, \text{ if } \lambda(e) = (\text{true}, Rxv)$   
 $P, e \models \phi \land \psi, \text{ if } P, e \models \phi \text{ and } P, e \models \psi$   
 $P, e \models \text{true}$   
 $P, e \models \neg \phi, \text{ if } P, e \not\models \phi$   
 $P, e \models \Box^{-1}\phi, \text{ if } (\forall d < e) P, d \models \phi$ 

**Definition 7.1.** Define  $P \models \phi$  if  $(\forall e \in E)$   $P, e \models \phi$  and  $P \models \phi$  if  $(\forall P \in P)$   $P \models \phi$ .

Thus,  $P \models \phi \land \Box^{-1}\phi$  whenever  $P \models \phi$ ; a fact that relies crucially on the universal quantification over all events of a pomset in the above definition.

Define the other connectives standardly following DeMorgan laws.

$$\begin{array}{lll} \diamondsuit^{-1}\phi & = & \neg(\Box^{-1}\neg\phi) \\ \text{false} & = & \neg(\text{true}) \\ \phi \lor \psi & = & \neg(\neg(\phi) \land \neg(\psi)) \\ \phi \Rightarrow \psi & = & \neg(\phi) \lor \psi \end{array}$$

The past operators do not include the current instant, and thus they do *not* satisfy the rule  $\Box^{-1}\phi \Rightarrow \Diamond^{-1}\phi$ . because the roots — the minimal elements of a pomset — always validate  $\Box^{-1}\phi$  and always invalidate  $\Diamond^{-1}\phi$ . However, they do satisfy:

LEMMA 7.2. Given an pomset P.

$$P \models (\phi \Rightarrow \diamondsuit^{-1}\phi) \Rightarrow \neg \phi$$
 (Coinduction)  
$$P \models (\Box^{-1}\phi \Rightarrow \phi) \Rightarrow \phi$$
 (Induction)

PROOF. We prove that any node in a pomset satisfies these formulas. The proof for both rules proceeds by induction on the length of the maximal path from a root to a node.

We now present two proof rules for programs.

 *Proof rules for programs.* The first rule captures the semantics of local variables. Define  $\operatorname{closed}(x) = (\operatorname{R} x \, v \Rightarrow \diamondsuit^{-1} \operatorname{W} x \, v)$ . Although this definition does not mention intervening writes, it is sufficient for our example. It is straightforward to establish that following rule is sound:

$$\frac{\phi \text{ is independent of } x \qquad P \models \mathsf{closed}(x) \Rightarrow \phi}{vx \cdot P \models \phi} \tag{Closing } x)$$

The second rule describes composition, in the style of Abadi and Lamport [1993]. To simplify the presentation, we consider the special case with a single invariant. In order to state the theorem, we generalize the satisfaction relation to include environment assumptions. Let  $\mathsf{Models}(\phi) = \{\mathcal{P} \mid \mathcal{P} \models \phi\}$  be the set of pomsets that satisfy  $\phi$ . We say that  $\phi$  is prefix closed if  $\mathsf{Models}(\phi)$  is  $\mathsf{prefix}\text{-closed}^7$ . Define  $\phi$ ,  $\mathcal{P} \models \psi$  if  $\mathsf{Models}(\phi) \parallel \mathcal{P} \models \psi$ .

**Proposition 7.3.** Let  $\phi$  be prefix-closed. Let  $\mathcal{P}_1, \mathcal{P}_2$  be augmentation-closed. Then:

$$\frac{\phi,\mathcal{P}_1 \models \phi \qquad \phi,\mathcal{P}_2 \models \phi}{\mathcal{P}_1 \parallel \mathcal{P}_2 \models \phi} \tag{Composition}$$

PROOF SKETCH. We will show that all prefixes in the prefix closures of  $\mathcal{P}_1 \parallel \mathcal{P}_2$  satisfy the required property. Proof proceeds by induction on prefixes of  $P \in \mathcal{P}_1 \parallel \mathcal{P}_2$ .

The case for empty prefix follows from assumption that  $\phi$  is prefix closed.

For the inductive case, consider  $P \in P_1 \parallel P_2$  where  $P_i \in \mathcal{P}_i$ . Since  $\mathcal{P}_1$  and  $\mathcal{P}_2$  are augmentation closed, we can assume that the restriction of P to the events of  $P_i$  coincides with  $P_i$ , for i = 1, 2. Consider a prefix P' derived by removing a maximal element e from P. Suppose e comes from  $P_1$  (the other case is symmetric). Since  $P_2$  is a prefix of P' and  $P' \models \phi$  by induction hypothesis, we deduce that  $P_2 \models \phi$ . Since  $P_1 \in \mathcal{P}_1$ , by assumption  $\phi$ ,  $\mathcal{P}_1 \models \phi$  we deduce that  $P \models \phi$ .

A more general principle, in the spirit of Abadi and Lamport [1993] can be proved. We chose the simple case of temporal invariants to illustrate the idea in a simple form. Even this simple version has interesting consequences.

Example 7.4. We now turn the conditional TAR-pit program, given in (7):

$$var x := 0$$
;  $var y := 0$ ;  $var z := 0$ ;  $(y := x \parallel if(z)\{x := 1\} \text{ else } \{x := y; a := y\} \parallel z := 1)$  (9)

We prove the formula  $\neg \diamondsuit^{-1}(W \ a \ 1)$  holds for the models of this program in our semantics. We start with the following invariant, which holds for each of the three threads, and thus, by composition, for the aggregate program:

$$[\diamondsuit^{-1}(\mathsf{W}\,y\,1) \Rightarrow \diamondsuit^{-1}(\mathsf{R}\,x\,1)] \land [\diamondsuit^{-1}(\mathsf{W}\,a\,1) \Rightarrow (\diamondsuit^{-1}(\mathsf{R}\,y\,1) \land \Box^{-1}(\mathsf{W}\,x\,1 \Rightarrow \diamondsuit^{-1}(\mathsf{R}\,y\,1)))]$$

Closing y, we have,  $\diamondsuit^{-1}(Ry1) \Rightarrow \diamondsuit^{-1}(Wy1)$  which we substitute into the left conjunct to get:

$$\diamondsuit^{-1}(R y 1) \Rightarrow \diamondsuit^{-1}(R x 1)$$

which in turn we substitute into the right conjunct to get:

$$\diamondsuit^{-1}(\mathsf{W}\,a\,1) \Rightarrow (\diamondsuit^{-1}(\mathsf{R}\,x\,1) \wedge \Box^{-1}(\mathsf{W}\,x\,1 \Rightarrow \diamondsuit^{-1}(\mathsf{R}\,x\,1)))$$

 $<sup>^{7}</sup>P'$  is a prefix of P if  $E'\subseteq E,\,e\in E'$  and  $d\le e$  imply  $d\in E'$ , and  $(\lambda',\,\leq')$  coincide with  $(\lambda,\,\leq)$  for elements of E'.

Closing x, we can replace  $\diamondsuit^{-1}(R x 1)$  with  $\diamondsuit^{-1}(W x 1)$ :

$$\diamondsuit^{-1}(W a 1) \Rightarrow (\diamondsuit^{-1}(W x 1) \wedge \Box^{-1}(W x 1) \Rightarrow \diamondsuit^{-1}(W x 1)))$$

Applying coinduction to the right conjunct, we have:

$$\diamondsuit^{-1}(W \ a \ 1) \Rightarrow (\diamondsuit^{-1}(W \ x \ 1) \land \Box^{-1}(\neg W \ x \ 1))$$

Simplifying, we have, as required:

$$\diamondsuit^{-1}(W a 1) \Rightarrow false$$

Example 7.5 (Jeffrey and Riely [2016]; Svendsen et al. [2018]). We show that the following program never writes 2 to z.

$$var x := 0$$
;  $var y := 0$ ;  $var z := 0$ ;  $(y := x + 1; z := y \parallel x := y)$ 

We start with the following invariant, which holds for each of the two threads, and thus, by composition, for the aggregate program:

$$\begin{split} &[(\diamondsuit^{-1}(\mathsf{W}\,y\,v_1) \wedge \diamondsuit^{-1}(\mathsf{W}\,y\,v_2)) \Rightarrow ((v_1 == v_2) \vee (v_1 == 0) \vee (v_2 == 0))] \wedge \\ &[\mathsf{W}\,x\,1 \Rightarrow (\diamondsuit^{-1}(\mathsf{R}\,y\,1))] \wedge \\ &[\mathsf{W}\,y\,2 \Rightarrow (\diamondsuit^{-1}(\mathsf{R}\,x\,1))] \wedge \\ &[\mathsf{W}\,z\,2 \Rightarrow (\diamondsuit^{-1}(\mathsf{R}\,y\,2))] \end{split}$$

Closing both x, y and substituting, we deduce:

$$Wz2 \Rightarrow (\diamondsuit^{-1}(Wy2) \land \diamondsuit^{-1}(Wy2))$$

Using the first conjunct, we deduce the required conclusion  $Wz = 2 \Rightarrow false$ .

#### 8 COMPARISON TO RELATED WORK

A memory consistency model for a shared-memory multiprocessor defines the values that a read may return. There has been extensive research on hardware and software memory models. Podkopaev et al. [2019] provided formal foundations to bridge the gap between the two. Alglave [2010] provided a historical and conceptual perspective on hardware memory models.

This paper focusses on software memory models. For a survey and history of the Java Memory Model, see [Lochbihler 2013]; for a survey and history of C11, see [Batty 2015].

This paper follows a line of work on relaxed memory models in the framework of true concurrency [Cenciarelli et al. 2007; Jeffrey and Riely 2016; Pichon-Pharabod and Sewell 2016].

We have already discussed related work in context earlier in the paper. Here, we discuss in more detail the relationship with the most closely related work.

#### 8.1 Memory models via program transformations

We discuss prior approaches to defining memory models using program transformations in chronological order. The general flavor in these approaches is to consider SC executions, albeit with the threads subject to sequential program transformation.

Saraswat et al. [2007] aimed to describe a JMM-like model this way. DRF-sc holds, but Sevcik [2011] discovered that it permits OOTA behavior.

Ferreira et al. [2010] described a relaxed memory model as being parametrized by available program transformations. They showed DRF-SC, but do not study OOTA, relate to concrete models, or demonstrate compilation results.

Demange et al. [2013] developed BMM, a model whose only permitted reordering is of a relaxed write with a following relaxed read. The paper develops an axiomatic and studies effects on

compilation. By design, it is designed as a restriction of the JMM that invalidates several compiler optimizations.

Lahav and Vafeiadis [2016] characterized TSO as being derived by considering Write-Read (WR) reordering and Read-After-Write (RAW) elimination. They also showed that the release acquires of C11 are less expressive than considering WR, RAW and thread-inlining. Our paper is inspired by their implicit challenge: "Some memory models can be defined via transformations. But there is more to weak memory than transformations."

#### 8.2 Modeling the invalidation of load buffering

In this subsection, we argue that our paper is relevant even to the reader who desires a semantics that invalidates "load buffering," or "reads from the future." There are two related lines of research that suggest to prohibit the reordering of loads with following stores.

• Dolan et al. [2018] imposed this restriction in order to "bound data races in space and time." Consider the following program:



In their model, one can deduce that w must be assigned 1, despite the past race on y, concurrent race on z and future race on x.

• In the C11 family of models, eg. see [Boehm and Demsky 2014; Lahav et al. 2017; Vafeiadis and Narayan 2013], load buffering is forbidden to avoid OOTA. The elegant discussion in Boehm [2018] elucidates the issues via a series of "OOTA" like examples. In particular, to rule out example 8.

Our model can easily capture models that forbid load buffering. It necessitates two changes.

• Modify the semantic rule for read to remove the possibility of an internal read:

$$\llbracket r := [L]; C \rrbracket = \bigcup_{x = \lceil w \rceil} (L = w) \triangleright \bigcup_{v} (Rxv) \Rightarrow \llbracket C \rrbracket [x/r]$$

• Modify item 6b in the definition of prefixing to require order from a read to subsequent write. Item 6b' becomes: if e is a write then c <' e. In 6b' we removed the "or or  $\lambda'_{\Phi}(e)$  implies  $\lambda_{\Phi}(e)$ ."

While Dolan et al. [2018] already describes DRF and the compilation to ARMV8, our approach yields the following new insights.

- Compositional treatment of temporal invariants from §7 holds mutatis mutandis, since all executions of the restricted model are already accounted for in the general model.
- With regards to single-threaded optimizations in §6, our approach provides different methods to prove optimizations, which complement the extant methods of Dolan et al. [2018] with data sensitivity; for example, our methods provides a simple proof of the transformation of if  $(M)\{x:=1; C\}$  else  $\{x:=1; D\}$  to x:=1; if  $(M)\{C\}$  else  $\{D\}$ .

#### 8.3 Revisiting OOTA

Batty et al. [2015] describe the problem of thin-air executions. Boehm [2018] revisits community discussions on "OOTA" like examples. We classify the models on the basis of their behavior on three examples: (1), (7) and (8).

- RC11[Lahav et al. 2017], JMM [Manson et al. 2005] and the related models of Jagadeesan et al. [2010] and Kang et al. [2017] forbid only (1).
- The model of this paper forbids (1) and (7).
- Forbidding load buffering as per [Boehm 2018; Dolan et al. 2018] forbids all three examples. RC11[Lahav et al. 2017] also forbids all three examples.

One of our novel contributions to this line of research is the desideratum of compositional reasoning of safety properties. The presence of such compositional reasoning suffices to remove the undesirable effects of OOTA, eg. type safety. And while OOTA in a model is an elusive creature, and hence hard to prove or disprove, the support of a model for a compositional proof principle is objectively provable or falsifiable.

#### 8.4 Relationship to models of speculation

Disselkoen et al. [2019] is a close formal cousin to our approach. In the spectrum from micro architectures to architectures to software, whereas Disselkoen et al. [2019] lies between the micro and architectural levels, we aim to be as abstract as possible to bridge the gap to programming languages. This is reflected in the several differences in the formal model.

We consider a single global acyclic order in contrast to the two studied in Disselkoen et al. [2019]. We do not track false branches of conditionals, as seen in our monotonicity axiom, whereby any event with a precondition false can only have false successors in <, whereas in that paper, speculation on false branches can lead to observable effects. We are also less accurate about tracking reads, as reflected in the removal of program order between reads and reads.

Disselkoen et al. [2019] does not study the properties as a memory model. We conjecture that our proofs of DRF and compositional reasoning can be carried over to that setting. Disselkoen et al. [2019] also preserves some po between reads, whereas we do not; so, we certainly support more compiler optimizations of single threaded code.

#### 9 CONCLUSIONS

We have defined a relaxed memory model, using standard tools from the semantics and demonstrated that it satisfies the fundamental criterion for software memory models, namely "the memory model must strike a balance between ease-of-use for programmers and implementation flexibility for system designers."

One key lesson we draw is that any software relaxed memory model should support the cyclical composition proof rule of Abadi and Lamport [1993] for temporal properties. We posit that all the benefits of OOTA, such as typing guarantees, can be derived from this principle.

We conclude with a suggestion for the consideration of programming language designers. The semantics of a programming language should support a clear and executable prescription of dependency in a program. Such a specification provides a nice interface between the language user and the compiler writer, and a clear bridge from the compiler writer to architectures.

Our current paper is to be viewed as a step towards this goal. We will explore developing an executable specification based on these ideas in future work.

#### **REFERENCES**

Martín Abadi and Leslie Lamport. 1993. Composing Specifications. ACM Trans. Program. Lang. Syst. 15, 1 (Jan. 1993), 73–132. https://doi.org/10.1145/151646.151649

Sarita V. Adve and Mark D. Hill. 1990. Weak Ordering - A New Definition. In Proceedings of the 17th Annual International Symposium on Computer Architecture, Seattle, WA, USA, June 1990. ACM, 2–14. https://doi.org/10.1145/325164.325100

Sarita V. Adve and Mark D. Hill. 1993. A Unified Formalization of Four Shared-Memory Models. *IEEE Trans. Parallel Distrib.* Syst. 4, 6 (1993), 613–624. https://doi.org/10.1109/71.242161

J. Alglave. 2010. A shared memory poetics. PhD thesis. Université Paris 7 and INRIA.

Jade Alglave, Anthony C. J. Fox, Samin Ishtiaq, Magnus O. Myreen, Susmit Sarkar, Peter Sewell, and Francesco Zappa
Nardelli. 2009. The semantics of power and ARM multiprocessor machine code. In *Proceedings of the POPL 2009 Workshop on Declarative Aspects of Multicore Programming, DAMP 2009, Savannah, GA, USA, January 20, 2009.* 13–24. https://doi.org/10.1145/1481839.1481842

- Jade Alglave, Luc Maranget, and Michael Tautschnig. 2014. Herding Cats: Modelling, Simulation, Testing, and Data Mining for Weak Memory. ACM Trans. Program. Lang. Syst. 36, 2, Article 7 (July 2014), 74 pages. https://doi.org/10.1145/2627752
- Mark Batty, Kayvan Memarian, Kyndylan Nienhuis, Jean Pichon-Pharabod, and Peter Sewell. 2015. The Problem of Programming Language Concurrency Semantics. In *Proc. European Symp. on Programming*. 283–307.
- Mark John Batty. 2015. *The C11 and C++11 concurrency model*. Ph.D. Dissertation. University of Cambridge, UK. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.708458
- Hans Boehm. 2018. 1217R0: Out-of-thin-air, revisited, again. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p1217r0.html.
- Hans-J. Boehm and Brian Demsky. 2014. Outlawing Ghosts: Avoiding Out-of-thin-air Results. In Proceedings of the
   Workshop on Memory Systems Performance and Correctness (MSPC '14). ACM, New York, NY, USA, Article 7, 6 pages.
   https://doi.org/10.1145/2618128.2618134
- S. D. Brookes, C. A. R. Hoare, and A. W. Roscoe. 1984. A Theory of Communicating Sequential Processes. *J. ACM* 31, 3 (June 1984), 560–599. https://doi.org/10.1145/828.833
- Pietro Cenciarelli, Alexander Knapp, and Eleonora Sibilio. 2007. The Java Memory Model: Operationally, Denotationally,
  Axiomatically. In Programming Languages and Systems, 16th European Symposium on Programming, ESOP 2007, Held as
  Part of the Joint European Conferences on Theory and Practics of Software, ETAPS 2007, Braga, Portugal, March 24 April 1,
  2007, Proceedings. 331–346. https://doi.org/10.1007/978-3-540-71316-6\_23
- Delphine Demange, Vincent Laporte, Lei Zhao, Suresh Jagannathan, David Pichardie, and Jan Vitek. 2013. Plan B: a buffered memory model for Java. In *The 40th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages*, POPL '13, Rome, Italy January 23 25, 2013. 329–342. https://doi.org/10.1145/2429069.2429110
  - C. Disselkoen, R. Jagadeesan, A. Jeffrey, and J. Riely. 2019. The Code That Never Ran: Modeling Attacks on Speculative Evaluation. In 2019 2019 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, Los Alamitos, CA, USA, 930–947. https://doi.org/10.1109/SP.2019.00047
- Stephen Dolan, KC Sivaramakrishnan, and Anil Madhavapeddy. 2018. Bounding Data Races in Space and Time. In *Proceedings*of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2018). ACM, New
  York, NY, USA, 242–255. https://doi.org/10.1145/3192366.3192421
  - Rodrigo Ferreira, Xinyu Feng, and Zhong Shao. 2010. Parameterized Memory Models and Concurrent Separation Logic. In Programming Languages and Systems, 19th European Symposium on Programming, ESOP 2010, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2010, Paphos, Cyprus, March 20-28, 2010. Proceedings. 267–286. https://doi.org/10.1007/978-3-642-11957-6\_15
- Jay L. Gischer. 1988. The equational theory of pomsets. *Theoretical Computer Science* 61, 2 (1988), 199–224. https://doi.org/10.1016/0304-3975(88)90124-7
- 1256 C. A. R. Hoare. 1969. An Axiomatic Basis for Computer Programming. Commun. ACM 12, 10 (Oct. 1969), 576–580.
   1257 https://doi.org/10.1145/363235.363259
- Michael Huth, Radha Jagadeesan, and David A. Schmidt. 2001. Modal Transition Systems: A Foundation for Three-Valued
  Program Analysis. In Programming Languages and Systems, 10th European Symposium on Programming, ESOP 2001 Held
  as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2001 Genova, Italy, April 2-6, 2001,
  Proceedings. 155–169. https://doi.org/10.1007/3-540-45309-1\_11 Full version in Mathematical Structures in Computer
  Science 2004(14), 469âĂŤ505.
- Radha Jagadeesan, Corin Pitcher, and James Riely. 2010. Generative Operational Semantics for Relaxed Memory Models. In
   Programming Languages and Systems, 19th European Symposium on Programming, ESOP 2010, Held as Part of the Joint
   European Conferences on Theory and Practice of Software, ETAPS 2010, Paphos, Cyprus, March 20-28, 2010. Proceedings
   (Lecture Notes in Computer Science), Andrew D. Gordon (Ed.), Vol. 6012. Springer, 307–326. https://doi.org/10.1007/978-3-642-11957-6\_17
- A. Jeffrey and J. Riely. 2016. On Thin Air Reads Towards an Event Structures Model of Relaxed Memory. In *Proceedings of the 31st Annual ACM/IEEE Symposium on Logic in Computer Science, LICS '16, New York, NY, USA, July 5-8, 2016, M.* Grohe, E. Koskinen, and N. Shankar (Eds.). ACM, 759–767. https://doi.org/10.1145/2933575.2934536
  - J. Kang, C.-K. Hur, O. Lahav, V. Vafeiadis, and D. Dreyer. 2017. A promising semantics for relaxed-memory concurrency. In Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages, POPL 2017, Paris, France, January 18-20, 2017, Giuseppe Castagna and Andrew D. Gordon (Eds.). ACM, 175–189. https://doi.org/10.1145/3009837
- Ori Lahav and Viktor Vafeiadis. 2016. Explaining Relaxed Memory Models with Program Transformations. In FM 2016:
  Formal Methods 21st International Symposium, Limassol, Cyprus, November 9-11, 2016, Proceedings. 479–495. https://doi.org/10.1007/978-3-319-48989-6\_29

1247

1248

1251

1252

1253

1269

1270

- Ori Lahav, Viktor Vafeiadis, Jeehoon Kang, Chung-Kil Hur, and Derek Dreyer. 2017. Repairing sequential consistency in C/C++11. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2017, Barcelona, Spain, June 18-23, 2017, Albert Cohen and Martin T. Vechev (Eds.). ACM, 618-632. https://doi.org/10.1145/3062341.3062352
- L. Lamport. 1979. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs. *IEEE Trans. Comput.* 28, 9 (Sept. 1979), 690–691. https://doi.org/10.1109/TC.1979.1675439
- Leslie Lamport. 1986. On Interprocess Communication. Part I: Basic Formalism. Distributed Computing 1, 2 (1986), 77–85.
   https://doi.org/10.1007/BF01786227
- Kim Guldstrand Larsen and Bent Thomsen. 1988. A Modal Process Logic. In Proceedings of the Third Annual Symposium on Logic in Computer Science (LICS '88), Edinburgh, Scotland, UK, July 5-8, 1988. IEEE Computer Society, 203–210. https://doi.org/10.1109/LICS.1988.5119
- Orna Lichtenstein, Amir Pnueli, and Lenore D. Zuck. 1985. The Glory of the Past. In *Proceedings of the Conference on Logic of Programs*. Springer-Verlag, London, UK, UK, 196–218. http://dl.acm.org/citation.cfm?id=648065.747612
- 1286 A. Lochbihler. 2013. Making the Java memory model safe. ACM Trans. Program. Lang. Syst. 35, 4 (2013), 12:1–12:65. https://doi.org/10.1145/2518191
- Jeremy Manson, William Pugh, and Sarita V. Adve. 2005. The Java Memory Model. SIGPLAN Not. 40, 1 (Jan. 2005), 378–391. https://doi.org/10.1145/1047659.1040336
- Robin Milner. 1999. Communicating and Mobile Systems: The  $\pi$ -calculus. Cambridge University Press, New York, NY, USA.
- 1290 J. Misra and K. M. Chandy. 1981. Proofs of Networks of Processes. IEEE Trans. Softw. Eng. 7, 4 (July 1981), 417-426.
- Jean Pichon-Pharabod and Peter Sewell. 2016. A Concurrency Semantics for Relaxed Atomics That Permits Optimisation and Avoids Thin-air Executions. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL '16). ACM, New York, NY, USA, 622–633. https://doi.org/10.1145/2837614.2837616
- Gordon Plotkin and Vaughan Pratt. 1997. Teams Can See Pomsets (Preliminary Version). In *Proceedings of the DIMACS Workshop on Partial Order Methods in Verification (POMIV '96)*. AMS Press, Inc., New York, NY, USA, 117–128. http://dl.acm.org/citation.cfm?id=266557.266600
- Amir Pnueli. 1985. In Transition From Global to Modular Temporal Reasoning about Programs. In *Logics and Models of Concurrent Systems*, Krzysztof R. Apt (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 123–144.
- Anton Podkopaev, Ori Lahav, and Viktor Vafeiadis. 2019. Bridging the gap between programming languages and hardware weak memory models. *PACMPL* 3, POPL (2019), 69:1–69:31. https://dl.acm.org/citation.cfm?id=3290382
- 1299 W. Pugh. 2004. Causality Test Cases. http://www.cs.umd.edu/~pugh/java/memoryModel/CausalityTestCases.html.
- Christopher Pulte, Shaked Flur, Will Deacon, Jon French, Susmit Sarkar, and Peter Sewell. 2018. Simplifying ARM concurrency: multicopy-atomic axiomatic and operational models for ARMv8. *PACMPL* 2, POPL (2018), 19:1–19:29. https://doi.org/10. 1145/3158107
- Vijay A. Saraswat, Radha Jagadeesan, Maged Michael, and Christoph von Praun. 2007. A Theory of Memory Models. In

  Proceedings of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '07). ACM,

  New York, NY, USA, 161–172. https://doi.org/10.1145/1229428.1229469
- Susmit Sarkar, Peter Sewell, Jade Alglave, Luc Maranget, and Derek Williams. 2011. Understanding POWER multiprocessors.
   In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2011,
   San Jose, CA, USA, June 4-8, 2011. 175–186. https://doi.org/10.1145/1993498.1993520
- Sevcik. 2011. OOTA in the PPoPP memory model. Personal Communication.

- Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, and Magnus O. Myreen. 2010. x86-TSO: a rigorous and usable programmer's model for x86 multiprocessors. *Commun. ACM* 53, 7 (2010), 89–97. https://doi.org/10.1145/ 1785414.1785443
- Eugene W. Stark. 1985. A proof technique for rely/guarantee properties. In Foundations of Software Technology and Theoretical
   Computer Science, S. N. Maheshwari (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 369–391.
- Kasper Svendsen, Jean Pichon-Pharabod, Marko Doko, Ori Lahav, and Viktor Vafeiadis. 2018. A Separation Logic for a Promising Semantics. In Programming Languages and Systems 27th European Symposium on Programming, ESOP 2018, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2018, Thessaloniki, Greece, April 14-20, 2018, Proceedings (Lecture Notes in Computer Science), Vol. 10801. Springer, 357–384. https://doi.org/10.1007/978-3-319-89884-1\_13
- Viktor Vafeiadis and Chinmay Narayan. 2013. Relaxed separation logic: a program logic for C11 concurrency. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications, OOPSLA 2013, part of SPLASH 2013, Indianapolis, IN, USA, October 26-31, 2013. 867-884. https://doi.org/10.1145/2509136.
   2509532
- Jaroslav Ševčík. 2008. Program Transformations in Weak Memory Models. PhD thesis. Laboratory for Foundations of
   Computer Science, University of Edinburgh.

#### A MODAL POMSETS

1324 1325

1326

1327

1328 1329

1331

1333

1337

1339 1340

1341

1342 1343

1345

1346

1347

1348

1349

1350

1351

1352

1353

1354

1355

1356 1357

1358

1359 1360

1361

1362

1363

1364

1365

1366

1367

1368

1369

1370

1371 1372 In order to perform a sharper analysis of dependency, we present an alternate semantics using modal pomsets defined below. Modal pomsets make a formal distinction between strong order and weak order.

**Definition A.1.** A modal (memory model) pomset is a tuple  $(E, \leq, \leq, \lambda)$ , such that

- $(E, \leq, \lambda)$  is a (memory model) pomset, and
- $\trianglelefteq \subseteq \subseteq$  is a partial order.

We write  $d \triangleleft e$  when  $d \unlhd e$  and  $d \neq e$ , and similarly for  $\leq$ . Thus,  $(\subseteq \cup eco)^* \subseteq \leq$ .

We list out a few observations to illustrate the relationship between modal pomsets and pomsets. We are given a modal pomset,  $(E, \leq, \leq, \lambda)$ . Then:

- $(E, \leq, \lambda)$  is a pomset with the same reads-from relation.
- Let eco be the restriction of  $\leq$  to conflicting actions on the same location. Then,  $(E, \leq, (\leq \cup eco)^*, \lambda)$  is a modal pomset, and  $(\leq \cup eco)^* \subseteq \leq$ .

*Changes to definitions.* The definition of the semantics of programs using modal pomset largely follows the one using pomsets. We sketch the changes to definitions below.

- We say that d fulfills e on x if d writes v to x, e reads v from x,
  - $-d \triangleleft e$ , and
  - if an event *c* writes to *x* then either  $c \le d$  or  $e \le c$ .
- Augmentation has to include  $\triangleleft$ . i.e P' is an *augmentation* of P if E' = E,  $\lambda' = \lambda$ ,  $\leq' \supseteq \leq$ , and  $\leq' \supseteq \leq$ .
- The definitions of substitution, restriction and the filtering operations stay the same, with ≤ carried over unchanged. For example, substitution is defined as follows:
  - Let  $\mathcal{P}\sigma$  be the set  $\mathcal{P}'$  where  $P' \in \mathcal{P}'$  whenever there is  $P \in \mathcal{P}$  such that:  $E' = E, \leq' = \leq$ ,  $\leq' = \leq$ , and  $\lambda'(e) = (\psi \sigma \mid a)$  when  $\lambda(e) = (\psi \mid a)$ .
- In composition, we require  $\trianglelefteq' \supseteq \trianglelefteq^1 \cup \trianglelefteq^2$
- The changes to the definition 3.4 of prefixing are as follows. The key changes are that synchronization and dependency enforce *¬* whereas coherence only enforces *¬*.
  - ⊴′ ⊇ ⊴.
  - Item 6b changes to: if *e* is a write then either  $c \triangleleft' e$  or  $\lambda'_{\Phi}(e)$  implies  $\lambda_{\Phi}(e)$ .
  - Item 8 changes to: if *a* is an acquire or  $\lambda_{\mathcal{A}}(e)$  is a release then  $c \triangleleft' e$ .

We use  $[\![C]\!]_M$  to stand for the modal pomset semantics of C.

#### A.1 Generators.

Modal pomsets provide a characterization of generators from section 4.

Recall that *generators* in the pomset semantics are pomsets that are minimal with respect to augmentation and implication. These generators are induced by pomsets that are minimal with respect to augmentation and implication in the modal pomset semantics in the following sense.

 $(E, \leq, \lambda)$  is a generator for  $[\![C]\!]$  if there exists  $(E, \lhd, \leq, \lambda) \in [\![C]\!]_M$  minimal wrt augmentation and implication, and  $\leq = (\trianglelefteq \cup eco)^*$ .

Furthermore, any strong order that is outside of program order must be induced by a reads-from. In the two-thread case, we can state the latter property as follows: suppose e and d are not related by program order and  $e \lhd d$ ; then there exist d' that reads-from e' such that  $e \xrightarrow{po} e'$ ,  $d' \xrightarrow{po} d$  and  $e \lhd e' \lhd d' \lhd d$ .

#### A.2 Closure properties

 The fine grain analysis of dependency in the modal semantics allows us to establish some closure properties of the semantics of programs.

We consider programs of the form  $\vec{x} := \vec{0}$ ; fence; C, where C is restriction-free. Thus, all memory locations are initialized to 0, initialization happens-before the execution of any command,

We say that  $P' = P|_{E'}$  when  $E' \subseteq E$ ,  $\lambda' = \lambda|_{E'}$ , and  $\leq' = \leq|_{E'}$ .

**Definition A.2.** Let (*P* after e) = { $d \in E \mid e \le d$ } be the set of events that follow e in *P*.

The semantics of read is "input"-enabled, since it permits the read of any visible value. Thus, any racy read in a program can be replaced by a read of a earlier value (wrt eco), even while the races with existing independent writes are maintained. A canonical example to keep in mind for this lemma is the program:

$$(y:=0; r:=y; x:=1) \parallel (x:=0; s:=x; y:=1)$$

with both registers getting value 1 via the execution:



The lemma constructs the execution:



LEMMA A.3. Let  $P \in [\![C]\!]_M$  be a top level pomset. Let  $e \in P$  read from write event d on x,  $\neg(d \xrightarrow{hb} e)$ . Then, there exists  $P' \in [\![C]\!]_M$  such that:

- e' reads from x, with matching write event d', such that d'  $\xrightarrow{\text{eco}}$  d in P'
- The restriction of  $\leq$  in P to  $E_P \setminus (P \text{ after } e)$  agrees with the the restriction of  $\leq$  in P' to  $E_{P'} \setminus (P \text{ after } e)$  in P'.
- The restriction of  $\leq$  in P to  $E_P \setminus (P \text{ after } e)$  agrees with the restriction of  $\leq$  in P' to  $E_{P'} \setminus (P \text{ after } e)$  in P'.

PROOF. The form of C ensures that there is always a write to x that is related by  $\xrightarrow{hb}$  to any read. Thus, there is at least one other write than can satisfy the read recorded as e.

The key observation behind the proof is that change in a prefixing read action can only affect the events that are dependent, ie. in the  $\triangleleft$  order to the read action.

In the following lemma, invert the **eco** relationship between a read and a write. A canonical example to keep in mind for this lemma is the program:

$$(y:=0; x:=1; r:=y) \parallel (x:=0; y:=1; s:=x)$$

with both registers getting value 0 via the execution:



The lemma constructs the execution:



LEMMA A.4. Let  $P \in [\![C]\!]_M$  be a top-level pomset. Let  $d \in P$  be a write on x. Let  $e \in P$  read from x such that  $e \xrightarrow{eco} d$  and  $\neg (e \lhd d)$ . Then, there exists  $P' \in [\![C]\!]_M$  such that:

- $e' \in P' \setminus P$  reads from x, with matching write d.
- The restriction of  $\leq$  in P to  $E_P \setminus (P \text{ after } e)$  agrees with the the restriction of  $\leq$  in P' to  $E_{P'} \setminus (P \text{ after } e)$ .

Proof. The proof proceeds similar to the above proof; in this case, replace the value read in e to come from d.

Any new event d' in P' after e' reading from x cannot have a matching write event  $d'' \xrightarrow{eco} d$  since that implies  $d' \xrightarrow{eco} d$  and a eco cycle  $d < e' \xrightarrow{eco} d$ . Thus, the above lemma can be iterated if the new pomset is has any further reads that precede d in eco, so we can finally derive a pomset with no reads and writes satisfying the hypothesis of the lemma.

The eco order between writes that are not related by < can be reversed. A canonical example to keep in mind for this lemma is the program:

$$(x := 1) \parallel (x := 0)$$

$$(x := 1) \parallel (x := 0)$$

The lemma constructs the execution:

$$(Wx1) \leftarrow (Wx0)$$

LEMMA A.5. Let  $P \in [C]_M$  be a top level pomset. Let d, e be a writes to x such that:

- d ≤ ε
- for all writes c to x such that  $d \le c \le e$ , it is the case that  $\neg(c \triangleleft e)$  and  $\neg(c \stackrel{po}{\longrightarrow} e)$

Then, there exists  $P' \in [\![C]\!]_M$  such that  $E_P = E_{P'}, \leq_P = \leq_{P'}$ , and  $e \leq d$  in P'.

PROOF. We show how to interchange e,d adjacent in  $\leq$ , ie. we assume that  $\neg(\exists c)\ d \leq c \leq e$ . The full proof follows by induction.

Since  $[\![C]\!]$  is augmentation closed, it suffices to show that we can build P' while satisfying the constraints between  $\lhd$ ,  $\leq$ . We list the changes below.

- $e \le d$  in P'
- Forall reads c matched to e, change from  $d \le c$  in P to  $c \le d$  in P'
- Forall reads *c* matched to *d*, change from  $c \le e$  in *P* to  $e \le c$  in *P'*

**B** PROOF OF DRF

In this section of the appendix, we develop a proof of DRF for modal pomsets. By the results in the earlier section, it yields DRF for the pomset semantics, since the races are identical in both models.

In the rest of this section, we assume that P is a generator for  $[\![C]\!]_M$ .

We prove:

**DRF1:** If *P* does not have a race,  $P \in [C]_{MSC}$ .

**DRF2:** If *P* has a race, then there exists  $P' \in \operatorname{closed}[\![C]\!]_{\mathbf{M}}$  such that  $P' \in [\![C]\!]_{\mathbf{MSC}}$  and has a race.

, Vol. 1, No. 1, Article . Publication date: July 2019.

- $e'_i \xrightarrow{\text{eco}} e'_{i+1}$ . In this case, there is a race.
- $e'_i < e'_{i+1}$ . In this case,  $e'_i$  is a write and  $e'_{i+1}$  is a conflicting read, so there is a race.

*Proof of DRF2.* We define a size |P| as follows: size(P) is the number of events in P. Since we are considering loop free programs, there is an  $P \in [\![C]\!]_{MSC}$  with maximum size, which we identify as size(C).

We prove by induction on size(C) - size(P') that given (P, P') such that:

- P' is a prefix of some  $P' \in [\![C]\!]_{MSC}$
- P' is a prefix of P under all of  $\xrightarrow{po}$ ,  $\leq$ ,  $\leq$
- P has a race

1471

1472

1473

1484

1490

1491 1492

1493

1494

1495

1496 1497

1498

1499

1500

1501 1502

1503

1504

1505

1506

1507

1508

1509 1510

1511

1512

1513

1514

1515

1516

1517

1518 1519 there exists  $P' \in [\![C]\!]_M$  that demonstrates the race.

The required theorem follows by setting P' to be the empty pomset.

For the base case, P' = |P|. In this case, P is the required witness.

Otherwise, consider a maximal sequential prefix, extending P', wrt to all of po, eco,  $\triangleleft$ . If it strictly contains P', result follows from induction hypothesis.

If not, P' is already maximal. Consider the set of all events in  $P \setminus P'$  that are minimal wrt hb. In particular, these events will also be minimal wrt po.

If one of these events, say e is a write, we proceed as follows. Using hb-minimality of e, we deduce po minimality of e. Using the generator properties, we deduce that e is  $\triangleleft$ -minimal . Using lemma A.4, we build  $P_1$  from P without changing P' to ensure that there are is no read  $d \in P_1 \setminus P'$  such that  $d \xrightarrow{\text{eco}} e$ . Using lemma A.5, we build  $P_2$  from  $P_1$  without changing P' to ensure that there are is no write  $d \in P_2 \setminus P'$  such that  $d \xrightarrow{\text{eco}} e$ . Thus, e is eco-minimal in  $P_2 \setminus P'$ . Result follows from induction hypothesis by considering  $(P_2, P_1')$  where  $P_1'$  is got from P' by adding e.

So, we can assume that all events in  $P \setminus P'$ , say  $e_0, \ldots, e_n$  that are minimal wrt hb are reads, and we have events  $e'_0, e'_1, \ldots, e'_n, e_0$  such that:

$$e_i \xrightarrow{po} e'_i$$

$$e'_i \text{ (eco } \cup \lhd \text{)} e_{(i+1) \mod n}$$

Let d be the matching write for  $e_{(i+1) \mod n}$ . If  $d_i \in P'$ bEv, then by  $\operatorname{eco}$  prefix closure of P',  $d \xrightarrow{\operatorname{eco}} e'_i$  and  $e_{(i+1) \mod n} \operatorname{eco} e'_i$ , which is a contradiction to  $\operatorname{eco}$  being a partial order per location. So, we can assume that  $e'_i \lhd e_{(i+1) \mod n}$ .

We proceed as follows. We use lemma A.3 on the pomset P and read  $e_{(i+1) \mod n}$  and write  $e'_i$  to construct  $P_1$  that changes the value read in  $e_j$  to a value from P'.  $P'_1$  is derived adding the modified read yielded by lemma A.3 to P'. Result follows by induction hypothesis since  $P'_1$  is a prefix of  $P_1$  under all of  $\xrightarrow{po}$ , <, eco,  $P_1$  has a race, and size( $P'_1$ ) = size(P') + 1.

#### C PROOF OF COMPILATION FOR ARMV8

In this section, we develop the proof of correctness of compilation to ARMV8. In order to ease readability, we reproduce the definitions from the main text.

Given a relation R, R? denotes reflexive closure, R+ denotes transitive closure and R\* denotes reflexive and transitive closure. Given relations R and S, R; S denotes composition.

The ARMv8 model is described using the following relations.

- [R], [W], [Acquire], [Release]: identity on reads, writes, acquires and releases.
- [X]: relates any two events that touch the same location.

• po: program order.

1521

1523

1525

1527

1528

1529

1530

1531

1533

1534

1535

1536 1537

1538 1539

1540

1541

1542 1543

1544

1545

1546

1547

1548

1549

1550

1551

1552 1553

1554 1555

1556

1557 1558

1559

1560

1561

1562

1564

1565

1566

1567 1568

- data, ctrl, addr: data, control and address dependencies.
- rf: reads-from. rf <sup>-1</sup> relates each read to a matching write on the same location.
- co: coherence, which is a total order on the writes to a single location.
- fr = co; rf  $^{-1}$ : from-read, which relates reads to subsequent writes.

For any relation, the cross-thread subrelation is denoted by appending e; the intra-thread subrelation is denoted by appending i. For example, rfe = rf \ po and rfi = rf \ po. The subrelation restriction attention to actions on the same location is given by appending loc. For example, poloc = po \(\inftig(X)\).

The ARMv8 model defines the following relations. In our presentation, we have elided rules concerning fences and RMW operations.

```
eco = rf \cup fr \cup co(Extended coherence)obs = rfe \cup fre \cup coe(Observed externally)dob = (addr \cup data); rfi^2 \cup (ctrl \cup data); [W]; coi^2 \cup addr; po; [W](Dependency order)<math>bob = [Acquire]; po \cup po; [Release]; coi^2(Barrier order)ob = (obs \cup dob \cup bob)^+(Acyclic order)
```

**Definition C.1.** An RMW-free and fence-free execution is *ARM-consistent* if

```
codomain(rf) = domain(Reads)(rf-completeness)For every location x, co totally orders the writes of x(co-totality)poloc \cup rf \cup fr \cup co is acyclic(sc-per-loc)ob is acyclic(external)
```

Given an execution graph G, we say that e is an *internal read* if  $e \in \text{codomain}(\text{po} \cap \text{rf})$ . We are going to translate internal reads of execution graphs into internal reads of the semantics.

From *G* we construct a candidate pomset *P* as follows:

- E = E,
- $\lambda_{\mathcal{A}}(e) = \tau \mathsf{lab}(e)$ , if *e* is a relaxed internal read,
- $\lambda_{\mathcal{A}}(e) = lab(e)$ , if *e* is not a relaxed internal read,
- $\lambda_{\Phi}(e) = \text{true}$ ,
- $\leq$  = ob, and
- $\leq$  =  $(ob \cup eco)^*$

To reemphasize, in this candidate pomset, ob is calculated by considering the definition of ob without rfi, ie.:

```
dob = (addr \cup data); \cup (ctrl \cup data); [W]; coi^{?} \cup addr; po; [W]  (Dependency order)

bob = [Acquire]; po \cup po; [Release]; coi^{?}  (Barrier order)

ob = (obs \cup dob \cup bob)^{+}  (Acyclic order)
```

We show that *P* is a top-level pomset, reasoning as follows.

- $\leq$  is a partial order. This holds since *G* ar is acyclic.
- $\lambda_{\Phi}(e)$  implies  $\lambda_{\Phi}(d)$  whenever  $d \leq e$ . Trivial, since every formula is true.
- *e* is location independent. Trivial, since every formula is true.
- If *e* reads *v* from *x*, then there is some *d* such that
  - -d < e,
- d writes v to x, and
- if *c* writes to *x* then either  $c \le d$  or  $e \le c$ .

1572

1573

1577

1578

1582

1590

1591

1592

1593

1594

1595

1596 1597

1598

1599

1600

1601 1602

1603

1604

1606 1607

1608 1609

1610

1611

1612

1613

1614

1615

#### C.1 Proof that $(ob \cup eco)^*$ is irreflexive.

1570 Proof of lemma 5.2. Let e, d be distinct events and d' ( $\xrightarrow{\text{po}} \cap \xrightarrow{\text{po}}$ ) d (( $\xrightarrow{\text{eco}} \cap \xrightarrow{\text{po}}$ )\  $\xrightarrow{\text{rfi}}$ ) e ( $\xrightarrow{\text{ob}}$ ) e'.

PROOF. If d' is an acquire, or e is an release, or e' is a release, result is immediate.

We next consider the case where e is a read. In this case, d is a write. Since  $d((\stackrel{\text{eco}}{\longrightarrow} \cap \stackrel{\text{po}}{\longrightarrow}) \setminus \stackrel{\text{rfi}}{\longrightarrow}) e$ , there is a write  $d_1$  such that  $d \stackrel{\text{coe}}{\longrightarrow} d_1 \stackrel{\text{rfe}}{\longrightarrow} e'$ . So,  $d \stackrel{\text{ob}}{\longrightarrow} e$  and result follows in this case.

So, it suffices to prove the following assuming that d' is not an acquire and e' is not a release and e is not a release or a read and e, d are distinct.

- If  $d'(\xrightarrow{ob} \cap \xrightarrow{po}) d(\xrightarrow{eco} \cap \xrightarrow{po})e$  then  $d'\xrightarrow{ob} e$ .
- If  $d \stackrel{\text{eco}}{(} \cap \stackrel{\text{po}}{)} e \stackrel{\text{po}}{(} \cap \stackrel{\text{po}}{)} e'$  then  $d \stackrel{\text{ob}}{\longrightarrow} e'$ .

We first prove that if  $d'(\xrightarrow{\mathrm{ob}} \cap \xrightarrow{\mathrm{po}}) d(\xrightarrow{\mathrm{eco}} \cap \xrightarrow{\mathrm{po}}) e$  then  $d'\xrightarrow{\mathrm{ob}} e$ . Proof proceeds by cases on the witness for  $d'(\xrightarrow{\mathrm{ob}} \cap \xrightarrow{\mathrm{po}}) d$ .

• If  $d' \xrightarrow{\text{bob}} d$ , then:

$$d'$$
 ([Acquire]; po  $\cup$  po; [Release]; coi<sup>?</sup>)  $d$ 

Since d' is not an acquire,  $d'(po; [Release]; coi^?)d$ , so d is a write. Since e is not a read,  $d \xrightarrow{coi} e$ . Thus, result follows.

• If  $d' \xrightarrow{\text{dob}} d$ , then:

$$d'$$
 ((ctrl  $\cup$  data); [W]; coi<sup>?</sup>  $\cup$  addr; po; [W]  $d$ 

So, d is a write. Since e is also a write, we deduce that

$$d'$$
 ((ctrl  $\cup$  data); [W]; coi<sup>?</sup>  $\cup$  addr; po; [W]  $e$ 

We next prove that if  $d \xrightarrow{\text{eco}} \cap \xrightarrow{\text{po}} e \xrightarrow{\text{ob}} \cap \xrightarrow{\text{po}} e'$  then  $d \xrightarrow{\text{ob}} e'$ , under the assumptions that e' is not a release and e is not a release or a read and e, d are distinct.

Proof proceeds by cases on the witness for  $e(\stackrel{\text{ob}}{\longrightarrow} \cap \stackrel{\text{po}}{\longrightarrow})e'$ .

• If  $e \xrightarrow{\text{bob}} e'$ , then:

$$e$$
 ([Acquire]; po  $\cup$  po; [Release]; coi<sup>?</sup>)  $e'$ 

Since e is not a read,  $e(po; [Release]; coi^?)e'$ . Result follows since  $d \xrightarrow{po} e$ .

• If  $e \xrightarrow{\text{dob}} e'$ , then e is a read.

LEMMA C.2. If  $d \xrightarrow{ob} e$  then  $\neg (e \xrightarrow{eco} d)$ .

PROOF. Proof by contradiction. Let

$$e \xrightarrow{ob} e' \xrightarrow{eco} d' \xrightarrow{ob} d \xrightarrow{ob} c \xrightarrow{ob} c' \xrightarrow{ob} e$$

where  $e' \xrightarrow{po} d'$ .

By lemma 5.2, if  $e \neq e'$ , we deduce  $e \xrightarrow{\text{ob}} d'$ , and thus  $e \xrightarrow{\text{ob}} d$ . If  $d \neq d'$ , we deduce  $e' \xrightarrow{\text{ob}} d$  and thus  $e \xrightarrow{\text{ob}} d$ .

Thus, if  $e \neq e'$  or  $d \neq d'$ , then there is a cycle  $e \xrightarrow{ob} d \xrightarrow{ob} c \xrightarrow{ob} c' \xrightarrow{ob} e$ .

So we can assume that e' = e, d' = d and

$$e \xrightarrow{\text{eco}} d \xrightarrow{\text{ob}} c \xrightarrow{\text{ob}} c' \xrightarrow{\text{eco}} e$$

where all of e, d, c, c' access the same location and at least one of e, d is a write, at least one of e, c' is a write, and at least one of d, c is a write.

We reason by cases.

• If c' is a write or both (e, d) are writes. We deduce that  $d \xrightarrow{\text{eco}} c' \xrightarrow{\text{eco}} e$  and thus  $d \xrightarrow{\text{eco}} e$ .

• c' is a read. e is a write. d is a read. In this case c is a write. From  $c \xrightarrow{\text{ob}} e$ , we deduce  $c \xrightarrow{\text{eco}} e$ . Combining with  $d \xrightarrow{\text{eco}} c$ , we deduce that  $d \xrightarrow{\text{eco}} e$ .

In either case, there is a contradiction  $e \xrightarrow{\text{eco}} d \xrightarrow{\text{eco}} e$ .

Lemma C.3.  $(ob \cup eco)^*$  is irrreflexive.

PROOF. The simple case that ob; eco is irreflexive is proved above. The full proof by contradiction. Let  $n \ge 1$  be the minimum such that:

$$\begin{array}{c} e_0^0 \stackrel{\mathrm{ob}}{\longrightarrow} e_1^0 \stackrel{\mathrm{eco}}{\longrightarrow} d_0^0 \stackrel{\mathrm{ob}}{\longrightarrow} d_1^0 \\ (\stackrel{\mathrm{eco}}{\longrightarrow} \cap \stackrel{\mathrm{ob}}{\longrightarrow}) \ e_0^1 \stackrel{\mathrm{ob}}{\longrightarrow} e_1^1 \stackrel{\mathrm{eco}}{\longrightarrow} d_0^1 \stackrel{\mathrm{ob}}{\longrightarrow} d_1^1 \\ (\stackrel{\mathrm{eco}}{\longrightarrow} \cap \stackrel{\mathrm{ob}}{\longrightarrow}) \ \dots \\ & \dots \ d_1^n \\ (\stackrel{\mathrm{eco}}{\longrightarrow} \cap \stackrel{\mathrm{ob}}{\longrightarrow}) \ e_0^0 \end{array}$$

where for all i, we have:

$$e_0^i \xrightarrow{\text{po}} e_1^i (\xrightarrow{\text{eco}} \cap \xrightarrow{\text{po}}) d_0^i \xrightarrow{\text{po}} d_1^i$$

and

$$\neg (d_1^i \xrightarrow{po} (e_0^{(i+1)} \mod n$$

For any i, if  $e_0^i \neq e_1^i$  or  $d_0^i \xrightarrow{\text{po}} d_1^i$ , via lemma 5.2, we deduce that  $e_0^i \xrightarrow{\text{ob}} d_1^i$ , contradicting minimality of n.

So, we can assume that  $n \ge 1$  is such that:

$$e^{0} \xrightarrow{\text{eco}} d^{0}$$

$$(\xrightarrow{\text{eco}} \cap \xrightarrow{\text{ob}}) e^{1} \xrightarrow{\text{eco}} d^{1}$$

$$(\xrightarrow{\text{eco}} \cap \xrightarrow{\text{ob}}) \dots$$

$$\dots d^{n}$$

$$(\xrightarrow{\text{eco}} \cap \xrightarrow{\text{ob}}) e^{0}$$

which is a contradiction since it is a cycle in  $\xrightarrow{\text{eco}}$  and since at least one of  $e^i, d^i$  is a write for all i.

#### **D** CAUSALITY TEST CASES

In this section, we discuss three of the causality test cases and the thread inlining from [Manson et al. 2005]. In presenting the examples, we unroll loops, correct typos and simplify the code.

#### D.1 Causality test case 8

Test case 8 asks whether:

$$x := 0; y := 0; (if(x < 2)\{y := 1\} \parallel x := y)$$

may read 1 for both x and y. This behavior is allowed, since "interthread analysis could determine that x and y are always either 0 or 1." This breaks the dependency between the read of x and the write to y in the first thread, allowing the write to be moved earlier.

The semantics of TC8 includes



Where we require (W x 0) < (R x 1) but not (R x 1) < (W y 1). To see why this execution exists, consider the left thread with syntax sugar removed:

$$r:=x$$
; if  $(r < 2)\{y:=1\}$ 

[if  $(r < 2)\{y := 1\}$ ] includes  $(r < 2 \mid Wy1)$ . Thus, by the definition of prefixing read, [r := x; if  $(r < 2)\{y := 1\}$ ] includes  $(Rx1) \Rightarrow (r < 2 \mid Wy1)[x/r]$  which simplifies to  $(Rx1) \Rightarrow (x < 2 \mid Wy1)$ , which, by Definition 3.4, includes:

Prefixing with (W x 0) allows us to discharge the assumption x < 2, arriving at:

$$(\mathbf{W} x 0) \quad (\mathbf{R} x 1) \quad (\mathbf{W} y 1)$$

#### D.2 Causality test case 9

Test case 9 asks whether:

$$x := 0; y := 0; (if(x < 2){y := 1} \parallel x := y \parallel y := 2;)$$

may read 1 for both x and y. This behavior is also allowed. This is "similar to test case 8, except that x is not always 0 or 1. However, a compiler might determine that the read of x by thread 1 will never see the write by thread 3 (perhaps because thread 3 will be scheduled after thread 1)"

Reasoning as for test case 8, the semantics of test case 9 includes:



Thus, with respect to the introduction of new threads, our model appears to be more robust than the event structures semantics of [Jeffrey and Riely 2016], which fails on this test case.

#### D.3 Causality test case 14

Test case 14 asks whether:

$$a := 0$$
;  $b := 0$ ;  $y := 0$ ; (if(a){ $b := 1$ } else{ $y := 1$ } || while ( $y + b == 0$ ){skip}  $a := 1$ )

may read 1 for a and b, yet 0 for y. Here a and b are regular variables and y is volatile, which is equivalent to release/acquire in this example. This behavior is also disallowed, since "in all sequentially consistent executions, [the read of a gets 0] and the program is correctly synchronized. Since the program is correctly synchronized in all SC executions, no non-SC behaviors are allowed."

Unrolling the loop once, we have:

$$a := 0$$
;  $b := 0$ ;  $y := 0$ ; (if(a){ $b := 1$ } else{ $y := 1$ }  $\|$  if( $y \lor b$ ){ $a := 1$ })

We argue that any execution with (R a 1), (R b 1), and (R y 0) must be cyclic. The closure requirements require that (W a 1) < (R a 1) and (R b 1) < (R b 1). Ignoring initialization, least ordered execution that includes all of these actions is:



where the read of a is ordering for (W b 1) but not (W y 1), and the read of b is ordering for (W a 1) but the read of y is not. (W y 1) is crossed out, since its precondition must imply  $(\neg a)[1/a]$ , which is equivalent to false. To avoid order from (R y 0) to (W a 1), we have strengthened the predicate on (W a 1) from ( $y \lor b$ ) to ( $y = 0 \land b = 1$ ). Note that we cannot use this trick symmetrically to remove the order from (R b 1) to (W a 1), since b = 1 does not follow from the initialization of b.

#### D.4 Thread inlining

 One property one could ask of a model of shared memory is thread inlining: any execution of  $[\![P;Q]\!]$  is an execution of  $[\![P|M]\!]$ . This is *not* a goal of our model, and indeed is not satisfied, due to the different semantics of concurrent and sequential memory accesses. We demonstrate this by considering an example from the Java Memory Model [Manson et al. 2005], which shows that Java does not satisfy thread inlining either.

The lack of thread inlining is related to the different dependency relations introduced by sequential and concurrent access. The program (x := 0; y := x + 1) has execution:

but  $(x := 1 \parallel y := x + 1)$  has:

That is, in the sequential case there is no dependency from the write of x to the write of y, but in the concurrent case there is such a dependency.

This can be used to construct a counter-example to thread inlining, based on [Manson et al. 2005, Ex 11]:

$$(x:=0; if(x){z:=1} else{x:=1}) || (y:=x) || (x:=y)$$

This has no execution containing (W z 1). Any attempt to build such an execution results in a cycle:



Inlining the thread (y := x) gives [Manson et al. 2005, Ex 12]:

$$(x := 0; if(x) \{z := 1\} else \{x := 1\}; y := x) || (x := y)$$

with execution:



To see why this execution exists, consider the program fragment:

$$if(x){z:=1}else{x:=1}; y:=x$$

Desugaring, we have:

$$r_1:=x$$
; if  $(r_1)\{z:=1; r_2:=x; y:=r_2\}$  else  $\{x:=1; r_3:=x; y:=r_3\}$ 

Now,  $[z := 1; r_2 := x; y := r_2]$  includes pomset:

$$\begin{array}{c|c}
r_1 = 1 \mid Wz1 \\
\hline
r_1 = x = 1 \mid Wy1
\end{array}$$

and  $[x := 1; r_3 := x; y := r_3]$  includes pomset:

$$rac{r_1 \neq 1 \mid W \times 1}{r_1 \neq 1 \mid W \times 1}$$

so  $[if(r_1 = 1)\{z := 1; r_2 := x; y := r_2\} else\{x := 1; r_3 := x; y := r_3\}]$  includes:

$$\begin{array}{c|c}
\hline
r_1 = 1 \mid Wz1
\end{array}
\quad
\begin{array}{c|c}
\hline
r_1 \neq 1 \mid Wx1
\end{array}$$

$$(r_1 = x = 1) \lor (r_1 \neq 1) \mid Wy1$$

which means  $[if(r_1 = 1) \{z := 1; r_2 := x; y := r_2\}$ else  $\{x := 1; r_3 := x; y := r_3\} ][x/r_1]$  includes:

$$x = 1 \mid Wz 1$$
  $x \neq 1 \mid Wx 1$   $x \neq 1 \mid Wx 1$ 

Now  $(x = x = 1) \lor (x \ne 1)$  is a tautology, so this is just:

$$x = 1 \mid Wz1$$
  $x \neq 1 \mid Wx1$   $Wy1$ 

and so  $[r_1:=x; if(r_1=1)\{z:=1; r_2:=x; y:=r_2\} else\{x:=1; r_3:=x; y:=r_3\}]$  includes:

which simplifies to:

$$(Rx1) \rightarrow (Wz1)$$
  $(Wy1)$ 

as required. The rest of the example is straightforward, and shows that our semantics agrees with the JMM in not supporting thread inlining.