24

16

17

# **Understanding The Leaky Semicolon**

Compositional Semantic Dependencies for Relaxed-Memory Concurrency

### ANONYMOUS AUTHOR(S)

Program logics and semantics tell us that when executing  $(S_1; S_2)$  starting in state  $s_0$ , we execute  $S_1$  in  $s_0$  to arrive at  $s_1$ , then execute  $S_2$  in  $s_1$  to arrive at the final state  $s_2$ . This is, of course, an abstraction. Processors execute instructions out of order, due to pipelines and caches, and compilers reorder programs even more dramatically. All of this reordering is meant to be unobservable in single-threaded code, but is observable in multi-threaded code. A formal attempt to understand the resulting mess is known as a "relaxed memory model." The relaxed memory models that have been proposed to date either fail to address sequential composition directly, or overly restrict processors and compilers.

To support sequential composition while targeting modern hardware, we propose using preconditions and families of predicate transformers. When composing  $(S_1; S_2)$ , the predicate transformers used to validate the preconditions of events in  $S_2$  are chosen based on the semantic dependencies from events in  $S_1$  to events in S<sub>2</sub>. We apply this approach to two existing memory models: "Modular Relaxed Dependencies" for C11 and "Pomsets with Preconditions."

CCS Concepts: • Theory of computation → Parallel computing models; *Preconditions*.

Additional Key Words and Phrases: Concurrency, Relaxed Memory Models, Multi-Copy Atomicity, ARMv8, Pomsets, Preconditions, Temporal Safety Properties, Thin-Air Reads, Compiler Optimizations

### **ACM Reference Format:**

Anonymous Author(s). 2021. Understanding The Leaky Semicolon: Compositional Semantic Dependencies for Relaxed-Memory Concurrency. Proc. ACM Program. Lang. 0, OOPSLA, Article 0 (October 2021), 42 pages.

#### INTRODUCTION

Sequentiality is a leaky abstraction [Spolsky 2002]. For example, sequentiality tells us that when executing  $(r_1 := x; y := r_2)$ , the assignment  $r_1 := x$  is executed before  $y := r_2$ . Thus, one might reasonably expect that the final value of  $r_1$  is independent of the initial value of  $r_2$ . In most modern languages, however, this fails to hold when the program is run concurrently with (s := y; x := s), which copies y to x.

In certain cases it is possible to ban concurrent access using separation [O'Hearn 2007], or to accept inefficiency in order to obtain sequential consistency [Marino et al. 2015]. When these approaches are not available, however, we are left with an enormous gap in our understanding of one of the most basic elements of computing: the humble semicolon. Existing approaches either

- don't bother tracking dependencies, allowing "thin air" executions [Batty 2015],
- use syntactic dependencies [Alglave et al. 2014; Kavanagh and Brookes 2018], disabling compiler optimizations,
- use complex non-compositional operational models [Chakraborty and Vafeiadis 2019a; Cho et al. 2021; Jagadeesan et al. 2010; Kang et al. 2017; Lee et al. 2020; Manson et al. 2005],
- use prefixing [Jagadeesan et al. 2020] or continuation-passing [Paviotti et al. 2020] rather than a direct presentation of sequential composition.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

© 2021 Copyright held by the owner/author(s).

2475-1421/2021/10-ART0

https://doi.org/

In this paper, we show that families of predicate transformers can be used to calculate semantic dependencies in a way that is compositional and direct: compositional in that the denotation of  $(S_1; S_2)$  can be computed from the denotation of  $S_1$  and the denotation of  $S_2$ , and direct in that these can be calculated independently. Our contributions:

- We define a denotational model for calculating dependencies.
- We apply the model to the PWP semantics of [Jagadeesan et al. 2020] and to the MRD-C11 semantics of [Paviotti et al. 2020].
- We provide a tool to execute litmus tests for both models.
- We prove DRF-SC for PWP; MRD-C11 inherits the result from RC11.
- We prove a lowering result for PWP.
- We extend the model to include many features.

### 2 OVERVIEW

50

51 52

53

54 55

56

57

58

61

63 64

65

66

67

68 69 70

71

72

73

74

75

76

77 78

79

84

85

86

87

88 89

90

91

92

93

94

95

96

97 98 This paper is about the interaction of two of the fundamental building blocks of computing: sequential composition and mutable state. One would like to think that these are well-worn topics, where every issue has been settled, but this is not the case.

#### **Sequential Composition** 2.1

Introductory programmers are taught sequential abstraction: that the program  $S_1$ ;  $S_2$  executes  $S_1$ before S<sub>2</sub>. Since the late 1960s, we've been able to explain this using logic [Hoare 1969]. In Dijkstra's [1975] formulation, we think of programs as predicate transformers, where predicates describe the state of memory in the system. In the calculus of weakest preconditions, programs map postconditions to preconditions. We recall the definition of  $wp_s(\psi)$  for loop-free code below (where r-srange over thread-local registers and M-N range over side-effect-free expressions).

```
(D1) wp_{\text{skip}}(\psi) = \psi
(D2) wp_{r:=M}(\psi) = \psi[M/r]
(D3) wp_{S_1;S_2}(\psi) = wp_{S_1}(wp_{S_2}(\psi))

(D4) wp_{\text{if}(M)\{S_1\} \text{else}\{S_2\}}(\psi) = ((M \neq 0) \Rightarrow wp_{S_1}(\psi)) \land ((M=0) \Rightarrow wp_{S_2}(\psi))
```

For this language, the Hoare triple  $\{\phi\}$   $S\{\psi\}$  holds exactly when  $\phi \Rightarrow wp_S(\psi)$ . This is an elegant explanation of sequential computation in a sequential context. Note that D2 is sound because a read from a thread-local register must be fulfilled by a preceding write in the same thread. In a concurrent context, with shared variables (x-z), the obvious generalization

(D2b) 
$$wp_{x:=M}(\psi) = \psi[M/x]$$
 (D2c)  $wp_{x:=x}(\psi) = \psi[x/r]$ 

is unsound! In particular, a read from a shared memory location may be fulfilled by a write in another thread, invalidating D2c. (We assume that expressions do not include shared variables.)

Existing approaches to sequential composition in the concurrent context either assume exclusive access, as in concurrent separation logic [O'Hearn 2007], or abandon the logical approach altogether, as in the pomset model of Kavanagh and Brookes [2018], which enforces syntactic dependencies. This leaves open the question of how to apply logic to racy programs without overconstraining the implementation. To understand the solution, one must first understand the constraints imposed by hardware and compilers.

 For single-threaded programs, memory can be thought of as you might expect: programs write to, and read from, memory references. This can be thought of as a total order of reads and writes (black arrows), where each read has a matching *fulfilling* write (green arrows), for example:

$$x := 0; x := 1; y := 2; r := y; s := x$$

$$(Wx0) \longrightarrow (Wx1) \longrightarrow (Ry2) \longrightarrow (Rx1)$$

This model naturally extends to the case of shared-memory concurrency, leading to a *sequentially consistent* semantics [Lamport 1979], in which *program order* inside a thread implies a total *causal order* between read and write events, for example (where; has higher precedence than  $\parallel$ ):

$$x := 0; x := 1; y := 2 \parallel r := y; s := x$$

$$(Wx0) \longrightarrow (Wx1) \longrightarrow (Ry2) \longrightarrow (Rx1)$$

Unfortunately, this model does not compile efficiently to commodity hardware, resulting in a 37–73% increase in CPU time on Arm8 [Liu et al. 2019] and, hence, in power consumption. Developers of software and compilers have therefore been faced with a difficult trade-off, between an elegant model of memory, and its impact on resource usage (such as size of data centers, electricity bills and carbon footprint). Unsurprisingly, many have chosen to prioritize efficiency over elegance.

This has led to *relaxed memory models*, in which the requirement of sequential consistency is weakened to only apply *per-location* and not globally over the whole program. This allows executions that are inconsistent with program order, such as:

$$x := 0; x := 1; y := 2 \parallel r := y; s := x$$

$$(Wx0) \qquad (Ry2) \qquad (Rx0)$$

In such models, the causal order between events is important, and includes control and data dependencies, to avoid paradoxical "out of thin air" examples such as:

$$r := x$$
; if  $(r)\{y := 1\} \parallel s := y$ ;  $x := s$ 

This candidate execution forms a cycle in causal order, so is disallowed, but this depends crucially on the control dependency from (Rx1) to (Wy1), and the data dependency from (Ry1) to (Wx1). If either is missing, then this execution is acyclic and hence allowed. For example dropping the control dependency results in:

$$r := x ; y := 1 \parallel s := y ; x := s$$

$$(Rx1) \qquad (Ry1) \qquad (Wx1)$$

While syntactic dependency calculation suffices for hardware models, it is not preserved by common compiler optimizations. For example, if we calculate control dependencies syntactically, then there is a dependency from (Rx1) to (Wy1), and therefore a cycle in, the candidate execution:

$$r := x$$
; if  $(r)\{y := 1\}$  else  $\{y := 1\} \parallel s := y$ ;  $x := s$ 

A compiler may lift the assignment y := 1 out of the conditional, thus removing the dependency.

0:4 Anon.

To address this, Jagadeesan et al. [2020] introduced *Pomsets with Preconditions*, where events are labeled with logical formulae. Nontrivial preconditions are introduced by store actions (modeling data dependencies) and conditionals (modeling control dependencies):

$$if(s<1)\{z:=r*s\}$$

$$(s<1) \land (r*s)=0 \mid Wz0$$

Preconditions are discharged by being ordered after a read (we assume the usual precedence for logical operators— $\neg$ ,  $\land$ ,  $\lor$ ,  $\Rightarrow$ ):

$$r := x; s := y; if(s<1)\{z := r*s\}$$

$$(†)$$

$$(R y0) \longrightarrow (0=s) \Rightarrow (s<1) \land (r*s)=0 \mid Wz0$$

Note that there is dependency order from (Ry0) to (Wz0) so the precondition for (Wz0) only has to be satisfied assuming the hypothesis (0=s). There is no matching order from (Rx0) to (Wz0) which is why we do not assume the hypothesis (0=r). Nonetheless, the precondition on (Wz0) is a tautology, and so can be elided in the diagram:

$$(Rx0)$$
  $(Ry0)$   $(Wz0)$ 

### 2.3 Predicate Transformers For Relaxed Memory

 Pomsets with Preconditions show how the logical approach to sequential dependency calculation can be mixed into a relaxed memory model. However, Jagadeesan et al. do not provide a model of sequential composition. Instead, their model uses *prefixing*, which requires that the model is built from right to left: events are prepended one at a time, with perfect knowledge of the future. This makes reasoning about sequential program fragments difficult. For example, Jagadeesan et al. state the equivalence allowing reordering independent writes as follows,

$$[x := M; y := N; S] = [y := N; x := M; S]$$
 if  $x \neq y$ 

where S is the entire future computation! By formalizing sequential composition, we can show:

$$[x := M; y := N] = [y := N; x := M]$$
 if  $x \neq y$ 

Then the equivalence holds in any context.

Predicate transformers are a good fit for logical models of dependency calculation, since both are concerned with preconditions and how they are transformed by sequential composition. Our first attempt is to associate a predicate transformer with each pomset. We visualize this in diagrams by showing how  $\psi$  is transformed, for example:

$$r:=x \qquad \qquad s:=y \qquad \qquad \text{if}(s<1)\{z:=r*s\}$$
 
$$(0=r)\Rightarrow \psi \qquad \qquad (0=s)\Rightarrow \psi \qquad \qquad (s<1) \land (r*s)=0 \mid Wz0\} \cdots \qquad \psi[r*s/z]$$

The predicate transformer from the write matches Dijkstra's D2b. For the reads, however, D2c defines the transformer of r := x to be  $\psi[x/r]$ , which is equivalent to  $(x=r) \Rightarrow \psi$  under the assumption that registers are assigned at most once. Instead, we use  $(0=r) \Rightarrow \psi$ , reflecting the fact that 0 may come from a concurrent write. The obligation to find a matching write is moved from the sequential semantics of *substitution* and *implication* to the concurrent semantics of *fulfillment*.

For a sequentially consistent semantics, sequential composition is straightforward: we apply each predicate transformer to the preconditions of subsequent events, composing the predicate transformers. (In subsequent diagrams, we only show predicate transformers for reads.)

$$r := x; s := y; if(s<1)\{z := r*s\}$$

$$(0=r) \Rightarrow (0=s) \Rightarrow \psi \leftarrow (\mathbb{R}x0) \rightarrow (\mathbb{R}y0) \rightarrow (0=r) \Rightarrow (0=s) \Rightarrow (s<1) \land (r*s)=0 \mid Wz0$$

This model works for the sequentially consistent case, but needs to be weakened for the relaxed case. The key observation of this paper is that rather than working with one predicate transformer, we should work with a *family* of predicate transformers, indexed by sets of events.

For example, for single-event pomsets, there are two predicate transformers, since there are two subsets of any one-element set. The *independent* transformer is indexed by the empty set, whereas the *dependent* transformer is indexed by the singleton. We visualize this by including more than one transformed predicate, with an edge leading to the dependent one. For example:

$$r := x \qquad \qquad s := y$$

$$|\psi| (Rx0) \cdots |(0=r) \Rightarrow \psi| \qquad |\psi| (Ry0) \cdots |(0=s) \Rightarrow \psi|$$

The model of sequential composition then picks which predicate transformer to apply to an event's precondition by picking the one indexed by all the events before it in causal order.

For example, we can recover the expected semantics for (†) by choosing the predicate transformer which is independent of (Rx0) but dependent on (Ry0), which is the transformer which maps  $\psi$  to (0=s)  $\Rightarrow \psi$ .

$$r:=x\;;\;s:=y\;;\;\mathsf{if}(s<1)\{z:=r*s\}$$
 
$$\psi \quad (0=r)\Rightarrow \psi \quad (0=r)\Rightarrow (0=s)\Rightarrow \psi \quad (\mathbb{R}\,x0) \\ \bullet \quad (0=r)\Rightarrow \psi \quad (0=s)\Rightarrow \psi \quad (0=s)\Rightarrow \psi \quad (0=s)\Rightarrow (s<1) \land (r*s)=0 \mid \mathsf{W}\,z0)$$

As a sanity check, we can see that sequential composition is associative in this case, since it does not matter whether we associate to the left, with intermediate step:

$$r := x \; ; \; s := y$$

$$\psi \qquad \boxed{(0=r) \Rightarrow \psi} \leftarrow (Rx0) \rightarrow (0=r) \Rightarrow (0=s) \Rightarrow \psi \leftarrow (Ry0) \rightarrow (0=s) \Rightarrow \psi$$

or to the right, with intermediate step:

$$s := y \; ; \; \text{if}(s<1)\{z := r*s\}$$

$$\psi \qquad (0=s) \Rightarrow \psi \quad \underbrace{( R y 0)}_{\bullet} \quad ((0=s) \Rightarrow (s<1) \land (r*s)=0 \mid Wz 0)$$

This is an instance of the general result that sequential composition forms a monoid.

#### 2.4 Related Work

Marino et al. [2015] argue that the "silently shifting semicolon" is sufficiently problematic for programmers that concurrent languages should guarantee sequential abstraction, despite the performance penalties. In this paper, we take the opposite approach. We have attempted to find the most intellectually tractable model that encompasses all of the messiness of relaxed memory.

There are few prior studies of relaxed memory that include sequential composition and/or precise calculation of semantic dependencies. Paviotti et al. [2020] give a denotational semantics, calculating dependencies using event structures rather than logic. They give the semantics of sequential composition in continuation passing style, whereas we give it in direct style. Kavanagh and Brookes [2018] define a semantics using pomsets without preconditions. Instead, their model uses syntactic dependencies, thus invalidating many compiler optimizations. They also require a fence after every relaxed read on Arm8. Pichon-Pharabod and Sewell [2016] use event structures to

0:6 Anon.

calculate dependencies, combined with an operational semantics that incorporates program transformations. This approach seems to require whole-program analysis.

Other studies of relaxed memory can be categorized by their approach to dependency calculation. Hardware models use syntactic dependencies [Alglave et al. 2014]. Many software models do not bother with dependencies at all [Batty et al. 2011; Cox 2016; Watt et al. 2020, 2019]. Others have strong dependencies that disallow compiler optimizations and efficient implementation, typically requiring fences for every relaxed read on Arm [Boehm and Demsky 2014; Dolan et al. 2018; Jeffrey and Riely 2016; Lahav et al. 2017; Lamport 1979]. Many of the most prominent models are operational, whole-program models based on speculative execution [Chakraborty and Vafeiadis 2019a; Cho et al. 2021; Jagadeesan et al. 2010; Kang et al. 2017; Lee et al. 2020; Manson et al. 2005].

Jagadeesan et al. [2020] note that the speculative models listed above all, including [Kang et al. 2017], fail to validate compositional reasoning of temporal properties—see their examples OOTA4, OOTA5, and [Lochbihler 2013, Fig. 8]). The difference with our model can be understood in terms of the valid program transformations. The speculative models allow reads to be introduced, with subsequent case analysis on the value read—effectively, this can turn one read into two, with different conditional branches taken for the two copies of the read. Our model invalidates this transformation. In return, our model enjoys compositionality for temporal safety properties.

We provide a detailed comparison with [Jagadeesan et al. 2020] in §F.

#### 2.5 Contributions

246

247

249

251

253

255

257

261

263

265

267

269

271

273

274

275

276277

278

279

280

281 282

283

284

285

286

287

288

289

290

291

292

293 294 We show how predicate transformers [Dijkstra 1975] can be added to pomsets with preconditions [Jagadeesan et al. 2020] to create a compositional semantics for sequential composition.

- §3 presents the basic model, which satisfies many desiderata, but not all.
- §A shows two approaches for efficient implementation on Arm. The first uses a suboptimal lowering for acquiring reads. The second uses an optimal lowering, but requires a nontrivial change to the definition of sequential composition.
- §4 generalizes the basic semantics of read and write to validate compiler optimizations.

Because it is closely related, we expect that the memory-model results of [Jagadeesan et al. 2020] apply to our model, including compositional reasoning for temporal safety properties and local DRF-sc as in [Cho et al. 2021; Dolan et al. 2018; Dongol et al. 2019].

### 3 THE BASIC MODEL

After some preliminaries ( $\S 3.1-3.2$ ), we define the basic model and establish some basic properties ( $\S 3.3$  and Figure 2). We then explain the model using examples ( $\S 3.4-3.12$ ). We encourage readers to skim the definitions and then skip to  $\S 3.4$ , coming back as needed.

#### 3.1 Preliminaries

The syntax is built from

- a set of *values* V, ranged over by v, w,  $\ell$ , k,
- a set of registers  $\mathcal{R}$ , ranged over by r, s,
- a set of *expressions*  $\mathcal{M}$ , ranged over by M, N, L.

*Memory references* are tagged values, written  $[\ell]$ . Let  $\mathcal{X}$  be the set of memory references, ranged over by x, y, z. We require that

- · values and registers are disjoint,
- values include at least the constants 0 and 1,
- expressions include at least registers and values,
- expressions do *not* include references: M[N/x] = M.

300 301

302

303 304

305

306

307

308

309

310

311

312

313

320

322

324

326

328

329

330

331

333

334

335 336

337

338

339

340

341

342 343 We model the following language.

$$\mu, \nu := rlx \mid rel \mid acq \mid sc$$

$$S := r := M \mid r := [L]^{\mu} \mid [L]^{\mu} := M \mid F^{\mu} \mid \text{skip} \mid S_1; S_2 \mid \text{if}(M)\{S_1\} \text{ else } \{S_2\} \mid S_1 \mid S_2$$

Access modes,  $\mu$ , are relaxed (rlx), release (rel), acquire (acq), and sequentially consistent (sc). Let expressions (r := M) only affect thread-local state and thus do not have a mode. Reads  $(r := [L]^{\mu})$  support rlx, acq, sc. Writes  $([L]^{\mu} := r)$  support rlx, rel, sc. Fences  $(F^{\mu})$  support rel, acq, sc.

Commands, aka statements, S, include memory accesses at a given mode, as well as the usual structural constructs. Following [Ferreira et al. 1996],  $\parallel$  denotes parallel composition, preserving thread state on the left after a join. In examples and sublanguages without join, we use the symmetric  $\parallel$  operator.

We use common syntax sugar, such as *extended expressions*,  $\mathbb{M}$ , which include memory locations. For example, if  $\mathbb{M}$  includes a single occurrence of x, then  $y := \mathbb{M}$ ; S is shorthand for r := x;  $y := \mathbb{M}[r/x]$ ; S. Each occurrence of x in an extended expression corresponds to an separate read. We also write if  $(M)\{S\}$  as shorthand for if  $(M)\{S\}$  else  $\{skip\}$ .

Throughout §1-A we require that

• each register is assigned at most once in a program.

In §4, we drop this restriction, requiring instead that

• there are registers  $S_{\mathcal{E}} = \{s_e \mid e \in \mathcal{E}\}\$ , that do not appear in programs:  $S[N/s_e] = S$ .

The semantics is built from the following.

- a set of *events*  $\mathcal{E}$ , ranged over by e, d, c, and subsets ranged over by E, D, C,
- a set of *logical formulae*  $\Phi$ , ranged over by  $\phi$ ,  $\psi$ ,  $\theta$ ,
- a set of actions  $\mathcal{A}$ , ranged over by a, b,
- a family of *quiescence symbols*  $Q_x$ , indexed by location.

We require that

- formulae include tt, ff,  $Q_x$ , and the equalities (M=N) and (x=M),
- formulae are closed under  $\neg$ ,  $\land$ ,  $\lor$ ,  $\Rightarrow$ , and substitutions [M/r], [M/x],  $[\phi/Q_x]$
- there is a relation  $\models$  between formulae, capturing entailment,
- $\models$  has the expected semantics for =,  $\neg$ ,  $\land$ ,  $\lor$ ,  $\Rightarrow$  and substitutions [M/r], [M/x],  $[\phi/Q_x]$ ,
- there are three binary relations over  $\mathcal{A} \times \mathcal{A}$ : matches, blocks, and delays,
- there are two subsets of  $\mathcal{A}$ , distinguishing read and release actions.

Logical formulae include equations over registers and memory references, such as (r=s+1) and (x=1). We use expressions as formulae, coercing M to  $M\neq 0$ .

We write  $\phi \equiv \psi$  when  $\phi \models \psi$  and  $\psi \models \phi$ . We say  $\phi$  is a *tautology* if  $\mathsf{tt} \models \phi$ . We say  $\phi$  is *unsatisfiable* if  $\phi \models \mathsf{ff}$ , and *satisfiable* otherwise.

### 3.2 Actions in This Paper

In this paper, we let actions be reads and writes and fences:

$$a, b := W^{\mu}xv \mid R^{\mu}xv \mid F^{\mu}$$

We use shorthand when referring to actions. In definitions, we drop elements of actions that are existentially quantified. In examples, we drop elements of actions, using defaults. Let  $\sqsubseteq$  be the smallest order over access and fence modes such that  $r|x \sqsubseteq rel \sqsubseteq sc$  and  $r|x \sqsubseteq acq \sqsubseteq sc$ . We write  $(W^{\exists rel})$  to stand for either  $(W^{rel})$  or  $(W^{sc})$ , and similarly for the other actions and modes.

*Definition 3.1.* Actions (R) are *read* actions. Actions (W<sup> $\exists$ rel</sup>) and (F<sup> $\exists$ rel</sup>) are *release* actions. We say *a matches b* if a = (Wxv) and b = (Rxv).

0:8 Anon.

```
We say a blocks b if a = (Wx) and b = (Rx), regardless of value.

Let \bowtie_{CO} capture write-write, read-write coherence: \bowtie_{CO} = \{(Wx, Wx), (Rx, Wx), (Wx, Rx)\}.

Let \bowtie_{SynC} capture conflict due to synchronization: \bowtie_{SynC} = \{(a, W^{\exists rel}), (a, F^{\exists rel}), (R, F^{\exists acq}), (R^{\exists acq}, a), (F^{\exists acq}, a), (W^{\exists rel}, W), (W^{\exists rel}, Wx)\}.

Let \bowtie_{SC} capture conflict due to sc access: \bowtie_{SC} = \{(W^{SC}, W^{SC}), (R^{SC}, W^{SC}), (R^{SC}, R^{SC})\}.

We say a delays b if a \bowtie_{CO} b or a \bowtie_{SVNC} b or a \bowtie_{SC} b.
```

### 3.3 The Semantic Domain

350

351

352

353 354

355

357 358

359

360

365

367

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384 385

386

387

388

389

390

391 392 *Predicate transformers* are functions on formulae that preserve logical structure, providing a natural model of sequential composition. The definition comes from Dijkstra [1975]:

```
Definition 3.2. A predicate transformer is a function \tau : \Phi \to \Phi such that (x1) if \phi \models \psi, then \tau(\phi) \models \tau(\psi), (x3) \tau(\psi_1 \lor \psi_2) \equiv \tau(\psi_1) \lor \tau(\psi_2). (x2) \tau(\psi_1 \land \psi_2) \equiv \tau(\psi_1) \land \tau(\psi_2),
```

We consistently use  $\psi$  as the parameter of predicate transformers. Note that substitutions ( $\psi[M/r]$  and  $\psi[M/x]$ ) and implications on the right ( $\phi \Rightarrow \psi$ ) are predicate transformers.

As discussed in §1, predicate transformers suffice for sequentially consistent models, but not relaxed models, where dependency calculation is crucial. For dependency calculation, we use a *family* of predicate transformers, indexed by sets of events. In sequential composition, we will use  $\tau^{\downarrow e}$  as the predicate transformer applied to event e where  $d \in (\downarrow e)$  if d < e.

*Definition 3.3.* A *family of predicate transformers* over E consists of a predicate transformer  $\tau^D$  for each  $D \subseteq \mathcal{E}$ , such that if  $C \cap E \subseteq D$  then  $\tau^C(\psi) \models \tau^D(\psi)$ .

We write  $\tau(\psi)$  as an abbreviation of  $\tau^E(\psi)$ .

*Definition 3.4.* A pomset with predicate transformers over  $\mathcal{A}$  is a tuple  $(E, \lambda, \kappa, \tau, \checkmark, \mathsf{rf}, \leq)$  where

```
(M1) E \subseteq \mathcal{E} is a set of events,
```

```
(M2) \lambda : E \to \mathcal{A} defines a label for each event,
```

(M3)  $\kappa : E \to \Phi$  defines a *precondition* for each event,

(M4)  $\tau: 2^{\mathcal{E}} \to \Phi \to \Phi$  is a family of predicate transformers over E,

(M5)  $\checkmark$ :  $\Phi$  is a termination condition, such that

(M5a)  $\checkmark \models \tau(tt)$ ,

(M6)  $\leq \subseteq E \times E$ , is a partial order capturing causality,

(M7) rf  $\subseteq E \times E$  is an injective relation capturing *reads-from*, such that

(M7a) if  $d \stackrel{\mathsf{rf}}{\longrightarrow} e$  then  $\lambda(d)$  matches  $\lambda(e)$ ,

(M7b) if  $d \xrightarrow{rf} e$  and  $\lambda(c)$  blocks  $\lambda(e)$  then either  $c \le d$  or  $e \le c$ .

(M7c) if  $d \xrightarrow{rf} e$  then  $d \le e$ .

A pomset is *complete* if

- (c2) if  $\lambda(e)$  is a read then there is some  $d \xrightarrow{\text{rf}} e$ ,
- (c3)  $\kappa(e)$  is a tautology (for every  $e \in E$ ),
- (c5)  $\checkmark$  is a tautology.

Preconditions are doing two things:

- determining which events can appear in a pomset, and which ones can go together
- helping to calculate dependencies.

Empty set is used for code that does not run in any given execution. To get a complete pomset for  $[if(ff)\{x := 1\}]$  you need to take the empty set for [x := 1].

Idea behind Q<sub>r</sub>:

```
If P \in SKIP then E = \emptyset and \tau^D(\psi) \equiv \psi.
393
394
          If P \in SEQ(\mathcal{P}_1, \mathcal{P}_2) then (\exists P_1 \in \mathcal{P}_1) \ (\exists P_2 \in \mathcal{P}_2)
395
                                                                                                    (s4) \tau^{D}(\psi) \equiv \tau_{1}^{D}(\tau_{2}^{D}(\psi)),
               (s1) E = (E_1 \cup E_2),
396
                                                                                                    (s5) \checkmark \equiv \checkmark_1 \land \tau_1(\checkmark_2),
               (s2) \lambda = (\lambda_1 \cup \lambda_2),
397
             (s3a) if e \in E_1 \setminus E_2 then \kappa(e) \equiv \kappa_1(e),
                                                                                                    (s6) \leq \supseteq \leq_1 \cup \leq_2
398
             (s3b) if e \in E_2 \setminus E_1 then \kappa(e) \equiv \kappa_2'(e) \land \sqrt{1}(e),
399
             (s3c) if e \in E_1 \cap E_2 then \kappa(e) \equiv (\kappa_1(e) \vee \kappa_2'(e)) \wedge \sqrt{1}(e),
400
          where \kappa_2'(e) = \tau_1(\kappa_2(e)) if \lambda(e) is a read—otherwise \kappa_2'(e) = \tau_1^{\downarrow e}(\kappa_2(e)), where \downarrow e = \{c \mid c < e\};
401
          where \sqrt{1}(e) = \sqrt{1} if \lambda(e) is a release—otherwise \sqrt{1}(e) = \text{tt.}
402
          If P \in IF(\phi, \mathcal{P}_1, \mathcal{P}_2) then (\exists P_1 \in \mathcal{P}_1) (\exists P_2 \in \mathcal{P}_2)
403
                                                                                                     (I4) \tau^D(\psi) \equiv (\phi \wedge \tau_1^D(\psi)) \vee (\neg \phi \wedge \tau_2^D(\psi)),
                (11) E = (E_1 \cup E_2),
404
                                                                                                     (15) \checkmark \equiv (\phi \land \checkmark_1) \lor (\neg \phi \land \checkmark_2).
                (12) \lambda = (\lambda_1 \cup \lambda_2),
405
                                                                                                     (16) \leq \supseteq \leq_1 \cup \leq_2,
              (13a) if e \in E_1 \setminus E_2 then \kappa(e) \equiv \phi \wedge \kappa_1(e),
             (13b) if e \in E_2 \setminus E_1 then \kappa(e) \equiv \neg \phi \wedge \kappa_2(e),
407
              (13c) if e \in E_1 \cap E_2 then \kappa(e) \equiv (\phi \wedge \kappa_1(e)) \vee (\neg \phi \wedge \kappa_2(e)),
408
          If P \in LET(r, M) then E = \emptyset and \tau^D(\psi) \equiv \psi[M/r].
409
          If P \in READ(r, x, \mu) then (\exists v \in \mathcal{V})
               (R1) if d, e \in E then d = e,
                                                                                                   (R4b) if E \neq \emptyset and (E \cap D) = \emptyset then
412
               (R2) \lambda(e) = R^{\mu} x v,
                                                                                                              \tau^D(\psi) \equiv (v=r \lor x=r) \Rightarrow \psi,
413
                                                                                                   (R4c) if E = \emptyset then \tau^D(\psi) \equiv \psi,
               (R3) \kappa(e) \equiv Q_x,
414
             (R4a) if E \neq \emptyset and (E \cap D) \neq \emptyset then
                                                                                                   (R5a) if E \neq \emptyset or \mu \sqsubseteq \text{rlx then } \checkmark \equiv \text{tt.}
                        \tau^D(\psi) \equiv v = r \Rightarrow \psi,
                                                                                                   (R5b) if E = \emptyset and \mu \supseteq acg then \checkmark \equiv ff.
          If P \in WRITE(x, M, \mu) then (\exists v \in \mathcal{V})
417
                                                                                                 (w4b) if E = \emptyset then \tau^D(\psi) \equiv \psi[M/x][ff/Q_x],
             (w1) if d, e \in E then d = e,
418
             (w2) \lambda(e) = W^{\mu}xv,
                                                                                                 (w5a) if E \neq \emptyset then \sqrt{} \equiv M = v,
419
                                                                                                 (w5b) if E = \emptyset then \checkmark \equiv ff.
             (w3) \kappa(e) \equiv M = v,
420
            (w4a) if E \neq \emptyset then \tau^D(\psi) \equiv \psi[M/x][M=v/Q_x],
421
422
                        [r := M]_1 = LET(r, M)
                                                                                                                 [skip]_1 = SKIP
                       [r := x^{\mu}]_1 = READ(r, x, \mu)
                                                                                                                [S_1; S_2]_1 = SEQ([S_1]_1, [S_2]_1)
                     [x^{\mu} := M]_1 = WRITE(x, M, \mu)
                                                                               [if(M)\{S_1\}else\{S_2\}]_1 = IF(M\neq 0, [S_1]_1, [S_2]_1)
426
```

Fig. 1. Sequential Semantics

- most recent prior write to x must be in the pomset in order to read x...
- similar to release/termination: all prior writes must be in the pomset in order to release...
- terminology: "prior" means sequentially before, different from ≤, which is "ordered before".
- (c3) requires tautologies, which means that all variables are initialized sequentially in order to get rid of  $Q_x$ .

Sanity check:

428 429 430

431

432

433

434

435

436

437 438

439

440 441

```
r := y; \text{ if } (r)\{x := 1\}; s := x
(Ry1) \bullet (1 = r \Rightarrow r \neq 0 \mid Wx1) \bullet (1 = r \Rightarrow (r \neq 0 \land Q_x) \mid Rx2)
(Ry0) \quad (0 = r \Rightarrow \tau(Q_x) \mid Rx2)
```

0:10 Anon.

```
Suppose R_1 \subseteq E_1 \times E_1 and R_2 \subseteq E_2 \times E_2.
442
          We say R extends R_1 and R_2 if R \supseteq (R_1 \cup R_2) and R \cap (E_1 \times E_1) = R_1 and R \cap (E_2 \times E_2) = R_2.
443
          If P \in SKIP then E = \emptyset and \tau^D(\psi) \equiv \psi.
444
445
          If P \in PAR(\mathcal{P}_1, \mathcal{P}_2) then (\exists P_1 \in \mathcal{P}_1) \ (\exists P_2 \in \mathcal{P}_2)
446
                                                                                                  (P5) \checkmark \equiv \checkmark_1 \land \checkmark_2,
              (P1) E = (E_1 \uplus E_2),
447
               (P2) \lambda = (\lambda_1 \cup \lambda_2),
                                                                                                  (P6) \leq \text{extends} \leq_1 \text{ and } \leq_2,
448
             (P3a) if e \in E_1 then \kappa(e) \equiv \kappa_1(e),
                                                                                                (P7a) rf extends rf<sub>1</sub> and rf<sub>2</sub>,
449
                                                                                                (P7b) if d \in E_1, e \in E_2 and d \xrightarrow{\mathsf{rf}} e then d \leq e,
            (P3b) if e \in E_2 then \kappa(e) \equiv \kappa_2(e),
450
                                                                                                (P7c) if d \in E_1, e \in E_2 and e \xrightarrow{rf} d then e \le d.
              (P4) \ \tau^D(\psi) \equiv \tau_1^D(\psi),
451
          If P \in SEQ(\mathcal{P}_1, \mathcal{P}_2) then (\exists P_1 \in \mathcal{P}_1) \ (\exists P_2 \in \mathcal{P}_2)
452
          let \sqrt{1}(e) = \sqrt{1} if \lambda_2(e) is a release and \sqrt{1}(e) = \text{tt} otherwise,
453
          let \kappa_2'(e) = \tau_1^{\downarrow e}(\kappa_2(e)), where \downarrow e = \{c \mid c < e\}
454
                                                                                                  (s4) \boldsymbol{\tau}^D(\psi) \equiv \boldsymbol{\tau}_1^D(\boldsymbol{\tau}_2^D(\psi)),
               (s1) E = (E_1 \cup E_2),
455
                                                                                                  (s5) \checkmark \equiv \checkmark_1 \land \tau_1(\checkmark_2),
               (s2) \lambda = (\lambda_1 \cup \lambda_2),
456
                                                                                                (s6a) \leq extends \leq_1 and \leq_2,
             (s3a) if e \in E_1 \setminus E_2 then \kappa(e) \equiv \kappa_1(e),
457
             (s3b) if e \in E_2 \setminus E_1 then \kappa(e) \equiv \kappa_2'(e) \wedge \sqrt{1}(e),
                                                                                                (s6b) if \lambda_1(d) delays \lambda_2(e) and d \leq e,
458
             (s3c) if e \in E_1 \cap E_2 then
                                                                                                (s7a) rf extends rf<sub>1</sub> and rf<sub>2</sub>,
459
                                                                                                (s7b) if d \in E_1, e \in E_2 and d \xrightarrow{\text{rf}} e then d \le e.
                       \kappa(e) \equiv (\kappa_1(e) \vee \kappa_2'(e)) \wedge \sqrt{1}(e),
461
           If P \in IF(\phi, \mathcal{P}_1, \mathcal{P}_2) then (\exists P_1 \in \mathcal{P}_1) (\exists P_2 \in \mathcal{P}_2)
                                                                                                   (14) \tau^D(\psi) \equiv (\phi \wedge \tau_1^D(\psi)) \vee (\neg \phi \wedge \tau_2^D(\psi)),
462
               (11) E = (E_1 \cup E_2),
               (12) \lambda = (\lambda_1 \cup \lambda_2),
                                                                                                   (15) \checkmark \equiv (\phi \land \checkmark_1) \lor (\neg \phi \land \checkmark_2).
             (13a) if e \in E_1 \setminus E_2 then \kappa(e) \equiv \phi \wedge \kappa_1(e),
                                                                                                 (16a) \leq extends \leq_1 and \leq_2,
             (13b) if e \in E_2 \setminus E_1 then \kappa(e) \equiv \neg \phi \wedge \kappa_2(e),
                                                                                                 (16b) \leq \subseteq (\leq_1 \cup \leq_2),
              (13c) if e \in E_1 \cap E_2
                                                                                                 (17a) rf extends rf<sub>1</sub> and rf<sub>2</sub>,
467
                       then \kappa(e) \equiv (\phi \wedge \kappa_1(e)) \vee (\neg \phi \wedge \kappa_2(e)),
                                                                                                 (17b) rf \subseteq (rf_1 \cup rf_2).
          If P \in LET(r, M) then E = \emptyset and \tau^D(\psi) \equiv \psi[M/r].
469
          If P \in READ(r, x, \mu) then (\exists v \in \mathcal{V})
470
              (R1) if d, e \in E then d = e,
                                                                                                (R4b) if E \neq \emptyset and (E \cap D) = \emptyset then
471
                                                                                                           \tau^D(\psi) \equiv (v=r \lor x=r) \Rightarrow \psi,
              (R2) \lambda(e) = R^{\mu} x v,
472
                                                                                                (R4c) if E = \emptyset then \tau^D(\psi) \equiv \psi,
              (R3) \kappa(e) \equiv Q_x,
473
            (R4a) if E \neq \emptyset and (E \cap D) \neq \emptyset then
                                                                                                (R5a) if E \neq \emptyset or \mu \sqsubseteq \text{rlx then } \checkmark \equiv \text{tt.}
474
                       \tau^D(\psi) \equiv v = r \Rightarrow \psi,
                                                                                                (R5b) if E = \emptyset and \mu \supseteq \text{acq then } \checkmark \equiv \text{ff.}
475
476
          If P \in WRITE(x, M, \mu) then (\exists v \in V)
477
                                                                                               (w4b) if E = \emptyset then \tau^D(\psi) \equiv \psi[M/x][ff/Q_x]
             (w1) if d, e \in E then d = e,
478
             (w2) \lambda(e) = W^{\mu}xv,
                                                                                               (w5a) if E \neq \emptyset then \sqrt{\ } \equiv M = v,
479
             (w3) \kappa(e) \equiv M = v,
                                                                                               (w5b) if E = \emptyset then \checkmark \equiv ff.
480
           (w4a) if E \neq \emptyset then \tau^D(\psi) \equiv \psi[M/x][M=v/Q_x]
481
           If P \in FENCE(\mu) then
482
                                                                                                  (F4) \tau^D(\psi) \equiv \psi,
               (F1) if d, e \in E then d = e,
483
               (F2) \lambda(e) = \mathsf{F}^{\mu},
                                                                                                (F5a) if E \neq \emptyset then \checkmark \equiv tt,
484
               (F3) \kappa(e) \equiv tt,
                                                                                                (F5b) if E = \emptyset then \checkmark \equiv ff.
485
486
                       [r := M]_1 = LET(r, M)
                                                                                                               [skip]_1 = SKIP
487
                       [r := x^{\mu}]_1 = READ(r, x, \mu)
                                                                                                           [S_1 \ ] \ S_2]_1 = PAR([S_1]_1, [S_2]_1)
488
489
                     [x^{\mu} := M]_1 = WRITE(x, M, \mu)
                                                                                                             [S_1; S_2]_1 = SEQ([S_1]_1, [S_2]_1)
490
```

Fig. 2. Semantics of Programs

 

```
where \tau(\psi) = (r \neq 0 \land \psi[1/x][ff/Q_x]) \lor (r=0 \land \psi)
```

We give the semantics of programs  $[\cdot]_1$  in Figure 2. P6

Let P range over pomsets, and  $\mathcal{P}$  over sets of pomsets.

The model has seven components, which can be daunting at first glance. To aid the reader, we use consistent numbering throughout. For example, item 7 always refers to the order relation.

The core of the model is a pomset, which includes a set of events (M1), a labeling (M2), and an order (M6). We also include the *reads-from* relation explicitly in the model (M7).

On top of this basic structure, M3–M5 add a layer of logic. For each pomset, M5 provides a termination condition. For each event in a pomset, M3 provides a precondition. For each set of events in a pomset, M4 provides a predicate transformer. Sequential dependency is calculated by  $\kappa_2'$  in the semantics of sequential composition.

Before discussing the details of the model, we note that the semantics satisfies the expected monoid laws and is closed with respect to *augmentation*. Augments include more order and stronger formulae; in examples, we typically consider pomsets that are augment-minimal. One intuitive reading of augment closure is that adding order can only cause preconditions to weaken.

```
LEMMA 3.5. [This is out of date.]

(a) \mathcal{P} = (\mathcal{P} \parallel SKIP) = (\mathcal{P}; SKIP) = (SKIP; \mathcal{P}).

(b) (\mathcal{P}_1 \parallel \mathcal{P}_2) \parallel \mathcal{P}_3 = \mathcal{P}_1 \parallel (\mathcal{P}_2 \parallel \mathcal{P}_3).

(c) (\mathcal{P}_1; \mathcal{P}_2); \mathcal{P}_3 = \mathcal{P}_1; (\mathcal{P}_2; \mathcal{P}_3).

(d) if (\phi) \{\mathcal{P}_1\}  else \{\mathcal{P}_2\} =  if (\phi) \{\mathcal{P}_1\};  if (\neg \phi) \{\mathcal{P}_2\} =  if (\neg \phi) \{\mathcal{P}_2\};  if (\phi) \{\mathcal{P}_1\}.

(e) if (\phi) \{\mathcal{P}_1\}  else \{\mathcal{P}_2\} = \mathcal{P}_1  if \phi is a tautology.

(f) if (\phi) \{ if (\psi) \{\mathcal{P}_1\} =  if (\phi \land \psi) \{\mathcal{P}_1\}.

(g) if (\phi) \{\mathcal{P}_1; \mathcal{P}_3\}  else \{\mathcal{P}_2; \mathcal{P}_3\} \supseteq  if (\phi) \{\mathcal{P}_1\}  else \{\mathcal{P}_2\}; \mathcal{P}_3.

(h) if (\phi) \{\mathcal{P}_1; \mathcal{P}_2\}  else \{\mathcal{P}_1; \mathcal{P}_3\} \supseteq \mathcal{P}_1;  if (\phi) \{\mathcal{P}_2\}  else \{\mathcal{P}_3\}.

(i) if (\phi) \{\mathcal{P}_1\}  else \{\mathcal{P}_1\} \supseteq \mathcal{P}.
```

PROOF. Straightforward calculation. (a) requires M5a for the termination condition in ( $\mathcal{P}$ ; SKIP).

- (c) requires both conjunction closure (x2, for the termination condition) and disjunction closure (x3, for the predicate transformers themselves).
- (d) requires s6b not impose order when  $\kappa_1(d) \wedge \kappa_2(e)$  is unsatisfiable, which in turn requires that  $\kappa$  calculates *weakest* preconditions, rather than simple preconditions (see §3.9).
  - (e) requires M3a.

In  $\S4.5$ , we refine the semantics to validate the reverse inclusions for (g), (h), and (i).  $\Box$ 

Definition 3.6.  $P_2$  is an augment of  $P_1$  if

```
(1) E_2 = E_1, (3) \kappa_2(e) \equiv \kappa_1(e), (5) \sqrt{2} \equiv \sqrt{1}, (7) \text{rf}_2 \supseteq \text{rf}_1. (2) \lambda_2(e) = \lambda_1(e), (4) \tau_2^D(\psi) \equiv \tau_1^D(\psi), (6) \leq_2 \supseteq \leq_1,
```

LEMMA 3.7. If  $P_1 \in [S]_1$  and  $P_2$  augments  $P_1$  then  $P_2 \in [S]_1$ .

PROOF. Induction on the definition of  $[\cdot]_1$ .

### 3.4 Pomsets

We first explain the core of model, ignoring the logic (rules 3–5). We defer discussion of IF to §3.7. Reads, writes, and fences map to pomsets with at most one event. skip maps to the empty pomset. Ignoring the logic, the definitions are straightforward. Note only that  $[x := 1]_1$  can write any value v; the fact that v must be 1 is captured in the logic (see §3.5).

The structural rules combine pomsets: Parallel composition is disjoint union, inheriting labeling, order and rf from the two sides. Any rf edges added between the two sides must also be added to

0:12 Anon.

the order (P7b and P7c). Sequential composition is similar, with two changes: \$1 does not require disjointness (see §3.5), and \$6b may require order (see example PUB, below).

Note that reads-from implies order.

540

541

542 543

545 546

547

548

549

551

552

553

555556557

559

565

567

569

571

573 574

575

576

577 578

579

580

581

583

584

585

586

587 588 LEMMA 3.8. For any P in the range of  $[\cdot]_1$ ,  $d \xrightarrow{rf} e$  implies  $d \le e$ .

Proof. Induction on the definition of  $[\![\cdot]\!]_1$ , using P7b, P7c, and S7b.

In top-level pomsets, every read must have a matching write in rf (c2). Together with M7a and M7b, the lemma guarantees that reads are *fulfilled* at top-level, as in [Jagadeesan et al. 2020, §2.7].<sup>1</sup>

From Definition 3.1, recall that *a delays*  $\dot{b}$  if  $a \bowtie_{co} b$  or  $a \bowtie_{sync} b$ . s6b guarantees that sequential order is enforced between conflicting accesses of the same location ( $\bowtie_{co}$ ), into a release and out of an acquire ( $\bowtie_{sync}$ ), and between SC accesses ( $\bowtie_{sc}$ ). Combined with the fulfillment requirements (M7a, M7b and Lemma 3.8), these ensure coherence, publication, subscription and other idioms. For example, consider the following:<sup>2</sup>

$$x := 0; x := 1; y^{\text{rel}} := 1 \parallel r := y^{\text{acq}}; s := x$$

$$(PUB)$$

The execution is disallowed due to the cycle. All of the order shown is required at top-level: The intra-thread order comes from s6b:  $(Wx0) \rightarrow (Wx1)$  is required by  $\bowtie_{co}$ .  $(Wx1) \rightarrow (W^{rel}y1)$  and  $(R^{acq}y1) \rightarrow (Rx0)$  are required by  $\bowtie_{sync}$ . The cross-thread order is required by fulfillment: c2 requires that all top-level reads are in the image of  $\stackrel{rf}{\longrightarrow}$ . M7a ensures that  $(W^{rel}y1) \stackrel{rf}{\longrightarrow} (R^{acq}y1)$ , and Lemma 3.8 subsequently ensures that  $(W^{rel}y1) \leq (R^{acq}y1)$ . The *antidependency*  $(Rx0) \rightarrow (Wx1)$  is required by M7b. (Alternatively, we could have  $(Wx1) \rightarrow (Wx0)$ , again resulting in a cycle.)

The semantics gives the expected results for store buffering and load buffering, as well as litmus tests involving fences and SC access. The model of coherence is weaker than C11, in order to support common subexpression elimination, and stronger than Java, in order to support local reasoning about data races. See [Jagadeesan et al. 2020, §3.1] for a discussion.

In the structural rules SEQ and IF, we say that  $d \in E_1$  and  $e \in E_2$  coalesce if d = e.

s1 allows *mumbling* [Brookes 1996] by coalescing events. For example  $[x := 1; x := 1]_1$  includes the singleton pomset  $w_{x1}$ . From this it is easy to see that  $[x := 1; x := 1]_1 \supseteq [x := 1]_1$  is a valid refinement. It is equally obvious that  $[x := 1] \not\supseteq [x := 1; x := 1]$  is not a valid refinement, since the latter includes a two-element pomset, but the former does not.<sup>3</sup>

### 3.5 Termination

In top-level pomsets, c5 requires that  $\sqrt{\ }$  is a tautology, capturing termination. Terminated pomsets are often called *complete*, whereas nonterminated pomsets are *incomplete*.

In §A.3, it is possible for rf to contradict  $\leq$ . In this case, we use a dotted arrow for rf:  $d \mapsto e$  indicates that  $e \leq d$ . These are distinguished by the context:  $[-] \parallel r := x$ ; x := 2; s := x; if  $(r=s)\{z := 1\}$ .

<sup>&</sup>lt;sup>1</sup>The basic model would be the same if we move rf from the model itself to be existentially quantified in the definition of top-level pomset, along with M7a and M7b. This was the approach of Jagadeesan et al. We include rf explicitly for use in A.3, where we introduce a variant semantics  $\|\cdot\|_2^f$  for which Lemma 3.8 fails.

<sup>&</sup>lt;sup>2</sup>We use different colors for arrows representing order:

<sup>•</sup>  $d \rightarrow e$  arises from control/data/address dependency (s3, definition of  $\kappa'_2(d)$ ),

<sup>•</sup>  $d \rightarrow e$  arises from  $\bowtie_{sync}$  or  $\bowtie_{sc}$  (s6b),

<sup>•</sup>  $d \rightarrow e$  arises from  $\bowtie_{co}$  (s6b),

<sup>•</sup>  $d \rightarrow e$  arises from reads-from (M7a),

<sup>•</sup>  $d \rightarrow e$  arises from blocking (M7b).

 Ignoring predicate transformers, the structural rules, P5 and S5, take  $\checkmark$  to be  $\checkmark_1 \land \checkmark_2$ . This is as expected: the program terminates if both subprograms terminate.

The interesting rules are READ, FENCE, and WRITE.

In *READ*, there is no restriction on  $\sqrt{\ }$  for relaxed reads. From this, it is easy to see that  $[r := x]_1 \supseteq [skip]_1$  is a valid refinement (where the default mode is rlx).

In *FENCE*, instead, F5 ensures that all fences are included at top-level. This also ensures  $\llbracket \mathsf{F}^{\mu} \rrbracket_1 \not\supseteq \llbracket \mathsf{if}(M) \{ \mathsf{F}^{\mu} \} \rrbracket_1$ , since  $\llbracket \mathsf{if}(M) \{ \mathsf{F}^{\mu} \} \rrbracket_1$  includes the empty set with termination condition  $\neg M$ , but  $\llbracket \mathsf{F}^{\mu} \rrbracket_1$  can only include the empty set with termination condition ff.

In *WRITE*, w5b is similar. In addition, w5a ensures that top-level pomsets do not include bogus writes. Suppose  $P \in [x:=1]_1$ . As we noted above, P can include  $(1=v \mid Wxv)$ , for any value v. At top-level, however, w5a requires that  $\checkmark$  implies 1=v. In this case, M3a would filter the pomset, since preconditions must be satisfiable. However, unsatisfiable writes can be become satisfiable via merging:

$$x := 1$$
  $x := 2$  if  $(M)\{x := 3\}$   $(Wx1)$   $(2=3 \mid Wx3)$   $(M \mid Wx3)$ 

By merging, the semantics allows the following:

$$x := 1; x := 2; if(M)\{x := 3\}$$

$$(Wx1) \rightarrow M \mid Wx3$$

This pomset is incomplete, however, since  $\sqrt{\ } \equiv 2=3$ .

## 3.6 Data Dependencies, Preconditions, and Predicate Transformers

In top-level pomsets, c3 requires that every precondition  $\kappa(e)$  is a tautology.

Preconditions are discharged during sequential composition by applying predicate transformers  $\tau_1$  from the left to preconditions  $\kappa_2(e)$  on the right. The specific rules are s3b and s3c, which use the transformed predicate  $\kappa_2'(e) = \tau_1^{\downarrow e}(\kappa_2(e))$ , where  $\downarrow e = \{c \mid c < e\}$  is the set of events that precede e in causal order. We call  $\downarrow e$  the *dependent set* for e. Then  $E \setminus (\downarrow e)$  is the *independent set*.

Before looking at the details, it is useful to have a high-level view of how nontrivial preconditions and predicate transformers are introduced. (We discuss address dependencies in §4.3.)

Preconditions are introduced in:

Predicate transformers are introduced in:

(s3) for release actions,

(R4a) for reads in the dependent set,

(13) for control dependencies,

(R4b) for reads in the independent set,

(w3) for data dependencies on writes.

(w5) for writes.

The rules track dependencies. We discuss data dependencies (w3) here and control dependencies (13) in §3.7. Unless otherwise noted, we assume pomsets are *complete* and *augment-minimal*. We do not discuss s3 further. It simply ensures that all writes are present before a release, even for incomplete pomsets (see §3.5).

A simple example of a data dependency is a pomset  $P \in [r := x; y := r]$ . If P is complete, it must have two events. Then SEQ requires that there are  $P_1 \in [r := x]$  and  $P_2 \in [y := r]$  of the form:

$$r := x y := r$$

$$(x = r \lor v = r) \Rightarrow \psi \mid (Rxv)^{d} \Rightarrow v = r \Rightarrow \psi \mid (r/y) \mid (r = w \mid Wyw)^{e} \Rightarrow \psi[r/y] \mid (\ddagger)$$

First we consider the case that v = w. For example if v = w = 1, we have:

$$(x=r \lor 1=r) \Rightarrow \psi \quad (Rx1)^{e} \longrightarrow 1=r \Rightarrow \psi \qquad \qquad [\psi[r/y]] \quad (r=1 \mid Wy1)^{e} \longrightarrow \psi[r/y]$$

0:14 Anon.

For the read, the dependent transformer  $\tau_1^{\{d\}}$  is  $1=r\Rightarrow \psi$ ; the independent transformer  $\tau_1^{\emptyset}$  is  $(x=r\vee 1=r)\Rightarrow \psi$ . These are determined by R4a and R4b, respectively. For the write, both  $\tau_2^{\{e\}}$  and  $\tau_2^{\emptyset}$  are  $\psi[r/y]$ , as are determined by W5. Combining these into a single pomset, we have:

$$\begin{aligned} r := x \; ; \; y := r \\ \hline \left[ (x = r \lor 1 = r) \Rightarrow \psi[r/y] \right] & \boxed{ (Rx1)^{d} \cdot } \boxed{ 1 = r \Rightarrow \psi[r/y] } & \boxed{ \phi \mid Wy1}^{e} \end{aligned}$$

By \$4, predicate transformers are determined by composition; thus  $\tau^D(\psi)$  is  $\tau^D_1(\tau^D_2(\psi))$ . Since the transformer does not depend on whether the write is included, we do not draw dependencies for the write in the diagram.

Turning to the precondition  $\phi$  on the write, recall that in order for e to participate in a top-level pomset, the precondition  $\phi$  must be a tautology at top-level. There are two possibilities.

- If  $d \le e$  then we apply the dependent transformer and  $\phi = (1=r \Rightarrow r=1)$ , a tautology.
- If  $d \not\leq e$  then we apply the independent transformer and  $\phi = ((x=r \lor 1=r) \Rightarrow r=1)$ . Under the assumption that r is bound, this is logically equivalent to (x=1). (We make this more precise in §4.1.)

Eliding transformers, the two outcomes are:

$$r := x ; y := r$$
  $r := x ; y := r$  
$$(Rx1)^{d} (Wy1)^{e}$$
 
$$(Rx1)^{d} (x := 1 | Wy1)^{e}$$

The independent case on the right can only participate in a top-level pomset if the precondition (x=1) is discharged. To do so, we must prepend a pomset  $P_0$  that writes 1 to x:

$$x := 1 \qquad \qquad x := 1; \ r := x; \ y := r$$

$$\boxed{\psi[1/x] \left(1 = 1 \mid Wx1\right)^c \cdot \psi[1/x]} \qquad \left(1 = 1 \mid Wx1\right)^c \quad \left(Rx1\right)^d \quad \left(1 = 1 \mid Wy1\right)^e$$

Here we apply the predicate transformer  $\tau_0^{\emptyset}$  to (x=1), resulting in the tautology (1=1).

Now suppose that  $v \neq w$  in (‡). Again there are two possibilities, where we take v = 0 and w = 1:

$$r := x \; ; \; y := r$$

$$Rx0 \xrightarrow{d} (0=r \Rightarrow r=1 \mid Wy1)^{e}$$

$$(Rx0)^{d} (x=r \lor 0=r) \Rightarrow r=1 \mid Wy1)^{e}$$

Assuming that r is bound, both preconditions on e are unsatisfiable.

If a write is independent of a read, then clearly no order is imposed between them. For example, the precondition of e is a tautology in:

$$r := x \; ; \; y := 1$$

$$(x=r \lor 0=r) \Rightarrow \psi[r/y] \quad (x=r \lor 0=r) \Rightarrow 1=1 \mid W y 1)^{e}$$

$$(x=r \lor 0=r) \Rightarrow 1=1 \mid W y 1$$

### 3.7 Control Dependencies

In  $IF(\phi, \mathcal{P}_1, \mathcal{P}_2)$ , the predicate transformer (14) is  $(\phi \wedge \tau_1^D(\psi)) \vee (\neg \phi \wedge \tau_2^D(\psi))$ , which is the disjunctive equivalent of Dijkstra's conjunctive formulation:  $(\phi \Rightarrow \tau_1^D(\psi)) \wedge (\neg \phi \Rightarrow \tau_2^D(\psi))$ .

This semantics validates dead code elimination: if  $M \neq 0$  is a tautology then  $[if(M)\{S_1\}] \in [S_2] \supseteq [S_1]$ . The reverse inclusion does not hold.

For events from  $E_1$ , I3a requires  $\phi \wedge \kappa_1(e)$ . For events from  $E_2$ , I3b requires  $\neg \phi \wedge \kappa_2(e)$ . For coalescing events in  $E_1 \cap E_2$ , I3c requires  $(\phi \wedge \kappa_1(e)) \vee (\neg \phi \wedge \kappa_2(e))$ . This semantics allows common code to be lifted out of a conditional, validating the transformation  $[if(M)\{S\}] = [S]$ . The use of *extends* in I6a and I7a ensures that no new order is introduced between events in  $E_1 \cap E_2$  when coalescing; see §A.3.

 By allowing events to coalesce, I3c ensures that control dependencies are calculated semantically. For example, consider  $P \in [[if(r=1)\{y:=r\}]]$  which is build from  $P_1 \in [[y:=r]]$  and  $P_2 \in [[y:=1]]$  such as:

$$\begin{array}{ll} y := r & y := 1 & \text{if}(r=1)\{y := r\} \text{ else } \{y := 1\} \\ \hline \begin{pmatrix} r=1 \mid \forall y 1 \end{pmatrix}^e & \begin{pmatrix} (r=1 \Rightarrow r=1) \land (r \neq 1 \Rightarrow 1=1) \mid \forall y 1 \end{pmatrix}^e \end{array}$$

Here, the precondition in the combined pomset is a tautology, independent of r.

Control dependencies are eliminated in the same way as data dependencies. For example:

$$\begin{array}{c} r:=x \\ \text{ if } (r=1)\{y:=1\} \\ \hline (x=r\vee v=r)\Rightarrow \psi \ \boxed{(\mathbb{R}xv)^d} \\ \stackrel{d}{\longrightarrow} v=r\Rightarrow \psi \end{array} \qquad \boxed{r=1\Rightarrow \psi[1/y]} \stackrel{r=1}{\longrightarrow} \psi[1/y]$$

Reasoning as we did for (‡) in §3.6, there are two possibilities:

$$r:=x$$
; if $(r=1)$ { $y:=1$ }  
 $(Rx1)^d \longrightarrow (Wy1)^e$   
 $r:=x$ ; if $(r=1)$ { $y:=1$ }  
 $(Rx1)^d \longrightarrow (x=1 \mid Wy1)^e$ 

As another example, consider JMM causality test case 1 [Pugh 2004]:

$$x := 0; (r := x; if(r \ge 0) \{y := 1\} \parallel x := y)$$
 $(wx0) = (Rx1) (y | Wy1) (Wx1)$ 

The precondition  $\phi$  is  $((1=r \lor x=r) \Rightarrow r \ge 0) [0/x]$  which is  $((1=r \lor 0=r) \Rightarrow r \ge 0)$  which is a tautology.

### 3.8 Reordering Transformations

The semantics validates many peephole optimizations. Most apply only to relaxed access.

$$[[r := x; s := y]]_1 = [[s := y; r := x]]_1$$
 if  $r \neq s$   

$$[[x := M; y := N]]_1 = [[y := N; x := M]]_1$$
 if  $x \neq y$   

$$[[x := M; s := y]]_1 = [[s := y; x := M]]_1$$
 if  $x \neq y$  and  $s \notin id(M)$ 

Here id(S) is the set of locations and registers that occur in S. Using augmentation closure, the semantics also validates roach-motel reorderings [Sevčík 2008]. For example, on read/write pairs:

$$[\![x^{\mu} := M; s := y]\!]_{1} \supseteq [\![s := y; x^{\mu} := M]\!]_{1} \qquad \text{if } x \neq y \text{ and } s \notin \text{id}(M)$$
$$[\![x := M; s := y^{\mu}]\!]_{1} \supseteq [\![s := y^{\mu}; x := M]\!]_{1} \qquad \text{if } x \neq y \text{ and } s \notin \text{id}(M)$$

#### 3.9 Conditional and Coherence

[This is out of date.]

LEMMA 3.9. if 
$$(\phi)\{\mathcal{P}_1\}$$
 else  $\{\mathcal{P}_2\} \supseteq if(\phi)\{\mathcal{P}_1\}$ ; if  $(\neg \phi)\{\mathcal{P}_2\}$  if  $(\phi)\{\mathcal{P}_1\}$  else  $\{\mathcal{P}_2\} \supseteq if(\neg \phi)\{\mathcal{P}_2\}$ ; if  $(\phi)\{\mathcal{P}_1\}$ 

Reverse direction does not hold, due to s6b.

(s6b) if 
$$\lambda_1(d)$$
 delays  $\lambda_2(e)$  then  $d \le e$ .

An alternate phrasing might be attractive:

(s6b') if 
$$\lambda_1(d)$$
 delays  $\lambda_2(e)$  and  $\kappa(d) \wedge \kappa(e)$  is satisfiable then  $d \leq e$ .

But s6b' is incompatible with the ability to strengthen preconditions using augment closure. Consider the following.

$$\begin{array}{lll} \text{if}(r)\{x:=2\} & x:=1 & x:=2 & \text{if}(!r)\{x:=1\} \\ \hline (r\neq 0 \mid \forall x \neq 2) & (\forall x \neq 1) & (\forall x \neq 2) & (r=0 \mid \forall x \neq 1) \\ \hline \end{array}$$

0:16 Anon.

Augmenting the middle preconditions and then using sequential composition, we have:

$$\begin{array}{ll} \text{if}(r)\{x:=2\} & x:=1; \ x:=2 & \text{if}(!r)\{x:=1\} \\ \hline (r\neq 0\mid Wx2) & (r\neq 0\mid Wx1) & (r=0\mid Wx2) & (r=0\mid Wx1) \end{array}$$

Note that s6b' does not require any order between the two writes of the middle pomset. Merging left and right, we have:

if(r){
$$x := 2$$
};  $x := 1$ ;  $x := 2$ ; if(!r){ $x := 1$ }
$$(Wx2) \rightarrow (Wx1)$$

As shown by the following single-threaded code, allowing this outcome would violate DRF-SC.

$$y := 1; r := y; if(r)\{x := 2\}; x := 1; x := 2; if(!r)\{x := 1\}$$

$$(Wy1) \rightarrow (Ry1) \qquad (Wx2) \rightarrow (Wx1)$$

To validate the reverse direction of Lemma 3.9, it may be tempting to define the semantics using *weakest* preconditions, rather than preconditions. But in this case the notion of program refinement could not be simple set inclusion—for example, in general we would *not* have  $\mathcal{P}_1 \supseteq if(\phi)\{\mathcal{P}_1\}$ .

As a result, we leave Lemma 3.9 as an inequation. The equational form may be valid using some notion of *observational* or *contextual* refinement, but we do not pursue that here.

### 3.10 Associativity and Skolemization

 The predicate transformers we have chosen for R4a and R4b are different from the ones used traditionally, which are written using substitution PWP. Attempting to write R4a in this style we would have:

(R4a') if 
$$(E \cap D) \neq \emptyset$$
 then  $\tau^D(\psi) \equiv \psi[v/r]$ ,

Recall that R4c says that  $\psi$  must be independent of r in order to appear in a top-level pomset: if  $E=\emptyset$  then  $\tau^D(\psi)\equiv \psi$ . This choice for R4c is forced by Definition 3.3, which states that the predicate transformer for a small subset of E must imply the transformer for a larger subset.

Sadly, this definition fails associativity.

Consider the following, eliding transformers for the writes ("!" represents logical negation):

Coalescing the writes and associating to the right, we have the following, since  $(r=0 \lor r\neq 0) \equiv tt$ :

$$r := y$$
  $x := !r; x := !!r; x := 0$   $r := y; (x := !r; x := !!r; x := 0)$   $(\mathbb{R}y1)$   $(\mathbb{W}x1) \rightarrow (\mathbb{W}x0)$ 

The precondition of (Wx1) is a tautology. Associating to the left and the coalescing, instead:

$$r := y \; ; \; x := !r \qquad \qquad x := !!r \; ; \; x := 0 \qquad \qquad (r := y \; ; \; x := !r) \; ; \; (x := !!r \; ; \; x := 0)$$
 
$$(Ry1) \qquad (y = r \vee 1 = r) \Rightarrow r = 0 \; | \; \mathsf{W}x1 ) \qquad (Ry1) \qquad (Ry1) \qquad (Wx0)$$

where  $\phi = ((y=r \lor 1=r) \Rightarrow r=0) \lor (r\neq 0)$ . The precondition  $\phi$  is not a tautology. In a top-level pomset, this forces dependency order from (Ry1) to (Wx1).

Our solution is to Skolemize, replacing uses of  $\psi[v/r]$  by  $(r=v) \Rightarrow \psi$ , for uniquely chosen r. The proof of associativity requires that predicate transformers distribute through disjunction (Definition 3.2). The attempt to define predicate transformers using substitution fails for R4c because the predicate transformer  $\tau(\psi) = (\forall r)\psi$  does not distribute through disjunction:  $\tau(\psi_1 \vee \psi_2) = (\forall r)(\psi_1 \vee \psi_2) \neq ((\forall r)(\psi_1)) \vee ((\forall r)(\psi_2)) = \tau(\psi_1) \vee \tau(\psi_2)$ . Since  $\tau(\psi) = (\forall r)\psi$  does not distribute

 through disjunction, we use  $\tau(\psi) = \psi$  instead (which trivially distributes through disjunction). This change means we cannot use substitution, since  $\psi$  does not imply  $\psi[v/r]$ . Fortunately, Skolemizing solves this problem, since  $\psi$  implies  $(r=v) \Rightarrow \psi$ .

### 3.11 Comparison with Weakest Preconditions

We compare traditional transformers to the dependent-case transformers of Figure 2.

Because of augment closure, we are not interested in isolating the *weakest* precondition. Thus we think of transformers as Hoare triples. In addition, all programs in our language are strongly normalizing, so we need not distinguish strong and weak correctness. In this setting, the Hoare triple  $\{\phi\}$  S  $\{\psi\}$  holds exactly when  $\phi \Rightarrow wp_S(\psi)$ .

Hoare triples do not distinguish thread-local variables from shared variables. Thus, the assignment rule applies to all types of storage. The rules can be written as on the left below:

$$\begin{split} wp_{x:=M}(\psi) &= \psi[M/x] \\ wp_{r:=M}(\psi) &= \psi[M/r] \\ wp_{r:=M}(\psi) &= \psi[M/r] \\ wp_{r:=X}(\psi) &= x = r \Rightarrow \psi \end{split} \qquad \begin{aligned} \tau_{x:=M}(\psi) &= \psi[M/x] \\ \tau_{r:=M}(\psi) &= \psi[M/r] \\ \tau_{r:=X}(\psi) &= v = r \Rightarrow \psi \end{aligned} \qquad \text{where } \lambda(e) = \mathsf{R} x v \end{split}$$

Here we have chosen an alternative formulation for the read rule, which is equivalent to the more traditional  $\psi[x/r]$ , as long as registers are assigned at most once in a program. Our predicate transformers for the dependent case are shown on the right above. Only the read rule differs from the traditional one.

For programs where every register is bound and every read is fulfilled, our dependent transformers are the same as the traditional ones. Thus, when comparing to weakest preconditions, let us only consider totally-ordered executions of our semantics where every read could be fulfilled by prepending some writes. For example, we ignore pomsets of x := 2; x := x that read 1 for x.

For example, let  $S_i$  be defined:

$$S_1 = s := x$$
;  $x := s + r$   $S_2 = x := t$ ;  $S_1$   $S_3 = t := 2$ ;  $r := 5$ ;  $S_2$ 

The following pomset appears in the semantics of  $S_2$ . A pomset for  $S_3$  can be derived by substituting [2/t, 5/r]. A pomset for  $S_1$  can be derived by eliminating the initial write.

$$x := t \; ; \; s := x \; ; \; x := s + r$$

$$(t = 2 \mid Wx2) \longrightarrow (Rx2) \longrightarrow (2 = s \Rightarrow (s + r) = 7 \mid Wx7) \cdots > (2 = s \Rightarrow \psi[s + r/x])$$

The predicate transformers are:

$$\begin{split} wp_{S_1}(\psi) &= x = s \Rightarrow \psi[s + r/x] \\ wp_{S_2}(\psi) &= t = s \Rightarrow \psi[s + r/x] \\ wp_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_2}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + 5/x] \\ \end{split}$$

#### 3.12 Substitutions

In *READ*, it is also possible to collapse *x* and *r* via substitution:

(R4a') if 
$$(E \cap D) \neq \emptyset$$
 then  $\tau^D(\psi) \equiv v = r \Rightarrow \psi[r/x]$ ,  
(R4b') if  $E \neq \emptyset$  and  $(E \cap D) = \emptyset$  then  $\tau^D(\psi) \equiv (v = r \vee x = r) \Rightarrow \psi[r/x]$ ,  
(R4c') if  $E = \emptyset$  then  $\tau^D(\psi) \equiv \psi[r/x]$ ,

Perhaps surprisingly, this semantics is incomparable with that of Figure 2. Consider the following:

if 
$$(r \land s \text{ even})\{y := 1\}$$
; if  $(r \land s)\{z := 1\}$ 

$$(r \land s \text{ even} \mid Wy1) \qquad (r \land s \mid Wz1)$$

0:18 Anon.

Prepending (s:=x), we get the same result regardless of whether we substitute [s/x], since x does not occur in either precondition. Here we show the independent case:

$$s:=x; \text{ if } (r \land s \text{ even})\{y:=1\}; \text{ if } (r \land s)\{z:=1\}$$

$$(2=s \lor x=s) \Rightarrow (r \land s \text{ even}) \mid Wy1) \qquad (2=s \lor x=s) \Rightarrow (r \land s) \mid Wz1)$$

Since the preconditions mention x, prepending (r := x), we now get different results depending on whether we perform the substitution. Without any substitution, we have:

$$r:=x\;;\;s:=x\;;\;\mathrm{if}(r\wedge s\;\mathrm{even})\{y:=1\}\;;\;\mathrm{if}(r\wedge s)\{z:=1\}$$
 
$$(1=r\Rightarrow (2=s\vee x=s)\Rightarrow (r\wedge s\;\mathrm{even})\;|\;\mathrm{W}\,y1) \qquad (1=r\Rightarrow (2=s\vee x=s)\Rightarrow (r\wedge s)\;|\;\mathrm{W}\,z1)$$

Prepending (x := 0), which substitutes [0/x], the precondition of (Wy1) becomes  $(1=r \Rightarrow (2=s \lor 0=s) \Rightarrow (r \land s \text{ even}))$ , which is a tautology, whereas the precondition of Wz1 becomes  $(1=r \Rightarrow (2=s \lor 0=s) \Rightarrow (r \land s))$ , which is not. In order to be top-level, (Wz1) must be dependency ordered after (Rx2); in this case the precondition becomes  $(1=r \Rightarrow 2=s \Rightarrow (r \land s))$ , which is a tautology.

$$(Wx0)$$
  $(Rx1)$   $(Rx2)$   $(Wy1)$   $(Wz1)$ 

The situation reverses with the substitution  $\lceil r/x \rceil$ :

$$r := x \; ; \; s := x \; ; \; \text{if} \; (r \land s \; \text{even}) \; \{y := 1\}; \; \text{if} \; (r \land s) \; \{z := 1\}$$

$$(Rx1) \quad (1 = r \Rightarrow (2 = s \lor r = s) \Rightarrow (r \land s \; \text{even}) \; | \; Wy1) \quad (1 = r \Rightarrow (2 = s \lor r = s) \Rightarrow (r \land s) \; | \; Wz1)$$

Prepending (x := 0):

$$(Wx0)$$
  $(Rx1)$   $(Rx2)$   $(Wy1)$   $(Wz1)$ 

The dependency has changed from  $(Rx2) \rightarrow (Wz1)$  to  $(Rx2) \rightarrow (Wy1)$ . The resulting sets of pomsets are incomparable.

Thinking in terms of hardware, the difference is whether reads update the cache, thus clobbering preceding writes. With [r/x], reads clobber the cache, whereas without the substitution, they do not. Since most caches work this way, the model with [r/x] is likely preferred for modeling hardware. However, this substitution only makes sense in a model with read-read coherence and read-read dependencies, which we will see is not the case for Arm. By leaving out the substitution, we also ensure that downgraded reads are fulfilled by preceding writes, not reads.

### 4 REFINEMENTS AND ADDITIONAL FEATURES

In the paper so far, we have assumed that registers are assigned at most once. We have done this primarily for readability. In the first subsection below, we drop this assumption, instead using substitution to rename registers. We use the set  $S_{\mathcal{E}} = \{s_e \mid e \in \mathcal{E}\}$ . By assumption (§3.1), these registers do not appear in programs:  $S[N/s_e] = S$ . The resulting semantics satisfies redundant read elimination.

In the rest of this section we consider several orthogonal features: address calculation, if-closure, read-modify-write operations, and access elimination.

These extensions preserve all of the valid transformations discussed thus far. We state the extensions with respect to the base semantics of Figure 2, but they apply equally to the variants described in §A.

885

886

887

888

890

900

901

902

903

906

908

909

910 911

912 913

914

915

916

917

918

919

920 921

922

923

924

925

926

927

928

929

930 931

### 4.1 Register Recycling and Redundant Read Elimination

JMM Test Case 2 [Pugh 2004] states the following execution should be allowed "since redundant read elimination could result in simplification of r=s to true, allowing y:=1 to be moved early."

$$r := x$$
;  $s := x$ ; if  $(r=s)\{y := 1\} || x := y$ 

$$d$$
  $(Rx1)$   $(Wy1)^e$   $(Ry1)$   $(Wx1)$ 

This execution is not allowed by the semantics  $[\cdot]_1$  of Figure 2: the precondition of e in the independent case is

$$(1=r \lor x=r) \Rightarrow (1=s \lor r=s) \Rightarrow (r=s), \tag{*}$$

which is equivalent to  $(x=r) \Rightarrow (1=s) \Rightarrow (r=s)$ , which is not a tautology, and thus  $[\cdot]_1$  requires order from d to e.

This execution is allowed, however, if we rename registers using a map from event names to register names. By using this renaming, coalesced events must choose the same register name. In the above example, the precondition of e in the independent case becomes

$$(1=s_e \lor x=s_e) \Rightarrow (1=s_e \lor s_e=s_e) \Rightarrow (s_e=s_e), \tag{**}$$

which is a tautology. In (\*\*), the first read resolves the nondeterminism in both the first and the second read. Given the choice of event names, the outcome of the second read is predetermined! In (\*), the second read remains nondeterministic, even in the case that the events are destined to coalesce.

Definition 4.1. Let  $[\cdot]_3$  be defined as in Figure 2, changing R4 of READ:

(R4a) if  $(E \cap D) \neq \emptyset$  then  $\tau^D(\psi) \equiv v = s_e \Rightarrow \psi[s_e/r]$ ,

(R4b) if 
$$E \neq \emptyset$$
 and  $(E \cap D) = \emptyset$  then  $\tau^D(\psi) \equiv (v = s_e \lor x = s_e) \Rightarrow \psi[s_e/r]$ ,

(R4c) if  $E = \emptyset$  then  $(\forall s) \tau^D(\psi) \equiv \psi[s/r]$ .

With this semantics, it is straightforward to see that redundant load elimination is sound:

$$[r := x^{\mu}; s := x^{\mu}]_{3} \supseteq [r := x^{\mu}; s := r]_{3}$$

#### 4.2 Register Consistency

We would like:

$$(x4') \tau(ff) \equiv ff.$$

(M3a')  $\kappa(e)$  is satisfiable.

(s6b") if  $\lambda_1(d)$  delays  $\lambda_2(e)$  and  $\kappa_1(d) \wedge \kappa_2(e)$  is satisfiable then  $d \leq e$ .

Define

$$\theta_{\lambda} = \bigwedge_{\{(e,v) \in (E \times \mathcal{V}) | \lambda(e) = (Rv)\}} (s_e = v)$$
 where  $E = \text{dom}(\lambda)$ 

We say that  $\phi$  is  $\lambda$ -consistent if  $\phi \wedge \theta_{\lambda}$  is satisfiable. We say that it is  $\lambda$ -inconsistent otherwise.

*Definition 4.2.* A  $\lambda$ -predicate transformer is a function  $\tau:\Phi\to\Phi$  such that

- (x1) if  $\phi \models \psi$ , then  $\tau(\phi) \models \tau(\psi)$ , (x4) if  $\psi$  is  $\lambda$ -inconsistent then  $\tau(\psi)$  is  $\lambda$ -
- (x2)  $\tau(\psi_1 \wedge \psi_2) \equiv \tau(\psi_1) \wedge \tau(\psi_2)$ , inconsistent.
- (x3)  $\tau(\psi_1 \vee \psi_2) \equiv \tau(\psi_1) \vee \tau(\psi_2)$ .

A family of  $\lambda$ -predicate transformers over consists of a  $\lambda$ -predicate transformer  $\tau^D$  for each  $D \subseteq \mathcal{E}$ , such that if  $C \cap E \subseteq D$  then  $\tau^C(\psi) \models \tau^D(\psi)$ .

- (M4)  $\tau: 2^{\mathcal{E}} \to \Phi \to \Phi$  is a family of  $\lambda$ -predicate transformers,
- (M3a)  $\kappa(e)$  is  $\lambda$ -consistent.
- (s6b') if  $\lambda_1(d)$  delays  $\lambda_2(e)$  and  $\kappa_1(d) \wedge \kappa_2(e)$  is  $\lambda$ -consistent then  $d \leq e$ .

0:20 Anon.

#### **Address Calculation**

932 933

934

935

936

937

939

941

942

943

945

946

947

949

951

952

953

955

957

959

961

963

965

967

968

969

971

972 973

974

975 976

977

978

979 980 Inevitably, address calculation complicates the definitions of WRITE and READ.

Definition 4.3. Let  $[\cdot]_4$  be defined as in Figure 2, changing WRITE and READ: If  $P \in WRITE(L, M, \mu)$  then  $(\exists \ell \in \mathcal{V})$   $(\exists v \in \mathcal{V})$ (w1) if  $d, e \in E$  then d = e, (w4b) if  $E = \emptyset$  then  $(\forall k) \ \tau^D(\psi) \equiv (L=k) \Rightarrow \psi[M/[k]]$ (w2)  $\lambda(e) = W^{\mu}[\ell]v$ , (w3)  $\kappa(e) \equiv L = \ell \wedge M = v$ , (w5a) if  $E \neq \emptyset$  then  $\sqrt{} \equiv L = \ell \land M = v$ , (w4a) if  $E \neq \emptyset$  then  $\tau^D(\psi) \equiv (L=\ell) \Rightarrow \psi[M/[\ell]]$ , (w5b) if  $E = \emptyset$  then  $\sqrt{=}$  ff. If  $P \in READ(r, L, \mu)$  then  $(\exists \ell \in \mathcal{V})$   $(\exists v \in \mathcal{V})$ (R1) if  $d, e \in E$  then d = e, (R2)  $\lambda(e) = R^{\mu}[\ell]v$ (R3)  $\kappa(e) \wedge L = \ell$ , (R4a)  $(\forall e \in E \cap D) \tau^D(\psi) \equiv (L=\ell \Rightarrow v=s_e) \Rightarrow \psi[s_e/r],$ (R4b)  $(\forall e \in E \setminus D) \tau^D(\psi) \equiv ((L=\ell \Rightarrow v=s_e) \vee (L=\ell \Rightarrow [\ell]=s_e)) \Rightarrow \psi[s_e/r],$ 

The combination of read-read independency (Definition A.3) and address calculation is somewhat delicate. Consider the following program, from PWP(§5), where initially x = 0, y = 0, [0] = 0, [1] = 2, and [2] = 1. It should only be possible to read 0, disallowing the attempted execution below:

$$r := y \; ; \; s := [r] \; ; \; x := s \parallel r := x \; ; \; s := [r] \; ; \; y := s$$

$$(R y2) \qquad \qquad (R[2]1) \qquad \qquad (Rx1) \qquad (R[1]2) \qquad (Wy2)$$

This execution would become possible, however, if we were to replace  $(L=\ell \Rightarrow v=s_e)$  by  $(v=s_e)$  in R4a. In this case, (Ry2) would not necessarily be dependency ordered before (Wx1).

### 4.4 Read-Modify-Write Operations

(R4c) ( $\forall s$ ) if  $E = \emptyset$  then  $\tau^D(\psi) \equiv \psi[s/r]$ , (R5) if  $E = \emptyset$  and  $\mu \neq \text{rlx then } \checkmark \equiv \text{ff.}$ 

From the data model, we require an additional binary relation over  $\mathcal{A} \times \mathcal{A}$ : overlaps. For the actions in this paper, we say a overlaps b if they access the same location.

RMW operations are formalized by adding a relation  $\stackrel{\text{rmv}}{\longrightarrow} \subseteq E \times E$  that relates the read of a successful RMW to the succeeding write.

Definition 4.4. Extend the definition of a pomset as follows.

```
(M10) rmw : E \rightarrow E is a partial function capturing read-modify-write atomicity, such that
     (M10a) if d \xrightarrow{\mathsf{rmw}} e then \lambda(e) blocks \lambda(d),
     (M10b) if d \xrightarrow{\mathsf{rmvy}} e then d \leq e,
     (M10c) if \lambda(c) overlaps \lambda(d) then
                   (i) if d \xrightarrow{\mathsf{rmv}} e then c \leq e implies c \leq d,
                  (ii) if d \xrightarrow{\mathsf{rmw}} e then d \le c implies e \le c.
  Extend the definition of par, if, seq to include:
```

```
(P10) (s10) (I10) rmw = (rmw_1 \cup rmw_2),
```

To define specific operations, we extend the syntax:

```
S := \cdots \mid r := \mathsf{CAS}^{\mu,\nu}([L], M, N) \mid r := \mathsf{FADD}^{\mu,\nu}([L], M) \mid r := \mathsf{EXCHG}^{\mu,\nu}([L], M)
```

We require that r does not occur in L. The corresponding semantic functions are as follows.

Proc. ACM Program. Lang., Vol. 0, No. OOPSLA, Article 0. Publication date: October 2021.

 Definition 4.5. Let READ' be defined as for READ, adding the constraint:

(R4d) if 
$$(E \cap D) = \emptyset$$
 then  $\tau^D(\psi) \models \psi$ .

If  $P \in FADD(r, L, M, \mu, \nu)$  then  $(\exists P_1 \in SEQ(READ'(r, L, \mu), WRITE(L, r+M, \nu)))$ 

(U1) if  $\lambda_1(e)$  is a write then there is a read  $\lambda_1(d)$  such that  $\kappa(e) \models \kappa(d)$  and  $d \xrightarrow{\mathsf{rmv}} e$ .

If  $P \in EXCHG(r, L, M, \mu, \nu)$  then  $(\exists P_1 \in SEQ(READ'(r, L, \mu), WRITE(L, M, \nu)))$ 

(U1) if  $\lambda_1(e)$  is a write then there is a read  $\lambda_1(d)$  such that  $\kappa(e) \models \kappa(d)$  and  $k \mapsto e$ .

If  $P \in CAS(r, L, M, N, \mu, \nu)$  then  $(\exists P_1 \in SEQ(READ'(r, L, \mu), IF(r=M, WRITE(L, N, \nu), SKIP)))$ 

(U1) if  $\lambda_1(e)$  is a write then there is a read  $\lambda_1(d)$  such that  $\kappa(e) \models \kappa(d)$  and  $d \xrightarrow{\mathsf{rmv}} e$ .

This definition ensures atomicity and supports lowering to Arm load/store exclusive operations. See PWP for examples.

One subtlety of the definition is that we use *READ'* rather than *READ*. Thus, for RMW operations, the independent case for a read is the same as the empty case. To see why this should be, consider the relaxed variant of the CDRF example from [Lee et al. 2020], using *READ* rather than *READ'*.

$$x := 0; (r := \mathsf{FADD}^{\mathsf{rlx},\mathsf{rlx}}(x,1); \mathsf{if}(!r)\{\mathsf{if}(y)\{x := 0\}\} \parallel$$

$$r := \mathsf{FADD}^{\mathsf{rlx},\mathsf{rlx}}(x,1); \mathsf{if}(!r)\{y := 1\})$$

$$(x_0) \xrightarrow{\mathsf{R}_{x_0}} (x_1) \xrightarrow{\mathsf{R}_{x_0}} (x_2) \xrightarrow{\mathsf{R}_{x_0}} (x_3) \xrightarrow{\mathsf{R}_{x_0}} (x_4) \xrightarrow{\mathsf{R}_{x_0}} (x_5) \xrightarrow{\mathsf{R}$$

A write should only be visible to one FADD instruction, but here the write of 0 is visible to two. This is allowed because no order is required from (Rx0) to (Wy1) in the last thread. To see why, consider the independent transformers of the last thread and initializer:

$$x := 0 \qquad \qquad \text{FADD}^{\mathsf{rlx},\mathsf{rlx}}(x,1) \qquad \qquad \text{if } (!r)\{y := 1\}$$
 
$$\boxed{\psi[0/x] \quad \text{$(0=r \lor x=r) \Rightarrow \psi[1/x]$} \quad \text{$(\mathbb{R}x0)^{\text{rmw}}$} \quad \text{$(\mathbb{W}x1)$} \qquad \qquad \boxed{\psi[1/y]} \quad \boxed{r=0 \mid \mathbb{W}y1}$$

After sequencing, the precondition of (Wy1) is a tautology:  $(0=r \lor 0=r) \Rightarrow r=0$ .

By including R4d, *READ'* constrains the independent predicate transformer of the FADD:

$$x := 0 \qquad \qquad \text{FADD}^{\mathsf{rlx},\mathsf{rlx}}(x,1) \qquad \qquad \text{if}(!r)\{y := 1\}$$

$$\psi[0/x] \quad \boxed{Wx0} \qquad \qquad \boxed{\psi[1/x]} \quad \boxed{\mathbb{R}x0} \quad \boxed{\mathbb{W}x1} \qquad \qquad \boxed{\psi[1/y]} \quad \boxed{r=0 \mid Wy1}$$

After sequencing, the precondition of (Wy1) is r=0, which is *not* a tautology. This forces any top-level pomset to include dependency order from (Rx0) to (Wy1).

### 4.5 If-Closure

In order to model sequential composition, we must allow inconsistent predicates in a single pomset, unlike PWP. For example, if S = (x := 1), then  $[\![\cdot]\!]_1$  does *not* allow:

if(M){x:=1}; S; if(
$$\neg M$$
){x:=1}  
 $wx1 \rightarrow wx1$ 

However, if  $S = (if(\neg M)\{x := 1\}; if(M)\{x := 1\})$ , then it *does* allow the execution. Looking at the initial program:

The difficulty is that the middle action can coalesce either with the right action, or the left, but not both. Thus, we are stuck with some non-tautological precondition. Our solution is to allow a pomset to contain many events for a single action, as long as the events have disjoint preconditions.

0:22 Anon.

Definition 4.6 allows the execution, by splitting the middle command:

Coalescing events gives the desired result.

This is not simply a theoretical question; it is observable. For example,  $[\cdot]_1$  does not allow the following, since it must add order in the first thread from the read of y to one of the writes to x.

Definition 4.6. Let  $[\cdot]_5$  be defined as in Figure 2, changing WRITE and READ:

If  $P \in WRITE(x, M, \mu)$  then  $(\exists v : E \to V) (\exists \theta : E \to \Phi)$ 

- (w1) if  $\theta_d \wedge \theta_e$  is satisfiable then d = e,
- (w2)  $\lambda(e) = W^{\mu} x v_e$ ,

1030 1031

1032

1033

1034

1035

1036

1041

1045

1047

1049 1050

1051

1052

1055

1056

1057

1058 1059

1060

1061

1062

1063 1064

1065

1066

1067 1068

1069

1070

1071 1072

1073

1074 1075

1076

1077 1078

- (w3)  $\kappa(e) \equiv \theta_e \wedge M = v_e$ ,
  - (w4)  $\tau^D(\psi) \equiv \theta_e \Rightarrow \psi[M/x],$
- (w5)  $\checkmark \equiv \theta_e \Rightarrow M = v_e$ ,

If  $P \in READ(r, x, \mu)$  then  $(\exists v : E \to V)$   $(\exists \theta : E \to \Phi)$ 

- (R1) if  $\theta_d \wedge \theta_e$  is satisfiable then d = e,
- (R2)  $\lambda(e) = R^{\mu} x v_e$
- 1053 (R3)  $\kappa(e) \equiv \theta_e$ ,
  1054 (R4a) (Va.5 F.O.)
  - (R4a)  $(\forall e \in E \cap D) \tau^D(\psi) \equiv \theta_e \Rightarrow v_e = s_e \Rightarrow \psi[s_e/r],$
  - (R4b)  $(\forall e \in E \setminus D) \tau^D(\psi) \equiv \theta_e \Rightarrow (v_e = s_e \lor x = s_e) \Rightarrow \psi[s_e/r],$
  - (R4c)  $(\forall s) \tau^D(\psi) \equiv (\bigwedge_{e \in E} \neg \theta_e) \Rightarrow \psi[s/r],$
  - (R5) if  $E = \emptyset$  and  $\mu \neq \text{rlx then } \checkmark \equiv \text{ff.}$

#### 4.6 Combining Address Calculation and If-Closure

Definition 4.3 is naive with respect to merging events. Consider the following example:

Merging, we have:

if 
$$(M)\{[r] := 0; [0] := !r\}$$
 else  $\{[r] := 0; [0] := !r\}$ 

$${}^{c}(r = 1 \mid W[1]0) \quad {}^{d}(r = 0 \lor r = 1 \mid W[0]0) - {}^{c}(r = 0 \mid W[0]1)$$

The precondition of W[0]0 is a tautology; however, this is not possible for ([r] := 0; [0] := !r) alone, using Definition 4.3.

Definition 4.7, enables this execution using if-closure. Under this semantics, we have:

$$[r] := 0$$
  $[0] := !r$   $^{c}(r=1 \mid W[1]0) \stackrel{d}{(r=0 \mid W[0]0)} \stackrel{d}{(r=1 \mid W[0]0)} \stackrel{e}{(r=0 \mid W[0]1)}$ 

Sequencing and merging:

$$[r] := 0; [0] := !r$$
 
$${}^{c}(r=1 \mid W[1]0) \quad {}^{d}(r=0 \lor r=1 \mid W[0]0) \quad {}^{\mathcal{E}}(r=0 \mid W[0]1)$$

Proc. ACM Program. Lang., Vol. 0, No. OOPSLA, Article 0. Publication date: October 2021.

1083

1088 1089

1098 1099

1100

1101 1102

1103

1104 1105

1106

1107

1108

1109 1110

1111

1112

1113

1114 1115

1116

1117 1118

1119 1120 1121

1122

```
The precondition of (W[0]0) is a tautology, as required.
1079
```

```
1081
            Definition 4.7. Let [\cdot]_6 be defined as in Figure 2, changing WRITE and READ:
        If P \in WRITE(L, M, \mu) then (\exists \ell : E \to \mathcal{V}) (\exists v : E \to \mathcal{V}) (\exists \theta : E \to \Phi)
1082
```

```
(w1) if \theta_d \wedge \theta_e is satisfiable then d = e, (w4b) (\forall k)
\begin{array}{lll} \text{(w2)} \ \lambda(e) = \mathsf{W}^{\mu}[\ell] v_e, & \tau^D(\psi) & \equiv & (\bigwedge_{e \in E} \neg \theta_e) \ \Rightarrow & (L = k) \ \Rightarrow \\ \text{(w3)} \ \kappa(e) \equiv \theta_e \wedge L = \ell_e \wedge M = v_e, & \psi[M/[k]] \\ \text{(w4a)} \ \tau^D(\psi) \equiv \theta_e \Rightarrow (L = \ell) \Rightarrow \psi[M/[\ell]], & \text{(w5a)} \ \checkmark \equiv \theta_e \Rightarrow L = \ell_e \wedge M = v_e, \end{array}
```

(w4a) 
$$\tau^D(\psi) \equiv \theta_e \Rightarrow (L=\ell) \Rightarrow \psi[M/[\ell]],$$
 (w5a)  $\checkmark \equiv \theta_e \Rightarrow L=\ell_e \land M=v_e$  (w5b)  $\checkmark \equiv \bigvee_{e \in E} \theta_e.$ 

```
1090
            If P \in READ(r, L, \mu) then (\exists \ell : E \to \mathcal{V}) (\exists v : E \to \mathcal{V}) (\exists \theta : E \to \Phi)
1091
                 (R1) if \theta_d \wedge \theta_e is satisfiable then d = e,
1092
                 (R2) \lambda(e) = \mathsf{R}^{\mu}[\ell]v_e
1093
                 (R3) \kappa(e) \equiv \theta_e \wedge L = \ell_e,
1094
               (R4a) (\forall e \in E \cap D) \tau^D(\psi) \equiv \theta_e \Rightarrow (L = \ell_e \Rightarrow v_e = s_e) \Rightarrow \psi[s_e/r],
1095
               (R4b) \ (\forall e \in E \setminus D) \ \tau^D(\psi) \equiv \theta_e \Rightarrow ((L = \ell_e \Rightarrow v_e = s_e) \lor (L = \ell_e \Rightarrow [\ell] = s_e)) \Rightarrow \psi[s_e/r],
               (R4c) (\forall s) \tau^D(\psi) \equiv (\bigwedge_{e \in E} \neg \theta_e) \Rightarrow \psi[s/r],
1097
                 (R5) if E = \emptyset and \mu \neq \text{rlx then } \checkmark \equiv \text{ff.}
```

#### 5 MRD-C11

Restrict the syntax to top-level parallel composition.

Definition 5.1. A pomset with program order is a tuple  $(P, \mathsf{m}, \mathsf{po})$ , where  $P = (E, \lambda, \kappa, \tau, \checkmark, \le)$  is a pomset with predicate transformers and

```
(M8) \mathbf{m}: (E \to E) is a function capturing merging, such that
    (M8a) \le \subseteq (R \times R), where R = \{e \mid m(e) = e\} is the set of real (rather than phantom) events,
(M9) po \subseteq (S \times S) is a partial order capturing program order, where S = \{e \mid \forall d. \ \mathsf{m}(d) = e \Rightarrow d = e\}
      is the set of simple (rather than compound) events.
```

Lots of fiddly details. Intuitively, (1) compute the preconditions and order for R as before, (2)create new events for the merged ones, compute preconditions for events outside R by applying all of the dependent transformers of the preceding *S*.

[Incomplete] Rules for computing preconditions during sequential composition:

```
• if e \in E_1 then \kappa(e) = \kappa_1(e),
```

• if  $e \in E_2 \cap R$  then  $\kappa(e)$  computed as before, restricted to R,

• if 
$$e \in E_2 \setminus R$$
 then  $\kappa(e) \equiv \tau_1^S(\kappa_1(e))$ .

po can have cycles when interpreted on merged events. For example:

```
r_1 := x ; r_2 := y ; s_1 := x ; s_2 := y
Rx1
Ry2
Rx1
Ry2
Rx1
Ry2
Rx1
Ry2
Rx1
Ry2
```

0:24 Anon.

For TC2, we have:



Idea: merging reads can only make a difference is if there is a race.





Redundant read after read elimination example from [Paviotti et al. 2020, §6.4] to work out with merge. [Sevčík and Aspinall 2008, Fig. 5]. Take r = s.



A more dramatic problem [Paviotti et al. 2020, §6.3]. Version with control dependencies is DRF.

Proc. ACM Program. Lang., Vol. 0, No. OOPSLA, Article 0. Publication date: October 2021.

#### 6 FUTURE WORK

 This paper is the first to present a direct denotational semantics for sequential composition in a relaxed memory model which can be efficiently compiled to modern CPUs. There is, as usual, more research to be done.

 $\begin{bmatrix} R y1 \end{bmatrix}$   $\begin{bmatrix} 1=r \Rightarrow (r\neq 0 \land Q_x) \mid Rx1 \end{bmatrix} \rightarrow \begin{bmatrix} Wz1 \end{bmatrix}$ 

We have not treated loops in this model, though we expect that the usual approach of showing continuity for all the semantic operations with respect to set inclusion would go through. Paviotti et al. [2020] use step-indexing to account for loops; a similar approach could be applied here.

In §A.2 we presented a compilation strategy to Arm8 for a simplified model, but which introduces fences to acquiring reads. These fences are not required in §A.3, but at the cost of model complexity. It would be illuminating to find out what the performance penalty is for these fences.

An earlier version of this paper has been mechanized in Agda; it would be reassuring to update the mechanization to bring it in line with the current state.

We don't handle access elimination.

#### REFERENCES

Jade Alglave, Will Deacon, Richard Grisenthwaite, Antoine Hacquard, and Luc Maranget. 2021. Armed Cats: Formal Concurrency Modelling at Arm. *TOPLAS* (2021). To Appear.

Jade Alglave, Luc Maranget, and Michael Tautschnig. 2014. Herding Cats: Modelling, Simulation, Testing, and Data Mining for Weak Memory. ACM Trans. Program. Lang. Syst. 36, 2, Article 7 (July 2014), 74 pages. https://doi.org/10.1145/2627752
 Mark Batty. 2015. The C11 and C++11 concurrency model. Ph.D. Dissertation. University of Cambridge, UK.

0:26 Anon.

Mark Batty, Scott Owens, Susmit Sarkar, Peter Sewell, and Tjark Weber. 2011. Mathematizing C++ Concurrency. In Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (Austin, Texas, USA) (POPL '11). ACM, New York, NY, USA, 55–66. https://doi.org/10.1145/1926385.1926394

- Hans-J. Boehm and Brian Demsky. 2014. Outlawing Ghosts: Avoiding Out-of-thin-air Results. In *Proceedings of the Work-shop on Memory Systems Performance and Correctness* (Edinburgh, United Kingdom) (MSPC '14). ACM, New York, NY, USA, Article 7, 6 pages. https://doi.org/10.1145/2618128.2618134
- Stephen D. Brookes. 1996. Full Abstraction for a Shared-Variable Parallel Language. *Inf. Comput.* 127, 2 (1996), 145–163.
   <a href="https://doi.org/10.1006/inco.1996.0056">https://doi.org/10.1006/inco.1996.0056</a>
- Soham Chakraborty and Viktor Vafeiadis. 2019a. Grounding thin-air reads with event structures. *PACMPL* 3, POPL (2019), 70:1–70:28. https://doi.org/10.1145/3290383
- Soham Chakraborty and Viktor Vafeiadis. 2019b. Grounding thin-air reads with event structures: Technical Appendix. (2019). http://plv.mpi-sws.org/weakest/appendix.pdf
- Minki Cho, Sung-Hwan Lee, Chung-Kil Hur, and Ori Lahav. 2021. Modular Data-Race-Freedom Guarantees in the Promising Semantics. Proc. ACM Program. Lang. 2, PLDI. To Appear.
- Russ Cox. 2016. Go's Memory Model. http://nil.csail.mit.edu/6.824/2016/notes/gomem.pdf.

1243

1245

1247

1258

1259

1260

1274

- Edsger W. Dijkstra. 1975. Guarded Commands, Nondeterminacy and Formal Derivation of Programs. *Commun. ACM* 18, 8 (1975), 453–457. https://doi.org/10.1145/360933.360975
- Stephen Dolan, KC Sivaramakrishnan, and Anil Madhavapeddy. 2018. Bounding Data Races in Space and Time. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (Philadelphia, PA, USA) (PLDI 2018). ACM, New York, NY, USA, 242–255. https://doi.org/10.1145/3192366.3192421
  - Brijesh Dongol, Radha Jagadeesan, and James Riely. 2019. Modular transactions: bounding mixed races in space and time. In *Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2019, Washington, DC, USA, February 16-20, 2019*, Jeffrey K. Hollingsworth and Idit Keidar (Eds.). ACM, 82–93. https://doi.org/10.1145/3293883.3295708
  - William Ferreira, Matthew Hennessy, and Alan Jeffrey. 1996. A Theory of Weak Bisimulation for Core CML. In Proceedings of the 1996 ACM SIGPLAN International Conference on Functional Programming, ICFP 1996, Philadelphia, Pennsylvania, USA, May 24-26, 1996, Robert Harper and Richard L. Wexelblat (Eds.). ACM, 201–212. https://doi.org/10.1145/232627.232649
- C.A.R. Hoare. 1969. An Axiomatic Basis for Computer Programming. Commun. ACM 12, 10 (Oct. 1969), 576–580. https://doi.org/10.1145/363235.363259
- Radha Jagadeesan, Alan Jeffrey, and James Riely. 2020. Pomsets with preconditions: a simple model of relaxed memory.
   Proc. ACM Program. Lang. 4, OOPSLA (2020), 194:1–194:30. https://doi.org/10.1145/3428262
- Radha Jagadeesan, Corin Pitcher, and James Riely. 2010. Generative Operational Semantics for Relaxed Memory Models.

  In Programming Languages and Systems, 19th European Symposium on Programming, ESOP 2010, Paphos, Cyprus, March
  20-28, 2010. Proceedings (Lecture Notes in Computer Science, Vol. 6012), Andrew D. Gordon (Ed.). Springer, 307–326. https://doi.org/10.1007/978-3-642-11957-6\_17
- Alan Jeffrey and James Riely. 2016. On Thin Air Reads Towards an Event Structures Model of Relaxed Memory. In *Proceedings of the 31st Annual ACM/IEEE Symposium on Logic in Computer Science, LICS '16, New York, NY, USA, July 5-8, 2016*, M. Grohe, E. Koskinen, and N. Shankar (Eds.). ACM, 759–767. https://doi.org/10.1145/2933575.2934536
  - Jeehoon Kang, Chung-Kil Hur, Ori Lahav, Viktor Vafeiadis, and Derek Dreyer. 2017. A promising semantics for relaxed-memory concurrency. In *Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages, POPL 2017, Paris, France, January 18-20, 2017*, Giuseppe Castagna and Andrew D. Gordon (Eds.). ACM, 175–189. http://dl.acm.org/citation.cfm?id=3009850
- Ryan Kavanagh and Stephen Brookes. 2018. A denotational account of C11-style memory. CoRR abs/1804.04214 (2018). arXiv:1804.04214 http://arxiv.org/abs/1804.04214
- Ori Lahav, Viktor Vafeiadis, Jeehoon Kang, Chung-Kil Hur, and Derek Dreyer. 2017. Repairing sequential consistency in C/C++11. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2017, Barcelona, Spain, June 18-23, 2017, Albert Cohen and Martin T. Vechev (Eds.). ACM, 618-632. https://doi. org/10.1145/3062341.3062352
- Leslie Lamport. 1979. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs. *IEEE Trans. Comput.* 28, 9 (Sept. 1979), 690–691. https://doi.org/10.1109/TC.1979.1675439
- Sung-Hwan Lee, Minki Cho, Anton Podkopaev, Soham Chakraborty, Chung-Kil Hur, Ori Lahav, and Viktor Vafeiadis.
   2020. Promising 2.0: global optimizations in relaxed memory concurrency. In Proceedings of the 41st ACM SIGPLAN International Conference on Programming Language Design and Implementation, PLDI 2020, London, UK, June 15-20, 2020,
   Alastair F. Donaldson and Emina Torlak (Eds.). ACM, 362-376. https://doi.org/10.1145/3385412.3386010
- Lun Liu, Todd Millstein, and Madanlal Musuvathi. 2019. Accelerating Sequential Consistency for Java with Speculative Compilation. In *Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation* (Phoenix, AZ, USA) (*PLDI 2019*). ACM, New York, NY, USA, 16–30. https://doi.org/10.1145/3314221.3314611

- 1275 Andreas Lochbihler. 2013. Making the Java memory model safe. ACM Trans. Program. Lang. Syst. 35, 4 (2013), 12:1–12:65. https://doi.org/10.1145/2518191
- 1277 Jeremy Manson, William Pugh, and Sarita V. Adve. 2005. The Java Memory Model. SIGPLAN Not. 40, 1 (Jan. 2005), 378–391. https://doi.org/10.1145/1047659.1040336
- Daniel Marino, Todd D. Millstein, Madanlal Musuvathi, Satish Narayanasamy, and Abhayendra Singh. 2015. The Silently
  Shifting Semicolon. In 1st Summit on Advances in Programming Languages, SNAPL 2015, May 3-6, 2015, Asilomar, California, USA (LIPIcs, Vol. 32), Thomas Ball, Rastislav Bodík, Shriram Krishnamurthi, Benjamin S. Lerner, and Greg Morrisett
  (Eds.). Schloss Dagstuhl Leibniz-Zentrum für Informatik, 177–189. https://doi.org/10.4230/LIPIcs.SNAPL.2015.177
  - Peter O'Hearn. 2007. Resources, Concurrency, and Local Reasoning. Theor. Comput. Sci. 375, 1-3 (April 2007), 271–307. https://doi.org/10.1016/j.tcs.2006.12.035
- Marco Paviotti, Simon Cooksey, Anouk Paradis, Daniel Wright, Scott Owens, and Mark Batty. 2020. Modular Relaxed

  Dependencies in Weak Memory Concurrency. In Programming Languages and Systems 29th European Symposium on

  Programming, ESOP 2020, Dublin, Ireland, April 25-30, 2020, Proceedings (Lecture Notes in Computer Science, Vol. 12075),

  Peter Müller (Ed.). Springer, 599–625. https://doi.org/10.1007/978-3-030-44914-8\_22
  - Jean Pichon-Pharabod and Peter Sewell. 2016. A Concurrency Semantics for Relaxed Atomics That Permits Optimisation and Avoids Thin-air Executions. In *Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages* (St. Petersburg, FL, USA) (*POPL '16*). ACM, New York, NY, USA, 622–633. https://doi.org/10.1145/2837614.2837616
  - Anton Podkopaev, Ori Lahav, and Viktor Vafeiadis. 2019. Bridging the gap between programming languages and hardware weak memory models. *Proc. ACM Program. Lang.* 3, POPL (2019), 69:1–69:31. https://doi.org/10.1145/3290382
- William Pugh. 2004. Causality Test Cases. https://perma.cc/PJT9-XS8Z
- Jaroslav Sevčík. 2008. Program Transformations in Weak Memory Models. PhD thesis. Laboratory for Foundations of Computer Science, University of Edinburgh.
  - Jaroslav Sevčík and David Aspinall. 2008. On Validity of Program Transformations in the Java Memory Model. In ECOOP 2008 - Object-Oriented Programming, 22nd European Conference, Paphos, Cyprus, July 7-11, 2008, Proceedings (Lecture Notes in Computer Science, Vol. 5142), Jan Vitek (Ed.). Springer, 27–51. https://doi.org/10.1007/978-3-540-70592-5\_3
  - Joel Spolsky. 2002. The Law of Leaky Abstractions. https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-abstractions/.
  - Conrad Watt, Christopher Pulte, Anton Podkopaev, Guillaume Barbier, Stephen Dolan, Shaked Flur, Jean Pichon-Pharabod, and Shu-yu Guo. 2020. Repairing and mechanising the JavaScript relaxed memory model. In *Proceedings of the 41st ACM SIGPLAN International Conference on Programming Language Design and Implementation, PLDI 2020, London, UK, June 15-20, 2020*, Alastair F. Donaldson and Emina Torlak (Eds.). ACM, 346–361. https://doi.org/10.1145/3385412.3385973
  - Conrad Watt, Andreas Rossberg, and Jean Pichon-Pharabod. 2019. Weakening WebAssembly. *Proc. ACM Program. Lang.* 3, OOPSLA (2019), 133:1–133:28. https://doi.org/10.1145/3360559

### A ARM

1282

1288

1289

1290

1296

1297

1298

1300

1301

1302

1303 1304

1305

1306 1307

1308

1309

1310

1311

1312

1313

1314

1316

1317

1318

1319

1320

1321

1322

For simplicity, we restrict to top level parallel composition and ignore fences<sup>4</sup>.

#### A.1 Arm executions

*Definition A.1.* An *Arm8 execution graph, G*, is tuple  $(E, \lambda, poloc, lob)$  such that

- (A1)  $E \subseteq \mathcal{E}$  is a set of events,
- (A2)  $\lambda: E \to \mathcal{A}$  defines a label for each event,
- (A3) poloc:  $E \times E$ , is a per-thread, per-location total order, capturing per-location program order,
- (A4) lob :  $E \times E$ , is a per-thread partial order capturing *locally-ordered-before*, such that (A4a) poloc  $\cup$  lob is acyclic.

The definition of lob is complex. Comparing with our definition of sequential composition, it is sufficient to note that lob includes

- (L1) read-write dependencies, required by \$3,
- (L2) synchronization delay of  $\ltimes_{\mathsf{sync}}$ , required by s6b,
- (L3) sc access delay of  $\bowtie_{sc}$ , required by s6b,
- (14) write-write and read-to-write coherence delay of ⋈<sub>co</sub>, required by s6b,

<sup>&</sup>lt;sup>4</sup>Fences are not actions in Arm8, which complicates the theorem statements.

0:28 Anon.

and that lob does not include

1324 1325

1326

1327

1329

1330

1331

1333

1334

1335 1336

1337

1343

1344

1347 1348 1349

1350

1351

1352

1353 1354

1355 1356 1357

1358

1359

1360

1361

1366

1367

1368

1369

1370

- (L5) read-read control dependencies, required by \$3,
- (L6) write-to-read order of rf, required by s7b,
  - (L7) write-to-read coherence delay of  $\bowtie_{co}$ , required by s6b.

Definition A.2. Execution G is (co, rf, gcb)-valid, under External Global Consistency (EGC) if

- (A5)  $co: E \times E$ , is a per-location total order on writes, capturing coherence,
- (A6) rf :  $E \times E$ , is a surjective and injective relation on reads, capturing *reads-from*, such that (A6a) if  $d \xrightarrow{rf} e$  then  $\lambda(d)$  matches  $\lambda(e)$ ,
  - (A6b) poloc  $\cup$  co  $\cup$  rf  $\cup$  fr is acyclic, where  $e \xrightarrow{fr} c$  if  $e \xleftarrow{rf} d \xrightarrow{co} c$ , for some d,
- (A7)  $gcb \supseteq (co \cup rf)$  is a linear order such that

  - (A7a) if  $d \xrightarrow{rf} e$  and  $\lambda(c)$  blocks  $\lambda(e)$  then either  $c \xrightarrow{gcb} d$  or  $e \xrightarrow{gcb} c$ , (A7b) if  $e \xrightarrow{lob} c$  then either  $e \xrightarrow{gcb} c$  or  $(\exists d) d \xrightarrow{rf} e$  and  $d \xrightarrow{poloc} e$  but not  $d \xrightarrow{lob} c$ .

Execution G is (co, rf, cb)-valid under External Consistency (EC) if

- 1338 (A5) and (A6), as for EGC,
  - (A8)  $cb \supseteq (co \cup lob)$  is a linear order such that if  $d \xrightarrow{rf} e$  then either

    - (A8a)  $d \stackrel{\mathsf{cb}}{\longrightarrow} e$  and if  $\lambda(c)$  blocks  $\lambda(e)$  then either  $c \stackrel{\mathsf{cb}}{\longrightarrow} d$  or  $e \stackrel{\mathsf{cb}}{\longrightarrow} c$ , or (A8b)  $d \stackrel{\mathsf{cb}}{\longleftarrow} e$  and  $d \stackrel{\mathsf{poloc}}{\longrightarrow} e$  and  $(\not\exists c) \lambda(c)$  blocks  $\lambda(e)$  and  $d \stackrel{\mathsf{poloc}}{\longrightarrow} c \stackrel{\mathsf{poloc}}{\longrightarrow} e$ .

Alglave et al. [2021] show that EGC and EC are both equivalent to the standard definition of Arm8. They explain EGC and EC using the following example, which is allowed by Arm8.5

$$x := 1; r := x; y := r \parallel 1 := y^{\text{acq}}; s := x$$

$$(Wx1)^{\text{rf}} (Rx1)^{\text{lob}} (Wy1)^{\text{rf}} (R^{\text{acq}}y1)^{\text{lob}} (Rx0)$$

EGC drops lob-order in the first thread using A7b, since (Wx1) is not lob-ordered before (Wy1).

EC drops rf-order in the first thread using A8b.

### A.2 Arm Compilation 1

We do not distinguish control dependencies from other dependencies, and therefore L5 forces us to drop all dependencies between reads. To achieve this, we modify the definition of  $\kappa'_2$  in Figure 2.

*Definition A.3.* Let  $[\![\cdot]\!]_2$  be defined as in Figure 2, replacing the definition of  $\kappa'_2$  with:

$$\kappa_2'(e) = \begin{cases} \tau_1(\kappa_2(e)) & \text{if } \lambda(e) \text{ is a read} \\ \tau_1^{\downarrow} e(\kappa_2(e)) & \text{otherwise, where } \downarrow e = \{c \mid c < e\} \end{cases}$$

Even with this small change, the optimal lowering for Arm8 is unsound for our semantics. The optimal lowering maps relaxed access to ldr/str and non-relaxed access to ldar/stlr [Podkopaev et al. 2019]. In this section, we consider a suboptimal strategy, which lowers non-relaxed reads to (dmb.sy; ldar). Significantly, we retain the optimal lowering for relaxed access. In the next section we recover the optimal lowering by adopting an alternative semantics for s7b and s6b.

1371 1372

<sup>&</sup>lt;sup>5</sup>We have changed an address dependency in the first thread to a data dependency.

 To see why the optimal lowering fails, consider the following attempted execution, where the final values of both x and y are 2.

$$x := 2; r := x^{\text{acq}}; y := r - 1 \parallel y := 2; x^{\text{rel}} := 1$$

$$(gcb)$$

$$(\leq)$$

$$(R^{acq}x^2) \longrightarrow (Wy1) \longrightarrow (Wy2) \longrightarrow (W^{rel}x1)$$

This attempted execution is allowed by Arm8, but disallowed by our semantics.

If the read of x in the execution above is changed from acquiring to relaxed, then our semantics allows the gcb execution, using the independent case for the read and satisfying the precondition of (Wy1) by prepending (Wx2). It may be tempting, therefore, to adopt a strategy of *downgrading* acquires in certain cases. Unfortunately, it is not possible to do this locally without invalidating important idioms such as publication. For example, consider that  $(R^{ra}x1)$  is *not* possible for the second thread in the following attempted execution, due to publication of (Wx2) via y:

$$x := x + 1; \ y^{\text{rel}} := 1 \parallel x := 1; \ \text{if} (y^{\text{acq}} \& x^{\text{acq}}) \{s := z\} \parallel z := 1; \ x^{\text{rel}} := 1$$
 $(x^{\text{rel}} y_1) \longrightarrow (x^{\text{acq}} y_1) \longrightarrow (x^{\text{acq}}$ 

Instead, if the read of x is relaxed, then the publication via y fails, and (Rx1) in the second thread is possible.

$$(Rx1)$$
  $(Wx2)$   $(Wx1)$   $(Rx1)$   $(Rx1)$   $(Rx0)$   $(Rx1)$   $(Rx0)$   $(Rx1)$ 

Using the suboptimal lowering for acquiring reads, our semantics is sound for Arm. The proof uses the characterization of Arm using EGC.

THEOREM A.4. Suppose  $G_1$  is  $(co_1, rf_1, gcb_1)$ -valid for S under the suboptimal lowering that maps non-relaxed reads to (dmb.sy; ldar). Then there is a top-level pomset  $P_2 \in [S]_2$  such that  $E_2 = E_1$ ,  $\lambda_2 = \lambda_1$ ,  $rf_2 = rf_1$ , and  $\leq_2 = gcb_1$ .

PROOF. First, we establish some lemmas about Arm8.

LEMMA A.5. Suppose G is (co, rf, gcb)-valid. Then  $gcb \supseteq fr$ .

PROOF. Using the definition of fr from A6b, we have e 
ightharpoonup definition of fr from A6b, we have <math>e 
ightharpoonup definition of fr from A6b, we have <math>e 
ightharpoonup definition of fr from A6b, we have <math>e 
ightharpoonup definition of fr from A6b, we have <math>e 
ightharpoonup definition of fr from A6b, we have <math>e 
ightharpoonup definition of fr from A6b, we have <math>e 
ightharpoonup definition of fr from A6b, we have <math>e 
ightharpoonup definition of fr from A6b, we have <math>e 
ightharpoonup definition of fr from A6b, we have <math>e 
ightharpoonup definition of fr from A6b, we have <math>e 
ightharpoonup definition of fr from A6b, we have <math>e 
ightharpoonup definition of fr from A6b, we have <math>e 
ightharpoonup definition of fr from A6b, we have <math>e 
ightharpoonup definition of fr from A6b, we have <math>e 
ightharpoonup definition of fr from A6b, we have <math>e 
ightharpoonup definition of fr from A6b, we have <math>e 
ightharpoonup definition of fr from A6b, we have <math>e 
ightharpoonup definition of fr from A6b, we have <math>e 
ightharpoonup definition of fr from A6b, we have <math>e 
ightharpoonup definition of fr from A6b, we have <math>e 
ightharpoonup definition of fr from A6b, we have <math>e 
ightharpoonup definition of fr from A6b, we have <math>e 
ightharpoonup definition of fr from A6b, we have <math>e 
ightharpoonup definition of fr from A6b, we have <math>e 
ightharpoonup definition of fr from A6b, we have <math>e 
ightharpoonup definition of from A6b, we have <math>e 
ightharpoonup

LEMMA A.6. Suppose G is (co, rf, gcb)-valid and c  $\xrightarrow{poloc} e$ , where  $\lambda(c)$  blocks  $\lambda(e)$ . Then  $c \xrightarrow{gcb} e$ . PROOF. By way of contradiction, assume  $e \xrightarrow{gcb} c$ . If  $c \xrightarrow{rf} e$  then by A7 we must also have  $c \xrightarrow{gcb} e$ , contradicting the assumption that gcb is a total order. Otherwise that there is some  $d \neq c$  such that  $d \xrightarrow{rf} e$ , and therefore  $d \xrightarrow{gcb} e$ . By transitivity,  $d \xrightarrow{gcb} c$ . By the definition of fr, we have  $e \xrightarrow{fr} c$ . But this contradicts A6b, since  $c \xrightarrow{poloc} e$ .

We show that all the order required in the pomset is also required by Arm8. M7b holds since  $cb_1$  is consistent with  $co_1$  and  $fr_1$ . As noted above, lob includes the order required by s3 and s6b. We need only show that the order removed from A7b can also be removed from the pomset. In order for A7b to remove order from e to c, we must have  $d \xrightarrow{rf} e$  and  $d \xrightarrow{poloc} e$  but not  $d \xrightarrow{lob} c$ . Because of our suboptimal lowering, it must be that e is a relaxed read; otherwise the dmb.sy would require  $d \xrightarrow{lob} c$ . Thus we know that s6b does not require order from e to c. By chaining R4b and W5, any dependence on the read can by satisfied without introducing order in s3.

0:30 Anon.

### A.3 Arm Compilation 2

We can achieve optimal lowering for Arm by weakening the semantics of sequential composition slightly. In particular, we must lose Lemma 3.8, which states that  $d \stackrel{\text{rf}}{\longrightarrow} e$  implies  $d \leq e$ . Revisiting the example in the last subsection, we essentially mimic the EC characterization:

$$x := 2; r := x^{\operatorname{acq}}; y := r - 1 \parallel y := 2; x^{\operatorname{rel}} := 1$$

$$(\operatorname{w}x2) \mapsto (\operatorname{Racq}x2) \mapsto (\operatorname{w}y1) \mapsto (\operatorname{w}y2) \mapsto (\operatorname{w}^{\operatorname{rel}}x1)$$

Here the rf relation *contradicts* order! We have both  $(Wx2) \cdots \rightarrow (R^{acq}x2)$  and  $(Wx2) \stackrel{\mathsf{cb}}{\longleftarrow} (R^{acq}x2)$ .

The change to the semantics is small: we weaken the relationship between rf and  $\leq$  in s7b. Rather than ensuring that there is no *global* blocker for a sequentially fulfilled read (s7b), we require only that there is no *thread-local* blocker (s7b<sup>rf</sup>). This change both allows and requires us to weaken the definition of *delays* to drop write-to-read order from  $\bowtie_{co}$ .

Definition A.7. Let  $[\![\cdot]\!]_2^{\mathsf{rf}}$  be defined as for  $[\![\cdot]\!]_2$  in Definition A.3/Figure 2, changing s7b and s6b:  $(\mathsf{s7b^{\mathsf{rf}}})$  if  $\lambda_1(c)$  blocks  $\lambda_2(e)$  then  $d \xrightarrow{\mathsf{rf}} e$  implies  $c \le d$ ,  $(\mathsf{s6b^{\mathsf{rf}}})$  if  $\lambda_1(d)$  delays'  $\lambda_2(e)$  then  $d \le e$ ,

where *delays'* replaces  $\bowtie_{co}$  in Definition 3.1 of *delays* by  $\bowtie_{lws} = \{(Wx, Wx), (Rx, Wx)\}.$ 

• TODO: I think this should order W->R if there is no rf the other way

The acronym lws is adopted from Arm8. It stands for Local Write Successor.

With the weakening of \$7b^{rf}, we must be careful not to allow spurious pairs to be added to the rf relation. The use of *extends* in 17a does this, ensuring that new rf is not introduced between events in  $E_1 \cap E_2$  when coalescing. This is necessary to ensure that  $[if(b)\{r:=x \mid |x:=1\}] else\{r:=x; x:=1\}$  does not include [xx] taking rf from the left and [x] from the right.

We emphasize that Lemma 3.8 fails for  $[\![\cdot]\!]_2^{rf}$ , since  $d \xrightarrow{rf} e$  may not imply  $d \le e$  when d and e come from different sides of a sequential composition. This means that rf must be verified during pomset construction, rather than post-hoc. The following lemma gives a post-hoc verification technique for rf, using program order (po).

Lemma A.8. Any P in the image of  $[\![\cdot]\!]_2^{\mathsf{rf}}$  is top-level iff for every  $d \xrightarrow{\mathsf{rf}} e$  either

- external fulfillment:  $d \le e$  and if  $\lambda(c)$  blocks  $\lambda(e)$  then either  $c \le d$  or  $e \le c$ , or
- internal fulfillment:  $d \xrightarrow{po} e$  and  $(\not\exists c) \lambda(c)$  blocks  $\lambda(e)$  and  $d \xrightarrow{po} c \xrightarrow{po} e$ .

THEOREM A.9. Suppose  $G_1$  is EC-valid for S via  $(co_1, rf_1, cb_1)$  and that  $cb_1 \supseteq fr_1$ . Then there is a top-level pomset  $P_2 \in [S]_2^{rf}$  such that  $E_2 = E_1$ ,  $\lambda_2 = \lambda_1$ ,  $rf_2 = rf_1$ , and  $\leq_2 = cb_1$ .

PROOF. We show that all the order required in the pomset is also required by Arm8. M7b holds since  $cb_1$  is consistent with  $co_1$  and  $fr_1$ .  $s7b^{rf}$  follows from A8b. As noted above, lob includes the order required by s3 and  $s6b^{rf}$ .

The generality of Theorem A.9 is not limited by the assumption that  $cb_1 \supseteq fr_1$ :

LEMMA A.10. Suppose G is EC-valid via (co, rf, cb). Then there a permutation cb' of cb such that G is EC-valid via (co, rf, cb') and cb'  $\supseteq$  fr, where fr is defined in A6b.

PROOF. We show that any cb order that contradicts fr is incidental.

By definition of fr,  $e \stackrel{\mathsf{rf}}{\longleftarrow} d \stackrel{\mathsf{co}}{\longrightarrow} c$ , for some d. Since  $\mathsf{cb} \supseteq \mathsf{co}$ , we know that  $d \stackrel{\mathsf{co}}{\longrightarrow} c$ .

If A8a applies to  $d \xrightarrow{rf} e$ , then  $e \xrightarrow{cb} c$ , since it cannot be that  $c \xrightarrow{co} d$ .

<sup>&</sup>lt;sup>6</sup>It is obvious how to enhance the semantics of most operators to define po. When combining pomsets using the conditional, the obvious definition of po may result in cycles, since po-ordered events may coalesce. In this case we include a separate pomset for each way of breaking these po cycles.

 Suppose A8b applies to  $d \xrightarrow{rf} e$  and c is from a different thread. Because it is a different thread, we cannot have  $e \xrightarrow{lob} c$ , and thus the order in cb is incidental.

Suppose A8b applies to  $d \xrightarrow{rf} e$  and c is from the same thread. Since  $c \xrightarrow{-cc} d$ , it cannot be that  $c \xrightarrow{-poloc} d$ , using A6b. It also cannot be that  $d \xrightarrow{-poloc} c \xrightarrow{-poloc} e$ . It must be that  $e \xrightarrow{-poloc} c$ . By A4a, we cannot have  $e \xrightarrow{lob} c$ , and thus the order in cb is incidental.

### **B** LOCAL DATA RACE FREEDOM AND SEQUENTIAL CONSISTENCY

We adapt Dolan et al.'s [2018] notion of Local Data Race Freedom (LDRF) to our setting.

The result requires that locations are properly initialized. We assume a sufficient condition: that programs have the form " $x_1 := v_1$ ;  $\cdots x_n := v_n$ ; S" where every location mentioned in S is some  $x_i$ .

We make two further restrictions to simplify the exposition. To simplify the definition of *happens-before*, we ban fences and RMWs. To simplify the proof, we assume there are no local declarations of the form (var x; S).

To state the theorem, we require several technical definitions. The reader unfamiliar with [Dolan et al. 2018] may prefer to skip to the examples in the proof sketch, referring back as needed.

Data Race. Data races are defined using program order (po), not pomset order ( $\leq$ ). In ??, for example, (Rx0) has an x-race with (Wx1), but not (Wx0), which is po-before it.

It is obvious how to enhance the semantics of prefixing and most other operators to define po. When combining pomsets using the conditional, the obvious definition may result in cycles, since po-ordered reads may coalesce—see the discussion of ?? in §??. In this case we include a separate pomset for each way of breaking these cycles.

Because we ignore the features of §??, we can adopt the simplest definition of *synchronizes-with* (sw): Let  $d \xrightarrow{sw} e$  exactly when d fulfills e, d is a release, e is an acquire, and  $\neg (d \xrightarrow{po} e)$ .

Let  $hb = (po \cup sw)^+$  be the *happens-before* relation. In ??, for example, (Wx1) happens-before (Rx0), but this fails if either ra access is relaxed.

Let  $L \subseteq X$  be a set of locations. We say that d has an L-race with e (notation  $d \stackrel{L}{\leadsto} e$ ) when at least one is relaxed, they *conflict* (Def. ??) at some location in L, and they are unordered by hb: neither  $d \stackrel{\text{hb}}{\longrightarrow} e$  nor  $e \stackrel{\text{hb}}{\longrightarrow} d$ .

*Generators.* We say that P' generates P if either P augments P' or P implies P'. For example, the unordered pomset (Rx1) (Wy1) generates the ordered pomset  $(Rx1) \rightarrow (r = 1 \mid Wy1)$ .

We say that P is a *generation-minimal* in  $\mathcal{P}$  if  $P \in \mathcal{P}$  and there is no  $P \neq P' \in \mathcal{P}$  that generates P.

Let gen $[S] = \{P \in [S] \mid P \text{ is top-level (Def. ??)} \text{ and generation-minimal in } [S] \}.$ 

*Extensions.* We say that P' *S-extends* P if  $P \neq P' \in \text{gen}[S]$  and P is a downset of P'.

Similarity. We say that P' is e-similar to P if they differ at most in (1) pomset order adjacent to e and (2) the value associated with event e, if it is a read. Formally: E' = E,  $\kappa' = \kappa$ ,  $\leq '|_{E \setminus \{e\}} = \leq |_{E \setminus \{e\}}$ , if e is not a read then  $\lambda' = \lambda$ , and if e is a read then  $\lambda'|_{E \setminus \{e\}} = \lambda|_{E \setminus \{e\}}$  and  $\lambda'(e) = \lambda(e)[v'/v]$ , for some v', v.

Stability. We say that P is L-stable in S if (1)  $P \in \text{gen}[S]$ , (2) P is po-convex (nothing missing in program order), (3) there is no S-extension of P with a *crossing* L-race: that is, there is no G is no G is no G. The empty pomset is G is G is G is G in G in G is G in G is G in G is G in G in

Sequentiality. Let  $<_L = <_L \cup po$ , where  $<_L$  is the restriction of < to events that access locations in L. We say that P' is L-sequential after P if P' is po-convex and  $<_L$  is acyclic in  $E' \setminus E$ .

0:32 Anon.

Theorem B.1. Let P be L-stable in S. Let P' be a S-extension of P that is L-sequential after P. Let P'' be a S-extension of P' that is po-convex, such that no subset of E'' satisfies these criteria. Then either (1) P'' is L-sequential after P or (2) there is some S-extension P''' of P' and some  $e \in (E'' \setminus E')$  such that (a) P''' is e-similar to P'', (b) P''' is e-sequential after P, and (c) e-e, for some e-e-e-for some e-e-for some e-for some e-e-for some e-e-for some e-e-for some e-e-for some e-for som

The theorem provides an inductive characterization of *Sequential Consistency for Local Data-Race Freedom (SC-LDRF)*: Any extension of a *L*-stable pomset is either *L*-sequential, or is *e*-similar to a *L*-sequential extension that includes a race involving *e*.

PROOF SKETCH. In order to develop a technique to find P''' from P'', we analyze pomset order in generation-minimal top-level pomsets. First, we note that  $\leq_*$  (the transitive reduction  $\leq$ ) can be decomposed into three disjoint relations. Let  $ppo = (\leq_* \cap po)$  denote *preserved* program order, as required by prefixing (Def. ??). The other two relations are cross-thread subsets of  $(\leq_* \setminus po)$ , as required by fulfillment (Def. ??): rfe orders writes before reads, satisfying fulfillment requirement ??; xw orders read and write accesses before writes, satisfying requirement ??. (Within a thread, ?? and ?? follow from prefixing requirement ??, which is included in ppo.)

Using this decomposition, we can show the following.

 LEMMA B.2. Suppose  $P'' \in gen[S]$  has a read e that is maximal in (ppo  $\cup$  rfe) and such that every po-following read is also  $\leq$ -following ( $e \stackrel{\text{po}}{\longleftrightarrow} d$  implies  $e \leq d$ , for every read e). Further, suppose there is an e-similar P''' that satisfies the requirements of fulfillment. Then e0.

The proof of the lemma follows an inductive construction of gen[S], starting from a large set with little order, and pruning the set as order is added: We begin with all pomsets generated by the semantics without imposing the requirements of fulfillment (including only ppo). We then prune reads which cannot be fulfilled, starting with those that are minimally ordered. This proof is simplified by precluding local declarations.

We can prove a similar result for  $(po \cup rfe)$ -maximal read and write accesses.

Turning to the proof of the theorem, if P'' is L-sequential after P, then the result follows from (1). Otherwise, there must be a  $\leq_L$  cycle in P'' involving all of the actions in  $(E'' \setminus E')$ : If there were no such cycle, then P'' would be L-sequential; if there were elements outside the cycle, then there would be a subset of E'' that satisfies these criteria.

If there is a (po  $\cup$  rfe)-maximal access, we select one of these as e. If e is a write, we reverse the outgoing order in xw; the ability to reverse this order witnesses the race. If e is a read, we switch its fulfilling write to a "newer" one, updating xw; the ability to switch witnesses the race. For example, for P'' on the left below, we choose the P''' on the right; e is the read of x, which races with (Wx1).



It is important that e be (po  $\cup$  rfe)-maximal, not just (ppo  $\cup$  rfe)-maximal. The latter criterion would allow us to choose e to be the read of g, but then there would be no e-similar pomset: if an execution reads 0 for g then there is no read of g, due to the conditional.

 If there is no (po  $\cup$  rfe)-maximal access, then all cross-thread order must be from rfe. In this case, we select a (ppo  $\cup$  rfe)-maximal read, switching its fulfilling write to an "older" one. As an example, consider the following; once again, e is the read of x, which races with (Wx1).



This example requires (Wx0). Proper initialization ensures the existence of such "older" writes.

The premises of the theorem allow us to avoid the complications caused by "mixed races" in [Dongol et al. 2019]. In the left pomset below, P'' is not an extension of P', since P' is not a downset of P''. When considering this pomset, we must perform the decomposition on the right.



This affects the inductive order in which we move across pomsets, but does not affect the set of pomsets that are considered. This simplification is enabled by denotational reasoning.

In our language, past races are always resolved at a stable point, as in co3. As another example, consider the following, which is disallowed here, but allowed by Java [Dolan et al. 2018, Ex. 2]. We include an SC fence here to mimic the behavior of volatiles in the JMM.

$$(x := 1; y^{ra} := 1) \parallel (x := 2; F^{sc}; if(y^{ra})\{r := x; s := x\})$$

$$(y^{ra} = 1) \parallel (x := 2; F^{sc}; if(y^{ra})\{r := x; s := x\})$$

$$(y^{ra} = 1) \parallel (x := 2; F^{sc}; if(y^{ra})\{r := x; s := x\})$$

$$(y^{ra} = 1) \parallel (x := 2; F^{sc}; if(y^{ra})\{r := x; s := x\})$$

The highlighted events are L-stable. The order from (Rx1) to (Wx2) is required by fulfillment, causing the cycle. If the fence is removed, there would be no order from (Wx2) to  $(R^{acq}y1)$ , the highlighted events would no longer be L-stable, and the execution would be allowed. This more relaxed notion of "past" is not expressible using Dolan et al.'s synchronization primitives.

The notion of "future" is also richer here. Consider [Dolan et al. 2018, Ex. 3]:

$$(r := 1; [r] := 42; s := [r]; x^{ra} := r) \parallel (r := x; [r] := 7)$$

$$(\text{W[1]}42) \rightarrow (\text{R[1]}7) \rightarrow (\text{W}^{rel}x1) \rightarrow (\text{R}x1) \rightarrow (\text{W[1]}7)$$
(FUTURE)

There is no interesting stable point here. The execution is disallowed because of a read from the causal future. If we changed  $x^{ra}$  to  $x^{rlx}$ , then there would be no order from (R[1]7) to (W<sup>rlx</sup>x1), and the execution would be allowed. The distinction between "causal future" and "temporal future" is not expressible in Dolan et al.'s operational semantics.

Our definition of *L*-sequentiality does not quite correspond to SC executions, since actions may be elided by read/write elimination (§??). However, for any properly initialized *L*-sequential pomset that uses elimination, there is larger *L*-sequential pomset that does not use elimination. This can

0:34 Anon.

be shown inductively—in the inductive step, writes that are introduced can be ignored by existing reads, and reads that are introduced can be fulfilled, for some value, by some preceding write.

#### DOWNSET CLOSURE

 We would like the semantics to be closed with respect to downsets. Downsets include a subset of initial events, similar to prefixes for strings.

Definition C.1.  $P_2$  is an downset of  $P_1$  if

- (5)  $\sqrt{2} = \sqrt{1}$ , (1)  $E_2 \subseteq E_1$ ,
- (2)  $(\forall e \in E_2) \lambda_2(e) = \lambda_1(e)$ , (6a)  $(\forall d \in E_2) \ (\forall e \in E_2) \ d \leq_2 e \text{ iff } d \leq_1 e,$
- (6b)  $(\forall d \in E_1)$   $(\forall e \in E_2)$  if  $d \leq_1 e$  then  $d \in E_2$ , (7)  $(\forall d \in E_2)$   $(\forall e \in E_2)$  d rf<sub>2</sub> e iff d rf<sub>1</sub> e. (3)  $(\forall e \in E_2) \kappa_2(e) \equiv \kappa_1(e)$ ,
- (4)  $(\forall e \in E_2) \tau_2^D(e) \equiv \tau_1^D(e),$

Downset closure fails due to for two reasons. The key property is that the empty set transformer should behave the same as the independent transformer.

First, downset closure fails for Definition A.3, because it does not enforce read-read dependencies. Consider

$$r := x$$
; if  $(!r) \{s := y\}$ 

The semantics of this program includes the singleton pomset (Rx0), but not the singleton pomset (Ry0). To get (Rx0), we combine:

$$r := x \qquad \text{if}(!r)\{s := y\}$$

$$(Rx0) \qquad \emptyset$$

Attempting to get (Ry0), we instead get:

$$r := x \qquad \text{if}(!r)\{s := y\}$$

$$\emptyset \qquad \qquad (r=0 \mid R y0)$$

Since r appears only once in the program, this pomset cannot contribute to a top-level pomset.

Second, the semantics is not downset closed because the independency reasoning of R4b is only applicable for pomsets where the ignored read is present! Revisiting JMM causality test case 1 from the end of §3.7:

The precondition of (Wy1) is a tautology.

Taking the empty set for the read, however, the precondition of (Wy1) is not a tautology:

$$x := 0; r := x; if(r \ge 0) \{y := 1\}; z := r$$
 $(r \ge 0 \mid Wy1)$ 
 $(r = 1 \mid Wz1)$ 

Proc. ACM Program. Lang., Vol. 0, No. OOPSLA, Article 0. Publication date: October 2021.

 (The second issue goes away if one allows general access elimination to merge (Wx0) and (Rx0), as in §??.

$$x := 0; r := x; if(r \ge 0) \{ y := 1 \}; z := r$$
 
$$(0 = r \lor 0 = r) \Rightarrow r \ge 0 \mid Wy1)$$
 
$$(r = 1 \mid Wz1)$$

### D COMMENTS ON CASE ANALYSIS, ETC

Case analysis gives very weak results when combined with thread inlining. See [Chakraborty and Vafeiadis 2019b, §B.1]. These happen by performing transformations that: (1) introduce conditionals, (2) inline two threads on both sides of the introduced conditional, (3) choose different orders for the two threads for the two sides of the conditional.

Case analysis gives very weak results when combined with read introduction. See [Cho et al. 2021]. These happen by performing transformations that: (1) introduce reads, (2) introduce conditionals, (3) choose different values for the reads on the two sides of the conditional.

The fact that the semantics is not verifiable a posteriori is something it shares with WEAKESTMO, where the justification relation must be built inductively.

WEAKESTMO admits FADD, but PS does not. PS CohCYC, but WEAKESTMO does not.

#### **E ADDITIONAL EXAMPLES**

#### E.1 Arm

The following execution is allowed by Arm.

$$x := 1; y^{\text{rel}} := 1 \parallel r := y; y := 2; s := y^{\text{acq}}; t := x$$

$$wx1 \mapsto w^{\text{rel}} y1 \mapsto xy1 \mapsto wy2 \mapsto xy2 \mapsto xy2$$

#### E.2 RMWs

It is not possible for two RMWs to see the same write.

$$x := 0; (\mathsf{FADD}^{\mathsf{rlx},\mathsf{rlx}}(x,1) \parallel \mathsf{FADD}^{\mathsf{rlx},\mathsf{rlx}}(x,1))$$

$$(\mathsf{R}x0) \xrightarrow{\mathsf{rmw}} (\mathsf{W}x1) \xrightarrow{\mathsf{rmw}} (\mathsf{W}x1) \tag{RMW0}$$

The gray arrow is required the RMW atomicity axioms.

Lee et al. [2020] introduce PS2.0 to refine the treatment of RMWs in the promising semantics (PS). Their examples have the expected results here, with far less work. First they recall that PS requires quantification over multiple futures in order to disallow executions such as CDRF:

$$r := \mathsf{FADD}^{\mathsf{acq},\mathsf{rel}}(x,1) \; ; \; \mathsf{if}(r=0) \{ y := 1 \} \parallel r := \mathsf{FADD}^{\mathsf{acq},\mathsf{rel}}(x,1) \; ; \; \mathsf{if}(r=0) \{ \mathsf{if}(y) \{ x := 0 \} \}$$
 
$$(\mathsf{CDRF})$$
 
$$\mathsf{W}^{\mathsf{rel}}(x,1) = \mathsf{W}^{\mathsf{rel}}(x,1) \; ; \; \mathsf{if}(r=0) \{ \mathsf{if}(y) \{ x := 0 \} \}$$

0:36 Anon.

This execution is clearly impossible, due to the cycle above. In this diagram, we have not drawn order adjacent to the writes of the RMWS, since this is not necessary to produce the cycle. If CDRF is allowed then DRF-RA fails.

PS does not support global value range analysis, as modeled by GA+E below. Our semantics permits GA+E:

$$x := 0$$
;  $(r := CAS^{r|x,r|x}(x, 0, 1); if (r < 10) {y := 1} || x := 42; x := y)$ 

$$(GA+E)$$

PS also does not support register promotion, as modeled by RP below. Our semantics permits RP:

$$r := x$$
;  $s := \mathsf{FADD}^{\mathsf{rlx},\mathsf{rlx}}(z,r)$ ;  $y := s+1 \parallel x := y$ 

$$(\mathsf{R}x1) \qquad (\mathsf{W}y1) \qquad (\mathsf{R}y1) \qquad (\mathsf{R}y1)$$

These following examples are from "Modular Data-Race-Freedom Guarantees in the Promising Semantics" to appear in PLDI21.

CDRF shows that our semantics is not too permissive for ra-RMWs. But what about rlx-RMWs. The following execution is allowed by Arm8, and PS2.0, but disallowed by PS2.1.

$$r := \mathsf{FADD}^{\mathsf{rlx},\mathsf{rlx}}(x,1) \; ; \; y := 1 \parallel r := y \; ; \; s := \mathsf{FADD}^{\mathsf{rlx},\mathsf{rlx}}(x,r)$$

$$(\mathsf{R}x1) \qquad (\mathsf{R}y1) \qquad (\mathsf{R}x0) \qquad (\mathsf{R}\mathsf{W}-\mathsf{W})$$

$$(\mathsf{W}x2) \qquad (\mathsf{W}x1)$$

If this  $\{z\}$ -DRF-RA?

$$if(y)\{x := z\} \text{ else } \{x := 1\} \parallel r := x; z := 1; y := r$$

$$Ry1 \longrightarrow Rx1 \longrightarrow Wy1 \longrightarrow Wy1$$
(NAIVE-LDRF-RA-FAIL)

Interpreting  $\{z\}$  as ra:



Our semantics already disallows LDRF-FAIL-PS, which is similar to OOTA4.

$$if(x) \{FADD(w, 1); y := 1; z := 1\} || if(!z) \{x := 1\} else \{if(!FADD(w, 1)) \{x := y\}\}$$

(LDRF-FAIL-PS)

$$y := x \parallel r := y; \text{ if } (b)\{x := r; z := r\} \text{ else } \{x := 1\} \parallel b := 1$$

$$(Rx1) \qquad (Ry1) \qquad (Ry1) \qquad (Rb1) \qquad (Wb1)$$

Proc. ACM Program. Lang., Vol. 0, No. OOPSLA, Article 0. Publication date: October 2021.

 *Example E.1.* This definition ensures atomicity, disallowing executions such as [Podkopaev et al. 2019, Ex. 3.2]:

$$x := 0$$
; INC<sup>rlx,rlx</sup>  $(x) \parallel x := 2$ ;  $r := x$ 

$$(Wx0) \longrightarrow (Rx0) \longrightarrow (Wx2) \longrightarrow (Wx1) \longrightarrow (Rx1)$$

By M10c(i), since  $(Wx2) \rightarrow (Wx1)$ , it must be that  $(Wx2) \rightarrow (Rx0)$ , creating a cycle.

Example E.2. Two successful RMWs cannot see the same write:

$$x := 0; (INC^{r|x,r|x}(x) \parallel INC^{r|x,r|x}(x))$$

$$(wx0) \xrightarrow{a:Rx0} b:Wx1 \xrightarrow{c:Rx0} c:Rx0$$

The order from read-to-write is required by fulfillment. Apply M10c(i) of the second RMW to  $a \rightarrow d$ , we have that  $a \rightarrow c$ . Subsequently applying M10c(ii) of the first RMW, we have  $b \rightarrow c$ , creating a cycle.

*Example E.3.* By using two actions rather than one, the definition allows examples such as the following, which is allowed by Arm8 [Podkopaev et al. 2019, Ex. 3.10]:

$$r := z$$
;  $s := INC^{rlx,rel}(x)$ ;  $y := s+1 \parallel r := y$ ;  $z := r$ 

$$(Rz1) \qquad (Wy1) \qquad (Ry1) \qquad (Wz1)$$

A similar example, also allowed by Arm8 [Chakraborty and Vafeiadis 2019a, Fig. 6]:

$$r := z$$
;  $s := \mathsf{FADD}^{\mathsf{rlx},\mathsf{rlx}}(x,r)$ ;  $y := s+1 \parallel r := y$ ;  $z := r$ 

This is allowed by WEAKESTMO, but not PS.

Example E.4. Consider the CDRF example from [Lee et al. 2020]:

$$r := INC^{\text{acq,rel}}(x); \text{ if } (r=0)\{y := 1\}$$

$$\parallel r := INC^{\text{acq,rel}}(x); \text{ if } (r=0)\{\text{if } (y)\{x := 0\}\}$$

$$\mathbb{R}^{\text{acq}}x0 \qquad \mathbb{W}^{\text{rel}}x1 \qquad \mathbb{R}y1 \qquad \mathbb{R}$$

Example E.5. Consider this example from [Lee et al. 2020, §C]:

0:38 Anon.

#### E.3 Coherence

 The following execution is disallowed by fulfillment.

$$x := 1; r := x \parallel x := 2; s := x$$

(COH)

Our model is more coherent than Java, which permits the following:

$$r := x; x := 1 \parallel s := x; x := 2$$

$$(TC16)$$

We also forbid the following, which Java allows:

$$x := 1; y^{ra} := 1 \parallel x := 2; z^{ra} := 1 \parallel r := z^{ra}; r := y^{ra}; r := x; r := x$$

$$(co3)$$

The following outcome is allowed by the promising semantics [Kang et al. 2017], but not in WEAKESTMO [Chakraborty and Vafeiadis 2019a, Fig. 3] nor in our semantics, due to the cycle:

$$x := 2; \text{ if } (x \neq 2) \{y := 1\} \parallel x := 1; r := x; \text{ if } (y) \{x := 3\}$$

$$(X = 2) \{y := 1\} \parallel x := 1; r := x; \text{ if } (y) \{x := 3\}$$

$$(X = 2) \{y := 1\} \parallel x := 1; r := x; \text{ if } (y) \{x := 3\}$$

$$(X = 2) \{y := 1\} \parallel x := 1; r := x; \text{ if } (y) \{x := 3\}$$

$$(X = 2) \{y := 1\} \parallel x := 1; r := x; \text{ if } (y) \{x := 3\}$$

Since reads are not ordered by intra-thread coherence, we allow the following unintuitive behavior. C11 includes read-read coherence between relaxed atomics in order to forbid this:

$$x := 1; x := 2 \parallel y := x; z := x$$

$$(x_1) \xrightarrow{\text{W}} \text{R} x_2 \xrightarrow{\text{W}} \text{R} x_2 \xrightarrow{\text{W}} \text{R} x_1 \xrightarrow{\text{W}} \text{W} z_1$$

Here, the reader sees 2 then 1, although they are written in the reverse order. This behavior is allowed by Java in order to validate CSE without requiring aliasing analysis.

#### E.4 MCA

$$if(z)\{x := 0\}; x := 1 || if(x)\{y := 0\}; y := 1 || if(y)\{z := 0\}; z := 1$$

$$Rz1 \longrightarrow Wx0 \longrightarrow Wx1 \longrightarrow Rx1 \longrightarrow Wy0 \longrightarrow Wy1 \longrightarrow Ry1 \longrightarrow Wz0 \longrightarrow Wz1$$

$$x := 0; x := 1 || y := x || r := y^{ra}; s := x$$

$$Wx0 \longrightarrow Wx1 \longrightarrow Rx1 \longrightarrow Wy1 \longrightarrow R^{acq}y1 \longrightarrow Rx0$$
(MCA2)

These candidate executions are invalid, due to cycles.

#### E.5 IRIW

Status of IRIW is unclear in our model, since we allow everything allowed by power...

$$(Wx1) \xrightarrow{\mathsf{R}^{\mathsf{ra}} x1} (Ry0) \xrightarrow{\mathsf{R}} (Ry0) \xrightarrow{\mathsf{R}^{\mathsf{ra}} y1} (Rx0)$$

 $x := 1 \parallel r := x^{ra}$ ;  $s := y \parallel y := 1 \parallel s := y^{ra}$ ; r := x

### F DIFFERENCES WITH "POMSETS WITH PRECONDITIONS"

We compare the model of this paper (PWT) with that of [Jagadeesan et al. 2020] (PWP).

Proc. ACM Program. Lang., Vol. 0, No. OOPSLA, Article 0. Publication date: October 2021.

 Substitution. PWP uses substitution rather than Skolemizing. Indeed our use of Skolemization is motivated by disjunction closure for predicate transformers, which do not appear in PWP. In Figure 2, we gave the semantics of read for nonempty pomsets as:

(R4a) if 
$$(E \cap D) \neq \emptyset$$
 then  $\tau^D(\psi) \equiv v = r \Rightarrow \psi$ ,  
(R4b) if  $(E \cap D) = \emptyset$  then  $\tau^D(\psi) \equiv (v = r \lor x = r) \Rightarrow \psi$ .

In PWP, the definition is roughly as follows:

1870 (R4a') if 
$$(E \cap D) \neq \emptyset$$
 then  $\tau^D(\psi) \equiv \psi[v/r][v/x]$ ,  
(R4b') if  $(E \cap D) = \emptyset$  then  $\tau^D(\psi) \equiv \psi[v/r][v/x] \wedge \psi[x/r]$ 

The use of conjunction in R4b' causes disjunction closure to fail because the predicate transformer  $\tau(\psi) = \psi' \wedge \psi''$  does not distribute through disjunction, even assuming that the prime operations do:  $\tau(\psi_1 \vee \psi_2) = (\psi_1' \vee \psi_2') \wedge (\psi_1'' \vee \psi_2'') \neq (\psi_1' \wedge \psi_1'') \vee (\psi_2' \wedge \psi_2'') = \tau(\psi_1) \vee \tau(\psi_2)$ . See also §3.10.

The substitutions collapse x and r, allowing local invariant reasoning (LIR), as required by causality test case 1, discussed at the end of §3.7. Without Skolemizing it is necessary to substitute  $\lfloor x/r \rfloor$ , since the reverse substitution  $\lfloor r/x \rfloor$  is useless when r is bound—compare with §3.12. As discussed below (Downset closure), including this substitution affects the interaction of LIR and downset closure.

Removing the substitution of [x/r] in the independent case has a technical advantage: we no longer require *extended* expressions (which include memory references), since substitutions no longer introduce memory references.

The substitution [x/r] does not work with Skolemization, even for the dependent case, since we lose the unique marker for each read. In effect, this forces all reads of a location to see the same values. Using this definition, consider the following:

$$r := x; s := x; if(r < s) \{ y := 1 \}$$

$$(Rx1) \qquad (Rx2) \rightarrow (1 = x \Rightarrow 2 = x \Rightarrow x < x \mid Wy1)$$

Although the execution seems reasonable, the precondition on the write is not a tautology.

*Downset Closure*. PWP enforces downset closure in the prefixing rule. Even without this, downset closure would be different for the two semantics, due to the use of substitution in PWP. Consider the final pomset in the last example of §C under the semantics of this paper, which elides the middle read event:

$$x := 0; r := x; if(r \ge 0) \{y := 1\}$$

$$(Wx0) \qquad (r \ge 0 \mid Wy1)$$

In PWP, the substitution [x/r] is performed by the middle read regardless of whether it is included in the pomset, with the subsequent substitution of [0/x] by the preceding write, we have [x/r][0/x], which is [0/r][0/x], resulting in:

Consistency. PWP imposes consistency, which requires that for every pomset P,  $\bigwedge_e \kappa(e)$  is satisfiable. Associativity requires that we allow pomsets with inconsistent preconditions. Consider a variant of the example from §4.5.

$$\begin{array}{ll} \text{if}(M)\{x:=1\} & \text{if}(!M)\{y:=1\} \\ \hline (M\mid Wx1) & \hline (\neg M\mid Wx1) & \hline (M\mid Wy1) & \hline (\neg M\mid Wy1) \end{array}$$

 $<sup>7(\</sup>psi_1 \vee \psi_2)' = (\psi_1' \vee \psi_2') \text{ and } (\psi_1 \vee \psi_2)'' = (\psi_1'' \vee \psi_2'').$ 

0:40 Anon.

Associating left and right, we have:

$$if(M)\{x := 1\}; if(!M)\{x := 1\}$$
  $if(M)\{y := 1\}; if(!M)\{y := 1\}$   $wy1$ 

Associating into the middle, instead, we require:

$$\begin{array}{ll} \text{if}(M)\{x:=1\} & \text{if}(!M)\{x:=1\}; \text{if}(M)\{y:=1\} \\ \hline (M\mid Wx1) & \hline (\neg M\mid Wx1) & \hline (\neg M\mid Wy1) \\ \end{array}$$

Joining left and right, we have:

$$\begin{split} \text{if}(M)\{x := 1\}; & \text{if}(!M)\{x := 1\}; & \text{if}(M)\{y := 1\}; & \text{if}(!M)\{y := 1\}\\ & \boxed{\mathbb{W}x1} \quad \boxed{\mathbb{W}y1} \end{split}$$

Causal Strengthening. PWP imposes causal strengthening, which requires for every pomset P, if  $d \le e$  then  $\kappa(e) \models \kappa(d)$ . Associativity requires that we allow pomsets without causal strengthening. Consider the following.

$$\begin{array}{ccc} \text{if}(M)\{r:=x\} & y:=r & \text{if}(!M)\{s:=x\} \\ \hline (M \mid \mathsf{R}x1) & \hline (r=1 \mid \mathsf{W}y1) & \hline \neg M \mid \mathsf{R}x1 \\ \end{array}$$

Associating left, with causal strengthening:

$$if(M)\{r := x\}; y := r \qquad if(!M)\{s := x\}$$

$$(M \mid Rx1) \rightarrow (M \mid Wy1) \qquad (\neg M \mid Rx1)$$

Finally, merging:

if 
$$(M)\{r := x\}; y := r; if (!M)\{s := x\}$$

$$(Rx1) \bullet (M | Wy1)$$

Instead, associating right:

$$\begin{aligned} \text{if}(M)\{r := x\} & y := r; \text{ if}(!M)\{s := x\} \\ \hline (M \mid \mathsf{R}x1) & r = 1 \mid \mathsf{W}y1) & \neg M \mid \mathsf{R}x1 \end{aligned}$$

Merging:

if(M){
$$r := x$$
};  $y := r$ ; if(!M){ $s := x$ }

(Rx1) (Wy1)

With causal strengthening, the precondition of Wy1 depends upon how we associate. This is not an issue in PWP, which always associates to the right.

One use of causal strengthening is to ensure that address dependencies do not introduce thin air reads. Associating to the right, the intermediate state of the example in §4.3 is:

$$s := [r]; x := s$$

$$(r=2 \mid R[2]1) \longrightarrow (r=2 \Rightarrow 1=s) \Rightarrow s=1 \mid Wx1$$

In PWP, we have, instead:

$$s := [r]; x := s$$

$$(r=2 \mid R[2]1) \longrightarrow (r=2 \land [2]=1 \mid Wx1)$$

Without causal strengthening, the precondition of (Wx1) would be simply [2]=1. The treatment in this paper, using implication rather than conjunction, is more precise.

Proc. ACM Program. Lang., Vol. 0, No. OOPSLA, Article 0. Publication date: October 2021.

Internal Acquiring Reads. The proof of compilation to Arm in PWP assumes that all internal reads can be eliminated. However, this is not the case for acquiring reads. For example, PWP disallows the following execution, where the final values of x is 2 and the final value of y is 2. This execution is allowed by Arm8 and TSO.

$$x := 2$$
;  $r := x^{\operatorname{acq}}$ ;  $s := y \parallel y := 2$ ;  $x^{\operatorname{rel}} := 1$ 

$$(Wx2) \longrightarrow (Ry0) \longrightarrow (Wy2) \longrightarrow (W^{\operatorname{rel}}x1)$$

We discussed two approaches to this problem in §A.

Redundant Read Elimination. Contrary to the claim, redundant read elimination fails for PWP. We discussed redundant read elimination in §4.1. Consider JMM Causality Test Case 2, which we discussed there.

$$r := x$$
;  $s := x$ ; if  $(r=s)\{y := 1\} \parallel x := y$ 

$$Rx1 \qquad Rx1 \qquad Ry1 \qquad Wx1$$

Under the semantics of PWP, we have

$$r:=x\;;\;s:=x\;;\;\mathsf{if}(r=s)\{y:=1\}$$
 
$$\boxed{\mathsf{R}x1}\quad \boxed{\mathsf{1}=1 \land \mathsf{1}=x \land x=1 \land x=x \mid \mathsf{W}y1}$$

The precondition of (Wy1) is *not* a tautology, and therefore redundant read elimination fails. (It is a tautology in r := x; s := r; if  $(r = s) \{ y := 1 \}$ .) PWP(§3.1) incorrectly stated that the precondition of (Wy1) was  $1 = 1 \land x = x$ .

Parallel Composition. In PWP( $\S2.4$ ), parallel composition is defined allowing coalescing of events. Here we have forbidden coalescing. This difference appears to be arbitrary. In PWP, however, there is a mistake in the handling of termination actions. The predicates should be joined using  $\land$ , not  $\lor$ .

Read-Modify-Write Actions. In PWP, the atomicity axioms  ${\tt M10c}$  erroneously applies only to overlapping writes, not overlapping reads. The difficulty can be seen in Example E.2.

In addition, PWP uses *READ* instead of *READ'* when calculating of dependency for RMWs. For a discussion, see the example at the end of §4.4.

*Data Race Freedom.* The definition of data race is wrong in PWP. It should require that that at least one action is relaxed.

Note that the definition of L-stable applies in the case that conflicting writes are totally ordered. This gives a result more in the spirit of [Dolan et al. 2018]. In particular, this special case of the theorem clarifies the discussion of the PAST example in PWP;

0:42 Anon.

#### G A NOTE ON MIXED-MODE DATA RACES

In preparing this paper, we came across the following example, which appears to invalidate Theorem 4.1 of [Dongol et al. 2019].

$$x := 1; y^{\text{rel}} := 1; r := x^{\text{acq}} \parallel \text{if}(y^{\text{acq}}) \{x^{\text{rel}} := 2\}$$

$$\boxed{Wx1 \longrightarrow W^{\text{rel}}y1} \qquad \boxed{\mathbb{R}^{\text{acq}}x1} \qquad \boxed{\mathbb{R}^{\text{acq}}y1 \longrightarrow W^{\text{rel}}x2}$$

$$\boxed{\mathbb{R}^{\text{acq}}y1 \longrightarrow W^{\text{rel}}x2}$$

The program is data-race free. The two executions shown are the only top-level executions that include  $(W^{rel}x2)$ .

Theorem 4.1 of [Dongol et al. 2019] is stated by extending execution sequences. In the terminology of [Dongol et al. 2019], a read is L-weak if it is sequentially stale. Let  $\rho = (\mathsf{W} x 1)(\mathsf{W}^\mathsf{rel} y 1)(\mathsf{R}^\mathsf{acq} y 1)(\mathsf{W}^\mathsf{rel} x 2)$  be a sequence and  $\alpha = (\mathsf{R}^\mathsf{acq} x 1)$ .  $\rho$  is L-sequential and  $\alpha$  is L-weak in  $\rho\alpha$ . But there is no execution of this program that includes a data race, contradicting the theorem. The error seems to be in Lemma A.4 of [Dongol et al. 2019], which states that if  $\alpha$  is L-weak after an L-sequential  $\rho$ , then  $\alpha$  must be in a data race. That is clearly false here, since ( $\mathsf{R}^\mathsf{acq} x 1$ ) is stale, but the program is data race free.

In proving the SC-LDRF result in PWP(§8), we noted that our proof technique is more robust than that of [Dongol et al. 2019], because it limits the prefixes that must be considered. In (¶), the induction hypothesis requires that we add ( $R^{acq}x1$ ) before ( $W^{rel}x2$ ) since ( $R^{acq}x1$ ) -  $\Rightarrow$  ( $W^{rel}x2$ ). In particular,

$$Wx1 \longrightarrow W^{rel}y1 \longrightarrow W^{rel}x2$$

is not a downset of (¶), because ( $\mathbb{R}^{acq}x1$ ) -  $\rightarrow$  (W<sup>rel</sup>x2). As we noted in PWP(§8), this affects the inductive order in which we move across pomsets, but does not affect the set of pomsets that are considered. In particular,



is a downset of  $(\P)$ .