1 2

# ANONYMOUS AUTHOR(S)

Program logics and semantics tell us that when executing  $(S_1; S_2)$  starting in state  $s_0$ , we execute  $S_1$  in  $s_0$  to arrive at  $s_1$ , then execute  $S_2$  in  $s_1$  to arrive at the final state  $s_2$ . This is, of course, an abstraction. Processors execute instructions out of order, due to pipelines and caches, and compilers reorder programs even more dramatically. All of this reordering is meant to be unobservable in single-threaded code, but is observable in multi-threaded code. A formal attempt to understand the resulting mess is known as a "relaxed memory model." The relaxed memory models that have been proposed to date either fail to address sequential composition directly, overly restrict processors and compilers, or permit nonsense thin-air behaviors which are unobservable in practice.

To support sequential composition while targeting modern hardware, we propose using preconditions and families of predicate transformers. When composing  $(S_1; S_2)$ , the predicate transformers used to validate the preconditions of events in  $S_2$  are chosen based on the semantic dependencies from events in  $S_1$  to events in  $S_2$ . We apply this approach to two existing memory models: "Modular Relaxed Dependencies" for C11 and "Pomsets with Preconditions."

CCS Concepts: • Theory of computation  $\rightarrow$  Parallel computing models; *Preconditions*.

Additional Key Words and Phrases: Concurrency, Relaxed Memory Models, Multi-Copy Atomicity, ARMv8, Pomsets, Preconditions, Temporal Safety Properties, Thin-Air Reads, Compiler Optimizations

#### **ACM Reference Format:**

Anonymous Author(s). 2022. The Leaky Semicolon: Compositional Semantic Dependencies for Relaxed-Memory Concurrency. *Proc. ACM Program. Lang.* 0, POPL, Article 0 (January 2022), 66 pages.

### 1 INTRODUCTION

Sequentiality is a leaky abstraction [Spolsky 2002]. For example, sequentiality tells us that when executing  $(r_1 := x; y := r_2)$ , the assignment  $r_1 := x$  is executed before  $y := r_2$ . Thus, one might reasonably expect that the final value of  $r_1$  is independent of the initial value of  $r_2$ . In most modern languages, however, this fails to hold when the program is run concurrently with (s := y; x := s), which copies y to x.

In certain cases it is possible to ban concurrent access using separation [O'Hearn 2007], or to accept inefficient implementation in order to obtain sequential consistency [Marino et al. 2015]. When these approaches are not available, however, we are left with an enormous gap in our understanding of one of the most basic elements of computing: the humble semicolon. Until recently, existing approaches either

- did not bother tracking dependencies, allowing "thin air" executions as in C and C++ [Batty et al. 2015],
- tracked dependencies conservatively, using syntax, requiring inefficient implementation of relaxed access [Boehm and Demsky 2014; Kavanagh and Brookes 2018; Lahav et al. 2017;

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

© 2022 Copyright held by the owner/author(s).

2475-1421/2022/1-ART0

https://doi.org/

0:2 Anon.

Vafeiadis and Narayan 2013]— a non-starter for safe languages like Java, and an unacceptable cost for low-level languages like C,

• computed dependencies using non-compositional operational models over alternate worlds [Chakraborty and Vafeiadis 2019; Cho et al. 2021; Jagadeesan et al. 2010; Kang et al. 2017; Lee et al. 2020; Manson et al. 2005]—these models validate many compiler optimizations, but fail to validate temporal safety properties (see §A.1).

Recently, two denotational models have been proposed that compute sequential dependencies semantically. Paviotti et al. [2020] defined Modular Relaxed Dependencies (MRD-c11), which use event structures to calculate dependencies for c11, targeting the Intermediate Memory Model (IMM) [Podkopaev et al. 2019]. Jagadeesan et al. [2020] defined Pomsets with Preconditions (PwP), which use preconditions and logic to calculate dependencies for a Java-like language targeting multicopyatomic (MCA) hardware, such as Arm8 [Pulte et al. 2018]. However, neither paper treated sequential composition as a first-class citizen. MRD-c11 encoded sequential composition using continuation-passing, and PwP used prefixing, adding one event at a time on the left. In both cases, adding an event requires perfect knowledge of the future.

In this paper, we show that PwP can be extended with *families of predicate transformers* (PwT) to calculate sequential dependencies in a way that is *compositional* and *direct: compositional* in that the denotation of  $(S_1; S_2)$  can be computed from the denotation of  $S_1$  and the denotation of  $S_2$ , and *direct* in that these can be calculated independently. The definition is associative: the denotation of  $((S_1; S_2); S_3)$  is the same as the denotation of  $((S_1; (S_2); S_3))$ . It also validates expected laws concerning the interaction of sequencing and conditional execution.

To manage complexity, we have layered the definitions. After an overview and discussion of related work, we define sequential dependencies in §4. We then add concurrency. In §5, we define PwT-mca, which provides a Java-like model for mca hardware, similar to that of Jagadeesan et al. [2020]; §6 summarizes the results for this model. In §7, we define PwT-c11, which models c11, adapting the approach of Paviotti et al. [2020]; §8 describes a tool for automatic evaluation of litmus tests. In §9, we extend the semantics to include additional features, such as address calculation and RMWs.

#### 2 OVERVIEW

 This paper is about the interaction of two of the fundamental building blocks of computing: sequential composition and mutable state. One would like to think that these are well-worn topics, where every issue has been settled, but this is not the case.

# 2.1 Sequential Composition

Introductory programmers are taught sequential abstraction: that the program  $S_1$ ;  $S_2$  executes  $S_1$  before  $S_2$ . Since the late 1960s, we've been able to explain this using logic [Hoare 1969]. In Dijkstra's [1975] formulation, we think of programs as predicate transformers, where predicates describe the state of memory in the system. In the calculus of weakest preconditions, programs map postconditions to preconditions. We recall the definition of  $wp_S(\psi)$  for loop-free code below (where r-s range over thread-local registers and M-N range over side-effect-free expressions).

```
\begin{array}{ll} \text{(D1)} & wp_{\text{skip}}(\psi) = \psi \\ \text{(D2)} & wp_{r:=M}(\psi) = \psi[M/r] \\ \text{(D3)} & wp_{S_1;S_2}(\psi) = wp_{S_1}(wp_{S_2}(\psi)) \\ \text{(D4)} & wp_{\text{if}(M)\{S_1\} \, \text{else}\,\{S_2\}}(\psi) = ((M \neq 0) \Rightarrow wp_{S_1}(\psi)) \wedge ((M = 0) \Rightarrow wp_{S_2}(\psi)) \end{array}
```

For this language, the Hoare triple  $\{\phi\}$  S  $\{\psi\}$  holds exactly when  $\phi \Rightarrow wp_S(\psi)$ . This is an elegant explanation of sequential computation in a sequential context. Note that D2 is sound because a

read from a thread-local register must be fulfilled by a preceding write in the same thread. In a concurrent context, with shared variables (x-z), the obvious generalization

(D2a) 
$$wp_{x:=M}(\psi) = \psi[M/x]$$
  
(D2b)  $wp_{r:=x}(\psi) = \psi[x/r]$ 

is unsound! In particular, a read from a shared memory location may be fulfilled by a write in another thread, invalidating D2b. (We assume that expressions do *not* include shared variables.)

In this paper we answer the following question: what does sequential composition mean in a concurrent context? An acceptable answer must satisfy several desiderata:

- (1) it should not impose too much order, overconstraining the implementation,
- (2) it should not impose too little order, allowing bogus executions, and
- (3) it should be compositional and direct, as described in §1.

Memory models differ in how they navigate between desiderata 1 and 2. In one direction there are both more valid compiler optimizations and also more potentially dubious executions, in the other direction, less of both. To understand the tradeoffs, one must first understand the underlying hardware and compilers.

# 2.2 Memory Models

For single-threaded programs, memory can be thought of as you might expect: programs write to, and read from, memory references. This can be thought of as a total order of reads and writes (pink arrows  $\longrightarrow$ ), where each read has a matching *fulfilling* write (green arrows  $\longrightarrow$ ), for example:

$$x := 0; x := 1; y := 2; r := y; s := x$$
 $(wx0) + (wy2) + (Ry2) + (Rx1)$ 

This model naturally extends to the case of shared-memory concurrency, leading to a *sequentially consistent* semantics [Lamport 1979], in which *program order* inside a thread implies a total *causal order* between read and write events, for example (where; has higher precedence than ||):

$$x := 0; x := 1; y := 2 \parallel r := y; s := x$$
 $(x := 0; x := 1; y := 2 \parallel r := y; s := x)$ 
 $(x := 0; x := 1; y := 2 \parallel r := y; s := x)$ 

Unfortunately, this model does not compile efficiently to commodity hardware, resulting in a 37–73% increase in CPU time on Arm8 [Liu et al. 2019] and, hence, in power consumption. Developers of software and compilers have therefore been faced with a difficult trade-off, between an elegant model of memory, and its impact on resource usage (such as size of data centers, electricity bills and carbon footprint). Unsurprisingly, many have chosen to prioritize efficiency over elegance.

This has led to *relaxed memory models*, in which the requirement of sequential consistency is weakened to only apply *per-location* and not globally over the whole program. This allows executions that are inconsistent with program order, such as:



0:4 Anon.

In such models, the causal order between events is important, and includes control and data dependencies, to avoid paradoxical "out of thin air" examples such as:

$$r := x$$
; if  $(r)\{y := 1\} \parallel s := y$ ;  $x := s$ 

$$(Rx1) \qquad (Ry1) \qquad (Wx1)$$

This candidate execution forms a cycle in causal order, so is disallowed, but this depends crucially on the control dependency from (Rx1) to (Wy1), and the data dependency from (Ry1) to (Wx1). If either is missing, then this execution is acyclic and hence allowed. For example dropping the control dependency results in:

$$r := x ; y := 1 \parallel s := y ; x := s$$

$$(Rx1) \qquad (Ry1) \qquad (Wx1)$$

While syntactic dependency calculation suffices for hardware models, it is not preserved by common compiler optimizations. For example, if we calculate control dependencies syntactically, then there is a dependency from (Rx1) to (Wy1), and therefore a cycle in, the candidate execution:

$$r := x$$
; if  $(r)\{y := 1\}$  else  $\{y := 1\} \parallel s := y$ ;  $x := s$ 

A compiler may lift the assignment y := 1 out of the conditional, thus removing the dependency.

To address this, Jagadeesan et al. [2020] introduced *Pomsets with Preconditions* (PwP), where events are labeled with logical formulae. Nontrivial preconditions are introduced by store actions (modeling data dependencies) and conditionals (modeling control dependencies):

$$if(s<1)\{z:=r*s\}$$

$$(s<1) \land (r*s)=0 \mid Wz0$$

Preconditions are discharged by being ordered after a read (we assume the usual precedence for logical operators— $\neg$ ,  $\land$ ,  $\lor$ ,  $\Rightarrow$ ):

$$r := x; s := y; \text{ if } (s<1)\{z := r*s\}$$

$$(Rx0) \qquad (Ry0) \longrightarrow ((0=s) \Rightarrow (s<1) \land (r*s)=0 \mid Wz0)$$

Note that there is dependency order from (Ry0) to (Wz0) so the precondition for (Wz0) only has to be satisfied assuming the hypothesis (0=s). There is no matching order from (Rx0) to (Wz0) which is why we do not assume the hypothesis (0=r). Nonetheless, the precondition on (Wz0) is a tautology, and so can be elided in the diagram:

$$(Rx0)$$
  $(Ry0)$   $\longrightarrow$   $(Wz0)$ 

### 2.3 Predicate Transformers For Relaxed Memory

Pomsets with Preconditions show how the logical approach to sequential dependency calculation can be mixed into a relaxed memory model. However, Jagadeesan et al. do not provide a model of sequential composition. Instead, their model uses *prefixing*, which requires that the model is built from right to left: events are prepended one at a time, with perfect knowledge of the future. This makes reasoning about sequential program fragments difficult. For example, Jagadeesan et al. state the equivalence allowing reordering independent writes as follows,

$$[x := M; y := N; S] = [y := N; x := M; S]$$
 if  $x \neq y$ 

where S is the entire future computation! By formalizing sequential composition, we can show:

$$[x := M; y := N] = [y := N; x := M]$$
 if  $x \neq y$ 

Then the equivalence holds in any context.

 Predicate transformers are a good fit for logical models of dependency calculation, since both are concerned with preconditions and how they are transformed by sequential composition. Our first attempt is to associate a predicate transformer with each pomset. We visualize this in diagrams by showing how  $\psi$  is transformed, for example:

$$r := x \qquad \qquad s := y \qquad \qquad \text{if}(s < 1) \{z := r * s\}$$

$$(8x0) \cdots (0=r) \Rightarrow \psi \qquad (8y0) \cdots (0=s) \Rightarrow \psi \qquad ((s < 1) \land (r * s) = 0 \mid Wz0) \cdots \psi [r * s/z]$$

The predicate transformer from the write matches Dijkstra's D2a. For the reads, however, D2b defines the transformer of r := x to be  $\psi[x/r]$ , which is equivalent to  $(x=r) \Rightarrow \psi$  under the assumption that registers are assigned at most once. Instead, we use  $(0=r) \Rightarrow \psi$ , reflecting the fact that 0 may come from a concurrent write. The obligation to find a matching write is moved from the sequential semantics of *substitution* and *implication* to the concurrent semantics of *fulfillment*.

For a sequentially consistent semantics, sequential composition is straightforward: we apply each predicate transformer to the preconditions of subsequent events, composing the predicate transformers. (In subsequent diagrams, we only show predicate transformers for reads.)

$$r := x; s := y; \text{ if } (s<1)\{z := r*s\}$$

$$(0=r) \Rightarrow (0=s) \Rightarrow \psi \quad (Rx0) \rightarrow (Ry0) \rightarrow (0=r) \Rightarrow (0=s) \Rightarrow (s<1) \land (r*s)=0 \mid Wz0$$

This model works for the sequentially consistent case, but needs to be weakened for the relaxed case. The key observation of this paper is that rather than working with one predicate transformer, we should work with a *family* of predicate transformers, indexed by sets of events.

For example, for single-event pomsets, there are two predicate transformers, since there are two subsets of any one-element set. The *independent* transformer is indexed by the empty set, whereas the *dependent* transformer is indexed by the singleton. We visualize this by including more than one transformed predicate, with an edge leading to the dependent one. For example:

The model of sequential composition then picks which predicate transformer to apply to an event's precondition by picking the one indexed by all the events before it in causal order.

For example, we can recover the expected semantics for (†) by choosing the predicate transformer which is independent of (Rx0) but dependent on (Ry0), which is the transformer which maps  $\psi$  to (0=s)  $\Rightarrow \psi$ .

$$r := x \; ; \; s := y \; ; \; \text{if} \; (s < 1) \; \{z := r * s\}$$

$$(8x0) \dots (8y0) \longrightarrow (0 = s) \Rightarrow (s < 1) \land (r * s) = 0 \mid Wz0$$

$$(0 = r) \Rightarrow \psi \qquad (0 = r) \Rightarrow (0 = s) \Rightarrow \psi$$

$$(0 = r) \Rightarrow \psi \qquad (0 = s) \Rightarrow \psi$$

In the diagram, the dotted lines indicate set inclusion into the index of the transformer-family. As a sanity check, we can see that sequential composition is associative in this case, since it does not matter whether we associate to the left, with intermediate step:

$$r := x \; ; \; s := y$$

$$\psi \qquad (0=r) \Rightarrow \psi \quad (0=r) \Rightarrow (0=s) \Rightarrow \psi \quad (0=s) \Rightarrow \psi$$

0:6 Anon.

or to the right, with intermediate step:

$$s := y \; ; \; \mathsf{if}(s<1) \{z := r*s\}$$

$$\psi \qquad (0=s) \Rightarrow \psi \iff (\mathbb{R} y0) \implies (0=s) \Rightarrow (s<1) \land (r*s)=0 \mid \mathsf{W} z0$$

This is an instance of the general result that sequential composition forms a monoid.

#### 3 RELATED WORK

 Marino et al. [2015] argue that the "silently shifting semicolon" is sufficiently problematic for programmers that concurrent languages should guarantee sequential abstraction, despite the performance penalties. In this paper, we take the opposite approach. We have attempted to find the most intellectually tractable model that encompasses all of the messiness of relaxed memory.

There are few prior studies of relaxed memory that include sequential composition and/or precise calculation of semantic dependencies. Jagadeesan et al. [2020] give a denotational semantics, using prefixing rather than sequential compositions. Paviotti et al. [2020] give a denotational semantics, calculating dependencies using event structures rather than logic. They give the semantics of sequential composition in continuation passing style, whereas we give it in direct style. This paper provides a general technique for computing sequential dependencies and applies it to these two approaches. We provide a detailed comparison with [Jagadeesan et al. 2020] in §A.2.

Kavanagh and Brookes [2018] define a semantics using pomsets without preconditions. Instead, their model uses syntactic dependencies, thus invalidating many compiler optimizations. They also require a fence after every relaxed read on Arm8. Pichon-Pharabod and Sewell [2016] use event structures to calculate dependencies, combined with an operational semantics that incorporates program transformations. This approach seems to require whole-program analysis.

Other studies of relaxed memory can be categorized by their approach to dependency calculation. Hardware models use syntactic dependencies [Alglave et al. 2014]. Many software models do not bother with dependencies at all [Batty et al. 2011; Cox 2016; Watt et al. 2020, 2019]. Others have strong dependencies that disallow compiler optimizations and efficient implementation, typically requiring fences for every relaxed read on Arm [Boehm and Demsky 2014; Dolan et al. 2018; Jeffrey and Riely 2016; Lahav et al. 2017; Lamport 1979]. Many of the most prominent models are operational, whole-program models based on speculative execution [Chakraborty and Vafeiadis 2019; Cho et al. 2021; Jagadeesan et al. 2010; Kang et al. 2017; Lee et al. 2020; Manson et al. 2005]. We provide a detailed comparison with these approaches in §A.1.

Other work in relaxed memory has shown that tooling is especially useful to researchers, architects, and language specifiers, enabling them to build intuitions experimentally [Alglave et al. 2014; Batty et al. 2011; Cooksey et al. 2019; Paviotti et al. 2020]. Unfortunately, it is not obvious that tools can be built for all thin-air free models, the calculation of Pichon-Pharabod and Sewell [2016] does not have a termination proof for an arbitrary input, and the enormous state space for the operational models of Kang et al. [2017] and Chakraborty and Vafeiadis [2019] is a daunting prospect for a tool builder – and as yet no tool exists for automatically evaluating these models. We describe a tool, PwTer, for automatically evaluating PwT in §8.

#### 4 SEQUENTIAL SEMANTICS

After some preliminaries (§4.1–4.2), we define the basic model and establish some basic properties (§4.3 and Fig. 1). We then explain the model using examples (§4.4–4.9). We encourage readers to skim the definitions and then skip to §4.4, coming back as needed.

#### 4.1 Preliminaries

295 296

297

298

300

301

302

303

304

305

306

307

309

310 311

312

313

314

315

316

318

319

320

321

322

324

325

326

328 329

330

331

332

333

334

335

336

337

338

339

340

341

342 343 The syntax is built from

- a set of values V, ranged over by v, w,  $\ell$ , k,
- a set of registers  $\mathcal{R}$ , ranged over by r, s,
- a set of *expressions*  $\mathcal{M}$ , ranged over by M, N, L.

*Memory references* are tagged values, written  $[\ell]$ . Let  $\mathcal{X}$  be the set of memory references, ranged over by x, y, z. We require that

- values and registers are disjoint,
- values include at least the constants 0 and 1,
- expressions include at least registers and values,
- expressions do *not* include references: M[N/x] = M.

We model the following language.

$$\mu, \nu := \mathsf{rlx} \mid \mathsf{rel} \mid \mathsf{acq} \mid \mathsf{sc}$$

```
S := r := M \mid r := [L]^{\mu} \mid [L]^{\mu} := M \mid \mathsf{F}^{\mu} \mid \mathsf{skip} \mid S_1; S_2 \mid \mathsf{if}(M) \{S_1\} \, \mathsf{else} \, \{S_2\} \mid S_1 \not \!\!\!\!+ S_2 \mid S_1 \not \!\!\!\!+ S_2 \mid S_2 \mid S_1 \not \!\!\!\!+ S_2 \mid S_2 \mid S_1 \not \!\!\!+ S_2 \mid S_
```

Access modes,  $\mu$ , are relaxed (rlx), release (rel), acquire (acq), and sequentially consistent (sc). Let expressions (r := M) only affect thread-local state and thus do not have a mode. Reads ( $r := [L]^{\mu}$ ) support rlx, acq, sc. Writes ( $[L]^{\mu} := r$ ) support rlx, rel, sc. Fences (F $^{\mu}$ ) support rel, acq, sc. In examples, the default mode for reads and writes is rlx—we systematically drop the annotation.

Commands, aka statements, S, include memory accesses at a given mode, as well as the usual structural constructs. Following [Ferreira et al. 1996], # denotes parallel composition, preserving thread state on the right after a join. In examples and sublanguages without join, we use the symmetric  $\|$  operator.

We use common syntax sugar, such as *extended expressions*,  $\mathbb{M}$ , which include memory locations. For example, if  $\mathbb{M}$  includes a single occurrence of x, then  $y := \mathbb{M}$ ; S is shorthand for r := x;  $y := \mathbb{M}[r/x]$ ; S. Each occurrence of x in an extended expression corresponds to an separate read. We also write if  $(M)\{S\}$  as shorthand for if  $(M)\{S\}$  else  $\{skip\}$ .

Throughout  $\S1-8$  we require that

each register is assigned at most once in a program.

In §9, we drop this restriction, requiring instead that

• there are registers  $S_{\mathcal{E}} = \{s_e \mid e \in \mathcal{E}\}\$ , that do not appear in programs:  $S[N/s_e] = S$ .

The semantics is built from the following.

- a set of events  $\mathcal{E}$ , ranged over by e, d, c, and subsets ranged over by E, D, C,
- a set of *logical formulae*  $\Phi$ , ranged over by  $\phi$ ,  $\psi$ ,  $\theta$ ,
- a set of actions  $\mathcal{A}$ , ranged over by a, b,
- a family of *quiescence symbols*  $Q_x$ , indexed by location.

We require that

- formulae include tt, ff,  $Q_x$ , and the equalities (M=N) and (x=M),
- formulae are closed under  $\neg$ ,  $\land$ ,  $\lor$ ,  $\Rightarrow$ , and substitutions [M/r], [M/x],  $[\phi/Q_x]$ ,
- there is a relation  $\models$  between formulae, capturing entailment,
- $\models$  has the expected semantics for =,  $\neg$ ,  $\land$ ,  $\lor$ ,  $\Rightarrow$  and substitutions [M/r], [M/x],  $[\phi/Q_x]$ ,
- there is a subset of  $\mathcal{A}$ , distinguishing *read* actions,
- there are four binary relations over  $\mathcal{A} \times \mathcal{A}$ : delays and matches  $\subseteq blocks \subseteq overlaps$ .

Logical formulae include equations over registers and memory references, such as (r=s+1) and (x=1). We use expressions as formulae, coercing M to  $M\neq 0$ .

0:8 Anon.

We write  $\phi \equiv \psi$  when  $\phi \models \psi$  and  $\psi \models \phi$ . We say  $\phi$  is a *tautology* if  $\mathsf{tt} \models \phi$ . We say  $\phi$  is *unsatisfiable* if  $\phi \models \mathsf{ff}$ , and *satisfiable* otherwise.

# 4.2 Actions in This Paper

 In this paper, we let actions be reads and writes and fences:

```
a, b ::= W^{\mu}xv \mid R^{\mu}xv \mid F^{\mu}
```

We use shorthand when referring to actions. In definitions, we drop elements of actions that are existentially quantified. In examples, we drop elements of actions, using defaults. Let  $\sqsubseteq$  be the smallest order over access and fence modes such that  $r|x \sqsubseteq re| \sqsubseteq sc$  and  $r|x \sqsubseteq acq \sqsubseteq sc$ . We write  $(W^{\exists rel})$  to stand for either  $(W^{rel})$  or  $(W^{sc})$ , and similarly for the other actions and modes.

```
Definition 4.1. Actions (R) are read actions.
```

We say a matches b if a = (Wxv) and b = (Rxv).

We say a blocks b if a = (Wx) and b = (Rx), regardless of value.

We say a overlaps b if they access the same location, regardless of whether they read or write.

Let  $\bowtie_{co}$  capture write-write, read-write coherence:  $\bowtie_{co} = \{(Wx, Wx), (Rx, Wx), (Wx, Rx)\}.$ 

Let  $\ltimes_{\mathsf{sync}}$  capture conflict due to synchronization:  $^1 \ltimes_{\mathsf{sync}} = \{(a, \mathsf{W}^{\exists \mathsf{rel}}), \ (a, \mathsf{F}^{\exists \mathsf{rel}}), \ (\mathsf{R}, \mathsf{F}^{\exists \mathsf{acq}}), \ (\mathsf{R}^{\exists \mathsf{acq}}, b), \ (\mathsf{F}^{\exists \mathsf{rel}}, b), \ (\mathsf{F}^{\exists \mathsf{rel}}, b), \ (\mathsf{W}^{\exists \mathsf{rel}}, b)$ 

Let  $\bowtie_{sc}$  capture conflict due to sc access:  $\bowtie_{sc} = \{(W^{sc}, W^{sc}), (R^{sc}, W^{sc}), (W^{sc}, R^{sc}), (R^{sc}, R^{sc})\}$ . We say a delays b if  $a \bowtie_{co} b$  or  $a \bowtie_{sc} b$ .

### 4.3 PwT: Pomsets with Predicate Transformers

*Predicate transformers* are functions on formulae that preserve logical structure, providing a natural model of sequential composition. The definition comes from Dijkstra [1975]:

```
Definition 4.2. A predicate transformer is a function \tau:\Phi\to\Phi such that
```

```
(x1) \tau(\psi_1 \wedge \psi_2) \equiv \tau(\psi_1) \wedge \tau(\psi_2), (x3) if \phi \models \psi, then \tau(\phi) \models \tau(\psi). (x2) \tau(\psi_1 \vee \psi_2) \equiv \tau(\psi_1) \vee \tau(\psi_2),
```

We consistently use  $\psi$  as the parameter of predicate transformers. Note that substitutions ( $\psi[M/r]$  and  $\psi[M/x]$ ) and implications on the right ( $\phi \Rightarrow \psi$ ) are predicate transformers.

As discussed in §1, predicate transformers suffice for sequentially consistent models, but not relaxed models, where dependency calculation is crucial. For dependency calculation, we use a *family* of predicate transformers, indexed by sets of events. In sequential composition, we will use  $\tau^{\downarrow e}$  as the predicate transformer applied to event e where  $d \in (\downarrow e)$  if d < e.

Definition 4.3. A family of predicate transformers over E consists of a predicate transformer  $\tau^D$  for each  $D \subseteq \mathcal{E}$ , such that if  $C \cap E \subseteq D$  then  $\tau^C(\psi) \models \tau^D(\psi)$ .

We write  $\tau(\psi)$  as an abbreviation of  $\tau^E(\psi)$ .

In a family of predicate transformers, the transformer of a smaller set must entail the transformer of a larger set. Thus bigger sets are *better* and  $\tau(\psi)$ —the transformer of the biggest set—is the *best*. (Note that the definition is written to be insensitive to events outside E.)

In sequential composition, adding more order can only increase the size of  $\downarrow e$ . Following Def. 4.3, the larger  $\downarrow e$  is, the better, at least in terms of satisfying preconditions. Thus more order means weaker preconditions.

*Definition 4.4.* A pomset with predicate transformers (PwT) is a tuple  $(E, \lambda, \kappa, \tau, \checkmark, \lt)$  where

<sup>&</sup>lt;sup>1</sup>Symmetry would suggest that we include  $(Rx, R^{\supseteq acq}x)$ , but this is not sound for Arm8.

```
(M1) E \subseteq \mathcal{E} is a set of events,
393
           (M2) \lambda : E \to \mathcal{A} defines a label for each event.
394
           (M3) \kappa : E \to \Phi defines a precondition for each event,
395
           (M4) \tau: 2^{\mathcal{E}} \to \Phi \to \Phi is a family of predicate transformers over E,
396
           (M5) \checkmark: \Phi is a termination condition, such that
397
                (M5a) \checkmark \models \tau(tt),
398
           (M6) < \subseteq E \times E, is a strict partial order capturing causality,
399
400
        A PwT is complete if
401
```

(c3)  $\kappa(e)$  is a tautology (for every  $e \in E$ ), (c5)  $\checkmark$  is a tautology.

Let P range over pomsets, and P over sets of pomsets. We give the semantics of programs  $[\cdot]$  in Fig. 1. The model has 6 components, which can be daunting at first glance. To aid the reader, we use consistent numbering throughout. For example, item 6 always refers to the order relation.

The core of the model is a pomset, which includes a set of events (M1), a labeling (M2), and an order (M6). As usual, we write  $d \le e$  to mean d < e or d = e. On top of this basic structure, M3-M5 add a layer of logic. For each pomset, M5 provides a termination condition. For each event in a pomset, M3 provides a precondition. For each set of events in a pomset, M4 provides a predicate transformer. Sequential dependency is calculated by  $\kappa'_2$  in the semantics of sequential composition.

Before discussing the details of the model, we note that the semantics satisfies the expected monoid laws and is closed with respect to *augmentation*. Augments include more order and stronger formulae; in examples, we typically consider pomsets that are augment-minimal. One intuitive reading of augment closure is that adding order can only cause preconditions to weaken.

```
Lemma 4.5. (a) \mathcal{P} = (\mathcal{P}; SKIP) = (SKIP; \mathcal{P}).
(b) (\mathcal{P}_1; \mathcal{P}_2); \mathcal{P}_3 = \mathcal{P}_1; (\mathcal{P}_2; \mathcal{P}_3).
```

 PROOF. Straightforward calculation. (a) requires M5a for the termination condition in ( $\mathcal{P}$ ; *SKIP*). (b) requires both conjunction closure (x1, for the termination condition) and disjunction closure (x2, for the predicate transformers themselves). (b) also requires that s6 enforce projection as well as inclusion (see the definition of *respects*).

```
Lemma 4.6. (d) if (\phi) {if (\psi) {\mathcal{P}}} = if (\phi \land \psi) {\mathcal{P}}. (e) if (\phi) {\mathcal{P}_1; \mathcal{P}_3} else {\mathcal{P}_2; \mathcal{P}_3} \supseteq if (\phi) {\mathcal{P}_1} else {\mathcal{P}_2}; \mathcal{P}_3. (f) if (\phi) {\mathcal{P}_1; \mathcal{P}_2} else {\mathcal{P}_1; \mathcal{P}_3} \supseteq \mathcal{P}_1; if (\phi) {\mathcal{P}_2} else {\mathcal{P}_3}. (g) if (\phi) {\mathcal{P}_1} else {\mathcal{P}_2} \supseteq \mathcal{P}_1 if \phi is a tautology. (h) if (\phi) {\mathcal{P}_1} else {\mathcal{P}_2} \supseteq if (\phi) {\mathcal{P}_1}; if (\neg \phi) {\mathcal{P}_2}. (j) if (\phi) {\mathcal{P}_1} else {\mathcal{P}_2} \supseteq if (\neg \phi) {\mathcal{P}_2}; if (\phi) {\mathcal{P}_1}. Proof. Straightforward calculation.
```

In §9.5, we refine the semantics to validate the reverse inclusions for (e), (f), and (g). In §9.2, we refine the semantics to validate the reverse inclusion for (h). For Fig. 1, (i) and (j) can be strengthened to equations, rather than inclusions. However, the reverse direction does not hold for PwT-mcA (§5.1) nor for PwT-po (§7). For further discussion, see §A.9.

*Definition 4.7.*  $P_2$  is an *augment* of  $P_1$  if all fields are equal except, perhaps, the order, where we require  $\leq_2 \geq \leq_1$ .

```
LEMMA 4.8. If P_1 \in [S] and P_2 augments P_1 then P_2 \in [S]. PROOF. Induction on the definition of [\cdot].
```

0:10 Anon.

```
Suppose R_i is a relation in E_i \times E_i. We say R respects R_i if R \supseteq R_i and R \cap (E_i \times E_i) = R_i.
442
           If P \in SKIP then E = \emptyset and \tau^D(\psi) \equiv \psi and \checkmark \equiv tt.
443
444
           If P \in SEQ(\mathcal{P}_1, \mathcal{P}_2) then (\exists P_1 \in \mathcal{P}_1) \ (\exists P_2 \in \mathcal{P}_2)
445
           let \kappa'_2(e) = \tau_1^{\downarrow e}(\kappa_2(e)), where \downarrow e = \{c \mid c < e\},
446
                                                                                                     (s4) \tau^D(\psi) \equiv \tau_1^D(\tau_2^D(\psi)),
               (s1) E = (E_1 \cup E_2),
447
                                                                                                      (s5) \ \checkmark \equiv \checkmark_1 \land \overset{\cdot}{\tau}_1(\checkmark_2),
               (s2) \lambda = (\lambda_1 \cup \lambda_2),
448
              (s3a) if e \in E_1 \setminus E_2 then \kappa(e) \equiv \kappa_1(e),
                                                                                                      (s6) < respects <_1 and <_2.
449
              (s3b) if e \in E_2 \setminus E_1 then \kappa(e) \equiv \kappa_2'(e),
450
              (s3c) if e \in E_1 \cap E_2 then \kappa(e) \equiv \kappa_1(e) \vee \kappa_2'(e),
451
           If P \in IF(\phi, \mathcal{P}_1, \mathcal{P}_2) then (\exists P_1 \in \mathcal{P}_1) (\exists P_2 \in \mathcal{P}_2)
452
                                                                                                      (I4) \tau^D(\psi) \equiv (\phi \wedge \tau_1^D(\psi)) \vee (\neg \phi \wedge \tau_2^D(\psi)),
                (11) E = (E_1 \cup E_2),
453
                                                                                                      (15) \checkmark \equiv (\phi \land \checkmark_1) \lor (\neg \phi \land \checkmark_2),
                (12) \lambda = (\lambda_1 \cup \lambda_2),
454
              (13a) if e \in E_1 \setminus E_2 then \kappa(e) \equiv \phi \wedge \kappa_1(e),
                                                                                                      (16) < respects <_1 and <_2.
455
              (13b) if e \in E_2 \setminus E_1 then \kappa(e) \equiv \neg \phi \wedge \kappa_2(e),
456
              (13c) if e \in E_1 \cap E_2 then \kappa(e) \equiv (\phi \wedge \kappa_1(e)) \vee (\neg \phi \wedge \kappa_2(e)),
457
           If P \in LET(r, M) then E = \emptyset and \tau^D(\psi) \equiv \psi[M/r] and \checkmark \equiv tt.
458
           If P \in WRITE(x, M, \mu) then (\exists v \in V)
                                                                                                  (w5a) if E \neq \emptyset then \sqrt{\ } \equiv M = v,
              (w1) |E| \leq 1,
461
              (w2) \lambda(e) = W^{\mu}xv,
                                                                                                  (w5b) if E = \emptyset then \checkmark \equiv ff.
              (w3) \kappa(e) \equiv M = v,
            (w4a) if E \neq \emptyset then \tau^D(\psi) \equiv \psi[M/x][M=v/Q_x],
            (w4b) if E = \emptyset then \tau^D(\psi) \equiv \psi[M/x][ff/Q_x],
           If P \in READ(r, x, \mu) then (\exists v \in \mathcal{V})
                                                                                                    (R4c) if E = \emptyset then \tau^D(\psi) \equiv \psi,
               (R1) |E| \leq 1,
467
               (R2) \lambda(e) = R^{\mu} x v,
                                                                                                    (R5a) if E \neq \emptyset or \mu \sqsubseteq \mathsf{rlx} then \checkmark \equiv \mathsf{tt},
                                                                                                    (R5b) if E = \emptyset and \mu \supseteq \text{acq then } \checkmark \equiv \text{ff.}
               (R3) \kappa(e) \equiv Q_x,
469
             (R4a) if e \in E \cap D then \tau^D(\psi) \equiv v = r \Rightarrow \psi.
470
             (R4b) if e \in E \setminus D then \tau^D(\psi) \equiv (v=r \vee x=r) \Rightarrow \psi,
471
472
                          \llbracket r := M \rrbracket = LET(r, M)
                                                                                                                      [skip] = SKIP
473
                          \llbracket r := x^{\mu} \rrbracket = READ(r, x, \mu)
                                                                                                                    [S_1; S_2] = SEQ([S_1], [S_2])
475
                        [x^{\mu} := M] = WRITE(x, M, \mu)
                                                                                       [\inf(M)\{S_1\} \text{ else } \{S_2\}] = IF(M \neq 0, [S_1], [S_2])
```

Fig. 1. PwT Semantics

## 4.4 Pomsets and Complete Pomsets

476 477

478 479 480

481

482

483

484

485

486

487

488

489 490 Ignoring the logic, the definitions are straightforward. Reads and writes map to pomsets with at most one event. skip maps to the empty pomset. Note only that [x := 1] can write any value v; the fact that v must be 1 is captured in the logic.

The structural rules combine pomsets: SEQ and IF perform a union, inheriting labeling and order from the two sides. We say that  $d \in E_1$  and  $e \in E_2$  coalesce if d = e.

As a trivial consequence of using union rather than disjoint union, s1 validates *mumbling* [Brookes 1996] by coalescing events. For example [x := 1; x := 1] includes the singleton pomset  $(w_{x1})$ . From this it is easy to see that  $[x := 1; x := 1] \supseteq [x := 1]$  is a valid refinement. It is equally obvious that

 $[x := 1] \not\supseteq [x := 1; x := 1]$  is not a valid refinement, since the latter includes a two-element pomset, but the former does not.<sup>2</sup>

In complete pomsets, c5 requires that  $\checkmark$  is a tautology, capturing termination. In *WRITE*, w5b ensures that all writes are included in complete pomsets. This also ensures  $[x := 1] \not\supseteq [if(M) \{x := 1\}]$ , since  $[if(M)\{x := 1\}]$  includes the empty set with termination condition  $\neg M$ , but [x := 1] can only include the empty set with termination condition ff.

In addition, w5a ensures that complete pomsets do not include bogus writes. Suppose  $P \in [x:=1]$ . As we noted above, P can include  $(1=v \mid Wxv)$ , for any value v. In complete pomsets, however, w5a requires that  $\checkmark$  implies 1=v. In this case, M3a would filter the pomset, since preconditions must be satisfiable. However, unsatisfiable writes can be become satisfiable via merging:

$$x := 1$$
  $x := 2$  if  $(M)\{x := 3\}$   $(Wx1)$   $(2=3 \mid Wx3)$   $(M \mid Wx3)$ 

By merging, the semantics allows the following:

$$x := 1; x := 2; if(M)\{x := 3\}$$

$$(Wx1) \qquad M \mid Wx3$$

This pomset is incomplete, however, since  $\sqrt{\ } \equiv 2=3$ .

In *READ*,  $\checkmark$  depends on the mode. R5b ensures that all acquiring reads are included in complete pomsets. Instead R5a states that relaxed reads are optional:  $\checkmark$  is alway true for relaxed reads. From this, it is easy to see that  $[r := x] \supseteq [skip]$  is a valid refinement (where the default mode is rlx).

Ignoring predicate transformers, the SEQ rule s5 takes  $\checkmark$  to be  $\checkmark_1 \land \checkmark_2$ . This is as expected: the program terminates if both subprograms terminate.

In  $IF(\phi, \mathcal{P}_1, \mathcal{P}_2)$ , the termination condition (15) is  $(\phi \land \sqrt{1}) \lor (\neg \phi \land \sqrt{2})$ : the program terminates as long as the "true" branch terminates. Thus  $[if(tt)\{x:=1\}]$  contains a complete pomset with exactly one event: (Wx1). To construct this pomset, we take the singleton from the left and the empty set from the right. This is a general principle: for code that contributes no events at top-level, use the empty set.

# 4.5 Preconditions, Predicate Transformers, and Data Dependencies

Preconditions are used to calculate dependencies. They also determine which events can appear in a pomset. In a complete pomset, c3 requires that every precondition  $\kappa(e)$  is a tautology. Using w3, [x := 2] cannot include a complete pomset with event (Wx3), since 2=3 is not a tautology. The symbols  $Q_x$  that occur in R3 and w4 serve similar purpose. We defer discussion of these until §4.8.

Preconditions are discharged during sequential composition by applying predicate transformers  $\tau_1$  from the left to preconditions  $\kappa_2(e)$  on the right. The specific rules are s3b and s3c, which use the transformed predicate  $\kappa_2'(e) = \tau_1^{\downarrow e}(\kappa_2(e))$ , where  $\downarrow e = \{c \mid c < e\}$  is the set of events that precede e in causal order. We call  $\downarrow e$  the *dependent set* for e. Then  $E \setminus (\downarrow e)$  is the *independent set*.

Before looking at the details, it is useful to have a high-level view of how nontrivial preconditions and predicate transformers are introduced. (We discuss address dependencies in §9.4.)

Preconditions are introduced in:

Predicate transformers are introduced in:

- (13) for control dependencies,
- (R4a) for reads in the dependent set,
- (w3) for data dependencies on writes.
- (R4b) for reads in the independent set,
- (w4) for writes.

The rules track dependencies. We discuss data dependencies (w3) here and control dependencies (13) in §4.6. Unless otherwise noted, we assume pomsets are *complete* and *augment-minimal*.

<sup>&</sup>lt;sup>2</sup>These are distinguished by the context:  $[-] \parallel r := x$ ; x := 2; s := x; if  $(r = s)\{z := 1\}$ .

0:12 Anon.

A simple example of a data dependency is a pomset  $P \in [r := x ; y := r]$ . If P is complete, it must have two events. Then SEQ requires that there are  $P_1 \in [r := x]$  and  $P_2 \in [y := r]$  of the form:

$$\begin{array}{c} r:=x & y:=r \\ \hline (v=r\vee x=r)\Rightarrow \psi \quad (\exists xv)^{d} \\ \hline \psi[r/y] \quad (r=w\mid Wyw)^{e} \\ \hline \end{array} \\ \psi[r/y] \end{array}$$

First we consider the case that v = w. For example if v = w = 1, we have:

$$\boxed{ (1 = r \vee x = r) \Rightarrow \psi \ \left[ \mathbb{R}x1 \right]^d \cdot * \left[ 1 = r \Rightarrow \psi \right] } \boxed{ \psi[r/y] \ \left[ r = 1 \mid \mathsf{W}y1 \right]^e \cdot * \left[ \psi[r/y] \right] }$$

For the read, the dependent transformer  $\tau_1^{\{d\}}$  is  $1=r \Rightarrow \psi$ ; the independent transformer  $\tau_1^{\emptyset}$  is  $(1=r \lor x=r) \Rightarrow \psi$ . These are determined by R4a and R4b, respectively. For the write, both  $\tau_2^{\{e\}}$  and  $\tau_2^{\emptyset}$  are  $\psi[r/y]$ , as are determined by W4. Combining these into a single pomset, we have:

$$\begin{array}{c} r:=x\;;\;y:=r\\ \hline (1=r\vee x=r)\Rightarrow \psi[r/y] \end{array} \overbrace{\left(\mathbb{R}x1\right)^d} \cdot * \boxed{1=r\Rightarrow \psi[r/y]} \quad \left(\phi\mid \mathsf{W}y1\right)^e \end{array}$$

By s4, predicate transformers are determined by composition; thus  $\tau^D(\psi)$  is  $\tau^D_1(\tau^D_2(\psi))$ . Since the transformer does not depend on whether the write is included, we do not draw dependencies for the write in the diagram.

Turning to the precondition  $\phi$  on the write, recall that in order for e to participate in a top-level pomset, the precondition  $\phi$  must be a tautology at top-level. There are two possibilities.

- If d < e then we apply the dependent transformer and  $\phi \equiv (1=r \Rightarrow r=1)$ , a tautology.
- If  $d \not< e$  then we apply the independent transformer and  $\phi \equiv ((1=r \lor x=r) \Rightarrow r=1)$ . Under the assumption that r is bound, this is logically equivalent to (x=1). (We make this more precise in §9.2.)

Eliding transformers, the two outcomes are:

$$r := x ; y := r$$
  $r := x ; y := r$   $(\mathbb{R}x1)^d (\mathbb{R}x1)^d (\mathbb{R}x1)^d (\mathbb{R}x1)^d$ 

The independent case on the right can only participate in a top-level pomset if the precondition (x=1) is discharged. To do so, we must prepend a pomset  $P_0$  that writes 1 to x:

$$x := 1 \qquad \qquad x := 1; \ r := x; \ y := r$$

$$\psi[1/x] \left(1 = 1 \mid Wx1\right)^{c} \rightarrow \psi[1/x] \qquad \left(1 = 1 \mid Wx1\right)^{c} \left(Rx1\right)^{d} \left(1 = 1 \mid Wy1\right)^{e}$$

Here we apply the predicate transformer  $\tau_0^0$  to (x=1), resulting in the tautology (1=1). Now suppose that  $v \neq w$  in  $(\dagger \dagger)$ . Again there are two possibilities. Taking v=0 and w=1:

$$r := x \; ; \; y := r$$

$$(Rx0)^{d} \longrightarrow (0=r \Rightarrow r=1 \mid Wy1)^{e}$$

$$(Rx0)^{d} \longrightarrow (0=r \lor x=r) \Rightarrow r=1 \mid Wy1)^{e}$$

Assuming that r is bound, both preconditions on e are unsatisfiable.

If a write is independent of a read, then clearly no order is imposed between them. For example, the precondition of e is a tautology in:

$$\begin{array}{c} r:=x\;;\;y:=1\\ \hline \left(0=r\vee x=r\right)\Rightarrow\psi[r/y]\end{array} \otimes \left(\mathbb{R}x0\right)^{d} \\ & \bullet \left[0=r\Rightarrow\psi[r/y]\right] \\ \end{array} \otimes \left((0=r\vee x=r)\Rightarrow 1=1\mid \mathsf{W}y1\right)^{e}$$

# 4.6 Control Dependencies

589 590

591

592

593

594

595

596

597

598

601 602 603

604

605

608 609

610

612 613

614

615

616

617

618

619 620 621

622

623

624

625 626

627

628

629

630

631 632

633 634

635

636 637 In  $IF(\phi, \mathcal{P}_1, \mathcal{P}_2)$ , the predicate transformer (14) is  $(\phi \wedge \tau_1^D(\psi)) \vee (\neg \phi \wedge \tau_2^D(\psi))$ , which is the disjunctive equivalent of Dijkstra's conjunctive formulation:  $(\phi \Rightarrow \tau_1^D(\psi)) \land (\neg \phi \Rightarrow \tau_2^D(\psi))$ .

This semantics validates dead code elimination: if  $M \neq 0$  is a tautology then  $[if(M) \in S_1]$  else  $\{S_2\}$   $\supseteq$   $S_1$ . The reverse inclusion does not hold.

For events from  $E_1$ , 13a requires  $\phi \wedge \kappa_1(e)$ . For events from  $E_2$ , 13b requires  $\neg \phi \wedge \kappa_2(e)$ . For coalescing events in  $E_1 \cap E_2$ , 13c requires  $(\phi \wedge \kappa_1(e)) \vee (\neg \phi \wedge \kappa_2(e))$ . This semantics allows common code to be lifted out of a conditional, validating the transformation  $\|if(M)\{S\}\| \ge \|S\|$ .

By allowing events to coalesce, 13c ensures that control dependencies are calculated semantically. For example, consider  $P \in [if(r=1)\{y:=r\} \text{ else } \{y:=1\}]$ , which is build from  $P_1 \in [y:=r]$  and  $P_2 \in [y := 1]$  such as:

$$y := r \qquad \qquad y := 1 \qquad \qquad \text{if}(r=1)\{y := r\} \text{ else } \{y := 1\}$$

$$(r=1 \mid Wy1)^{e} \qquad \qquad (r=1 \Rightarrow r=1) \land (r \neq 1 \Rightarrow 1=1) \mid Wy1)^{e}$$

Here, the precondition in the combined pomset is a tautology, independent of r.

Control dependencies are eliminated in the same way as data dependencies. For example:

$$r := x \qquad \qquad \text{if } (r=1) \{y := 1\}$$

$$(v=r \lor x=r) \Rightarrow \psi \qquad \text{Rx} v \xrightarrow{d} v=r \Rightarrow \psi \qquad \qquad \tau_2^{\emptyset}(\psi) \qquad (r=1 \mid Wyw) \xrightarrow{e} \tau_2^{\{e\}}(\psi)$$
where  $\tau_2^{\emptyset}(\psi) \equiv \tau_2^{\{e\}}(\psi) \equiv (r=1 \land \psi[1/y]) \lor (r\neq 1 \land \psi)$ . As for  $(\dagger^{\dagger})$ , there are two possibilities:

$$r := x; \text{ if } (r=1)\{y := 1\}$$

$$(Rx1)^{d} \longrightarrow (1=r \Rightarrow r=1 \mid Wy1)^{e}$$

$$r := x; \text{ if } (r=1)\{y := 1\}$$

$$(Rx1)^{d} \longrightarrow (1=r \lor x=r) \Rightarrow r=1 \mid Wy1)^{e}$$

### 4.7 A Refinement: No Dependencies into Reads

To avoid stalling the CPU pipeline unnecessarily, hardware does not enforce control dependencies between reads. To support if-closure (§9.5), software models must not distinguish control dependencies from other dependencies. Thus, we are forced to drop all dependencies into reads. To achieve this, we modify the definition of  $\kappa_2'$  in Fig. 1.

$$\kappa_2'(e) = \begin{cases} \tau_1(\kappa_2(e)) & \text{if } \lambda(e) \text{ is a read} \\ \tau_1^{\downarrow e}(\kappa_2(e)) & \text{otherwise, where } \downarrow e = \{c \mid c < e\} \end{cases}$$

Thus reads always use the "best" transformer,  $\tau_1$ . In order for non-reads to get a good transformer, they need to add order.

Throughout the remainder of the paper, we use this definition. (The lack of dependencies into reads is one of the factors complicating downset closure; see §A.8 for a discussion.)

## A Subtlety: Local Invariant Reasoning

Two aspects of Fig. 1 that are fairly obscure: First, R4b introduces locations into formula, in order to track the local state of memory. This is necessary to support local invariant reasoning as in JMM Causality Test Case 1 (TC1) [Pugh 2004]. Second, R3 introduces symbols  $Q_x$  to ensure that local state is tracked accurately. Consider TC1:

$$x := 0; (r := x; if(r \ge 0) \{y := 1\} \parallel x := y)$$

$$(TC1)$$

In order to allow this execution, the precondition  $\phi$  must be a tautology. Using R4b and w4a, the precondition is  $((1=r \lor x=r) \Rightarrow r \ge 0) [0/x]$  which is  $((1=r \lor 0=r) \Rightarrow r \ge 0)$  which is indeed a tautology. 0:14 Anon.

Intuitively, R4b says that, to be independent of the read action, subsequent preconditions must be tautological under both  $\lfloor v/r \rfloor$  and  $\lfloor x/r \rfloor$ . Here v is the value read, and x tracks the "local state" of the variable. This idea is borrowed from [Jagadeesan et al. 2020], which includes further examples. (See §4.9 for a discussion of Skolemization.) [Todo: Local invariant reasoning requires that we track the state of variables in the logic, not just registers. This is one reason we use predicate transformers rather than simple postconditions.]

 $Q_x$  only occurs positively in the preconditions of read actions. Thus, per §4.7, it does not add dependencies. Instead, the purpose of  $Q_x$  is to filter out pomsets with stale reads. Recall that all events in a pomset must have tautological preconditions in a complete pomset.  $Q_x$  strengthens the preconditions of reads. Consider the following example from [Paviotti et al. 2020, §6.3]:

Note that the two branches of the conditional are the same after the first assignment in the else branch. Without  $Q_x$ , the precondition  $\phi$  is tt, which is a tautology, and the execution is allowed, resulting in a violation of DRF-SC. To construct this pomset, we have chosen the empty pomset for [x := 0]. The constraints on complete pomsets do not filter out this pomset, since x := 0 is in the false-branch of the conditional. The problem here is that we have forgotten the local state of x in the false-branch of the execution. Nonetheless, we are using the subsequent read.

With  $Q_x$ , the precondition of  $\phi$  is ff. Intuitively,  $Q_x$  requires that the most recent prior write to x must be in the pomset in order to read x.

 $Q_x$  also guarantees initialization in complete pomsets: (c3) requires tautologies, which means that all variables must be initialized sequentially in order to get rid of  $Q_x$ .

The use of  $Q_x$  is not too restrictive. As a sanity check, consider the following pomsets,

$$r:=y\;;\;\mathrm{if}(r)\{x:=1\}\;;\;s:=x$$
 
$$(\exists r,y) + (\exists r,y) + (\exists$$

where  $\tau_1(\psi) = (r \neq 0 \land \psi[1/x][1=1/Q_x]) \lor (r=0 \land \psi)$  and  $\tau_2(\psi) = (r \neq 0 \land \psi[1/x][ff/Q_x]) \lor (r=0 \land \psi)$  are the predicate transformers for the conditional writes, using w4a and w4b respectively. The executions are complete since all preconditions are tautologies.

# 4.9 Associativity and Skolemization

 The predicate transformers we have chosen for R4a and R4b are different from the ones used traditionally, which are written using substitution. Attempting to write R4a and R4b in this style we would have (as in [Jagadeesan et al. 2020]):

```
(R4a') if E \neq \emptyset and (E \cap D) \neq \emptyset then \tau^D(\psi) \equiv \psi[v/r], (R4b') if E \neq \emptyset and (E \cap D) = \emptyset then \tau^D(\psi) \equiv \psi[v/r] \wedge \psi[x/r]. Sadly, this definition fails associativity.
```

Consider the following, eliding transformers for the writes ("!" represents logical negation):

$$r:=y \hspace{1cm} x:=!r \hspace{1cm} x:=!!r \\ \hline \left[\psi[1/r] \wedge \psi[y/r]\right] \hspace{1cm} \left[\mathbb{R}\,y1\right] \rightarrow \left[\psi[1/r]\right] \hspace{1cm} \left[r=0 \mid \mathsf{W}x1\right] \hspace{1cm} \left[r\neq 0 \mid \mathsf{W}x1\right]$$

Coalescing the writes and associating to the right, we have the following, since  $(r=0 \lor r\neq 0) \equiv tt$ :

$$r:=y$$
  $x:=!r; x:=!!r$   $r:=y; (x:=!r; x:=!!r)$   $(\mathbb{R}y1)$   $(\mathbb{R}y1)$   $(\mathbb{R}y1)$ 

The precondition of (Wx1) is a tautology. Associating to the left and the coalescing, instead:

$$r := y \; ; \; x := ! \; r \qquad \qquad x := ! ! \; r \qquad \qquad (r := y \; ; \; x := ! ! \; r) \; ; \; x := ! ! \; r$$
 
$$(\mathbb{R} \; y1) \quad \widehat{(\mathbb{R} \; y1)} \quad \widehat{(\mathbb{R} \; y1)}$$

The precondition  $\phi \equiv (1=0 \land \psi=0) \lor (1\neq 0 \land \psi\neq 0)$ , which is equivalent to  $\psi\neq 0$ , which is not a tautology. This pomset can never be complete due to the bogus write of (x := !r), which will show up in the termination condition (1=0). Nevertheless, this is a problem, since associativity must hold for incomplete pomsets.

Our solution is to Skolemize, replacing substitution by implication, with uniquely chosen registers. Using Fig. 1, we compute a tautology in the example above:  $\phi \equiv ((1=r \lor y=r) \Rightarrow r=0) \lor (r\neq 0)$ .

The proof of associativity requires that predicate transformers distribute through disjunction (Def. 4.2). The attempt to define predicate transformers using substitution fails for R4c because the predicate transformer  $\tau(\psi) = (\forall r)\psi$  does not distribute through disjunction:  $\tau(\psi_1 \lor \psi_2) =$  $(\forall r)(\psi_1 \vee \psi_2) \neq ((\forall r)(\psi_1)) \vee ((\forall r)(\psi_2)) = \tau(\psi_1) \vee \tau(\psi_2)$ . Since  $\tau(\psi) = (\forall r)\psi$  does not distribute through disjunction, we use  $\tau(\psi) = \psi$  instead (which trivially distributes through disjunction). This change means we cannot use substitution, since  $\psi$  does not imply  $\psi[v/r]$ . Fortunately, Skolemizing solves this problem, since  $\psi$  implies  $(r=v) \Rightarrow \psi$ .

# 5 PwT-MCA: POMSETS WITH PREDICATE TRANSFORMERS FOR MCA

We derive a model of concurrent computation by adding parallel composition and reads-from to Fig. 1. To model coherence and synchronization, we add delay to the rule for sequential composition. For MCA architectures, it is sufficient to encode delay in the pomset order. The resulting model, PwT-mcA<sub>1</sub>, supports optimal lowering for relaxed access on Arm8, but requires extra synchronization for acquiring reads.

A variant, PwT-McA2, supports optimal lowering for all access modes on Arm8. To achieve this, PwT-MCA2 drops the global requirement that reads-from implies pomset order (M7c). The models are the same, except for *internal reads*, where a thread reads its own write.

The lowering proofs can be found in §B. The proofs use recent alternative characterizations of Arm8 [Alglave et al. 2021].

### 5.1 PwT-MCA1

687 688

689

690

691

692 693

694

695 696

697

698

699

700

701

702

703 704

705 706

707

708 709

710

711

712

713

714

715

716

717 718

719

720

721

722

723

724

725

726

727 728

729

730

731

732

733

734 735 We define PwT-MCA<sub>1</sub> by extending Def. 4.4 and Fig. 1. The definition uses several relations over actions-matches, blocks and delays-as well a distinguished set of read actions; see §4.2.

```
Definition 5.1. A PwT-MCA<sub>1</sub> is a PwT (Def. 4.4) equipped with a relation rf such that
(M7) rf \subseteq E \times E is an injective relation capturing reads-from, such that
```

(M7a) if  $d \stackrel{\mathsf{rf}}{\longrightarrow} e$  then  $\lambda(d)$  matches  $\lambda(e)$ ,

(M7b) if  $d \xrightarrow{rf} e$  and  $\lambda(c)$  blocks  $\lambda(e)$  then either  $c \le d$  or  $e \le c$ ,

(M7c) if  $d \xrightarrow{rf} e$  then d < e.

A PwT-McA is complete if

```
(c7) if \lambda(e) is a read then there is some d \xrightarrow{rf} e.
(c3) (c5) as in Def. 4.4,
```

If  $P \in PAR(\mathcal{P}_1, \mathcal{P}_2)$  then  $(\exists P_1 \in \mathcal{P}_1)$   $(\exists P_2 \in \mathcal{P}_2)$ 

(P1) 
$$E = (E_1 \uplus E_2),$$
 (P4)  $\tau^D(\psi) \equiv \tau_2^D(\psi),$  (P2)  $\lambda = (\lambda_1 \cup \lambda_2),$  (P5)  $\checkmark \equiv \checkmark_1 \land \checkmark_2,$ 

$$(P5) \ \lambda = (\lambda_1 \cup \lambda_2), \tag{P5} \ \forall \equiv \forall_1 \land \forall_2$$

(P3a) if 
$$e \in E_1$$
 then  $\kappa(e) \equiv \kappa_1(e)$ , (P6) < respects <1 and <2,

(P3b) if 
$$e \in E_2$$
 then  $\kappa(e) \equiv \kappa_2(e)$ , (P7) rf  $\supseteq (\mathsf{rf}_1 \cup \mathsf{rf}_2)$ .

If  $P \in SEQ(\mathcal{P}_1, \mathcal{P}_2)$  then  $(\exists P_1 \in \mathcal{P}_1) \ (\exists P_2 \in \mathcal{P}_2)$ 

0:16 Anon.

```
(s1) (s2) (s3) (s4) (s5) (s6) as in Fig. 1, (s7) rf \supseteq (rf<sub>1</sub> \cup rf<sub>2</sub>). (s6a) if \lambda_1(d) delays \lambda_2(e) then d \le e, If P \in IF(\phi, \mathcal{P}_1, \mathcal{P}_2) then (\exists P_1 \in \mathcal{P}_1) (\exists P_2 \in \mathcal{P}_2) (11) (12) (13) (14) (15) (16) as in Fig. 1, (17) rf \supseteq (rf<sub>1</sub> \cup rf<sub>2</sub>).
```

Let  $[S_1 + S_2] = PAR([S_1], [S_2])$ . We write  $[\cdot]_{mca1}$  for the semantic function when it is unclear from context.

In complete pomsets, rf must pair every read with a matching write (c7). The requirements M7a, M7b, and M7c guarantee that reads are *fulfilled*, as in [Jagadeesan et al. 2020, §2.7].

The semantic rules are mostly straightforward: Parallel composition is disjoint union, and all constructs respect reads-from. The monoid laws (Lemma 4.5) extend to parallel composition, with skip as right unit only due to the asymmetry of P4.

Only s6a requires explanation. From Def. 4.1, recall that a delays b if  $a \bowtie_{co} b$  or  $a \bowtie_{sync} b$  or  $a \bowtie_{sc} b$ . s6a guarantees that sequential order is enforced between conflicting accesses of the same location ( $\bowtie_{co}$ ), into a release and out of an acquire ( $\bowtie_{sync}$ ), and between SC accesses ( $\bowtie_{sc}$ ). Combined with the fulfillment requirements (M7a, M7b and M7c), these ensure coherence, publication, subscription and other idioms. For example, consider the following:<sup>3</sup>

$$x := 0; x := 1; y^{\text{rel}} := 1 \parallel r := y^{\text{acq}}; s := x$$

$$(\text{PUB})$$

The execution is disallowed due to the cycle. All of the order shown is required at top-level: The intra-thread order comes from s6a:  $(Wx0) \rightarrow (Wx1)$  is required by  $\bowtie_{co}$ .  $(Wx1) \rightarrow (W^{rel}y1)$  and  $(R^{acq}y1) \rightarrow (Rx0)$  are required by  $\bowtie_{sync}$ . The cross-thread order is required by fulfillment: c7 requires that all top-level reads are in the image of  $\stackrel{rf}{\longrightarrow}$ . M7a ensures that  $(W^{rel}y1) \stackrel{rf}{\longrightarrow} (R^{acq}y1)$ , and M7c subsequently ensures that  $(W^{rel}y1) < (R^{acq}y1)$ . The *antidependency*  $(Rx0) \rightarrow (Wx1)$  is required by M7b. (Alternatively, we could have  $(Wx1) \rightarrow (Wx0)$ , again resulting in a cycle.)

The semantics gives the expected results for store buffering and load buffering, as well as litmus tests involving fences and SC access. The model of coherence is weaker than C11, in order to support common subexpression elimination, and stronger than Java, in order to support local reasoning about data races. For further examples, see §D and [Jagadeesan et al. 2020, §3.1].

Lemmas 4.5 and 4.6 hold for PwT-MCA<sub>1</sub>. For further discussion of items (i) and (j), see §A.9.

#### 5.2 PwT-MCA2

 Lowering PwT-MCA1 to Arm8 requires a full fence after every acquiring read. To see why, consider the following attempted execution, where the final values of both x and y are 2.

$$x := 2; r := x^{\text{acq}}; y := r - 1 \parallel y := 2; x^{\text{rel}} := 1$$
 $(Wx2) \cdot (R^{\text{acq}}x2) \cdot (Wy1) \cdot (Wy2) \cdot (W^{\text{rel}}x1)$ 

The execution is allowed by Arm8, but disallowed by PwT-McA<sub>1</sub>, due to the cycle.

Arm8 allows the execution because the read of x is internal to the thread. This aspect of Arm8 semantics is difficult to model locally. To capture this, we found it necessary to drop M7c and

•  $d \rightarrow e$  arises from  $\bowtie_{co}$  (s6a),

- $d \rightarrow e$  arises from reads-from (M7a),
- $d \rightarrow e$  arises from  $\bowtie_{SVRC}$  or  $\bowtie_{SC}$  (s6a),
- $d \rightarrow e$  arises from blocking (M7b).
- $d \rightarrow e$  arises from control/data/address dependency (s3, definition of  $\kappa'_2(d)$ ),

In PwT-MCA<sub>2</sub>, it is possible for rf to contradict <. In this case, we use a dotted arrow for rf:  $d \cdot \cdot \cdot \rangle$  e indicates that e < d.

<sup>&</sup>lt;sup>3</sup>We use different colors for arrows representing order:

relax s6a, adding local constraints on rf to *PAR*, *SEQ* and *IF*. Rather than ensuring that there is no *global* blocker for a sequentially fulfilled read (M7c), we require only that there is no *thread-local* blocker (s6b). For PwT-MCA<sub>2</sub>, internal reads don't necessarily contribute to order, and thus the above execution is allowed.

Definition 5.2. A PwT-мсA<sub>2</sub> is a PwT (Def. 4.4) equipped with an injective relation rf that satisfies requirements м7a and м7b of Def. 5.1.

A PwT-MCA<sub>2</sub> is *complete* if it satisfies c3, c5, and c7—this is the same as for PwT-MCA<sub>1</sub>. If  $P \in PAR(\mathcal{P}_1, \mathcal{P}_2)$  then  $(\exists P_1 \in \mathcal{P}_1)$   $(\exists P_2 \in \mathcal{P}_2)$ 

```
(P1) (P2) (P3) (P4) (P5) (P6) (P7) as in Def. 5.1, (P6b) if d \in E_1, e \in E_2 and e \xrightarrow{rf} d then e < d, (P6a) if d \in E_1, e \in E_2 and d \xrightarrow{rf} e then d < e, (P7a) rf_i = rf \cap (E_i \times E_i), for i \in \{1, 2\}.
```

If  $P \in SEQ(\mathcal{P}_1, \mathcal{P}_2)$  then  $(\exists P_1 \in \mathcal{P}_1) \ (\exists P_2 \in \mathcal{P}_2)$ 

```
(s1) (s2) (s3) (s4) (s5) (s6) (s7) as in Def. 5.1, (s6b) if \lambda_1(c) blocks \lambda_2(e) and d \xrightarrow{\text{rf}} e (s6a) if \lambda_1(d) delays \lambda_2(e) then either d \xrightarrow{\text{rf}} e then c \le d, or d \le e, (s7a) \text{rf}_i = \text{rf} \cap (E_i \times E_i), for i \in \{1, 2\}.
```

```
If P \in IF(\phi, \mathcal{P}_1, \mathcal{P}_2) then (\exists P_1 \in \mathcal{P}_1) (\exists P_2 \in \mathcal{P}_2) (11) (12) (13) (14) (15) (16) (17) as in Def. 5.1, (17a) \mathsf{rf}_i = \mathsf{rf} \cap (E_i \times E_i), for i \in \{1, 2\}.
```

A PwT-MCA<sub>2</sub> need not satisfy requirement M7c, and thus we may have  $d \stackrel{\text{rf}}{\longrightarrow} e$  and e < d. [Todo: Example using s6a and s6b. To make space, Lemma 5.3 could move to appendix.]

With the weakening of \$6a, we must be careful not to allow spurious pairs to be added to the rf relation. Thus we add P7a, \$7a, and I7a. For example, I7a ensures that  $[if(b)\{r:=x \mid x:=1\}]$  does not include  $[Rx1] \times [Wx1]$ , taking rf from the left and < from the right.

As a consequence of dropping M7c, sequential rf must be validated during pomset construction, rather than post-hoc. In §7, we show how to construct program order (po) for complete pomsets using phantom events ( $\pi$ ). Using this construction, the following lemma gives a post-hoc verification technique for rf.

Lemma 5.3. If  $P \in [S]_{mca2}$  is complete, then for every  $d \xrightarrow{rf} e$  either

- external fulfillment: d < e and if  $\lambda(c)$  blocks  $\lambda(e)$  then either  $c \le d$  or  $e \le c$ , or
- internal fulfillment:  $(\exists d' \in \pi^{-1}(d))$   $(\exists e' \in \pi^{-1}(e))$   $d' \stackrel{\text{po}}{\dots} e'$  and  $(\not\exists c')$   $\kappa(c)$  is a tautology and  $\lambda(c)$  blocks  $\lambda(e)$  and  $d' \stackrel{\text{po}}{\dots} e'$ .

These mimic the external consistency requirements of Arm8 [Alglave et al. 2021].

### 6 PwT-MCA RESULTS

PwP [Jagadeesan et al. 2020] is a novel memory model, intended to serve as a semantic basis for a Java-like language, where all access is safe. PwT-mcA generalizes PwP, making several small but significant changes. As a result, we have had to re-prove most of the theorems from PwP.

In §B, we show that PwT-mcA<sub>1</sub> supports the optimal lowering of relaxed accesses to Arm8 and that PwT-mcA<sub>2</sub> supports the optimal lowering of *all* accesses to Arm8. The proofs are based on two recent characterizations of Arm8 [Alglave et al. 2021]. For PwT-mcA<sub>1</sub>, we use *External Global Consistency*. For PwT-mcA<sub>2</sub>, we use *External Consistency*.

In §C, we prove sequential consistency for local-data-race-free programs. The proof uses *program order*, which we construct for c11 in §7. The same construction works for PwT-mca. (This proof assumes there are no RMW operations.)

0:18 Anon.

The semantics validates many peephole optimizations, such as the standard reorderings on relaxed access:

Here id(S) is the set of locations and registers that occur in S. Using augmentation closure, the semantics also validates roach-motel reorderings [Sevčík 2008]. For example, on read/write pairs:

$$[x^{\mu} := M; s := y] \supseteq [s := y; x^{\mu} := M]$$
 if  $x \neq y$  and  $s \notin id(M)$ 
$$[x := M; s := y^{\mu}] \supseteq [s := y^{\mu}; x := M]$$
 if  $x \neq y$  and  $s \notin id(M)$ 

Notably, the semantics does *not* validate read introduction. When combined with case analysis (§9.5), read introduction can break temporal reasoning. This combination is allowed by speculative operational models. See §A.1 for a discussion.

Prop. 6.1 of [Jagadeesan et al. 2020] establishes a compositional principle for proving that programs validate formula in past-time temporal logic. The principal is based entirely on the pomset order relation. It's proof, and all of the no-thin-air examples in [Jagadeesan et al. 2020, §6] hold equally for the models described here.

#### 7 PwT-C11: POMSETS WITH PREDICATE TRANSFORMERS FOR C11

PwT can be used to generate semantic dependencies to prohibit thin-air executions of c11, while preserving optimal lowering for relaxed access. We follow the approach of Paviotti et al. [2020], using our semantics to generate c11 candidate executions with a dependency relation, then applying the rules of Rc11 [Lahav et al. 2017]. The No-Thin-Air axiom of Rc11 is overly restrictive, requiring that rf  $\cup$  po be acyclic. Instead, we require that rf  $\cup$  < is acyclic. This is a more precise categorisation of thin-air behavior, and it allows aggressive compiler optimizations that would be erroneously forbidden by Rc11's original No-Thin-Air axiom.

The chief difficulty is instrumenting our semantics to generate program order, for use in the various axioms of c11.

Definition 7.1. A PwT-po is a PwT (Def. 4.4) equipped with relations  $\pi$  and po such that

```
(M8) \pi: (E \to E) is an idempotent function capturing merging, such that let R = \{e \mid \pi(e) = e\} be real events, let \overline{R} = (E \setminus R) be phantom events, let S = \{e \mid \forall d. \ \pi(d) = e \Rightarrow d = e\} be simple events, let \overline{S} = (E \setminus S) be compound events, (M8a) \lambda(e) = \lambda(\pi(e)), (M8b) if e \in \overline{S} then \kappa(e) \models \bigvee_{\{c \in \overline{R} \mid \pi(c) = e\}} \kappa(c).
```

(M9)  $po \subseteq (S \times S)$  is a partial order capturing program order.

A PwT-po is complete if

```
(c3) if e \in R then \kappa(e) is a tautology, (c5) \checkmark is a tautology.
```

Since  $\pi$  is idempotent, we have  $\pi(\pi(e)) = \pi(e)$ . Equivalently, we could require  $\pi(e) \in R$ .

We use  $\pi$  to partition events E in two ways: we distinguish *real* events R from *phantom* events  $\overline{R}$ ; we distinguish *simple* events R from *compound* events R. From idempotency, it follows that all phantom events are simple ( $\overline{R} \subseteq S$ ) and all compound events are real ( $\overline{S} \subseteq R$ ). In addition, all phantom events map to compound events (if  $e \in \overline{R}$  then  $\pi(e) \in \overline{S}$ ).

Lemma 7.2. If P is a PwT then there is a PwT-po P'' that conservatively extends it.

PROOF. The proof strategy is as follows: We extend the semantics of Fig. 1 with po. The obvious definition gives us a preorder rather than a partial order. To get a partial order, we replay the semantics without merging to get an *unmerged* pomset P'; the construction also produces the map  $\pi$ . We then construct P'' as the union of P and P', using the dependency relation from P.

We extend the semantics with po as follows. For pomsets with at most one event, po is the identity. For sequential composition,  $po = po_1 \cup po_2 \cup E_1 \times E_2$ . For the conditional,  $po = po_1 \cup po_2$ . By construction, po is a pre-order, which may include cycles due to coalescing. For example:

To find an acyclic po', we replay the construction of P to get P'. When building P', we require disjoint union in s1 and s1:  $E' = E'_1 \uplus E'_2$ . If and event is unmerged in P (i.e.  $e \in E_1 \uplus E_2$ ) then we choose the same event name for E' in P'. If an event is merged in P (i.e.  $e \in E_1 \cap E_2$ ) then we choose fresh event names— $e'_1$  and  $e'_2$ —and extend  $\pi$  accordingly:  $\pi(e'_1) = \pi(e'_2) = e$ . In P', we take  $\leq' = \mathsf{po'}$ .

To arrive at P'', we take (1)  $E'' = E \cup E'$ , (2)  $\lambda'' = \lambda \cup \lambda'$ , (3a) if  $e \in E$  then  $\kappa''(e) = \kappa(e)$ , (3b) if  $e \in E' \setminus E$  then  $\kappa''(e) = \kappa'(e)$ , (4)  $\tau''^D = \tau^{(\pi^{-1}(D))}$ , (5)  $\checkmark'' = \checkmark$ , (6) d < '' e exactly when  $\pi(d) < \pi(e)$ , (7) po'' = po', and (8)  $\pi''$  is the constructed merge function.

Definition 7.3. For a PwT-Po, let extract(P) be the projection of P onto the set  $\{e \in E_1 \mid e \text{ is simple and } \kappa_1(e) \text{ is a tautology}\}.$ 

By definition, extract(P) includes the simple events of P whose preconditions are tautologies. These are already in program order, as per item 7 of the proof. The dependency order is derived from the real events using  $\pi$ , as per item 6.

The following lemma shows that if P is *complete*, then extract(P) includes at least one simple event for every compound event in P.

LEMMA 7.4. If P is a complete PwT-PO with compound event e, then there is a phantom event  $c \in \pi^{-1}(e)$  such that  $\kappa(c)$  is a tautology.

A pomset in the image of extract is a candidate execution.

 As an example, consider Java Causality Test Case 6. Taking w = 0 and v = 1, the PwT-Po on the left below produces the candidate execution on the right. In diagrams, we visualize po using a dotted arrow  $\rightarrow$ , and  $\pi$  using a double arrow  $\rightarrow$ .

```
y := w; r := y; if(r=0)\{x := 1\}; if(r=1)\{x := 1\} y := 0; r := y; if(r=0)\{x := 1\}; if(r=1)\{x := 1\} y := 0; r := y; if(r=0)\{x := 1\}; if(r=1)\{x := 1\} y := 0; r := y; if(r=0)\{x := 1\}; if(r=1)\{x := 1\} y := 0; r := y; if(r=0)\{x := 1\}; if(r=1)\{x := 1\} y := 0; r := y; if(r=0)\{x := 1\}; if(r=1)\{x := 1\} y := 0; r := y; if(r=0)\{x := 1\}; if(r=1)\{x := 1\} y := 0; r := y; if(r=0)\{x := 1\}; if(r=1)\{x := 1\}
```

We write  $[\![\cdot]\!]^{po}$  for the semantic function defined by applying the construction of Lemma 7.2 to the base semantics of 1.

The dependency calculation of  $[\![\cdot]\!]^{po}$  is sufficient for c11; however, it ignores synchronization and coherence completely.

if(r){x:=1}; if(s){x:=2}; if(!r){x:=1}  

$$\frac{r\neq 0 \lor r=0 \mid Wx1}{(r\neq 0 \mid Wx1)} \lor (r\neq 0 \mid Wx1)$$
(‡)

0:20 Anon.

Adding a pair of reads to complete the pomset, we can extract the following candidate execution.

$$r := y \; ; \; s := z \; ; \; \text{if}(r)\{x := 1\} \; ; \; \text{if}(s)\{x := 2\} \; ; \; \text{if}(!\,r)\{x := 1\} \; \\ (R\,y1) \dots (R\,z1) \dots (W\,x1) \dots (W\,x2) \qquad \qquad (R\,y0) \dots (R\,z1) \dots (W\,x2) \dots (W\,x1)$$

It is somewhat surprising that the writes are independent of both reads!

In PwT-McA, delay stops the merge in (‡).

if(r){x:=1}; if(s){x:=2}; if(!r){x:=1}  

$$r\neq 0 \mid Wx1$$
  $s\neq 0 \mid Wx2$   $r=0 \mid Wx1$ 

It is possible to mimic this in c11, without introducing extra dependencies: one can filter executions post-hoc using the relation  $\sqsubseteq$ , defined as follows:

$$\pi(d) \sqsubseteq \pi(e)$$
 if  $d \stackrel{\text{po}}{\longleftrightarrow} e$  and  $\lambda(d)$  delays  $\lambda(e)$ .

In (‡), we have both  $d \sqsubseteq e$  and  $e \sqsubseteq d$ . To rule out this execution, it suffices to require that  $\sqsubseteq$  is a partial order.

Program (‡) shows that the definition of semantic dependency is up for debate in c11, and the International Standard Organisation's C++ concurrency subgroup acknowledges that semantic dependency (sdep) would address the Out-of-Thin-Air problem: *Prohibiting executions that have cycles in* rf ∪ sdep *can therefore be expected to prohibit Out-of-Thin-Air behaviors* [McKenney et al. 2016]. PwT-c11 resolves program structure into a dependency relation—not a complex state—that is precise and easily adjusted. As refinements are made to c11, PwT-c11 can accommodate these and test them automatically.

# 8 PwTer: AUTOMATIC LITMUS TEST EVALUATOR

PwTer automatically and exhaustively calculates the allowed outcomes of litmus tests for the PwT, PwT-po, and PwT-c11 models. It is built in OCaml, and uses Z3 [De Moura and Bjørner 2008] to judge the truth of predicates constructed by the models. PwTer obviates the need for error-prone hand evaluation.

PwTer allows several modes of evaluation: it can evaluate the rules of Fig. 1, implementing PwT; it can generate program order according to §7, implementing PwT-po; and similar to Mrd-c11 [Paviotti et al. 2020], it can construct C11-style pre-executions and filter them according to the rules of Rc11 as described in §7. Finally, PwTer also allows us to toggle the complete check of 4.4, providing an interface for understanding how fragments of code might compose by exposing preconditions and termination conditions that are not yet tautologies. We show PwTer in action in Fig. 2. PwTer will be made open source upon publication.

### 9 REFINEMENTS AND ADDITIONAL FEATURES

In the paper so far, we have assumed that registers are assigned at most once. We have done this primarily for readability. In the first subsection below, we drop this assumption, instead using substitution to rename registers. We use the set  $S_{\mathcal{E}} = \{s_e \mid e \in \mathcal{E}\}$ . By assumption (§4.1), these registers do not appear in programs:  $S[N/s_e] = S$ . The resulting semantics satisfies redundant read elimination.

Our approach to register recycling allows us to define a criterion for eliminating certain types of useless pomsets (§9.2).

In the remainder of this section we consider several mostly-orthogonal features: address calculation, if-closure, fences, and read-modify-write operations. Address calculation and if-closure do have some interaction, and we spell out the combined semantics in §9.6.

```
pomsets-with-predicate-transformers — -zsh — 80x24
$ cat data/tests/jctc/jctc1.lit
name=JCTC1
values={0,1}
comment "Should be allowed"
     := x;
  if (r1 ≥
            0) {
    else { skip }
  ||| {
  r2 := y;
  x := r2
allow (r1 = 1 86 r2 = 1) [] "Allowed, since interthread compiler analysis could
determine that x and y are always non-negative, allowing simplification of r1 \geqslant
0 to true, and allowing write y = 1 to be moved early."
  ./pomsets.exe --check --complete data/tests/jctc/jctc1.lit
Allowed, since interthread compiler analysis could determine that x and y are al
 vays non-negative, allowing simplification of r1 \geqslant 0 to true, and allowing writ
    = 1 to be moved early. (pass)
```

Fig. 2. Example output of PwTer, validating TC1 [Pugh 2004].

It is worth pointing out that address calculation and if-closure only affect the semantics of read and write. Fences introduce new trivial actions. RMWs require more infrastructure in order to ensure atomicity while compiling to Arm8 using load-exclusive and store-exclusive.

These extensions preserve all of the program transformation discussed thus far, and apply equally to the various semantics we have discussed: PwT, PwT-mcA<sub>1</sub>, PwT-mcA<sub>2</sub>, and PwT-c11. The results discussed in §6 also apply equally, with the exception of RMWs: we have not proven DRF-sc or Arm8 lowering for RMWs.

### 9.1 Register Recycling and Redundant Read Elimination

981 982

983

984

985

987

988

1000 1001 1002

1003

1004

1005

1006

1007

1008

1009 1010

1011

1012

1013

1014

1015 1016 1017

1018

1019

1020

1021

1022

1023

1024

1026

1027

1028 1029 JMM Test Case 2 [Pugh 2004] states the following execution should be allowed "since redundant read elimination could result in simplification of r=s to true, allowing y:=1 to be moved early."

$$r := x; s := x; if(r=s)\{y := 1\} \parallel x := y$$

$$\stackrel{d}{(\mathbb{R}x1)} \stackrel{(\mathbb{W}y1)}{\longrightarrow} \mathbb{R}y1 \longrightarrow \mathbb{W}x1$$

$$(TC2)$$

Under the semantics of Fig. 1, the precondition of *e* in the independent case is

$$(1=r \lor x=r) \Rightarrow (1=s \lor r=s) \Rightarrow (r=s), \tag{*}$$

which is equivalent to  $(x=r) \Rightarrow (1=s) \Rightarrow (r=s)$ , which is not a tautology, and thus Fig. 1 requires order from d to e in order to complete the pomset.

This execution is allowed, however, if we rename registers using a map from event names to register names. By using this renaming, coalesced events must choose the same register name. In the above example, the precondition of e in the independent case becomes

$$(1=s_e \lor x=s_e) \Rightarrow (1=s_e \lor s_e=s_e) \Rightarrow (s_e=s_e), \tag{**}$$

which is a tautology. In (\*\*), the first read resolves the nondeterminism in both the first and the second read. Given the choice of event names, the outcome of the second read is predetermined!

0:22 Anon.

In (\*), the second read remains nondeterministic, even in the case that the events are destined to coalesce.

*Definition 9.1.* Let  $[\cdot]$  be defined as in Fig. 1, changing R4 of *READ*:

- (R4a) if  $e \in E \cap D$  then  $\tau^D(\psi) \equiv v = s_e \Rightarrow \psi[s_e/r]$ ,
- (R4b) if  $e \in E \setminus D$  then  $\tau^D(\psi) \equiv (v = s_e \lor x = s_e) \Rightarrow \psi[s_e/r]$ ,
- (R4c) if  $E = \emptyset$  then  $(\forall s) \tau^D(\psi) \equiv \psi[s/r]$ .

1030

1031 1032

1033

1034

1035

1037

1043

1045 1046

1047

1049

1050

1051

1052

1053

1054

1055

1056

1057 1058

1059

1060

1061 1062

1063

1064 1065 1066

1067

1068

1069

1070

1071

1072

1073

1074 1075

1076

1077 1078 With this semantics, it is straightforward to see that redundant load elimination is sound:

$$[r := x^{\mu}; s := x^{\mu}] \supseteq [r := x^{\mu}; s := r]$$

As a further example, consider [Sevčík and Aspinall 2008, Fig. 5], referenced in [Paviotti et al. 2020, §6.4]. Consider the case where the reads are merged, both seeing 1:

$$r := y$$
; if  $(r=1)\{s := y; x := s\}$  else  $\{x := 1\}$   $(Ry1)$   $(\phi \mid Wx1)$ 

In order to independent of both reads, we take the precondition  $\phi$  to be:

$$(1=r \lor y=r) \Rightarrow [r=1 \land ((1=s \lor y=s) \Rightarrow s=1)] \lor [r\neq 1]$$

Then collapsing r and s and substituting the initial value of y (say 0), we have a tautology:

$$(1=r \lor 0=r) \Rightarrow [r=1 \land ((1=r \lor 0=r) \Rightarrow r=1)] \lor [r\neq 1]$$

# 9.2 Register Consistency

If a precondition is false, you can be pretty sure it's useless. In this subsection, we develop a criterion for eliminating such useless pomsets.

To achieve this, we would like to bolt a requirement into the definition of pomsets in order to weed out the useless ones. Something like this:

(M3a')  $\kappa(e)$  is satisfiable.

For associativity, (M3a') would in turn require

$$(x4') \tau(ff) \equiv ff.$$

Dijkstra [1975] requires exactly x4'. Problem solved! Unfortunately, our transformer for read actions (R4a) does not obey x4', since ff is not equivalent to  $v=r \Rightarrow$  ff.

In this subsection, we refine these requirements into ones that do hold. The main insight is to pull values for registers from the labels of pomset itself. Thus, we define  $\theta_{\lambda}$  to capture the *register* state of a pomset.

*Definition 9.2.* Let  $\theta_{\lambda} = \bigwedge_{\{(e,v) \in (E \times V) \mid \lambda(e) = (Rv)\}} (s_e = v)$  where  $E = \text{dom}(\lambda)$ . We say that  $\phi$  is  $\lambda$ -consistent if  $\phi \wedge \theta_{\lambda}$  is satisfiable. We say that it is  $\lambda$ -inconsistent otherwise.

Using this, we define the constraint on predicate transformers that we want. We also need to update the definition of predicate transformer families to carry the labeling.

*Definition 9.3.* A  $\lambda$ -predicate transformer is a function  $\tau: \Phi \to \Phi$  such that

- (x3) (x1) (x2) as in Def. 4.2,
- (x4) if  $\psi$  is  $\lambda$ -inconsistent then  $\tau(\psi)$  is  $\lambda$ -inconsistent.

A family of  $\lambda$ -predicate transformers over consists of a  $\lambda$ -predicate transformer  $\tau^D$  for each  $D \subseteq$  $\mathcal{E}$ , such that if  $C \cap E \subseteq D$  then  $\tau^{C}(\psi) \models \tau^{D}(\psi)$ .

```
(M4) \tau: 2^{\mathcal{E}} \to \Phi \to \Phi is a family of \lambda-predicate transformers,
```

Given these definitions, we can add the following requirement to the model, which enables us to prune pomsets that include  $\lambda$ -inconsistent preconditions and termination conditions.

```
(M3a) \kappa(e) is \lambda-consistent, (M5b) \sqrt{} is \lambda-consistent.
```

With this modification, dead-code elimination (Lemma 4.6h) can be changed from an inclusion to an equation:

$$if(\phi)\{\mathcal{P}_1\}$$
 else  $\{\mathcal{P}_2\} = \mathcal{P}_1$  if  $\phi$  is a tautology.

# 9.3 Fences and Read-Modify-Write Operations

 The semantics of fences is straightforward. Let  $\llbracket \mathsf{F}^{\mu} \rrbracket = \mathit{FENCE}(\mu)$ , where if  $P \in \mathit{FENCE}(\mu)$  then

(F1) 
$$|E| \le 1$$
, (F3)  $\kappa(e) \equiv \text{tt}$ , (F5a) if  $E \neq \emptyset$  then  $\sqrt{=}$  tt, (F2)  $\lambda(e) = F^{\mu}$ , (F4)  $\tau^{D}(\psi) \equiv \psi$ , (F5b) if  $E = \emptyset$  then  $\sqrt{=}$  ff.

This semantics is identical to that of [Jagadeesan et al. 2020]; see there for examples.

RMW operations are more complex. To support RMWs, we add a relation  $\subseteq E \times E$  that relates the read of a successful RMW to the succeeding write.

Definition 9.4. Extend the definition of a pomset as follows.

```
(M10) rmw : E \rightarrow E is a partial function capturing read-modify-write atomicity, such that (M10a) if d \xrightarrow{\text{rmw}} e then \lambda(e) blocks \lambda(d), (M10b) if d \xrightarrow{\text{rmw}} e then d < e,
```

(M10c) if  $\lambda(c)$  overlaps  $\lambda(d)$  and  $d \xrightarrow{rmv} e$  then c < e implies  $c \le d$  and d < c implies  $e \le c$ .

Extend the definition of SEQ, IF and PAR to include:

```
(s10) (I10) (P10) rmw = (rmw_1 \cup rmw_2),
```

To define specific operations, we extend the syntax:

```
S := \cdots \mid r := \mathsf{CAS}^{\mu,\nu}([L], M, N) \mid r := \mathsf{FADD}^{\mu,\nu}([L], M) \mid r := \mathsf{EXCHG}^{\mu,\nu}([L], M)
```

We require that *r* does not occur in *L*. The corresponding semantic functions are as follows.

Definition 9.5. Let READ' be defined as for READ, adding the constraint:

```
(R4d) if (E \cap D) = \emptyset then \tau^D(\psi) \equiv \psi.
```

If  $P \in FADD(r, x, M, \mu, \nu)$  then  $P \in SEQ(READ'(r, x, \mu), WRITE(x, r+M, \nu))$  and

If  $P \in EXCHG(r, x, M, \mu, \nu)$  then  $P \in SEQ(READ'(r, x, \mu), WRITE(x, M, \nu))$  and

If  $P \in CAS(r, x, M, N, \mu, \nu)$  then

 $P \in SEQ(READ'(r, x, \mu), IF(r=M, WRITE(x, N, \nu), SKIP))$  and

(U10) if  $\lambda(e)$  is a write then there is a read  $\lambda(d)$  such that  $\kappa(e) \models \kappa(d)$  and  $k \mapsto e$ .

```
[\![r := \mathsf{CAS}^{\mu,\nu}(x,M,N)]\!] = CAS(r, x, M, N, \mu, \nu)
[\![r := \mathsf{FADD}^{\mu,\nu}(x,M)]\!] = FADD(r, x, M, \mu, \nu)
[\![r := \mathsf{EXCHG}^{\mu,\nu}(x,M)]\!] = EXCHG(r, x, M, \mu, \nu)
```

This definition ensures atomicity and supports lowering to Arm load/store exclusive operations. See [Jagadeesan et al. 2020] for examples.

One subtlety of the definition is that we use *READ'* rather than *READ*. Thus, for RMW operations, the independent case for a read is the same as the empty case. To see why this should be, consider the relaxed variant of the CDRF example from [Lee et al. 2020], using *READ* rather than *READ'*.

```
x := 0; (r := \mathsf{FADD}^{\mathsf{rlx},\mathsf{rlx}}(x,1); \mathsf{if}(!r)\{\mathsf{if}(y)\{x := 0\}\} \parallel r := \mathsf{FADD}^{\mathsf{rlx},\mathsf{rlx}}(x,1); \mathsf{if}(!r)\{y := 1\})

(\mathsf{W}x0) \longrightarrow (\mathsf{R}x0)^{\mathsf{rmw}} (\mathsf{W}x1) \qquad (\mathsf{R}y1) \longrightarrow (\mathsf{R}x0)^{\mathsf{rmw}} (\mathsf{W}x1) \qquad (\mathsf{W}y1)
```

0:24 Anon.

A write should only be visible to one FADD instruction, but here the write of 0 is visible to two. This is allowed because no order is required from (Rx0) to (Wy1) in the last thread. To see why, consider the independent transformers of the last thread and initializer:

$$x := 0 \qquad \qquad \mathsf{FADD}^{\mathsf{rlx},\mathsf{rlx}}(x,1) \qquad \qquad \mathsf{if}(!r)\{y := 1\}$$

$$\boxed{\psi[0/x]} \boxed{\mathsf{W}x0} \qquad \boxed{(0=r \lor x=r) \Rightarrow \psi[1/x]} \boxed{\mathsf{R}x0}^{\mathsf{rmw}} \boxed{\mathsf{W}x1} \qquad \boxed{\psi[1/y]} \boxed{r=0 \mid \mathsf{W}y1}$$

After sequencing, the precondition of (Wy1) is a tautology:  $(0=r \lor 0=r) \Rightarrow r=0$ .

By including R4d, READ' constrains the independent predicate transformer of the FADD:

$$x := 0 \qquad \qquad \mathsf{FADD}^{\mathsf{rlx},\mathsf{rlx}}(x,1) \qquad \qquad \mathsf{if}(!\,r)\{y := 1\}$$

$$\psi[0/x] \quad \boxed{\psi[1/x]} \quad \boxed{(\mathsf{R}x0)}^{\mathsf{rmw}} (\mathsf{W}x1) \qquad \qquad \boxed{\psi[1/y]} \quad \boxed{r=0 \mid \mathsf{W}y1}$$

After sequencing, the precondition of (Wy1) is r=0, which is not a tautology. This forces any top-level pomset to include dependency order from (Rx0) to (Wy1).

#### 9.4 Address Calculation

1128

1129

1130 1131

1132

1133 1134

1135

1136

1137

1138

1139 1140

1141

1142 1143

1144

1145

1149

1151

1153

1155

1156

1157

1158

1159

1160

1161 1162

1163

1164

1165

1166

1167 1168 1169

1170

1171

1172

1173

1174 1175 1176 [Todo: We develop a flat memory model, which does not deal with provenance [Lee et al. 2018].]

Inevitably, address calculation complicates the definitions of WRITE and READ. We assume a family of aliasing symbols  $A_n$ , indexed by value. Let  $\phi$  be a formula that may include the pseudovalue \*. Then let  $[\phi/A_{*\neq \ell}]$  be the substitution that, for every  $k \neq \ell$ , replaces  $A_k$  by  $\phi[k/*]$ .

```
Definition 9.6. If P \in WRITE(L, M, \mu) then (\exists \ell \in V) (\exists v \in V)
(w1) if |E| \le 1,
                                                                      (w5a) if E \neq \emptyset then \sqrt{} \equiv L = \ell \land M = v,
```

(w2) 
$$\lambda(e) = W^{\mu}[\ell]v$$
, (w5b) if  $E = \emptyset$  then  $\sqrt{=}$  ff.

(w3)  $\kappa(e) \equiv L = \ell \wedge M = v$ ,

(w4a) if  $E \neq \emptyset$  then  $\tau^D(\psi) \equiv (L=\ell) \Rightarrow \psi[M/[\ell]][M=v/Q_{\lceil \ell \rceil}][tt/A_{\ell}][(L\neq * \land A_*)/A_{*\neq \ell}],$ 

(w4b) if  $E = \emptyset$  then  $(\forall k) \tau^D(\psi) \equiv (L=k) \Rightarrow \psi[M/[k]][ff/Q_{[k]}][ff/A_*],$ 

If  $P \in READ(r, L, \mu)$  then  $(\exists \ell \in \mathcal{V})$   $(\exists v \in \mathcal{V})$ 

```
(R4c) if E = \emptyset then (\forall s) \tau^D(\psi) \equiv \psi[s/r],
  (R1) if |E| \leq 1,
                                                                                            (R5a) if E \neq \emptyset or \mu \sqsubseteq rlx then \checkmark \equiv tt.
  (R2) \lambda(e) = \mathsf{R}^{\mu}[\ell] v
                                                                                           (R5b) if E = \emptyset and \mu \supseteq \text{acq then } \checkmark \equiv \text{ff.}
  (R3) \kappa(e) \equiv L = \ell \wedge Q_{[\ell]},
(R4a) if e \in E \cap D then \tau^D(\psi) \equiv (L=\ell \Rightarrow v=s_e) \Rightarrow \psi[s_e/r],
```

(R4b) if  $e \in E \setminus D$  then  $\tau^D(\psi) \equiv ((L=\ell \Rightarrow v=s_e) \vee ((L=\ell \land A_\ell) \Rightarrow [\ell]=s_e)) \Rightarrow \psi[s_e/r]$ ,

The combination of read-read independency (§4.7) and address calculation is somewhat delicate. Consider the following program, from [Jagadeesan et al. 2020,  $\S 5$ ], where initially x=0, y=0, [0]=0, [1]=2, and [2]=1. It should only be possible to read 0, disallowing the attempted execution below:

$$r := x; s := [r]; y := s \parallel r := y; s := [r]; x := s$$

$$(Rx1) \qquad \qquad (Ry2) \qquad \qquad (Ry2) \qquad \qquad (ADDR1)$$

This execution would become possible, however, if we were to replace  $(L=\ell \Rightarrow v=s_e)$  by  $(v=s_e)$  in R4a. In this case, (Ry2) would not necessarily be dependency ordered before (Wx1).

TC12 describes another potential OOTA behavior. Initially, x=y=0, [a+0]=1, and [a+1]=2:

$$y := x \parallel r := y; [a+r] := 0; s := [a+0]; x := s;$$

$$(Rx1) \longrightarrow (Ry1) \longrightarrow (W[a+1]0) \qquad (R[a+0]1) \qquad \phi \mid Wx1 \qquad (Tc12)$$

Building the precondition  $\phi$  from right to left (without A):

$$\phi_1 = s = 1$$
 (x := s)  

$$\phi_2 = (1 = s \lor [a + 0] = s) \Rightarrow s = 1$$
 (Prepending s := [a + 0])  

$$\phi_3 = ((1 = s \lor [a + 0] = s) \Rightarrow s = 1)[0/[a + 1]]$$
 (Prepending [a + r] := 0)  

$$\phi_4 = (r = 1 \lor y = 1) \Rightarrow (1 = s \lor [a + 0] = s) \Rightarrow s = 1$$
 (Prepending r := y)  

$$\phi_5 = (r = 1 \lor y = 1) \Rightarrow (1 = s \lor 1 = s)) \Rightarrow s = 1$$
 (Prepending [a + 0] := 1)

Adding A fixes that:

$$y := x \parallel r := y ; [a+r] := 0 ; s := [a+0] ; x := s ;$$

$$(Rx1) \longrightarrow (Ry1) \longrightarrow (R[a+1]0) \qquad (R[a+0]1) \longrightarrow (\phi \mid Wx1)$$

A compiler should be able to refine the second thread to:

$$r := y$$
; if  $(r=0)\{[a+0] := 0$ ;  $x := 0\}$  else if  $(r=1)\{[a+1] := 0$ ;  $x := 1\}$ 

Building the precondition  $\phi$  from right to left (assuming that  $x, y \notin \{[a+0], [a+1]\}$  and  $x\neq y$ ):

$$\phi_1 = s = 1 \qquad (x := s)$$

$$\phi_2 = (1 = s \lor (A_{a+0} \Rightarrow [a+0] = s)) \Rightarrow s = 1 \qquad (Prepending s := [a+0])$$

$$\phi_3 = (1 = s \lor ((0 \neq r \land A_{a+0}) \Rightarrow [a+0] = s)) \Rightarrow s = 1 \qquad (Prepending [a+r] := 0)$$

$$\phi_4 = (r = 1) \Rightarrow (1 = s \lor ((0 \neq r \land A_{a+0}) \Rightarrow [a+0] = s)) \Rightarrow s = 1 \qquad (Prepending r := y)$$

$$\phi_5 = (r = 1) \Rightarrow (1 = s \lor (0 \neq r \land tt) \Rightarrow 1 = s)) \Rightarrow s = 1 \qquad (Prepending [a+0] := 1)$$

#### 9.5 If-Closure

In order to model sequential composition, we must allow inconsistent predicates in a single pomset, unlike PwP [Jagadeesan et al. 2020]. For example, if S = (x := 1), then the semantics Fig. 1 does *not* allow:

if(M){x:=1}; S; if(
$$\neg M$$
){x:=1}  

$$(\mathbb{W}x1) \rightarrow (\mathbb{W}x1)$$

However, if  $S = (if(\neg M)\{x := 1\}; if(M)\{x := 1\})$ , then it *does* allow the execution. Looking at the initial program:

The difficulty is that the middle action can coalesce either with the right action, or the left, but not both. Thus, we are stuck with some non-tautological precondition. Our solution is to allow a pomset to contain many events for a single action, as long as the events have disjoint preconditions.

Def. 9.7 allows the execution, by splitting the middle command:

Coalescing events gives the desired result.

 0:26 Anon.

This is not simply a theoretical question; it is observable. For example, the semantics of Fig. 1 does not allow the following, since it must add order in the first thread from the read of y to one of the writes to x.

*Definition 9.7.* Let  $E \subseteq \mathcal{E}$  and  $\theta : E \to \Phi$  and  $\Omega \in \Phi$ . We say that  $\theta$  partitions  $\Omega$  if (1) if  $\theta_e \wedge \theta_d$  is satisfiable then e = d, (2)  $\Omega \equiv \bigvee_{e \in E} \theta_e$ .

If  $P \in WRITE(x, M, \mu)$  then  $(\exists v : E \to V)$   $(\exists \theta : E \to \Phi)$   $(\exists \Omega \in \Phi)$ 

(w1) 
$$\theta$$
 partitions  $\Omega$ ,   
(w2)  $\lambda(e) = W^{\mu}xv_e$ ,   
(w4)  $\tau^D(\psi) \equiv \bigwedge_{e \in E}(\theta_e \Rightarrow \psi[M/x][M=v_e/Q_x])$   
 $\wedge \neg \Omega \Rightarrow \psi[M/x][ff/Q_x]$ 

$$(w3) \kappa(e) \equiv \theta_e \wedge M = v_e, \qquad (w5) \checkmark \equiv \Omega \wedge \bigwedge_{e \in E} (\theta_e \Rightarrow M = v_e).$$

If 
$$P \in READ(r, x, \mu)$$
 then  $(\exists v : E \to V) (\exists \theta : E \to \Phi) (\exists \Omega \in \Phi)$ 

(R1)  $\theta$  partitions  $\Omega$ , (R5a) if  $\mu \sqsubseteq \text{rlx then } \sqrt{\equiv} \text{ tt}$ ,

(R2) 
$$\lambda(e) = R^{\mu} x v_e$$
 (R5b) if  $\mu \supseteq acq then \checkmark \equiv \Omega$ .

(R3)  $\kappa(e) \equiv \theta_e \wedge Q_x$ ,

(R4) 
$$(\forall s)\tau^{D}(\psi) \equiv \bigwedge_{e \in E \cap D} (\theta_{e} \Rightarrow v_{e} = s_{e} \Rightarrow \psi[s_{e}/r])$$
  
 $\wedge \bigwedge_{e \in E \setminus D} (\theta_{e} \Rightarrow (v_{e} = s_{e} \lor x = s_{e}) \Rightarrow \psi[s_{e}/r])$   
 $\wedge \neg \Omega \Rightarrow \psi[s/r]$ 

The definition allows multiple events to represent a single action, each with a disjoint precondition. The predicate transformers are derived from those defined for the conditional.

This modification validates Lemma 4.6e, f, and g as equations.

We show how to combine address calculation and if-closure in §9.6.

# 9.6 Combining Address Calculation and If-Closure

Def. 9.6 is naive with respect to merging events. Consider the following example:

Merging, we have:

if(M){[r]:=0; [0]:=!r}else{[r]:=0; [0]:=!r}  

$$c$$
 (r=1 | W[1]0)  $d$  (r=0 \ r=1 | W[0]0)  $e$  (r=0 | W[0]1)

The precondition of W [0] 0 is a tautology; however, this is not possible for ([r] := 0; [0] := !r) alone, using Def. 9.6.

Def. 9.8, enables this execution using if-closure. Under this semantics, we have:

$$[r] := 0 \qquad [0] := !r$$

$${}^{c}(r=1 \mid W[1]0) \stackrel{d}{(r=0 \mid W[0]0)} \stackrel{e}{(r=1 \mid W[0]0)} \stackrel{e}{(r=0 \mid W[0]1)}$$
and merging:

Sequencing and merging:

$$[r] := 0; [0] := !r$$

$${}^{c}(r=1 \mid W[1]0) \xrightarrow{d} (r=0 \lor r=1 \mid W[0]0) \xrightarrow{e} (r=0 \mid W[0]1)$$

The precondition of (W[0]0) is a tautology, as required

```
[Todo: Add A_n.]
1275
1276
                  Definition 9.8. If P \in WRITE(L, M, \mu) then (\exists \ell : E \to \mathcal{V}) (\exists v : E \to \mathcal{V}) (\exists \theta : E \to \Phi) (\exists \Omega \in \Phi)
                                                                                                                     (w5) \checkmark \equiv \Omega \land \bigwedge_{e \in F} (\theta_e \Rightarrow L = \ell_e \land M = v_e).
                (w1) \theta partitions \Omega,
1278
                (w2) \lambda(e) = W^{\mu}[\ell]v_e,
                (w3) \kappa(e) \equiv \theta_e \wedge L = \ell_e \wedge M = v_e,
1280
                (w4) (\forall k) \tau^D(\psi) \equiv \bigwedge_{e \in E} (\theta_e \Rightarrow (L = \ell_e) \Rightarrow \psi[M/[\ell_e]][M = v_e/Q_{\lceil \ell_e \rceil}])
                                                   \wedge \neg \Omega \Rightarrow (L=k) \Rightarrow \psi[M/[k]][ff/Q_{[k]}]
1282
1284
            If P \in READ(r, L, \mu) then (\exists \ell : E \to V) (\exists v : E \to V) (\exists \theta : E \to \Phi) (\exists \Omega \in \Phi)
1285
                                                                                                                    (R5a) if \mu \sqsubseteq rlx then \checkmark \equiv tt,
                 (R1) \theta partitions \Omega,
1286
                 (R2) \lambda(e) = \mathsf{R}^{\mu} [\ell] v_e
                                                                                                                   (R5b) if \mu \supseteq \text{acg then } \checkmark \equiv \Omega.
                 (R3) \kappa(e) \equiv \theta_e \wedge L = \ell_e \wedge Q_{\lceil \ell \rceil},
                 (R4) \ (\forall s)\tau^{D}(\psi) \equiv \bigwedge_{e \in E \cap D} (\theta_{e} \Rightarrow (L = \ell_{e} \Rightarrow v_{e} = s_{e}) \Rightarrow \psi[s_{e}/r])
                                                   \land \land_{e \in E \setminus D} (\theta_e \Rightarrow ((L = \ell_e \Rightarrow v_e = s_e) \lor (L = \ell_e \Rightarrow [\ell] = s_e)) \Rightarrow \psi[s_e/r])
                                                   \wedge \neg \Omega \Rightarrow \psi[s/r].
```

#### 10 CONCLUSIONS

1292

1293

1294

1295

1296

1297

1298

1299

1300

1301

1302

1303

1304

1305

1306

1307

1308

1310 1311

1312

1313

1314

1315

1316

1317

1318

1319

1320

1321

1322

This paper is the first to present a direct denotational semantics for sequential composition in a relaxed memory model that can be efficiently compiled to modern CPUs. We extract from this model a semantic dependency relation and use it to build PwT-c11, a solution to the Out-of-Thin-Air problem in c11, and PwT-mca, a model for Java-like languages.

We have not treated loops in this model, though we expect that the usual approach of showing continuity for all the semantic operations with respect to set inclusion would go through. Paviotti et al. [2020] use step-indexing to account for loops; a similar approach could be applied here.

PwT-MCA does not validate access elimination: store-forwarding and dead-write-removal are unsound. We expect that these can be validated by allowing events with different labels to merge.

PwT-MCA<sub>1</sub> is a simpler model than PwT-MCA<sub>2</sub>, but requires fences on acquiring reads for Arm8. It would be illuminating to find out what the performance penalty is for these fences.

PwT does not validate read introduction, whereas speculative operational semantics do. Recent work shows a tension between read introduction and compositional reasoning for temporal safety properties (see §A.1). Nonetheless, read introduction is ubiquitous is some compilers. It would be interesting to know if there is a performance penalty for banning read introduction.

#### **REFERENCES**

Jade Alglave, Will Deacon, Richard Grisenthwaite, Antoine Hacquard, and Luc Maranget. 2021. Armed Cats: Formal Concurrency Modelling at Arm. ACM Trans. Program. Lang. Syst. 43, 2, Article 8 (July 2021), 54 pages. https://doi.org/10.1145/3458926

Jade Alglave, Luc Maranget, and Michael Tautschnig. 2014. Herding Cats: Modelling, Simulation, Testing, and Data Mining for Weak Memory. ACM Trans. Program. Lang. Syst. 36, 2, Article 7 (July 2014), 74 pages. https://doi.org/10.1145/2627752
 Mark Batty. 2015. The C11 and C++11 concurrency model. Ph.D. Dissertation. University of Cambridge, UK.

Mark Batty, Kayvan Memarian, Kyndylan Nienhuis, Jean Pichon-Pharabod, and Peter Sewell. 2015. The Problem of Programming Language Concurrency Semantics. In Programming Languages and Systems - 24th European Symposium on Programming, ESOP 2015, London, UK, April 11-18, 2015. Proceedings (Lecture Notes in Computer Science, Vol. 9032), Jan Vitek (Ed.). Springer, 283–307. https://doi.org/10.1007/978-3-662-46669-8\_12

Mark Batty, Scott Owens, Susmit Sarkar, Peter Sewell, and Tjark Weber. 2011. Mathematizing C++ Concurrency. In Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (Austin, Texas, USA) (POPL '11). ACM, New York, NY, USA, 55–66. https://doi.org/10.1145/1926385.1926394

Proc. ACM Program. Lang., Vol. 0, No. POPL, Article 0. Publication date: January 2022.

Hans-J. Boehm. 2019. Out-of-thin-air, Revisited, Again (Revision 2). https://wg21.link/p1217.

0:28 Anon.

Hans-J. Boehm and Brian Demsky. 2014. Outlawing Ghosts: Avoiding Out-of-thin-air Results. In *Proceedings of the Work-shop on Memory Systems Performance and Correctness* (Edinburgh, United Kingdom) (MSPC '14). ACM, New York, NY, USA, Article 7, 6 pages. https://doi.org/10.1145/2618128.2618134

- Stephen D. Brookes. 1996. Full Abstraction for a Shared-Variable Parallel Language. *Inf. Comput.* 127, 2 (1996), 145–163. https://doi.org/10.1006/inco.1996.0056
- Pietro Cenciarelli, Alexander Knapp, and Eleonora Sibilio. 2007. The Java Memory Model: Operationally, Denotationally,
  Axiomatically. In Programming Languages and Systems, 16th European Symposium on Programming, ESOP 2007, Braga,
  Portugal, March 24 April 1, 2007, Proceedings (Lecture Notes in Computer Science, Vol. 4421), Rocco De Nicola (Ed.).
  Springer, 331–346. https://doi.org/10.1007/978-3-540-71316-6\_23
- Soham Chakraborty and Viktor Vafeiadis. 2017. Formalizing the concurrency semantics of an LLVM fragment. In Proceedings of the 2017 International Symposium on Code Generation and Optimization, CGO 2017, Austin, TX, USA, February 4-8, 2017, Vijay Janapa Reddi, Aaron Smith, and Lingjia Tang (Eds.). ACM, 100–110. http://dl.acm.org/citation.cfm?id= 3049844
- Soham Chakraborty and Viktor Vafeiadis. 2019. Grounding thin-air reads with event structures. *PACMPL* 3, POPL (2019), 70:1–70:28. https://doi.org/10.1145/3290383
- Minki Cho, Sung-Hwan Lee, Chung-Kil Hur, and Ori Lahav. 2021. Modular Data-Race-Freedom Guarantees in the Promising Semantics. *Proc. ACM Program. Lang.* 2, PLDI. To Appear.
- Simon Cooksey, Sarah Harris, Mark Batty, Radu Grigore, and Mikoláš Janota. 2019. PrideMM: Second Order Model Checking for Memory Consistency Models. In *Formal Methods. FM 2019 International Workshops*. Springer International Publishing, 507–525.
  - Russ Cox. 2016. Go's Memory Model. http://nil.csail.mit.edu/6.824/2016/notes/gomem.pdf.

1347

1353

1355

1361

1362

1363

1364

1365

1368

1369

1370

1371 1372

- Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: An Efficient SMT Solver. In *Proceedings of the Theory and Practice of Software, 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems* (Budapest, Hungary) (*TACAS'08/ETAPS'08*). Springer-Verlag, Berlin, Heidelberg, 337–340.
- Edsger W. Dijkstra. 1975. Guarded Commands, Nondeterminacy and Formal Derivation of Programs. *Commun. ACM* 18, 8 (1975), 453–457. https://doi.org/10.1145/360933.360975
  - Stephen Dolan, KC Sivaramakrishnan, and Anil Madhavapeddy. 2018. Bounding Data Races in Space and Time. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (Philadelphia, PA, USA) (PLDI 2018). ACM, New York, NY, USA, 242–255. https://doi.org/10.1145/3192366.3192421
- Brijesh Dongol, Radha Jagadeesan, and James Riely. 2019. Modular transactions: bounding mixed races in space and time. In *Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2019, Washington, DC, USA, February 16-20, 2019,* Jeffrey K. Hollingsworth and Idit Keidar (Eds.). ACM, 82–93. https://doi.org/10.1145/3293883.3295708
  - William Ferreira, Matthew Hennessy, and Alan Jeffrey. 1996. A Theory of Weak Bisimulation for Core CML. In *Proceedings of the 1996 ACM SIGPLAN International Conference on Functional Programming, ICFP 1996, Philadelphia, Pennsylvania, USA, May 24-26, 1996*, Robert Harper and Richard L. Wexelblat (Eds.). ACM, 201–212. https://doi.org/10.1145/232627.232649
  - Shaked Flur, Kathryn E. Gray, Christopher Pulte, Susmit Sarkar, Ali Sezgin, Luc Maranget, Will Deacon, and Peter Sewell. 2016. Modelling the ARMv8 architecture, operationally: concurrency and ISA. In *Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2016, St. Petersburg, FL, USA, January 20 22, 2016*, Rastislav Bodík and Rupak Majumdar (Eds.). ACM, 608–621. https://doi.org/10.1145/2837614.2837615
- 1357 C.A.R. Hoare. 1969. An Axiomatic Basis for Computer Programming. *Commun. ACM* 12, 10 (Oct. 1969), 576–580. https://doi.org/10.1145/363235.363259
- Radha Jagadeesan, Alan Jeffrey, and James Riely. 2020. Pomsets with preconditions: a simple model of relaxed memory. Proc. ACM Program. Lang. 4, OOPSLA (2020), 194:1–194:30. https://doi.org/10.1145/3428262
  - Radha Jagadeesan, Corin Pitcher, and James Riely. 2010. Generative Operational Semantics for Relaxed Memory Models. In Programming Languages and Systems, 19th European Symposium on Programming, ESOP 2010, Paphos, Cyprus, March 20-28, 2010. Proceedings (Lecture Notes in Computer Science, Vol. 6012), Andrew D. Gordon (Ed.). Springer, 307–326. https://doi.org/10.1007/978-3-642-11957-6\_17
  - Alan Jeffrey and James Riely. 2016. On Thin Air Reads Towards an Event Structures Model of Relaxed Memory. In *Proceedings of the 31st Annual ACM/IEEE Symposium on Logic in Computer Science, LICS '16, New York, NY, USA, July 5-8, 2016*, M. Grohe, E. Koskinen, and N. Shankar (Eds.). ACM, 759–767. https://doi.org/10.1145/2933575.2934536
- Alan Jeffrey and James Riely. 2019. On Thin Air Reads: Towards an Event Structures Model of Relaxed Memory. Logical Methods in Computer Science 15, 1 (2019), 25 pages. https://doi.org/10.23638/LMCS-15(1:33)2019
  - Jeehoon Kang, Chung-Kil Hur, Ori Lahav, Viktor Vafeiadis, and Derek Dreyer. 2017. A promising semantics for relaxed-memory concurrency. In *Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages, POPL 2017, Paris, France, January 18-20, 2017*, Giuseppe Castagna and Andrew D. Gordon (Eds.). ACM, 175–189. http://dl.acm.org/citation.cfm?id=3009850

Ryan Kavanagh and Stephen Brookes. 2018. A denotational account of C11-style memory. CoRR abs/1804.04214 (2018). arXiv:1804.04214 http://arxiv.org/abs/1804.04214

- Ori Lahav and Viktor Vafeiadis. 2016. Explaining Relaxed Memory Models with Program Transformations. In FM 2016:

  Formal Methods 21st International Symposium, Limassol, Cyprus, November 9-11, 2016, Proceedings (Lecture Notes in Computer Science, Vol. 9995), John S. Fitzgerald, Constance L. Heitmeyer, Stefania Gnesi, and Anna Philippou (Eds.).

  Springer, 479–495. https://doi.org/10.1007/978-3-319-48989-6\_29
- Ori Lahav, Viktor Vafeiadis, Jeehoon Kang, Chung-Kil Hur, and Derek Dreyer. 2017. Repairing sequential consistency in C/C++11. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2017, Barcelona, Spain, June 18-23, 2017, Albert Cohen and Martin T. Vechev (Eds.). ACM, 618-632. https://doi.org/10.1145/3062341.3062352
- Leslie Lamport. 1979. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs. *IEEE Trans. Comput.* 28, 9 (Sept. 1979), 690–691. https://doi.org/10.1109/TC.1979.1675439
- Juneyoung Lee, Chung-Kil Hur, Ralf Jung, Zhengyang Liu, John Regehr, and Nuno P. Lopes. 2018. Reconciling high-level
   optimizations and low-level code in LLVM. *Proc. ACM Program. Lang.* 2, OOPSLA (2018), 125:1–125:28. https://doi.org/
   10.1145/3276495
- Juneyoung Lee, Yoonseung Kim, Youngju Song, Chung-Kil Hur, Sanjoy Das, David Majnemer, John Regehr, and Nuno P. Lopes. 2017. Taming undefined behavior in LLVM. In *Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2017, Barcelona, Spain, June 18-23, 2017*, Albert Cohen and Martin T. Vechev (Eds.). ACM, 633–647. https://doi.org/10.1145/3062341.3062343
- Sung-Hwan Lee, Minki Cho, Anton Podkopaev, Soham Chakraborty, Chung-Kil Hur, Ori Lahav, and Viktor Vafeiadis.
   2020. Promising 2.0: global optimizations in relaxed memory concurrency. In Proceedings of the 41st ACM SIGPLAN International Conference on Programming Language Design and Implementation, PLDI 2020, London, UK, June 15-20, 2020, Alastair F. Donaldson and Emina Torlak (Eds.). ACM, 362–376. https://doi.org/10.1145/3385412.3386010
- Lun Liu, Todd Millstein, and Madanlal Musuvathi. 2019. Accelerating Sequential Consistency for Java with Speculative Compilation. In *Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation* (Phoenix, AZ, USA) (*PLDI 2019*). ACM, New York, NY, USA, 16–30. https://doi.org/10.1145/3314221.3314611
- Nuno Lopes. 2016. RFC: Killing undef and spreading poison. https://lists.llvm.org/pipermail/llvm-dev/2016-October/106182.html.
- Jeremy Manson, William Pugh, and Sarita V. Adve. 2005. The Java Memory Model. SIGPLAN Not. 40, 1 (Jan. 2005), 378–391. https://doi.org/10.1145/1047659.1040336
  - Daniel Marino, Todd D. Millstein, Madanlal Musuvathi, Satish Narayanasamy, and Abhayendra Singh. 2015. The Silently Shifting Semicolon. In 1st Summit on Advances in Programming Languages, SNAPL 2015, May 3-6, 2015, Asilomar, California, USA (LIPIcs, Vol. 32), Thomas Ball, Rastislav Bodík, Shriram Krishnamurthi, Benjamin S. Lerner, and Greg Morrisett (Eds.). Schloss Dagstuhl Leibniz-Zentrum für Informatik, 177–189. https://doi.org/10.4230/LIPIcs.SNAPL.2015.177
- Paul E. McKenney, Alan Jeffrey, Ali Sezgin, and Tony Tye. 2016. Out-of-Thin-Air Execution is vacuous. http://wg21.link/p0422.
- Peter O'Hearn. 2007. Resources, Concurrency, and Local Reasoning. *Theor. Comput. Sci.* 375, 1-3 (April 2007), 271–307. https://doi.org/10.1016/j.tcs.2006.12.035
  - Marco Paviotti, Simon Cooksey, Anouk Paradis, Daniel Wright, Scott Owens, and Mark Batty. 2020. Modular Relaxed Dependencies in Weak Memory Concurrency. In Programming Languages and Systems 29th European Symposium on Programming, ESOP 2020, Dublin, Ireland, April 25-30, 2020, Proceedings (Lecture Notes in Computer Science, Vol. 12075), Peter Müller (Ed.). Springer, 599-625. https://doi.org/10.1007/978-3-030-44914-8 22
  - Jean Pichon-Pharabod and Peter Sewell. 2016. A Concurrency Semantics for Relaxed Atomics That Permits Optimisation and Avoids Thin-air Executions. In *Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages* (St. Petersburg, FL, USA) (*POPL '16*). ACM, New York, NY, USA, 622–633. https://doi.org/10. 1145/2837614.2837616
- Anton Podkopaev, Ori Lahav, and Viktor Vafeiadis. 2019. Bridging the gap between programming languages and hardware weak memory models. *Proc. ACM Program. Lang.* 3, POPL (2019), 69:1–69:31. https://doi.org/10.1145/3290382
- William Pugh. 1999. Fixing the Java Memory Model. In Proceedings of the ACM 1999 Conference on Java Grande, JAVA
  '99, San Francisco, CA, USA, June 12-14, 1999, Geoffrey C. Fox, Klaus E. Schauser, and Marc Snir (Eds.). ACM, 89–98.
  https://doi.org/10.1145/304065.304106
  - William Pugh. 2004. Causality Test Cases. https://perma.cc/PJT9-XS8Z

1398

1399

1400

1405

1406

1407

1408

1409

1410

1421

- Christopher Pulte, Shaked Flur, Will Deacon, Jon French, Susmit Sarkar, and Peter Sewell. 2018. Simplifying ARM concurrency: multicopy-atomic axiomatic and operational models for ARMv8. *PACMPL* 2, POPL (2018), 19:1–19:29. https://doi.org/10.1145/3158107
- Jaroslav Sevčík. 2008. Program Transformations in Weak Memory Models. PhD thesis. Laboratory for Foundations of
   Computer Science, University of Edinburgh.

0:30 Anon.

Jaroslav Sevčík and David Aspinall. 2008. On Validity of Program Transformations in the Java Memory Model. In ECOOP
2008 - Object-Oriented Programming, 22nd European Conference, Paphos, Cyprus, July 7-11, 2008, Proceedings (Lecture
Notes in Computer Science, Vol. 5142), Jan Vitek (Ed.). Springer, 27–51. https://doi.org/10.1007/978-3-540-70592-5\_3
Joel Spolsky. 2002. The Law of Leaky Abstractions. https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-abstractions/.

Kasper Svendsen, Jean Pichon-Pharabod, Marko Doko, Ori Lahav, and Viktor Vafeiadis. 2018. A Separation Logic for a Promising Semantics. In *Programming Languages and Systems - 27th European Symposium on Programming, ESOP 2018, Thessaloniki, Greece, April 14-20, 2018, Proceedings (Lecture Notes in Computer Science, Vol. 10801).* Springer, 357–384. https://doi.org/10.1007/978-3-319-89884-1\_13

Viktor Vafeiadis and Chinmay Narayan. 2013. Relaxed separation logic: a program logic for C11 concurrency. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications, OOPSLA 2013, part of SPLASH 2013, Indianapolis, IN, USA, October 26-31, 2013. 867–884. https://doi.org/10.1145/2509136. 2509532

Conrad Watt, Christopher Pulte, Anton Podkopaev, Guillaume Barbier, Stephen Dolan, Shaked Flur, Jean Pichon-Pharabod, and Shu-yu Guo. 2020. Repairing and mechanising the JavaScript relaxed memory model. In *Proceedings of the 41st ACM SIGPLAN International Conference on Programming Language Design and Implementation, PLDI 2020, London, UK, June 15-20, 2020*, Alastair F. Donaldson and Emina Torlak (Eds.). ACM, 346–361. https://doi.org/10.1145/3385412.3385973

Conrad Watt, Andreas Rossberg, and Jean Pichon-Pharabod. 2019. Weakening WebAssembly. *Proc. ACM Program. Lang.* 3, OOPSLA (2019), 133:1–133:28. https://doi.org/10.1145/3360559

#### A DISCUSSION

# A.1 Comparison to "Promising Semantics" [POPL 2017]

Recently, Cho et al. [2021] showed that certain combinations of compiler optimizations are inconsistent with local DRF guarantees. All of the examples that prove inconsistency have the same shape: they combine read introduction and case analysis (aka, if-closure). Effectively, this turns one read into two, where different conditional branches can be taken for the two copies of the read. This is reminiscent of the type of *bait and switch* behavior noted by Jagadeesan et al. [2020]: the promising semantics (PS) [Kang et al. 2017] and related models [Chakraborty and Vafeiadis 2019; Jagadeesan et al. 2010; Manson et al. 2005], fail to validate compositional reasoning of temporal properties. Consider example OOTA4 from [Jagadeesan et al. 2020]:

Under all variants of PwT, this outcome is disallowed, due to the cycle involving x and y.<sup>4</sup> Under Ps, this outcome is allowed by baiting with the else branch, then switching to the then branch, based on a coin flip (b).

Cho et al. [2021] introduce more complex examples to show that the promising semantics fails LDRF-SC.<sup>5</sup> Here is one, dubbed LDRF-FAIL-PS. if (x) {FADD(w, 1); y := 1; z := 1} || if (z){if (! FADD(w, 1)){x := y} else {x := 1}

Again, all variants of PwT disallow the outcome due to the cycle involving x and y. It is allowed by Ps by baiting the second thread with x := 1 in the else branch, then switching to the then branch. This shows some some structural resemblance to OOTA4, with z replacing b.

<sup>&</sup>lt;sup>4</sup>All of the reads in OOTA4 are cross-thread, so there is no difference between PwT-mcA<sub>1</sub> and PwT-mcA<sub>2</sub>. For PwT-c11, there is a cycle in rf  $\cup$  <.

<sup>&</sup>lt;sup>5</sup>Cho et al. [2021] show that by restricting RMW-store reorderings, one can establish LDRF-SC for PS. We speculate that no such restriction is required for PwT. (We did not treat RMWs in our proof of LDRF-SC.)

Cho et al. argue that the outcome of LDRF-FAIL-PS is inevitable due to compiler optimizations. The examples crucially involve the following sequence of operations:

read introduction.

 • if introduction, branching on the read just introduced.

We believe this combination of optimizations is unsound. This is obviously the case in c11: read introduction may cause undefined behavior (UB), due to the possible introduction of a data race.

The situation is more delicate in LLVM. The short version of the story is that load-hoisting followed by case analysis is unsound in LLVM, without freeze. This happens because:

- read introduction may result in the undefined value **undef**, due to the possible introduction of a data race [Chakraborty and Vafeiadis 2017], and
- branching on an undefined value in LLVM results in UB.

LLVM delays UB using the undefined value. This allows LLVM to perform optimizations such as load hoisting, where  $if(C)\{r:=x\}$  is rewritten to s:=x; r:=C?s:r. Despite this, other optimizations regularly performed by LLVM are unsound [Lee et al. 2017]. An example is loop switching, where  $while(C_1)\{if(C_2)\{S_1\} else\{S_2\}\}$  is rewritten to  $if(C_2)\{while(C_1)\{S_1\}\}$  else  $\{while(C_1)\{S_2\}\}$ . Freeze was introduced in LLVM in order to make such optimizations sound by allowing branch on frozen **undef** to give nondeterministic choice rather than UB. In the RFC for freeze, Lopes [2016] says: "Note that having branch on poison not trigger UB has its own problems. We believe this is a good tradeoff." LDRF-FAIL-PS demonstrates a concrete problem with this tradeoff. Other compilers, such as Compcert, are more conservative [Lee et al. 2017, §9].

Thus, the difference between PS and PwT can be understood in terms of the valid program transformations. PS allows reads to be introduced, with subsequent case analysis on the value read. PwT validates case analysis, but invalidates read introduction.

Allowing executions such as OOTA4 and LDRF-FAIL-PS also invalidates compositional reasoning for temporal safety properties (see §6).

These differences highlight the subtle tensions between compiler optimizations and program logics that are revealed by relaxed memory models. It is not possible to have everything one wants. Thus, one is forced to choose which optimizations and reasoning principles are most important. Finally, we note that it is possible that PS is properly weaker than PwT.

# A.2 Comparison to "Pomsets with Preconditions" [OOPSLA 2020]

PwT-mca is closely related to PwP model of [Jagadeesan et al. 2020]. The major difference is that PwT-mca supports sequential composition. In the remainder of this section, we discuss other differences. We also point out some errors in [Jagadeesan et al. 2020], all of which have been confirmed by the authors.

*Substitution*. PwP uses substitution rather than Skolemizing. Indeed our use of Skolemization is motivated by disjunction closure for predicate transformers, which do not appear in PwP. In Fig. 1, we gave the semantics of read for nonempty pomsets as:

```
1511 (R4a) if (E \cap D) \neq \emptyset then \tau^D(\psi) \equiv v = r \Rightarrow \psi,

1512 (R4b) if (E \cap D) = \emptyset then \tau^D(\psi) \equiv (v = r \lor x = r) \Rightarrow \psi.
```

In PwP, the definition is roughly as follows:

```
1514 (R4a') if (E \cap D) \neq \emptyset then \tau^D(\psi) \equiv \psi[v/r][v/x],

1515 (R4b') if (E \cap D) = \emptyset then \tau^D(\psi) \equiv \psi[v/r][v/x] \wedge \psi[x/r]
```

<sup>&</sup>lt;sup>6</sup>Another example is the tension between load hoisting—forbidden in c11 but allowed by LLVM—and common subexpression elimination over an acquiring lock—allowed by c11 but forbidden by LLVM [Chakraborty and Vafeiadis 2017].

0:32 Anon.

The use of conjunction in R4b' causes disjunction closure to fail because the predicate transformer  $\tau(\psi) = \psi' \wedge \psi''$  does not distribute through disjunction, even assuming that the prime operations do:  $\tau(\psi_1 \vee \psi_2) = (\psi'_1 \vee \psi'_2) \wedge (\psi''_1 \vee \psi''_2) \neq (\psi'_1 \wedge \psi''_1) \vee (\psi'_2 \wedge \psi''_2) = \tau(\psi_1) \vee \tau(\psi_2)$ . See also §4.9.

The substitutions collapse x and r, allowing local invariant reasoning (LIR), as required by JMM causality test case 1, discussed in §4.8. Without Skolemizing it is necessary to substitute  $\lfloor x/r \rfloor$ , since the reverse substitution  $\lfloor r/x \rfloor$  is useless when r is bound—compare with §A.7. As discussed below (Downset closure), including this substitution affects the interaction of LIR and downset closure.

Removing the substitution of [x/r] in the independent case has a technical advantage: we no longer require *extended* expressions (which include memory references), since substitutions no longer introduce memory references.

The substitution [x/r] does not work with Skolemization, even for the dependent case, since we lose the unique marker for each read. In effect, this forces all reads of a location to see the same values. Using this definition, consider the following:

$$r := x; s := x; if(r < s) \{ y := 1 \}$$

$$(Rx1) \qquad (Rx2) \rightarrow (1 = x \Rightarrow 2 = x \Rightarrow x < x \mid Wy1)$$

Although the execution seems reasonable, the precondition on the write is not a tautology.

Downset Closure. PwP enforces downset closure in the prefixing rule. Even without this, downset closure would be different for the two semantics, due to the use of substitution in PwP. Consider the final pomset in the last example of §A.8 under the semantics of this paper, which elides the middle read event:

In PwP, the substitution [x/r] is performed by the middle read regardless of whether it is included in the pomset, with the subsequent substitution of [0/x] by the preceding write, we have [x/r][0/x], which is [0/r][0/x], resulting in:

Consistency. PwP imposes consistency, which requires that for every pomset P,  $\bigwedge_e \kappa(e)$  is satisfiable. Associativity requires that we allow pomsets with inconsistent preconditions. Consider a variant of the example from §9.5.

$$\begin{array}{lll} \text{if}(M)\{x:=1\} & \text{if}(!M)\{x:=1\} & \text{if}(M)\{y:=1\} \\ \hline (M\mid \mathsf{W}x1) & \hline (\neg M\mid \mathsf{W}x1) & \hline (M\mid \mathsf{W}y1) & \hline (\neg M\mid \mathsf{W}y1) \\ \end{array}$$

Associating left and right, we have:

$$if(M)\{x:=1\}; if(!M)\{x:=1\}$$
 
$$if(M)\{y:=1\}; if(!M)\{y:=1\}$$
 
$$\boxed{Wy1}$$

Associating into the middle, instead, we require:

$$\begin{array}{ll} \text{if}(M)\{x:=1\} & \text{if}(!M)\{x:=1\} & \text{if}(!M)\{y:=1\} \\ \hline (M\mid \mathsf{W}x1) & \hline (\neg M\mid \mathsf{W}x1) & \hline (\neg M\mid \mathsf{W}y1) & \hline (\neg M\mid \mathsf{W}y1) \\ \end{array}$$

 $<sup>7 = \</sup>frac{1}{(\psi_1 \vee \psi_2)' = (\psi_1' \vee \psi_2')}$  and  $(\psi_1 \vee \psi_2)'' = (\psi_1'' \vee \psi_2'')$ .

Joining left and right, we have:

$$if(M)\{x := 1\}; if(!M)\{x := 1\}; if(M)\{y := 1\}; if(!M)\{y := 1\}$$

$$(Wx1) (Wy1)$$

CAUSAL STRENGTHENING. PwP imposes causal strengthening, which requires for every pomset P, if d < e then  $\kappa(e) \models \kappa(d)$ . Associativity requires that we allow pomsets without causal strengthening. Consider the following.

$$\begin{array}{ccc} \text{if}(M)\{r:=x\} & y:=r & \text{if}(!M)\{s:=x\} \\ \hline (M \mid \mathsf{R}x1) & \hline (r=1 \mid \mathsf{W}y1) & \hline (\neg M \mid \mathsf{R}x1) \\ \end{array}$$

Associating left, with causal strengthening:

$$if(M)\{r := x\}; y := r$$
 
$$if(!M)\{s := x\}$$
 
$$(M \mid Rx1) \rightarrow (M \mid Wy1)$$
 
$$(\neg M \mid Rx1)$$

Finally, merging:

if(
$$M$$
){ $r:=x$ };  $y:=r$ ; if(! $M$ ){ $s:=x$ }
$$Rx1 \longrightarrow M \mid Wy1$$

Instead, associating right:

$$\begin{array}{ccc} \text{if}(M)\{r:=x\} & y:=r; \text{if}(!M)\{s:=x\} \\ \hline (M\mid \mathsf{R}x1) & \hline (r=1\mid \mathsf{W}y1) & \lnot M\mid \mathsf{R}x1 \end{array}$$

Merging:

if(M){
$$r := x$$
};  $y := r$ ; if(!M){ $s := x$ }

(Rx1) (Wy1)

With causal strengthening, the precondition of Wy1 depends upon how we associate. This is not an issue in PwP, which always associates to the right.

One use of causal strengthening is to ensure that address dependencies do not introduce thin air reads. Associating to the right, the intermediate state of ADDR2 (§9.4) is:

$$s := [r]; x := s$$

$$(r=2 \mid R[2]1) \longrightarrow ((r=2 \Rightarrow 1=s) \Rightarrow s=1 \mid Wx1)$$

In PwP, we have, instead:

$$s := \llbracket r \rrbracket \; ; \; x := s$$
 
$$(r=2 \mid \mathsf{R} \; [2] \, 1) \longrightarrow (r=2 \land [2] = 1 \mid \mathsf{W} x \, 1)$$

Without causal strengthening, the precondition of (Wx1) would be simply [2]=1. The treatment in this paper, using implication rather than conjunction, is more precise.

Internal Acquiring Reads. The proof of compilation to Arm in PwP assumes that all internal reads can be eliminated. However, this is not the case for acquiring reads. For example, PwP disallows the following execution, where the final values of x is 2 and the final value of y is 2. This execution is allowed by Arm8 and Tso.

$$x := 2; r := x^{\operatorname{acq}}; s := y \parallel y := 2; x^{\operatorname{rel}} := 1$$

$$(Wx2) \longrightarrow (Ry0) \longrightarrow (Wy2) \longrightarrow (W^{\operatorname{rel}}x1)$$

We discuss two approaches to this problem in §B.

0:34 Anon.

*Redundant Read Elimination.* Contrary to the claim, redundant read elimination fails for PwP. We discuss redundant read elimination in §9.1. Consider JMM Causality Test Case 2, which we describe there.

$$r:=x$$
;  $s:=x$ ; if  $(r=s)\{y:=1\} \parallel x:=y$ 

$$\begin{array}{c|c} \hline (Rx1) & Ry1 & Wx1 \\ \hline \end{array}$$

Under the semantics of PwP, we have

$$\begin{array}{c} r:=x\;;\;s:=x\;;\;\mathrm{if}(r=s)\{y:=1\}\\\\ \boxed{(\mathsf{R}x1)}\quad \boxed{(\mathsf{R}x1)}\quad \boxed{(\mathsf{1}=1\wedge \mathsf{1}=x\wedge x=1\wedge x=x\mid \mathsf{W}\,y1)} \end{array}$$

The precondition of (Wy1) is *not* a tautology, and therefore redundant read elimination fails. (It is a tautology in r:=x; s:=r; if  $(r=s)\{y:=1\}$ .) PwP(§3.1) incorrectly stated that the precondition of (Wy1) was  $1=1 \land x=x$ .

# A.3 Comparison with Sequential Predicate Transformers

We compare traditional transformers to the dependent-case transformers of Fig. 1.

All programs in our language are strongly normalizing, so we need not distinguish strong and weak correctness. In this setting, the Hoare triple  $\{\phi\}$  S  $\{\psi\}$  holds exactly when  $\phi \Rightarrow wp_S(\psi)$ .

Hoare triples do not distinguish thread-local variables from shared variables. Thus, the assignment rule applies to all types of storage. The rules can be written as on the left below:

$$\begin{split} wp_{x:=M}(\psi) &= \psi[M/x] \\ wp_{r:=M}(\psi) &= \psi[M/r] \\ wp_{r:=M}(\psi) &= \psi[M/r] \\ wp_{r:=X}(\psi) &= x = r \Rightarrow \psi \end{split} \qquad \begin{aligned} \tau_{x:=M}(\psi) &= \psi[M/x] \\ \tau_{r:=M}(\psi) &= \psi[M/r] \\ \tau_{r:=X}(\psi) &= v = r \Rightarrow \psi \end{aligned} \qquad \text{where } \lambda(e) = \mathsf{R} x v \in \mathcal{N}(e)$$

Here we have chosen an alternative formulation for the read rule, which is equivalent to the more traditional  $\psi[x/r]$ , as long as registers are assigned at most once in a program. Our predicate transformers for the dependent case are shown on the right above. Only the read rule differs from the traditional one.

For programs where every register is bound and every read is fulfilled, our dependent transformers are the same as the traditional ones. Thus, when comparing to weakest preconditions, let us only consider totally-ordered executions of our semantics where every read could be fulfilled by prepending some writes. For example, we ignore pomsets of x := 2; x := x that read 1 for x.

For example, let  $S_i$  be defined:

$$S_1 = s := x$$
;  $x := s + r$   $S_2 = x := t$ ;  $S_1$   $S_3 = t := 2$ ;  $r := 5$ ;  $S_2$ 

The following pomset appears in the semantics of  $S_2$ . A pomset for  $S_3$  can be derived by substituting [2/t, 5/r]. A pomset for  $S_1$  can be derived by eliminating the initial write.

$$x := t \; ; \; s := x \; ; \; x := s + r$$

$$(t = 2 \mid \mathsf{W} x 2) \longrightarrow (\mathsf{R} x 2) \longrightarrow (2 = s \Rightarrow (s + r) = 7 \mid \mathsf{W} x 7) \cdots \geqslant 2 = s \Rightarrow \psi[s + r/x]$$

The predicate transformers are:

$$\begin{split} wp_{S_1}(\psi) &= x = s \Rightarrow \psi[s + r/x] \\ wp_{S_2}(\psi) &= t = s \Rightarrow \psi[s + r/x] \\ wp_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + s/x] \\ \end{split}$$
 
$$\begin{aligned} \tau_{S_1}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + s/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + s/x] \end{aligned}$$

# A.4 Combining Address Calculation and If-Closure

# A.5 The Need for Respect

 In Fig. 1, we choose the weakest precondition. Because of this, associativity requires that s6 is (< respects <1 and <2) rather than (<  $\supseteq$  (<1  $\cup$  <2)). Consider (r:=x; y:=M; skip). Associating to the left, we might have:

$$P_{12} = \left( \mathbb{R} x \right)^d \left( \phi \mid \mathbb{W} y \right)^e \qquad \qquad P_3 = \emptyset \qquad \qquad P = \left( \mathbb{R} x \right)^d \left( \phi \mid \mathbb{W} y \right)^e$$

When building  $P_{12}$ , we would have had  $\downarrow e = \emptyset$ , and thus  $\phi$  must have been constructed using the independent transformer R4b. Attempting to repeat this, associating to the right:

$$P_{1} = \left( \operatorname{Rx} \right)^{d} \qquad \qquad P_{23} = \left( \phi' \mid \operatorname{Wy} \right)^{e} \qquad \qquad P' = \left( \operatorname{Rx} \right)^{d} \left( \phi' \mid \operatorname{Wy} \right)^{e}$$

In P', however,  $\downarrow e = \{d\}$ , and thus  $\phi'$  must be constructed using the dependent transformer R4a. Since  $((v=r \lor x=r) \Rightarrow \psi) \not\equiv (v=r \Rightarrow \psi)$ , associativity fails.

If we allow stronger preconditions, as in [Jagadeesan et al. 2020], then we could use inclusion rather than *respects*. To arrive at this semantics, one would replace every occurrence of  $\equiv$  in Fig. 1 with  $\models$ . Then (< respects  $<_1$  and  $<_2$ ) can be replaced by ( $< \supseteq (<_1 \cup <_2)$ ).

#### A.6 Write Substitutions

[Todo: Discuss.]

Alan example of why substitute M/x rather than v/x in the write rule:

$$r := y$$
;  $x := r$ ;  $s := x$ ;  $z := s$   
 $(Ry1) \bullet (Wx1)$   $(Rx1)$   $(Wz1)$ 

s := x ; z := s

We lost the order from Ry1 to Wz1.

$$(Rx1) \quad (x=1 \mid Wz1)$$

$$x := r; \ s := x; \ z := s$$

$$(Wx1) \quad (Rx1) \quad (1=1 \mid Wz1)$$

$$(Wx1) \quad (Rx1) \quad (r=1 \mid Wz1)$$

#### A.7 Read Substitutions

In READ, it is also possible to collapse x and r via substitution:

(R4a') if 
$$(E \cap D) \neq \emptyset$$
 then  $\tau^D(\psi) \equiv v = r \Rightarrow \psi[r/x]$ ,

(R4b') if 
$$E \neq \emptyset$$
 and  $(E \cap D) = \emptyset$  then  $\tau^D(\psi) \equiv (v=r \lor x=r) \Rightarrow \psi[r/x]$ ,

(R4c') if 
$$E = \emptyset$$
 then  $\tau^D(\psi) \equiv \psi[r/x]$ ,

Perhaps surprisingly, this semantics is incomparable with that of Fig. 1. Consider the following:

Prepending (s:=x), we get the same result regardless of whether we substitute [s/x], since x does not occur in either precondition. Here we show the independent case:

$$\begin{split} s := x \; ; \; & \text{if}(r \land s \; \text{even})\{y := 1\}; \; & \text{if}(r \land s)\{z := 1\} \\ \hline (2 = s \lor x = s) \Rightarrow & (r \land s \; \text{even}) \mid \forall y \, 1 \end{split}$$

Proc. ACM Program. Lang., Vol. 0, No. POPL, Article 0. Publication date: January 2022.

0:36 Anon.

Since the preconditions mention x, prepending (r := x), we now get different results depending on whether we perform the substitution. Without any substitution, we have:

$$r:=x\;;\;s:=x\;;\;\mathrm{if}(r\wedge s\;\mathrm{even})\{y:=1\}\;;\;\mathrm{if}(r\wedge s)\{z:=1\}$$
 
$$(Rx1) \quad (1=r\Rightarrow (2=s\vee x=s)\Rightarrow (r\wedge s\;\mathrm{even})\;|\;\mathrm{W}y1) \quad (1=r\Rightarrow (2=s\vee x=s)\Rightarrow (r\wedge s)\;|\;\mathrm{W}z1)$$

Prepending (x := 0), which substitutes [0/x], the precondition of (Wy1) becomes  $(1=r \Rightarrow (2=s \lor 0=s) \Rightarrow (r \land s \text{ even}))$ , which is a tautology, whereas the precondition of Wz1 becomes  $(1=r \Rightarrow (2=s \lor 0=s) \Rightarrow (r \land s))$ , which is not. In order to be top-level, (Wz1) must be dependency ordered after (Rx2); in this case the precondition becomes  $(1=r \Rightarrow 2=s \Rightarrow (r \land s))$ , which is a tautology.

$$(Wx0)$$
  $(Rx1)$   $(Rx2)$   $(Wy1)$   $(Wz1)$ 

The situation reverses with the substitution [r/x]:

$$r := x \; ; \; s := x \; ; \; \text{if} \; (r \land s \; \text{even}) \; \{y := 1\}; \; \text{if} \; (r \land s) \; \{z := 1\}$$
 
$$(z := x) \Rightarrow (z := x) \Rightarrow (r \land s \; \text{even}) \; | \; \forall y \; 1$$
 
$$(z := x) \Rightarrow (r \land s) \; | \; \forall x \; 1$$

Prepending (x := 0):

$$(Wx0)$$
  $(Rx1)$   $(Rx2)$   $(Wy1)$   $(Wz1)$ 

The dependency has changed from  $(Rx2) \rightarrow (Wz1)$  to  $(Rx2) \rightarrow (Wy1)$ . The resulting sets of pomsets are incomparable.

Thinking in terms of hardware, the difference is whether reads update the cache, thus clobbering preceding writes. With [r/x], reads clobber the cache, whereas without the substitution, they do not. Since most caches work this way, the model with [r/x] is likely preferred for modeling hardware. However, this substitution only makes sense in a model with read-read coherence and read-read dependencies, which is not the case for Arm8.

#### A.8 Downset Closure

We would like the semantics to be closed with respect to *downsets*. Downsets include a subset of initial events, similar to *prefixes* for strings.

Definition A.1.  $P_2$  is an downset of  $P_1$  if

- $(1) E_2 \subseteq E_1, \qquad (5) \sqrt{2} \models \sqrt{1},$
- (2)  $(\forall e \in E_2) \ \lambda_2(e) = \lambda_1(e)$ , (6a)  $(\forall d \in E_2) \ (\forall e \in E_2) \ d <_2 e \text{ iff } d <_1 e$ ,
- (3)  $(\forall e \in E_2) \kappa_2(e) \equiv \kappa_1(e)$ , (6b)  $(\forall d \in E_1) (\forall e \in E_2) \text{ if } d <_1 e \text{ then } d \in E_2$ ,
- $(4) \ (\forall e \in E_2) \ \tau_2^D(e) \equiv \tau_1^D(e), \qquad \qquad (7) \ (\forall d \in E_2) \ (\forall e \in E_2) \ d \ \text{rf}_2 \ e \ \text{iff} \ d \ \text{rf}_1 \ e.$

Downset closure fails due to for two reasons. The key property is that the empty set transformer should behave the same as the independent transformer.

First, downset closure fails for read-read independency §4.7. Consider

$$r := x; \text{ if } (!r)\{s := y\}$$

$$(Rx0) \qquad (Ry0)$$

The semantics of this program includes the singleton pomset (Rx0), but not the singleton pomset (Ry0). To get (Rx0), we combine:

$$r := x \qquad \qquad \text{if}(!r)\{s := y\}$$

$$(Rx0) \qquad \qquad \emptyset$$

Attempting to get (Ry0), we instead get:

$$r := x$$

$$\emptyset$$

$$if(!r)\{s := y\}$$

$$(r=0 \mid Ry0)$$

Since *r* appears only once in the program, this pomset cannot contribute to a top-level pomset.

Second, the semantics is not downset closed because the independency reasoning of R4b is only applicable for pomsets where the ignored read is present! Revisiting JMM causality test case 1 from the end of §4.6:

$$x := 0 \qquad r := x \qquad \text{if } (r \geqslant 0) \{y := 1\}; z := r$$

$$\boxed{\mathbb{W}x0} \qquad \boxed{\mathbb{R}x1} \qquad \boxed{r \geqslant 0 \mid \mathbb{W}y1} \qquad \boxed{r = 1 \mid \mathbb{W}z1}$$

$$\psi[0/x] \qquad \qquad x := 0; r := x; \text{ if } (r \geqslant 0) \{y := 1\}; z := r$$

$$\boxed{\mathbb{W}x0} \rightarrow \boxed{\mathbb{R}x1} \qquad \boxed{(1 = r \lor 0 = r) \Rightarrow r \geqslant 0 \mid \mathbb{W}y1} \qquad \boxed{1 = r \Rightarrow r = 1 \mid \mathbb{W}z1}$$

The precondition of (Wy1) is a tautology.

Taking the empty set for the read, however, the precondition of (Wy1) is not a tautology:

$$x := 0; r := x; if(r \ge 0) \{y := 1\}; z := r$$
 $(wx0)$ 
 $(r \ge 0 \mid wy1)$ 
 $(r=1 \mid wz1)$ 

One way to deal with the second issue would be to allow general access elimination to merge (Wx0) and (Rx0):

$$x := 0; r := x; if(r \ge 0) \{ y := 1 \}; z := r$$

$$(0 = r \lor 0 = r) \Rightarrow r \ge 0 \mid Wy1 ) \qquad (r = 1 \mid Wz1)$$

We leave the elaboration of this idea to future work.

# A.9 Logical Encoding of Delay for PwT-MCA

In this subsection, we develop a logical encoding of delay, which can replace \$6a in PwT-mcA<sub>1</sub>. It is not obvious how to repeat this trick for PwT-mcA<sub>2</sub>, due to thread-local reads-from and thread-local blockers (\$6a and \$6b in Def. 5.2).

As motivation, recall that we stated Lemma 4.6(i)–(j) as inclusions:

- (i)  $if(\phi)\{\mathcal{P}_1\}$  else  $\{\mathcal{P}_2\} \supseteq if(\phi)\{\mathcal{P}_1\}$ ;  $if(\neg \phi)\{\mathcal{P}_2\}$ .
- (j) if  $(\phi)\{\mathcal{P}_1\}$  else  $\{\mathcal{P}_2\} \supseteq if(\neg \phi)\{\mathcal{P}_2\}$ ; if  $(\phi)\{\mathcal{P}_1\}$ .

PwT-MCA does not satisfy the reverse inclusion. The culprit is delay, which introduces order regardless of whether preconditions are disjoint. As an example,  $[if(r)\{x := 1\} else\{x := 2\}]$  has an execution with  $(r=0 \mid Wx2) \rightarrow (r\neq 0 \mid Wx1)$ , (using augmentation), whereas  $[if(r)\{x := 1\}; if(!r)\{x := 2\}]$  has no such execution.

[Todo: Put an example for PwT-Po.]

 0:38 Anon.

In order to validate the reverse inclusions, we could require that s6a not impose order when  $\kappa_1(d) \wedge \kappa_2(e)$  is unsatisfiable. Thus, following on §9.2, we would also like this:

(s6b') if  $\lambda_1(d)$  delays  $\lambda_2(e)$  and  $\kappa_1(d) \wedge \kappa_2'(e)$  is  $\lambda$ -consistent then  $d \leq e$ .

However, (s6b') fails associativity. Example where  $\theta_{\lambda} = (r=0)$ 

$$r:=y \qquad \qquad \text{if}(r\parallel s)\{x:=1\} \qquad \qquad \text{if}(!s)\{x:=2\}$$
 
$$(Ry0) \qquad \qquad (r\neq 0 \lor s\neq 0 \mid Wx1) \qquad \qquad (s=0 \mid Wx2)$$

Associating right, order is required since  $((r \neq 0 \lor s \neq 0) \land s = 0)$  is satisfiable (take r = 1 and s = 0):

$$r := y \qquad \qquad \text{if}(r \parallel s)\{x := 1\}; \text{if}(!s)\{x := 2\}$$

$$(r \neq 0 \lor s \neq 0 \mid Wx1) \longrightarrow (s = 0 \mid Wx2)$$

$$r := y; \text{if}(r \parallel s)\{x := 1\}; \text{if}(!s)\{x := 2\}$$

$$(Ry0) \longrightarrow (r = 0 \Rightarrow (r \neq 0 \lor s \neq 0) \mid Wx1) \longrightarrow (s = 0 \mid Wx2)$$

Associating left, order is not required between the writes since  $(s \neq 0 \land s = 0)$  is unsatisfiable:

$$r := y; \text{ if } (r \parallel s)\{x := 1\} \qquad \text{ if } (!s)\{x := 2\}$$

$$(Ry0) \rightarrow (r=0 \Rightarrow (r\neq 0 \lor s\neq 0) \mid Wx1) \qquad (s=0 \mid Wx2)$$

$$r := y; \text{ if } (r \parallel s)\{x := 1\}; \text{ if } (!s)\{x := 2\}$$

$$(Ry0) \rightarrow (r=0 \Rightarrow (r\neq 0 \lor s\neq 0) \mid Wx1) \qquad (s=0 \mid Wx2)$$

This motivates the logic-based presentation of delay.

In the data model, we require additional symbols:  $Q_{sc}$ ,  $Q_{rc}^{x}$ , and  $Q_{wc}^{x}$ . We refer to these collectively as quiescence symbols.

We update the Def. 4.4 of complete pomset to substitute true for every quiescence symbol (notation [tt/Q]:

```
Definition A.2. A PwT is complete if
(c3) \kappa(e) [tt/Q] is a tautology,
                                                       (c5) \sqrt{[tt/Q]} is a tautology.
```

We define some helper notation:

$$\begin{array}{lll} \textit{Definition A.3.} & \text{Let } Q_{ro}^* = \bigwedge_y Q_{ro}^y, \text{ and similarly for } Q_{wo}^*. \\ \text{Let formulae } Q_{\mu}^{Sx}, Q_{\mu}^{Lx}, \text{ and } Q_{\mu}^F \text{ be defined:} \\ & Q_{rlx}^{Sx} = Q_{ro}^x \wedge Q_{wo}^x & Q_{rlx}^{Lx} = Q_{wo}^x & Q_{rel}^F = Q_{ro}^* \wedge Q_{wo}^* \\ & Q_{rel}^{Sx} = Q_{ro}^* \wedge Q_{wo}^* & Q_{acq}^L = Q_{wo}^x & Q_{acq}^F = Q_{ro}^* \\ & Q_{sc}^{Sx} = Q_{ro}^* \wedge Q_{wo}^* \wedge Q_{sc} & Q_{sc}^L = Q_{wo}^x \wedge Q_{sc} & Q_{sc}^F = Q_{ro}^* \wedge Q_{wo}^* \wedge Q_{sc} \end{array}$$

Let  $[\phi/Q_{ro}^*]$  substitute  $\phi$  for every  $Q_{ro}^y$ , and similarly for  $Q_{wo}^*$ . Let substitutions  $[\phi/Q_{\mu}^{Sx}]$ ,  $[\phi/Q_{\mu}^{Lx}]$ , and  $[\phi/Q_{\mu}^{F}]$  be defined:

```
 \begin{split} [\phi/Q_{\text{rlx}}^{\text{S}x}] &= [\phi/Q_{\text{wo}}^{x}] & [\phi/Q_{\text{rlx}}^{\text{L}x}] = [\phi/Q_{\text{ro}}^{x}] & [\phi/Q_{\text{rel}}^{\text{F}}] = [\phi/Q_{\text{wo}}^{*}] \\ [\phi/Q_{\text{rel}}^{\text{S}x}] &= [\phi/Q_{\text{wo}}^{x}] & [\phi/Q_{\text{acq}}^{\text{L}x}] = [\phi/Q_{\text{ro}}^{*}, \phi/Q_{\text{wo}}^{*}] & [\phi/Q_{\text{acq}}^{\text{F}}] = [\phi/Q_{\text{ro}}^{*}, \phi/Q_{\text{wo}}^{*}] \\ [\phi/Q_{\text{S}c}^{\text{S}x}] &= [\phi/Q_{\text{wo}}^{x}, \phi/Q_{\text{sc}}] & [\phi/Q_{\text{S}c}^{\text{L}x}] = [\phi/Q_{\text{ro}}^{*}, \phi/Q_{\text{wo}}^{*}, \phi/Q_{\text{sc}}] & [\phi/Q_{\text{S}c}^{\text{F}}] = [\phi/Q_{\text{ro}}^{*}, \phi/Q_{\text{wo}}^{*}, \phi/Q_{\text{sc}}] \\ \end{split}
```

Update the following rules from Fig. 1. (The change is similar for address calculation and ifclosure.)

[Todo: This is buggy. Need to enforce order for coherence/synchronization/dependency into a write and coherence/synchronization, but not dependency, into reads. Lack of read-read dependency is bad here. Note that the write rules should mention D—see the agda version of write.]



Fig. 3. The Effect of Quiescence for Each Access Mode

```
1882 (W3) \kappa(e) \equiv M = v \wedge Q_{\mu}^{Sx},

1884 (W4a) if E \neq \emptyset then \tau^{D}(\psi) \equiv \psi[M/x][(M = v \wedge Q_{\mu}^{Sx})/Q_{\mu}^{Sx}],

1885 (W4b) if E = \emptyset then \tau^{D}(\psi) \equiv \psi[M/x][ff/Q_{\mu}^{Sx}],

1886 (R3) \kappa(e) \equiv Q_{\mu}^{Lx},

1887 (R4a) if e \in E \cap D then \tau^{D}(\psi) \equiv v = r \Rightarrow \psi,

1888 (R4b) if e \in E \setminus D then \tau^{D}(\psi) \equiv (v = r \vee x = r) \Rightarrow \psi[ff/Q_{\mu}^{Lx}],

1890 (R4c) if E = \emptyset then \tau^{D}(\psi) \equiv \psi[ff/Q_{\mu}^{Lx}],

1891 (F4a) if E \neq \emptyset then \tau^{D}(\psi) \equiv \psi,

1892 (F4b) if E = \emptyset then \tau^{D}(\psi) \equiv \psi[ff/Q_{\mu}^{Fx}].
```

1882

1893

1894

1895

1896

1897

1898

1899

1900

1901

1902

1904

1905

1910 1911 The quiescence formulae indicate what must precede an event. For example, all preceding accesses must be ordered before a releasing write, whereas only writes on x must be ordered before a releasing read on x.

The quiescence substitutions update quiescence symbols in subsequent code. For subsequent independent code, w³ and x³ substitute false. In complete pomsets, we substitute true for . For example, we substitute ff for  $Q_{\rm rel}^{\rm Sx}$  in the independent case for a releasing write; this ensures that subsequent writes to x follow the releasing write in top-level pomsets. Similarly, we substitute ff for  $Q_{\rm acq}^{\rm Lx}$  in the independent case for an acquiring write; this ensures that all subsequent accesses follow the acquiring read in top-level pomsets.

Fig. 3 shows the effect of quiescence for each access mode.

Example A.4. The definition enforces publication. Consider:



Since  $Q_{wo}^*[ff/Q_{wo}^x]$  is ff, we must introduce order to get a satisfiable precondition for (Wy1).

0:40 Anon.

Example A.5. The definition enforces subscription. Consider:

$$r := y^{\operatorname{acq}} \qquad r := x$$

$$(1 = r \lor y = r) \Rightarrow \psi[\operatorname{ff}/Q_{\operatorname{ro}}^{*}][\operatorname{ff}/Q_{\operatorname{wo}}^{*}] \qquad (1 = r \lor x = r) \Rightarrow \psi[\operatorname{ff}/Q_{\operatorname{ro}}^{x}]$$

$$Q_{\operatorname{wo}}^{y} \mid R^{\operatorname{acq}} yv) \vdots \qquad Q_{\operatorname{wo}}^{x} \mid Rxv$$

$$v = r \Rightarrow \psi \quad \vdots \qquad v = r \Rightarrow \psi$$

Since  $Q_{wo}^{x}[ff/Q_{wo}^{*}]$  is ff, we must introduce order to get a satisfiable precondition for (Wy1).

*Example A.6.* Even in its logical form, s6b' is incompatible with the ability to strengthen preconditions using augment closure, which is allowed in [Jagadeesan et al. 2020]. Consider the following.

$$\begin{array}{lll} \text{if}(r)\{x:=2\} & x:=1 & x:=2 & \text{if}(!r)\{x:=1\} \\ \hline (r\neq 0 \mid \forall x2) & (\forall x1) & (\forall x2) & \hline \end{array}$$

If r=0 then x is 1, 2, 1. If  $r\neq 0$  then x is 2, 1, 2. Augmenting the middle preconditions and then using sequential composition, we have:

$$\begin{array}{ll} \text{if}(r)\{x:=2\} & x:=1; \ x:=2 & \text{if}(!r)\{x:=1\} \\ \hline (r\neq 0\mid Wx2) & (r\neq 0\mid Wx1) & (r=0\mid Wx1) \end{array}$$

Note that 86b' does not require any order between the two writes of the middle pomset. Merging left and right, we have:

if(r){x := 2}; x := 1; x := 2; if(!r){x := 1}  

$$(Wx2) \rightarrow (Wx1)$$

As shown by the following single-threaded code, allowing this outcome would violate DRF-SC.

$$y := 1; r := y; if(r)\{x := 2\}; x := 1; x := 2; if(!r)\{x := 1\}$$

$$(Wy1) \rightarrow (Ry1) \qquad (Wx2) \rightarrow (Wx1)$$

This is one reason that we use weakest preconditions, rather than preconditions.

The same problem does not occur due to if-closure (at least not for complete pomsets, where you need to have termination being a tautology, so you can't arbitrarily choose to partition  $\Omega \neq tt$ :

$$\begin{array}{c} \text{if}(r)\{x:=2\} \\ \hline (r\neq 0\mid Wx2) \\ \hline (r=0\mid Wx1) \\ \hline (r\neq 0\mid Wx2) \\ \hline \end{array} \qquad \begin{array}{c} \text{if}(!r)\{x:=1\} \\ \hline (r=0\mid Wx1) \\ \hline (r\neq 0\mid Wx2) \\ \hline \end{array}$$

Merging left and right, we have

if 
$$(r)\{x := 2\}$$
;  $x := 1$ ;  $x := 2$ ; if  $(!r)\{x := 1\}$ 

$$(wx2) \leftarrow (r=0 \mid wx1) \qquad (r\neq 0 \mid wx2) \leftarrow (wx1)$$

# A.10 Is Coherence/Delay Compatible with If-closure and Dead-Write-Removal?

This paper has a "High-level abstraction for efficient computation"

With if-closure, the following equation should hold:

$$[if(r)\{x := 2\}; x := 1; x := 2; if(!r)\{x := 1\}; x := 3]$$

$$=[if(!r)\{x := 1\}; x := 2; x := 1; if(r)\{x := 2\}; x := 3]$$

Using dead write removal, these can be refined, respectively, to:

1962 1963

1961

```
[x := 1; x := 2; x := 3]
 \neq [x := 2; x := 1; x := 3]
```

1964 1965

What has become of coherence?

1966 1967

1968

### **B** LOWERING PwT-MCA TO ARM

<sup>1969</sup> For simp

For simplicity, we restrict to top-level parallel composition.

1971 1972

1973

1980 1981

1982

1983

1984

1985

1986

1987 1988

1989

1990

1991

1992 1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2004

2005

2006

2007

### **B.1** Arm executions

Our description of Arm8 follows Alglave et al. [2021], adapting the notation to our setting.

*Definition B.1.* An *Arm8 execution graph, G*, is tuple  $(E, \lambda, poloc, lob)$  such that

- (A1)  $E \subseteq \mathcal{E}$  is a set of events,
- (A2)  $\lambda: E \to \mathcal{A}$  defines a label for each event,
- (A3)  $poloc \subseteq E \times E$ , is a per-thread, per-location total order, capturing *per-location program order*,
- (A4)  $lob \subseteq E \times E$ , is a per-thread partial order capturing *locally-ordered-before*, such that (A4a)  $poloc \cup lob$  is acyclic.

The definition of lob is complex. Comparing with our definition of sequential composition, it is sufficient to note that lob includes

- (L1) read-write dependencies, required by \$3,
- (L2) synchronization delay of  $\ltimes_{sync}$ , required by s6a,
- (L3) sc access delay of ⋈<sub>sc</sub>, required by s6a,
- (L4) write-write and read-to-write coherence delay of ⋈<sub>CO</sub>, required by S6a,

and that lob does not include

- (L5) read-read control dependencies, required by \$3,
- (L6) write-to-read order of rf, required by M7c,
- (L7) write-to-read coherence delay of  $\bowtie_{co}$ , required by s6a.

Definition B.2. Execution G is (co, rf, gcb)-valid, under External Global Consistency (EGC) if

- (A5)  $co \subseteq E \times E$ , is a per-location total order on writes, capturing *coherence*,
- (A6) rf  $\subseteq E \times E$ , is a relation, capturing *reads-from*, such that
  - (A6a) rf is surjective and injective relation on  $\{e \in E \mid \lambda(e) \text{ is a read}\}\$ ,
  - (A6b) if  $d \xrightarrow{rf} e$  then  $\lambda(d)$  matches  $\lambda(e)$ ,
  - (A6c) poloc  $\cup$  co  $\cup$  rf  $\cup$  fr is acyclic, where  $e \xrightarrow{fr} c$  if  $e \xleftarrow{rf} d \xrightarrow{co} c$ , for some d,
- (A7)  $gcb \supseteq (co \cup rf)$  is a linear order such that
  - (A7a) if  $d \xrightarrow{rf} e$  and  $\lambda(c)$  blocks  $\lambda(e)$  then either  $c \xrightarrow{gcb} d$  or  $e \xrightarrow{gcb} c$ ,
  - (A7b) if  $e \xrightarrow{lob} c$  then either  $e \xrightarrow{gcb} c$  or  $(\exists d) d \xrightarrow{rf} e$  and  $d \xrightarrow{poloc} e$  but not  $d \xrightarrow{lob} c$ .

Execution *G* is (co, rf, cb)-valid under External Consistency (EC) if

- (A5) and (A6), as for EGC,
- (A8) cb  $\supseteq$  (co  $\cup$  lob) is a linear order such that if  $d \xrightarrow{rf} e$  then either
  - (A8a)  $d \stackrel{\mathsf{cb}}{\longrightarrow} e$  and if  $\lambda(c)$  blocks  $\lambda(e)$  then either  $c \stackrel{\mathsf{cb}}{\longrightarrow} d$  or  $e \stackrel{\mathsf{cb}}{\longrightarrow} c$ , or
- (A8b)  $d \stackrel{\mathsf{cb}}{\longleftarrow} e$  and  $d \stackrel{\mathsf{poloc}}{\longrightarrow} e$  and  $(\not\exists c) \lambda(c)$  blocks  $\lambda(e)$  and  $d \stackrel{\mathsf{poloc}}{\longrightarrow} c \stackrel{\mathsf{poloc}}{\longrightarrow} e$ .

2008

0:42 Anon.

Alglave et al. [2021] show that EGC and EC are both equivalent to the standard definition of Arm8. They explain EGC and EC using the following example, which is allowed by Arm8.<sup>8</sup>

$$x := 1; r := x; y := r \parallel 1 := y^{\text{acq}}; s := x$$

$$(Wx1) \xrightarrow{\text{rf}} (Rx1) \xrightarrow{\text{lob}} (Wy1) \xrightarrow{\text{rf}} (R^{\text{acq}}y1) \xrightarrow{\text{lob}} (Rx0)$$

EGC drops lob-order in the first thread using A7b, since (Wx1) is not lob-ordered before (Wy1).

EC drops rf-order in the first thread using A8b.

$$(cb)$$

# B.2 Lowering PwT-MCA1 to Arm

 The optimal lowering for Arm8 is unsound for PwT-mcA<sub>1</sub>. The optimal lowering maps relaxed access to ldr/str and non-relaxed access to ldar/stlr [Podkopaev et al. 2019]. In this section, we consider a suboptimal strategy, which lowers non-relaxed reads to (dmb.sy; ldar). Significantly, we retain the optimal lowering for relaxed access. In the next section we recover the optimal lowering by adopting an alternative semantics for M7c and S6a.

To see why the optimal lowering fails, consider the following attempted execution, where the final values of both x and y are 2.

$$x := 2$$
;  $r := x^{\operatorname{acq}}$ ;  $y := r - 1 \parallel y := 2$ ;  $x^{\operatorname{rel}} := 1$ 

$$(\operatorname{gcb})$$

$$(\vee x2) \longrightarrow (R^{acq}x2) \longrightarrow (Wy1) \longrightarrow (Wy2) \longrightarrow (W^{rel}x1)$$

This attempted execution is allowed by Arm8, but disallowed by our semantics.

If the read of x in the execution above is changed from acquiring to relaxed, then our semantics allows the gcb execution, using the independent case for the read and satisfying the precondition of (Wy1) by prepending (Wx2). It may be tempting, therefore, to adopt a strategy of *downgrading* acquires in certain cases. Unfortunately, it is not possible to do this locally without invalidating important idioms such as publication. For example, consider that  $(R^{ra}x1)$  is *not* possible for the second thread in the following attempted execution, due to publication of (Wx2) via y:

$$x := x + 1; y^{\text{rel}} := 1 \parallel x := 1; \text{ if } (y^{\text{acq}} \& x^{\text{acq}}) \{s := z\} \parallel z := 1; x^{\text{rel}} := 1$$
 $(x_1) \leftarrow (x_2) \leftarrow (x_1) \leftarrow (x_2) \leftarrow (x_2)$ 

Instead, if the read of x is relaxed, then the publication via y fails, and (Rx1) in the second thread is possible.

Using the suboptimal lowering for acquiring reads, our semantics is sound for Arm. The proof uses the characterization of Arm using EGC.

<sup>&</sup>lt;sup>8</sup>We have changed an address dependency in the first thread to a data dependency.

Theorem B.3. Suppose  $G_1$  is  $(co_1, rf_1, gcb_1)$ -valid for S under the suboptimal lowering that maps non-relaxed reads to (dmb.sy; ldar). Then there is a top-level point  $P_2 \in [S]$  such that  $E_2 = E_1$ ,  $\lambda_2 = \lambda_1$ ,  $rf_2 = rf_1$ , and  $\leq_2 = gcb_1$ .

PROOF. First, we establish some lemmas about Arm8.

 LEMMA B.4. Suppose G is (co, rf, gcb)-valid. Then  $gcb \supseteq fr$ .

PROOF. Using the definition of fr from A6c, we have  $e \overset{\text{rf}}{\longleftrightarrow} d \overset{\text{co}}{\longleftrightarrow} c$ , and therefore  $\lambda(c)$  blocks  $\lambda(e)$ . Applying A7a, we have that either  $c \overset{\text{gcb}}{\longleftrightarrow} d$  or  $e \overset{\text{gcb}}{\longleftrightarrow} c$ . Since gcb includes co, we have  $d \overset{\text{gcb}}{\longleftrightarrow} c$ , and therefore it must be that  $e \overset{\text{gcb}}{\longleftrightarrow} c$ .

LEMMA B.5. Suppose G is (co, rf, gcb)-valid and c  $\stackrel{\text{poloc}}{\longrightarrow}$  e, where  $\lambda(c)$  blocks  $\lambda(e)$ . Then  $c \stackrel{\text{gcb}}{\longrightarrow}$  e.

PROOF. By way of contradiction, assume  $e \xrightarrow{\text{gcb}} c$ . If  $c \xrightarrow{\text{rf}} e$  then by A7 we must also have  $c \xrightarrow{\text{gcb}} e$ , contradicting the assumption that gcb is a total order. Otherwise that there is some  $d \neq c$  such that  $d \xrightarrow{\text{rf}} e$ , and therefore  $d \xrightarrow{\text{gcb}} e$ . By transitivity,  $d \xrightarrow{\text{gcb}} c$ . By the definition of fr, we have  $e \xrightarrow{\text{fr}} c$ . But this contradicts A6c, since  $c \xrightarrow{\text{poloc}} e$ .

We show that all the order required in the pomset is also required by Arm8. M7b holds since  $cb_1$  is consistent with  $co_1$  and  $fr_1$ . As noted above, lob includes the order required by s3 and s6a. We need only show that the order removed from A7b can also be removed from the pomset. In order for A7b to remove order from e to c, we must have  $d \xrightarrow{rf} e$  and  $d \xrightarrow{poloc} e$  but not  $d \xrightarrow{lob} c$ . Because of our suboptimal lowering, it must be that e is a relaxed read; otherwise the dmb.sy would require  $d \xrightarrow{lob} c$ . Thus we know that s6a does not require order from e to c. By chaining R4b and W4, any dependence on the read can by satisfied without introducing order in s3.

### B.3 Lowering PwT-MCA2 to Arm

We can achieve optimal lowering for Arm by weakening the semantics of sequential composition slightly. In particular, we must lose M7c, which states that  $d \stackrel{\text{rf}}{\longrightarrow} e$  implies d < e. Revisiting the example in the last subsection, we essentially mimic the EC characterization:

$$x := 2; r := x^{\operatorname{acq}}; y := r - 1 \parallel y := 2; x^{\operatorname{rel}} := 1$$

$$(\operatorname{cb})$$

$$(\operatorname{cb})$$

Here the rf relation *contradicts* order! We have both  $(Wx2) \cdots \rightarrow (R^{acq}x2)$  and  $(Wx2) \stackrel{\mathsf{cb}}{\longleftarrow} (R^{acq}x2)$ . We first show that EC-validity is unchanged if we assume  $\mathsf{cb} \supseteq \mathsf{fr}$ :

LEMMA B.6. Suppose G is EC-valid via (co, rf, cb). Then there a permutation cb' of cb such that G is EC-valid via (co, rf, cb') and cb'  $\supseteq$  fr, where fr is defined in A6c.

PROOF. Suppose  $e \xrightarrow{fr} c$ . By definition of fr,  $e \xleftarrow{rf} d \xrightarrow{co} c$ , for some d. We show that either (1)  $e \xrightarrow{cb} c$ , or (2)  $e \xrightarrow{cb} e$  and we can reverse the order in cb' to satisfy the requirements.

If A8a applies to  $d \xrightarrow{rf} e$ , then  $e \xrightarrow{cb} c$ , since it cannot be that  $c \xrightarrow{co} d$ .

Suppose A8b applies to  $d \xrightarrow{rf} e$  and c is from a different thread than e. Because it is a different thread, we cannot have  $e \xrightarrow{lob} c$ , and therefore we can choose  $c \xrightarrow{cb} e$  in cb'.

Suppose A8b applies to  $d \xrightarrow{rf} e$  and c is from the same thread as e. Applying A6c to  $e \xrightarrow{fr} c$ , it cannot be that  $c \xrightarrow{poloc} e$ . Since poloc is a per-thread-and-per-location total order, it must be that  $e \xrightarrow{poloc} c$ . Applying A4a, we cannot have  $e \xrightarrow{lob} c$ , and therefore we can choose  $e \xrightarrow{cb} e$  in cb'.  $\Box$ 

0:44 Anon.

Here is a contradictory non-example illustrating the last case of the proof:

$$x := 2$$
;  $r := x \parallel x := 1$ 

$$c : Wx2$$

$$d : Wx1$$

THEOREM B.7. Suppose  $G_1$  is EC-valid for S via  $(co_1, rf_1, cb_1)$  and that  $cb_1 \supseteq fr_1$ . Then there is a top-level pomset  $P_2 \in [S]$  such that  $E_2 = E_1$ ,  $\lambda_2 = \lambda_1$ ,  $rf_2 = rf_1$ , and  $\leq_2 = cb_1$ .

PROOF. We show that all the order required in the pomset is also required by Arm8. M7b holds since  $cb_1$  is consistent with  $co_1$  and  $fr_1$ . s6b follows from A8b. As noted above, lob includes the order required by s3 and s6a.

#### C LDRF-SC FOR PwT-MCA

 In this appendix, we establish a DRF-SC for PwT-MCA<sub>2</sub>. We prove an *external* result, where the notion of *data-race* is independent of the semantics itself. Since every PwT-MCA<sub>2</sub> is also a PwT-MCA<sub>1</sub>, the result also applies there. Our result is also *local*. Using Dolan et al.'s [2018] notion of *Local Data Race Freedom (LDRF)*.

We do not address PwT-c11. The internal DRF-sc result for c11 [Batty 2015] does not rely on dependencies and thus applies to PwT-c11. In internal DRF-sc, data-races are defined using the semantics of the language itself. Using the notion of dependency defined here, it should be possible to prove an stronger external result for c11, similar to that of [Lahav et al. 2017]—we leave this as future work.

Jagadeesan et al. [2020] prove LDRF-SC for Pomsets with Preconditions (PwP). PwT-MCA generalizes PwP to account for sequential composition. Most of the machinery of LDRF-SC, however, has little to do with sequential semantics. Thus, we have borrowed heavily from the text of [Jagadeesan et al. 2020]; indeed, we have copied directly from the LATEX source, which is publicly available. We indicate substantial changes or additions using a change-bar on the right.

There are several changes:

- PwP imposes several conditions that we have dropped: *consistency, causal strengthening, downset closure* (see §A.2).
- PwP allows preconditions that are stronger than the weakest precondition.
- PwP imposes M7c (rf implies <) and thus is similar to PwT-MCA<sub>1</sub>. PwT-MCA<sub>2</sub> is a weaker model that is new to this paper.
- PwP did not provide an accurate account of program order for merged actions. We use Lemma 7.2 to correct this deficiency.

The first two items require us to define gen differently, below.

The result requires that locations are properly initialized. We assume a sufficient condition: that programs have the form " $x_1 := v_1$ ;  $\cdots x_n := v_n$ ; S" where every location mentioned in S is some  $x_i$ . To simplify the definition of *happens-before*, we ban fences and RMWs.

To state the theorem, we require several technical definitions. The reader unfamiliar with [Dolan et al. 2018] may prefer to skip to the examples in the proof sketch, referring back as needed.

*Program Order.* Let  $[\![\cdot]\!]_{mca2}^{po}$  be defined by applying the construction of Lemma 7.2 to  $[\![\cdot]\!]_{mca2}$ . We consider only *complete* pomsets. For these, we derive program order on compound events as follows. By Lemma 7.4, if there is a compound event e, then there is a phantom event  $c \in \pi^{-1}(e)$  such that  $\kappa(c)$  is a tautology. If there is exactly one tautology, we identify e with c in program order. If there is more than one tautology, Lemma C.1, below, shows that it suffices to pick an arbitrary

one—we identify e with the  $c \in \pi^{-1}(e)$  that is minimal in program order. For example, consider JMM causality test case 2, with an added write to z:

$$r := x; z := 1; s := x; if(r=s)\{y := 1\} \parallel x := y$$

$$(\ddagger \ddagger)$$

$$Rx1 \longrightarrow Rx0 \longrightarrow Rx1 \longrightarrow Rx1$$

$$(\ddagger \ddagger)$$

Data Race. Data races are defined using program order (po), not pomset order (<).

Because we ban fences and RMWs, we can adopt the simplest definition of *synchronizes-with* (sw): Let  $d \stackrel{\text{sw}}{\longrightarrow} e$  exactly when d fulfills e, d is a release, e is an acquire, and  $\neg (d \stackrel{\text{po}}{\longrightarrow} e)$ .

Let  $hb = (po \cup sw)^+$  be the *happens-before* relation.

Let  $L \subseteq X$  be a set of locations. We say that d has an L-race with e (notation  $d \stackrel{L}{\leadsto} e$ ) when (1) at least one is relaxed, (2) at least one is a write, (3) they access the same location in L, and (4) they are unordered by hb: neither  $d \stackrel{hb}{\Longrightarrow} e$  nor  $e \stackrel{hb}{\Longrightarrow} d$ .

*Generators.* We say that  $P' \in \nabla(\mathcal{P})$  if there is some  $P \in \mathcal{P}$  such that P is *complete* (Def. 5.1) and P' is a *downset* of P (Def. A.1).

Let P be augmentation-minimal in  $\mathcal{P}$  if  $P \in \mathcal{P}$  and there is no  $P \neq P' \in \mathcal{P}$  such that P augments P'. Let  $\text{gen}[S] = \{P \in \nabla [S]_{\text{mca}2}^{\text{po}} \mid P \text{ is augmentation-minimal in } \nabla [S]_{\text{mca}2}^{\text{po}} \}$ .

*Extensions.* We say that P' *S-extends* P if  $P \neq P' \in \text{gen}[S]$  and P is a downset of P'.

Similarity. We say that P' is e-similar to P if they differ at most in (1) pomset order adjacent to e, (2) the value associated with event e, if it is a read, and (3) the addition and removal of read events po-after e.

Stability. We say that P is L-stable in S if (1)  $P \in \text{gen}[S]$ , (2) P is po-convex (nothing missing in program order), (3) there is no S-extension of P with a *crossing* L-race: that is, there is no  $G \in E$ , no  $G \in E$  and no  $G \in E$  such that  $G \in E$  and  $G \in E$  such that  $G \in E$  and  $G \in E$  such that  $G \in E$  such th

Sequentiality. Let  $<_L = <_L \cup po$ , where  $<_L$  is the restriction of < to events that access locations in L. We say that P' is L-sequential after P if (1) P' is po-convex, (2)  $<_L$  is acyclic in  $E' \setminus E$ .

*Simplicity.* We say that P' is L-simple after P if all of the events in  $E' \setminus E$  that access locations in L are simple (Def. 7.1).

LEMMA C.1. Suppose  $P' \in gen[S]$  and P is L-sequential after P. Let P'' be the restriction of P' that is L-simple after P (throwing out compound L-events after P). Then  $P'' \in gen[S]$ .

As a negative example, note that  $(\ddagger\ddagger)$  is not L-sequential—in fact there is no execution of the program that results in the simple events of  $(\ddagger\ddagger)$ : without merging the reads, there would be a dependency  $(Rx1) \rightarrow (Wy1)$ . L-sequential executions of this code must read 0 for x:

```
r:=x; z:=1; s:=x; if (r=s)\{y:=1\} \parallel x:=y
(Rx0)
(Wx1)
(Rx0)
(Wy1)
(Rx1)
```

Theorem C.2. Let P be L-stable in S. Let P' be a S-extension of P that is L-sequential after P. Let P'' be a S-extension of P' that is po-convex, such that no subset of E'' satisfies these criteria. Then either (1) P'' is L-sequential and L-simple after P or (2) there is some S-extension P''' of P' and some  $e \in (E'' \setminus E')$  such that (a) P''' is e-similar to P'', (b) P''' is e-sequential and e-simple after e-simple after e-simple after e-simple e-simple

0:46 Anon.

The theorem provides an inductive characterization of *Sequential Consistency for Local Data-Race Freedom (SC-LDRF)*: Any extension of a *L*-stable pomset is either *L*-sequential, or is *e*-similar to a *L*-sequential extension that includes a race involving *e*.

PROOF SKETCH. We show *L*-sequentiality. *L*-simplicity then follows from Lemma C.1.

In order to develop a technique to find P''' from P'', we analyze pomset order in generation-minimal top-level pomsets. First, we note that  $<_*$  (the transitive reduction <) can be decomposed into three disjoint relations. Let  $ppo = (<_* \cap po)$  denote *preserved* program order, as required by sequential composition and conditional. The other two relations are cross-thread subsets of  $(<_* \setminus po)$ : rfe (reads-from-external) orders writes before reads, satisfying P6; cae (coherence-after-external) orders read and write accesses before writes, satisfying M7b. (Within a thread, S6 induces order that is included in ppo.)

Using this decomposition, we can show the following.

 LEMMA C.3. Suppose  $P'' \in gen[S]$  has an external read  $d \xrightarrow{rf''} e$  that is maximal in  $(ppo \cup rfe)$ . Further suppose that there another write d' that could fulfill e. Then there exists an e-similar P''' with  $d' \xrightarrow{rf'''} e$  such that  $P''' \in gen[S]$ .

The proof of the lemma follows an inductive construction of gen[S], starting from a large set with little order, and pruning the set as order is added: We begin with all pomsets generated by the semantics without imposing the requirements of fulfillment (including only ppo). We then prune reads which cannot be fulfilled, starting with those that are minimally ordered.

We can prove a similar result for (po  $\cup$  rfe)-maximal read and write accesses.

Turning to the proof of the theorem, if P'' is L-sequential after P, then the result follows from (1). Otherwise, there must be a  $\leq_L$  cycle in P'' involving all of the actions in  $(E'' \setminus E')$ : If there were no such cycle, then P'' would be L-sequential; if there were elements outside the cycle, then there would be a subset of E'' that satisfies these criteria.

If there is a (po  $\cup$  rfe)-maximal access, we select one of these as e. If e is a write, we reverse the outgoing order in cae; the ability to reverse this order witnesses the race. If e is a read, we switch its fulfilling write to a "newer" one, updating cae; the ability to switch witnesses the race. For example, for P'' on the left below, we choose the P''' on the right; e is the read of x, which races with (Wx1).



It is important that e be (po  $\cup$  rfe)-maximal, not just (ppo  $\cup$  rfe)-maximal. The latter criterion would allow us to choose e to be the read of g, but then there would be no e-similar pomset: if an execution reads 0 for g then there is no read of g, due to the conditional.

In the above argument, it is unimportant whether e reads-from an internal or an external write; thus the argument applies to PwT-MCA<sub>2</sub> and PwT-MCA<sub>1</sub> as it does for PwT-MCA<sub>1</sub>.

If there is no  $(po \cup rfe)$ -maximal access, then all cross-thread order must be from rfe. In this case, we select a  $(ppo \cup rfe)$ -maximal read, switching its fulfilling write to an "older" one. If there are

several of these, we choose one that is po-minimal. As an example, consider the following; once again, e is the read of x, which races with (Wx1).

$$x := 0; y := 0; (r := x; y := 1 \parallel s := y; x := s)$$

$$P \qquad P' \setminus P \qquad P'' \setminus P'$$

$$P \qquad P'' \setminus P \qquad P''' \setminus P'$$

$$Wy0 \qquad Rx1 \qquad Wy1$$

$$Wx0 \qquad Ry1 \qquad Wx1$$

This example requires (Wx0). Proper initialization ensures the existence of such "older" writes.  $\Box$ 

#### D PwT-MCA: ADDITIONAL EXAMPLES

This appendix includes additional examples. They all apply equally to PwT-MCA<sub>1</sub> and PwT-MCA<sub>2</sub>. Many of these are taken directly from [Jagadeesan et al. 2020]; see there for further discussion.

## D.1 Buffering

 Store buffering is allowed, as required by TSO.

Load buffering is allowed, as required by Arm8.

$$r := y; x := 1 \parallel r := x; y := 1$$

$$Ry1 \longrightarrow Rx1 \longrightarrow Wy1$$
(LB)

# D.2 Thin-Air

Thin air is disallowed. [Pugh 2004, TC4]:

$$y := x \parallel r := y; x := r$$

$$(Rx1) \longrightarrow (Ry1) \longrightarrow (Ry1) \longrightarrow (Wx1)$$

The control variant ([Pugh 2004, TC13]) is also disallowed:

[Jagadeesan et al. 2020, §2]

$$y := x \parallel r := y; \text{ if } (r)\{x := r; z := r\} \text{ else } \{x := 2\}$$

$$(Rx1) \longrightarrow (Ry1) \longrightarrow (Wz1)$$

$$(OOTA3)$$

[Jeffrey and Riely 2019, §8] and [Jagadeesan et al. 2020, §6]:

$$y := x \parallel r := y; \text{ if } (b)\{x := r; z := r\} \text{ else } \{x := 1\} \parallel b := 1$$

$$(\text{N}x1) \qquad (\text{W}y1) \qquad (\text{W}x1) \qquad (\text{W}b1) \qquad (\text{OOTA4})$$

[Svendsen et al. 2018, RNG] is disallowed since there is no write to fulfill (Ry1).

$$(y := x+1 \parallel x := y)$$

$$(Rx1) \qquad (Ry2) \qquad (Ry1) \qquad (OOTA6)$$

0:48 Anon.

OOTA7 is allowed by PS, but not WEAKESTMO [Chakraborty and Vafeiadis 2019, Fig. 3]:

$$x := 2$$
; if  $(x \neq 2)\{y := 1\} \parallel x := 1$ ;  $r := x$ ; if  $(y)\{x := 3\}$ 

(OOTA7)

OOTA4 is similar to TC5 [Pugh 2004]:

$$y := x \parallel x := y \parallel z := 0; z := 1 \parallel x := z$$

$$(Rx1) \longrightarrow (Ry1) \longrightarrow (Wx1) \qquad (Wz0) \longrightarrow (Wz0) \longrightarrow (Wx0)$$

The justification for forbidding this execution states:

values are not allowed to come out of thin air, even if there are other executions in which the thin-air value would have been written to that variable by some not out-of-thin air means.

OOTA4 is an interesting border case, since it is allowed by speculative models (§A.1).

[Todo: What's the point?] We presented two examples of thin-air behavior involving address calculation in §9.4. The justification for TC12 states:

Since no other thread accesses [either [0] or [1]], the code for [the second] thread should be equivalent to:

$$r := y$$
;  $[r] := 0$ ;  $(s := if(r=0)\{0\} else\{1\})$ ;  $x := s$ ;

With this code, it is clear that this is the same situation as test 4.

[Jagadeesan et al. 2020, §6]:

Boehm's [2019] RFUB example presents another potential form of OOTA behavior. Our analysis shows that there is no OOTA behavior in RFUB, only a false dependency:

$$[r := y; x := r] \not\supseteq [r := y; if(r \neq 1) \{z := 1; r := 1\}; x := r]$$
 (RFUB)

The left command is half of OOTA3 (y := x). The right command is dubbed Rfub, for *Register assignment From an Unexecuted Branch*. Boehm observes that in the context  $x := y \parallel [-]$ , these programs have different behaviors. Yet the OOTA example on the left never writes 1. Why should the unexecuted branch change that? Because of the conditional, the write to x in Rfub is independent of the read from y. It useful to considering the Hoare logic formulas satisfied by the two threads above: we have  $\{tt\}$  Rfub  $\{x=1\}$  for the right thread of Rfub, but not  $\{tt\}$  OOTA3  $\{x=1\}$  for the right thread of OOTA3. The change in the thread from OOTA3 to Rfub is not a valid refinement under Hoare logic; thus, it is expected that Rfub may have additional behaviors.

**RFUB New Constructor:** 

$$y := x \parallel r := y$$
; if  $(r=\text{null})\{r := \text{new C}()\}; x := r; r.f()$  (RFUB-NC)

This is similar to:

$$y := x \parallel r := y$$
; if  $(r=0)\{r := random()\}; x := r$ ; if  $(r)\{z := 1\}$ 

And different from the following, which is similar to TC18:

$$y := x \parallel r := y$$
; if  $(r=0)\{r := 1\}$ ;  $x := r$ ; if  $(r)\{z := 1\}$ 

Proc. ACM Program. Lang., Vol. 0, No. POPL, Article 0. Publication date: January 2022.

#### D.3 Coherence

The following execution is disallowed by fulfillment (M7a and M7b). It is also disallowed by c11 and Java.

$$x := 1; r := x \parallel x := 2; s := x$$
(COH)

M7b requires that we order one write with respect to the other, either before the write or after the read (and therefore after the write). Suppose we pick 1 before 2, as shown. This satisfies M7b for (Rx2). But to satisfy the requirement for (Rx1) we must have either (Wx2) < (Wx1) or (Rx1) < (Wx2). Either way, we have a cycle.

Our model is more coherent than Java, which permits the following:

$$r := x; x := 1 \parallel s := x; x := 2$$

$$(TC16)$$

We also forbid the following, which Java allows:

$$x := 1; y^{\text{rel}} := 1 \parallel x := 2; z^{\text{rel}} := 1 \parallel r := z^{\text{ra}}; r := y^{\text{ra}}; r := x; r := x$$

$$(\text{co3})$$

The following outcome is allowed by the promising semantics [Kang et al. 2017], but not in WEAKESTMO [Chakraborty and Vafeiadis 2019, Fig. 3]. We disallow it:

$$x := 2$$
; if  $(x \neq 2) \{ y := 1 \} \parallel x := 1$ ;  $r := x$ ; if  $(y) \{ x := 3 \}$ 

(COH-CYC)

c11 includes read-read coherence between relaxed atomics in order to forbid the following. We do not order reads by intra-thread coherence, and this allow the following:

$$x := 1; x := 2 \parallel y := x; z := x$$

$$(x_1) \longrightarrow (x_2) \longrightarrow (x$$

Here, the reader sees 2 then 1, although they are written in the reverse order.

We also allow the following, similar execution:

$$x := 1; x := 2 \parallel r_1 := x; r_2 := x; r_3 := x;$$

$$(Wx1) \rightarrow (Rx2) \rightarrow (Rx1) \rightarrow (Rx2)$$

Pugh [1999, §2.3] presented the following example to show that Java's original memory model required alias analysis to validate common subexpression elimination (CSE).

$$r_1 := x$$
;  $r_2 := z$ ;  $r_3 := x$ ; if  $(r_3 \le 1) \{ y = r_2 \}$ 

Coalescing the two read of x is obviously allowed if  $z\neq x$ . But if z=x, coalescing is only permitted because we do not include read-read pairs in  $\bowtie_{co}$  (§4.2):

$$\bowtie_{co} = \{(\mathsf{W}x, \mathsf{W}x), (\mathsf{R}x, \mathsf{W}x), (\mathsf{W}x, \mathsf{R}x)\}\$$

c11 has read-read coherence, and therefore CSE is only valid up to alias analysis in c11.

 0:50 Anon.

#### D.4 MCA

 Here are a few litmus tests that distinguish MCA architectures from non-MCA architectures. MCA1 is an example of write subsumption [Pulte et al. 2018, §3]:

$$if(z)\{x := 0\}; x := 1 \parallel if(x)\{y := 0\}; y := 1 \parallel if(y)\{z := 0\}; z := 1$$

$$(MCA1)$$

Two thread variant:

$$if(x)\{y := 0\}; y := 1 \parallel if(y)\{x := 0\}; x := 1$$

$$(Rx1) \leftarrow (Wy0) \rightarrow (Wy1) \rightarrow (Ry1) \rightarrow (Wx0) \rightarrow (Wx1)$$

IRIW is allowed if all accesses are relaxed, but not if the initial reads are acquiring:

$$x := 1 \parallel r := x^{\mathsf{ra}}; \ s := y \parallel y := 1 \parallel s := y^{\mathsf{ra}}; \ r := x$$

$$(\mathsf{IRIW})$$

$$(\mathsf{R}^{\mathsf{ra}}x1) \longrightarrow (\mathsf{R}y0) \longrightarrow (\mathsf{W}y1) \longrightarrow (\mathsf{R}x0)$$

MCA2 is a simplified version of IRIW

$$x := 0; x := 1 \parallel y := x \parallel r := y^{ra}; s := x$$

$$(Wx0) \longrightarrow (Wx1) \longrightarrow (Rx1) \longrightarrow (Ry1) \longrightarrow (Rx0)$$

[Flur et al. 2016] and [Lahav and Vafeiadis 2016, Fig. 4] discuss the following, which is not valid in Arm8, although it was valid under some earlier sketches of the model:

$$r := x \; ; \; x := 1 \parallel y := x \parallel x := y$$

$$(MCA3)$$

$$Rx1 \longrightarrow (Rx1) \longrightarrow (Ry1) \longrightarrow (Ry1) \longrightarrow (Ry1)$$

These candidate executions are invalid, due to cycles.

## D.5 Detour

The following example [Podkopaev et al. 2019, Ex. 3.7] is disallowed by IMM by including a detour relation. It is also disallowed by PS.

$$x := z-1; \ y := x \parallel x := 1 \parallel z := y$$

$$(Rz1) \leftarrow (Wx0) \qquad (Rx1) \leftarrow (Wy1) \rightarrow (Ry1) \leftarrow (Wz1)$$

# D.6 Read-Read Dependencies and Java Final Field Semantics Versus If-Closure

One might worry that the lack of read-read dependencies could cause DRF-SC to fail. For example, the following execution has a control dependency between the reads of the last thread, but this order is not enforced, neither by our model, nor Arm8.

$$z := 1; y^{\text{rel}} := 1 \parallel r := y^{\text{acq}}; x^{\text{rel}} := 1 \parallel \text{if}(x) \{s := z\}$$

$$(Wz1) \longrightarrow (R^{\text{rel}}y1) \longrightarrow (Rx1) \longrightarrow (Rx1)$$

Proc. ACM Program. Lang., Vol. 0, No. POPL, Article 0. Publication date: January 2022.

If the first read of the last thread is acquiring, then the execution is disallowed, since acquiring reads are ordered with respect to the reads that follow.

$$z := 1; \ y^{\text{rel}} := 1 \parallel r := y^{\text{acq}}; \ x^{\text{rel}} := 1 \parallel \text{if}(x^{\text{acq}}) \{s := z\}$$

$$(Wz1) \longrightarrow (R^{\text{acq}}y1) \longrightarrow (R^{\text{acq}}x1) \longrightarrow (R^{\text{acq}}x1) \longrightarrow (Rz0)$$

Arm8 enforces address dependencies between reads, but not control dependencies. To support case-analysis (AKA if-closure), we drop all dependencies between reads. This, in turn, invalidates Java's final field semantics.

$$(r := 1; [r] := 0; [r] := 1; x^{\text{rel}} := r) \parallel (r := x^{\text{acq}}; s := [r])$$

$$(\text{Addr2})$$

$$(\text{W[1]0}) \longrightarrow (\text{W[1]1}) \longrightarrow (\text{R^{ra}}x1) \longrightarrow (\text{R[1]0})$$

The acquire annotation is required to ensure publication. If address dependencies were enforced between reads then the acquire annotation could be dropped. However, the compiler would need to track address dependencies in order to ensure that case analysis did not convert them to control dependencies.

# D.7 Local Invariant Reasoning and Value Range Analysis

We have already seen TC1 in §4.8, TC2 in §9.1 and TC6 in §7. Here is the complete program for TC6:

$$y := 0; (r := y; if(r=0)\{x := 1\}; if(r=1)\{x := 1\}) \parallel (if(x=1)\{y := 1\})$$

$$(Wy0) \qquad (Ry1) \qquad (Ry1) \qquad (Wx1)$$

$$\phi = (1=r \lor 0=r) \Rightarrow (r=0 \lor r=1)$$
(TC6)

## [Todo: Discuss.]

 Here are some additional examples:

$$y := 0; (r := y; x := 1 + r * r - r) \parallel (y := x)$$

$$(Wy0) \qquad (Ry1) \qquad (Ry1) \qquad (Wx1)$$

$$\phi = (1 = r \lor 0 = r) \Rightarrow 1 + r * r - r = 1$$

$$(TC8)$$

$$x := 0; (r := x; if(r \ge 0) \{y := 1\} \parallel x := y \parallel x := -2)$$

$$(TC9)$$

$$x := 1; a^{ra} := 1; if(z^{ra})\{y := x\} \parallel if(a^{ra})\{x := 2; z^{ra} := 1\}$$
 $(INTERNAL1)$ 
 $r := x; y^{ra} := 1; s := y; z := s \parallel x := z$ 

$$r := x; \ y^{\text{re}} := 1; \ s := y; \ z := s \parallel x := z$$

$$(\text{INTERNAL2})$$

$$(\text{INTERNAL2})$$

Java Causality Test Case 18 asks that we justify the following execution:

0:52 Anon.

Before we prefix x := 0, the precondition of Wy1 is:

$$\phi \equiv (1=r \lor x=r) \Rightarrow ([r=0 \land ((1=s \lor 1=s) \Rightarrow s=1)] \lor [r\neq 0 \land ((1=s \lor x=s) \Rightarrow s=1)])$$

2503 Simplifying:

$$\phi \equiv (1=r \lor x=r) \Rightarrow (r=0 \lor [r\neq 0 \land ((1=s \lor x=s) \Rightarrow s=1)])$$

Prefixing x := 0:

$$\phi \equiv (1=r \lor 0=r) \Rightarrow (r=0 \lor [r\neq 0 \land ((1=s \lor 0=s) \Rightarrow s=1)])$$

Drilling into the interesting part:

$$\phi \equiv 1 = r \Rightarrow ((1 = s \lor 0 = s) \Rightarrow s = 1)$$

This is not a tautology. But we get one by coalescing s and r:

$$\phi \equiv 1 = r \Rightarrow ((1 = r \lor 0 = r) \Rightarrow r = 1)$$

TC20 splits the first thread of TC18:

$$x := 0; (x := y \implies r := x; if(r=0)\{x := 1\}); s := x; y := s$$

$$(x := y \implies r := x; if(r=0)\{x := 1\}); s := x; y := s$$

$$(x := y \implies r := x; if(r=0)\{x := 1\}); s := x; y := s$$

$$(x := y \implies r := x; if(r=0)\{x := 1\}); s := x; y := s$$

$$(x := y \implies r := x; if(r=0)\{x := 1\}); s := x; y := s$$

$$(x := y \implies r := x; if(r=0)\{x := 1\}); s := x; y := s$$

Because we take register state from the right, the example is the same as for TC18 above.

TC17 replaces the condition r=0 by  $r\neq 1$  in TC18:

$$\phi \equiv (1=r \lor x=r) \Rightarrow (\lceil r \neq 1 \land ((1=s \lor 1=s) \Rightarrow s=1) \rceil \lor \lceil r=1 \land ((1=s \lor x=s) \Rightarrow s=1) \rceil)$$

Simplifying and prefixing x := 0:

$$\phi \equiv (1=r \lor 0=r) \Rightarrow (r \neq 1 \lor [r=1 \land ((1=s \lor 0=s) \Rightarrow s=1)])$$

Again, we have:

$$\phi \equiv 1 = r \Rightarrow ((1 = s \lor 0 = s) \Rightarrow s = 1)$$

which is not a tautology. But we get one by coalescing s and r.

TC19 makes the same change for TC20, and follows for the same reason.

### D.8 Commuting release and acquire

[Todo: Discuss.]

RA example. This is impossible, since Rx1 unfulfilled.

$$x := 1; a^{\text{rel}} := 1; r := b^{\text{acq}}; s := x; y := r + s \parallel r := a^{\text{acq}}; x := 2; b^{\text{rel}} := 10$$
 $wx1 \longrightarrow w^{\text{rel}} a1 \longrightarrow x \longrightarrow wy11$ 
 $x := 1; a^{\text{rel}} := 1; r := b^{\text{acq}}; s := x; y := r + s \parallel r := a^{\text{acq}}; x := 2; b^{\text{rel}} := 10$ 
 $wx1 \longrightarrow w^{\text{rel}} a1 \longrightarrow wy11$ 
 $wx2 \longrightarrow w^{\text{rel}} b10$ 

If you swap the release and acquire, then it is impossible for the second thread to get in the middle.

$$x := 1; r := b^{\text{acq}}; a^{\text{rel}} := 1; \parallel r := a^{\text{acq}}; x := 2; b^{\text{rel}} := 10$$

$$(Wx1) \longrightarrow (W^{\text{rel}}a1) \longrightarrow (Wx2) \longrightarrow (W^{\text{rel}}b10)$$

Proc. ACM Program. Lang., Vol. 0, No. POPL, Article 0. Publication date: January 2022.

In this case, the following execution is possible:

$$x := 1$$
;  $r := b^{\text{acq}}$ ;  $a^{\text{rel}} := 1$ ;  $s := x$ ;  $y := r + s \parallel r := a^{\text{acq}}$ ;  $x := 2$ ;  $b^{\text{rel}} := 10$ 
 $(wx1) \quad (x^{\text{acq}} b \cdot 10) \quad (x^{\text{rel}} a \cdot 1) \quad (x^{$ 

But not:

$$x := 1; r := b^{\text{acq}}; a^{\text{rel}} := 1; s := x; y := r + s \parallel r := a^{\text{acq}}; x := 2; b^{\text{rel}} := 10$$
 $wx1 \qquad \text{Racq } b10 \qquad \text{Rx1} \qquad \text{Wy11}$ 
 $x := 1; r := b^{\text{acq}}; a^{\text{rel}} := 10$ 

# D.9 Sevcik examples

[Todo: Discuss.]

Cenciarelli et al. [2007, §7] example. (I incorrectly credit Sevčík and Aspinall [2008].)

if 
$$(x \land y)\{z := 1\} \parallel \text{if}(z)\{x := 1; y := 1\} \text{ else } \{y := 1; x := 1\}$$

$$(x \land y)\{z := 1\} \parallel \text{if}(z)\{x := 1; y := 1\} \text{ else } \{y := 1; x := 1\}$$

$$(x \land y)\{z := 1\} \parallel \text{if}(z)\{x := 1; y := 1\} \text{ else } \{y := 1; x := 1\}$$

Examples from [Sevčík and Aspinall 2008, §4.1] are interesting: Redundant write after read elimination:

```
|| lock m2; x=1; unlock m2
|| lock m1; x=2; unlock m1
```

|| lock m1; lock m2; r1=x; [x=r1;] r2=x; unlock m2; unlock m1 // [bracketed line removed]

Even without the write, r1 and r2 must see the same values, whereas JMM allows different values for the reads when the write is missing.

Redundant read after read elimination:

```
|| y=x
|| r2=y; if (r2==1){[r3=y]; x=r3}else{x=1} // [r3=r2]
```

Interesting case is left Wx1. Initially has predicate  $r_3 = 1$ . With read rule, we have y = 1. In read prefixing, we don't weaken. Instead we weaken with the read into r2.

```
\begin{split} &\text{if}(r_2 = 1)\{r_3 := y \; ; \; x := r_3\} & \text{if}(r_2 \neq 1)\{x := 1\} \\ &\underbrace{(r_2 = 1 \mid \mathsf{R} \; y_1)} & \underbrace{(r_2 = 1 \land y = 1 \mid \mathsf{W} \; x_1)} & \underbrace{(r_2 \neq 1 \mid \mathsf{W} \; x_1)} \\ & \text{if}(r_2 = 1)\{r_3 := y \; ; \; x := r_3\} \; \text{else} \; \{x := 1\} \\ &\underbrace{(r_2 = 1 \mid \mathsf{R} \; y_1)} & \underbrace{((r_2 = 1 \land y = 1) \lor (r_2 \neq 1) \mid \mathsf{W} \; x_1)} \\ & r_2 := y \; ; \; \text{if}(r_2 = 1)\{r_3 := y \; ; \; x := r_3\} \; \text{else} \; \{x := 1\} \\ & \underbrace{(\mathsf{R} \; y_1)} & \underbrace{(\mathsf{R} \; y_1)} & \underbrace{((y = 1 \land y = 1) \lor (y \neq 1) \mid \mathsf{W} \; x_1)} \end{split}
```

To ignore the second read, we use the "delay" trick that we used for JMM TC1, but this is fulfilled by a read rather than a write. In any case, the execution with x = y = 1 is allowed.

Roach Motel—all reads 1 impossible, but passible after swapping r1=x and lock m

0:54 Anon.

```
|| lock m; x=1; unlock m
2598
       || lock m; x=2; unlock m
2599
       || r1=x; lock m; r2=z; if(r1==2){y=1}else{y=r2}; unlock m
2600
2601
       || z=y
2602
       So Question is whether you can read all 1 in
2603
       || lock m; x=1; unlock m
       || lock m; x=2; unlock m
2605
       | | lock m; r1=x; r2=z; if(r1==2){y=1}else{y=r2}; unlock m
       || z=y
2607
       In any execution, we must have 1 before 2, or 2 before 1.
2608
2609
            • If thread sees 2, then read x is 2.
2610
            • If thread sees 1, then read x is 1.
2611
                                            if(r_1=2)\{y:=1\} else\{y:=r_2\}
                                               r_1=2 \lor (r_1 \neq 2 \land r_2=1) \mid \mathsf{W} \, \mathsf{y} \, \mathsf{1}
2613
                                   r_1 := x; r_2 := z; if (r_1 = 2) \{ y := 1 \} else \{ y := r_2 \}
2615
                                                  (Rx1) (Rz1) (Wy1)
2616
2617
               So impossible for y and z to be 1.
2618
         Irrelevant Read Introduction (can I read 1 for both y and z?)
2619
       || r=z; if(!r){if(x){y=1}}else{[s=x;]y=r}
2620
       | | x=1; z=y
2621
2622
2623
                                 if(!r){if(x){y:=1}} if(r){s:=x; y:=r}
2624
                                 r=0 \mid Rx1 \rightarrow r=0 \mid Wy1 r\neq 0 \mid Rx1 r=1 \mid Wy1
2625
2626
                                         if(!r)\{if(x)\{y := 1\}\} else\{y := r\}
2627
                                                (Rx1) \rightarrow (r=0 \lor r=1 \mid Wy1)
2628
                                 z := 0; r := z; if (!r) \{ if(x) \{ y := 1 \} \} else \{ y := r \}
2629
2630
                                        (Wz0) (Rz1) (Rx1) (0=0 \lor 0=1 \mid Wy1)
2631
```

If z is initialized to 2, rather than 0, then the dependencies remain and both are disallowed. This relies crucially on the fact that par takes order from both sides.

 $if(!r){if(x){y := 1}} if(r){y := r}$ 

 $r=0 \mid Rx1 \rightarrow r=0 \mid Wy1$ 

 $if(!r){if(x){y := 1}} else{y := r}$ 

 $r=0 \mid Rx1 \rightarrow r=1 \mid Wy1$ 

z := 0; r := z; if(!r){if(x){y := 1}} else{y := r}

 $(\mathsf{W}z0)$   $(\mathsf{R}z1)$   $(\mathsf{R}x1) \rightarrow (0=0 \lor 0=1 \mid \mathsf{W}y1)$ 

# D.10 SC Access

[Todo: Discuss.]

2645 2646

2632

2633

2634

2635 2636

2637

2638

2639 2640

2641

2642 2643

2644

[Dolan et al. 2018, §8.2]:

$$r := y; x^{\text{sc}} := 1; s := x \parallel x^{\text{sc}} := 2; y := 1$$

$$(\text{Sc1})$$

$$(\text{Sc1})$$

Watt et al. [2020, §3.1]:

$$x^{\text{sc}} := 1; \ r := y^{\text{sc}} \parallel y^{\text{sc}} := 1; \ y^{\text{sc}} := 2; \ x := 2; \ s := x^{\text{sc}}$$

$$(\text{sc2})$$

### D.11 Fences

[Todo: Discuss.]

$$x := 0; x := 1; \mathsf{F}^{\mathsf{rel}}; y := 1 \parallel r := y; \mathsf{F}^{\mathsf{acq}}; s := x$$

$$(\mathsf{PUB2})$$

[Lahav et al. 2017, Fig. 5]:

[Lahav et al. 2017, Fig. 6]

$$x := 1; z^{\mathsf{ra}} := 1; \parallel r := z^{\mathsf{acq}}; \mathsf{F}^{\mathsf{sc}}; r := y \parallel y := 1; \mathsf{F}^{\mathsf{sc}}; r := x$$

$$(\mathsf{Sc4})$$

Here are several examples mixing fencing with release/acquire:

$$x := 1; y^{ra} := 1 \parallel r := y^{acq}; s := x$$

$$Wx1 \longrightarrow \mathbb{R}^{acq} y1 \longrightarrow \mathbb{R}^{x0}$$

$$x := 1; F^{rel}; y := 1 \parallel r := y^{acq}; s := x$$

$$Wx1 \longrightarrow \mathbb{R}^{y1} \longrightarrow \mathbb{R}^{x0}$$

$$x := 1; y^{ra} := 1 \parallel r := y; F^{acq}; s := x$$

$$Wx1 \longrightarrow \mathbb{R}^{y1} \longrightarrow \mathbb{R}^{x0}$$

$$x := 1; F^{rel}; y := 1 \parallel r := y; F^{acq}; s := x$$

$$x := 1; F^{rel}; y := 1 \parallel r := y; F^{acq}; s := x$$

$$x := 1; F^{rel}; y := 1 \parallel r := y; F^{acq}; s := x$$

$$x := 1; F^{rel}; y := 1 \parallel r := y; F^{acq}; s := x$$

$$x := 1; F^{rel}; y := 1 \parallel r := y; F^{acq}; s := x$$

[Podkopaev et al. 2019, §D]: The following execution graph is not consistent in the promise-free declarative model of [Kang et al. 2017]. Nevertheless, its mapping to POWER (obtained by simply replacing Fsc with Fsync) is POWER-consistent and po  $\cup$  rf is acyclic (so it is Strong-POWER-consistent). Note that, using promises, the promising semantics allows this behavior.

$$r := z$$
;  $F^{sc}$ ;  $x := 1 \parallel x := 2$ ;  $F^{sc}$ ;  $y := 1 \parallel r := y$ ;  $z := 1$ 

Allowed behavior on POWER... Is there a dependency in the last thread? If so, this is a problem.

0:56 Anon.

[Podkopaev et al. 2019, §8]: To establish the correctness of compilation of the promising semantics to POWER, Kang et al. [2017] followed the approach of Lahav and Vafeiadis [2016]. This approach reduces compilation correctness to POWER to (i) the correctness of compilation to the POWER model strengthened with po  $\cup$  rf acyclicity; and (ii) the soundness of local reorderings of memory accesses. To establish (i), Kang et al. [2017] wrongly argued that the strengthened POWER-consistency of mapped promise-free execution graphs imply the promise-free consistency of the source execution graphs. This is not the case due to SC fences, which have relatively strong semantics in the promise-free declarative model (see [Podkopaev et al. 2018, Appendix D] for a counter example). Nevertheless, our proof shows that the compilation claim of Kang et al. [2017] is correct.

#### D.12 RMWs

 If RMWs simply use the same semantics as read and write, then we allow LDRF-PF-FAIL, which is used to show failure of LDRF-sc for the promising semantics in [Cho et al. 2021].

$$y := 0$$
; if  $(y)$ {if  $(!CAS(x, 0, 1))$ {if  $(z)$ { $x := 2$ }}}  $|| y := 1$ ; if  $(1 \neq CAS(x, 0, 3))$ { $z := 1$ }

 $(UX)$ 
 $(UX$ 

To disallow this, we need to retain the dependency  $(Rx2) \rightarrow (Wz1)$ . For this, we need to avoid the substitution for x. This is why we use READ' instead of READ in the independent case for RMWs.

It is not possible for two RMWs to see the same write.

$$x := 0; (FADD^{rlx,rlx}(x,1) \parallel FADD^{rlx,rlx}(x,1))$$

$$(RX0) \xrightarrow{rmw} (Wx1) \xrightarrow{rmw} (Wx1)$$

$$(RX0) \xrightarrow{rmw} (Wx1)$$

The gray arrow is required the RMW atomicity axioms.

Lee et al. [2020] introduce PS2.0 to refine the treatment of RMWs in the promising semantics (PS). Their examples have the expected results here, with far less work. First they recall that PS requires quantification over multiple futures in order to disallow executions such as CDRF. (We showed the relaxed variant (CDRF-RLX) in §9.3.)

$$r := \mathsf{FADD}^{\mathsf{acq},\mathsf{rel}}(x,1) \; ; \; \mathsf{if}(r=0) \{ y := 1 \} \parallel r := \mathsf{FADD}^{\mathsf{acq},\mathsf{rel}}(x,1) \; ; \; \mathsf{if}(r=0) \{ \mathsf{if}(y) \{ x := 0 \} \}$$

$$(\mathsf{R}^{\mathsf{acq}} x 0) \qquad (\mathsf{W} y 1) \qquad (\mathsf{R}^{\mathsf{acq}} x 0) \qquad (\mathsf{W} y 1) \qquad (\mathsf{W} y 1)$$

$$(\mathsf{W}^{\mathsf{rel}} x 1) \qquad (\mathsf{W}^{\mathsf{rel}} x 1) \qquad (\mathsf{CDRF})$$

This execution is clearly impossible, due to the cycle above. In this diagram, we have not drawn order adjacent to the writes of the RMWS, since this is not necessary to produce the cycle. If CDRF is allowed then DRF-RA fails.

PS does not support global value range analysis, as modeled by GA+E below. Our semantics permits GA+E:

$$x := 0$$
;  $(r := CAS^{r|x,r|x}(x, 0, 1); if (r < 10) \{y := 1\} || x := 42; x := y)$ 

$$(GA+E)$$

PS also does not support register promotion, as modeled by RP below. Our semantics permits RP:

$$r := x ; s := \mathsf{FADD}^{\mathsf{rlx},\mathsf{rlx}}(z,r) ; y := s+1 \parallel x := y$$

$$(\mathsf{R}x1) \qquad (\mathsf{R}y1) \qquad (\mathsf{R}y1) \qquad (\mathsf{R}y1)$$

Example D.1. Recall M10c: if  $\lambda(c)$  overlaps  $\lambda(d)$  and  $d \xrightarrow{rmv} e$  then (1) c < e implies  $c \le d$  and (2) d < c implies  $e \le c$ .

This definition ensures atomicity, disallowing executions such as [Podkopaev et al. 2019, Ex. 3.2]:

$$x := 0$$
; INC<sup>r|x,r|x</sup>  $(x) \parallel x := 2$ ;  $r := x$ 

$$\boxed{Wx0} \qquad \boxed{Rx0} \qquad \boxed{Wx1} \qquad \boxed{Rx1}$$

By 1, since  $(Wx2) \rightarrow (Wx1)$ , it must be that  $(Wx2) \rightarrow (Rx0)$ , creating a cycle.

Example D.2. Two successful RMWs cannot see the same write:

The order from read-to-write is required by fulfillment. Apply 1 of the second RMW to  $a \to d$ , we have that  $a \to c$ . Subsequently applying 2 of the first RMW, we have  $b \to c$ , creating a cycle.

*Example D.3.* By using two actions rather than one, the definition allows examples such as the following, which is allowed by Arm8 [Podkopaev et al. 2019, Ex. 3.10]:

$$r := z$$
;  $s := INC^{rlx,rel}(x)$ ;  $y := s+1 \parallel r := y$ ;  $z := r$ 

$$(x) \xrightarrow{rel} (x) \xrightarrow{rel} (x) \xrightarrow{rel} (x)$$

$$(x) \xrightarrow{rel} (x) \xrightarrow{rel} (x)$$

A similar example, also allowed by Arm8 [Chakraborty and Vafeiadis 2019, Fig. 6]:

$$r := z$$
;  $s := \mathsf{FADD}^{\mathsf{rlx},\mathsf{rlx}}(x,r)$ ;  $y := s+1 \parallel r := y$ ;  $z := r$ 

$$(\mathsf{R}z1) \qquad (\mathsf{W}y1) \qquad (\mathsf{W}y1) \qquad (\mathsf{W}z1)$$

This is allowed by WEAKESTMO, but not PS.

Example D.4. Consider the CDRF example from [Lee et al. 2020]:

$$r := \mathsf{INC}^{\mathsf{acq},\mathsf{rel}}(x) \; ; \; \mathsf{if}(r=0) \{ y := 1 \}$$

$$\parallel \; r := \mathsf{INC}^{\mathsf{acq},\mathsf{rel}}(x) \; ; \; \mathsf{if}(r=0) \{ \mathsf{if}(y) \{ x := 0 \} \}$$

$$\mathsf{R}^{\mathsf{acq}}(x) \qquad \mathsf{W}^{\mathsf{rel}}(x) \qquad \mathsf{R}^{\mathsf{y1}} \qquad \mathsf{W}^{\mathsf{y2}}(x) \qquad \mathsf{R}^{\mathsf{y3}}(x) \qquad \mathsf{R}^{\mathsf{y4}}(x) \qquad \mathsf{R}^{\mathsf{y4}}(x)$$

0:58 Anon.

Example D.5. Consider this example from [Lee et al. 2020, §C]:

$$r := \mathsf{CAS^{rlx,rlx}}(x,0,1) \; ; \; \mathsf{if}(r \leq 1) \{ y := 1 \}$$
 
$$\parallel \; r := \mathsf{CAS^{rlx,rlx}}(x,0,2) \; ; \; \mathsf{if}(r = 0) \{ \mathsf{if}(y) \{ x := 0 \} \}$$
 
$$\boxed{\mathsf{Rx0}} \qquad \boxed{\mathsf{Rx0}} \qquad \boxed{\mathsf{Rx0}} \qquad \boxed{\mathsf{Ry1}} \qquad \boxed{\mathsf{Wx0}}$$

#### D.13 More RMW

 These following examples are from [Cho et al. 2021].

CDRF shows that PwT semantics is not too permissive for ra-RMws. But what about rlx-RMws. The following execution is allowed by Arm8, and Ps2.0, but disallowed by Ps2.1.



If this  $\{z\}$ -DRF-RA?

$$if(y)\{x := z\} else\{x := 1\} \parallel r := x; z := 1; y := r$$

$$Ry1 \longrightarrow Rx1 \longrightarrow Wy1 \longrightarrow Wy1$$
(NAIVE-LDRF-RA-FAIL)

Interpreting  $\{z\}$  as ra:



## D.14 Fences and RMW

#### [Todo: Discuss.]

[Podkopaev et al. 2019, Remark 2, After example 3.1]: Aim: allow the splitting of release writes and RMWs into release fences followed by relaxed operations. In RC11 [Lahav et al. 2017], as well as in C/C++11 [Batty et al. 2011], this rather intuitive transformation, as we found out, is actually unsound.

$$y := 1; \ x^{\mathsf{ra}} := 1 \parallel \mathsf{INC}^{\mathsf{ra},\mathsf{ra}}(x); \ x := 3 \parallel r := x^{\mathsf{acq}}; \ s := y$$

$$(\mathsf{W}y1) \leftarrow (\mathsf{W}^{\mathsf{rel}}x1) \rightarrow (\mathsf{R}^{\mathsf{acq}}x1)^{\mathsf{rmw}}(\mathsf{W}^{\mathsf{rel}}x2) \rightarrow (\mathsf{W}x3) \rightarrow (\mathsf{R}^{\mathsf{acq}}x3) \leftarrow (\mathsf{R}y0)$$

(R)C11 disallows the annotated behavior, due in particular to the release sequence formed from the release exclusive write to x in the second thread to its subsequent relaxed write. However, if we split the increment to fencerel; a := FADDacq,rlx(x, 1) (which intuitively may seem stronger), the release sequence will no longer exist, and the annotated behavior will be allowed. IMM overcomes

this problem by strengthening sw in a way that ensures a synchronization edge for the transformed program as well

$$y := 1; x^{\mathsf{ra}} := 1 \parallel \mathsf{F}^{\mathsf{rel}}; \ \mathsf{INC}^{\mathsf{ra},\mathsf{rlx}}(x); x := 3 \parallel r := x^{\mathsf{acq}}; s := y$$

We seem to disallow both of these out of the box.

In the case of a relaxed read in the RMW, the outcome is allowed in both cases:

$$y := 1; x^{ra} := 1 \parallel \mathsf{INC}^{\mathsf{rlx},\mathsf{ra}}(x); x := 3 \parallel r := x^{\mathsf{acq}}; s := y$$

$$\boxed{\mathsf{W}y1} \qquad \boxed{\mathsf{W}^{\mathsf{rel}}x1} \qquad \boxed{\mathsf{R}x1} \qquad \boxed{\mathsf{W}^{\mathsf{rel}}x2} \qquad \boxed{\mathsf{W}x3} \qquad \boxed{\mathsf{R}^{\mathsf{acq}}x3} \qquad \boxed{\mathsf{R}y0}$$

$$y := 1; x^{\mathsf{ra}} := 1 \parallel \mathsf{F}^{\mathsf{rel}}; \mathsf{INC}^{\mathsf{rlx},\mathsf{rlx}}(x); x := 3 \parallel r := x^{\mathsf{acq}}; s := y$$

$$\boxed{\mathsf{W}y1} \qquad \boxed{\mathsf{F}^{\mathsf{rel}}} \qquad \boxed{\mathsf{R}x1} \qquad \boxed{\mathsf{W}x2} \qquad \boxed{\mathsf{W}x3} \qquad \boxed{\mathsf{R}^{\mathsf{acq}}x3} \qquad \boxed{\mathsf{R}y0}$$

#### **E NOT FOR PUBLICATION**

## E.1 Recent discussion on JMM/JDK-dev

Raffaello Giulietti: "JEP 188: Java Memory Model Update" [1], the JMM wiki [2] and the jmm-dev mailing list [3] seem quite inactive. (The latter point explains why I'm posting to this list instead.)

The introduction of j.l.i.VarHandle [4] brought more access modes to Java, but in a narrative and informal way. A paper by Bender & Palsberg [5], addressing the formalization of the concurrent access modes, has been published in 2019 but I'm not sure if it caught the attention of the OpenJDK community.

So what is the current thinking for progressing the JMM spec?

Hans Boehm: I think it's safe to say that it has been slow going, not just for Java, but for other languages as well.

In my view, the core problem, shared by pretty much all of them, is that we don't have an established way to give well-defined semantics to potentially racing unordered accesses, like ordinary variable accesses in Java, or memory\_order\_relaxed accesses in C and C++. That's particularly essential with the traditional Java language-based-security model, since we can't just give up on racing accesses to ordinary variables.

I'm aware of a number of proposed solutions. But I don't think we currently have enough confidence that they

- Are correct, and don't have issues similar to the older models,
- Don't have unintended consequences, particularly for compilation, and
- Are sufficiently comprehensible by programmers to actually be useful.

[Correctness] is hard because the models have gotten complex enough that reviewers are scarce. (A problem that I gather you're familiar with.) The authors are commonly experts at formally analyzing the models, but it's hard to analyze whether the model conflicts with some well-known, but perhaps not well-written-down compilation technique.

Probably even more controversially, I think we've realized that existing compiler technology can compile such racing code in ways that some of us are not 100% sure should really be allowed. Demonstrably unexecuted code can affect the semantics in ways that strike me as scary. (See <a href="https://doi.org/10.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/jtms.1007/j

0:60 Anon.

//wg21.link/p1217 for a down-to-assembly C++ version; [if I understand correctly], Lochbihler and others earlier came up with some closely related observations for Java.)

It might be possible to do what we've involuntarily done for C++: Punt the hard cases for now, and define what the model is for programs without racing ordinary accesses.

[p1217 is [Boehm 2019].]

# Andrew Haley:

- > Probably even more controversially, I think we've realized that
- > existing compiler technology can compile such racing code in ways
- > that some of us are not 100\% sure should really be allowed.

This implies, does it not, that the problem is not formalization as such, but that we don't really understand what the language is supposed to mean? That's always been my problem with OOTA: I'm unsure whether the problem is due to the inadequacy of formal models, in which case the formalists can fix their own problem, or something we all have to pay attention to.

Hans Boehm: In some sense, I'm not sure either. The p1217 examples bother me in that they seem to violate some global programming rules ("if x is only ever null or refers to an object properly constructed by the same thread, then x should never appear to refer to an incompletely constructed object"). And there seems to be disagreement about whether the currently allowed behavior is "correct"

On the other hand, in practice the weirdness doesn't seem to break things. If you ask people advocating the current behavior, the answer will be that it doesn't matter because nobody writes code that way. If you ask people trying to analyzer or verify code, they'll probably be unhappy. And I haven't been able to convince myself that you cannot get yourself into these situations just by linking components together, each of which does something perfectly reasonable.

And there are very common code patterns (like the standard implementation of reentrant locks used by all Java implementations) that break if you allow general OOTA behavior. Which at least means that you can't currently formally verify such code. The theorem you'd be trying to prove is false with respect to the part of the language spec we know how to formalize.

It's a mess.

# Andrew Haley:

- > Demonstrably unexecuted code can affect the semantics in ways that strike me
- > as scary. (See wg21.link/p1217 for a down-to-assembly C++ version; IIUC, Lochbihler
- > and others earlier came up with some closely related observations for Java.)

Looking again at p1217, it seems to me that enforcing load-store ordering would have severe effects on compilers, at least without new optimization techniques. We hoist loads before loops and sink stores after them. When it all works out, there are no memory accesses in the loop. A load-store barrier in a loop would have the effect of forcing succeeding stores out to memory, and forcing preceding loads to reload from memory. It's not hard to imagine that this would cause an order-of-margnitude performance reduction in common cases.

I suppose one could argue that such optimizations would continue to be valid, so only those stores which would have been emitted anyway would be affected. But that's not how compilers work, as far as I know. In our IR for C2, memory accesses are not pinned in any way, so the only way to make unrelated accesses execute in any particular order is to add a dependency between all loads and stores.

Hans Boehm: I think it would be a fairly pervasive change to optimizers. It has also become clear in WG21, the C++ committee, that there is not enough support for requiring this. In that case, Ou

and Demsky have a paper saying that the overhead is likely to be on the order of 1% or less. For Java if it were applied everywhere, it would probably be appreciably higher.

On the other hand, it's a bit harder than that to come up with examples where the generated x86 code has to be worse. Moving loads earlier in the code, or delaying stores, as you suggest, would still be fine. The only issue is with delaying loads past stores, which seems less common, though it can certainly be beneficial for reducing live ranges, probably some vectorization etc.

But it seems unlikely that such a restriction will be applied even to C++ memory\_order\_relaxed, much less Java ordinary variables.

Doug Lea: My stance in the less formal account (http://gee.cs.oswego.edu/dl/html/j9mm.html) as well as Shuyang Liu et al's ongoing formalization (see links from http://compilers.cs.ucla.edu/people/) is that the most you want to say about racy Java programs is that they are typesafe. As in: you can't see a String when expecting an int. Even this looser constraint is challenging to specify, prove, and extend. But it is a path for Java that might not apply to languages like C that are not guaranteed typesafe anyway, and so enter Undefined Behavior territory (as opposed to possibly-unexpected but still typesafe behavior).

Han Boehm: But this now breaks some common idioms, right? In particular, I think a bunch of code assumes that racing assignments of equivalent primitive values or immutable objects to the same field are OK.

If, in 2004, our view of language-based security had been the same as it is now, then I completely agree that this would have been the right approach. But I think doing it now would require significant user code changes. Which might still be the best way forward ...

#### E.2 A Note on Mixed-Mode Data Races

 In preparing this paper, we came across the following example, which appears to invalidate Theorem 4.1 of [Dongol et al. 2019].

$$x := 1; y^{\text{rel}} := 1; r := x^{\text{acq}} \parallel \text{if}(y^{\text{acq}}) \{x^{\text{rel}} := 2\}$$

$$\boxed{Wx1 \longrightarrow W^{\text{rel}}y1} \qquad \boxed{R^{\text{acq}}y1 \longrightarrow W^{\text{rel}}x2}$$

$$\boxed{\mathbb{R}^{\text{acq}}y1 \longrightarrow W^{\text{rel}}x2}$$

The program is data-race free. The two executions shown are the only top-level executions that include  $(W^{rel}x^2)$ .

Theorem 4.1 of [Dongol et al. 2019] is stated by extending execution sequences. In the terminology of [Dongol et al. 2019], a read is L-weak if it is sequentially stale. Let  $\rho = (Wx1)(W^{\rm rel}y1)$  ( $R^{\rm acq}y1)(W^{\rm rel}x2)$  be a sequence and  $\alpha = (R^{\rm acq}x1)$ .  $\rho$  is L-sequential and  $\alpha$  is L-weak in  $\rho\alpha$ . But there is no execution of this program that includes a data race, contradicting the theorem. The error seems to be in Lemma A.4 of [Dongol et al. 2019], which states that if  $\alpha$  is L-weak after an L-sequential  $\rho$ , then  $\alpha$  must be in a data race. That is clearly false here, since ( $R^{\rm acq}x1$ ) is stale, but the program is data race free.

In proving the SC-LDRF result in [Jagadeesan et al. 2020, §8], we noted that our proof technique is more robust than that of [Dongol et al. 2019], because it limits the prefixes that must be

0:62 Anon.

considered. In (¶), the induction hypothesis requires that we add ( $\mathbb{R}^{acq}x1$ ) before ( $\mathbb{W}^{rel}x2$ ) since ( $\mathbb{R}^{acq}x1$ )  $\rightarrow$  ( $\mathbb{W}^{rel}x2$ ). In particular,

$$(Wx1) \rightarrow (W^{rel}y1)$$
  $(R^{acq}y1) \rightarrow (W^{rel}x2)$ 

is not a downset of (¶), because  $(R^{acq}x1) \rightarrow (W^{rel}x2)$ . As noted in [Jagadeesan et al. 2020, §8], this affects the inductive order in which we move across pomsets, but does not affect the set of pomsets that are considered. In particular,



is a downset of  $(\P)$ .

## F OLD NOTES

## F.1 More optimizations

Sound to strengthen the annotation on an action from rlx to ra, and from ra to sc.

From [Manson et al. 2005]:

- synchronization on thread local objects can be ignored or removed altogether (the caveat
  to this is the fact that invocations of methods like wait and notify have to obey the correct
  semantics for example, even if the lock is thread local, it must be acquired when performing a wait),
- volatile fields of thread local objects can be treated as normal fields.
- redundant synchronization (e.g., when a synchronized method is called from another synchronized method on the same object) can be ignored or removed,

Counterexample for first two:

```
y=1; x^AR=1; r=X^AR; z=1
```

If you see z = 1 you must see y = 1

It would be nice if we could get at these with a strength reducing result: synchronization actions can be replaced by relaxed actions in some cases. Then the rules for relaxed read elimination and relaxed write elimination can be used to get rid of them.

## F.2 Examples for semicolon semantics

- Parallel asymmetric: state result for *joint free* programs.
- Subsumption can be allowed on registers only
- We build substitutions
- Ignore substitutions when considering semantic equality.

Value for r in  $(r=1 \mid Wz1)$  from (Wx1):

$$x := 1 \parallel x := 1; r := x; y := r; z := r$$
 $(Wx1) \rightarrow (Wy1) \qquad (Wz1)$ 

Value for r in  $(r=1 \mid Wz1)$  from (Wx1):

$$x := 2 \parallel x := 1; r := x; if(r>0) \{y := 1\}; if(r>0) \{z := 1\}$$

$$(Wx2) \qquad (Wx1) \qquad (Wy1) \qquad (Wz1)$$

Note that this also contains pomset where value for r in  $(r=1 \mid Wy1)$  also comes from (Wx1):

So our semantics will calculate the least ordered version. Then rely on augmentation to get the others.

It is also possible that the read is necessary to give a value for r:

$$x := 2 \parallel x := 0; r := x; \text{ if } (r > 0) \{ y := 1 \}; \text{ if } (r > 0) \{ z := 1 \}$$

$$\boxed{Wx2} \qquad \boxed{Wx0} \qquad \boxed{Rx2} \qquad \boxed{Wy1} \qquad \boxed{Wz1}$$

$$x := 0; r := x \qquad \qquad \text{if } (r > 0) \{ y := 1 \}; \text{ if } (r > 0) \{ z := 1 \}$$

$$\boxed{Wx0} (rRx2 \mid \langle 0/x, 0/r \rangle) \qquad \qquad \boxed{(r > 0 \mid Wy1) (r > 0 \mid Wz1)}$$

$$\boxed{(0/x)} \qquad \qquad \boxed{(r > 0 \mid \langle 1/y, 1/z \rangle)}$$

Dependency on two reads:

Don't need to worry about confusing reads:

$$r := x; s := x; if(s < 0) \{z := 1\}$$

$$rRx1 \quad (sRx2) \leftarrow (Wz1)$$

$$r := x; s := x \qquad z := s$$

$$rRx1 \quad (sRx2) \qquad (s < 0 \mid Wz1)$$

$$(x/r, x/s) \qquad (s < 0 \mid \langle 1/z \rangle)$$

But we also have

$$r := x$$
;  $s := x$ ; if  $(s < 0)\{z := 1\}$ 

$$(rRx1) (sRx2) (Wz1)$$

0:64 Anon.

3088 
$$r := x \qquad s := x; \text{ if } (s < 0) \{z := 1\}$$
3089 
$$(x < 0) | (x < 0) | (x < 0) | (x < 0) | (x < 0) |$$
3091 
$$x < 0 | (x < 0) | (x < 0) | (x < 0) |$$

Dependency on two reads (No dependency here):

$$r := x; s := x; if(r=s)\{z := 1\}$$

$$rRx1 \quad (sRx2) \quad (Wz1)$$

$$r := x \quad s := x; if(r=s)\{z := 1\}$$

$$rRx1 \quad (sRx2) \quad (r=x \mid Wz1)$$

$$r=x \mid \langle x/r \rangle$$

$$r=x \mid \langle x/s, 1/z \rangle$$

Another example:

Value for r in  $(r < s \mid Wz1)$  from (Wx0):

$$x := 0$$
;  $r := x$ ;  $s := y$ ; if  $(r < s) \{z := 1\}$ 

$$(Wx0) (rRx1) (sRy2) \rightarrow (Wz1)$$

Contrary to submission, reverse subsumption not okay.

$$x := 1 \qquad x := 2$$

$$rRx1 \qquad sRx2$$

$$(1/x) \qquad (1/x)$$

## F.3 Playing around with 5a and 4b

If we do this, then swap 4b and 4c, In definition 2.10, take 1-4b of def 2.8, rather than all of it. Another

```
r := x; \ s := x; \ if(r > 0) \{y := 1\}; \ if(s > 0) \{z := 1\}
r := x; \ if(r > 0) \{y := 1\}; \ s := x; \ if(s > 0) \{z := 1\}
rRx1 \quad (sRx2) \quad (Wy1) \quad (Wz1)
s := x; \ r := x; \ if(r > 0) \{y := 1\}; \ if(s > 0) \{z := 1\}
s := x; \ if(s > 0) \{z := 1\}; \ r := x; \ if(r > 0) \{y := 1\}
rRx1 \quad (sRx2) \quad (Wy1) \quad (Wz1)
rRx1 \quad (sRx2) \quad (Wy1) \quad (Wz1)
s := x; \ if(r > 0) \{y := 1\}; \ if(s > 0) \{z := 1\}
sRx2 \quad (r > 0) \{y := 1\}; \ if(s > 0) \{z := 1\}
```

3137
3138 r:=x; s:=x rRx1 = x rRx2 = x rRx1 = x rRx2 = x rRx1 = x rRx2 = x

Idea to get rid of 4b and change 5a to the following:

5a. if *e* writes then either  $\kappa'(e)$  implies  $\kappa(e)$ , or some c < e reads *v* from *x* and  $\kappa'(e)$  implies  $\kappa(e)[v/x]$ ,

Need to get rid of 4b because it is sensitive to order of reads.

This change seems sound, because of consistency. But it also fails to validate read reordering on same variable, due to consistency.

Without 4b, we still do not allow:

$$r:=x$$
;  $s:=x$ ;  $y:=r$ ;  $z:=r$   
 $Rx1$   $Rx2$   $Wy1$   $Wz2$ 

The following is not a pomset (consistency):

$$y := r; z := r$$

$$(r=1 \mid Wy1) \quad (r=2 \mid Wz2)$$

Without 4b, we still do not allow:

$$r := x; s := x; y := r; z := s; if(r=s){a := 1};$$
  
 $(Rx1)$   $(Rx2)$   $(Wy1)$   $(Wz2)$   $(x=x | Wa1)$ 

The following is not a pomset (consistency):

$$y := r; z := s; if(r=s){a := 1};$$
  
 $(r=1 \mid Wy1) (s=2 \mid Wz2) (r=s \mid Wa1)$ 

We do allow:

$$r := x$$
;  $s := x$ ; if  $(r=s)\{a := 1\}$ ;  
 $(Rx1)$   $(Rx2)$   $(x=x | Wa1)$ 

And also

$$r_1 := x \; ; \; r_2 := x \; ; \; r_3 := x \; ; \; \text{if} \; (r_3 \le 1) \{ y := 1 \} ;$$

$$(Rx0) \quad (Rx2) \quad (Rx1) \quad (1 \le 1 \mid Wy1)$$

But we cannot wait forever to satisfy a precondition. This is not a pomset:

$$\begin{array}{c}
r := x; \ s := x; \ y := r; \ z := s \\
\hline
(Rx3) \ (Rx4) \ (x=1 \mid Wy1) \ (x=2 \mid Wz2)
\end{array}$$

Note that reads that we delay must all be consistent.

Also note that we cannot have:

$$r := x; a := r; s := x; b := s; y := r; z := s$$
 $Rx3$ 
 $Rx4$ 
 $x=1 \mid Wy1$ 
 $x=1 \mid Wz2$ 
 $Wa3$ 
 $Wb4$ 

Proc. ACM Program. Lang., Vol. 0, No. POPL, Article 0. Publication date: January 2022.

0:66 Anon.

Because the following is not a pomset:

$$b := s; y := r; z := s$$

$$\underbrace{(r=1 \mid Wy1)}_{s=4 \mid Wb4}$$

But we can have the following, since there is no order the reads:

$$r_1 := x$$
;  $s_1 := x$ ;  $r_2 := x$ ;  $s_2 := x$ ;  $y := r_2$ ;  $z := s_2$ 

$$(Rx1) (Rx2) (Rx3) (Rx4) (Wy1) (Wz2)$$

Because this is indistinguishable from:

$$r_1 := x$$
;  $s_1 := x$ ;  $r_2 := x$ ;  $s_2 := x$ ;  $y := r_2$ ;  $z := s_2$   
 $(Rx3)$   $(Rx4)$   $(Rx1)$   $(Rx2)$   $(Wy1)$   $(Wz2)$ 

which is the same as:

But we can have:

$$p := x ; r := x ; s := x ; y := r ; z := s$$

$$(Rx1) (Rx3) (Rx4) (x=1 | Wy1) (x=1 | Wz1)$$

Reads can only swap when their values are interchangeable in the following program.

## F.4 Alan comments

```
x=s; y=r; z=3s+2r
```

x=s; y=r; z1=s; if(r odd){ z2=1} // using 1 and 3 as the reads

Proc. ACM Program. Lang., Vol. 0, No. POPL, Article 0. Publication date: January 2022.