# Sequential Composition for Relaxed Memory: Pomsets with Predicate Transformers\*

ALAN JEFFREY, Roblox, USA JAMES RIELY, DePaul University, USA

This paper presents the first compositional definition of sequential composition that applies to a relaxed memory model weak enough to allow efficient implementation on Arm. We extend the denotational model of pomsets with preconditions with predicate transformers. Previous work has shown that pomsets with preconditions are a model of concurrent composition, and that predicate transformers are a model of sequential composition. This paper show how they can be combined.

CCS Concepts: • Theory of computation → Parallel computing models; Preconditions.

Additional Key Words and Phrases: Concurrency, Relaxed Memory Models, Multi-Copy Atomicity, ARMv8, Pomsets, Preconditions, Temporal Safety Properties, Thin-Air Reads, Compiler Optimizations

#### **ACM Reference Format:**

Alan Jeffrey and James Riely. 2020. Sequential Composition for Relaxed Memory: Pomsets with Predicate Transformers. *Proc. ACM Program. Lang.* 4, OOPSLA, Article 194 (November 2020), 45 pages. https://doi.org/10.1145/3428262

#### 1 INTRODUCTION

This paper is about the interaction of two of the fundamental building blocks of computing: memory and sequential composition. One would like to think that these are well-worn topics, where every issue has been settled, but this is sadly not the case.

# 1.1 Memory

For single-threaded programs, memory can be thought of as you might expect: programs write to, and read from, memory references. This can be thought of as a total order of reads and writes, where each read has a matching *fulfilling* write, for example:

$$x := 0; x := 1; y := 2; r := y; s := x$$

$$(Wx0) \longrightarrow (Wx1) \longrightarrow (Ry2) \longrightarrow (Rx1)$$

(In examples, *r*–*s* range over thread-local registers and *x*-*z* range over shared memory references.)

Authors' addresses: Alan Jeffrey, Roblox, Chicago, USA; James Riely, DePaul University, Chicago, USA.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

© 2020 Copyright held by the owner/author(s).

2475-1421/2020/11-ART194

https://doi.org/10.1145/3428262

 $<sup>^*\</sup>mbox{Riely}$  was supported by the National Science Foundation under grant No. CCR-1617175.

This model naturally extends to the case of shared-memory concurrency, leading to a *sequentially consistent* semantics [Lamport 1979], in which *program order* inside a thread implies a total *causal order* between read and write events, for example:

$$x := 0; x := 1; y := 2 \parallel r := y; s := x$$

$$(Wx0) \longrightarrow (Wx1) \longrightarrow (Ry2) \longrightarrow (Rx1)$$

Unfortunately, this model does not compile efficiently to commodity hardware, resulting in a 37–73% increase in CPU time on ARM [Liu et al. 2019] and, hence, in power consumption. Developers of software and compilers have therefore been faced with a difficult trade-off, between an elegant model of memory, and its impact on resource usage (such as size of data centers, electricity bills and carbon footprint). Unsurprisingly, many have chosen to prioritize efficiency over elegance.

This has led to *relaxed memory models*, in which the requirement of sequential consistency is weakened to only apply *per-location* and not globally over the whole program. This allows executions which are inconsistent with program order, such as:

$$x := 0; x := 1; y := 2 \parallel r := y; s := x$$

$$(Wx0) \qquad (Wy2) \qquad (Ry2) \qquad (Rx0)$$

In such models, the causal order between events is important, and includes control and data dependencies, to avoid paradoxical "out of thin air" examples such as:

$$r := x$$
; if  $(r)\{y := 1\} \parallel s := y$ ;  $x := s$ 

This candidate execution forms a cycle in causal order, so is disallowed, but this depends crucially on the control dependency from (Rx1) to (Wy1), and the data dependency from (Ry1) to (Wx1). If either is missing, then this execution is acyclic and hence allowed. For example dropping the control dependency results in:

$$r := x; y := 1 \parallel s := y; x := s$$

$$(Rx1) \qquad (Wy1) \qquad (Wx1)$$

Unfortunately, while a simple syntactic approach to dependency calculation suffices for hardware models, it is not preserved by common compiler optimizations. For example, if we calculate control dependencies syntactically, then there is a dependency from (Rx1) to (Wy1), and therefore a cycle in, the candidate execution:

$$r := x$$
; if  $(r)\{y := 1\}$  else  $\{y := 1\}$   $\parallel s := y$ ;  $x := s$ 

An optimizing compiler might lift the assignment y := 1 out of the conditional, thus removing the control dependency.

Prominent solutions to the problem of dependency calculation include:

- syntactic methods used in hardware models such as ARM or x86-TSO [Alglave et al. 2014],
- speculative execution methods (which give a semantics based on multiple executions of the same program) such as the Java Memory Model [Manson et al. 2005] and related models [Chakraborty and Vafeiadis 2019; Jagadeesan et al. 2010a; Kang et al. 2017],

- rewriting methods, which give an operational model up to syntactic rewrites, such as [Pichon-Pharabod and Sewell 2016], and
- logical methods, such as the pomsets with preconditions model of [Jagadeesan et al. 2020].

In this paper, we will focus on logical models, as those are compositional, and align well with existing models of sequential composition. The heart of the model of [Jagadeesan et al. 2020] is to add logical preconditions to events, which are introduced by store actions (modeling data dependencies) and conditionals (modeling control dependencies):

$$if(s<1)\{z:=r*s\}$$

$$(s<1) \land (r*s)=0 \mid Wz0$$

Preconditions are discharged by being ordered after a read:

$$r:=x; s:=y; \text{if}(s<1)\{z:=r*s\}$$

$$(Rx0) \qquad (Ry0) \longrightarrow ((s=0) \Rightarrow (s<1) \land (r*s)=0 \mid Wz0)$$

Note that there is dependency order from (Ry0) to (Wz0) so the precondition for (Wz0) only has to be satisfied assuming the hypothesis (s=0). There is no matching order from (Rx0) to (Wz0) which is why we do not assume the hypothesis (r=0). Nonetheless, the precondition on (Wz0) is a tautology, and so can be elided in the diagram:

While existing models of relaxed memory have detailed treatments of parallel composition, they often give sequential composition little attention, either ignoring it altogether, or treating it operationally with its usual small-step semantics. This paper investigates how existing models of sequential composition interact with relaxed memory.

# Sequential composition

Our approach follows that of weakest precondition semantics of Dijkstra [1975], which provides an alternative characterization of Hoare logic [Hoare 1969] by mapping postconditions to preconditions. We recall the definition of  $wp_s(\psi)$  for loop-free code below.

- $\bullet \ wp_{\rm skip}(\psi) = \psi$
- $wp_{\mathsf{abort}}(\psi) = \mathsf{ff}$
- $wp_{r:=M}(\psi) = \psi[M/r]$
- $wp_{S_1:S_2}(\psi) = wp_{S_1}(wp_{S_2}(\psi))$

• 
$$wp_{\text{if}(M)\{S_1\} \text{ else } \{S_2\}}(\psi) = ((M \neq 0) \Rightarrow wp_{S_1}(\psi)) \land ((M=0) \Rightarrow wp_{S_2}(\psi))$$

The rule we are most interested in is the one for sequential composition, which maps sequential composition of programs to function composition of predicate transformers.

Predicate transformers are a good fit to logical models of dependency calculation, since both are concerned with preconditions, and how they are transformed by sequential composition. Our first attempt is to associate a predicate transformer with each pomset. We visualize this in diagrams by showing how  $\psi$  is transformed, for example:

$$r := x \qquad \qquad s := y \qquad \qquad \text{if } (s < 1) \{z := r * s\}$$

$$(s < 1) \land (r * s) = 0 \mid \forall z \neq 0\}$$

$$(s < 1) \land (r * s) = 0 \mid \forall z \neq 0$$

$$(s < 1) \land (r * s) = 0 \mid \forall z \neq 0$$

In the rightmost program above, the write to z affects the shared store, not the local state of the thread, therefore we assign it the identity transformer.

For the sequentially consistent semantics, sequential composition is straightforward: we apply each predicate transformer to the preconditions of subsequent events, and compose the predicate transformers:

$$r := x; s := y; if(s<1)\{z := r*s\}$$

$$(x=0) \rightarrow (x=0) \Rightarrow (x=0) \Rightarrow (x=0) \land (x*s) = 0 \mid Wz0$$

$$(x=0) \Rightarrow (x=0) \Rightarrow \psi$$

This model works for the sequentially consistent case, but needs to be weakened for the relaxed case. The key observation of this paper is that rather than working with one predicate transformer, we should work with a *family* of predicate transformers, indexed by sets of events.

For example, for single-event pomsets, there are two predicate transformers, since there are two subsets of any one-element set. The *independent* transformer is indexed by the empty set, whereas the *dependent* transformer is indexed by the singleton. We visualize this by including more than one transformed predicate, with an edge leading to the dependent one. For example:

$$r:=x \qquad \qquad s:=y \qquad \qquad \text{if}(s<1)\{z:=r*s\}$$

$$\psi \qquad \qquad \psi \qquad \qquad \psi$$

$$(s<1) \land (r*s)=0 \mid Wz0$$

$$(r=0) \Rightarrow \psi \nmid \dot{}$$

$$(s=0) \Rightarrow \psi \nmid \dot{}$$

The model of sequential composition then picks which predicate transformer to apply to an event's precondition by picking the one indexed by all the events before it in causal order.

For example, we can recover the expected semantics for the above example by choosing the predicate transformer which is independent of (Rx0) but dependent on (Ry0), which is the transformer which maps  $\psi$  to  $(s=0) \Rightarrow \psi$ .

$$r := x; s := y; if(s<1)\{z := r*s\}$$

$$(r=0) \Rightarrow (s=0) \Rightarrow \psi$$

$$(x=0) \Rightarrow (s=0) \Rightarrow (s<1) \land (r*s)=0 \mid Wz0$$

$$(r=0) \Rightarrow \psi$$

$$(s=0) \Rightarrow \psi$$

As a sanity check, we can see that sequential composition is associative in this case, since it does not matter whether we associate to the left, with intermediate step:

$$r := x; s := y$$

$$(r=0) \Rightarrow (s=0) \Rightarrow \psi$$

$$(r=0) \Rightarrow \psi \quad \forall (s=0) \Rightarrow \psi$$

$$(r=0) \Rightarrow \psi \quad (s=0) \Rightarrow \psi$$

or to the right, with intermediate step:

This is an instance of a general result that sequential composition forms a monoid, as one would hope.

#### 1.3 Contributions

In this paper, we show how pomsets with preconditions [Jagadeesan et al. 2020] can be combined with predicate transformers [Dijkstra 1975] to create a compositional semantics for sequential composition.

- §3 presents the basic model, with few features required of the logic of preconditions, but a resulting lack of fidelity to exiting models,
- §4 adds a model of *quiescence* to the logic, required to model coherence (accessing *x* has a precondition that *x* is quiescent) and synchronization (a releasing write requires all locations to be quiescent),
- §5 adds the features required for efficient compilation to modern architectures: downgrading some synchronized accesses to relaxed, and removing read-read dependencies, and
- §6 show how to address common litmus tests.

The definitions in this paper have been formalized in Agda.

Because it is closely related, we expect that the memory-model results of [Jagadeesan et al. 2020] apply to our model, including compositional reasoning for temporal safety properties and local DRF-sc as in [Dolan et al. 2018; Dongol et al. 2019]. In §5, we provide an alternative proof strategy for efficient compilation to ARM8, which improves upon that of [Jagadeesan et al. 2020] by using a recent alternative characterization of ARM8.

#### 2 RELATED WORK

- Models with strong dependencies that disallow compiler optimizations and efficient implementation on ARM8: [Boehm and Demsky 2014; Dolan et al. 2018; Jeffrey and Riely 2016; Kavanagh and Brookes 2018; Lahav et al. 2017,?; Lamport 1979].
- Models with weak dependencies that allow some forms of thin-air reasoning: [Chakraborty and Vafeiadis 2019; Jagadeesan et al. 2010b; Kang et al. 2017; Manson et al. 2005]
- Models that do not calculate dependencies: [Batty et al. 2011; Cox 2016; Watt et al. 2020, 2019].

Like us, Kavanagh and Brookes [2018] use pomsets to create a compositional account of sequential composition. However, their model requires a fence after every read on ARM8, and uses syntactic dependencies, thus invalidating many compiler optimizations.

#### 3 MODEL

In this section, we present the mathematical preliminaries for the model (which can be skipped on first reading). We then present the model incrementally, starting with a model built using *partially ordered multisets* (*pomsets*) [Gischer 1988; Plotkin and Pratt 1996], and then adding preconditions and finally predicate transformers.

In later sections, we will discuss extensions to the logic, and to the semantics of load, store and thread initialization, in order to model relaxed memory more faithfully. We stress that these features do *not* change any of the structures of the language: conditionals, parallel composition, and sequential composition are as defined in this section.

#### 3.1 Preliminaries

The syntax is built from

- a set of values V, ranged over by  $v, w, \ell, k$ ,
- a set of *registers*  $\mathcal{R}$ , ranged over by r, s,
- a set of expressions  $\mathcal{M}$ , ranged over by M, N, L.

*Memory references* are tagged values, written [ $\ell$ ]. Let X be the set of memory references, ranged over by x, y, z.

We require that

- values and registers are disjoint,
- values include at least the constants 0 and 1,
- expressions include at least registers and values,
- expressions do *not* include references: M[N/x] = M.

We model the following language.

$$\mu ::= rlx \mid ra \mid sc$$

$$S ::= abort \mid skip \mid r := M \mid r := [L]^{\mu} \mid [L]^{\mu} := M$$

$$\mid S_1 \nleftrightarrow S_2 \mid S_1; S_2 \mid if(M)\{S_1\} else\{S_2\}$$

*Memory modes*,  $\mu$ , are relaxed (rlx), release-acquire (ra), and sequentially consistent (sc). Relaxed mode is the default; we regularly elide it from examples. ra/sc accesses are collectively known as *synchronized accesses*.

Commands, aka statements, S, include memory accesses at a given mode, as well as the usual structural constructs. Following [Ferreira et al. 1996], # denotes parallel composition, preserving thread state on the right after a join. In examples and sublanguages without join, we use the symmetric  $\|$  operator.

The semantics is built from the following.

- a set of events  $\mathcal{E}$ , ranged over by e, d, c, b,
- a set of actions  $\mathcal{A}$ , ranged over by a,
- a set of logical formulae  $\Phi$ , ranged over by  $\phi$ ,  $\psi$ ,  $\theta$ .

Subsets of  $\mathcal{E}$  are ranged over by E, D, C, B.

We require that:

- actions include writes (Wxv) and reads (Rxv),
- formulae include equalities (M=N) and (x=M),
- formulae include quiescence symbols  $Q_{sc}$ ,  $Q_{ro}^x$ ,  $Q_{wo}^x$  (used in §4), the downgrade symbols  $\downarrow^x$  (used in §5.1), and the write symbol W (used in §5.2),
- formulae are closed under negation, conjunction, disjunction, and substitutions [M/r], [M/x], and  $[\phi/s]$  for each symbol s,
- there is an entailment relation \= between formulae,
- $\models$  has the expected semantics for =,  $\neg$ ,  $\land$ ,  $\lor$ ,  $\Rightarrow$  and substitution.

Logical formulae include equations over registers, such as (r=s+1). For use in §6.1, we also include equations over memory references, such as (x=1). Formulae are subject to substitutions; actions are not. We use expressions as formulae, coercing M to  $M\neq 0$ . Equations have precedence over logical operators; thus  $r=v \Rightarrow s>w$  is read  $(r=v) \Rightarrow (s>w)$ . As usual, implication associates to the right; thus  $\phi \Rightarrow \psi \Rightarrow \theta$  is read  $\phi \Rightarrow (\psi \Rightarrow \theta)$ .

We say  $\phi$  implies  $\psi$  if  $\phi \models \psi$ . We say  $\phi$  is a tautology if  $\mathsf{tt} \models \phi$ . We say  $\phi$  is unsatisfiable if  $\phi \models \mathsf{ff}$ . Throughout §3–5 we additionally require that

• each register is assigned at most once in a program.

In §6, we drop this restriction, requiring instead that

- there are registers  $S_{\mathcal{E}} = \{s_e \mid e \in \mathcal{E}\},\$
- registers  $S_{\mathcal{E}}$  do not appear in programs:  $S[N/s_e] = S$ .

Definition 3.1. Reorderability relations. Coherence-after and synchronization.

$$\begin{split} & \ltimes_{\mathsf{ca}} = \{ (\mathsf{W} x, \mathsf{W} y) \mid x \neq y \} \cup \{ (\mathsf{R} x, \mathsf{W} y) \mid x \neq y \} \cup \{ (\mathsf{W} x, \mathsf{R} y) \} \ \cup \ \{ (\mathsf{R} x, \mathsf{R} y) \} \\ & \ltimes_{\mathsf{sync}} = \{ (\mathsf{W}^{\mu}, \mathsf{R}^{\nu}) \mid \mu \neq \mathsf{sc} \vee \nu \neq \mathsf{sc} \} \cup \{ (\mathsf{W}^{\mu}, \mathsf{W}^{\mathsf{rlx}}) \} \cup \{ (\mathsf{F}^{\mathsf{rel}}, \mathsf{R}^{\nu}) \} \\ & \cup \{ (\mathsf{R}^{\mathsf{rlx}}, \mathsf{W}^{\mathsf{rlx}}) \} \cup \{ (\mathsf{R}^{\mathsf{rlx}}, \mathsf{R}^{\nu}) \} \cup \{ (\mathsf{W}^{\mu}, \mathsf{F}^{\mathsf{acq}}) \} \cup \{ (\mathsf{F}^{\mathsf{rel}}, \mathsf{F}^{\mathsf{acq}}) \} \\ & \ltimes = \ltimes_{\mathsf{sync}} \cap \ltimes_{\mathsf{ca}} \end{split}$$

#### 3.2 Model

*Definition 3.2.* A pomset with predicate transformers over  $\mathcal{A}$  is a tuple  $(E, \lambda, \kappa, \tau, \mathsf{rf}, \leq)$  where

- (1)  $E \subset \mathcal{E}$  is a set of events,
- (2)  $\lambda : E \to \mathcal{A}$  defines a *label* for each event,
- (3)  $\kappa : E \to \Phi$  defines a *precondition* for each event,
- (4)  $\tau: 2^{\mathcal{E}} \to \Phi \to \Phi$  defines a predicate transformer for each set of events,
- (5) rf  $\subseteq \{(d, e) \mid \lambda(e) = (\mathsf{W}xv) \text{ and } \lambda(d) = (\mathsf{R}xv)\}$ , is a relation capturing reads-from,
- (6)  $\leq \subseteq (E \times E)$ , is a partial order capturing *causality*.

*Definition 3.3.* If  $P \in PAR(\mathcal{P}_1, \mathcal{P}_2)$  then  $(\exists P_1 \in \mathcal{P}_1)$   $(\exists P_2 \in \mathcal{P}_2)$ 

(1)  $E = (E_1 \cup E_2)$ ,

(5)  $E_1$  and  $E_2$  are disjoint.

(2)  $\lambda = (\lambda_1 \cup \lambda_2)$ ,

- (6) if  $e \in E_1$  then  $\kappa(e)$  implies  $\kappa_1(e)$ ,
- $(3) \leq \supseteq (\leq_1 \cup \leq_2),$
- (7) if  $e \in E_2$  then  $\kappa(e)$  implies  $\kappa_2(e)$ ,
- (4)  $(\mathsf{rf}_1 \cup \mathsf{rf}_2) \subseteq \mathsf{rf} \subseteq (\mathsf{rf}_1 \cup \mathsf{rf}_2 \cup \leq),$
- (8)  $\tau^D(\psi)$  implies  $\tau_2^D(\psi)$ .

If  $P \in IF(\phi, \mathcal{P}_1, \mathcal{P}_2)$  then  $(\exists P_1 \in \mathcal{P}_1) \ (\exists P_2 \in \mathcal{P}_2)$ 

(1-4) as for PAR,

- (4)  $rf = (rf_1 \cup rf_2),$
- (5) if  $e \in E_1 \setminus E_2$  then  $\kappa(e)$  implies  $\phi \wedge \kappa_1(e)$ ,
- (6) if  $e \in E_2 \setminus E_1$  then  $\kappa(e)$  implies  $\neg \phi \wedge \kappa_2(e)$ ,
- (7) if  $e \in E_1 \cap E_2$  then  $\kappa(e)$  implies  $(\phi \Rightarrow \kappa_1(e)) \wedge (\neg \phi \Rightarrow \kappa_2(e))$ ,
- (8)  $\tau^D(\psi)$  implies  $(\phi \Rightarrow \tau_1^D(\psi)) \land (\neg \phi \Rightarrow \tau_2^D(\psi))$ .

If  $P \in SEO(\mathcal{P}_1, \mathcal{P}_2)$  then  $(\exists P_1 \in \mathcal{P}_1)$   $(\exists P_2 \in \mathcal{P}_2)$ 

(1-4) as for PAR,

- (4)  $\mathsf{rf} \subseteq (\mathsf{rf}_1 \cup \mathsf{rf}_2 \cup (E_1 \times E_2)),$
- (5) if  $c \in E_1$  and  $(d, e) \in (rf \cap (E_1 \times E_2))$  then either  $\lambda(c) \ltimes_{\mathsf{ca}} \lambda(d)$  or  $c \leq d$ ,
- (6) if  $e \in E_1 \setminus E_2$  then  $\kappa(e)$  implies  $\kappa_1(e)$ ,
- (7) if  $e \in E_2 \setminus E_1$  then  $\kappa(e)$  implies  $\kappa'_2(e)$ ,
- (8) if  $e \in E_1 \cap E_2$  then  $\kappa(e)$  implies  $\kappa_1(e) \vee \kappa_2'(e)$ , where  $\kappa_2'(e) = \tau_1^{\downarrow e}(\kappa_2(e))$ , where  $\downarrow e = \{c \mid c < e\}$  if  $\lambda(e)$  is a write, and  $\downarrow e = E_1$ , otherwise,
- (9)  $\tau^D(\psi)$  implies  $\tau_1^D(\tau_2^D(\psi))$ ,
- (10) if  $d \in E_1$  and  $e \in E_2$  then either  $d \le e$  or  $\lambda_1(d) \ltimes \lambda_2(e)$ .

If  $P \in LET(r, M)$  then  $E = \emptyset$  and  $\tau^D(\psi)$  implies  $\psi[M/r]$ .

If  $P \in SKIP$  then  $E = \emptyset$  and  $\tau^D(\psi)$  implies  $\psi$ .

If  $P \in FENCE(\mu)$  then

- (1)  $E \neq \emptyset$  and if  $d, e \in E$  then d = e,
- (2)  $\lambda(e) = F^{\mu}$ ,
- (3)  $\tau^D(\psi)$  implies  $\psi$ .

```
If P \in STORE(x, M, \mu) then (\exists v \in V)

(1) E \neq \emptyset and if d, e \in E then d = e,

(2) \lambda(e) = \mathbb{W} x v,

(3) \kappa(e) implies M = v,

(4) (\forall e \in E \cap D) \tau^D(\psi) implies \psi \wedge M = v,

(5) (\forall e \in E \setminus D) \tau^D(\psi) implies \psi.

If P \in LOAD(r, x, \mu) then (\exists v \in V)

(1) E \neq \emptyset and if d, e \in E then d = e,

(2) \lambda(e) = \mathbb{R} x v

(3) \kappa(e) implies t,

(4) (\forall e \in E \cap D) \tau^D(\psi) implies v = r \Rightarrow \psi,

(5) (\forall e \in E \setminus D) \tau^D(\psi) implies \psi.
```

Full versions:

If  $P \in STORE(x, M, \mu)$  then  $(\exists v : E \to V)$   $(\exists \theta : E \to \Phi)$ 

- (1)  $\bigvee_{e \in E} \theta_e$  is a tautology and if  $\theta_d \wedge \theta_e$  is satisfiable then d = e,
- (2)  $\lambda(e) = Wxv_e$ ,
- (3)  $\kappa(e)$  implies  $\theta_e \wedge M = v_e$ ,
- (4)  $(\forall e \in E \cap D) \tau^D(\psi)$  implies  $\theta_e \Rightarrow (\psi[M/x] \land M = v_e)$ ,
- (5)  $(\forall e \in E \setminus C) \tau^C(\psi)$  implies  $\theta_e \Rightarrow \psi[M/x]$ .

If  $P \in LOAD(r, x, \mu)$  then  $(\exists v : E \to V)$   $(\exists \theta : E \to \Phi)$ 

- (1)  $\bigvee_{e \in E} \theta_e$  is a tautology and if  $\theta_d \wedge \theta_e$  is satisfiable then d = e,
- (2)  $\lambda(e) = Rxv_e$
- (3)  $\kappa(e)$  implies  $\theta_e$ ,
- (4)  $(\forall e \in E \cap D) \tau^D(\psi)$  implies  $\theta_e \Rightarrow v_e = s_e \Rightarrow \psi[s_e/r]$ ,
- (5)  $(\forall e \in E \setminus C) \ \tau^C(\psi) \text{ implies } \theta_e \Rightarrow (v_e = s_e \lor x = s_e) \Rightarrow \psi[s_e/r].$

#### 3.3 Pomsets

We first consider a fragment of our language that can be modeled using simple pomsets. This captures read and write actions which may be reordered, but as we shall see does *not* capture control or data dependencies.

*Definition 3.4.* A *pomset* over  $\mathcal{A}$  is a tuple  $(E, \leq, \lambda)$  where

- $E \subset \mathcal{E}$  is a set of *events*,
- $\leq \subseteq (E \times E)$  is the *causality* partial order,
- $\lambda: E \to \mathcal{A}$  is a labeling.

Let P range over pomsets, and  $\mathcal{P}$  over sets of pomsets. Let Pom be the set of all pomsets.

We lift terminology from actions to events. For example, we say that e writes x if  $\lambda(e)$  writes x. We also drop quantifiers when clear from context, such as  $(\forall e \in E)(\forall x \in X)$ .

Definition 3.5. Action (Wxv) matches (Rxw) when v = w. Action (Wxv) blocks (Rxw), for any v, w.

A read event e is *fulfilled* if there is a  $d \le e$  which matches it and, for any e which can block e, either e or e if e or e if e is a e if e in e is a e in e in

We introduce reorderability [Mazurkiewicz 1995] in order to provide examples with coherence in this subsection. In §4 we show that coherence can be encoded in the logic, making reorderability unnecessary.

*Definition 3.6.* Actions *a* and *b* are *reorderable* ( $a \bowtie b$ ) if either both are reads or they are accesses to different locations. Formally  $\bowtie = \{(Rxv, Ryw)\} \cup \{(Rxv, Wyw), (Wxv, Ryw), (Wxv, Wyw) \mid x \neq y\}.$ 

Actions that are not reorderable are in conflict.

We can now define a model of processes given as sets of pomsets sufficient to give the semantics for a fragment of our language without control or data dependencies.

*Definition 3.7.* If  $P \in NIL$  then  $E = \emptyset$ .

If  $P \in PAR(\mathcal{P}_1, \mathcal{P}_2)$  then  $(\exists P_1 \in \mathcal{P}_1)$   $(\exists P_2 \in \mathcal{P}_2)$ 

- (1)  $E = (E_1 \cup E_2)$ ,
- (2) if  $d \leq_1 e$  then  $d \leq e$ ,
- (3) if  $d \leq_2 e$  then  $d \leq e$ ,
- (4) if  $e \in E_1$  then  $\lambda(e) = \lambda_1(e)$ ,
- (5) if  $e \in E_2$  then  $\lambda(e) = \lambda_2(e)$ ,
- (6)  $E_1$  and  $E_2$  are disjoint.

If  $P \in (a \to \mathcal{P}_2)$  then  $(\exists E_1) \ (\exists P_2 \in \mathcal{P}_2)$ 

- (1)  $E = (E_1 \cup E_2)$ ,
- (2) if  $d, e \in E_1$  then d = e,
- (3) if  $d \leq_2 e$  then  $d \leq e$ ,
- (4) if  $e \in E_1$  then  $\lambda(e) = a$ ,
- (5) if  $e \in E_2$  then  $\lambda(e) = \lambda_2(e)$ ,
- (6) if  $d \in E_1$ ,  $e \in E_2$  then either  $d \le e$  or  $a \bowtie \lambda_2(e)$ .

If  $P \in TOP(\mathcal{P})$  then  $(\exists P_1 \in \mathcal{P})$ 

- (1)  $E = E_1$ ,
- (2)  $\lambda(e) = \lambda_1(e)$ ,
- (3) if  $d \leq_1 e$  then  $d \leq e$ ,
- (4) if  $\lambda_1(e)$  is a read then e is fulfilled (Def 3.5).

Definition 3.8. For a language fragment, the semantics is:

In this semantics, both skip and 0 map to the empty pomset. Parallel composition is disjoint union, inheriting labeling and order from the two sides. Prefixing may add a new action (on the left) to an existing pomset (on the right), inheriting labeling and order from the right.

It is worth noting that if  $\bowtie$  is taken to be the empty relation, then top-level pomsets of Def 3.4 correspond to sequentially consistent executions up to mumbling [Brookes 1996].

*Example 3.9.* Mumbling is allowed, since there is no requirement that left and right be disjoint in the definition of prefixing. Both of the pomsets below are allowed.

$$x := 1; x := 1$$
  $x := 1; x := 1$ 

$$(Wx1) \rightarrow (Wx1)$$

In the left pomset, the order between the events is enforced by clause 6, since the actions are in conflict.

*Example 3.10.* Although this model enforces coherence, it is very weak. For example, it makes no distinction between synchronizing and relaxed access, thus allowing:

$$x := 0$$
;  $x := 1$ ;  $y^{ra} := 1 \parallel r := y^{ra}$ ;  $s := x$ 
 $was y = 0$ 
 $was y = 0$ 

We show how to enforce the intended semantics, where (Wy1) publishes (Wx1) in Ex 4.3.

In diagrams, we use different shapes and colors for arrows and events. These are included only to help the reader understand why order is included. We adopt the following conventions (dependency and synchronization order will appear later in the paper):

- relaxed accesses are blue, with a single border,
- synchronized accesses are red, with a double border,
- $e \rightarrow d$  arises from fulfillment, where e matches d,
- $e \rightarrow d$  arises either from fulfillment, where e blocks d, or from prefixing, where e was prefixed before d and their actions conflict,
- $e \rightarrow d$  arises from control/data/address dependency,
- $e \rightarrow d$  arises from synchronized access.

Definition 3.11.  $\mathcal{P}_1$  refines  $\mathcal{P}_2$  if  $\mathcal{P}_1 \subseteq \mathcal{P}_2$ .

*Example 3.12.* Ex 3.9 shows that [x := 1] refines [x := 1; x := 1].

#### 3.4 Pomsets with Preconditions

The previous section modeled a language fragment without conditionals (and hence no control dependencies) or expressions (and hence no data dependencies). We now address this, by adopting a *pomsets with preconditions* model similar to [Jagadeesan et al. 2020].

Definition 3.13. A point with preconditions is a point (Def 3.4) together with  $\kappa: E \to \Phi$ .

*Definition 3.14.* Let  $\lceil \phi/Q \rceil$  substitute all quiescence symbols by  $\phi$ .

We can now define a model of processes given as sets of pomsets with preconditions sufficient to give the semantics for a fragment of our language where every use of sequential composition is either  $(x^{\mu} := M; S)$  or  $(r := x^{\mu}; S)$ .

```
Definition 3.15. If P \in NIL then E = \emptyset.

If P \in PAR(\mathcal{P}_1, \mathcal{P}_2) then (\exists P_1 \in \mathcal{P}_1) (\exists P_2 \in \mathcal{P}_2)

1-6) as for PAR in Def 3.7,

(7) if e \in E_1 then \kappa(e) implies \kappa_1(e),

(8) if e \in E_2 then \kappa(e) implies \kappa_2(e).

If P \in IF(\phi, \mathcal{P}_1, \mathcal{P}_2) then (\exists P_1 \in \mathcal{P}_1) (\exists P_2 \in \mathcal{P}_2)

1-5) as for PAR in Def 3.7 (ignoring disjointness),

(6) if e \in E_1 \setminus E_2 then \kappa(e) implies \phi \wedge \kappa_1(e),

(7) if e \in E_2 \setminus E_1 then \kappa(e) implies \neg \phi \wedge \kappa_2(e),

(8) if e \in E_1 \cap E_2 then \kappa(e) implies (\phi \Rightarrow \kappa_1(e)) \wedge (\neg \phi \Rightarrow \kappa_2(e)).

If P \in ST(x, M, \mathcal{P}_2) then (\exists P_2 \in \mathcal{P}_2) (\exists v \in \mathcal{V})

1-6) as for (\mathbb{V}xv) \to \mathcal{P}_2 in Def 3.7,

(7) if e \in E_1 \setminus E_2 then \kappa(e) implies M=v,
```

(8) if  $e \in E_2 \setminus E_1$  then  $\kappa(e)$  implies  $\kappa_2(e)$ ,

(9) if  $e \in E_1 \cap E_2$  then  $\kappa(e)$  implies  $M=v \vee \kappa_2(e)$ . If  $P \in LD(r, x, \mathcal{P}_2)$  then  $(\exists P_2 \in \mathcal{P}_2)$   $(\exists v \in \mathcal{V})$ 1-6) as for  $(\mathbb{R}xv) \to \mathcal{P}_2$  in Def 3.7, (7) if  $e \in E_2 \setminus E_1$  then either  $\kappa(e)$  implies  $r=v \Rightarrow \kappa_2(e)$  and  $(\exists d \in E_1)$  d < e, or  $\kappa(e)$  implies  $\kappa_2(e)$ .

If  $P \in TOP(\mathcal{P})$  then  $(\exists P_1 \in \mathcal{P})$ 

- 1-4) as for TOP in Def 3.7,
  - (5) if  $\lambda_1(e)$  is a write,  $\kappa_1(e)[tt/Q][tt/W]$  is a tautology,
  - (6) if  $\lambda_1(e)$  is a read,  $\kappa_1(e)$  [ff/W] is a tautology.

Let PomPre be the set of all pomsets with preconditions. The function  $TOP: 2^{PomPre} \rightarrow 2^{Pom}$  embeds sets of pomsets with preconditions into sets of pomsets. It also substitutes formulae for quiescence and write symbols, for use in §4-5. In these "top-level" pomsets, every read is fulfilled and every precondition is a tautology.

Definition 3.16. For a language fragment, the semantics is:

*Example 3.17.* A simple example of a data dependency is a pomset  $P \in [r := x; y := r]$ , for which there must be an  $v \in \mathcal{V}$  and  $P' \in [y := r]$  such as the following, where v = 1:

$$y := r$$

$$r = 1 \mid W y 1$$

The value chosen for the read may be different from that chosen for the write:

$$r := x; y := r$$

$$(Rx0) \longrightarrow (r=0 \Rightarrow r=1 \mid Wy1)$$

In this case, the pomset's preconditions depend on a bound register, so cannot contribute to a top-level pomset.

If the values chosen for read and write are compatible, then we have two cases: the independent case, which again cannot be part of a top-level pomset,

$$r := x; y := r$$

$$(Rx1) \qquad (r=1 \mid Wy1)$$

and the dependent case:

Since  $r=1 \Rightarrow r=1$  is a tautology, this can be part of a top-level pomset.

*Example 3.18.* Control dependencies are similar, for example for any  $P \in [r := x; if(r)\{y := 1\}]$ , there must be an  $v \in V$  and  $P' \in [if(r)\{y := 1\}]$  such as:

$$if(r)\{y := 1\}$$

$$r \neq 0 \mid Wy1$$

The rest of the reasoning is the same as Ex 3.17.

*Example 3.19.* A simple example of an independency is a pomset  $P \in [r:=x; y:=1]$ , for which there must be:

$$y := 1$$

$$(1=1 \mid W y 1)$$

In this case it doesn't matter what value the read chooses:

$$r := x; y := 1$$

$$(Rx0) \qquad (1=1 \mid Wy1)$$

*Example 3.20.* Consider  $P \in [[if(r=1)\{y:=r\}]]$  else  $\{y:=1\}$ , so there must be  $P_1 \in [[y:=r]]$ , and  $P_2 \in [[y:=1]]$ , such as:

$$y := r$$

$$(r=1 \mid W y 1)$$

$$y := 1$$

$$(1=1 \mid W y 1)$$

Since there is no requirement for disjointness in the semantics of conditionals, we can consider the case where the event *coalesces* from the two pomsets, in which case:

if 
$$(r=1)\{y := r\}$$
 else  $\{y := 1\}$   
 $(r=1 \Rightarrow r=1) \land (r \neq 1 \Rightarrow 1=1) \mid Wy1$ 

Here, the precondition is a tautology, independent of r.

# 3.5 Pomsets with Predicate Transformers

Having reviewed the work we are building on, we now turn to the contribution of this paper, which is a model of *pomsets with predicate transformers*. *Predicate transformers* are functions on formulae which preserve logical structure, providing a natural model of sequential composition.

*Definition 3.21.* A predicate transformer is a function  $\tau:\Phi\to\Phi$  such that

- $\tau(ff)$  is ff,
- $\tau(\psi_1 \wedge \psi_2)$  is  $\tau(\psi_1) \wedge \tau(\psi_2)$ ,
- $\tau(\psi_1 \vee \psi_2)$  is  $\tau(\psi_1) \vee \tau(\psi_2)$ ,
- if  $\phi$  implies  $\psi$ , then  $\tau(\phi)$  implies  $\tau(\psi)$ .

Note that substitutions  $(\tau(\psi) = \psi[M/r])$  and implications on the right  $(\tau(\psi) = \phi \Rightarrow \psi)$  are predicate transformers.

As discussed in §1, predicate transformers suffice for sequentially consistent models, but not relaxed models, where dependency calculation is crucial. For dependency calculation, we use a *family* of predicate transformers, indexed by sets of events. We use  $\tau^D$  as the predicate transformer applied to any event e where if  $d \in D$  then d < e.

Definition 3.22. A family of predicate transformers for E consists of a predicate transformer  $\tau^D$  for each  $D \subseteq \mathcal{E}$ , such that if  $C \cap E \subseteq D$  then  $\tau^C(\psi)$  implies  $\tau^D(\psi)$ .

Definition 3.23. A pomset with predicate transformers is a pomset with preconditions (Def 3.15), together with a family of predicate transformers for E.

Definition 3.24. If  $P \in ABORT$  then  $E = \emptyset$  and

•  $\tau^D(\psi)$  implies ff.

If  $P \in SKIP$  then  $E = \emptyset$  and

•  $\tau^D(\psi)$  implies  $\psi$ .

```
If P \in LET(r, M) then E = \emptyset and
      • \tau^D(\psi) implies \psi[M/r].
If P \in IF(\phi, \mathcal{P}_1, \mathcal{P}_2) then (\exists P_1 \in \mathcal{P}_1) (\exists P_2 \in \mathcal{P}_2)
 1-8) as for IF in Def 3.15,
   (9) \tau^D(\psi) implies (\phi \Rightarrow \tau_1^D(\psi)) \land (\neg \phi \Rightarrow \tau_2^D(\psi)).
If P \in PAR(\mathcal{P}_1, \mathcal{P}_2) then (\exists P_1 \in \mathcal{P}_1) (\exists P_2 \in \mathcal{P}_2)
 1-8) as for PAR in Def 3.15,
   (9) \tau^D(\psi) implies \tau_2^D(\psi),
 (10) \tau^D(s) implies \tau_1^{\tilde{D}}(s), for every quiescence symbol s.
If P \in SEQ(\mathcal{P}_1, \mathcal{P}_2) then (\exists P_1 \in \mathcal{P}_1) \ (\exists P_2 \in \mathcal{P}_2)
 1-5) as for PAR in Def 3.7 (ignoring disjointness),
   (6) if e \in E_1 \setminus E_2 then \kappa(e) implies \kappa_1(e),
   (7) if e \in E_2 \setminus E_1 then \kappa(e) implies \kappa'_2(e),
   (8) if e \in E_1 \cap E_2 then \kappa(e) implies \kappa_1(e) \vee \kappa_2'(e),
         where \kappa_2'(e) = \tau_1^C(\kappa_2(e)), where C = \{c \mid c < e\},
   (9) \tau^D(\psi) implies \tau_1^D(\tau_2^D(\psi)).
If P \in STORE(x, M, \mu) then (\exists v \in \mathcal{V})
   S1) if d, e \in E then d = e,
   S2) \lambda(e) = Wxv,
   S3) \kappa(e) implies M=v,
   S4) \tau^D(\psi) implies \psi \wedge M=v,
   S5) \tau^{C}(\psi) implies \psi,
         where D \cap E \neq \emptyset and C \cap E = \emptyset.
If P \in LOAD(r, x, \mu) then (\exists v \in \mathcal{V})
  L1) if d, e \in E then d = e,
  L2) \lambda(e) = Rxv,
  L3) \kappa(e) implies tt,
  L4) \tau^D(\psi) implies v=r \Rightarrow \psi,
  L5) \tau^{C}(\psi) implies \psi,
         where D \cap E \neq \emptyset and C \cap E = \emptyset,
If P \in TOP(\mathcal{P}) then (\exists P_1 \in \mathcal{P})
 1-6) as in Def 3.15,
    (7) \tau^{E_1}(s) implies s, for every quiescence symbol s.
```

Definition 3.25. The semantics of commands is:

Most of these definitions are straightforward adaptations of §3.4, but the treatment of sequential composition is new. This uses the usual rule for composition of predicate transformers (but preserving the indexing set). For the pomset, we take the union of their events, preserving actions, but crucially in cases 7 and 8 we apply a predicate transformer  $\tau_1^C$  from the left-hand side to a precondition  $\kappa_2(e)$  from the right-hand side to build the precondition  $\kappa_2'(e)$ . The indexing set C for the predicate transformer is  $\{c \mid c < e\}$ , so can depend on the causal order.

Example 3.26. For read to write dependency, consider:

Putting these together without order, we calculate the precondition  $\kappa(e)$  as  $\tau_1^C(\kappa_2(e))$ , where C is  $\{c \mid c < e\}$ , which is  $\emptyset$ . Since  $\tau_1^{\emptyset}(\psi)$  is  $\psi$ , this gives that  $\kappa(e)$  is  $\kappa_2(e)$ , which is r=1. This gives the pomsaet with predicate transformers:

$$r := x \; ; \; y := r$$

$$\stackrel{d}{\underset{\vdash}{\left(\mathbb{R}x1\right)}} \stackrel{e}{\underset{\vdash}{\left(r=1\mid \mathsf{W}y1\right)}} \stackrel{e}{\underset{\vdash}{\left(r=1\wedge\psi\right)}} \stackrel{r=1\wedge\psi}{\underset{\vdash}{\left(r=1\wedge\psi\right)}} \stackrel{\psi}{\underset{\vdash}{\left(r=1\wedge\psi\right)}}$$

This pomset's preconditions depend on a bound register, so cannot contribute to a top-level pomset. Putting them together with order, we calculate the precondition  $\kappa(e)$  as  $\tau_1^C(\kappa_2(e))$ , where C is  $\{c \mid c < e\}$ , which is  $\{d\}$ . Since  $\tau_1^{\{d\}}(\psi)$  is  $(1=r \Rightarrow \psi)$ , this gives that  $\kappa(e)$  is  $(1=r \Rightarrow \kappa_2(e))$ , which is  $(1=r \Rightarrow r=1)$ . This gives the pomset with predicate transformers:

This pomset's preconditions do not depend on a bound register, so can contribute to a top-level pomset.

Example 3.27. If the read and write choose different values:

Putting these together with order, we have the following, which cannot be part of a top-level pomset:

$$r := x \; ; \; y := r$$

$$\stackrel{d}{\underset{\longrightarrow}{\left(\mathbb{R}x1\right)}} \underbrace{\mathbb{R}x1} \stackrel{e}{\underset{\longrightarrow}{\left(1=r\Rightarrow r=2\mid Wy2\right)}} \underbrace{\mathbb{I}=r\Rightarrow (r=2\land \psi)} \qquad \boxed{r=2\land \psi} \qquad \boxed{\psi}$$

*Example 3.28.* S4 includes *M=v* to ensure that spurious merges do not go undetected. Consider the following.

$$x := 1$$

$$x := 2$$

$$(1=1 \mid \mathsf{W}x1) \cdots \Rightarrow 1=1 \land \psi \qquad (2=1 \mid \mathsf{W}x1) \cdots \Rightarrow 2=1 \land \psi \qquad (\psi)$$

Merging the actions, since 2=1 is unsatisfiable, we have:

This pomset cannot be part of a top-level pomset, since  $\tau^E(s) = \text{ff}$  for every quiescence symbol s. This is what we would hope: that the program x := 1; x := 2 should only be top-level if there is a (Wx2) event.

*Example 3.29.* The predicate transformer we have chosen for L4 is different from the one used traditionally, which is written using substitution. Substitution is also used in [Jagadeesan et al. 2020]. Attempting to write the predicate transformers in this style we have:

- L4)  $\tau^D(\psi)$  implies  $\psi[v/r]$ ,
- L5)  $\tau^C(\psi)$  implies  $(\forall r)\psi$ .

This phrasing of L5 says that  $\psi$  must be independent of r in order to appear in a top-level pomset. This choice for L5 is forced by Def 3.22, which states that the predicate transformer for a small subset of E must imply the transformer for a larger subset.

Sadly, this definition fails associativity.

Consider the following, eliding transformers:

$$r:=y$$
  $x:=!r$   $x:=!!r$   $x:=0$   $(\mathbb{R}y1)$   $(r=0 \mid Wx1)$   $(r\neq 0 \mid Wx1)$   $(Wx0)$ 

Associating to the right and merging:

$$r := y \qquad \qquad x := !r; \ x := !!r; \ x := 0$$

$$(Ry1) \qquad \qquad (r=0 \lor r\neq 0 \mid Wx1) \rightarrow (Wx0)$$

The precondition of (Wx1) is a tautology, thus we have:

$$r := y ; x := !r ; x := !!r ; x := 0$$

$$(Ry1) \qquad (Wx1) \rightarrow (Wx0)$$

If, instead, we associate to the left:

$$r:=y\;;\;x:=!r$$
  $x:=!!r\;;\;x:=0$  
$$(Ry1) \qquad (1=0\mid Wx1) \qquad (r\neq 0\mid Wx1) \rightarrow (Wx0)$$

Sequencing and merging:

$$r := y ; x := !r ; x := !!r ; x := 0$$

$$(Ry1) \qquad (1=0 \lor r \neq 0 \mid Wx1) \rightarrow (Wx0)$$

In this case, the precondition of (Wx1) is not a tautology, forcing a dependency  $(Ry1) \rightarrow (Wx1)$ . Our solution is to Skolemize. We have proven associativity of Def 3.24 in Agda. The proof re-

quires that predicate transformers distribute through disjunction (Def 3.21). Since universal quantification does not distribute through disjunction, the attempt to define predicate transformers using substitution fails (in particular for L5.)

# 3.6 The Road Ahead

The final semantic functions for load, store, and thread initialization are given in Fig 1, at the end of the paper. In §4–6, we explain this definition by looking at its constituent parts, building on Def 3.24. In §4, we add *quiescence*, which encodes coherence, release-acquire access, and SC access. In §5, we add peculiarities that are necessary for efficient implementation on ARM8. In §6, we discuss other features such as invariant reasoning, case analysis and register recycling.

The final definitions of load and store are quite complex, due to the inherent complexities of relaxed memory. The core of Def 3.24, modeling sequential composition, parallel composition, and conditionals, is stable, remaining unchanged in later sections. The messiness of relaxed memory is quarantined to the rules for load and store, rather than permeating the entire semantics.

#### 4 OUIESCENCE

We introduce *quiescence*, which captures *coherence* and *synchronized access*. Recall from §3.1 that formulae include symbols  $Q_{sc}$ ,  $Q_{ro}^x$ , and  $Q_{wo}^x$ . We refer to these collectively as *quiescence symbols*. In this section, we will show how these logical symbols can be used to capture coherence and synchronization. This illustrates a feature of our model, which is that many features of weak memory can be cautured in the logic, not in the pomset model itself.

# 4.1 Coherence (co)

In the logic, the quiescence symbols are just uninterpreted formula, but the semantics uses them as preconditions, to ensure appropriate causal order. For example, *write-write coherence* enforces order between writes to the same location in the same thread. We model this by adding the precondition  $(Q_{ro}^x \wedge Q_{wo}^x)$  to events that write to x, for example:

$$x := 1; x := 2$$

$$(1=1 \land Q_{ro}^x \land Q_{wo}^x \mid Wx1) \rightarrow (2=2 \land Q_{ro}^x \land Q_{wo}^x \mid Wx2)$$

These symbols are left alone in the dependent case, but in the independent case we substitute ff for  $Q_{wo}^x$ :

$$\boxed{1=1 \land Q_{\mathsf{ro}}^{x} \land Q_{\mathsf{wo}}^{x} \mid \mathsf{W}x1} \qquad \boxed{2=2 \land Q_{\mathsf{ro}}^{x} \land \mathsf{ff} \mid \mathsf{W}x2}$$

This substitution is part of the predicate transformer:

We treat read-write and write-read coherence similarly:

$$r := x$$

$$Q_{\text{wo}}^{x} \mid \mathsf{R}x1 \rangle \cdots \rangle r = 1 \Rightarrow \psi \qquad \left[ \psi \left[ \mathsf{ff} / Q_{\mathsf{ro}}^{x} \right] \right]$$

In this model, there is no read-read coherence, but to restore it we would identify  $Q_{ro}^x$  with  $Q_{wo}^x$ . When threads are initialized, we substitute every quiesence symbol with tt, so at top level there are no remaining quiescence symbols, for example:

$$x := 1; x := 2 \parallel r := x$$

$$\underbrace{1 = 1 \land \text{tt} \land \text{tt} \mid \text{W}x1}_{\text{tt} \mid \text{R}x1} - \cdots - \underbrace{2 = 2 \land \text{tt} \land \text{tt} \mid \text{W}x2}_{\text{tt}}$$

Definition 4.1 (co). Update Def 3.24:

- S3)  $\kappa(e)$  implies  $Q_{ro}^x \wedge Q_{wo}^x \wedge M=v$ ,
- L3)  $\kappa(e)$  implies  $Q_{wo}^{x}$ ,
- S5)  $\tau^{C}(\psi)$  implies  $\psi[ff/Q_{w_0}^x]$ ,
- L5)  $\tau^{C}(\psi)$  implies  $\psi[ff/Q_{r_0}^x]$ .

Example 4.2. Def 4.1 enforces coherence. Consider:



Simplifying, we have:

$$\begin{array}{c|c} Q_{\mathsf{ro}}^{x} \wedge Q_{\mathsf{wo}}^{x} \mid \mathsf{W}x1 ) & \qquad & Q_{\mathsf{ro}}^{x} \wedge Q_{\mathsf{wo}}^{x} \mid \mathsf{W}x2 ) \\ \hline \psi [\mathsf{ff}/Q_{\mathsf{wo}}^{x}] & \psi & \qquad & \psi [\mathsf{ff}/Q_{\mathsf{wo}}^{x}] \\ \end{array}$$

If we attempt to put these together unordered, the precondition of (Wx2) becomes unsatisfiable:

$$x := 1; x := 2$$

$$Q_{ro}^{x} \wedge Q_{wo}^{x} \mid Wx1$$

$$\psi[ff/Q_{wo}^{x}] \qquad \psi[ff/Q_{wo}^{x}] \qquad \psi[ff/Q_{wo}^{x}]$$

In order to get a satisfiable precondition for (Wx2), we must introduce order:

# Synchronized Access (SYNC)

Example 4.3. The publication idiom requires that we disallow the execution below, which is allowed by Def 4.1.

$$x := 0; x := 1; y^{ra} := 1 \parallel r := y^{ra}; s := x$$

$$(Wx0) \longrightarrow (Wx1) \qquad (Ry1) \longrightarrow (Rx0)$$

We disallow this by introducing order  $(Wx1) \rightarrow (Wy1)$  and  $(Ry1) \rightarrow (Rx0)$ .

$$(Wx0) \rightarrow (Wx1) \rightarrow (Ry1) \rightarrow (Rx0)$$

Definition 4.4. Let  $Q_{ro}^* = \bigwedge_y Q_{ro}^y$ , and similarly for  $Q_{wo}^*$ . Let formulae  $Q_{\mu}^{Sx}$  and  $Q_{\mu}^{Lx}$  be defined:

$$\begin{array}{lll} Q_{r|x}^{Sx} = Q_{ro}^x \wedge Q_{wo}^x & Q_{r|x}^{Lx} = Q_{wo}^x \\ Q_{ra}^{Sx} = Q_{ro}^* \wedge Q_{wo}^* & Q_{ra}^{Lx} = Q_{wo}^x \\ Q_{sc}^{Sx} = Q_{ro}^* \wedge Q_{wo}^* \wedge Q_{sc} & Q_{sc}^{Lx} = Q_{wo}^x \wedge Q_{sc} \end{array}$$

Let  $[\phi/Q_{ro}^*]$  substitute  $\phi$  for every  $Q_{ro}^y$ , and similarly for  $Q_{wo}^*$ . Let substitutions  $[\phi/Q_{\mu}^{Sx}]$  and  $[\phi/Q_{\mu}^{Lx}]$  be defined:

$$\begin{split} [\phi/Q_{\text{rlx}}^{\text{S}x}] &= [\phi/Q_{\text{wo}}^x] \\ [\phi/Q_{\text{ra}}^{\text{S}x}] &= [\phi/Q_{\text{wo}}^x] \\ [\phi/Q_{\text{ra}}^{\text{S}x}] &= [\phi/Q_{\text{wo}}^x] \\ [\phi/Q_{\text{sc}}^{\text{S}x}] &= [\phi/Q_{\text{wo}}^x, \phi/Q_{\text{sc}}] \\ [\phi/Q_{\text{sc}}^{\text{L}x}] &= [\phi/Q_{\text{ro}}^x, \phi/Q_{\text{wo}}^x] \\ [\phi/Q_{\text{sc}}^{\text{L}x}] &= [\phi/Q_{\text{ro}}^x, \phi/Q_{\text{wo}}^x, \phi/Q_{\text{sc}}] \end{split}$$

Definition 4.5 (CO/SYNC). Update Def 3.24 to:

- S3)  $\kappa(e)$  implies  $Q_{\mu}^{Sx} \wedge M=v$ , L3)  $\kappa(e)$  implies  $Q_{\mu}^{Lx}$ ,

- S5)  $\tau^C(\psi)$  implies  $\psi[\text{ff}/Q_{\mu}^{Sx}]$ , L5)  $\tau^C(\psi)$  implies  $\psi[\text{ff}/Q_{\mu}^{Lx}]$ .

The quiescence formulae indicate what must precede an event. For example, all preceding accesses must be ordered before a releasing write, whereas only writes on x must be ordered before a releasing read on *x*.

The quiescence substitutions update quiescence symbols in subsequent code. For subsequent independent code, S5 and L5 substitute false. In top-level pomsets, TOP substitutes true (Def 3.15). For example, we substitute ff for  $Q_{ra}^{Sx}$  in the independent case for a releasing write; this ensures that subsequent writes to x follow the releasing write in top-level pomsets. Similarly, we substitute ff for  $Q_{ra}^{Lx}$  in the independent case for an acquiring write; this ensures that all subsequent accesses follow the acquiring read in top-level pomsets.

Example 4.6. The following pomsets show the effect of quiescence for each access mode.



Example 4.7. Def 4.1 enforces publication. Consider:

$$x := 1$$

$$1 = 1 \land Q_{\text{rlx}}^{Sx} \mid Wx1$$

$$1 = 1 \land \psi \land \cdots$$

$$\psi[\text{ff}/Q_{\text{rlx}}^{Sy}]$$

$$\psi[\text{ff}/Q_{\text{rlx}}^{Sy}]$$

$$\psi[\text{ff}/Q_{\text{rlx}}^{Sy}]$$

Since  $Q_{ra}^{Sy}[ff/Q_{rlx}^{Sx}]$  is ff, composing these without order simplifies to:

In order to get a satisfiable precondition for (Wy1), we must introduce order:



Example 4.8. Def 4.1 enforces subscription. Consider:



Since  $Q_{rlx}^{Lx}[ff/Q_{ra}^{Ly}]$  is ff, composing these without order simplifies to:

$$r := y^{\mathsf{rd}} \; ; \; s := x$$

$$\psi[\mathsf{ff}/\mathsf{Q}_{\mathsf{ra}}^{\mathsf{L}y}][\mathsf{ff}/\mathsf{Q}_{\mathsf{rk}}^{\mathsf{L}x}] \qquad \qquad (\mathsf{ff} \mid \mathsf{R}x1)$$

$$r := y^{\mathsf{rd}} \; ; \; s := x$$

$$\psi[\mathsf{ff}/\mathsf{Q}_{\mathsf{ra}}^{\mathsf{L}x}] \qquad (\mathsf{ff} \mid \mathsf{R}x1)$$

$$r := y^{\mathsf{rd}} \; ; \; s := x$$

$$\psi[\mathsf{ff}/\mathsf{Q}_{\mathsf{ra}}^{\mathsf{L}x}] \qquad (\mathsf{ff} \mid \mathsf{R}x1)$$

$$r := y^{\mathsf{rd}} \; ; \; s := x$$

$$\psi[\mathsf{ff}/\mathsf{Q}_{\mathsf{ra}}^{\mathsf{L}x}] \qquad (\mathsf{ff} \mid \mathsf{R}x1)$$

In order to get a satisfiable precondition for (Rx1), we must introduce order:

$$\begin{array}{c|c} \hline \mathbb{Q}_{\mathsf{ra}}^{\mathsf{L}y} \mid \mathsf{R}\,y1 \\ \hline & \psi[\mathsf{ff}/\mathbb{Q}_{\mathsf{ra}}^{\mathsf{L}y}][\mathsf{ff}/\mathbb{Q}_{\mathsf{rlx}}^{\mathsf{L}x}] \\ \hline & \\ \hline &$$

# 5 EFFICIENT IMPLEMENTATION ON ARMV8

We discuss ARM8 using *external global completion* (EGC) [Alglave 2020] [Arm Limited 2020, §B2.3.6] which is very close to our model.

Example 5.1. Bad example:

$$r := x$$
;  $x := 2$ ;  $s := x$ ;  $y := s-1 \parallel r := y$ ;  $x := r$ 

$$(Rx1) \rightarrow (Wx2) \rightarrow (Rx2) \rightarrow (Wy1) \rightarrow (Ry1) \rightarrow (Wx1)$$

Example 5.2. Should not wipe out local state only when there is a release-acquire pair, as in the definition of [Jagadeesan et al. 2020], which allows the following execution. Even though there is a blocker (Wx3), here we are using (Wx0) to justify the lack of dependence (Rx4)  $\rightarrow$  (Ww4).

$$x := 0$$
; if  $(x) \{ y := 2 \}$ ; if  $(z^{ra}) \{ s := x \}$ ; if  $(s \text{ even}) \{ w := 4 \} \} \| x := 2 \}$ ; if  $(y^{ra}) \{ x := 3 \} z^{ra} := 1 \}$ ;  $x := w$ 

$$(Wx0)$$
  $(Rx2)$   $(Wy1)$   $(Rx4)$   $(Ww4)$   $(Wx2)$   $(Ry1)$   $(Rw4)$   $(Wx4)$ 

(The read-clobbering semantics from §B.4 does not change this.)

It does seem to me like a non-local blocker for program-order must include a read of x or a release (to get order from the write) and an acquire (to get the order to the read).

What if an acquire preceded by a release or non-local read on x wiped the local state of x? The problem with this kind of solution (and P6 in ooplsa specifically) is that it destroys roach-motel.

# 5.1 Downgraded Reads (DGR)

*Example 5.3.* The following example is from Podkopaev [Podkopaev 2020]. It is allowed by ARM8, but disallowed by Def 4.5. The coherence order between the writes can be witnessed by a separate thread, which we have elided.

$$x := 2; r := x^{ra}; y := 1 \parallel y := 2; x^{ra} := 1$$
 $(wx2) \longrightarrow (wy1) \longrightarrow (wy2) \longrightarrow (wx1)$ 

Under EGC, this is explained by dropping the order  $(Rx2) \rightarrow (Wy1)$ , because (Rx2) is fulfilled by a relaxed write in the same thread.

$$(Wx2)$$
  $(Wy1)$   $- (Wy2)$   $(Wx1)$ 

More generally, this can be understood as a compiler optimization that downgrades a read from ra to rlx when it can be fulfilled by a relaxed write in the same thread.

To model such downgraded reads, we use the uninterpreted symbols  $\downarrow^x$ .

*Definition 5.4.* Let  $[\phi/\downarrow^*]$  substitute  $\phi$  for every  $\downarrow^y$ . Let formula  $\downarrow_{\mu}^{x}$  and substitution  $[\mu/\downarrow^{x}]$  be defined:

$$\downarrow^x_{\mu} = \begin{cases} \downarrow^x & \text{if } \mu \neq \text{rlx} \\ \text{tt} & \text{otherwise} \end{cases} \qquad [\mu/\downarrow^x] = \begin{cases} [\text{tt}/\downarrow^x] & \text{if } \mu = \text{rlx} \\ [\text{tt}/\downarrow^x][\text{ff}/\downarrow^x] & \text{otherwise} \end{cases}$$

Definition 5.5 (CO/SYNC/DGR). Update Def 4.5 to:

- S4)  $\tau^D(\psi)$  implies  $\psi[\mu/\downarrow^x] \land M=v$ ,
- S5)  $\tau^{C}(\psi)$  implies  $\psi[\mu/\downarrow^{x}][ff/Q_{\mu}^{Sx}]$ , L5)  $\tau^{C}(\psi)$  implies  $\downarrow^{x}_{\mu} \wedge \psi[ff/Q_{\mu}^{Lx}]$ .

Load actions that require downgrading introduce  $\downarrow^x$ . Relaxed stores on x substitute true for  $\downarrow^x$ , whereas synchronizing stores substitute false for  $\downarrow^x$ .

Example 5.6. Revisiting Ex 5.3 and eliding irrelevant transformers:

Associating right:

$$\begin{array}{c} x := 2 \\ (\mathbb{W}x2) \nearrow \psi[\mathsf{tt}/\downarrow^x] \end{array} \qquad \begin{array}{c} r := x^\mathsf{ra}\,;\; y := 1 \\ (\mathbb{R}x2) & (\mathbb{I}^x \mid \mathbb{W}y1) \end{array}$$

Composing, we have, as desired:

$$x := 2$$
;  $r := x^{ra}$ ;  $y := 1$ 

$$\boxed{Wx2} \longrightarrow \boxed{Rx2} \qquad \boxed{Wy1}$$

Example 5.7. One might worry that our model is too permissive for sc access, but ARM8 itself allows some very counterintuitive results for sc access. In the following execution we elide the initializing write (Wy0).

if 
$$(x)\{x := 2\}$$
;  $r := x^{sc}$ ;  $s := y^{sc} \parallel y^{sc} := 2$ ;  $x^{sc} := 1$ 
 $(x)\{x := 2\}$ ;  $x := x^{sc}$ ;  $x := y^{sc} \parallel y^{sc}$ ;  $x := y^{sc} \parallel y^{sc}$ ;  $x := y^{sc} \parallel y^{sc}$ ;  $y := y^{sc}$ ;  $y :=$ 

Under EGC, this is explained by dropping the order  $(Rx2) \rightarrow (Ry0)$ , because (Rx2) is fulfilled by a relaxed write in the same thread.

#### Removing Read-Read dependencies (RRD) 5.2

Example 5.8. The following execution is allowed by ARM8, but disallowed by Def 4.5.

$$x := 0; x := 1; y^{ra} := 1 \parallel r := y; if(r)\{s := x\}$$

$$(Wx0) \rightarrow (Wx1) \rightarrow (Wy1) \rightarrow (Ry1) \rightarrow (Rx0)$$

Under EGC, this is explained by dropping the order  $(Ry1) \rightarrow (Rx0)$ , because ARM8 does not include control dependencies between reads in the locally-ordered-before relation.

$$(Wx0) \rightarrow (Wx1) \rightarrow (Ry1) \rightarrow (Rx0)$$

Since we do not distinguish control dependencies from other dependencies, we are forced to drop all dependencies between reads. In order to do so, we use the uninterpreted symbol W.

Definition 5.9 (RRD). Update Def 3.24 to:

- L4)  $\tau^D(\psi)$  implies  $v=r \Rightarrow \psi$ ,
- L5)  $\tau^C(\psi)$  implies  $(v=r \vee W) \Rightarrow \psi$ , when  $E \neq \emptyset$ ,
- L6)  $\tau^B(\psi)$  implies  $\psi$ , when  $E = \emptyset$ .

L4 is the dependent case, L5 the independent case, and L6 the empty case. Recall the substitutions performed by TOP (Def 3.15). For writes, L5 is the same as L6 (since W is tt). For reads, L5 is the same as L4 (since W is ff).

One reading of L5 is that when satisfying a precondition  $\phi$  it is safe to use the value of the read, as long as the action is not a write. If the pomset is empty, however, there is no value v to use. Therefore the best we can do is to emulate skip, as in L6.

*Example 5.10.* Revisiting Ex 5.8 and eliding irrelevant transformers:

$$r := y \qquad \text{if}(r)\{s := x\}$$

$$(R y1) | (1=r \lor W) \Rightarrow \psi | \qquad (r \mid Rx0)$$

Composing sequentially:

$$r := y; \text{ if } (r) \{s := x\}$$

$$(\exists x \ y \ 1) \qquad (\exists x \ y \ 1)$$

At top-level, *TOP* yields:

$$(Ry1) (1=r \lor ff) \Rightarrow r \mid Rx0$$

The precondition of (Rx0) is a tautology, as required.

#### 5.3 Full semantics for ARM

Def 5.11 combines all of the features of §4–5.

*Definition 5.11 (co/sync/dgr/RRD).* Let  $[\phi/Q]$  be the substitution  $[\phi/Q_{ro}^*][\phi/Q_{wo}^*][\phi/Q_{sc}]$ . Update Def 3.24 to:

- S3)  $\kappa(e)$  implies  $Q_{\mu}^{Sx} \wedge M=v$ , L3)  $\kappa(e)$  implies  $Q_{\mu}^{Lx}$ .
- S4)  $\tau^D(\psi)$  implies  $\psi[\mu/\downarrow^x] \land M=v$
- S5)  $\tau^{C}(\psi)$  implies  $\psi[\mu/\downarrow^{x}][ff/Q_{\mu}^{Sx}]$
- L4)  $\tau^D(\psi)$  implies  $v=r \Rightarrow \psi$ ,
- L5)  $\tau^{C}(\psi)$  implies  $\downarrow^{x}_{\mu} \wedge ((v=r \vee W) \Rightarrow \psi[ff/Q_{\mu}^{Lx}]),$
- L6)  $\tau^B(\psi)$  implies  $\downarrow^x_{\mu} \land \psi$ , when  $E = \emptyset$ .

Example 5.12. RRD does not adversely affect subscription (Ex 4.8).

$$r := y^{ra}$$

$$S := x$$

$$\boxed{\mathbb{Q}_{ra}^{Ly} \mid \mathbb{R} y 1} \quad \boxed{\downarrow^{y} \land (\mathbb{W} \Rightarrow \psi[ff/\mathbb{Q}_{ra}^{Lx}])}$$

$$\boxed{\mathbb{Q}_{ra}^{Ly} \mid \mathbb{W} x 0}$$

Since  $Q_{ra}^{Ly}[ff/Q_{ra}^{Lx}]$  is false, we have:

$$r := y^{ra}; s := x$$

$$\boxed{\mathbb{Q}_{ra}^{Ly} \mid \mathbb{R} y 1} \qquad (\downarrow^{y} \land (\mathbb{W} \Rightarrow \mathsf{ff}) \mid \mathbb{W} x 0$$

Applying TOP yields:

$$\boxed{\text{tt} \mid \mathsf{R} \, \mathsf{y} \, \mathsf{1}} \qquad \left( \downarrow^{y} \land (\mathsf{f} \mathsf{f} \Rightarrow \mathsf{f} \mathsf{f}) \mid \mathsf{W} x \, \mathsf{0} \right)$$

Although (ff  $\Rightarrow$  ff) is a tautology,  $\downarrow^y$  is not. Thus, the pomset can only contribute to a top-level pomset if the thread is preceded by a relaxed write to y, allowing (Ry1) to be downgraded to a relaxed read.

Every ARM8 execution is allowed by Def 5.11. The proof of this fact is simplified by the recent characterization of ARM8 in terms of *external global completion* (EGC) [Arm Limited 2020, §B2.3.6]. Under EGC, an ARM8 execution is a linearization of per-location program order and a subset of local-order. Every such linearization is also a valid pomset under Def 5.11.

#### 6 OTHER FEATURES

# 6.1 Local Invariant Reasoning (LIR)

Example 6.1. JMM causality Test Case 1 [Pugh 2004] states the following execution should be allowed "since interthread compiler analysis could determine that x and y are always non-negative, allowing simplification of  $r \ge 0$  to true, and allowing write y := 1 to be moved early."

$$x := 0$$
; fork  $\{(r := x; if(r \ge 0) \{ y := 1 \} || x := y) \}$ 

$$(wx0) = (-1)^{-1} (x := y) + (-1)^{-1} (x := y)$$

Under the definitions given thus far, the precondition on (Wy1) can only be satisfied by the read of x, disallowing this execution.

In order to allow such executions, we include memory references in formula, resulting in:

$$(\mathbb{W}x0) \longrightarrow (\mathbb{R}x1) \qquad (0 \ge 0 \mid \mathbb{W}y1) \longrightarrow (\mathbb{R}y1) \longrightarrow (\mathbb{W}x1)$$

Definition 6.2 (LIR). Update Def 3.24 to:

- S4)  $\tau^D(\psi)$  implies  $\psi[M/x] \wedge M=v$ ,
- S5)  $\tau^C(\psi)$  implies  $\psi[M/x]$ ,
- L4)  $\tau^D(\psi)$  implies  $v=r \Rightarrow \psi$ ,
- L5)  $\tau^C(\psi)$  implies  $(v=r \lor x=r) \Rightarrow \psi$ , when  $E \neq \emptyset$ ,
- L6)  $\tau^B(\psi)$  implies  $\psi$ , when  $E = \emptyset$ .

L5 introduces memory references. It states that to be independent of the read, we must establish both  $\psi[v/r]$  and  $\psi[x/r]$ . If a precondition holds in both circumstances, S5 allows a local write to satisfy the precondition without introducing dependence. As in Def 5.9, we include L6 to provide a predicate transformer for the empty pomset.

Example 6.3. Revisiting Ex 6.1 and eliding irrelevant transformers:

If  $P \in STORE(x, M, \mu)$  then  $(\exists v : E \to V)$   $(\exists \theta : E \to \Phi)$ 

- S1) if  $\theta_d \wedge \theta_e$  is satisfiable then d = e,
- S2)  $\lambda(e) = Wxv_e$ ,
- S3)  $\kappa(e)$  implies  $\theta_e \wedge Q_{\mu}^{Sx} \wedge M = v_e$ , S4)  $(\forall e \in E \cap D) \tau^D(\psi)$  implies  $\theta_e \Rightarrow (\psi[M/x][\mu/\downarrow^x] \wedge M = v_e)$ ,
- S5)  $(\forall e \in E \setminus D) \tau^D(\psi)$  implies  $\theta_e \Rightarrow \psi[M/x][\mu/\downarrow^x][ff/Q_{\mu}^{Sx}].$

If  $P \in LOAD(r, x, \mu)$  then  $(\exists v : E \to V)$   $(\exists \theta : E \to \Phi)$ 

- L1) if  $\theta_d \wedge \theta_e$  is satisfiable then d = e,
- L2)  $\lambda(e) = \mathsf{R} x v_e$ ,
- L3)  $\kappa(e)$  implies  $\theta_e \wedge Q_u^{Lx}$ ,
- L4)  $(\forall e \in E \cap D) \tau^D(\psi)$  implies  $\theta_e \Rightarrow (v_e = s_e) \Rightarrow \psi[s_e/r]$ ,
- L5)  $(\forall e \in E \setminus D) \tau^D(\psi)$  implies  $\theta_e \Rightarrow \downarrow_{\mu}^x \land (\langle v_e = s_e \lor (\mathsf{W} \land x = s_e) \rangle \Rightarrow \psi[s_e/r][\mathsf{ff}/\mathsf{Q}_{\mu}^{\mathsf{L}x}]),$
- L6)  $(\forall s) \ \tau^D(\psi) \text{ implies } (\not\exists e \in E \mid \theta_e) \Rightarrow (\downarrow_{\mu}^x \land \psi[s/r][\mathsf{ff}/\mathsf{Q}_{\mu}^{\mathsf{L}x}]).$

Fig. 1. Full Semantics of Loads and Stores (See Def 4.4 for  $Q_{\mu}^{Sx}$ ,  $Q_{\mu}^{Lx}$  and Def 5.4 for  $\downarrow_{\mu}^{x}$ ,  $[\mu/\downarrow^{x}]$ )

$$x := 0; r := x; if(r \ge 0) \{y := 1\}$$

$$(Wx0) \rightarrow (Rx1) \quad ((1=r \lor 0=r) \Rightarrow r \ge 0 \mid Wy1)$$

The precondition of (Wy1) is a tautology, as required.

If L5 required only that  $x=r \Rightarrow \psi$ , then the following execution would be allowed:

$$x := 0; r := x; if(r \ge 0) \{y := 1\}$$

$$(\forall x 0) \rightarrow (Rx-1) \quad (\forall y 1)$$

But this would violate the expected local invariant: that all values seen for x are nonnegative.

It is worth emphasizing that this reasoning is local, and therefore unaffected by the introduction of additional threads, as in Test Case 9 [Pugh 2004].

Some care is necessary when combining LIR Def 6.2 and RRD Def 5.9. The proper form for L5 is:

L5) 
$$\tau^C(\psi)$$
 implies  $(v=r \lor (W \land x=r)) \Rightarrow \psi$ .

When W is true, this is unchanged from L5 in Def 6.2. When W is false, it is the same as L4 in Def 6.2.

One must also be careful when combing LIR with RMW operations, such as CAS. For these operations, LIR is unsound. Thus reads in RMWs should always use L6 for the independent case.

# Register Recycling (ALPHA)

The semantics considered thus far assume that each register is assigned at most once in a program. We relax this by renaming.

Example 6.4. JMM causality Test Case 2 [Pugh 2004] states the following execution should be allowed "since redundant read elimination could result in simplification of r=s to true, allowing y := 1 to be moved early."

$$r := x$$
;  $s := x$ ; if  $(r = s)\{y := 1\} \parallel x := y$ 
 $(Rx1) \leftarrow (Wy1) \rightarrow (Ry1) \rightarrow (Wx1)$ 

This execution is not allowed under Def 6.2, since the precondition of (Wy1) in the independent case is

$$(r=1 \lor r=x) \Rightarrow (s=1 \lor s=r) \Rightarrow (r=s),$$

which is not a tautology. Our solution is to rename registers using the set  $S_{\mathcal{E}} = \{s_e \mid e \in \mathcal{E}\}$ , which are banned from source programs, as per §3.1. This allows us to resolve nondeterminism in loads when merging, resulting in:

$$(Rx1)$$
  $(Wy1)$   $(Ry1)$   $(Wx1)$ 

Definition 6.5 (ALPHA). Update Def 3.24 to:

- L4)  $\tau^D(\psi)$  implies  $v=s_e \Rightarrow \psi[s_e/r]$ ,
- L5)  $(\forall s) \tau^C(\psi)$  implies  $\psi[s/r]$ .

*Example 6.6.* Revisiting Ex 6.4 and choosing  $s_e = r$ :

$$r := x$$

$$e \left( \mathbb{R}x1 \right)$$

$$(1=r \lor x=r) \Rightarrow \psi[r/r]$$

$$(1=r \lor x=r) \Rightarrow \psi[r/s]$$

Coalescing and composing:

The precondition of (Wy1) is a tautology, as required.

# 6.3 If-Closure (IF)

Example 6.7. If S = (x := 1), then Def 3.24 does not allow:

$$if(M)\{x := 1\}; S; if(\neg M)\{x := 1\}$$

$$(Wx1) \rightarrow (Wx1)$$

However, if  $S = (if(\neg M)\{x := 1\}; if(M)\{x := 1\})$ , then it *does* allow the execution. Looking at the initial program:

The difficulty is that the middle action can coalesce either with the right action, or the left, but not both. Thus, we are stuck with some non-tautological precondition. Our solution is to allow a pomset to contain many events for a single action, as long as the events have disjoint preconditions.

This is not simply a theoretical question; it is observable. For example, Def 3.24 does not allow the following.

$$r := y$$
; if( $r$ ){ $x := 1$ };  $x := 1$ ; if( $\neg r$ ){ $x := 1$ };  $z := r$ 

$$|| if(x){x := 0}; if(x){y := 1}}$$

$$|| Ry1 | Wx1 | Wx1 | Wz1$$

$$|| Rx1 | Wx0 | Rx1 | Wy1 |$$

```
Definition 6.8 (ALPHA/IF). Update Def 3.24 to: If P \in STORE(x, M, \mu) then (\exists v : E \to V) (\exists \theta : E \to \Phi) S1) if \theta_d \wedge \theta_e is satisfiable then d = e, S2) \lambda(e) = Wxv_e, S3) \kappa(e) implies \theta_e \wedge M = v, S4) (\forall e \in E \cap D) \tau^D(\psi) implies \theta_e \Rightarrow (\psi \wedge M = v), S5) \tau^C(\psi) implies (\nexists e \in E \cap C \mid \theta_e) \Rightarrow \psi, If P \in LOAD(r, x, \mu) then (\exists v : E \to V) (\exists \theta : E \to \Phi) L1) if \theta_d \wedge \theta_e is satisfiable then d = e, L2) \lambda(e) = Rxv_e, L3) \kappa(e) implies \theta_e. L4) (\forall e \in E \cap D) \tau^D(\psi) implies \theta_e \Rightarrow v_e = s_e \Rightarrow \psi[s_e/r], L5) (\forall s) \tau^C(\psi) implies (\nexists e) \in E \mid \theta_e) \Rightarrow \psi[s/r].
```

Example 6.9. Revisiting Ex 6.7, we can split the middle command:

Coalescing events gives the desired result.

These examples show that we must allow inconsistent predicates in a single pomset, unlike [Jagadeesan et al. 2020].

# 7 CONCLUSIONS

We have presented the first model of relaxed memory that treats sequential composition as a first-class citizen. The model builds directly on [Jagadeesan et al. 2020].

For sequential composition, parallel composition and the conditional, we believe that the definition is *natural*, even *canonical*. For stores and loads, instead, the definition in Fig 1 is a Frankenstein's monster of features. This complexity is *essential*, however, not just an accident of our poor choices. Relaxed memory models must please many audiences: compiler writers want one thing, hardware designers another, and programmers yet another still. The result is inevitably full of compromise.

Given that *complexity* cannot be eliminated from relaxed memory models, the best one can do is attempt to understand its causes. We have broken the problem into seven manageable pieces, discussed throughout §4–6. Def 5.11 summarizes all the features necessary for efficient implementation on ARM8. We discuss address calculation, read-modify-write operations and fences in the appendix.

Logic is the thread that sews these features together.

#### REFERENCES

Jade Alglave. 2020. This commit adds three alternative formulations of the Arm model, both for non-mixed and mixed size accesses. https://github.com/herd/herdtools7/commit/685ee4.

Jade Alglave, Luc Maranget, and Michael Tautschnig. 2014. Herding Cats: Modelling, Simulation, Testing, and Data Mining for Weak Memory. ACM Trans. Program. Lang. Syst. 36, 2, Article 7 (July 2014), 74 pages. https://doi.org/10.1145/2627752Arm Limited. 2020. Arm Architecture Reference Manual: Armv8, for Armv8-A Architecture Profile (Issue F.c). https://developer.arm.com/documentation/ddi0487/latest.

Mark Batty, Scott Owens, Susmit Sarkar, Peter Sewell, and Tjark Weber. 2011. Mathematizing C++ Concurrency. In *Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages* (Austin, Texas, USA) (*POPL '11*). ACM, New York, NY, USA, 55–66. https://doi.org/10.1145/1926385.1926394

- Hans-J. Boehm and Brian Demsky. 2014. Outlawing Ghosts: Avoiding Out-of-thin-air Results. In Proceedings of the Workshop on Memory Systems Performance and Correctness (Edinburgh, United Kingdom) (MSPC '14). ACM, New York, NY, USA, Article 7, 6 pages. https://doi.org/10.1145/2618128.2618134
- Stephen D. Brookes. 1996. Full Abstraction for a Shared-Variable Parallel Language. *Inf. Comput.* 127, 2 (1996), 145–163. https://doi.org/10.1006/inco.1996.0056
- Soham Chakraborty and Viktor Vafeiadis. 2019. Grounding thin-air reads with event structures. *PACMPL* 3, POPL (2019), 70:1–70:28. https://doi.org/10.1145/3290383
- Russ Cox. 2016. Go's Memory Model. http://nil.csail.mit.edu/6.824/2016/notes/gomem.pdf.
- Edsger W. Dijkstra. 1975. Guarded Commands, Nondeterminacy and Formal Derivation of Programs. *Commun. ACM* 18, 8 (1975), 453–457. https://doi.org/10.1145/360933.360975
- Stephen Dolan, KC Sivaramakrishnan, and Anil Madhavapeddy. 2018. Bounding Data Races in Space and Time. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (Philadelphia, PA, USA) (PLDI 2018). ACM, New York, NY, USA, 242–255. https://doi.org/10.1145/3192366.3192421
- Brijesh Dongol, Radha Jagadeesan, and James Riely. 2019. Modular transactions: bounding mixed races in space and time. In *Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2019, Washington, DC, USA, February 16-20, 2019*, Jeffrey K. Hollingsworth and Idit Keidar (Eds.). ACM, 82–93. https://doi.org/10.1145/3293883.3295708
- William Ferreira, Matthew Hennessy, and Alan Jeffrey. 1996. A Theory of Weak Bisimulation for Core CML. In Proceedings of the 1996 ACM SIGPLAN International Conference on Functional Programming, ICFP 1996, Philadelphia, Pennsylvania, USA, May 24-26, 1996, Robert Harper and Richard L. Wexelblat (Eds.). ACM, 201–212. https://doi.org/10.1145/232627.232649
- Jay L. Gischer. 1988. The equational theory of pomsets. Theoretical Computer Science 61, 2 (1988), 199–224. https://doi.org/ 10.1016/0304-3975(88)90124-7
- C.A.R. Hoare. 1969. An Axiomatic Basis for Computer Programming. Commun. ACM 12, 10 (Oct. 1969), 576–580. https://doi.org/10.1145/363235.363259
- Radha Jagadeesan, Alan Jeffrey, and James Riely. 2020. Pomsets with preconditions: a simple model of relaxed memory. Proc. ACM Program. Lang. 4, OOPSLA (2020), 194:1–194:30. https://doi.org/10.1145/3428262
- Radha Jagadeesan, Corin Pitcher, and James Riely. 2010a. Generative Operational Semantics for Relaxed Memory Models. In *Proceedings of the 19th European Conference on Programming Languages and Systems* (Paphos, Cyprus) (ESOP'10). Springer-Verlag, Berlin, Heidelberg, 307–326. https://doi.org/10.1007/978-3-642-11957-6\_17
- Radha Jagadeesan, Corin Pitcher, and James Riely. 2010b. Generative Operational Semantics for Relaxed Memory Models. In Programming Languages and Systems, 19th European Symposium on Programming, ESOP 2010, Paphos, Cyprus, March 20-28, 2010. Proceedings (Lecture Notes in Computer Science, Vol. 6012), Andrew D. Gordon (Ed.). Springer, 307–326. https://doi.org/10.1007/978-3-642-11957-6 17
- Alan Jeffrey and James Riely. 2016. On Thin Air Reads Towards an Event Structures Model of Relaxed Memory. In *Proceedings of the 31st Annual ACM/IEEE Symposium on Logic in Computer Science, LICS '16, New York, NY, USA, July 5-8, 2016*, M. Grohe, E. Koskinen, and N. Shankar (Eds.). ACM, 759–767. https://doi.org/10.1145/2933575.2934536
- Jeehoon Kang, Chung-Kil Hur, Ori Lahav, Viktor Vafeiadis, and Derek Dreyer. 2017. A promising semantics for relaxed-memory concurrency. In Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages, POPL 2017, Paris, France, January 18-20, 2017, Giuseppe Castagna and Andrew D. Gordon (Eds.). ACM, 175–189. http://dl.acm.org/citation.cfm?id=3009850
- Ryan Kavanagh and Stephen Brookes. 2018. A denotational account of C11-style memory. CoRR abs/1804.04214 (2018). arXiv:1804.04214 http://arxiv.org/abs/1804.04214
- Ori Lahav, Viktor Vafeiadis, Jeehoon Kang, Chung-Kil Hur, and Derek Dreyer. 2017. Repairing sequential consistency in C/C++11. In *Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2017, Barcelona, Spain, June 18-23, 2017*, Albert Cohen and Martin T. Vechev (Eds.). ACM, 618–632. https://doi.org/10.1145/3062341.3062352
- Leslie Lamport. 1979. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs. *IEEE Trans. Comput.* 28, 9 (Sept. 1979), 690–691. https://doi.org/10.1109/TC.1979.1675439
- Sung-Hwan Lee, Minki Cho, Anton Podkopaev, Soham Chakraborty, Chung-Kil Hur, Ori Lahav, and Viktor Vafeiadis. 2020. Promising 2.0: global optimizations in relaxed memory concurrency. In *Proceedings of the 41st ACM SIGPLAN International Conference on Programming Language Design and Implementation, PLDI 2020, London, UK, June 15-20, 2020, Alastair F. Donaldson and Emina Torlak (Eds.).* ACM, 362–376. https://doi.org/10.1145/3385412.3386010
- Lun Liu, Todd Millstein, and Madanlal Musuvathi. 2019. Accelerating Sequential Consistency for Java with Speculative Compilation. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (Phoenix, AZ, USA) (PLDI 2019). ACM, New York, NY, USA, 16–30. https://doi.org/10.1145/3314221.3314611
- Jeremy Manson, William Pugh, and Sarita V. Adve. 2005. The Java Memory Model. SIGPLAN Not. 40, 1 (Jan. 2005), 378–391. https://doi.org/10.1145/1047659.1040336

```
If P \in STORE(L, M, \mu) then (\exists \ell : E \to \mathcal{V}) (\exists v : E \to \mathcal{V}) (\exists \theta : E \to \Phi)

S1) if \theta_d \wedge \theta_e is satisfiable then d = e,

S2) \lambda(e) = \mathbb{W}[\ell_e]v_e,

S3) \kappa(e) implies \theta_e \wedge \mathbb{Q}_{\mu}^{\mathsf{S}[\ell_e]} \wedge L = \ell_e \wedge M = v_e,

S4) (\forall e \in E \cap D) \tau^D(\psi) implies \theta_e \Rightarrow (\psi[M/[\ell_e]][\mu/\downarrow^{[\ell_e]}] \wedge L = \ell_e \wedge M = v_e),

S5) (\forall e \in E \setminus D) \tau^D(\psi) implies \theta_e \Rightarrow \psi[M/[\ell_e]][\mu/\downarrow^{[\ell_e]}][\mathsf{ff}/\mathbb{Q}_{\mu}^{\mathsf{S}[\ell_e]}],

S6) (\forall k) \tau^D(\psi) implies (\exists e \in E \mid \theta_e) \Rightarrow (L = k) \Rightarrow \psi[M/[k]][\mu/\downarrow^{[k]}][\mathsf{ff}/\mathbb{Q}_{\mu}^{\mathsf{S}[k]}].

If P \in LOAD(r, L, \mu) then (\exists \ell : E \to \mathcal{V}) (\exists v : E \to \mathcal{V}) (\exists \theta : E \to \Phi)

L1) if \theta_d \wedge \theta_e is satisfiable then d = e,

L2) \lambda(e) = \mathbb{R}[\ell_e]v_e,

L3) \kappa(e) implies \theta_e \wedge \mathbb{Q}_{\mu}^{\mathsf{L}[\ell_e]} \wedge L = \ell_e,

L4) (\forall e \in E \cap D) \tau^D(\psi) implies \theta_e \Rightarrow (L = \ell_e \Rightarrow v_e = s_e) \Rightarrow \psi[s_e/r],

L5) (\forall e \in E \setminus D) \tau^D(\psi) implies \theta_e \Rightarrow \int_{\mu}^{[\ell_e]} \wedge ((L = \ell_e \Rightarrow v_e = s_e) \vee (\mathbb{W} \wedge (L = \ell_e \Rightarrow \ell_e))) \Rightarrow \psi[s_e/r][\mathsf{ff}/\mathbb{Q}_{\mu}^{\mathsf{L}[\ell_e]}]),

L6) (\forall k)(\forall s) \tau^D(\psi) implies (\exists e \in E \mid \theta_e) \Rightarrow (L = k) \Rightarrow (\downarrow_{\mu}^{[k]} \wedge \psi[s/r][\mathsf{ff}/\mathbb{Q}_{\mu}^{\mathsf{L}[k]}]).
```

Fig. 2. Full Semantics with Address Calculation (See Def 4.4 for  $Q_u^{Sx}$ ,  $Q_u^{Lx}$  and Def 5.4 for  $\downarrow_u^x$ ,  $[\mu/\downarrow^x]$ )

Antoni W. Mazurkiewicz. 1995. Introduction to Trace Theory. In *The Book of Traces*, Volker Diekert and Grzegorz Rozenberg (Eds.). World Scientific, 3–41. https://doi.org/10.1142/9789814261456\_0001

Jean Pichon-Pharabod and Peter Sewell. 2016. A Concurrency Semantics for Relaxed Atomics That Permits Optimisation and Avoids Thin-air Executions. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (St. Petersburg, FL, USA) (POPL '16). ACM, New York, NY, USA, 622–633. https://doi.org/10. 1145/2837614.2837616

Gordon D. Plotkin and Vaughan R. Pratt. 1996. Teams can see pomsets. In Partial Order Methods in Verification, Proceedings of a DIMACS Workshop, Princeton, New Jersey, USA, July 24-26, 1996 (DIMACS Series in Discrete Mathematics and Theoretical Computer Science, Vol. 29), Doron A. Peled, Vaughan R. Pratt, and Gerard J. Holzmann (Eds.). DIMACS/AMS, 117–128. https://doi.org/10.1090/dimacs/029/07

Anton Podkopaev. 2020. Private correspondence.

Anton Podkopaev, Ori Lahav, and Viktor Vafeiadis. 2019. Bridging the gap between programming languages and hardware weak memory models. *Proc. ACM Program. Lang.* 3, POPL (2019), 69:1–69:31. https://doi.org/10.1145/3290382 William Pugh. 2004. Causality Test Cases. https://perma.cc/PJT9-XS8Z

Conrad Watt, Christopher Pulte, Anton Podkopaev, Guillaume Barbier, Stephen Dolan, Shaked Flur, Jean Pichon-Pharabod, and Shu-yu Guo. 2020. Repairing and mechanising the JavaScript relaxed memory model. In *Proceedings of the 41st ACM SIGPLAN International Conference on Programming Language Design and Implementation, PLDI 2020, London, UK, June 15-20, 2020*, Alastair F. Donaldson and Emina Torlak (Eds.). ACM, 346–361. https://doi.org/10.1145/3385412.3385973

Conrad Watt, Andreas Rossberg, and Jean Pichon-Pharabod. 2019. Weakening WebAssembly. *Proc. ACM Program. Lang.* 3, OOPSLA (2019), 133:1–133:28. https://doi.org/10.1145/3360559

# A ADDITIONAL FEATURES

#### A.1 Address Calculation (ADDR)

Fig 2 describes the full semantics with address calculation.

Definition A.1 (ADDR). Update Def 3.24 to existentially quantify over  $\ell$  in STORE and LOAD:

- S2)  $\lambda(e) = W[\ell]v$ ,
- L2)  $\lambda(e) = \mathbb{R}[\ell]v$ .
- S3)  $\kappa(e)$  implies  $L=\ell \wedge M=v$ ,
- L3)  $\kappa(e)$  implies  $L=\ell$ .
- L4)  $\tau^D(\psi)$  implies  $(L=\ell \Rightarrow v=r) \Rightarrow \psi$ ,

L5)  $\tau^{C}(\psi)$  implies  $\psi$ .

*Example A.2.* Def A.1 is naive with respect to merging events. Consider the following example from [Jagadeesan et al. 2020]:

$$[r] := 0 \; ; \; [0] := !r \\ {}^{c} (r = 1 \mid W[1]0)^{d} (r = 1 \mid W[0]0) \\ {}^{d} (r = 0 \mid W[0]0)^{e} (r = 0 \mid W[0]1)$$

Merging, we have:

$$\begin{split} & \text{if}(M)\{[r] := 0\,;\, [0] := !r\}\, \text{else}\, \{[r] := 0\,;\, [0] := !r\} \\ & \quad ^{c} \left(r = 1 \mid \mathsf{W}[1]0\right)^{d} \left(r = 0 \lor r = 1 \mid \mathsf{W}[0]0\right)^{-\frac{p}{2}} \left(r = 0 \mid \mathsf{W}[0]1\right) \end{aligned}$$

The precondition of W[0]0 is a tautology; however, this is not possible for ([r]:=0; [0]:=!r) alone, using Def A.1. The full semantics, given in Fig 2, enables this execution using if-closure. The individual commands have the pomsets:

$$[r] := 0$$
  $[0] := !r$   $^{c} (r=1 \mid W[1]0)$   $^{d} (r=0 \mid W[0]0)$   $^{e} (r=0 \mid W[0]1)$ 

Sequencing and merging, we have:

$$[r] := 0; [0] := !r$$
 
$${}^{c}(r=1 \mid W[1]0) \stackrel{d}{=} (r=0 \lor r=1 \mid W[0]0) \stackrel{e}{\to} (r=0 \mid W[0]1)$$

The precondition of (W[0]0) is a tautology, as required.

*Example A.3.* The combination of read-read independency and address calculation (ADDR/RRD) is somewhat delicate. Combing Def 5.9 and Def A.1 we have:

L4) 
$$\tau^D(\psi)$$
 implies  $(L=\ell \Rightarrow v=r) \Rightarrow \psi$ ,  
L5)  $\tau^C(\psi)$  implies  $((L=\ell \Rightarrow v=r) \lor \mathsf{W}) \Rightarrow \psi$ .

If we replace the use of  $(L=\ell \Rightarrow v=r)$  by (v=r), thin air reads are possible. The subsection of §C on Causal Strengthening discusses this example using the semantics of [Jagadeesan et al. 2020].

Consider the following program, from [Jagadeesan et al. 2020, §5], where initially x = 0, y = 0, [0] = 0, [1] = 2, and [2] = 1. It should only be possible to read 0, disallowing the attempted execution below:

Looking at the left thread:

$$r := y \qquad \qquad s := [r] \qquad \qquad x := s$$

$$(2=r \lor W) \Rightarrow \psi \qquad \qquad [\psi]$$

$$(r=2 \Rightarrow 1=s) \lor W) \Rightarrow \psi$$

$$(r=2 \mid R[2]1) \cdot \cdot \cdot \cdot \qquad \qquad [v=1 \mid Wx1]$$

$$(r=2 \Rightarrow 1=s) \Rightarrow \psi \Leftrightarrow \qquad [\psi]$$

Composing, we have:

$$r := y; s := [r]; x := s$$

$$(2=r \lor W) \Rightarrow r=2 \mid R[2]1$$

$$(2=r \lor W) \Rightarrow (r=2 \Rightarrow 1=s) \Rightarrow s=1 \mid Wx1$$

Substituting for W:

$$(2=r \lor ff) \Rightarrow r=2 \mid R[2]1$$

$$(2=r \lor tt) \Rightarrow (r=2 \Rightarrow 1=s) \Rightarrow s=1 \mid Wx1$$

Which is:

$$\begin{array}{c}
(R y2) & (2=r \Rightarrow r=2 \mid R[2]1) \\
((r=2 \Rightarrow 1=s) \Rightarrow s=1 \mid Wx1)
\end{array}$$

The precondition of (R[2]1) is a tautology, but the precondition of (Wx1) is not. This forces a dependency:

$$r := y; s := [r]; x := s$$

$$(R y2) \qquad (2=r \Rightarrow r=2 \mid R[2]1)$$

$$(2=r \Rightarrow (r=2 \Rightarrow 1=s) \Rightarrow s=1 \mid Wx1)$$

All the preconditions are now tautologies.

# A.2 Merging different labels

We combine access and fence modes into a single order:

$$\begin{picture}(20,10) \put(0,0){\line(1,0){100}} \put(0,0){\line(1,0){10$$

Definition A.4. Define  $\prec : \mathcal{A} \times \mathcal{A} \to 2^{\mathcal{A}}$  as follows. If  $a_0 \in a_1 \prec a_2$ , then  $a_1$  and  $a_2$  can coalesce, resulting in  $a_0$ . This is useful for replacing (x := 1; x := 2) by (x := 2).

$$Rxv \prec Rxv = \{Rxv\}$$

$$Wxv \prec Wxw = \{Wxw\}$$

$$Rxv \prec Wxv = \{Wxv\}$$

$$Rxv \prec Wxv = \{Wxv\}$$

$$F^{\mu} \prec F^{\nu} = \{F^{\mu \sqcup \nu}\}$$

$$a \prec b = \emptyset, \text{ otherwise}$$

- (1) if  $e \in E_1 \setminus E_2$  then  $\lambda(e) = \lambda_1(e)$ ,
- (2) if  $e \in E_2 \setminus E_1$  then  $\lambda(e) = \lambda_2(e)$ ,
- (3) if  $e \in E_1 \cap E_2$  then  $\lambda(e) \in \lambda_1(e) \prec \lambda_2(e)$ , the first has no rf,

# A.3 Read-Modify-Write Operations (RMW)

RMW operations are formalized by adding a relation  $\subseteq E \times E$  that relates the read of a successful RMW to the succeeding write. The definition of a pomset requires the following, where two actions *overlap* if they touch the same location:

- (1) rmw  $\subseteq$   $\leq$  is a relation capturing *read-modify-write atomicity*, such that for any c, d,  $e \in E$ , where  $\lambda(c)$  and  $\lambda(d)$  touch the same location:
  - if  $d \xrightarrow{\mathsf{rmw}} e$  and  $c \le e$  then  $c \le d$ ,
  - if  $d \xrightarrow{\mathsf{rmw}} e$  and  $d \le c$  then  $e \le c$ .

Extend the definition of par, if, seq to include:

• rmw  $\supseteq$  (rmw<sub>1</sub>  $\cup$  rmw<sub>2</sub>),

*Example A.5.* This definition ensures atomicity, disallowing executions such as [Podkopaev et al. 2019, Ex. 3.2]:

$$x := 0$$
;  $s := \text{FADD}^{\text{rlx},\text{rlx}}(x, 1) \parallel x := 2$ ;  $s := x$ 

$$\begin{array}{c} Wx0 \longrightarrow Rx0 \longrightarrow Wx2 \longrightarrow Wx1 \longrightarrow Rx1 \end{array}$$

By ??, since  $(Wx2) \rightarrow (Wx1)$ , it must be that  $(Wx2) \rightarrow (Rx0)$ , creating a cycle.

Example A.6. Two successful RMWs cannot see the same write:

$$x := 0$$
; (FADD<sup>rlx,rlx</sup>(x, 1) || FADD<sup>rlx,rlx</sup>(x, 1))

The order from read-to-write is required by fulfillment. Apply ?? to  $a \rightarrow d$ , we have that  $a \rightarrow c$ . Subsequently applying ??, we have  $b \rightarrow c$ , creating a cycle.

*Example A.7.* By using two actions rather than one, the definition allows examples such as the following, which is allowed by ARM8 [Podkopaev et al. 2019, Ex. 3.10]:

$$r := y$$
;  $z := r \parallel r := z$ ;  $x := 0$ ;  $s := \mathsf{FADD^{rlx,ra}}(x,1)$ ;  $y := s+1$ 

$$\boxed{ (\mathsf{R}\,y1) \xrightarrow{\mathsf{F}(\mathsf{W}\,z1)} \qquad (\mathsf{R}\,z1) \qquad (\mathsf{W}\,x0) \xrightarrow{\mathsf{F}(\mathsf{R}\,y0)} \qquad (\mathsf{W}\,x1) \qquad (\mathsf{W}\,y1) }$$

*Example A.8.* For RMW operations, the independent case for a read should be the same as the empty case. To see why, consider the semantics of local invariant reasoning (LIR) from Def 6.2:

- S4)  $\tau^D(\psi)$  implies  $\psi[M/x] \wedge M=v$ ,
- S5)  $\tau^C(\psi)$  implies  $\psi[M/x]$ ,
- L4)  $\tau^D(\psi)$  implies  $v=r \Rightarrow \psi$ ,
- L5)  $\tau^{C}(\psi)$  implies  $(v=r \lor x=r) \Rightarrow \psi$ , when  $E \neq \emptyset$ ,
- L6)  $\tau^B(\psi)$  implies  $\psi$ , when  $E = \emptyset$ .

Consider the relaxed variant of the CDRF example from [Lee et al. 2020], using a semantics for FADD that simply composes the rules for load and store above.

$$x := 0; (r := \mathsf{FADD}^{\mathsf{rlx},\mathsf{rlx}}(x,1); \mathsf{if}(!r) \{ \mathsf{if}(y) \{ x := 0 \} \} \quad \|$$

$$r := \mathsf{FADD}^{\mathsf{rlx},\mathsf{rlx}}(x,1); \mathsf{if}(!r) \{ y := 1 \} )$$

$$(\mathsf{W}x0) \longrightarrow (\mathsf{R}x0) \xrightarrow{\mathsf{rmw}} (\mathsf{W}x1) \xrightarrow{\mathsf{R}} (\mathsf{W}x0) \longrightarrow (\mathsf{R}x0) \xrightarrow{\mathsf{rmw}} (\mathsf{W}x1) \xrightarrow{\mathsf{W}} (\mathsf{W}y1)$$

Looking at the independent transformers of the second thread and initializer, we have:

After sequencing, the precondition of (Wy1) is a tautology:  $(0=r \lor 0=r) \Rightarrow r=0$ .

Here, local invariant reasoning is using the initializing write to x to justify the independence of the write to y. But this write is made unavailable by the first thread's successful RMW.

As a result, we disallow the use of L5 when treating the read event in an RMW. [Todo: write out the rules.]

# A.4 Fence Operations (FENCE)

Syntactic fences  $F^{\nu}$  have corresponding actions:  $(F^{\nu})$ . The syntactic fence mode  $(\nu := rel \mid acq \mid fsc)$  is either release, acquire, or sequentially-consistent.

Formalizing this,  $Q_{\text{rel}}^{Sx}$  substitutes for  $Q_{\text{wo}}^*$  in addition to  $Q_{\text{wo}}^x$ , as in  $Q_{\text{ra}}^{Sx}$ .  $Q_{\text{acq}}^{Lx}$  requires  $Q_{\text{ro}}^*$  in addition to  $Q_{\text{wo}}^x$ , as in  $Q_{\text{ra}}^{Cx}$ .

Definition A.9. Extend Def 4.4 and Def 5.4.

$$\begin{aligned} Q_{\text{rel}}^{Fx} &= Q_{\text{ro}}^* \wedge Q_{\text{wo}}^* & Q_{\text{acq}}^{Fx} &= Q_{\text{wo}}^x \wedge Q_{\text{ro}}^* \\ [\phi/Q_{\text{rel}}^{Fx}] &= [\phi/Q_{\text{wo}}^*] & [\phi/Q_{\text{acq}}^{Fx}] &= [\phi/Q_{\text{ro}}^*, \phi/Q_{\text{wo}}^*] \end{aligned}$$
 
$$Q_{\text{fsc}}^{Fx} &= Q_{\text{ro}}^* \wedge Q_{\text{wo}}^* \wedge Q_{\text{sc}} \\ [\phi/Q_{\text{fsc}}^{Fx}] &= [\phi/Q_{\text{ro}}^*, \phi/Q_{\text{wo}}^*, \phi/Q_{\text{sc}}] \end{aligned}$$

If  $P \in FENCE(\mu)$  then  $(\exists X \subseteq X)$ 

- (2) if  $d, e \in E$  then d = e,
- (3)  $\lambda(e) = F^{\mu} X$ ,
- (4)  $\kappa(e)$  implies  $\bigwedge_{x \in (X \setminus X)} Q_{\mu}^{\mathsf{F}x}$ ,
- (5)  $\tau^D(\psi)$  implies  $\psi$ , where  $D \cap E \neq \emptyset$ ,
- (6)  $\tau^C(\psi)$  implies  $\bigwedge_{x \in X} \downarrow^x \land \psi[\mathsf{ff}/\mathsf{Q}^{\mathsf{F}x}]$ , where  $C \cap E = \emptyset$ .

Example A.10. Extend Ex 4.6.



# A.5 Fence Actions with Downgrading Reads (FENCE/DGR)

Example A.11. Revisiting Ex 5.6 using fences:

What we want is this:

$$Wx2 \longrightarrow Rx2$$
  $F^{acq} \longrightarrow Wy1$ 

Let acquiring fence actions include a set:  $(F^{\nu}X)$ . The set is nondeterministically chosen by the semantics. Idea is to downgrade the reads in X and fence everything else.

# A.6 Access Elimination

For reads, get rid of ff/Q in L6.

For writes, change the label rules of sequential composition to:

- (1) if  $e \in E_1 \setminus E_2$  then  $\lambda(e) = \lambda_1(e)$ ,
- (2) if  $e \in E_2 \setminus E_1$  then  $\lambda(e) = \lambda_2(e)$ ,
- (3) if  $e \in E_1 \cap E_2$  then  $\lambda(e) \in \lambda_1(e) \prec \lambda_2(e)$ .

Definition A.12.

$$Rxv \prec Rxv = \{Rxv\}$$
 $Wxv \prec Wxw = \{Wxw\}$ 
 $F^{\mu} \prec F^{\nu} = \{F^{\mu \sqcup \nu}\}$ 
 $a \prec b = \emptyset$ , otherwise

# A.7 Extended Access Modes

We can enrich read and write actions to use fence modes. The resulting order is:

$$rlx \rightarrow ra \stackrel{acq}{\Rightarrow} \stackrel{sc}{sc} \stackrel{fsc}{\Rightarrow} fsc$$

We write  $\mu \sqsubseteq \nu$  for this order. Let  $\mu \sqcup \nu$  denote the least upper bound of  $\mu$  and  $\nu$ .

Reads allow all annotations but rel. Writes allow all annotations but acq. Fences allow only the three annotations rel, acq and fsc.

Definition A.13.

$$\mathsf{R}^{\mu}xv \prec \mathsf{R}^{\nu}xv = \{\mathsf{R}^{\mu \sqcup \nu}xv\}$$

$$\mathsf{W}^{\mu}xv \prec \mathsf{W}^{\nu}xw = \{\mathsf{W}^{\mu \sqcup \nu}xw\}$$

$$\mathsf{F}^{\mu} \prec \mathsf{F}^{\nu} = \{\mathsf{F}^{\mu \sqcup \nu}\}$$

$$\mathsf{F}^{\mu} \prec \mathsf{R}^{\nu}xv = \mathsf{R}^{\mu}xv \prec \mathsf{F}^{\nu} = \{\mathsf{R}^{\mu \sqcup \nu}xv\}$$

$$\mathsf{F}^{\mu} \prec \mathsf{W}^{\nu}xw = \mathsf{W}^{\mu}xw \prec \mathsf{F}^{\nu} = \{\mathsf{W}^{\mu \sqcup \nu}xw\}$$

$$a \prec b = \emptyset, \text{ otherwise}$$

# **B DISCUSSION**

#### **B.1** Closure properties

We would like the semantics to be closed with respect to augments and downsets.

Augments include more order and stronger formulae; in examples, we typically consider pomsets that are augment-minimal. One intuitive reading of augment closure is that adding order can only cause preconditions to weaken.

Definition B.1.  $P_2$  is an augment of  $P_1$  if

- (1)  $E_2 = E_1$ ,
- (2)  $\lambda_2(e) = \lambda_1(e)$ ,
- (3)  $\kappa_2(e)$  implies  $\kappa_1(e)$ ,
- (4)  $\tau_2^D(e)$  implies  $\tau_1^D(e)$ ,
- (5) if  $d \leq_2 e$  then  $d \leq_1 e$ .

PROPOSITION B.2. If  $P_1 \in [S]$  and  $P_2$  augments  $P_1$  then  $P_2 \in [S]$ .

Downsets include a subset of initial events, similar to prefixes for strings.

Definition B.3.  $P_2$  is an downset of  $P_1$  if

- (1)  $E_2 \subseteq E_1$ ,
- (2)  $(\forall e \in E_2) \lambda_2(e) = \lambda_1(e)$ ,
- (3)  $(\forall e \in E_2) \ \kappa_2(e) = \kappa_1(e)$ ,
- (4)  $(\forall e \in E_2) \ \tau_2^D(e) = \tau_1^D(e),$
- (5)  $(\forall d \in E_2)$   $(\forall e \in E_2)$   $d \leq_2 e$  if and only if  $d \leq_1 e$ ,
- (6)  $(\forall d \in E_1)$   $(\forall e \in E_2)$  if  $d \leq_1 e$  then  $d \in E_2$ .

Downset closure fails due to RRD and LIR. The key property is that the empty set transformer should behave the same as the independent transformer.

Example B.4. For RRD, Def 5.9 states:

- L4)  $\tau^D(\psi)$  implies  $v=r \Rightarrow \psi$ ,
- L5)  $\tau^{C}(\psi)$  implies  $(v=r \vee W) \Rightarrow \psi$ ,
- L6)  $\tau^B(\psi)$  implies  $\psi$ , when  $E = \emptyset$ .

This semantics is not downset closed due to the lack of read-read dependencies. In both cases, for subsequent writes, L5 is the same as L6. For subsequent reads, L5 is the same as L4. Consider

The semantics of this program includes the singleton pomset (Rx0), but not the singleton pomset (Ry0). To get (Rx0), we combine:

$$r := x \qquad \qquad \text{if}(!r)\{s := y\}$$

$$(Rx0) \qquad \qquad \emptyset$$

Attempting to get (Ry0), we instead get:

$$r := x \qquad \text{if}(!r)\{s := y\}$$

$$\emptyset \qquad \qquad (r=0 \mid R y0)$$

Since r appears only once in the program, this pomset cannot contribute to a top-level pomset.

*Example B.5.* For LIR, Def 6.2 states:

- L4)  $\tau^D(\psi)$  implies  $v=r \Rightarrow \psi$ ,
- L5)  $\tau^{C}(\psi)$  implies  $(v=r \lor x=r) \Rightarrow \psi$ , when  $E \neq \emptyset$ ,
- L6)  $\tau^B(\psi)$  implies  $\psi$ , when  $E = \emptyset$ .

This semantics is not downset closed: The independency reasoning of L5 is only applicable for pomsets where the ignored read is present! Revisiting Ex 6.3

$$x := 0 \qquad r := x \qquad \text{if } (r \ge 0) \{y := 1\}$$

$$\boxed{\mathbb{R}x1} \qquad \qquad r \ge 0 \mid \mathbb{W}y1$$

$$\psi[0/x] \qquad \qquad (1 = r \lor x = r) \Rightarrow \psi$$

$$x := 0; \ r := x; \ \text{if } (r \ge 0) \{y := 1\}$$

$$\boxed{\mathbb{R}x1} \qquad \qquad (1 = r \lor 0 = r) \Rightarrow r \ge 0 \mid \mathbb{W}y1$$

The precondition of (Wy1) is a tautology.

Taking the empty set for the read, however, we have:

$$x := 0; r := x; if(r \ge 0) \{y := 1\}$$

$$(wx0) \qquad (r \ge 0 \mid wy1)$$

The precondition of (Wy1) is not a tautology.

# **B.2** Comparison with Weakest Preconditions

We compare traditional transformers to the dependent-case transformers of Def 6.2; thus we consider only totally ordered executions. Because we only consider the dependent case, we drop the superscript E on  $\tau^E$  throughout this section. We also assume that each register appears at most once in a program, as we did throughout §3–5.

Because of augment closure, we are not interested in isolating the *weakest* precondition. Thus we think of transformers as Hoare triples. In addition, all programs in our language are strongly normalizing, so we need not distinguish strong and weak correctness. In this setting, the Hoare triple  $\{\phi\}$  S  $\{\psi\}$  holds exactly when  $\phi \Rightarrow wp_S(\psi)$ .

Hoare triples do not distinguish thread-local variables from shared variables. Thus, the assignment rule applies to all types of storage. The rules can be written as follows:

$$\begin{aligned} wp_{x:=M}(\psi) &= \psi[M/x] \\ wp_{r:=M}(\psi) &= \psi[M/r] \\ wp_{r:=x}(\psi) &= x{=}r \Rightarrow \psi \end{aligned}$$

Here we have chosen an alternative formulation for the read rule, which is equivalent the more traditional  $\psi[x/r]$ , as long as registers occur at most once in a program. In Def 6.2, the transformers for the dependent case are as follows:

$$\tau_{x:=M}(\psi) = \psi[M/x]$$

$$\tau_{r:=M}(\psi) = \psi[M/r]$$

$$\tau_{r:=x}(\psi) = v = r \Rightarrow \psi$$
 where  $\lambda(e) = Rxv$ 

Only the read rule differs from the traditional one.

For programs where every register is bound and every read is fulfilled, our dependent transformers are the same as the traditional ones. In our semantics, thus, we only consider totally-ordered executions where every read could be fulfilled by prepending some writes. For example, we ignore pomsets of x := 2; r := x that read 1 for x.

```
If P \in STORE(x, M, \mu) then (\exists v : E \to V) (\exists \theta : E \to \Phi)
```

- S1) if  $\theta_d \wedge \theta_e$  is satisfiable then d = e,
- S2)  $\lambda(e) = W^{\mu} x v_e$ ,
- S3)  $\kappa(e)$  implies  $\theta_e \wedge M=v_e$ ,
- S4)  $(\forall e \in E \cap D) \tau^D(\psi)$  implies  $\theta_e \Rightarrow (\psi[M/x][\mu/\downarrow^x] \land M=v_e)$ ,
- S5)  $(\forall e \in E \setminus D) \tau^D(\psi)$  implies  $\theta_e \Rightarrow \psi[M/x][\mu/\downarrow^x][ff/Q]$ .
- S6)  $\tau^D(\psi)$  implies  $(\nexists e \in E \mid \theta_e) \Rightarrow \psi[M/x][\mu/\downarrow^x][ff/Q]$ .

If  $P \in LOAD(r, x, \mu)$  then  $(\exists v : E \to V)$   $(\exists \theta : E \to \Phi)$   $(\exists v \in \{\mu, r | x\})$ 

- L1) if  $\theta_d \wedge \theta_e$  is satisfiable then d = e,
- L2)  $\lambda(e) = R^{\nu} x v_e$
- L3)  $\kappa(e)$  implies  $\theta_e$ ,
- L4)  $(\forall e \in E \cap D) \tau^D(\psi)$  implies  $\theta_e \Rightarrow v_e = s_e \Rightarrow \psi[s_e/r]$ ,
- L5)  $(\forall e \in E \setminus D) \tau^D(\psi)$  implies  $\theta_e \Rightarrow \downarrow_{u,v}^x \land (\langle v_e = s_e \lor (W \land x = s_e) \rangle \Rightarrow \psi[s_e/r][ff/Q]),$
- L6)  $(\forall s) \tau^D(\psi)$  implies  $(\nexists e \in E \mid \theta_e) \Rightarrow (\downarrow_{u,v}^x \land \psi[s/r][ff/Q]).$

Fig. 3. Simplified Quiescence Semantics w/o Address Calculation (See Def B.9 for  $\downarrow_{\mu,\nu}^x$ ,  $[\mu/\downarrow^x]$ )

For example, let  $S_i$  be defined:

$$S_1 = s := x; x := s+r$$
  
 $S_2 = x := t; S_1$   
 $S_3 = t := 2; r := 5; S_2$ 

The following pomset appears in the semantics of  $S_2$ . A pomset for  $S_3$  can be derived by substituting [2/t, 5/r]. A pomset for  $S_1$  can be derived by eliminating the initial write.

$$x := t ; s := x ; x := s+r$$

$$\underbrace{(t=2 \mid Wx2)}_{2=s} \rightarrow \underbrace{(Rx2)}_{(s+r)=7} + \underbrace{(s+r)=7 \mid Wx7}_{(s+r)=7}$$

The predicate transformers are:

$$\begin{split} wp_{S_1}(\psi) &= x = s \Rightarrow \psi[s + r/x] \\ wp_{S_2}(\psi) &= t = s \Rightarrow \psi[s + r/x] \\ wp_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + s/x] \\ \end{split} \qquad \begin{array}{l} \tau_{S_1}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + s/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + s/x] \\ \end{array}$$

# B.3 Coherence/Synchronization via Reordering

In §4.2, we encoded coherence and synchronized access using quiescence symbols. Building on the language with fork-join, it is possible to model these using reorderability (§3.3), rather than encoding them in the logic. With synchronization, the relationship becomes asymmetric.

To capture completion, we use a single quiescence symbol: Q.

Update actions to include access modes:  $(W^{\mu}xv)$  and  $(R^{\mu}xv)$ . Reorderability of two sequential actions is determined, in part, by the modes of the two actions, capturing synchronization:

|                 | $2^{ m nd}$      |                 |          |          |     |     |  |  |
|-----------------|------------------|-----------------|----------|----------|-----|-----|--|--|
| 1 <sup>st</sup> | R <sup>rlx</sup> | R <sup>ra</sup> | Rsc      | Wrlx     | Wra | Wsc |  |  |
| Rrlx            | <b>√</b>         | <b>√</b>        | <b>√</b> | <b>✓</b> | Х   | X   |  |  |
| $R^{ra}$        | X                | X               | X        | X        | X   | X   |  |  |
| $R^{sc}$        | X                | X               | X        | X        | X   | X   |  |  |
| Wrl×            | 1                | <b>√</b>        | 1        | 1        | Х   | Х   |  |  |
| $W^{ra}$        | 1                | ✓               | 1        | 1        | X   | X   |  |  |
| $W^{sc}$        | 1                | 1               | X        | 1        | X   | X   |  |  |

It seems that fences generally do not commute, except for road-motel.

|                 | $2^{\mathrm{nd}}$ |                 |          |          |     |     |                  |          |                  |  |
|-----------------|-------------------|-----------------|----------|----------|-----|-----|------------------|----------|------------------|--|
| 1 <sup>st</sup> | R <sup>rlx</sup>  | R <sup>ra</sup> | Rsc      | Wrlx     | Wra | Wsc | F <sup>rel</sup> | Facq     | F <sup>fsc</sup> |  |
| Rrlx            | 1                 | 1               | <b>✓</b> | <b>/</b> | Х   | Х   | Х                | Х        | Х                |  |
| $R^{ra}$        | X                 | X               | X        | X        | X   | X   | X                | X        | X                |  |
| $R^sc$          | X                 | X               | X        | X        | X   | X   | X                | X        | X                |  |
| Wrlx            | 1                 | 1               | <b>✓</b> | 1        | X   | Х   | Х                | <b>✓</b> | Х                |  |
| $W^{ra}$        | 1                 | 1               | 1        | 1        | X   | X   | X                | ✓        | X                |  |
| $W^{sc}$        | 1                 | ✓               | X        | ✓        | X   | X   | X                | ✓        | X                |  |
| Frel            | 1                 | <b>√</b>        | <b>√</b> | X        | Х   | X   | X                | <b>✓</b> | Х                |  |
| $F^{acq}$       | X                 | X               | X        | X        | X   | X   | X                | X        | X                |  |
| $F^fsc$         | X                 | X               | X        | X        | X   | X   | X                | X        | X                |  |

*Definition B.6.* Including coherence, *reorderability* is defined:

$$\begin{split} & \ltimes_{\mathsf{co}} = \{ (\mathsf{W} x, \mathsf{R} y) \mid x \neq y \} \cup \{ (\mathsf{W} x, \mathsf{W} y) \mid x \neq y \} \\ & \cup \{ (\mathsf{R} x, \mathsf{W} y) \mid x \neq y \} \cup \{ (\mathsf{R} x, \mathsf{R} y) \} \\ & \ltimes_{\mathsf{sync}} = \{ (\mathsf{W}^{\mu}, \mathsf{R}^{\nu}) \mid \mu \neq \mathsf{sc} \lor \nu \neq \mathsf{sc} \} \cup \{ (\mathsf{W}^{\mu}, \mathsf{W}^{\mathsf{rlx}} \} \\ & \cup \{ (\mathsf{R}^{\mu}, \mathsf{W}^{\nu}) \mid \mu = \mathsf{rlx} \land \nu = \mathsf{rlx} \} \cup \{ (\mathsf{R}^{\mathsf{rlx}}, \mathsf{R}^{\nu}) \} \\ & \cup \{ (\mathsf{F}^{\mathsf{rel}}, \mathsf{F}^{\mathsf{acq}}) \} \cup \{ (\mathsf{F}^{\mathsf{rel}}, \mathsf{R}^{\nu}) \} \cup \{ (\mathsf{W}^{\mu}, \mathsf{F}^{\mathsf{acq}}) \} \\ & \ltimes = \ltimes_{\mathsf{sync}} \cap \ltimes_{\mathsf{co}} \end{split}$$

Here is the version with generalized modes for read and write:

$$\begin{split} & \ltimes_{\mathsf{sync}} = \{(\mathsf{W}^{\mu},\mathsf{R}^{\nu}) \mid \mu \not\supseteq \mathsf{sc} \vee \nu \not\supseteq \mathsf{sc}\} \cup \{(\mathsf{W}^{\mu},\mathsf{W}^{\mathsf{rlx}} \mid \mu \not\supseteq \mathsf{rel}\} \\ & \cup \{(\mathsf{R}^{\mu},\mathsf{W}^{\nu}) \mid \mu = \mathsf{rlx} \wedge \nu = \mathsf{rlx}\} \cup \{(\mathsf{R}^{\mathsf{rlx}},\mathsf{R}^{\nu}) \mid \nu \not\supseteq \mathsf{acq}\} \\ & \cup \{(\mathsf{F}^{\mathsf{rel}},\mathsf{F}^{\mathsf{acq}})\} \cup \{(\mathsf{F}^{\mathsf{rel}},\mathsf{R}^{\nu})\} \cup \{(\mathsf{W}^{\mu},\mathsf{F}^{\mathsf{acq}})\} \end{split}$$

|                  |                  |          |      |          | $2^{nd}$         |                  |     |                  |     |                  |
|------------------|------------------|----------|------|----------|------------------|------------------|-----|------------------|-----|------------------|
| 1 <sup>st</sup>  | R <sup>rlx</sup> | Rra      | Racq | Rsc      | R <sup>fsc</sup> | W <sup>rl×</sup> | Wra | W <sup>rel</sup> | Wsc | W <sup>fsc</sup> |
| R <sup>rlx</sup> | 1                | <b>✓</b> | X    | <b>✓</b> | Х                | 1                | Х   | Х                | Х   | X                |
| $R^{ra}$         | X                | X        | ×    | X        | X                | X                | X   | X                | X   | X                |
| $R^{acq}$        | X                | X        | ×    | X        | X                | X                | X   | X                | X   | X                |
| Rsc              | X                | X        | X    | X        | X                | X                | X   | X                | X   | X                |
| R <sup>fsc</sup> | X                | X        | X    | X        | X                | X                | X   | X                | X   | X                |
| W <sup>rlx</sup> | 1                | 1        | 1    | ✓        | X                | 1                | X   | X                | X   | X                |
| Wra              | 1                | 1        | 1    | ✓        | X                | 1                | X   | X                | X   | X                |
| W <sup>rel</sup> | 1                | 1        | 1    | ✓        | X                | X                | X   | X                | X   | X                |
| Wsc              | ✓                | ✓        | 1    | X        | X                | <b>/</b>         | X   | X                | X   | X                |
| $W^fsc$          | X                | X        | X    | X        | X                | X                | X   | X                | X   | X                |

*Definition B.7.* If  $P \in SEQ(\mathcal{P}_1, \mathcal{P}_2)$  then  $(\exists P_1 \in \mathcal{P}_1)$   $(\exists P_2 \in \mathcal{P}_2)$ 

- 1-9) as for SEQ in Def 3.24,
- (10) if  $d \in E_1$  and  $e \in E_2$  either  $d \le e$  or  $a \ltimes \lambda_2(e)$ .

Update the read and write rules of Def 3.24 to:

- S2)  $\lambda(e) = W^{\mu}xv$ ,
- S3)  $\kappa(e)$  implies M=v,
- S4)  $\tau^D(\psi)$  implies  $\psi \wedge M=v$ ,
- S5)  $\tau^{C}(\psi)$  implies  $\psi[ff/Q]$ ,
- L2)  $\lambda(e) = R^{\mu} x v$ ,
- L3)  $\kappa(e)$  implies tt.
- L4)  $\tau^D(\psi)$  implies  $v=r \Rightarrow \psi$ ,
- L5)  $\tau^{C}(\psi)$  implies  $\psi[ff/Q]$ .

Example B.8. The logic of quiescence is greatly simplified. Compare the following to Ex 4.6.

$$x^{\mu} := M \qquad \qquad r := x^{\mu}$$

$$\psi[\mathsf{ff}/\mathsf{Q}] \qquad \qquad \psi[\mathsf{ff}/\mathsf{Q}]$$

$$\mathbb{Q} \land M = v \mid \mathsf{W}^{\mu} x v \rangle \qquad \qquad \mathbb{R}^{\mu} x v \rangle \qquad \qquad v = r \Rightarrow \psi \, \langle$$

Fig 3 shows the resulting full semantics of read and write, without address calculation. Fig 4 shows the resulting full semantics with address calculation.

*Definition B.9.* Let substitution  $[\mu/\downarrow^x]$  be defined:

$$[\mu/\downarrow^x] = \begin{cases} [\mathsf{tt}/\downarrow^x] & \text{if } \mu = \mathsf{rlx} \\ [\mathsf{ff}/\downarrow^*] & \text{otherwise} \end{cases}$$

Let formula  $\downarrow_{\mu,\nu}^x$  be defined:

$$\downarrow_{\mu,\nu}^{x} = \begin{cases} \mathsf{tt} & \text{if } \mu = \mathsf{rlx} \text{ or } \mu = \nu \\ \downarrow^{x} & \text{otherwise} \end{cases}$$

#### **B.4** Substitutions

Recall the load rules from §6.1:

- L4)  $\tau^D(\psi)$  implies  $v=r \Rightarrow \psi$ ,
- L5)  $\tau^{C}(\psi)$  implies  $(v=r \lor x=r) \Rightarrow \psi$ , when  $E \neq \emptyset$ ,
- L6)  $\tau^B(\psi)$  implies  $\psi$ , when  $E = \emptyset$ .

It is also possible to collapse x and r when doing a load:

- L4)  $\tau^D(\psi)$  implies  $v=r \Rightarrow \psi[r/x]$ ,
- L5)  $\tau^C(\psi)$  implies  $(v=r \lor x=r) \Rightarrow \psi[r/x]$ , when  $E \neq \emptyset$ .
- L6)  $\tau^B(\psi)$  implies  $\psi[r/x]$ , when  $E = \emptyset$ .

Perhaps surprisingly, these two semantics are incomparable. Consider the following:

if 
$$(r \land s \text{ even})\{y := 1\}$$
; if  $(r \land s)\{z := 1\}$ 

$$\underbrace{(r \land s \text{ even} \mid Wy1)}_{(r \land s \mid Wz1)}$$

If  $P \in STORE(L, M, \mu)$  then  $(\exists \ell : E \to V)$   $(\exists v : E \to V)$   $(\exists \theta : E \to \Phi)$ 

- S1) if  $\theta_d \wedge \theta_e$  is satisfiable then d = e,
- S2)  $\lambda(e) = W^{\mu}[\ell_e]v_e$ ,
- S3)  $\kappa(e)$  implies  $\theta_e \wedge L = \ell_e \wedge M = v_e$ ,
- S4)  $(\forall e \in E \cap D) \tau^D(\psi)$  implies  $\theta_e \Rightarrow \psi[M/[\ell_e]][\mu/\downarrow^{[\ell_e]}] \land M=v_e \land L=\ell_e$ ,
- S5)  $(\forall e \in E \setminus D) \tau^D(\psi)$  implies  $\theta_e \Rightarrow \psi[M/[\ell_e]][\mu/\downarrow^{[\ell_e]}][ff/Q],$
- S6)  $(\forall k) \ \tau^D(\psi) \text{ implies } (\not\exists e \in E \mid \theta_e) \Rightarrow (L=k) \Rightarrow \psi[M/[k]][\mu/\downarrow^{[k]}][ff/Q].$

If  $P \in LOAD(r, L, \mu)$  then  $(\exists \ell : E \to \mathcal{V})$   $(\exists v : E \to \mathcal{V})$   $(\exists \theta : E \to \Phi)$   $(\exists v \in \{\mu, r | x\})$ 

- L1) if  $\theta_d \wedge \theta_e$  is satisfiable then d = e,
- L2)  $\lambda(e) = \mathsf{R}^{\nu}[\ell_e]v_e$ ,
- L3)  $\kappa(e)$  implies  $\theta_e \wedge L = \ell_e$ ,
- L4)  $(\forall e \in E \cap D) \tau^D(\psi)$  implies  $\theta_e \Rightarrow (L = \ell_e \Rightarrow v_e = s_e) \Rightarrow \psi[s_e/r]$ ,
- L5)  $(\forall e \in E \setminus D) \ \tau^D(\psi) \text{ implies } \theta_e \Rightarrow \downarrow_{\mu,\nu}^{\lceil \ell_e \rceil} \land (\langle (L = \ell_e \Rightarrow v_e = s_e) \lor (W \land (L = \ell_e \Rightarrow \lceil \ell_e \rceil = s_e))) \Rightarrow \psi[s_e/r][\text{ff}/Q]),$
- L6)  $(\forall k)(\forall s) \tau^D(\psi)$  implies  $(\nexists e \in E \mid \theta_e) \Rightarrow (L=k) \Rightarrow (\downarrow_{uv}^{[k]} \land \psi[s/r][ff/Q]).$

Fig. 4. Simplified Quiescence Semantics with Address Calculation (See Def B.9 for  $\downarrow_{\mu,\nu}^x$ ,  $[\mu/\downarrow^x]$ )

Prepending (s:=x), we get the same result regardless of whether we substitute [s/x], since x does not occur in either precondition. Here we show the independent case:

$$\begin{aligned} s := x \;;\; & \text{if}(r \land s \; \text{even})\{y := 1\};\; & \text{if}(r \land s)\{z := 1\}\\ & \underbrace{(2 = s \lor x = s) \Rightarrow (r \land s \; \text{even}) \mid \forall y \, 1}_{} \\ & \underbrace{(2 = s \lor x = s) \Rightarrow (r \land s) \mid \forall z \, 1}_{} \end{aligned}$$

Prepending (r := x), we now get different results since the preconditions mention x. Without substitution:

$$r:=x$$
; if  $(r \land s \text{ even})\{y:=1\}$ ; if  $(r \land s)\{z:=1\}$   

$$(Rx1) \longrightarrow (1=r \Rightarrow (2=s \lor x=s) \Rightarrow (r \land s \text{ even}) \mid Wy1)$$

$$(Rx2) \longrightarrow (1=r \Rightarrow (2=s \lor x=s) \Rightarrow (r \land s) \mid Wz1)$$

Prepending (x := 0), which substitutes [0/x], the precondition of (Wy1) becomes  $(1=r \Rightarrow (2=s \lor 0=s) \Rightarrow (r \land s \text{ even}))$ , which is a tautology, whereas the precondition of Wz1 becomes  $(1=r \Rightarrow (2=s \lor 0=s) \Rightarrow (r \land s))$ , which is not. In order to be top-level, Wz1 must depend on Rx2; in this case the precondition becomes  $(1=r \Rightarrow 2=s \Rightarrow (r \land s))$ , which is a tautology.

$$(Wx0)$$
  $(Rx1)$   $(Rx2)$   $(Wy1)$   $(Wz1)$ 

The situation reverses with the substitution  $\lceil r/x \rceil$ :

$$r:=x$$
;  $s:=x$ ; if  $(r \land s \text{ even})\{y:=1\}$ ; if  $(r \land s)\{z:=1\}$ 

$$(Rx1) \longrightarrow (1=r \Rightarrow (2=s \lor r=s) \Rightarrow (r \land s \text{ even}) \mid Wy1)$$

$$(Rx2) \longrightarrow (1=r \Rightarrow (2=s \lor r=s) \Rightarrow (r \land s) \mid Wz1)$$

Prepending (x := 0):

$$(Wx0)$$
  $(Rx1)$   $(Rx2)$   $(Wy1)$   $(Wz1)$ 

The dependency has changed from  $(Rx2) \rightarrow (Wz1)$  to  $(Rx2) \rightarrow (Wy1)$ . The resulting sets of pomsets are incomparable.

Thinking in terms of hardware, the difference is whether reads update the cache, thus clobbering preceding writes. With [r/x], reads clobber the cache, whereas without the substitution, they do not. Since most caches work this way, the model with [r/x] is likely preferred for modeling hardware. In a software model, however, we see no reason to prefer one of these over the other.

# C DIFFERENCES WITH OOPSLA

Substitution. [Jagadeesan et al. 2020] uses substitution rather than Skolemizing. Indeed our use of Skolemization is motivated by disjunction closure for predicate transformers, which do not appear in [Jagadeesan et al. 2020]; see §3.5.

In §6.1, we give the semantics of load for nonempty pomsets as:

- L4)  $\tau^D(\psi)$  implies  $v=r \Rightarrow \psi$ ,
- L5)  $\tau^C(\psi)$  implies  $(v=r \lor x=r) \Rightarrow \psi$ .

In [Jagadeesan et al. 2020], the definition is roughly as follows:

- L4)  $\tau^D(\psi)$  implies  $\psi[v/r][v/x]$ ,
- L5)  $\tau^{C}(\psi)$  implies  $\psi[v/r][v/x] \wedge \psi[x/r]$ .

These substitutions collapse x and r, allowing local invariant reasoning, as in §6.1. Without Skolemizing it is necessary to substitute [x/r], since the reverse substitution [r/x] is useless when r is bound—compare with §8.4.

Including this substitution affects the interaction of local invariant reasoning and downset closure—see the following discussion of *downset closure*.

Removing the substitution of [x/r] in the independent case has a technical advantage: we no longer require *extended* expressions (which include memory references), since substitutions no longer introduce memory references.

The substitution [x/r] does not work with Skolemization, even for the dependent case, since we lose the unique marker for each read. In effect, this forces the reads to the same values. To be concrete, the candidate definition would modify L4 to be:

L4) 
$$\tau^D(\psi)$$
 implies  $v=x \Rightarrow \psi[x/r]$ .

Using this definition, consider the following:

$$r := x; s := x; if(r < s) \{ y := 1 \}$$

$$(Rx1) \qquad (Rx2) \rightarrow (1 = x \Rightarrow 2 = x \Rightarrow x < x \mid Wy1)$$

Although the execution seems reasonable, the precondition on the write is not a tautology.

Downset closure. [Jagadeesan et al. 2020] enforces downset closure in the prefixing rule. Even without this, downset closure would be different for the two semantics, due to the use of substitution in [Jagadeesan et al. 2020]. Consider the final pomset of Ex B.5, under the semantics of this paper, which elides the middle read event:

$$x := 0; r := x; if(r \ge 0) \{y := 1\}$$

$$(Wx0) \qquad (r \ge 0 \mid Wy1)$$

In [Jagadeesan et al. 2020], the substitution [x/r] is performed by the middle read regardless of whether it is included in the pomset, with the subsequent substitution of [0/x] by the preceding write, we have [x/r][0/x], which is [0/r][0/x], resulting in:

$$(\mathbb{W}x0)$$
  $(0\geq 0 \mid \mathbb{W}y1)$ 

Consistency. [Jagadeesan et al. 2020] imposes consistency, which requires that for every pomset P,  $\bigwedge_e \kappa(e)$  is satisfiable. Associativity requires that we allow pomsets with inconsistent preconditions. Consider a variant of Ex 6.7 from §6.3.

$$\begin{array}{ll} \text{if}(M)\{x:=1\} & \text{if}(!M)\{x:=1\} & \text{if}(M)\{y:=1\} \\ \hline \begin{pmatrix} M \mid \forall x1 \end{pmatrix} & \begin{pmatrix} \neg M \mid \forall x1 \end{pmatrix} & \begin{pmatrix} M \mid \forall y1 \end{pmatrix} & \begin{pmatrix} \neg M \mid \forall y1 \end{pmatrix} \\ \end{array}$$

Associating left and right, we have:

$$\label{eq:formula} \begin{split} \text{if}(M)\{x\coloneqq 1\}; & \text{if}(!M)\{x\coloneqq 1\} \\ & & \text{$(Wx1)$} \end{split}$$

Associating into the middle, instead, we require:

$$\begin{array}{ll} \text{if}(M)\{x:=1\} & \text{if}(!M)\{y:=1\} \\ \hline (M\mid \forall x)1 & \hline (M\mid \forall x)1 & \hline (M\mid \forall y)1 & \hline (M\mid \forall y)1 \\ \end{array}$$

Joining left and right, we have:

$$\begin{split} \text{if}(M)\{x := 1\}; & \text{if}(!M)\{x := 1\}; & \text{if}(M)\{y := 1\}; & \text{if}(!M)\{y := 1\}\\ & \boxed{\mathbb{W}x1} \quad \boxed{\mathbb{W}y1} \end{split}$$

Causal Strengthening. Causal Strengthening [Jagadeesan et al. 2020] imposes causal strengthening, which requires for every pomset P, if  $d \le e$  then  $\kappa(e)$  implies  $\kappa(d)$ . Associativity requires that we allow pomsets without causal strengthening. Consider the following.

$$\begin{aligned} &\text{if}(M)\{r := x\} & & & & & & & & & \\ & & & & & & & \\ & & & & & & \\ & & & & & & \\ & & & & & & \\ & & & & & \\ & & & & & \\ & & & & & \\ & & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & & \\ & & \\ & & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & &$$

Associating left, with causal strengthening:

$$if(M)\{r := x\}; y := r$$
 
$$if(!M)\{s := x\}$$
 
$$(M \mid Rx1) \rightarrow (M \mid Wy1)$$
 
$$(\neg M \mid Rx1)$$

Finally, merging:

if(M){
$$r := x$$
};  $y := r$ ; if(!M){ $s := x$ }
$$(Rx1) \rightarrow (M \mid Wy1)$$

Instead, associating right:

$$\begin{array}{ll} \text{if}(M)\{r := x\} & y := r; \text{ if}(!M)\{s := x\} \\ \hline (M \mid \mathsf{R}x1) & r = 1 \mid \mathsf{W}y1) & \neg M \mid \mathsf{R}x1 \\ \end{array}$$

Merging:

$$if(M)\{r := x\}; y := r; if(!M)\{s := x\}$$

$$(Rx1) + (Wy1)$$

With causal strengthening, the precondition of Wy1 depends upon how we associate. This is not an issue in [Jagadeesan et al. 2020], which always associates to the right.

One use of causal strengthening is to ensure that address dependencies do not introduce thin air reads. Associating to the right, the intermediate state of Ex A.3 is:

$$s := [r]; x := s$$

$$(r=2 \mid R[2]1) \longrightarrow (r=2 \Rightarrow 1=s) \Rightarrow s=1 \mid Wx1)$$

In [Jagadeesan et al. 2020], we have, instead:

$$s := [r]; x := s$$

$$(r=2 \mid R[2]1) \longrightarrow (r=2 \land [2]=1 \mid Wx1)$$

Without causal strengthening, the precondition of (Wx1) would be simply [2]=1. The treatment in this paper, using implication rather than conjunction, is more precise.

Parallel Composition. In [Jagadeesan et al. 2020, §2.4], parallel composition is defined allowing coalescing of events. Here we have forbidden coalescing. This difference appears to be arbitrary. In [Jagadeesan et al. 2020], however, there is a mistake in the handling of termination actions. The predicates should be joined using  $\land$ , not  $\lor$ .

*Read-Modify-Write Actions.* In [Jagadeesan et al. 2020], the atomicity axioms ??/?? erroneously applies only to overlapping writes, not overlapping reads. The difficulty can be seen in Ex A.6.

[Jagadeesan et al. 2020] does not specify the calculation of dependency for RMWs, as discussed in Ex A.8.

Downgrading Internal Acquiring Reads. Shortly after publication, Podkopaev [2020] noticed a short-coming of the implementation on ARM8 in [Jagadeesan et al. 2020, §7]. The proof given there assumes that all internal reads can be dropped. However, this is not the case for acquiring reds. For example, [Jagadeesan et al. 2020] disallows the following execution, which is allowed by ARM8 and Tso.

$$x := 2; r := x^{ra}; s := y \parallel y := 2; x^{ra} := 1$$
 $x := 2; r := x^{ra}; s := y \parallel y := 2; x^{ra} := 1$ 
 $x := 2; r := x^{ra}; s := y \parallel y := 2; x^{ra} := 1$ 

The solution we have adopted is to allow an acquiring read to be downgraded to a relaxed read when it is preceded (sequentially) by a relaxed write that could fulfill it. This solution allows executions that are not allowed under ARM8 since we do not insist that the local relaxed write is actually read from. This may seem counterintuitive, but we don't see a local way to be more precise.

As a result, we use a different proof strategy for ARM8 implementation, which does not rely on read elimination. The proof idea uses a recent alternative characterization of ARM8 [Alglave 2020; Arm Limited 2020].

Redundant Read Elimination. Contrary to the claim, redundant read elimination fails for [Jagadeesan et al. 2020]. We discussed redundant read elimination in §6.2. Consider JMM Causality Test Case 2, which we discussed there.

$$r := x$$
;  $s := x$ ; if  $(r = s)\{y := 1\} \parallel x := y$ 

$$(Rx1) \qquad (Ry1) \qquad (Ry1) \qquad (Wx1)$$

Under the semantics of [Jagadeesan et al. 2020], we have

$$r:=x\;;\;s:=x\;;\;\mathrm{if}(r=s)\{y:=1\}$$
 
$$\boxed{\mathsf{R}x1}\quad \boxed{\mathsf{1}=1\land 1=x\land x=1\land x=x\mid \mathsf{W}y1}$$

The precondition of (Wy1) is *not* a tautology, and therefore redundant read elimination fails. (It is a tautology in r := x; s := r; if  $(r = s) \{ y := 1 \}$ .) In [Jagadeesan et al. 2020, §3.1], we incorrectly stated that the precondition of (Wy1) was  $1 = 1 \land x = x$ .

#### **D** MORE STUFF

#### D.1 A Note on Mixed-Mode Data Races

In preparing this paper, we came across the following example, which appears to invalidate Theorem 4.1 of [Dongol et al. 2019].

$$x := 1; y^{ra} := 1; r := x^{ra} \parallel if(y^{ra}) \{x^{ra} := 2\}$$

$$\boxed{Wx1} \qquad \boxed{Ry1} \qquad \boxed{Wx2} \qquad (*)$$

$$\boxed{Wx1} \qquad \boxed{Ry1} \qquad \boxed{Ry1} \qquad \boxed{Wx2} \qquad (†)$$

The program is data-race free. The two executions shown are the only top-level executions that include (Wx2).

Theorem 4.1 of [Dongol et al. 2019] is stated by extending execution sequences. In the terminology of [Dongol et al. 2019], a read is L-weak if it is sequentially stale. Let  $\rho = (Wx1)(Wy1)(Ry1)(Wx2)$  be a sequence and  $\alpha = (Rx1)$ .  $\rho$  is L-sequential and  $\alpha$  is L-weak in  $\rho\alpha$ . But there is no execution of this program that includes a data race, contradicting the theorem. The error seems to be in Lemma A.4 of [Dongol et al. 2019], which states that if  $\alpha$  is L-weak after an L-sequential  $\rho$ , then  $\alpha$  must be in a data race. That is clearly false here, since (Rx1) is stale, but the program is data race free.

In proving the SC-LDRF result in [Jagadeesan et al. 2020, §8], we noted that our proof technique is more robust than that of [Dongol et al. 2019], because it limits the prefixes that must be considered. In (\*), the induction hypothesis requires that we add (Rx1) before (Rx1) -  $\Rightarrow$  (Rx1) -  $\Rightarrow$  (Rx1). In particular,

$$(Wx1) \rightarrow (Wy1)$$
  $(Ry1) \rightarrow (Wx2)$ 

is not a downset of (\*), because  $(Rx1) \rightarrow (Wx2)$ . As we noted in [Jagadeesan et al. 2020, §8], this affects the inductive order in which we move across pomsets, but does not affect the set of pomsets that are considered. In particular,

$$(Wx1) \rightarrow (Wy1)$$

is a downset of (\*).

# D.2 Downgraded Reads

We allow downgrades in executions that are not be allowed by ARM8.

ARM8 disallows this because the acquiring read is fulfilled by an external write.

$$x := z; r := x^{ra}; y := 1 \parallel z := y$$

$$\boxed{Rz1} \qquad \boxed{Wy1} \qquad \boxed{Wy1} \qquad \boxed{Wz1}$$

ARM8 disallows this because data and control dependencies change acquiring read is fulfilled by an external write.

# D.3 If Closure and Address Dependencies

An optimization (p/q are registers):

$$r := [p]; s := [q]$$

vs

$$r := [p]; if(p=q)\{s := r\} else\{s := [q]\}$$

$$r := \text{new}; [r] := 42; s := [r]; x := r \parallel r := x; [r] := 7$$

If closure is at odds with Java Final field semantics.

Do sequencing and if commute?

# D.4 About ARM

Hypothesis: gcb cannot contradict (poloc minus RxR).

# D.5 Using Independency for Coherence

It is also possible to use independency only to capture coherence, but the results are less interesting. In the logic, we remove the symbols  $Q^x_{wo}$  and  $Q^x_{ro}$ . Previously, we had given the semantics of ra access using  $Q^*_{wo}$  and  $Q^*_{ro}$ , which were encoded using  $Q^x_{wo}$  and  $Q^x_{ro}$ . With these gone, we introduce the quiescence symbol  $Q^*$  and  $Q_{acq}$ . Thus, the only quiescence symbols required are  $Q^*$ ,  $Q_{acq}$  and  $Q_{sc}$ . Fig 5 shows the difference with the semantics of §4.2.

*Definition D.1.* Let formulae  $Q_{\mu}^{S}$  and  $Q_{\mu}^{L}$  be defined:

$$\begin{aligned} Q_{r|x}^S &= Q_{acq} & Q_{r|x}^L &= Q_{acq} \\ Q_{ra}^S &= Q_{acq} \wedge Q^* & Q_{ra}^L &= Q_{acq} \\ Q_{sc}^S &= Q_{acq} \wedge Q^* \wedge Q_{sc} & Q_{sc}^L &= Q_{acq} \wedge Q_{sc} \end{aligned}$$

Let substitutions  $[\phi/Q_{ij}^S]$  and  $[\phi/Q_{ij}^L]$  be defined:

$$\begin{aligned} [\phi/Q_{\text{rlx}}^{\text{S}}] &= [\phi/Q^*] & [\phi/Q_{\text{rlx}}^{\text{L}}] &= [\phi/Q^*] \\ [\phi/Q_{\text{ra}}^{\text{S}}] &= [\phi/Q^*] & [\phi/Q_{\text{ra}}^{\text{L}}] &= [\phi/Q^*, \phi/Q_{\text{acq}}] \\ [\phi/Q_{\text{sc}}^{\text{S}}] &= [\phi/Q^*, \phi/Q_{\text{sc}}] & [\phi/Q_{\text{sc}}^{\text{L}}] &= [\phi/Q^*, \phi/Q_{\text{acq}}, \phi/Q_{\text{sc}}] \end{aligned}$$

Definition D.2. Update Def 3.24 to:

- S3)  $\kappa(e)$  implies  $Q_{\mu}^{S} \wedge M=v$ ,
- L3)  $\kappa(e)$  implies  $Q_{\mu}^{L}$ ,
- S4)  $\tau^D(\psi)$  implies  $\psi \wedge M=v$ ,
- S5)  $\tau^C(\psi)$  implies  $\psi[ff/Q_{\mu}^S]$ ,
- L4)  $\tau^D(\psi)$  implies  $v=r \Rightarrow \psi$ ,
- L5)  $\tau^C(\psi)$  implies  $\psi[ff/Q_{\mu}^L]$ .

The most interesting examples in Fig 5b concern ra access. Every independent transformer substitutes [ff/Q\*]. Q\* is a precondition for any releasing write e, ensuring that all preceding events must are ordered before e. Conversely,  $Q_{acq}$  is a precondition of every event. The independent transformer for any acquiring read e substitutes [ff/ $Q_{acq}$ ], ensuring that all following events must be ordered after e.

Item 10 of Def B.7 ensures coherence. This definition is incompatible with asynchronous fork parallelism of Def ??, where we expect executions such as:

fork 
$$\{r := x\}$$
;  $x := 1$ 

$$\boxed{\mathbb{R}x1} \blacktriangleleft \boxed{\mathbb{W}x1}$$



Fig. 5. Quiescence Examples for Coherence

Item 10 would require  $(Rx1) \rightarrow (Wx1)$ , forbidding this.