#### A DISCUSSION

## A.1 Comparison to "Promising Semantics" [POPL 2017]

Recently, Cho et al. [2021] showed that certain combinations of compiler optimizations are inconsistent with local DRF guarantees. All of the examples that prove inconsistency have the same shape: they combine read-introduction and case analysis (aka, if-introduction). Effectively, this turns one read into two, where different conditional branches can be taken for the two copies of the read. This is reminiscent of the type of *bait and switch* behavior noted by Jagadeesan et al. [2020]: the promising semantics (PS) [Kang et al. 2017] and related models [Chakraborty and Vafeiadis 2019; Jagadeesan et al. 2010; Manson et al. 2005], fail to validate compositional reasoning of temporal properties. Consider example OOTA4 from [Jagadeesan et al. 2020]:

$$y := x \parallel r := y; \text{ if } (b)\{x := r; z := r\} \text{ else } \{x := 1\} \parallel b := 1$$

$$(Rx1) \qquad (Ry1) \qquad (Wz1) \qquad (Wb1)$$

Under all variants of PwT, this outcome is disallowed, due to the cycle involving x and y.<sup>1</sup> Under Ps, this outcome is allowed by baiting with the else branch, then switching to the then branch, based on a coin flip (b).

Cho et al. [2021] introduce more complex examples to show that the promising semantics fails LDRF-SC.<sup>2</sup> Here is one, dubbed LDRF-FAIL-PS.

$$if(x) \{FADD(w, 1); y := 1; z := 1\} \parallel if(z) \{if(!FADD(w, 1)) \{x := y\}\} else \{x := 1\}$$

$$(Rw1) \qquad (Ww2) \qquad (Wy1) \qquad (Rz1) \qquad (Rw0) \qquad (Ry1) \qquad (Wx1)$$

Again, all variants of PwT disallow the outcome due to the cycle involving x and y. It is allowed by Ps by baiting the second thread with x := 1 in the else branch, then switching to the then branch. This shows some some structural resemblance to OOTA4, with z replacing b.

Cho et al. argue that the outcome of LDRF-FAIL-PS is inevitable due to compiler optimizations. The examples crucially involve the following sequence of operations:

- read-introduction,
- if-introduction, branching on the read just introduced.

We believe this combination of optimizations is unsound. This is obviously the case in C11: read-introduction may cause undefined behavior (UB), due to the possible introduction of a data race.

The situation is more delicate in LLVM. The short version of the story is that load-hoisting followed by case analysis is unsound in LLVM, without freeze. This happens because:

- read-introduction may result in the undefined value undef, due to the possible introduction
  of a data race [Chakraborty and Vafeiadis 2017], and
- branching on an undefined value in LLVM results in UB.

LLVM delays UB using the undefined value. This allows LLVM to perform optimizations such as load hoisting, where if  $(C)\{r:=x\}$  is rewritten to s:=x; r:=C?s:r. Despite this, other optimizations regularly performed by LLVM are unsound [Lee et al. 2017]. An example is loop switching, where while  $(C_1)\{if(C_2)\{S_1\}$  else  $\{S_2\}\}$  is rewritten to if  $(C_2)\{while(C_1)\{S_1\}\}$  else  $\{while(C_1)\}$ 

 $<sup>^{1}</sup>$ All of the reads in oota4 are cross-thread, so there is no difference between PwT-mca $_{1}$  and PwT-mca $_{2}$ . For PwT-C11, there is a cycle in rf ∪ <.

<sup>&</sup>lt;sup>2</sup>Cho et al. [2021] show that by restricting RMW-store reorderings, one can establish LDRF-sc for PS. We speculate that no such restriction is required for PWT. (We did not treat RMWs in our proof of LDRF-sc.)

0:2 Anon.

 $(C_1)$  { $S_2$ }}. Freeze was introduced in LLVM in order to make such optimizations sound by allowing branch on frozen **undef** to give nondeterministic choice rather than UB. In the RFC for freeze, Lopes [2016] says: "Note that having branch on poison not trigger UB has its own problems. We believe this is a good tradeoff." LDRF-FAIL-PS demonstrates a concrete problem with this tradeoff. Other compilers, such as Compcert, are more conservative [Lee et al. 2017, §9].

Thus, the difference between PS and PwT can be understood in terms of the valid program transformations. PS allows reads to be introduced, with subsequent case analysis on the value read. PwT validates case analysis, but invalidates read-introduction.

Allowing executions such as OOTA4 and LDRF-FAIL-PS also invalidates compositional reasoning for temporal safety properties (see §5).

These differences highlight the subtle tensions between compiler optimizations and program logics that are revealed by relaxed memory models. It is not possible to have everything one wants. Thus, one is forced to choose which optimizations and reasoning principles are most important. $^3$ 

Finally, we note that it is possible that PS is properly weaker than PwT.

# A.2 Comparison to "Pomsets with Preconditions" [OOPSLA 2020]

PwT-mca is closely related to PwP model of [Jagadeesan et al. 2020]. The major difference is that PwT-mca supports sequential composition. In the remainder of this section, we discuss other differences. We also point out some errors in [Jagadeesan et al. 2020], all of which have been confirmed by the authors.

*Substitution*. PwP uses substitution rather than Skolemizing. Indeed our use of Skolemization is motivated by disjunction closure for predicate transformers, which do not appear in PwP. In Fig. 1, we gave the semantics of read for nonempty pomsets as:

```
(R4a) if (E \cap D) \neq \emptyset then \tau^D(\psi) \equiv v = r \Rightarrow \psi,
(R4b) if (E \cap D) = \emptyset then \tau^D(\psi) \equiv (v = r \lor x = r) \Rightarrow \psi.
```

In PwP, the definition is roughly as follows:

```
(R4a') if (E \cap D) \neq \emptyset then \tau^D(\psi) \equiv \psi[v/r][v/x],

(R4b') if (E \cap D) = \emptyset then \tau^D(\psi) \equiv \psi[v/r][v/x] \wedge \psi[x/r]
```

The use of conjunction in R4b' causes disjunction closure to fail because the predicate transformer  $\tau(\psi) = \psi' \wedge \psi''$  does not distribute through disjunction, even assuming that the prime operations do:  ${}^4\tau(\psi_1 \vee \psi_2) = (\psi_1' \vee \psi_2') \wedge (\psi_1'' \vee \psi_2'') \neq (\psi_1' \wedge \psi_1'') \vee (\psi_2' \wedge \psi_2'') = \tau(\psi_1) \vee \tau(\psi_2)$ . See also §3.9.

The substitutions collapse x and r, allowing local invariant reasoning (LIR), as required by JMM causality test case 1, discussed in §3.8. Without Skolemizing it is necessary to substitute [x/r], since the reverse substitution [r/x] is useless when r is bound—compare with §A.7. As discussed below (Downset closure), including this substitution affects the interaction of LIR and downset closure.

Removing the substitution of [x/r] in the independent case has a technical advantage: we no longer require *extended* expressions (which include memory references), since substitutions no longer introduce memory references.

<sup>&</sup>lt;sup>3</sup>Another example is the tension between load hoisting—forbidden in C11 but allowed by LLVM—and common subexpression elimination over an acquiring lock—allowed by C11 but forbidden by LLVM [Chakraborty and Vafeiadis 2017].

 $<sup>^{4}(\</sup>psi_{1}\vee\psi_{2})'=(\psi_{1}'\vee\psi_{2}')$  and  $(\psi_{1}\vee\psi_{2})''=(\psi_{1}''\vee\psi_{2}'').$ 

The substitution [x/r] does not work with Skolemization, even for the dependent case, since we lose the unique marker for each read. In effect, this forces all reads of a location to see the same values. Using this definition, consider the following:

$$r := x$$
;  $s := x$ ; if  $(r < s) \{ y := 1 \}$ 

$$(Rx1) \qquad Rx2 \rightarrow (1 = x \Rightarrow 2 = x \Rightarrow x < x \mid Wy1)$$

Although the execution seems reasonable, the precondition on the write is not a tautology.

*Downset closure*. PwP enforces downset closure in the prefixing rule. Even without this, downset closure would be different for the two semantics, due to the use of substitution in PwP. Consider the final pomset in the last example of §A.8 under the semantics of this paper, which elides the middle read event:

$$x := 0; r := x; if(r \ge 0) \{y := 1\}$$

$$(wx0) \qquad (r \ge 0) \quad wy1$$

In PwP, the substitution [x/r] is performed by the middle read regardless of whether it is included in the pomset, with the subsequent substitution of [0/x] by the preceding write, we have [x/r][0/x], which is [0/r][0/x], resulting in:

$$(\mathbb{W}x0)$$
  $(0\geqslant 0 \ \mathbb{W}y1)$ 

Consistency. PwP imposes consistency, which requires that for every pomset P,  $\bigwedge_e \kappa(e)$  is satisfiable. Associativity requires that we allow pomsets with inconsistent preconditions. Consider a variant of the example from §8.3.

Associating left and right, we have:

Associating into the middle, instead, we require:

Joining left and right, we have:

$$\begin{split} \text{if}(M)\{x := 1\}; & \text{if}(!M)\{x := 1\}; & \text{if}(M)\{y := 1\}; & \text{if}(!M)\{y := 1\} \\ & \boxed{\mathbb{W}x1} & \boxed{\mathbb{W}y1} \end{split}$$

Causal Strengthening. PwP imposes causal strengthening, which requires for every pomset P, if d < e then  $\kappa(e) \models \kappa(d)$ . Associativity requires that we allow pomsets without causal strengthening. Consider the following.

$$\begin{array}{ccc} \text{if}(M)\{r:=x\} & y:=r & \text{if}(!M)\{s:=x\} \\ \hline (M \mid Rx1) & \hline (r=1 \mid Wy1) & \neg M \mid Rx1 \end{array}$$

Associating left, with causal strengthening:

$$if(M)\{r := x\}; y := r \qquad if(!M)\{s := x\}$$

$$(M \mid Rx1) \rightarrow (M \mid Wy1) \qquad (\neg M \mid Rx1)$$

0:4 Anon.

Finally, merging:

if 
$$(M)$$
 { $r := x$ };  $y := r$ ; if  $(!M)$  { $s := x$ }
$$(Rx1) \rightarrow (M \mid Wy1)$$

Instead, associating right:

$$\begin{array}{ccc} \text{if}(M)\{r:=x\} & y:=r; \text{if}(!M)\{s:=x\} \\ \hline (M \mid \mathsf{R}x1) & \hline (r=1 \mid \mathsf{W}y1) & \lnot M \mid \mathsf{R}x1 \end{array}$$

Merging:

$$if(M)\{r:=x\}; y:=r; if(!M)\{s:=x\}$$

$$(Rx1) \rightarrow (Wy1)$$

With causal strengthening, the precondition of Wy1 depends upon how we associate. This is not an issue in PwP, which always associates to the right.

One use of causal strengthening is to ensure that address dependencies do not introduce thin air reads. Associating to the right, the intermediate state of ADDR2 (§8.4) is:

$$s := [r] ; x := s$$

$$(r=2 | R[2]1) \longrightarrow (r=2 \Rightarrow 1=s) \Rightarrow s=1 | Wx1)$$

In PwP, we have, instead:

$$s := [r] \; ; \; x := s$$
 
$$(r=2 \ \ \mathsf{R} \; [2]1) \longrightarrow (r=2 \land [2]=1 \ \ \mathsf{W} \; x \; 1)$$

Without causal strengthening, the precondition of (Wx1) would be simply [2]=1. The treatment in this paper, using implication rather than conjunction, is more precise.

Internal Acquiring Reads. The proof of compilation to Arm in PwP assumes that all internal reads can be eliminated. However, this is not the case for acquiring reads. For example, PwP disallows the following execution, where the final values of x is 2 and the final value of y is 2. This execution is allowed by Arm8 and Tso.

$$x := 2; r := x^{acq}; s := y \parallel y := 2; x^{rel} := 1$$
 $(Wx2) \longrightarrow (Ry0) \longrightarrow (Wy2) \longrightarrow (W^{rel}x1)$ 

We discuss two approaches to this problem in §B.

*Redundant Read Elimination.* Contrary to the claim, redundant read elimination fails for PwP. We discuss redundant read elimination in §8.1. Consider JMM Causality Test Case 2, which we describe there.

$$r := x$$
;  $s := x$ ; if  $(r = s)\{y := 1\} \parallel x := y$ 

$$(Rx1) \qquad (Wy1) \qquad (Ry1) \qquad (Wx1)$$

Under the semantics of PwP, we have

$$r := x; s := x; if(r=s)\{y := 1\}$$

$$(Rx1) \quad (Rx1) \quad (1=1 \land 1=x \land x=1 \land x=x \mid Wy1)$$

Proc. ACM Program. Lang., Vol. 0, No. POPL, Article 0. Publication date: January 2022.

The precondition of (Wy1) is *not* a tautology, and therefore redundant read elimination fails. (It is a tautology in r:=x; s:=r; if  $(r=s)\{y:=1\}$ .) PwP(§3.1) incorrectly stated that the precondition of (Wy1) was  $1=1 \land x=x$ .

150 151

153

155

148

149

Termination Conditions and Parallel Composition. In PwP(§2.4), parallel composition is defined allowing coalescing of events. Here we have forbidden coalescing. This difference appears to be arbitrary. In PwP, however, there is a mistake in the handling of termination actions. The predicates should be joined using  $\land$ , not  $\lor$ . Here we have used termination conditions rather than termination actions so that termination is handled separately.

157 158

Read-Modify-Write Actions. In PwP, the atomicity axioms  ${\tt M10c}$  erroneously applies only to overlapping writes, not overlapping reads. The difficulty can be seen in Example D.2.

160 disc

In addition, PwP uses READ instead of READ' when calculating of dependency for RMWs. For a discussion, see the example at the end of §8.2.

162 163

159

*Data Race Freedom.* The definition of data race is wrong in PwP. It should require that that at least one action is relaxed.

164 Not
 165 This g
 166 theore

Note that the definition of L-stable applies in the case that conflicting writes are totally ordered. This gives a result more in the spirit of [Dolan et al. 2018]. In particular, this special case of the theorem clarifies the discussion of the PAST example in PwP;

169 170

167

171

173

174

175176

177

178

179

180

181

182 183

184 185

187

189

190 191

192

193

194

195 196 Augmentation of Preconditions. PwP allows arbitrary augmentation of preconditions. Here we are more conservative, only allowing augmentation of preconditions in the semantics of primitive actions, as in §8.3. As discussed in §A.9, allowing arbitrary augmentation causes associativity to fail when encoding delay logically.

# A.3 Register Consistency

[Todo: Explain why we cannot require that either  $\kappa(e)$  or  $\checkmark$  are  $\lambda$ -consistent.]

In addition to the three criteria of Def. 3.2 Dijkstra [1975] requires

```
(x4') \tau(ff) \equiv ff.
```

Unfortunately, our transformer for read actions (R4a) does not obey x4', since ff is not equivalent to  $v=r \Rightarrow$  ff.

In this subsection, we refine this requirement to one that does hold. The main insight is to pull values for registers from the actions of pomset itself. Thus, we define  $\theta_{\lambda}$  to capture the *register state* of a pomset.

```
Definition A.1. Let \theta_{\lambda} = \bigwedge_{\{(e,v) \in (E \times \mathcal{V}) | \lambda(e) = (Rv)\}} (s_e = v) where E = \text{dom}(\lambda). We say that \phi is \lambda-consistent if \phi \wedge \theta_{\lambda} is satisfiable. We say that it is \lambda-inconsistent otherwise.
```

Using this, we define the constraint on predicate transformers that we want. We also need to update the definition of predicate transformer families to carry the labeling.

*Definition A.2.* A  $\lambda$ -predicate transformer is a function  $\tau: \Phi \to \Phi$  such that

```
(x1) (x2) (x3) as in Def. 3.2,
```

(x4) if  $\psi$  is  $\lambda$ -inconsistent then  $\tau(\psi)$  is  $\lambda$ -inconsistent.

A family of  $\lambda$ -predicate transformers over consists of a  $\lambda$ -predicate transformer  $\tau^D$  for each  $D \subseteq \mathcal{E}$ , such that if  $C \cap E \subseteq D$  then  $\tau^C(\psi) \models \tau^D(\psi)$ .

```
(M4) \tau: 2^{\mathcal{E}} \to \Phi \to \Phi is a family of \lambda-predicate transformers,
```

0:6 Anon.

# A.4 Comparison with Sequential Predicate Transformers

We compare traditional transformers to the dependent-case transformers of Fig. 1.

All programs in our language are strongly normalizing, so we need not distinguish strong and weak correctness. In this setting, the Hoare triple  $\{\phi\}$  S  $\{\psi\}$  holds exactly when  $\phi \Rightarrow wp_S(\psi)$ .

Hoare triples do not distinguish thread-local variables from shared variables. Thus, the assignment rule applies to all types of storage. The rules can be written as on the left below:

$$\begin{aligned} wp_{x:=M}(\psi) &= \psi [M/x] & \tau_{x:=M}(\psi) &= \psi [M/x] \\ wp_{r:=M}(\psi) &= \psi [M/r] & \tau_{r:=M}(\psi) &= \psi [M/r] \\ wp_{r:=x}(\psi) &= x = r \Rightarrow \psi & \tau_{r:=x}(\psi) &= v = r \Rightarrow \psi & \text{where } \lambda(e) &= \mathsf{R} x v \end{aligned}$$

Here we have chosen an alternative formulation for the read rule, which is equivalent to the more traditional  $\psi[x/r]$ , as long as registers are assigned at most once in a program. Our predicate transformers for the dependent case are shown on the right above. Only the read rule differs from the traditional one.

For programs where every register is bound and every read is fulfilled, our dependent transformers are the same as the traditional ones. Thus, when comparing to weakest preconditions, let us only consider totally-ordered executions of our semantics where every read could be fulfilled by prepending some writes. For example, we ignore pomsets of x := 2; x := x that read 1 for x.

For example, let  $S_i$  be defined:

$$S_1 = s := x$$
;  $x := s + r$   $S_2 = x := t$ ;  $S_1$   $S_3 = t := 2$ ;  $r := 5$ ;  $S_2$ 

The following pomset appears in the semantics of  $S_2$ . A pomset for  $S_3$  can be derived by substituting [2/t, 5/r]. A pomset for  $S_1$  can be derived by eliminating the initial write.

$$x := t \; ; \; s := x \; ; \; x := s + r$$

$$(t = 2 \mid \forall x 2) \longrightarrow (\exists x z) \longrightarrow (\exists s \Rightarrow (s + r) = 7 \mid \forall x 7) \cdots \triangleright 2 = s \Rightarrow \psi[s + r/x]$$

The predicate transformers are:

$$\begin{split} wp_{S_1}(\psi) &= x = s \Rightarrow \psi[s + r/x] \\ wp_{S_2}(\psi) &= t = s \Rightarrow \psi[s + r/x] \\ wp_{S_2}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_2}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi) &= 2 = s \Rightarrow \psi[s + r/x] \\ \tau_{S_3}(\psi$$

### A.5 The Need for Respect

In Fig. 1, we choose the weakest precondition. Because of this, associativity requires that s6 is (< respects <1 and <2) rather than (<  $\supseteq$  (<1  $\cup$  <2)). Consider (r:=x; y:=M; skip). Associating to the left, we might have:

When building  $P_{12}$ , the dependent set of e would be the empty set, and thus  $\phi$  must have been constructed using the independent transformer R4b. Attempting to repeat this, associating to the right:

$$P_{1} = \left( \operatorname{Rx} \right)^{d} \qquad \qquad P_{23} = \left( \phi' \middle| \operatorname{Wy} \right)^{e} \qquad \qquad P' = \left( \operatorname{Rx} \right)^{d} \left( \phi' \middle| \operatorname{Wy} \right)^{e}$$

In P', however, now the dependent set of e is the singleton  $\{d\}$ ; thus  $\phi'$  must be constructed using the dependent transformer R4a. Since  $((v=r \lor x=r) \Rightarrow \psi) \not\equiv (v=r \Rightarrow \psi)$ , associativity fails.

If we allow stronger preconditions, as in [Jagadeesan et al. 2020], then we could use inclusion rather than *respects*. To arrive at this semantics, one would replace every occurrence of  $\equiv$  in Fig. 1 with  $\models$ . Then (< respects  $<_1$  and  $<_2$ ) can be replaced by ( $< \supseteq (<_1 \cup <_2)$ ).

#### A.6 Write Substitutions

[Todo: Discuss.]

 Rx1

Alan example of why substitute M/x rather than v/x in the write rule:

$$r := y$$
;  $x := r$ ;  $s := x$ ;  $z := s$ 

$$(Ry1) \rightarrow (Wx1) \quad (Rx1) \quad (Wz1)$$

We lost the order from Ry1 to Wz1.

#### A.7 Read Substitutions

In *READ*, it is also possible to collapse x and r via substitution:

```
(R4a') if (E \cap D) \neq \emptyset then \tau^D(\psi) \equiv v = r \Rightarrow \psi[r/x],

(R4b') if E \neq \emptyset and (E \cap D) = \emptyset then \tau^D(\psi) \equiv (v = r \vee x = r) \Rightarrow \psi[r/x],

(R4c') if E = \emptyset then \tau^D(\psi) \equiv \psi[r/x],
```

Perhaps surprisingly, this semantics is incomparable with that of Fig. 1. Consider the following:

$$\begin{split} \text{if}(r \wedge s \text{ even}) \{y := 1\}; \text{ if}(r \wedge s) \{z := 1\} \\ \hline (r \wedge s \text{ even} \ | \ \forall y1) \quad (r \wedge s \ | \ \forall z1) \end{split}$$

Prepending (s:=x), we get the same result regardless of whether we substitute [s/x], since x does not occur in either precondition. Here we show the independent case:

Since the preconditions mention x, prepending (r := x), we now get different results depending on whether we perform the substitution. Without any substitution, we have:

r := x; s := x; if  $(r \land s \text{ even}) \{ y := 1 \}$ ; if  $(r \land s) \{ z := 1 \}$ 

Prepending (x := 0), which substitutes [0/x], the precondition of (Wy1) becomes  $(1=r \Rightarrow (2=s \lor 0=s) \Rightarrow (r \land s \text{ even}))$ , which is a tautology, whereas the precondition of Wz1 becomes  $(1=r \Rightarrow (2=s \lor 0=s) \Rightarrow (r \land s))$ , which is not. In order to be top-level, (Wz1) must be dependency ordered after (Rx2); in this case the precondition becomes  $(1=r \Rightarrow 2=s \Rightarrow (r \land s))$ , which is a tautology.

$$(\mathsf{W}x0)$$
  $(\mathsf{R}x1)$   $(\mathsf{R}x2)$   $(\mathsf{W}y1)$   $(\mathsf{W}z1)$ 

The situation reverses with the substitution [r/x]:

$$r := x \; ; \; \text{if} \; (r \land s \; \text{even}) \{ y := 1 \}; \; \text{if} \; (r \land s) \{ z := 1 \}$$

$$Rx2 \qquad \boxed{1 = r \Rightarrow (2 = s \lor r = s) \Rightarrow (r \land s \; \text{even}) \; | \; \mathsf{W}y1 } \qquad \boxed{1 = r \Rightarrow (2 = s \lor r = s) \Rightarrow (r \land s) \; | \; \mathsf{W}z1}$$

0:8 Anon.

Prepending (x := 0):

$$(Wx0)$$
  $(Rx1)$   $(Rx2)$   $(Wy1)$   $(Wz1)$ 

The dependency has changed from  $(Rx2) \rightarrow (Wz1)$  to  $(Rx2) \rightarrow (Wy1)$ . The resulting sets of pomsets are incomparable.

Thinking in terms of hardware, the difference is whether reads update the cache, thus clobbering preceding writes. With [r/x], reads clobber the cache, whereas without the substitution, they do not. Since most caches work this way, the model with [r/x] is likely preferred for modeling hardware. However, this substitution only makes sense in a model with read-read coherence and read-read dependencies, which is not the case for Arm8.

### A.8 Downset Closure

We would like the semantics to be closed with respect to *downsets*. Downsets include a subset of initial events, similar to *prefixes* for strings.

Definition A.3.  $P_2$  is an downset of  $P_1$  if

- $(1) E_2 \subseteq E_1, \qquad (5) \sqrt{2} \models \sqrt{1},$
- (2)  $(\forall e \in E_2) \ \lambda_2(e) = \lambda_1(e)$ , (6a)  $(\forall d \in E_2) \ (\forall e \in E_2) \ d <_2 \ e \ \text{iff} \ d <_1 \ e$ ,
- (3)  $(\forall e \in E_2) \kappa_2(e) \equiv \kappa_1(e)$ , (6b)  $(\forall d \in E_1) (\forall e \in E_2) \text{ if } d <_1 e \text{ then } d \in E_2$ ,
- $(4) \ (\forall e \in E_2) \ \tau_2^D(e) \equiv \tau_1^D(e), \qquad (7) \ (\forall d \in E_2) \ (\forall e \in E_2) \ d \ \text{rf}_2 \ e \ \text{iff} \ d \ \text{rf}_1 \ e.$

Downset closure fails due to for two reasons. The key property is that the empty set transformer should behave the same as the independent transformer.

First, downset closure fails for read-read independency §3.7. Consider

The semantics of this program includes the singleton pomset (Rx0), but not the singleton pomset (Ry0). To get (Rx0), we combine:

$$r := x \qquad \text{if}(!r)\{s := y\}$$

$$(Rx0) \qquad \emptyset$$

Attempting to get (Ry0), we instead get:

$$r := x$$
 if  $(!r) \{s := y\}$  
$$(r=0 | Ry0)$$

Since *r* appears only once in the program, this pomset cannot contribute to a top-level pomset.

Second, the semantics is not downset closed because the independency reasoning of R4b is only applicable for pomsets where the ignored read is present! Revisiting JMM causality test case 1 from the end of §3.6:

$$x := 0 \qquad r := x \qquad \text{if } (r \ge 0) \{y := 1\}; z := r$$

$$\boxed{\mathbb{R}x1} \qquad \boxed{r \ge 0 \quad \mathbb{W}y1} \qquad \boxed{r = 1 \quad \mathbb{W}z1}$$

$$\psi[0/x] \qquad (1 = r \lor x = r) \Rightarrow \psi$$

$$x := 0; r := x; \text{if } (r \ge 0) \{y := 1\}; z := r$$

$$\boxed{\mathbb{W}x0} \qquad \mathbb{R}x1 \qquad (1 = r \lor 0 = r) \Rightarrow r \ge 0 \quad \mathbb{W}y1 \qquad 1 = r \Rightarrow r = 1 \quad \mathbb{W}z1$$

The precondition of (Wy1) is a tautology.

 Taking the empty set for the read, however, the precondition of (Wy1) is not a tautology:

One way to deal with the second issue would be to allow general access elimination to merge (Wx0) and (Rx0):

$$x := 0; r := x; if(r \ge 0) \{ y := 1 \}; z := r$$

$$(0 = r \lor 0 = r) \Rightarrow r \ge 0 \quad | Wy1 ) \quad (r = 1 \quad | Wz1 )$$

We leave the elaboration of this idea to future work.

# A.9 Logical Encoding of Delay for PwT-MCA

### [Todo: Remove this section?]

In this subsection, we develop a logical encoding of delay, which can replace 16a in PwT-MCA<sub>1</sub>. It is not obvious how to repeat this trick for PwT-MCA<sub>2</sub>, due to thread-local reads-from and thread-local blockers (s6a and s6b in Def. 4.2).

As motivation, recall that we stated Lemma 3.6(g) using inclusions:

(g) 
$$[if(\neg \phi)\{S_2\}; if(\phi)\{S_1\}] \subseteq [if(\phi)\{S_1\}] = [if(\phi)\{S_1\}; if(\neg \phi)\{S_2\}].$$

PwT-mcA does not satisfy the reverse inclusion. The culprit is delay, which introduces order regardless of whether preconditions are disjoint. As an example,  $[if(r)\{x := 1\} else\{x := 2\}]$  has an execution with  $(r=0 \mid Wx2) \rightarrow (r\neq 0 \mid Wx1)$ , (using augmentation), whereas  $[if(r)\{x := 1\}; if(!r)\{x := 2\}]$  has no such execution.

### [Todo: What is the story here for PwT-Po?]

In order to validate the reverse inclusions, we could require that 16a not impose order when  $\kappa_1(d) \wedge \kappa_2(e)$  is unsatisfiable. Thus, following on §A.3, we would also like this:

(s6b') if  $\lambda_1(d)$  delays  $\lambda_2(e)$  and  $\kappa_1(d) \wedge \kappa_2'(e)$  is  $\lambda$ -consistent then  $d \leq e$ .

However, (s6b') fails associativity. Example where  $\theta_{\lambda} = (r=0)$ 

$$r := y \qquad \qquad \text{if}(r \parallel s)\{x := 1\} \qquad \qquad \text{if}(!s)\{x := 2\}$$
 
$$(R y0) \qquad \qquad (r \neq 0 \lor s \neq 0 \mid Wx1) \qquad \qquad (s = 0 \mid Wx2)$$

Associating right, order is required since  $((r \neq 0 \lor s \neq 0) \land s = 0)$  is satisfiable (take r = 1 and s = 0):

$$r := y \qquad \qquad \text{if}(r \parallel s)\{x := 1\}; \text{ if}(!s)\{x := 2\}$$

$$(r \neq 0 \lor s \neq 0 \mid Wx1) \longrightarrow (s = 0 \mid Wx2)$$

$$r := y; \text{ if}(r \parallel s)\{x := 1\}; \text{ if}(!s)\{x := 2\}$$

$$(R \neq 0) \longrightarrow (r = 0 \Rightarrow (r \neq 0 \lor s \neq 0) \mid Wx1) \longrightarrow (s = 0 \mid Wx2)$$

Associating left, order is not required between the writes since  $(s\neq 0 \land s=0)$  is unsatisfiable:

$$r := y; \text{ if } (r \parallel s)\{x := 1\}$$
 if  $(!s)\{x := 2\}$ 

$$(Ry0) \rightarrow (r=0 \Rightarrow (r\neq 0 \lor s\neq 0) \mid Wx1)$$
 
$$(s=0 \mid Wx2)$$

$$r := y; \text{ if } (r \parallel s)\{x := 1\}; \text{ if } (!s)\{x := 2\}$$

$$(Ry0) \rightarrow (r=0 \Rightarrow (r\neq 0 \lor s\neq 0) \mid Wx1)$$
 
$$(s=0 \mid Wx2)$$

This motivates the logic-based presentation of delay.

0:10 Anon.

In the data model, we require additional symbols:  $Q_{sc}$ ,  $Q_{ro}^x$ , and  $Q_{wo}^x$ . We refer to these collectively as *quiescence symbols*.

We update the Def. 3.4 of complete pomset to substitute true for every quiescence symbol (notation [tt/Q]):

Definition A.4. A PwT is complete if (c3)  $\kappa(e)$  [tt/Q] is a tautology, (c5)  $\sqrt{\text{[tt/Q]}}$  is a tautology.

We define some helper notation:

Definition A.5. Let  $Q_{ro}^* = \bigwedge_y Q_{ro}^y$ , and similarly for  $Q_{wo}^*$ . Let formulae  $Q_{\mu}^{Sx}$ ,  $Q_{\mu}^{Lx}$ , and  $Q_{\mu}^F$  be defined:

$$\begin{array}{lll} Q_{r|x}^{Sx} = Q_{ro}^x \wedge Q_{wo}^x & Q_{r|x}^{Lx} = Q_{wo}^x & Q_{rel}^F = Q_{ro}^* \wedge Q_{wo}^* \\ Q_{rel}^{Sx} = Q_{ro}^* \wedge Q_{wo}^* & Q_{acq}^{Lx} = Q_{wo}^x & Q_{acq}^F = Q_{ro}^* \wedge Q_{wo}^* \\ Q_{sc}^{Sx} = Q_{ro}^* \wedge Q_{wo}^* \wedge Q_{sc} & Q_{sc}^{Lx} = Q_{wo}^x \wedge Q_{sc} & Q_{sc}^F = Q_{ro}^* \wedge Q_{wo}^* \wedge Q_{sc} \end{array}$$

Let  $[\phi/Q_{ro}^*]$  substitute  $\phi$  for every  $Q_{ro}^y$ , and similarly for  $Q_{wo}^*$ . Let substitutions  $[\phi/Q_{\mu}^{Sx}]$ ,  $[\phi/Q_{\mu}^{Lx}]$ , and  $[\phi/Q_{\mu}^{F}]$  be defined:

Update the following rules from Fig. 1. (The change is similar for address calculation and if-introduction.)

[Todo: This is buggy. Need to enforce order for coherence/synchronization/dependency into a write and coherence/synchronization, but not dependency, into reads. Lack of read-read dependency is bad here. Note that the write rules should mention D—see the agda version of write.]

The quiescence formulae indicate what must precede an event. For example, all preceding accesses must be ordered before a releasing write, whereas only writes on x must be ordered before a releasing read on x.

The quiescence substitutions update quiescence symbols in subsequent code. For subsequent independent code, w³ and x³ substitute false. In complete pomsets, we substitute true for . For example, we substitute ff for  $Q_{rel}^{Sx}$  in the independent case for a releasing write; this ensures that subsequent writes to x follow the releasing write in top-level pomsets. Similarly, we substitute ff for  $Q_{acq}^{Lx}$  in the independent case for an acquiring write; this ensures that all subsequent accesses follow the acquiring read in top-level pomsets.

Fig. 1 shows the effect of quiescence for each access mode.



Fig. 1. The Effect of Quiescence for Each Access Mode

Example A.6. The definition enforces publication. Consider:



Since  $Q_{wo}^*[ff/Q_{wo}^x]$  is ff, we must introduce order to get a satisfiable precondition for (Wy1).

*Example A.7.* The definition enforces subscription. Consider:



Since  $Q_{w_0}^x[ff/Q_{w_0}^*]$  is ff, we must introduce order to get a satisfiable precondition for  $(Wy_1)$ .

*Example A.8.* Even in its logical form, s6b' is incompatible with the ability to strengthen preconditions using augment closure, which is allowed in [Jagadeesan et al. 2020]. Consider the following.



If r=0 then x is 1, 2, 1. If  $r\neq 0$  then x is 2, 1, 2. Augmenting the middle preconditions and then using sequential composition, we have:



0:12 Anon.

Note that s6b' does not require any order between the two writes of the middle pomset. Merging left and right, we have:

$$if(r)\{x := 2\}; x := 1; x := 2; if(!r)\{x := 1\}$$

$$\underbrace{Wx2} \longrightarrow \underbrace{Wx1}$$

As shown by the following single-threaded code, allowing this outcome would violate DRF-SC.

$$y := 1; r := y; if(r)\{x := 2\}; x := 1; x := 2; if(!r)\{x := 1\}$$

$$(Wy1) \longrightarrow (Ry1) \qquad (Wx2) \longrightarrow (Wx1)$$

This is one reason that we use weakest preconditions, rather than preconditions.

The same problem does not occur due to if-introduction (at least not for complete pomsets, where you need to have termination being a tautology, so you can't arbitrarily choose to partition  $\Omega \neq tt$ :



Merging left and right, we have

if 
$$(r)\{x := 2\}$$
;  $x := 1$ ;  $x := 2$ ; if  $(!r)\{x := 1\}$ 

$$(Wx2) \longrightarrow (r=0) Wx1$$

$$(r\neq 0) Wx2 \longrightarrow (Wx1)$$

# A.10 Is Coherence/Delay Compatible with If-Introduction and Dead-Write-Removal?

[Todo: Flesh this out.]

 With if-introduction, the following equation should hold:

$$[if(r)\{x := 2\}; x := 1; x := 2; if(!r)\{x := 1\}; x := 3]]$$

$$= [if(!r)\{x := 1\}; x := 2; x := 1; if(r)\{x := 2\}; x := 3]$$

Using dead write removal, these can be refined, respectively, to:

$$[x := 1; x := 2; x := 3]$$
  
 $\neq [x := 2; x := 1; x := 3]$ 

What has become of coherence?

### B LOWERING PwT-MCA TO ARM

For simplicity, we restrict to top-level parallel composition.

#### **B.1** Arm executions

Our description of Arm8 follows Alglave et al. [2021], adapting the notation to our setting.

*Definition B.1.* An *Arm8 execution graph, G*, is tuple  $(E, \lambda, poloc, lob)$  such that

- (A1)  $E \subseteq \mathcal{E}$  is a set of events,
- (A2)  $\lambda: E \to \mathcal{A}$  defines a label for each event.
- (A3) poloc  $\subseteq E \times E$ , is a per-thread, per-location total order, capturing per-location program order,
- (A4)  $|ob \subseteq E \times E$ , is a per-thread partial order capturing *locally-ordered-before*, such that (A4a) poloc  $\cup$  |ob| is acyclic.

The definition of lob is complex. Comparing with our definition of sequential composition, it is sufficient to note that lob includes

- (L1) read-write dependencies, required by \$3,
- (L2) synchronization delay of Ksync, required by 16a,
- (L3) sc access delay of ⋈<sub>sc</sub>, required by 16a,
- (L4) write-write and read-to-write coherence delay of ⋈<sub>co</sub>, required by 16a,

and that lob does not include

540

541 542

543

545

546

547

551

553

555

565

567

569

571

573 574

575 576

577

578

579 580

581

582

583

584

585

586

587 588

- (L5) read-read control dependencies, required by \$3,
- (L6) write-to-read order of rf, required by M7c,
- (L7) write-to-read coherence delay of ⋈<sub>CO</sub>, required by 16a.

Definition B.2. Execution G is (co, rf, gcb)-valid, under External Global Consistency (EGC) if

- (A5)  $co \subseteq E \times E$ , is a per-location total order on writes, capturing coherence,
- (A6) rf  $\subseteq E \times E$ , is a relation, capturing reads-from, such that
  - (A6a) rf is surjective and injective relation on  $\{e \in E \mid \lambda(e) \text{ is a read}\}\$ ,
  - (A6b) if  $d \stackrel{\mathsf{rf}}{\longrightarrow} e$  then  $\lambda(d)$  matches  $\lambda(e)$ ,
  - (A6c) poloc  $\cup$  co  $\cup$  rf  $\cup$  fr is acyclic, where  $e \xrightarrow{fr} c$  if  $e \xleftarrow{rf} d \xrightarrow{co} c$ , for some d,
- (A7)  $gcb \supseteq (co \cup rf)$  is a linear order such that
  - (A7a) if  $d \xrightarrow{rf} e$  and  $\lambda(c)$  blocks  $\lambda(e)$  then either  $c \xrightarrow{gcb} d$  or  $e \xrightarrow{gcb} c$ ,
  - (A7b) if  $e \xrightarrow{lob} c$  then either  $e \xrightarrow{gcb} c$  or  $(\exists d) d \xrightarrow{rf} e$  and  $d \xrightarrow{poloc} e$  but not  $d \xrightarrow{lob} c$ .

Execution *G* is (co, rf, cb)-valid under External Consistency (EC) if

- (A5) and (A6), as for EGC,
- (A8)  $cb \supseteq (co \cup lob)$  is a linear order such that if  $d \xrightarrow{rf} e$  then either

  - (A8a)  $d \stackrel{\mathsf{ch}}{\rightleftharpoons} e$  and if  $\lambda(c)$  blocks  $\lambda(e)$  then either  $c \stackrel{\mathsf{ch}}{\rightleftharpoons} d$  or  $e \stackrel{\mathsf{ch}}{\rightleftharpoons} c$ , or (A8b)  $d \stackrel{\mathsf{ch}}{\rightleftharpoons} e$  and  $d \stackrel{\mathsf{poloc}}{\rightleftharpoons} e$  and  $(\nexists c)$   $\lambda(c)$  blocks  $\lambda(e)$  and  $d \stackrel{\mathsf{poloc}}{\rightleftharpoons} c$  e.

Alglave et al. [2021] show that EGC and EC are both equivalent to the standard definition of Arm8. They explain EGC and EC using the following example, which is allowed by Arm8.<sup>5</sup>

$$x := 1; r := x; y := r \parallel 1 := y^{\text{acq}}; s := x$$

$$(Wx1) \xrightarrow{\text{rf}} (Rx1) \xrightarrow{\text{lob}} (Wy1) \xrightarrow{\text{rf}} (R^{\text{acq}}y1) \xrightarrow{\text{lob}} (Rx0)$$

EGC drops lob-order in the first thread using A7b, since (Wx1) is not lob-ordered before (Wy1).

$$(yx1) \longrightarrow (Rx1) \qquad (Wy1) \longrightarrow (Rx0) \qquad (gcb)$$

EC drops rf-order in the first thread using A8b.

$$(cb)$$

### B.2 Lowering PwT-MCA1 to Arm

The optimal lowering for Arm8 is unsound for PwT-McA<sub>1</sub>. The optimal lowering maps relaxed access to ldr/str and non-relaxed access to ldar/stlr [Podkopaev et al. 2019]. In this section, we consider a suboptimal strategy, which lowers non-relaxed reads to (dmb.sy; 1dar). Significantly, we retain the optimal lowering for relaxed access. In the next section we recover the optimal lowering by adopting an alternative semantics for M7c and 16a.

<sup>&</sup>lt;sup>5</sup>We have changed an address dependency in the first thread to a data dependency.

0:14 Anon.

To see why the optimal lowering fails, consider the following attempted execution, where the final values of both x and y are 2.

$$x := 2$$
;  $r := x^{\operatorname{acq}}$ ;  $y := r - 1 \parallel y := 2$ ;  $x^{\operatorname{rel}} := 1$ 

$$(\operatorname{gcb})$$

$$(<)$$

$$R^{acq}x^2 \longrightarrow (Wy1) \longrightarrow (Wy2) \longrightarrow (W^{rel}x^1)$$

This attempted execution is allowed by Arm8, but disallowed by our semantics.

 If the read of x in the execution above is changed from acquiring to relaxed, then our semantics allows the gcb execution, using the independent case for the read and satisfying the precondition of (Wy1) by prepending (Wx2). It may be tempting, therefore, to adopt a strategy of *downgrading* acquires in certain cases. Unfortunately, it is not possible to do this locally without invalidating important idioms such as publication. For example, consider that  $(R^{ra}x1)$  is *not* possible for the second thread in the following attempted execution, due to publication of (Wx2) via y:

$$x := x + 1; \ y^{\text{rel}} := 1 \parallel x := 1; \ \text{if} (y^{\text{acq}} \& x^{\text{acq}}) \{s := z\} \parallel z := 1; \ x^{\text{rel}} := 1$$
 $(x_1) \leftarrow (x_2) \leftarrow (x_1) \leftarrow (x_2) \leftarrow (x$ 

Instead, if the read of x is relaxed, then the publication via y fails, and (Rx1) in the second thread is possible.

$$(Rx1)$$
  $(Wx2)$   $(Wx1)$   $(Rx1)$   $(Rx1)$   $(Rx0)$   $(Wz1)$   $(Wz1)$   $(Wz1)$ 

Using the suboptimal lowering for acquiring reads, our semantics is sound for Arm. The proof uses the characterization of Arm using EGC.

Theorem B.3. Suppose  $G_1$  is  $(co_1, rf_1, gcb_1)$ -valid for S under the suboptimal lowering that maps non-relaxed reads to (dmb.sy; ldar). Then there is a top-level poisset  $P_2 \in [S]$  such that  $E_2 = E_1$ ,  $\lambda_2 = \lambda_1$ ,  $rf_2 = rf_1$ , and  $\leq_2 = gcb_1$ .

PROOF. First, we establish some lemmas about Arm8.

LEMMA B.4. Suppose G is (co, rf, gcb)-valid. Then  $gcb \supseteq fr$ .

PROOF. Using the definition of fr from A6c, we have e 
ightharpoonup definition of fr from A6c, we have <math>e 
ightharpoonup definition of fr from A6c, we have <math>e 
ightharpoonup definition of fr from A6c, we have <math>e 
ightharpoonup definition of fr from A6c, we have <math>e 
ightharpoonup definition of fr from A6c, we have <math>e 
ightharpoonup definition of fr from A6c, we have <math>e 
ightharpoonup definition of fr from A6c, we have <math>e 
ightharpoonup definition of fr from A6c, we have <math>e 
ightharpoonup definition of fr from A6c, we have <math>e 
ightharpoonup definition of fr from A6c, we have <math>e 
ightharpoonup definition of fr from A6c, we have <math>e 
ightharpoonup definition of fr from A6c, we have <math>e 
ightharpoonup definition of fr from A6c, we have <math>e 
ightharpoonup definition of fr from A6c, we have <math>e 
ightharpoonup definition of fr from A6c, we have <math>e 
ightharpoonup definition of fr from A6c, we have <math>e 
ightharpoonup definition of fr from A6c, we have <math>e 
ightharpoonup definition of fr from A6c, we have <math>e 
ightharpoonup definition of fr from A6c, we have <math>e 
ightharpoonup definition of fr from A6c, we have <math>e 
ightharpoonup definition of fr from A6c, we have <math>e 
ightharpoonup definition of fr from A6c, we have <math>e 
ightharpoonup definition of fr from A6c, we have <math>e 
ightharpoonup definition of from A6c, we have <math>e 
ightharpoonup defini

LEMMA B.5. Suppose G is (co, rf, gcb)-valid and c e, where  $\lambda(c)$  blocks  $\lambda(e)$ . Then c e e. Proof. By way of contradiction, assume e e e. If e then by A7 we must also have e e such that e e and therefore e e by transitivity, e e e by the definition of e we have e e by the definition of e e. By the definition of e e by the definition of e e by the definition of e e by the definition of e e.

We show that all the order required in the pomset is also required by Arm8. M7b holds since  $cb_1$  is consistent with  $co_1$  and  $fr_1$ . As noted above, lob includes the order required by s3 and 16a. We need only show that the order removed from A7b can also be removed from the pomset. In order for A7b to remove order from e to c, we must have  $d \xrightarrow{rf} e$  and  $d \xrightarrow{poloc} e$  but not  $d \xrightarrow{lob} c$ . Because of our suboptimal lowering, it must be that e is a relaxed read; otherwise the dmb.sy would require  $d \xrightarrow{lob} c$ . Thus we know that 16a does not require order from e to c. By chaining R4b and W4, any dependence on the read can by satisfied without introducing order in s3.

### B.3 Lowering PwT-MCA2 to Arm

 We can achieve optimal lowering for Arm by weakening the semantics of sequential composition slightly. In particular, we must lose M7c, which states that  $d \xrightarrow{rf} e$  implies d < e. Revisiting the example in the last subsection, we essentially mimic the EC characterization:

$$x := 2; r := x^{\operatorname{acq}}; y := r - 1 \parallel y := 2; x^{\operatorname{rel}} := 1$$

$$(\operatorname{W} x2) \longrightarrow (\operatorname{W} y1) \longrightarrow (\operatorname{W} y2) \longrightarrow (\operatorname{W}^{\operatorname{rel}} x1)$$

$$(\operatorname{cb})$$

Here the rf relation *contradicts* order! We have both  $(Wx2) \cdots \rightarrow (R^{acq}x2)$  and  $(Wx2) \stackrel{cb}{\longleftarrow} (R^{acq}x2)$ . We first show that EC-validity is unchanged if we assume  $cb \supseteq fr$ :

LEMMA B.6. Suppose G is EC-valid via (co, rf, cb). Then there a permutation cb' of cb such that G is EC-valid via (co, rf, cb') and cb'  $\supseteq$  fr, where fr is defined in A6c.

PROOF. Suppose  $e \xrightarrow{fr} c$ . By definition of fr,  $e \xleftarrow{rf} d \xrightarrow{co} c$ , for some d. We show that either (1)  $e \xrightarrow{cb} c$ , or (2)  $e \xrightarrow{cb} e$  and we can reverse the order in cb' to satisfy the requirements.

If A8a applies to  $d \xrightarrow{\mathsf{rf}} e$ , then  $e \xrightarrow{\mathsf{cb}} c$ , since it cannot be that  $c \xrightarrow{\mathsf{co}} d$ .

Suppose A8b applies to  $d \xrightarrow{rf} e$  and c is from a different thread than e. Because it is a different thread, we cannot have  $e \xrightarrow{lob} c$ , and therefore we can choose  $c \xrightarrow{cb} e$  in cb'.

Suppose A8b applies to  $d \xrightarrow{rf} e$  and c is from the same thread as e. Applying A6c to  $e \xrightarrow{fr} c$ , it cannot be that  $c \xrightarrow{poloc} e$ . Since poloc is a per-thread-and-per-location total order, it must be that  $e \xrightarrow{poloc} c$ . Applying A4a, we cannot have  $e \xrightarrow{lob} c$ , and therefore we can choose  $e \xrightarrow{cb} e$  in  $e \xrightarrow{cb} c$ .

Here is a contradictory non-example illustrating the last case of the proof:

$$x := 2$$
;  $r := x \parallel x := 1$ 

$$c : Wx2$$

$$d : Wx1$$

THEOREM B.7. Suppose  $G_1$  is EC-valid for S via  $(co_1, rf_1, cb_1)$  and that  $cb_1 \supseteq fr_1$ . Then there is a top-level pomset  $P_2 \in [S]$  such that  $E_2 = E_1$ ,  $\lambda_2 = \lambda_1$ ,  $rf_2 = rf_1$ , and  $\leq_2 = cb_1$ .

PROOF. We show that all the order required in the pomset is also required by Arm8. M7b holds since  $cb_1$  is consistent with  $co_1$  and  $fr_1$ . s6b follows from A8b. As noted above, lob includes the order required by s3 and s6a.

### C LDRF-SC FOR PwT-MCA

#### [Todo: Remove this section?]

In this appendix, we establish a DRF-sc for PwT-MCA<sub>2</sub>. We prove an *external* result, where the notion of *data-race* is independent of the semantics itself. Since every PwT-MCA<sub>2</sub> is also a PwT-MCA<sub>1</sub>, the result also applies there. Our result is also *local*. Using Dolan et al.'s [2018] notion of *Local Data Race Freedom (LDRF)*.

We do not address PwT-C11. The internal DRF-sc result for C11 [Batty 2015] does not rely on dependencies and thus applies to PwT-C11. In internal DRF-sc, data-races are defined using the semantics of the language itself. Using the notion of dependency defined here, it should be possible to prove an stronger external result for C11, similar to that of [Lahav et al. 2017]—we leave this as future work.

Jagadeesan et al. [2020] prove LDRF-sc for Pomsets with Preconditions (PwP). PwT-mca generalizes PwP to account for sequential composition. Most of the machinery of LDRF-sc, however, has little to do with sequential semantics. Thus, we have borrowed heavily from the text of [Jagadeesan]

0:16 Anon.

et al. 2020]; indeed, we have copied directly from the LATEX source, which is publicly available. We indicate substantial changes or additions using a change-bar on the right.

There are several changes:

- PwP imposes several conditions that we have dropped: *consistency, causal strengthening, downset closure* (see §A.2).
- PwP allows preconditions that are stronger than the weakest precondition.
- PwP imposes M7c (rf implies <) and thus is similar to PwT-MCA<sub>1</sub>. PwT-MCA<sub>2</sub> is a weaker model that is new to this paper.
- PwP did not provide an accurate account of program order for merged actions. We use Lemma 6.2 to correct this deficiency.

The first two items require us to define gen differently, below.

The result requires that locations are properly initialized. We assume a sufficient condition: that programs have the form " $x_1 := v_1$ ;  $\cdots x_n := v_n$ ; S" where every location mentioned in S is some  $x_i$ . To simplify the definition of *happens-before*, we ban fences and RMWS.

To state the theorem, we require several technical definitions. The reader unfamiliar with [Dolan et al. 2018] may prefer to skip to the examples in the proof sketch, referring back as needed.

*Program Order.* Let  $[\![\cdot]\!]_{mca2}^{po}$  be defined by applying the construction of Lemma 6.2 to  $[\![\cdot]\!]_{mca2}$ . We consider only *complete* pomsets. For these, we derive program order on compound events as follows. By Lemma 6.4, if there is a compound event e, then there is a phantom event  $c \in \pi^{-1}(e)$  such that  $\kappa(c)$  is a tautology. If there is exactly one tautology, we identify e with e in program order. If there is more than one tautology, Lemma C.1, below, shows that it suffices to pick an arbitrary one—we identify e with the e is e in the e in the



Data Race. Data races are defined using program order (po), not pomset order (<).

Because we ban fences and RMWs, we can adopt the simplest definition of *synchronizes-with* (sw): Let  $d \xrightarrow{sw} e$  exactly when d fulfills e, d is a release, e is an acquire, and  $\neg(d \xrightarrow{po} e)$ .

Let  $hb = (po \cup sw)^+$  be the *happens-before* relation.

Let  $L \subseteq X$  be a set of locations. We say that d has an L-race with e (notation  $d \stackrel{L}{\leadsto} e$ ) when (1) at least one is relaxed, (2) at least one is a write, (3) they access the same location in L, and (4) they are unordered by hb: neither  $d \stackrel{hb}{\Longrightarrow} e$  nor  $e \stackrel{hb}{\Longrightarrow} d$ .

*Generators.* We say that  $P' \in \nabla(\mathcal{P})$  if there is some  $P \in \mathcal{P}$  such that P is *complete* (Def. 4.1) and P' is a *downset* of P (Def. A.3).

Let P be augmentation-minimal in  $\mathcal{P}$  if  $P \in \mathcal{P}$  and there is no  $P \neq P' \in \mathcal{P}$  such that P augments P'. Let  $gen[\![S]\!] = \{P \in \nabla[\![S]\!]^{po}_{mca2} \mid P \text{ is augmentation-minimal in } \nabla[\![S]\!]^{po}_{mca2} \}$ .

*Extensions.* We say that P' *S-extends* P if  $P \neq P' \in \text{gen}[S]$  and P is a downset of P'.

Similarity. We say that P' is e-similar to P if they differ at most in (1) pomset order adjacent to e, (2) the value associated with event e, if it is a read, and (3) the addition and removal of read events po-after e.

Stability. We say that P is L-stable in S if (1)  $P \in \text{gen}[S]$ , (2) P is po-convex (nothing missing in program order), (3) there is no S-extension of P with a *crossing* L-race: that is, there is no  $G \in E$ , no  $G \in E$  and no  $G \in E$  such that  $G \notin E$ . The empty pomset is  $G \in E$  such that  $G \notin E$  is  $G \in E$ .

 *Sequentiality.* Let  $\leq_L = \leq_L \cup po$ , where  $\leq_L$  is the restriction of  $\leq$  to events that access locations in L. We say that P' is L-sequential after P if (1) P' is po-convex, (2)  $\leq_L$  is acyclic in  $E' \setminus E$ .

*Simplicity.* We say that P' is L-simple after P if all of the events in  $E' \setminus E$  that access locations in L are simple (Def. 6.1).

LEMMA C.1. Suppose  $P' \in gen[S]$  and P is L-sequential after P. Let P'' be the restriction of P' that is L-simple after P (throwing out compound L-events after P). Then  $P'' \in gen[S]$ .

As a negative example, note that  $(\ddagger\ddagger)$  is not *L*-sequential—in fact there is no execution of the program that results in the simple events of  $(\ddagger\ddagger)$ : without merging the reads, there would be a dependency  $(Rx1) \rightarrow (Wy1)$ . *L*-sequential executions of this code must read 0 for x:

$$r:=x$$
;  $z:=1$ ;  $s:=x$ ; if  $(r=s)\{y:=1\} \parallel x:=y$ 

$$(Rx0)$$

$$(Rx0)$$

$$(Rx0)$$

$$(Rx1)$$

$$(Rx1)$$

$$(Rx1)$$

$$(Rx1)$$

Theorem C.2. Let P be L-stable in S. Let P' be a S-extension of P that is L-sequential after P. Let P'' be a S-extension of P' that is po-convex, such that no subset of E'' satisfies these criteria. Then either (1) P'' is L-sequential and L-simple after P or (2) there is some S-extension P''' of P' and some  $e \in (E'' \setminus E')$  such that (a) P''' is e-similar to P'', (b) P''' is E-sequential and E-simple after E, and (c) E is E-some E-extension E

The theorem provides an inductive characterization of *Sequential Consistency for Local Data-Race Freedom (SC-LDRF)*: Any extension of a *L*-stable pomset is either *L*-sequential, or is *e*-similar to a *L*-sequential extension that includes a race involving *e*.

PROOF SKETCH. We show L-sequentiality. L-simplicity then follows from Lemma C.1.

In order to develop a technique to find P''' from P'', we analyze pomset order in generation-minimal top-level pomsets. First, we note that  $<_*$  (the transitive reduction <) can be decomposed into three disjoint relations. Let  $ppo = (<_* \cap po)$  denote *preserved* program order, as required by sequential composition and conditional. The other two relations are cross-thread subsets of  $(<_* \setminus po)$ : rfe (reads-from-external) orders writes before reads, satisfying P6a and P6b; cae (coherence-after-external) orders read and write accesses before writes, satisfying M7b. (Within a thread, s6a and s6b induce order that is included in ppo.)

Using this decomposition, we can show the following.

LEMMA C.3. Suppose  $P'' \in gen[S]$  has an external read  $d \xrightarrow{rf''} e$  that is maximal in (ppo  $\cup$  rfe). Further suppose that there another write d' that could fulfill e. Then there exists an e-similar P''' with  $d' \xrightarrow{rf'''} e$  such that  $P''' \in gen[S]$ .

The proof of the lemma follows an inductive construction of gen[S], starting from a large set with little order, and pruning the set as order is added: We begin with all pomsets generated by the semantics without imposing the requirements of fulfillment (including only ppo). We then prune reads which cannot be fulfilled, starting with those that are minimally ordered.

We can prove a similar result for (po  $\cup$  rfe)-maximal read and write accesses.

Turning to the proof of the theorem, if P'' is L-sequential after P, then the result follows from (1). Otherwise, there must be a  $\leq_L$  cycle in P'' involving all of the actions in  $(E'' \setminus E')$ : If there were

0:18 Anon.

no such cycle, then P'' would be L-sequential; if there were elements outside the cycle, then there would be a subset of E'' that satisfies these criteria.

If there is a (po  $\cup$  rfe)-maximal access, we select one of these as e. If e is a write, we reverse the outgoing order in cae; the ability to reverse this order witnesses the race. If e is a read, we switch its fulfilling write to a "newer" one, updating cae; the ability to switch witnesses the race. For example, for P'' on the left below, we choose the P''' on the right; e is the read of x, which races with (Wx1).



It is important that e be (po  $\cup$  rfe)-maximal, not just (ppo  $\cup$  rfe)-maximal. The latter criterion would allow us to choose e to be the read of g, but then there would be no e-similar pomset: if an execution reads 0 for g then there is no read of g, due to the conditional.

In the above argument, it is unimportant whether e reads-from an internal or an external write; thus the argument applies to PwT-MCA<sub>2</sub> and PwT-MCA<sub>1</sub> as it does for PwT-MCA<sub>1</sub>.

If there is no ( $po \cup rfe$ )-maximal access, then all cross-thread order must be from rfe. In this case, we select a ( $ppo \cup rfe$ )-maximal read, switching its fulfilling write to an "older" one. If there are several of these, we choose one that is po-minimal. As an example, consider the following; once again, e is the read of x, which races with (Wx1).



This example requires (Wx0). Proper initialization ensures the existence of such "older" writes.  $\Box$ 

#### D PwT-MCA: ADDITIONAL EXAMPLES

This appendix includes additional examples. They all apply equally to PwT-MCA<sub>1</sub> and PwT-MCA<sub>2</sub>. Many of these are taken directly from [Jagadeesan et al. 2020]; see there for further discussion.

# D.1 Buffering

 Store buffering is allowed, as required by Tso.

$$x := 0; \ y := 0; \ (x := 1; \ r := y \parallel y := 1; \ r := x)$$

$$(SB)$$

Load buffering is allowed, as required by Arm8.

$$r := y \; ; \; x := 1 \parallel r := x \; ; \; y := 1$$

$$(LB)$$

#### D.2 Thin-Air

 Thin air is disallowed. [Pugh 2004, TC4]:

$$y := x \parallel r := y; x := r$$

$$(Rx1) \longrightarrow (Ry1) \longrightarrow (Wx1)$$

The control variant ([Pugh 2004, TC13]) is also disallowed:

$$\begin{array}{c|c}
\text{if}(x)\{y:=1\} & \text{if}(y)\{x:=1\} \\
\hline
(Rx1) & Wy1 & Wx1
\end{array}$$

[Jagadeesan et al. 2020, §2]

$$y := x \parallel r := y; \text{ if } (r)\{x := r; z := r\} \text{ else } \{x := 2\}$$

$$(\text{OOTA3})$$

[Jeffrey and Riely 2019, §8] and [Jagadeesan et al. 2020, §6]:

$$y := x \parallel r := y; \text{ if } (b)\{x := r; z := r\} \text{ else } \{x := 1\} \parallel b := 1$$

$$(Rx1) \longrightarrow (Ry1) \longrightarrow (Wz1) \longrightarrow (Rb1) \longleftarrow (Wb1)$$

[Svendsen et al. 2018, RNG] is disallowed since there is no write to fulfill (Ry1).

$$(y := x+1 \parallel x := y)$$

$$(Rx1) \longrightarrow (Wy2) \qquad (Ry1) \longrightarrow (Wx1)$$

оота7 is allowed by PS, but not WEAKESTMO [Chakraborty and Vafeiadis 2019, Fig. 3]:

$$x := 2; \text{ if } (x \neq 2) \{y := 1\} \parallel x := 1; r := x; \text{ if } (y) \{x := 3\}$$

$$(\text{OOTA7})$$

оота4 is similar to тс5 [Pugh 2004]:

$$y := x \parallel x := y \parallel z := 0; z := 1 \parallel x := z$$

$$(TC5)$$

The justification for forbidding this execution states:

values are not allowed to come out of thin air, even if there are other executions in which the thin-air value would have been written to that variable by some not out-of-thin air means.

OOTA4 is an interesting border case, since it is allowed by speculative models (§A.1).

[Todo: What's the point?] We presented two examples of thin-air behavior involving address calculation in §8.4. The justification for TC12 states:

Since no other thread accesses [either [0] or [1]], the code for [the second] thread should be equivalent to:

$$r := y$$
;  $[r] := 0$ ;  $(s := if(r=0)\{0\} else\{1\})$ ;  $x := s$ ;

With this code, it is clear that this is the same situation as test 4.

[Jagadeesan et al. 2020, §6]:

Boehm's [2019] RFUB example presents another potential form of OOTA behavior. Our analysis shows that there is no OOTA behavior in RFUB, only a false dependency:

$$[r := y; x := r] \not\supseteq [r := y; if(r \neq 1) \{z := 1; r := 1\}; x := r]$$
 (RFUB)

0:20 Anon.

The left command is half of OOTA3 (y := x). The right command is dubbed Rfub, for *Register assignment From an Unexecuted Branch*. Boehm observes that in the context  $x := y \parallel [-]$ , these programs have different behaviors. Yet the OOTA example on the left never writes 1. Why should the unexecuted branch change that? Because of the conditional, the write to x in Rfub is independent of the read from y. It useful to considering the Hoare logic formulas satisfied by the two threads above: we have  $\{tt\}$  Rfub  $\{x=1\}$  for the right thread of Rfub, but not  $\{tt\}$  OOTA3  $\{x=1\}$  for the right thread of OOTA3. The change in the thread from OOTA3 to Rfub is not a valid refinement under Hoare logic; thus, it is expected that Rfub may have additional behaviors.

RFUB New Constructor:

$$y := x \parallel r := y$$
; if  $(r = \text{null}) \{r := \text{new C}()\}; x := r; r.f()$  (RFUB-NC)

This is similar to:

$$y := x \parallel r := y$$
; if  $(r=0)\{r := random()\}; x := r$ ; if  $(r)\{z := 1\}$ 

And different from the following, which is similar to TC18:

$$y := x \parallel r := y$$
; if  $(r=0)\{r := 1\}$ ;  $x := r$ ; if  $(r)\{z := 1\}$ 

#### D.3 Coherence

The following execution is disallowed by fulfillment (M7a and M7b). It is also disallowed by C11 and Java.

$$x := 1; r := x \parallel x := 2; s := x$$

$$(COH)$$

M7b requires that we order one write with respect to the other, either before the write or after the read (and therefore after the write). Suppose we pick 1 before 2, as shown. This satisfies M7b for (Rx2). But to satisfy the requirement for (Rx1) we must have either (Wx2) < (Wx1) or (Rx1) < (Wx2). Either way, we have a cycle.

Our model is more coherent than Java, which permits the following:

$$r := x; x := 1 \parallel s := x; x := 2$$

$$(TC16)$$

We also forbid the following, which Java allows:

$$x := 1; \ y^{\text{rel}} := 1 \parallel x := 2; \ z^{\text{rel}} := 1 \parallel r := z^{\text{ra}}; \ r := y^{\text{ra}}; \ r := x; \ r := x$$

$$(\text{Co3})$$

The following outcome is allowed by the promising semantics [Kang et al. 2017], but not in WEAKESTMO [Chakraborty and Vafeiadis 2019, Fig. 3]. We disallow it:

$$x := 2$$
; if  $(x \neq 2) \{y := 1\} \parallel x := 1$ ;  $r := x$ ; if  $(y) \{x := 3\}$ 

(COH-CYC)

C11 includes read-read coherence between relaxed atomics in order to forbid the following. We do not order reads by intra-thread coherence, and this allow the following:

$$x := 1; x := 2 \parallel y := x; z := x$$

$$(co2)$$

$$(wx1) \longrightarrow (Rx2) \longrightarrow (Rx1) \longrightarrow (Wz1)$$

Here, the reader sees 2 then 1, although they are written in the reverse order.

We also allow the following, similar execution:

$$x := 1; x := 2 \parallel r_1 := x; r_2 := x; r_3 := x;$$

$$(Wx1) \longrightarrow (Rx2) \longrightarrow (Rx1) \longrightarrow (Rx2)$$

Pugh [1999, §2.3] presented the following example to show that Java's original memory model required alias analysis to validate common subexpression elimination (CSE).

$$r_1 := x$$
;  $r_2 := z$ ;  $r_3 := x$ ; if  $(r_3 \le 1) \{ y = r_2 \}$ 

Coalescing the two read of x is obviously allowed if  $z\neq x$ . But if z=x, coalescing is only permitted because we do not include read-read pairs in  $\bowtie_{CO}$  (§3.2):

$$\bowtie_{co} = \{(\mathsf{W}x, \mathsf{W}x), (\mathsf{R}x, \mathsf{W}x), (\mathsf{W}x, \mathsf{R}x)\}$$

C11 has read-read coherence, and therefore CSE is only valid up to alias analysis in C11.

#### D.4 RA

 Our model is closer to strong RA (SRA) [Lahav and Boker 2020; Lahav et al. 2016], than RA, as in C11 and RC11. For example, RC11 allows the following, which we disallow:

$$x := 2; y^{\text{rel}} := 1; r := y \parallel y := 2; x^{\text{rel}} := 1; s := x$$
 $(SRA)$ 

#### D.5 MCA

Here are a few litmus tests that distinguish MCA architectures from non-MCA architectures. MCA1 is an example of write subsumption [Pulte et al. 2018, §3]:

$$\begin{array}{c} \text{if}(z)\{x:=0\}; \ x:=1 \parallel \text{if}(x)\{y:=0\}; \ y:=1 \parallel \text{if}(y)\{z:=0\}; \ z:=1 \\ \hline (Rz1) \longleftarrow (Wx1) \longrightarrow (Rx1) \longrightarrow (Wy1) \longrightarrow (Ry1) \longrightarrow (Wz0) \longrightarrow (Wz1) \\ \end{array}$$

Two thread variant:

$$if(x)\{y := 0\}; y := 1 \parallel if(y)\{x := 0\}; x := 1$$

$$(Rx1) \leftarrow (Wy0) \rightarrow (Wy1) \rightarrow (Ry1) \rightarrow (Wx0) \rightarrow (Wx1)$$

IRIW is allowed if all accesses are relaxed, but not if the initial reads are acquiring:

$$x := 1 \parallel r := x^{\mathsf{ra}}; \ s := y \parallel y := 1 \parallel s := y^{\mathsf{ra}}; \ r := x$$

$$(\mathsf{R}^{\mathsf{ra}} x 1) \longrightarrow (\mathsf{R} y 0) \longrightarrow (\mathsf{R}^{\mathsf{ra}} y 1) \longrightarrow (\mathsf{R} x 0)$$

$$(\mathsf{IRIW})$$

0:22 Anon.

MCA2 is a simplified version of IRIW

$$x := 0; x := 1 \parallel y := x \parallel r := y^{\mathsf{ra}}; s := x$$

$$(\mathsf{W}x0) \longrightarrow (\mathsf{R}x1) \longrightarrow (\mathsf{R}y1) \longrightarrow (\mathsf{R}x0)$$

[Flur et al. 2016] and [Lahav and Vafeiadis 2016, Fig. 4] discuss the following, which is not valid in Arm8, although it was valid under some earlier sketches of the model:

$$\begin{array}{c|c}
r := x ; x := 1 \parallel y := x \parallel x := y \\
\hline
(Rx1) & Rx1 & Ry1 & e : Wx1
\end{array}$$
(MCA3)

These candidate executions are invalid, due to cycles.

#### D.6 Detour

 The following example [Podkopaev et al. 2019, Ex. 3.7] is disallowed by IMM by including a detour relation. It is also disallowed by PS.

$$x := z-1; y := x \parallel x := 1 \parallel z := y$$

$$(Rz1) \qquad (Ry1) \qquad (Ry1) \qquad (Wz1)$$

$$(Wx1) \qquad (Wz1) \qquad (Wz1)$$

# D.7 Read-Read Dependencies and Java Final Field Semantics Versus If-Closure

One might worry that the lack of read-read dependencies could cause DRF-sc to fail. For example, the following execution has a control dependency between the reads of the last thread, but this order is not enforced, neither by our model, nor Arm8.

$$z := 1; y^{\text{rel}} := 1 \parallel r := y^{\text{acq}}; x^{\text{rel}} := 1 \parallel \text{if}(x) \{ s := z \}$$

$$(Wz1) \longrightarrow (R^{\text{acq}}y1) \longrightarrow (Rx1) \longrightarrow (Rx1)$$

If the first read of the last thread is acquiring, then the execution is disallowed, since acquiring reads are ordered with respect to the reads that follow.

$$z := 1; y^{\text{rel}} := 1 \parallel r := y^{\text{acq}}; x^{\text{rel}} := 1 \parallel \text{if}(x^{\text{acq}}) \{ s := z \}$$

$$(Wz1) \longrightarrow (R^{\text{acq}}y1) \longrightarrow (R^{\text{acq}}x1) \longrightarrow (Rz0)$$

Arm8 enforces address dependencies between reads, but not control dependencies. To support case-analysis (AKA if-closure), we drop all dependencies between reads. This, in turn, invalidates Iava's final field semantics.

$$(r := 1; [r] := 0; [r] := 1; x^{\text{rel}} := r) \parallel (r := x^{\text{acq}}; s := [r])$$
 $(\text{ADDR2})$ 

The acquire annotation is required to ensure publication. If address dependencies were enforced between reads then the acquire annotation could be dropped. However, the compiler would need to track address dependencies in order to ensure that case analysis did not convert them to control dependencies.

### D.8 Local Invariant Reasoning and Value Range Analysis

We have already seen TC1 in §3.8, TC2 in §8.1 and TC6 in §6. Here is the complete program for TC6:

$$y := 0$$
;  $(r := y; if(r=0)\{x := 1\}; if(r=1)\{x := 1\}) \parallel (if(x=1)\{y := 1\})$ 

$$(Wy0) \qquad (Ry1) \qquad (Ry1) \qquad (Wx1) \qquad (Wx1)$$

$$\phi = (1=r \lor 0=r) \Rightarrow (r=0 \lor r=1)$$

### [Todo: Discuss.]

 Here are some additional examples:

$$y := 0; (r := y; x := 1 + r * r - r) \parallel (y := x)$$

$$(Wy0) \qquad (Ry1) \qquad (Ry1) \qquad (Wx1)$$

$$\phi = (1 = r \lor 0 = r) \Rightarrow 1 + r * r - r = 1$$

$$(TC8)$$

$$x := 0; (r := x; if(r \ge 0) \{y := 1\} \parallel x := y \parallel x := -2)$$

$$(Wx0) \longrightarrow (Rx1) \qquad (0 \ge 0 \mid Wy1) \longrightarrow (Ry1) \longrightarrow (Wx-2)$$

$$(TC9)$$

$$x := 1; a^{ra} := 1; if(z^{ra})\{y := x\} \parallel if(a^{ra})\{x := 2; z^{ra} := 1\}$$

$$(Wx1)$$
  $(R^{acq}a1)$   $(Wx2)$   $(R^{acq}b1)$   $(Rx1)$   $(INTERNAL1)$ 

$$r := x; y^{ra} := 1; s := y; z := s \parallel x := z$$

$$(INTERNAL2)$$

Java Causality Test Case 18 asks that we justify the following execution:

$$x := 0; (x := y \parallel r := x; if(r=0)\{x := 1\}; s := x; y := s)$$

$$(x_0) \qquad (x_1) \qquad (x_1) \qquad (x_1) \qquad (x_2) \qquad (x_3) \qquad (x_4) \qquad$$

Before we prefix x := 0, the precondition of Wy1 is:

$$\phi \equiv (1=r \lor x=r) \Rightarrow ([r=0 \land ((1=s \lor 1=s) \Rightarrow s=1)] \lor [r\neq 0 \land ((1=s \lor x=s) \Rightarrow s=1)])$$

Simplifying:

$$\phi \equiv (1=r \lor x=r) \Rightarrow (r=0 \lor [r\neq 0 \land ((1=s \lor x=s) \Rightarrow s=1)])$$

Prefixing x := 0:

$$\phi \equiv (1=r \lor 0=r) \Rightarrow (r=0 \lor [r\neq 0 \land ((1=s \lor 0=s) \Rightarrow s=1)])$$

Drilling into the interesting part:

$$\phi \equiv 1 = r \Rightarrow ((1 = s \lor 0 = s) \Rightarrow s = 1)$$

This is not a tautology. But we get one by coalescing s and r:

$$(Wx0) \qquad (Ry1) \qquad (Rx1) \qquad (\phi \mid Wy1)$$

$$\phi \equiv 1 = r \Rightarrow ((1 = r \lor 0 = r) \Rightarrow r = 1)$$

0:24 Anon.

TC20 splits the first thread of TC18:

$$x := 0; (x := y \# r := x; if(r=0)\{x := 1\}); s := x; y := s$$
 $(TC20)$ 

Because we take register state from the right, the example is the same as for TC18 above.

TC17 replaces the condition r=0 by  $r\neq 1$  in TC18:

$$\phi \equiv (1 = r \lor x = r) \Rightarrow ([r \neq 1 \land ((1 = s \lor 1 = s) \Rightarrow s = 1)] \lor [r = 1 \land ((1 = s \lor x = s) \Rightarrow s = 1)])$$

Simplifying and prefixing x := 0:

$$\phi \equiv (1=r \lor 0=r) \Rightarrow (r \neq 1 \lor [r=1 \land ((1=s \lor 0=s) \Rightarrow s=1)])$$

Again, we have:

$$\phi \equiv 1 = r \Rightarrow ((1 = s \lor 0 = s) \Rightarrow s = 1)$$

which is not a tautology. But we get one by coalescing s and r.

TC19 makes the same change for TC20, and follows for the same reason.

# D.9 Commuting release and acquire

### [Todo: Discuss.]

RA example. This is impossible, since Rx1 unfulfilled.

$$x := 1; a^{\text{rel}} := 1; r := b^{\text{acq}}; s := x; y := r + s \parallel r := a^{\text{acq}}; x := 2; b^{\text{rel}} := 10$$

$$(Wx1) \leftarrow (W^{\text{rel}}a1) \leftarrow (Rx1) \leftarrow (Wy11)$$

$$(R^{\text{acq}}a1) \leftarrow (Wx2) \leftarrow (W^{\text{rel}}b10)$$

If you swap the release and acquire, then it is impossible for the second thread to get in the middle.

$$x := 1; r := b^{\text{acq}}; a^{\text{rel}} := 1; \parallel r := a^{\text{acq}}; x := 2; b^{\text{rel}} := 10$$

$$\begin{array}{c} (wx1) & (w^{\text{rel}}a1) & (wx2) & (w^{\text{rel}}b10) \end{array}$$

$$\begin{array}{c} (wx1) & (wx2) & (wx2)$$

In this case, the following execution is possible:

But not:

$$x := 1; r := b^{\text{acq}}; a^{\text{rel}} := 1; s := x; y := r + s \parallel r := a^{\text{acq}}; x := 2; b^{\text{rel}} := 10$$
 $wx1 \quad \text{Racq } b10 \quad \text{W} y11$ 
 $x := 1; r := b^{\text{acq}}; a^{\text{rel}} := 10$ 
 $wx1 \quad \text{Racq } b10 \quad \text{Rx1} \quad \text{W} y11$ 

Proc. ACM Program. Lang., Vol. 0, No. POPL, Article 0. Publication date: January 2022.

### D.10 Sevcik examples

```
[Todo: Discuss.]
```

1128 1129

1130

1131

1133

1135 1136 1137

1138

1139

1143

1145

1151

1152

1153

1154 1155

1156

1157

1158 1159

1160

1161

1162

Cenciarelli et al. [2007, §7] example. (I incorrectly credit Sevčík and Aspinall [2008].)

```
if(x \land y)\{z := 1\} \parallel if(z)\{x := 1; y := 1\} else\{y := 1; x := 1\}
```

Examples from [Sevčík and Aspinall 2008, §4.1] are interesting: Redundant write after read elimination:

```
|| lock m2; x=1; unlock m2
     || lock m1; x=2; unlock m1
1141
     || lock m1; lock m2; r1=x; [x=r1;] r2=x; unlock m2; unlock m1 // [bracketed line removed]
```

Even without the write, r1 and r2 must see the same values, whereas JMM allows different values for the reads when the write is missing.

Redundant read after read elimination:

```
|| y=x
1147
     || r2=y; if (r2==1){[r3=y]; x=r3}else{x=1} // [r3=r2]
1148
```

Interesting case is left Wx1. Initially has predicate  $r_3 = 1$ . With read rule, we have y = 1. In read 1149 prefixing, we don't weaken. Instead we weaken with the read into r2. 1150

```
if(r_2=1)\{r_3:=y:x:=r_3\} if(r_2\neq 1)\{x:=1\}
r_2=1 \mid R \mid y1 r_2=1 \land y=1 \mid W \mid x1
                                              r_2 \neq 1 \mid Wx1
       if(r_2=1)\{r_3:=y; x:=r_3\} else\{x:=1\}
       r_{2}=1 | R y 1  (r_{2}=1 \land y=1) \lor (r_{2}\neq 1) | Wx 1
  r_2 := y; if (r_2=1)\{r_3 := y; x := r_3\} else \{x := 1\}
      (Ry1) (Ry1) (y=1 \land y=1) \lor (y\neq 1) (∀x1)
```

To ignore the second read, we use the "delay" trick that we used for JMM TC1, but this is fulfilled by a read rather than a write. In any case, the execution with x = y = 1 is allowed.

Roach Motel—all reads 1 impossible, but passible after swapping r1=x and lock m

```
1163
      || lock m; x=1; unlock m
1164
      || lock m; x=2; unlock m
1165
      || r1=x; lock m; r2=z; if(r1==2){y=1}else{y=r2}; unlock m
1166
      || z=v
1167
     So Question is whether you can read all 1 in
1168
```

1169

```
|| lock m; x=1; unlock m
1170
     || lock m; x=2; unlock m
1171
     || lock m; r1=x; r2=z; if(r1==2){y=1}else{y=r2}; unlock m
1172
     || z=y
```

In any execution, we must have 1 before 2, or 2 before 1.

• If thread sees 2, then read x is 2.

1175 1176

1173

1174

0:26 Anon.

• If thread sees 1, then read x is 1.

1177 1178

1179

1180

1182

1184

1204

1205 1206

1207

1208 1209

1210

1211

1212

1213

1214

1225

```
if(r_1=2)\{y := 1\} else\{y := r_2\}
                r_1=2 \lor (r_1 \neq 2 \land r_2=1) \mid \mathsf{W} \, y \, 1
r_1 := x; r_2 := \overline{z}; if (r_1 = 2)\{y := 1\} else \{y := r_2\}
                    (Rx1) (Rz1) (Wy1)
```

So impossible for y and z to be 1.

Irrelevant Read Introduction (can I read 1 for both y and z?)

```
1185
1186
        || r=z; if(!r){if(x){y=1}}else{[s=x;]y=r}
1187
        | | x=1; z=y
1188
1189
                                    if(!r){if(x){y := 1}} if(r){s := x; y := r}
1190
                                    r=0 Rx1 r=0 Wy1 r\neq 0 Rx1 r=1 Wy1
                                            if(!r){if(x){u := 1}} else{u := r}
                                                    Rx1 \rightarrow r=0 \lor r=1 \ Wy1
1194
                                   z := 0; r := z; if (!r)\{if(x)\{y := 1\}\} else \{y := r\}
1195
1196
                                            (\mathbb{W}z_0) (\mathbb{R}z_1) (\mathbb{R}x_1) (\mathbb{R}z_1) (\mathbb{R}z_1) (\mathbb{R}z_1)
1197
1198
                                         if(!r){if(x){y := 1}} if(r){y := r}
1199
                                         (r=0 \mid Rx1) \rightarrow (r=0 \mid Wy1) (r=1 \mid Wy1)
1200
                                            if(!r){if(x){y := 1}} else{y := r}
1201
1202
                                                 r=0 | R x 1 \rightarrow r=0 \lor r=1 | W y 1
1203
                                   z := 0; r := z; if (!r) \{ if(x) \{ y := 1 \} \} else \{ y := r \}
```

If z is initialized to 2, rather than 0, then the dependencies remain and both are disallowed. This relies crucially on the fact that par takes order from both sides.

(Wz0) (Rz1) (Rx1)  $(0=0 \lor 0=1)$  (Ry1)

### D.11 SC Access

[Todo: Discuss.]

[Todo: Volatile read = full fence followed by acquire; Volatile write = release followed by full fence. But this is not enough on power to guarantee that all-volatile program has only SC executions. On power, release-acquire implemented with lwsync]

https://bugs.openjdk.java.net/browse/JDK-8262877

```
1215
      volatile int x, y;
1216
1217
      Thread 1: x = 2; r1 = y // 0
1218
1219
      Thread 2: y = 1
1220
1221
      Thread 3: r2 = y; x = 1 // 1
1222
1223
      Thread 4: r3 = x; r4 = x // 1,2
1224
```

Proc. ACM Program. Lang., Vol. 0, No. POPL, Article 0. Publication date: January 2022.

The state  $(r_1,r_2,r_3,r_4) = (0,1,1,2)$  is forbidden, as it violates sequential consistency. (You can show it by constructing the syncronization order that leads to this result and observing it is cyclic).

Current PPC code is one (the only?) platform that runs into the SC violation with current barrier placement. Current placement seems to be:

Violation of SC-DRF from [Watt et al. 2020, Fig. 9]:

 $\frac{1231}{1}$  Thread 1: x^sc = 1

 Thread 2:  $x^sc = 2$ ;  $r = x^sc$ ; if (r==1) then s=x

The program is DRF. Should not be possible to have r==1, s==2.

[Dolan et al. 2018, §8.2]:

$$r := y; x^{\text{sc}} := 1; s := x \parallel x^{\text{sc}} := 2; y := 1$$

$$(\text{Sc1})$$

Watt et al. [2020, §3.1]:

$$x^{\text{sc}} := 1; \ r := y^{\text{sc}} \parallel y^{\text{sc}} := 1; \ y^{\text{sc}} := 2; \ x := 2; \ s := x^{\text{sc}}$$

$$(\text{sc2})$$

#### D.12 Fences

[Todo: Discuss.]

$$x := 0; x := 1; \mathsf{F}^{\mathsf{rel}}; y := 1 \parallel r := y; \mathsf{F}^{\mathsf{acq}}; s := x$$

$$(\mathsf{W}x0) \longrightarrow (\mathsf{W}x1) \longrightarrow (\mathsf{F}^{\mathsf{rel}}) \longrightarrow (\mathsf{W}y1) \longrightarrow (\mathsf{R}y1) \longrightarrow (\mathsf{F}^{\mathsf{acq}}) \longrightarrow (\mathsf{R}x0)$$

$$(\mathsf{PUB}2)$$

[Lahav et al. 2017, Fig. 5]:

$$x := 1 \parallel r := x; F^{sc}; r := y \parallel y := 1; F^{sc}; r := x$$

$$(Sc3)$$

[Lahav et al. 2017, Fig. 6]

$$x := 1; z^{\mathsf{ra}} := 1; \parallel r := z^{\mathsf{acq}}; \mathsf{F}^{\mathsf{sc}}; r := y \parallel y := 1; \mathsf{F}^{\mathsf{sc}}; r := x$$

$$(\mathsf{W}x1) \longrightarrow (\mathsf{W}^{\mathsf{rel}}z1) \longrightarrow (\mathsf{R}^{\mathsf{acq}}z1) \longrightarrow (\mathsf{F}^{\mathsf{sc}}) \longrightarrow (\mathsf{R}\,y0) \longrightarrow (\mathsf{W}\,y1) \longrightarrow (\mathsf{R}\,x0)$$

$$(\mathsf{Sc}4)$$

Here are several examples mixing fencing with release/acquire:

$$x := 1; y^{ra} := 1 \parallel r := y^{acq}; s := x$$

$$(Wx1) \longrightarrow (R^{acq}y1) \longrightarrow (Rx0)$$

$$x := 1$$
;  $F^{rel}$ ;  $y := 1 \parallel r := y^{acq}$ ;  $s := x$ 

$$(Wx1) \leftarrow (F^{rel}) \leftarrow (Wy1) \rightarrow (Ry1) \leftarrow (Rx0)$$

$$x := 1$$
;  $y^{\text{ra}} := 1 \parallel r := y$ ;  $F^{\text{acq}}$ ;  $s := x$   
 $(Wx1) \longrightarrow (Ry1) \longrightarrow (F^{\text{acq}}) \longrightarrow (Rx0)$ 

$$x := 1; F^{\text{rel}}; y := 1 \parallel r := y; F^{\text{acq}}; s := x$$

$$(Wx1) \qquad (Ry1) \qquad (Rx0)$$

Proc. ACM Program. Lang., Vol. 0, No. POPL, Article 0. Publication date: January 2022.

0:28 Anon.

[Podkopaev et al. 2019, §D]: The following execution graph is not consistent in the promise-free declarative model of [Kang et al. 2017]. Nevertheless, its mapping to POWER (obtained by simply replacing Fsc with Fsync) is POWER-consistent and po  $\cup$  rf is acyclic (so it is Strong-POWER-consistent). Note that, using promises, the promising semantics allows this behavior.

$$r := z$$
;  $F^{sc}$ ;  $x := 1 \parallel x := 2$ ;  $F^{sc}$ ;  $y := 1 \parallel r := y$ ;  $z := 1$ 

$$Rz1 \longrightarrow F^{sc} \longrightarrow Wx1 \longrightarrow Wx2 \longrightarrow F^{sc} \longrightarrow Wy1 \longrightarrow Ry1 \longrightarrow Wz1$$

Allowed behavior on POWER... Is there a dependency in the last thread? If so, this is a problem.

[Podkopaev et al. 2019, §8]: To establish the correctness of compilation of the promising semantics to POWER, Kang et al. [2017] followed the approach of Lahav and Vafeiadis [2016]. This approach reduces compilation correctness to POWER to (i) the correctness of compilation to the POWER model strengthened with po  $\cup$  rf acyclicity; and (ii) the soundness of local reorderings of memory accesses. To establish (i), Kang et al. [2017] wrongly argued that the strengthened POWER-consistency of mapped promise-free execution graphs imply the promise-free consistency of the source execution graphs. This is not the case due to SC fences, which have relatively strong semantics in the promise-free declarative model (see [Podkopaev et al. 2018, Appendix D] for a counter example). Nevertheless, our proof shows that the compilation claim of Kang et al. [2017] is correct.

#### D.13 RMWs

 If RMWs simply use the same semantics as read and write, then we allow LDRF-PF-FAIL, which is used to show failure of LDRF-sc for the promising semantics in [Cho et al. 2021].

$$y := 0$$
; if  $(y)$ {if  $(!CAS(x, 0, 1))$ {if  $(z)$ { $x := 2$ }}}  $|| y := 1$ ; if  $(1 \neq CAS(x, 0, 3))$ { $z := 1$ }

(LDRF-PF-FAIL)

To disallow this, we need to retain the dependency  $(Rx2) \rightarrow (Wz1)$ . For this, we need to avoid the substitution for x. This is why we use READ' instead of READ in the independent case for RMWs.

It is not possible for two RMWs to see the same write.

$$x := 0; (\mathsf{FADD}^{\mathsf{rlx},\mathsf{rlx}}(x,1) \parallel \mathsf{FADD}^{\mathsf{rlx},\mathsf{rlx}}(x,1))$$

$$(\mathsf{R}x0) \qquad (\mathsf{R}x0) \qquad (\mathsf{R}x0) \qquad (\mathsf{R}x0)$$

The gray arrow is required the RMW atomicity axioms.

Lee et al. [2020] introduce PS2.0 to refine the treatment of RMWs in the promising semantics (PS). Their examples have the expected results here, with far less work. First they recall that PS requires quantification over multiple futures in order to disallow executions such as CDRF. (We showed the relaxed variant (CDRF-RLX) in §8.2.)

$$r := \mathsf{FADD}^{\mathsf{acq},\mathsf{rel}}(x,1) \; ; \; \mathsf{if}(r=0) \{ y := 1 \} \parallel r := \mathsf{FADD}^{\mathsf{acq},\mathsf{rel}}(x,1) \; ; \; \mathsf{if}(r=0) \{ \mathsf{if}(y) \{ x := 0 \} \}$$

$$\overset{\mathsf{Racq}}{\mathsf{W}} x_0 \overset{\mathsf{Red}}{\mathsf{W}} x_1 \overset{\mathsf{Red}}{\mathsf{W}} x_0 \overset{$$

This execution is clearly impossible, due to the cycle above. In this diagram, we have not drawn order adjacent to the writes of the RMWS, since this is not necessary to produce the cycle. If CDRF is allowed then DRF-RA fails.

Ps does not support global value range analysis, as modeled by GA+E below. Our semantics permits GA+E:

$$x := 0; (r := CAS^{r|x,r|x}(x, 0, 1); if (r < 10) \{y := 1\} || x := 42; x := y)$$

(GA+E)

PS also does not support register promotion, as modeled by RP below. Our semantics permits RP:

$$r := x ; s := \mathsf{FADD}^{\mathsf{rlx},\mathsf{rlx}}(z,r) ; y := s+1 \parallel x := y$$

$$(\mathsf{R}x1) \qquad (\mathsf{W}y1) \qquad (\mathsf{R}y1) \qquad (\mathsf{R}y1)$$

Example D.1. Recall M10c: if  $\lambda(c)$  overlaps  $\lambda(d)$  and  $d \xrightarrow{rmw} e$  then (1) c < e implies  $c \le d$  and (2) d < c implies  $e \le c$ .

This definition ensures atomicity, disallowing executions such as [Podkopaev et al. 2019, Ex. 3.2]:

$$x := 0$$
; INC<sup>rlx,rlx</sup> $(x) \parallel x := 2$ ;  $r := x$ 
 $(Wx0) \longrightarrow (Rx0) \longrightarrow (Wx1)$ 
 $(Wx1) \longrightarrow (Rx1)$ 

By 1, since  $(Wx2) \rightarrow (Wx1)$ , it must be that  $(Wx2) \rightarrow (Rx0)$ , creating a cycle.

Example D.2. Two successful RMWs cannot see the same write:

$$x := 0; (INC^{r|x,r|x}(x) \parallel INC^{r|x,r|x}(x))$$

$$(Wx0) \longrightarrow a:Rx0 \longrightarrow b:Wx1 \longrightarrow c:Rx0 \longrightarrow d:Wx1$$

The order from read-to-write is required by fulfillment. Apply 1 of the second RMW to  $a \to d$ , we have that  $a \to c$ . Subsequently applying 2 of the first RMW, we have  $b \to c$ , creating a cycle.

*Example D.3.* By using two actions rather than one, the definition allows examples such as the following, which is allowed by Arm8 [Podkopaev et al. 2019, Ex. 3.10]:

$$r := z$$
;  $s := INC^{rlx,rel}(x)$ ;  $y := s+1 \parallel r := y$ ;  $z := r$ 

$$(Rz1) \qquad (Wy1) \qquad (Ry1) \qquad (Wz1)$$

A similar example, also allowed by Arm8 [Chakraborty and Vafeiadis 2019, Fig. 6]:

$$r := z$$
;  $s := \mathsf{FADD}^{\mathsf{rlx},\mathsf{rlx}}(x,r)$ ;  $y := s+1 \parallel r := y$ ;  $z := r$ 

This is allowed by WEAKESTMO, but not PS.

 0:30 Anon.

Example D.4. Consider the CDRF example from [Lee et al. 2020]:

$$r := \mathsf{INC}^{\mathsf{acq},\mathsf{rel}}(x) \, ; \, \mathsf{if}(r=0) \{ y := 1 \}$$

$$\parallel r := \mathsf{INC}^{\mathsf{acq},\mathsf{rel}}(x) \, ; \, \mathsf{if}(r=0) \{ \mathsf{if}(y) \{ x := 0 \} \}$$

Example D.5. Consider this example from [Lee et al. 2020, §C]:

$$r := \mathsf{CAS}^{\mathsf{rlx},\mathsf{rlx}}(x,0,1) \; ; \; \mathsf{if}(r \leq 1) \{ y := 1 \}$$
 
$$\parallel r := \mathsf{CAS}^{\mathsf{rlx},\mathsf{rlx}}(x,0,2) \; ; \; \mathsf{if}(r = 0) \{ \mathsf{if}(y) \{ x := 0 \} \}$$
 
$$\boxed{\mathsf{Rx0}} \qquad \boxed{\mathsf{Wx1}} \qquad \boxed{\mathsf{Wy1}} \qquad \boxed{\mathsf{Rx0}} \qquad \boxed{\mathsf{Ry1}} \qquad \boxed{\mathsf{Wx0}}$$

#### D.14 More RMW

These following examples are from [Cho et al. 2021].

CDRF shows that PwT semantics is not too permissive for ra-RMws. But what about rlx-RMws. The following execution is allowed by Arm8, and Ps2.0, but disallowed by Ps2.1.

$$r := \mathsf{FADD}^{\mathsf{rlx},\mathsf{rlx}}(x,1) \; ; \; y := 1 \parallel r := y \; ; \; s := \mathsf{FADD}^{\mathsf{rlx},\mathsf{rlx}}(x,r)$$

$$(\mathsf{R}x1) \qquad \qquad (\mathsf{R}y1) \qquad \qquad (\mathsf{R}x0) \qquad \qquad (\mathsf{R}\mathsf{M}\mathsf{W}-\mathsf{W})$$

$$(\mathsf{W}x2) \qquad \qquad (\mathsf{W}x1)$$

If this  $\{z\}$ -DRF-RA?

$$if(y)\{x := z\} else\{x := 1\} \parallel r := x; z := 1; y := r$$

$$(NAIVE-LDRF-RA-FAIL)$$

Interpreting  $\{z\}$  as ra:



### D.15 Fences and RMW

# [Todo: Discuss.]

[Podkopaev et al. 2019, Remark 2, After example 3.1]: Aim: allow the splitting of release writes and RMWs into release fences followed by relaxed operations. In RC11 [Lahav et al. 2017], as well as in C/C++11 [Batty et al. 2011], this rather intuitive transformation, as we found out, is actually unsound.

$$y := 1; x^{\mathsf{ra}} := 1 \parallel \mathsf{INC}^{\mathsf{ra},\mathsf{ra}}(x); x := 3 \parallel r := x^{\mathsf{acq}}; s := y$$

$$(\mathsf{W}y1) \leftarrow (\mathsf{W}^{\mathsf{rel}}x1) \longrightarrow (\mathsf{R}^{\mathsf{acq}}x1) \longrightarrow (\mathsf{W}^{\mathsf{rel}}x2) \rightarrow (\mathsf{W}x3) \longrightarrow (\mathsf{R}^{\mathsf{acq}}x3) \rightarrow (\mathsf{R}y0)$$

(R)C11 disallows the annotated behavior, due in particular to the release sequence formed from the release exclusive write to x in the second thread to its subsequent relaxed write. However, if we split the increment to fencerel; a := FADDacq,rlx(x, 1) (which intuitively may seem stronger), the release sequence will no longer exist, and the annotated behavior will be allowed. IMM overcomes this problem by strengthening sw in a way that ensures a synchronization edge for the transformed program as well

$$y := 1$$
;  $x^{\text{ra}} := 1 \parallel \mathsf{F}^{\text{rel}}$ ;  $\mathsf{INC}^{\mathsf{ra,rlx}}(x)$ ;  $x := 3 \parallel r := x^{\mathsf{acq}}$ ;  $s := y$ 

$$(\mathsf{W}y1) \qquad (\mathsf{F}^{\mathsf{rel}}x1) \qquad (\mathsf{R}^{\mathsf{acq}}x3) \qquad (\mathsf{R}y0)$$

We seem to disallow both of these out of the box.

In the case of a relaxed read in the RMW, the outcome is allowed in both cases:

$$y := 1; x^{\mathsf{ra}} := 1 \parallel \mathsf{INC}^{\mathsf{rlx,ra}}(x); x := 3 \parallel r := x^{\mathsf{acq}}; s := y$$

$$(\mathsf{W}y1) \leftarrow (\mathsf{W}^{\mathsf{rel}}x1) \rightarrow (\mathsf{R}x1)^{\mathsf{rmw}}(\mathsf{W}^{\mathsf{rel}}x2) \rightarrow (\mathsf{W}x3) \rightarrow (\mathsf{R}^{\mathsf{acq}}x3) \leftarrow (\mathsf{R}y0)$$

$$y := 1; x^{\mathsf{ra}} := 1 \parallel \mathsf{F}^{\mathsf{rel}}; \mathsf{INC}^{\mathsf{rlx,rlx}}(x); x := 3 \parallel r := x^{\mathsf{acq}}; s := y$$

$$(\mathsf{W}y1) \leftarrow (\mathsf{W}^{\mathsf{rel}}x1) \qquad (\mathsf{F}^{\mathsf{rel}}) \leftarrow (\mathsf{R}x1)^{\mathsf{rel}}(\mathsf{W}x2) \rightarrow (\mathsf{W}x3) \rightarrow (\mathsf{R}^{\mathsf{acq}}x3) \leftarrow (\mathsf{R}y0)$$

#### **E NOT FOR PUBLICATION**

# E.1 Recent discussion on JMM/JDK-dev

Raffaello Giulietti: "JEP 188: Java Memory Model Update" [1], the JMM wiki [2] and the jmm-dev mailing list [3] seem quite inactive. (The latter point explains why I'm posting to this list instead.)

The introduction of j.l.i.VarHandle [4] brought more access modes to Java, but in a narrative and informal way. A paper by Bender & Palsberg [5], addressing the formalization of the concurrent access modes, has been published in 2019 but I'm not sure if it caught the attention of the OpenJDK community.

So what is the current thinking for progressing the JMM spec?

Hans Boehm: I think it's safe to say that it has been slow going, not just for Java, but for other languages as well.

In my view, the core problem, shared by pretty much all of them, is that we don't have an established way to give well-defined semantics to potentially racing unordered accesses, like ordinary variable accesses in Java, or memory\_order\_relaxed accesses in C and C++. That's particularly essential with the traditional Java language-based-security model, since we can't just give up on racing accesses to ordinary variables.

I'm aware of a number of proposed solutions. But I don't think we currently have enough confidence that they

- Are correct, and don't have issues similar to the older models,
- Don't have unintended consequences, particularly for compilation, and
- Are sufficiently comprehensible by programmers to actually be useful.

[Correctness] is hard because the models have gotten complex enough that reviewers are scarce. (A problem that I gather you're familiar with.) The authors are commonly experts at formally

0:32 Anon.

analyzing the models, but it's hard to analyze whether the model conflicts with some well-known, but perhaps not well-written-down compilation technique.

Probably even more controversially, I think we've realized that existing compiler technology can compile such racing code in ways that some of us are not 100% sure should really be allowed. Demonstrably unexecuted code can affect the semantics in ways that strike me as scary. (See https://wg21.link/p1217 for a down-to-assembly C++ version; [if I understand correctly], Lochbihler and others earlier came up with some closely related observations for Java.)

It might be possible to do what we've involuntarily done for C++: Punt the hard cases for now, and define what the model is for programs without racing ordinary accesses.

[p1217 is [Boehm 2019].]

### Andrew Haley:

- > Probably even more controversially, I think we've realized that
- > existing compiler technology can compile such racing code in ways
  - > that some of us are not 100\% sure should really be allowed.

This implies, does it not, that the problem is not formalization as such, but that we don't really understand what the language is supposed to mean? That's always been my problem with OOTA: I'm unsure whether the problem is due to the inadequacy of formal models, in which case the formalists can fix their own problem, or something we all have to pay attention to.

Hans Boehm: In some sense, I'm not sure either. The p1217 examples bother me in that they seem to violate some global programming rules ("if x is only ever null or refers to an object properly constructed by the same thread, then x should never appear to refer to an incompletely constructed object"). And there seems to be disagreement about whether the currently allowed behavior is "correct."

On the other hand, in practice the weirdness doesn't seem to break things. If you ask people advocating the current behavior, the answer will be that it doesn't matter because nobody writes code that way. If you ask people trying to analyzer or verify code, they'll probably be unhappy. And I haven't been able to convince myself that you cannot get yourself into these situations just by linking components together, each of which does something perfectly reasonable.

And there are very common code patterns (like the standard implementation of reentrant locks used by all Java implementations) that break if you allow general OOTA behavior. Which at least means that you can't currently formally verify such code. The theorem you'd be trying to prove is false with respect to the part of the language spec we know how to formalize.

It's a mess.

## Andrew Haley:

- > Demonstrably unexecuted code can affect the semantics in ways that strike me
- > as scary. (See wg21.link/p1217 for a down-to-assembly C++ version; IIUC, Lochbihler
- > and others earlier came up with some closely related observations for Java.)

Looking again at p1217, it seems to me that enforcing load-store ordering would have severe effects on compilers, at least without new optimization techniques. We hoist loads before loops and sink stores after them. When it all works out, there are no memory accesses in the loop. A load-store barrier in a loop would have the effect of forcing succeeding stores out to memory, and forcing preceding loads to reload from memory. It's not hard to imagine that this would cause an order-of-margnitude performance reduction in common cases.

I suppose one could argue that such optimizations would continue to be valid, so only those stores which would have been emitted anyway would be affected. But that's not how compilers

work, as far as I know. In our IR for C2, memory accesses are not pinned in any way, so the only way to make unrelated accesses execute in any particular order is to add a dependency between all loads and stores.

Hans Boehm: I think it would be a fairly pervasive change to optimizers. It has also become clear in WG21, the C++ committee, that there is not enough support for requiring this. In that case, Ou and Demsky have a paper saying that the overhead is likely to be on the order of 1% or less. For Java if it were applied everywhere, it would probably be appreciably higher.

On the other hand, it's a bit harder than that to come up with examples where the generated x86 code has to be worse. Moving loads earlier in the code, or delaying stores, as you suggest, would still be fine. The only issue is with delaying loads past stores, which seems less common, though it can certainly be beneficial for reducing live ranges, probably some vectorization etc.

But it seems unlikely that such a restriction will be applied even to C++ memory\_order\_relaxed, much less Java ordinary variables.

Doug Lea: My stance in the less formal account (http://gee.cs.oswego.edu/dl/html/j9mm.html) as well as Shuyang Liu et al's ongoing formalization (see links from http://compilers.cs.ucla.edu/people/) is that the most you want to say about racy Java programs is that they are typesafe. As in: you can't see a String when expecting an int. Even this looser constraint is challenging to specify, prove, and extend. But it is a path for Java that might not apply to languages like C that are not guaranteed typesafe anyway, and so enter Undefined Behavior territory (as opposed to possibly-unexpected but still typesafe behavior).

Han Boehm: But this now breaks some common idioms, right? In particular, I think a bunch of code assumes that racing assignments of equivalent primitive values or immutable objects to the same field are OK.

If, in 2004, our view of language-based security had been the same as it is now, then I completely agree that this would have been the right approach. But I think doing it now would require significant user code changes. Which might still be the best way forward ...

### E.2 A Note on Mixed-Mode Data Races

 In preparing this paper, we came across the following example, which appears to invalidate Theorem 4.1 of [Dongol et al. 2019].

$$x := 1; y^{\text{rel}} := 1; r := x^{\text{acq}} \parallel \text{if}(y^{\text{acq}}) \{x^{\text{rel}} := 2\}$$

$$\boxed{Wx1 \longrightarrow W^{\text{rel}}y1} \qquad \boxed{\mathbb{R}^{\text{acq}}y1 \longrightarrow W^{\text{rel}}x2}$$

$$\boxed{\mathbb{R}^{\text{acq}}y1 \longrightarrow W^{\text{rel}}x2}$$

The program is data-race free. The two executions shown are the only top-level executions that include  $(W^{rel}x^2)$ .

Theorem 4.1 of [Dongol et al. 2019] is stated by extending execution sequences. In the terminology of [Dongol et al. 2019], a read is L-weak if it is sequentially stale. Let  $\rho = (Wx1)(W^{\rm rel}y1)$  ( $R^{\rm acq}y1)(W^{\rm rel}x2)$  be a sequence and  $\alpha = (R^{\rm acq}x1)$ .  $\rho$  is L-sequential and  $\alpha$  is L-weak in  $\rho\alpha$ . But there is no execution of this program that includes a data race, contradicting the theorem. The error seems to be in Lemma A.4 of [Dongol et al. 2019], which states that if  $\alpha$  is L-weak after an L-sequential  $\rho$ , then  $\alpha$  must be in a data race. That is clearly false here, since ( $R^{\rm acq}x1$ ) is stale, but the program is data race free.

0:34 Anon.

In proving the SC-LDRF result in [Jagadeesan et al. 2020, §8], we noted that our proof technique is more robust than that of [Dongol et al. 2019], because it limits the prefixes that must be considered. In (¶), the induction hypothesis requires that we add ( $\mathbb{R}^{acq}x1$ ) before ( $\mathbb{W}^{rel}x2$ ) since ( $\mathbb{R}^{acq}x1$ )  $\rightarrow$  ( $\mathbb{W}^{rel}x2$ ). In particular,

$$(Wx1) \longrightarrow (W^{rel}y1) \longrightarrow (R^{acq}y1) \longrightarrow (W^{rel}x2)$$

is not a downset of (¶), because  $(R^{acq}x1) \rightarrow (W^{rel}x2)$ . As noted in [Jagadeesan et al. 2020, §8], this affects the inductive order in which we move across pomsets, but does not affect the set of pomsets that are considered. In particular,

$$(Wx1) \rightarrow (W^{\text{rel}}y1) \rightarrow (R^{\text{acq}}y1)$$

is a downset of  $(\P)$ .

#### F OLD NOTES

## F.1 More optimizations

Sound to strengthen the annotation on an action from rlx to ra, and from ra to sc.

From [Manson et al. 2005]:

- synchronization on thread local objects can be ignored or removed altogether (the caveat
  to this is the fact that invocations of methods like wait and notify have to obey the correct
  semantics for example, even if the lock is thread local, it must be acquired when performing a wait),
- volatile fields of thread local objects can be treated as normal fields.
- redundant synchronization (e.g., when a synchronized method is called from another synchronized method on the same object) can be ignored or removed,

Counterexample for first two:

```
y=1; x^AR=1; r=X^AR; z=1
```

If you see z = 1 you must see y = 1

It would be nice if we could get at these with a strength reducing result: synchronization actions can be replaced by relaxed actions in some cases. Then the rules for relaxed read elimination and relaxed write elimination can be used to get rid of them.

#### F.2 Examples for semicolon semantics

- Parallel asymmetric: state result for *joint free* programs.
- Subsumption can be allowed on registers only
- We build substitutions
- Ignore substitutions when considering semantic equality.

Value for r in  $(r=1 \mid Wz1)$  from (Wx1):

$$x := 1 \parallel x := 1; r := x; y := r; z := r$$
 $(Wx1) (Wx1) (Wx1) (Wy1)$ 

Value for r in  $(r=1 \mid Wz1)$  from (Wx1):

$$x := 2 \parallel x := 1; r := x; if(r>0) \{y := 1\}; if(r>0) \{z := 1\}$$

$$(Wx2) \qquad (Wx1) \qquad (Rx2) \qquad (Wy1) \qquad (Wz1)$$

Note that this also contains pomset where value for r in  $(r=1 \mid Wy1)$  also comes from (Wx1):

1619  
1620 
$$x := 2 \parallel x := 1; r := x; if(r>0)\{y := 1\}; if(r>0)\{z := 1\}$$
  
1621  $(Wx2) (Wx1) (Rx2) (Wy1) (Wz1)$ 

So our semantics will calculate the least ordered version. Then rely on augmentation to get the others.

It is also possible that the read is necessary to give a value for *r*:

Dependency on two reads:

$$r:=x$$
;  $s:=y$ ; if  $(r < s) \{z := 1\}$ 
 $r < x > 0$ 
 $r:=x$ ;  $s:=y$ 
 $r < x > 0$ 
 $r:=x$ ;  $s:=y$ 
 $r < x > 0$ 
 $r:=x$ ;  $s:=y$ 
 $r < x > 0$ 
 $r:=x$ ;  $s:=y$ ; if  $r < x > 0$ 
 $r:=x$ ;  $s:=y$ ; if  $r < x > 0$ 
 $r:=x$ ;  $s:=y$ ; if  $r < x > 0$ 
 $r:=x$ ;  $r < x > 0$ 
 $r < x >$ 

Don't need to worry about confusing reads:

But we also have

$$r := x$$
;  $s := x$ ; if  $(s<0)\{z := 1\}$ 

$$(rRx1) (sRx2) (Wz1)$$

0:36 Anon.

1667 
$$r := x \qquad s := x; \text{ if } (s < 0) \{z := 1\}$$
1668 
$$(x/r) \qquad (x < 0) (x/s, 1/z)$$
1670 Part of the state of

Dependency on two reads (No dependency here):

$$r := x; s := x; if(r=s)\{z := 1\}$$

$$rRx1 \quad sRx2 \quad Wz1$$

$$r := x \quad s := x; if(r=s)\{z := 1\}$$

$$rRx1 \quad sRx2 \quad r=x \quad Wz1$$

$$r=x \quad (x/r) \quad r=x \quad (x/s, 1/z)$$

Another example:

Value for r in  $(r < s \mid Wz1)$  from (Wx0):

$$x := 0$$
;  $r := x$ ;  $s := y$ ; if  $(r < s) \{z := 1\}$ 

$$(Wx0) (rRx1) (sRy2) \rightarrow (Wz1)$$

Contrary to submission, reverse subsumption not okay.

$$x := 1 \qquad x := 2$$

$$rRx1 \qquad sRx2$$

$$(1/x) \qquad (\langle \rangle)$$

# F.3 Playing around with 5a and 4b

If we do this, then swap 4b and 4c, In definition 2.10, take 1-4b of def 2.8, rather than all of it. Another

```
 r := x; \ s := x; \ if(r > 0) \{y := 1\}; \ if(s > 0) \{z := 1\}   r := x; \ if(r > 0) \{y := 1\}; \ s := x; \ if(s > 0) \{z := 1\}   rRx1 \quad (sRx2) \quad (Wy1) \quad (Wz1)   s := x; \ r := x; \ if(r > 0) \{y := 1\}; \ if(s > 0) \{z := 1\}   s := x; \ if(s > 0) \{z := 1\}; \ r := x; \ if(r > 0) \{y := 1\}   rRx1 \quad (sRx2) \quad (Wy1) \quad (Wz1)   rRx1 \quad (sRx2) \quad (Wy1) \quad (Wz1)   s := x; \ if(r > 0) \{y := 1\}; \ if(s > 0) \{z := 1\}   sRx2 \quad (r > 0) \quad (yy1) \quad (Wz1)   sRx2 \quad (r > 0) \quad (yy1) \quad (Wz1)   sRx2 \quad (r > 0) \quad (yy1) \quad (Wz1)
```

1716

1717 r:=x; s:=x  $rRx1 \ sRx2$ 1719 (x/r,x/s)1720 (x/r,x/s)1740 (x/r,x/s)1750 (x/r,x/s)1760 (x/r,x/s)1770

Idea to get rid of 4b and change 5a to the following:

5a. if *e* writes then either  $\kappa'(e)$  implies  $\kappa(e)$ , or some c < e reads *v* from *x* and  $\kappa'(e)$  implies  $\kappa(e)[v/x]$ ,

Need to get rid of 4b because it is sensitive to order of reads.

This change seems sound, because of consistency. But it also fails to validate read reordering on same variable, due to consistency.

Without 4b, we still do not allow:

$$r:=x$$
;  $s:=x$ ;  $y:=r$ ;  $z:=r$   
 $Rx1$   $Rx2$   $Wy1$   $Wz2$ 

The following is not a pomset (consistency):

$$y := r; z := r$$

$$(r=1 \mid Wy1) \quad (r=2 \mid Wz2)$$

Without 4b, we still do not allow:

$$r := x; s := x; y := r; z := s; if(r=s){a := 1};$$
  
 $(Rx1)$   $(Rx2)$   $(Wy1)$   $(Wz2)$   $(x=x)$   $(x=x)$ 

The following is not a pomset (consistency):

$$y := r; z := s; if(r=s){a := 1};$$
  
 $(r=1 \mid Wy1) \quad (s=2 \mid Wz2) \quad (r=s \mid Wa1)$ 

We do allow:

$$r := x$$
;  $s := x$ ; if  $(r=s)\{a := 1\}$ ;  
 $(Rx1)$   $(Rx2)$   $(x=x)$   $(x=x)$ 

And also

$$r_1 := x$$
;  $r_2 := x$ ;  $r_3 := x$ ; if  $(r_3 \le 1) \{ y := 1 \}$ ; 
$$(Rx0) (Rx2) (Rx1) (1 \le 1 | Wy1)$$

But we cannot wait forever to satisfy a precondition. This is not a pomset:

$$\begin{array}{c|c}
r := x; \ s := x; \ y := r; \ z := s \\
\hline
(Rx3) & (Rx4) & (x=1) & (Wy1) & (x=2) & (Wz2)
\end{array}$$

Note that reads that we delay must all be consistent.

Also note that we cannot have:

0:38 Anon.

Because the following is not a pomset:

$$b := s; y := r; z := s$$

$$\underbrace{(r=1 \mid Wy1)}_{s=4 \mid Wb4}$$

But we can have the following, since there is no order the reads:

$$r_1 := x$$
;  $s_1 := x$ ;  $r_2 := x$ ;  $s_2 := x$ ;  $y := r_2$ ;  $z := s_2$ 

$$(Rx1) (Rx2) (Rx3) (Rx4) (Wy1) (Wz2)$$

Because this is indistinguishable from:

$$r_1 := x$$
;  $s_1 := x$ ;  $r_2 := x$ ;  $s_2 := x$ ;  $y := r_2$ ;  $z := s_2$   
 $(Rx3)$   $(Rx4)$   $(Rx1)$   $(Rx2)$   $(Wy1)$   $(Wz2)$ 

which is the same as:

But we can have:

$$p := x \; ; \; r := x \; ; \; s := x \; ; \; y := r \; ; \; z := s$$

$$(Rx1) \quad (Rx3) \quad (Rx4) \quad (x=1) \quad (Wy1) \quad (x=1) \quad (Wz1)$$

Reads can only swap when their values are interchangeable in the following program.

#### F.4 Alan comments

```
x=s; v=r; z=3s+2r
```

x=s; y=r; z1=s; if(r odd){ z2=1} // using 1 and 3 as the reads

### **REFERENCES**

Jade Alglave, Will Deacon, Richard Grisenthwaite, Antoine Hacquard, and Luc Maranget. 2021. Armed Cats: Formal Concurrency Modelling at Arm. ACM Trans. Program. Lang. Syst. 43, 2, Article 8 (July 2021), 54 pages. https://doi.org/10.1145/3458926

Mark Batty. 2015. The C11 and C++11 concurrency model. Ph.D. Dissertation. University of Cambridge, UK.

Hans-J. Boehm. 2019. Out-of-thin-air, Revisited, Again (Revision 2). https://wg21.link/p1217.

Pietro Cenciarelli, Alexander Knapp, and Eleonora Sibilio. 2007. The Java Memory Model: Operationally, Denotationally, Axiomatically. In *Programming Languages and Systems, 16th European Symposium on Programming, ESOP 2007, Braga, Portugal, March 24 - April 1, 2007, Proceedings (Lecture Notes in Computer Science, Vol. 4421), Rocco De Nicola (Ed.).* Springer, 331–346. https://doi.org/10.1007/978-3-540-71316-6\_23

Soham Chakraborty and Viktor Vafeiadis. 2017. Formalizing the concurrency semantics of an LLVM fragment. In *Proceedings of the 2017 International Symposium on Code Generation and Optimization, CGO 2017, Austin, TX, USA, February 4-8, 2017*, Vijay Janapa Reddi, Aaron Smith, and Lingjia Tang (Eds.). ACM, 100–110. http://dl.acm.org/citation.cfm?id=3049844

Soham Chakraborty and Viktor Vafeiadis. 2019. Grounding thin-air reads with event structures. *PACMPL* 3, POPL (2019), 70:1–70:28. https://doi.org/10.1145/3290383

Minki Cho, Sung-Hwan Lee, Chung-Kil Hur, and Ori Lahav. 2021. Modular Data-Race-Freedom Guarantees in the Promising Semantics. *Proc. ACM Program. Lang.* 2, PLDI. To Appear.

Edsger W. Dijkstra. 1975. Guarded Commands, Nondeterminacy and Formal Derivation of Programs. Commun. ACM 18, 8 (1975), 453–457. https://doi.org/10.1145/360933.360975

Stephen Dolan, KC Sivaramakrishnan, and Anil Madhavapeddy. 2018. Bounding Data Races in Space and Time. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (Philadelphia, PA, USA) (PLDI 2018). ACM, New York, NY, USA, 242–255. https://doi.org/10.1145/3192366.3192421

Proc. ACM Program. Lang., Vol. 0, No. POPL, Article 0. Publication date: January 2022.

1814 Brijesh Dongol, Radha Jagadeesan, and James Riely. 2019. Modular transactions: bounding mixed races in space and time. In Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 1815 2019, Washington, DC, USA, February 16-20, 2019, Jeffrey K. Hollingsworth and Idit Keidar (Eds.). ACM, 82-93. https: 1816 //doi.org/10.1145/3293883.3295708 1817

- Shaked Flur, Kathryn E. Gray, Christopher Pulte, Susmit Sarkar, Ali Sezgin, Luc Maranget, Will Deacon, and Peter Sewell. 2016. Modelling the ARMv8 architecture, operationally: concurrency and ISA. In Proceedings of the 43rd Annual ACM 1819 SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2016, St. Petersburg, FL, USA, January 20 -22, 2016, Rastislav Bodík and Rupak Majumdar (Eds.). ACM, 608-621. https://doi.org/10.1145/2837614.2837615 1820
- Radha Jagadeesan, Alan Jeffrey, and James Riely. 2020. Pomsets with preconditions: a simple model of relaxed memory. 1821 Proc. ACM Program. Lang. 4, OOPSLA (2020), 194:1-194:30. https://doi.org/10.1145/3428262
- 1822 Radha Jagadeesan, Corin Pitcher, and James Riely. 2010. Generative Operational Semantics for Relaxed Memory Models. 1823 In Programming Languages and Systems, 19th European Symposium on Programming, ESOP 2010, Paphos, Cyprus, March 20-28, 2010. Proceedings (Lecture Notes in Computer Science, Vol. 6012), Andrew D. Gordon (Ed.). Springer, 307-326. https: 1824 //doi.org/10.1007/978-3-642-11957-6 17 1825
- Alan Jeffrey and James Riely. 2019. On Thin Air Reads: Towards an Event Structures Model of Relaxed Memory. Logical 1826 Methods in Computer Science 15, 1 (2019), 25 pages. https://doi.org/10.23638/LMCS-15(1:33)2019 1827
- Jeehoon Kang, Chung-Kil Hur, Ori Lahav, Viktor Vafeiadis, and Derek Dreyer. 2017. A promising semantics for relaxed-1828 memory concurrency. In Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages, 1829 POPL 2017, Paris, France, January 18-20, 2017, Giuseppe Castagna and Andrew D. Gordon (Eds.). ACM, 175-189. http: //dl.acm.org/citation.cfm?id=3009850
- Ori Lahav and Udi Boker. 2020. Decidable verification under a causally consistent shared memory. In Proceedings of the 41st 1831 ACM SIGPLAN International Conference on Programming Language Design and Implementation, PLDI 2020, London, UK, June 15-20, 2020, Alastair F. Donaldson and Emina Torlak (Eds.). ACM, 211-226. https://doi.org/10.1145/3385412.3385966
- Ori Lahav, Nick Giannarakis, and Viktor Vafeiadis. 2016. Taming release-acquire consistency. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2016, St. Petersburg, FL, USA, January 20 - 22, 2016, Rastislav Bodík and Rupak Majumdar (Eds.). ACM, 649-662. https://doi.org/10.1145/2837614. 1835 2837643
  - Ori Lahav and Viktor Vafeiadis. 2016. Explaining Relaxed Memory Models with Program Transformations. In FM 2016: Formal Methods - 21st International Symposium, Limassol, Cyprus, November 9-11, 2016, Proceedings (Lecture Notes in Computer Science, Vol. 9995), John S. Fitzgerald, Constance L. Heitmeyer, Stefania Gnesi, and Anna Philippou (Eds.). Springer, 479-495. https://doi.org/10.1007/978-3-319-48989-6\_29
- Ori Lahav, Viktor Vafeiadis, Jeehoon Kang, Chung-Kil Hur, and Derek Dreyer. 2017. Repairing sequential consistency in C/C++11. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, 1841 PLDI 2017, Barcelona, Spain, June 18-23, 2017, Albert Cohen and Martin T. Vechev (Eds.). ACM, 618-632. https://doi. org/10.1145/3062341.3062352
- 1843 Juneyoung Lee, Yoonseung Kim, Youngju Song, Chung-Kil Hur, Sanjoy Das, David Majnemer, John Regehr, and Nuno P. Lopes. 2017. Taming undefined behavior in LLVM. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2017, Barcelona, Spain, June 18-23, 2017, Albert Cohen and Martin T. Vechev 1845 (Eds.). ACM, 633-647. https://doi.org/10.1145/3062341.3062343
- Sung-Hwan Lee, Minki Cho, Anton Podkopaev, Soham Chakraborty, Chung-Kil Hur, Ori Lahav, and Viktor Vafeiadis. 1847 2020. Promising 2.0: global optimizations in relaxed memory concurrency. In Proceedings of the 41st ACM SIGPLAN International Conference on Programming Language Design and Implementation, PLDI 2020, London, UK, June 15-20, 2020, 1849 Alastair F. Donaldson and Emina Torlak (Eds.). ACM, 362-376. https://doi.org/10.1145/3385412.3386010
- Nuno Lopes. 2016. RFC: Killing undef and spreading poison. https://lists.llvm.org/pipermail/llvm-dev/2016-October/ 1850 1851
- Jeremy Manson, William Pugh, and Sarita V. Adve. 2005. The Java Memory Model. SIGPLAN Not. 40, 1 (Jan. 2005), 378-391. 1852 https://doi.org/10.1145/1047659.1040336
- 1853 Anton Podkopaev, Ori Lahav, and Viktor Vafeiadis. 2019. Bridging the gap between programming languages and hardware 1854 weak memory models. Proc. ACM Program. Lang. 3, POPL (2019), 69:1-69:31. https://doi.org/10.1145/3290382
- William Pugh. 1999. Fixing the Java Memory Model. In Proceedings of the ACM 1999 Conference on Java Grande, JAVA 1855 '99, San Francisco, CA, USA, June 12-14, 1999, Geoffrey C. Fox, Klaus E. Schauser, and Marc Snir (Eds.). ACM, 89-98. 1856 https://doi.org/10.1145/304065.304106 1857
  - William Pugh. 2004. Causality Test Cases. https://perma.cc/PJT9-XS8Z

1837

1839

1858

1861 1862

Christopher Pulte, Shaked Flur, Will Deacon, Jon French, Susmit Sarkar, and Peter Sewell. 2018. Simplifying ARM 1859 concurrency: multicopy-atomic axiomatic and operational models for ARMv8. PACMPL 2, POPL (2018), 19:1-19:29. https://doi.org/10.1145/3158107 1860

0:40 Anon.

Jaroslav Sevčík and David Aspinall. 2008. On Validity of Program Transformations in the Java Memory Model. In ECOOP

2008 - Object-Oriented Programming, 22nd European Conference, Paphos, Cyprus, July 7-11, 2008, Proceedings (Lecture
Notes in Computer Science, Vol. 5142), Jan Vitek (Ed.). Springer, 27–51. https://doi.org/10.1007/978-3-540-70592-5\_3

Kasper Svendsen, Jean Pichon-Pharabod, Marko Doko, Ori Lahav, and Viktor Vafeiadis. 2018. A Separation Logic for a

Promising Semantics. In Programming Languages and Systems - 27th European Symposium on Programming, ESOP 2018.

Promising Semantics. In Programming Languages and Systems - 27th European Symposium on Programming, ESOP 2018, Thessaloniki, Greece, April 14-20, 2018, Proceedings (Lecture Notes in Computer Science, Vol. 10801). Springer, 357–384. https://doi.org/10.1007/978-3-319-89884-1\_13

Conrad Watt, Christopher Pulte, Anton Podkopaev, Guillaume Barbier, Stephen Dolan, Shaked Flur, Jean Pichon-Pharabod, and Shu-yu Guo. 2020. Repairing and mechanising the JavaScript relaxed memory model. In *Proceedings of the 41st ACM SIGPLAN International Conference on Programming Language Design and Implementation, PLDI 2020, London, UK, June 15-20, 2020*, Alastair F. Donaldson and Emina Torlak (Eds.). ACM, 346–361. https://doi.org/10.1145/3385412.3385973