## Appendix A. Discussion

## A.1. Comparison with Weakest Preconditions

We compare traditional transformers to the dependentcase transformers of Def 48; thus we consider only totally ordered executions. Because we only consider the dependent case, we drop the superscript E on  $\tau^E$  throughout this section. We also assume that each register appears at most once in a program, as we did throughout §2–4.

We are not interested in isolating the *weakest* precondition. Thus we think of transformers as Hoare triples. In addition, all programs in our language are strongly normalizing, so we need not distinguish strong and weak correctness. In this setting, the Hoare triple  $\{\phi\}$  S  $\{\psi\}$  holds exactly when  $\phi \Rightarrow wp_S(\psi)$ .

Hoare triples do not distinguish thread-local variables from shared variables. Thus, the assignment rule applies to all types of storage. The rules can be written as follows:

$$\begin{split} & wp_{x := M}(\psi) = \psi[M/x] \\ & wp_{r := M}(\psi) = \psi[M/r] \\ & wp_{r := x}(\psi) = x = r \Rightarrow \psi \end{split}$$

Here we have chosen an alternative formulation for the read rule, which is equivalent the more traditional  $\psi[x/r]$ , as long as registers occur at most once in a program. In Def 48, the transformers for the dependent case are as follows:

$$\tau_{x:=M}(\psi) = \psi[M/x]$$

$$\tau_{r:=M}(\psi) = \psi[M/r]$$

$$\tau_{r:=x}(\psi) = v = r \Rightarrow \psi \qquad \text{where } \lambda(e) = \mathsf{R} x v$$

Only the read rule differs from the traditional one.

For programs where every register is bound and every read is fulfilled, our dependent transformers are the same as the traditional ones. In our semantics, thus, we only consider totally-ordered executions where every read could be fulfilled by prepending some writes. For example, we ignore pomsets of x := 2; r := x that read 1 for x.

For example, let  $S_i$  be defined:

$$S_1 = s := x; x := s + r$$
  
 $S_2 = x := t; S_1$   
 $S_3 = t := 2; r := 5; S_2$ 

The following pomset appears in the semantics of  $S_2$ . A pomset for  $S_3$  can be derived by substituting [2/t, 5/r]. A pomset for  $S_1$  can be derived by eliminating the initial write.

$$\begin{array}{c} x \coloneqq t; s \coloneqq x; x \coloneqq s + r \\ \underbrace{(t = 2 \mid \mathsf{W} x 2)}_{\mathsf{C}} \longrightarrow \underbrace{(2 = s \Rightarrow (s + r) = 7 \mid \mathsf{W} x 7)}_{\mathsf{C}} \\ \underbrace{(z = s \Rightarrow \psi[s + r/x])}_{\mathsf{C}} \end{array}$$

The predicate transformers are:

$$\begin{split} ℘_{S_1}(\psi)=x{=}s\Rightarrow \psi[s{+}r/x] &&\tau_{S_1}(\psi)=2{=}s\Rightarrow \psi[s{+}r/x] \\ ℘_{S_2}(\psi)=t{=}s\Rightarrow \psi[s{+}r/x] &&\tau_{S_2}(\psi)=2{=}s\Rightarrow \psi[s{+}r/x] \\ ℘_{S_3}(\psi)=2{=}s\Rightarrow \psi[s{+}5/x] &&\tau_{S_3}(\psi)=2{=}s\Rightarrow \psi[s{+}5/x] \end{split}$$

#### A.2. Closure properties

**Def 58.**  $P_2$  is an augment of  $P_1$  if

- 1)  $E_2 = E_1$ ,
- 2)  $\lambda_2(e) = \lambda_1(e)$ ,
- 3)  $\kappa_2(e)$  implies  $\kappa_1(e)$ ,
- 4)  $au_2^D(e)$  implies  $au_1^D(e)$ ,
- 5) if  $d \leq_2 e$  then  $d \leq_1 e$ .

**Def 59.**  $P_2$  is an downset of  $P_1$  if

- 1)  $E_2 \subseteq E_1$ ,
- 2)  $(\forall e \in E_2) \ \lambda_2(e) = \lambda_1(e),$
- 3)  $(\forall e \in E_2)$   $\kappa_2(e) = \kappa_1(e)$ ,
- 4)  $(\forall e \in E_2)$   $\tau_2^D(e) = \tau_1^D(e)$ , 5)  $(\forall d \in E_2)$   $(\forall e \in E_2)$   $(\forall e \in E_2)$   $d \leq_2 e$  if and only if  $d \leq_1 e$ ,
- 6)  $(\forall d \in E_1)$   $(\forall e \in E_2)$  if  $d \leq_1 e$  then  $d \in E_2$ .

**Prop 60.** Suppose  $P_1 \in [S]$ .

- 1) If  $P_2$  is an augment of  $P_1$  then  $P_2 \in [S]$ .
- 2) If  $P_2$  is a downset of  $P_1$  then  $P_2 \in \llbracket \bar{S} \rrbracket$ .

In examples, we typically consider pomsets that are augment-minimal.

#### A.3. Completed Pomsets and Fork

It is sometimes useful to distinguish terminated or completed executions from partial executions. For example in [x:=1;y:=1], we expect completed executions to include two write actions. Note that this is different from being downset-maximal.

$$x := 0; x := 1 \parallel r := x; s := x; if(s) \{ y := 1 \}$$

$$(1)$$

$$(2)$$

$$(2)$$

(1) is a downset of (2), but both are completed.

For pomsets with predicate transformers, we identify completion with quiescence.

**Def 61.** A pomset with predicate transformers P is completed if, for every quiescence symbol s,  $\tau^{E}(s)$  implies s.

For example, there are no pomsets in [abort] that are completed, whereas the augment-minimal pomset of [skip] is completed.

While this definition is sensible for single *threads*, it is less satisfying for thread groups. To see why, consider that in  $\llbracket fork \{S\} \rrbracket$ :

- by T3, quiescence symbols and the symbol W have been substituted out of preconditions  $\kappa(e)$ ,
- by F4, every predicate transformer  $\tau^D$  is the identity

Every pomset in  $\llbracket fork \{G\} \rrbracket$  is completed, by definition. As a result, in general,  $\llbracket fork \{S\} \rrbracket \neq \llbracket S \rrbracket$ .

The fork operation is asynchronous: In  $[S_1; fork \{G\}]$ ;  $S_2$ , the threads in [G] run concurrently with  $[S_1; S_2]$ .

$$r := x; \operatorname{fork} \{ x := 1 \}$$

$$\begin{array}{c} & \\ & \\ & \\ & \\ & \end{array}$$

In fact, perhaps surprisingly,  $[r:=x; fork\{x:=1\}]$  $[fork \{x := 1\}; r := x]$ . Order between the threads can be enforced using synchronization. For example, the "backwards" read above is forbidden in:

$$r\!:=\!x;z^{\mathsf{ra}}\!:=\!1;\mathsf{fork}\{\mathsf{if}(z^{\mathsf{ra}})\{x\!:=\!1\}\}$$

### A.4. Fork-Join

In this subsection, we model a variant of our language that removes asynchronous fork operation and adds a synchronous fork-join.

$$S ::= \mathtt{abort} \mid \mathtt{skip} \mid r := M \mid r := [L]^{\mu} \mid [L]^{\mu} := M \mid \mathtt{fork}\{G\}; \mathtt{join} \mid S_1; S_2 \mid \mathtt{if}(M)\{S_1\}\mathtt{else}\{S_2\}$$

In  $(S_1; fork\{G\}; join; S_2)$ ,  $S_1$  must complete before Gbegins, and threads in G must complete before  $S_2$  begins. Thus (fork $\{r := x\}$ ; join) acts like a full fence. As modeled here, however, if G is empty, no order is imposed between  $S_1$  and  $S_2$ . Thus  $[fork\{skip\}; join] = [skip]$ .

To model fork-join, we give the semantics of thread groups using pomsets with preconditions and termination.

**Def 62.** A pomset with preconditions and termination is a pomset with preconditions (Def 12) together with a termination predicate (notation  $\checkmark$ ).

**Def 63.** If 
$$P \in THRD(\mathcal{P})$$
 then  $(\exists P_1 \in \mathcal{P})$ 

- 1-2) as for THRD in Def 45,
- T3)  $\kappa(e)$  implies  $\kappa_1(e)[tt/Q][tt/W]$  if  $\lambda_1(e)$  is a write,  $\kappa(e)$  implies  $\kappa_1(e)[tt/Q][ff/W]$  otherwise.
- T4) if  $\checkmark$  then P is completed (Def 61).

If 
$$P \in (\mathcal{P}_1 \parallel \mathcal{P}_2)$$
 then  $(\exists P_1 \in \mathcal{P}_1)$   $(\exists P_2 \in \mathcal{P}_2)$ 

- 1-8) as for  $\parallel$  in Def 12,
  - 9)  $\checkmark$  implies  $\checkmark_1 \land \checkmark_2$ .

If  $P \in FORKJOIN(\mathcal{P})$  then  $(\exists P_1 \in \mathcal{P})$ 

- 1-2) as for FORK in Def 21,
- F3)  $\kappa(e)$  implies  $Q_{ro}^* \wedge Q_{wo}^* \wedge Q_{sc} \wedge \kappa_1(e)$ , F4)  $\tau^D(\psi)$  implies  $\psi$ , if D=E and  $\checkmark_1$ ,
- F5)  $\tau^D(\psi)$  implies  $\psi[ff/Q]$ , otherwise.

Def 64. Update Def 24 to include:

$$\llbracket fork\{G\}; join \rrbracket = FORKJOIN \llbracket G \rrbracket$$

We embed pomsets with predicate transformers into pomsets with preconditions and termination using completion. The rules for thread groups keep track of the termination predicate. As noted A.3, every pomset in  $fork\{G\}$ 

is completed. In contrast, a pomset in  $\llbracket fork\{G\}; join \rrbracket$  is completed only if every thread in G is completed.

Top-level thread groups do not need quiescence symbols; thus, THRD removes all quiescence symbols by substitution. However, FORKJOIN(P) adds every possible quiescence symbol as a precondition to the events of  $\mathcal{P}$ . For example, the preconditions of  $[S \mid I \mid 0]$  do not contain quiescence symbols. Instead, the preconditions of  $[fork \{S \mid 0\}; join]$ are saturated with them. As a result, in completed top-level pomsets of  $[S_1; fork\{G\}; join]$ , all of the events from  $[S_1]$  must precede those of [G].

A similar thing happens with predicate transformers. Thread groups in  $[S \mid 0]$  do not contain predicate transformers. Instead, all of the independent predicate transformers of  $[fork{S \parallel 0}; join]$  take  $\psi$  to  $\psi[ff/Q]$ . As a result, in completed top-level pomsets of  $\lceil fork\{G\}; join; S_2 \rceil$ , all of the events from  $\llbracket G \rrbracket$  must precede those of  $\llbracket S_2 \rrbracket$ .

#### A.5. Using Independency for Coherence

In §3.2, we encoded coherence and synchronized access using quiescence symbols. Building on the language with fork-join, it model coherence using independency (§2.2), rather than encoding it in the logic.

**Def 65.** If 
$$P \in (\mathcal{P}_1; \mathcal{P}_2)$$
 then  $(\exists P_1 \in \mathcal{P}_1)$   $(\exists P_2 \in \mathcal{P}_2)$ 

- 1-8) as for; in Def 23,
  - 9) if  $d \in E_1$  and  $e \in E_2$  either d < e or  $a \leftrightarrow \lambda_2(e)$ .

In the logic, we remove the symbols  $Q_{wo}^x$  and  $Q_{ro}^x$ . Previously, we had given the semantics of ra access using  $Q_{wo}^*$  and  $Q_{ro}^*$ , which were encoded using  $Q_{wo}^x$  and  $Q_{ro}^x$ . With these gone, we introduce the quiescence symbol  $\bar{\mathsf{Q}}^*$  and Q<sub>acq</sub>. Thus, the only quiescence symbols required are Q\*,  $Q_{\mathsf{acq}}$  and  $Q_{\mathsf{sc}}$ . Fig 2 shows the difference with the semantics of §3.2.

**Def 66.** Let formulae  $Q_{\mu}^{S}$  and  $Q_{\mu}^{L}$  be defined:

$$\begin{aligned} Q_{rlx}^S &= Q_{acq} & Q_{rlx}^L &= Q_{acq} \\ Q_{ra}^S &= Q_{acq} \wedge Q^* & Q_{ra}^L &= Q_{acq} \\ Q_{sc}^S &= Q_{acq} \wedge Q^* \wedge Q_{sc} & Q_{sc}^L &= Q_{acq} \wedge Q_{sc} \end{aligned}$$

Let substitutions  $[\phi/Q_{\mu}^{S}]$  and  $[\phi/Q_{\mu}^{L}]$  be defined:

$$\begin{split} [\phi/Q_{\text{rlx}}^{\text{S}}] &= [\phi/Q^*] & [\phi/Q_{\text{rlx}}^{\text{L}}] = [\phi/Q^*] \\ [\phi/Q_{\text{ra}}^{\text{S}}] &= [\phi/Q^*] & [\phi/Q_{\text{ra}}^{\text{L}}] = [\phi/Q^*, \phi/Q_{\text{acq}}] \\ [\phi/Q_{\text{sc}}^{\text{S}}] &= [\phi/Q^*, \phi/Q_{\text{sc}}] & [\phi/Q_{\text{sc}}^{\text{L}}] = [\phi/Q^*, \phi/Q_{\text{acq}}, \phi/Q_{\text{sc}}] \end{split}$$

Def 67. Update Def 23 and 63 to:

- S3)  $\kappa(e)$  implies  $M=v \wedge Q_{\mu}^{S}$ ,
- L3)  $\kappa(e)$  implies  $Q_{\mu}^{L}$ , F3)  $\kappa(e)$  implies  $Q^{*} \wedge Q_{acq} \wedge Q_{sc} \wedge \kappa_{1}(e)$ ,
- S4)  $\tau^D(\psi)$  implies  $\psi[(Q^* \wedge M=v)/Q^*]$ ,
- S5)  $\tau^C(\psi)$  implies  $\psi[ff/Q_{\mu}^{S}]$ ,
- L4)  $\tau^D(\psi)$  implies  $v=r \Rightarrow \psi$ ,
- L5)  $\tau^C(\psi)$  implies  $\psi[ff/Q_{\mu}^L]$ .



(a) Quiescence Examples (§3.2)

(b) Quiescence Examples (§A.5)

Figure 2: Quiescence Examples

The most interesting examples in Fig 2b concern ra access. Every independent transformer substitutes [ff/Q\*]. Q\* is a precondition for any releasing write e, ensuring that all preceding events must are ordered before e. Conversely,  $Q_{acq}$  is a precondition of every event. The independent transformer for any acquiring read e substitutes [ff/ $Q_{acq}$ ], ensuring that all following events must be ordered after e.

Sequential composition ensures coherence. Note that this definition is incorrect if one allows fork parallelism, since T3 substitutes tt for every quiescence symbol. Preconditions of augment-minimal pomsets in  $\llbracket fork\{S\} \rrbracket$  contain no quiescence symbols. Instead, preconditions of augment-minimal pomsets in  $\llbracket fork\{S\} \}$ ;  $join \rrbracket$  are saturated with quiescence symbols.

As before, the substitution in S4 ensures that left merges are not quiescent (Ex 31).

#### A.6. Substitutions

Recall the load rules from §5.1:

- L4)  $\tau^D(\psi)$  implies  $v=r \Rightarrow \psi$ ,
- L5)  $\tau^C(\psi)$  implies  $(v=r \lor x=r) \Rightarrow \psi$ , when  $E \neq \emptyset$ ,
- L6)  $\tau^B(\psi)$  implies  $\psi$ , when  $E = \emptyset$ .

It is also possible to collapse x and r when doing a load:

- L4)  $\tau^D(\psi)$  implies  $v=r \Rightarrow \psi[r/x]$ ,
- L5)  $\tau^C(\psi)$  implies  $(v=r \lor x=r) \Rightarrow \psi[r/x]$ , when  $E \neq \emptyset$ .
- L6)  $\tau^B(\psi)$  implies  $\psi[r/x]$ , when  $E = \emptyset$ .

Perhaps surprisingly, these two semantics are incompa-

rable. Consider the following:

$$\begin{split} \text{if}(r \wedge s \text{ even}) \{y &:= 1\}; \text{if}(r \wedge s) \{z := 1\} \\ &\underbrace{\begin{pmatrix} r \wedge s \text{ even} \mid \mathsf{W}y1 \end{pmatrix}}_{\left(r \wedge s \mid \mathsf{W}z1 \right)} \end{split}$$

Prepending (s=x), we get the same result regardless of whether we substitute [s/x], since x does not occur in either precondition. Here we show the independent case:

$$\begin{split} s &:= x; \texttt{if}(r \wedge s \texttt{ even}) \{ y \! := \! 1 \}; \texttt{if}(r \wedge s) \{ z \! := \! 1 \} \\ & \underbrace{ (2 \! = \! s \vee x \! = \! s) \Rightarrow (r \wedge s \texttt{ even}) \mid \mathsf{W}y1)}_{\left( 2 \! = \! s \vee x \! = \! s) \Rightarrow (r \wedge s) \mid \mathsf{W}z1 \right)} \end{split}$$

Prepending (r=x), we now get different results since the preconditions mention x. Without substitution:

$$\begin{aligned} r := x; s := x; & \text{if} (r \land s \text{ even}) \{y := 1\}; & \text{if} (r \land s) \{z := 1\} \\ & \text{R} x1 \\ & \text{(}1 = r \Rightarrow (2 = s \lor x = s) \Rightarrow (r \land s \text{ even}) \mid \mathsf{W} y1) \\ & \text{(}1 = r \Rightarrow (2 = s \lor x = s) \Rightarrow (r \land s) \mid \mathsf{W} z1) \end{aligned}$$

Prepending (x := 0), which substitutes [0/x], the precondition of  $(\mathsf{W}y1)$  becomes  $(1 = r \Rightarrow (2 = s \lor 0 = s) \Rightarrow (r \land s \text{ even}))$ , which is a tautology, whereas the precondition of  $\mathsf{W}z1$  becomes  $(1 = r \Rightarrow (2 = s \lor 0 = s) \Rightarrow (r \land s))$ , which is not. In order to be top-level,  $\mathsf{W}z1$  must depend on  $\mathsf{R}x2$ ; in this case the precondition becomes  $(1 = r \Rightarrow 2 = s \Rightarrow (r \land s))$ , which is a tautology.



The situation reverses with the substitution [r/x]:

$$\begin{aligned} r := x; s := x; & \text{if} (r \land s \text{ even}) \{y := 1\}; & \text{if} (r \land s) \{z := 1\} \\ & \text{(R$x1)} \\ & \text{(}1 = r \Rightarrow (2 = s \lor r = s) \Rightarrow (r \land s \text{ even}) \mid \mathsf{W}y1) \end{aligned}$$

Prepending (x := 0):

$$(Wx0)$$
  $(Rx1)$   $(Rx2)$   $(Wy1)$   $(Wz1)$ 

The dependency has changed from  $(Rx2) \rightarrow (Wz1)$  to  $(Rx2) \rightarrow (Wy1)$ . The resulting sets of pomsets are incomparable.

Thinking in terms of hardware, the difference is whether reads update the cache, thus clobbering preceding writes. With  $\lceil r/x \rceil$ , reads clobber the cache, whereas without the substitution, they do not. Since most caches work this way, the model with  $\lceil r/x \rceil$  is likely preferred for modeling hardware. In a software model, however, we see no reason to prefer one of these over the other.

# Appendix B. Differences with OOPSLA

**Substitution.** [4] uses substitution rather than Skolemizing. Indeed our use of Skolemization is motivated by disjunction closure for predicate transformers, which do not appear in [4]; see §2.4.

In §5.1, we give the semantics of load for nonempty pomsets as:

- L4)  $\tau^D(\psi)$  implies  $v=r \Rightarrow \psi$ ,
- L5)  $\tau^C(\psi)$  implies  $(v=r \lor x=r) \Rightarrow \psi$ .

In [4], the definition is roughly as follows:

- L4)  $\tau^D(\psi)$  implies  $\psi[v/r][v/x]$ ,
- L5)  $\tau^C(\psi)$  implies  $\psi[v/r][v/x] \wedge \psi[x/r]$ .

These substitutions collapse x and r, allowing local invariant reasoning, as in §5.1. Without Skolemizing it is necessary to substitute [x/r], since the reverse substitution [r/x] is useless when r is bound.

Removing the substitution of  $\lfloor x/r \rfloor$  in the independent case has a small technical advantage: we no longer require *extended* expressions (which include memory references), since substitutions no longer introduce memory references.

The substitution [x/r] does not work with Skolemization, even for the dependent case, since we lose the unique marker for each read. In effect, this forces the reads to the same values. To be concrete, the candidate definition would modify L4 to be:

L4) 
$$\tau^D(\psi)$$
 implies  $v=x \Rightarrow \psi[x/r]$ .

Using this definition, consider the following:

$$r := x; s := x; \text{if} (r < s) \{ y := 1 \}$$
 
$$(Rx1) \qquad (Rx2) \rightarrow (1 = x \Rightarrow 2 = x \Rightarrow x < x \mid Wy1)$$

Although the execution seems reasonable, the precondition on the write is not a tautology.

**Consistency.** [4] imposes *consistency*, which requires that for every pomset P,  $\bigwedge_e \kappa(e)$  is satisfiable. Associativity requires that we allow pomsets with inconsistent preconditions. Consider a variant of Ex 53 from §5.3.

$$\begin{split} & \mathbf{if}(M)\{x := 1\} & \quad \mathbf{if}(!M)\{x := 1\} & \quad \mathbf{if}(M)\{y := 1\} \\ & \quad \left( M \mid \mathsf{W}x1 \right) & \quad \left( \neg M \mid \mathsf{W}x1 \right) & \quad \left( M \mid \mathsf{W}y1 \right) & \quad \left( \neg M \mid \mathsf{W}y1 \right) \end{split}$$

Associating left and right, we have:

$$\begin{split} \mathbf{if}(M)\{x \!:=\! 1\}; \mathbf{if}(!M)\{x \!:=\! 1\} &\quad \mathbf{if}(M)\{y \!:=\! 1\}; \mathbf{if}(!M)\{y \!:=\! 1\} \\ &\quad \boxed{(\mathbf{W} y 1)} \end{split}$$

Associating into the middle, instead, we require:

$$\begin{split} &\mathbf{if}(M)\{x \coloneqq 1\} &\quad \mathbf{if}(!M)\{x \coloneqq 1\}; \mathbf{if}(M)\{y \coloneqq 1\} &\quad \mathbf{if}(!M)\{y \coloneqq 1\} \\ &\left(M \mid \mathsf{W}x1\right) &\quad \left(\neg M \mid \mathsf{W}x1\right) &\quad \left(M \mid \mathsf{W}y1\right) &\quad \left(\neg M \mid \mathsf{W}y1\right) \end{split}$$

Joining left and right, we have:

$$\begin{split} \mathbf{if}(M) &\{x \!:=\! 1\}\,; \, \mathbf{if}(!M) &\{x \!:=\! 1\}\,; \, \mathbf{if}(M) &\{y \!:=\! 1\}\,; \, \mathbf{if}(!M) &\{y \!:=\! 1\}\\ & \left( \mathbf{W} x \mathbf{1} \right) \quad \left( \mathbf{W} y \mathbf{1} \right) \end{split}$$

Causal Strengthening. [4] imposes causal strengthening, which requires for every pomset P, if  $d \le e$  then  $\kappa(e)$  implies  $\kappa(d)$ . Associativity requires that we allow pomsets without causal strengthening. Consider the following.

$$\begin{array}{ccc} \mathtt{if}(M)\{r := x\,\} & y := r & \mathtt{if}(!M)\{s := x\,\} \\ \hline \begin{pmatrix} M \mid \mathsf{R}x1 \end{pmatrix} & \begin{pmatrix} r = 1 \mid \mathsf{W}y1 \end{pmatrix} & \begin{pmatrix} \neg M \mid \mathsf{R}x1 \end{pmatrix} \\ \end{array}$$

Associating left, with causal strengthening:

$$\begin{split} & \text{if}(M) \{ \, r \coloneqq x \, \} \, ; \, y \coloneqq r \\ & \underbrace{(M \mid \mathsf{R}x1) \! + \! (M \mid \mathsf{W}y1)} & \underbrace{(\neg M \mid \mathsf{R}x1)} \end{split}$$

Finally, merging:

Instead, associating right:

$$\begin{array}{ll} \mathtt{if}(M)\{r\!:=\!x\} & y\!:=\!r;\mathtt{if}(!M)\{s\!:=\!x\} \\ \hline (M\mid \mathsf{R} x 1) & r\!=\!1\mid \mathsf{W} y 1) & \neg M\mid \mathsf{R} x 1 \end{array}$$

Merging:

$$\inf(M)\{r\!:=\!x\};y\!:=\!r;\inf(!M)\{s\!:=\!x\}$$

With causal strengthening, the precondition of Wy1 depends upon how we associate. This is not an issue in [4], which always associates to the right.

**Causal Strengthening and Address Dependencies.** In order to guarantee that address calculation does not introduce thin-air executions, the predicate transformer for address calculation must be chosen carefully. Combing Def 43 and Def 56 we have:

L4) 
$$\tau^D(\psi)$$
 implies  $(L=\ell \Rightarrow v=r) \Rightarrow \psi$ ,  
L5)  $\tau^C(\psi)$  implies  $((L=\ell \Rightarrow v=r) \lor \mathsf{W}) \Rightarrow \psi$ .

Consider the following program, from [4, §5], where initially  $x=0,\ y=0,\ [0]=0,\ [1]=2,$  and [2]=1. It should only be possible to read 0, disallowing the attempted execution below:

Looking at the left thread:



Composing, we have:

$$\begin{array}{c} r := y; s := [r]; x := s \\ \hline (\mathbb{R}y2) & \overbrace{(2 = r \vee \mathbb{W}) \Rightarrow r = 2 \mid \mathbb{R}[2] \, 1} \\ \hline ((2 = r \vee \mathbb{W}) \Rightarrow (r = 2 \Rightarrow 1 = s) \Rightarrow s = 1 \mid \mathbb{W}x1 \\ \hline \end{array}$$

Substituting for W:

$$\begin{aligned} r &:= y; s := [r]; x := s \\ \hline (2 &= r \lor \mathsf{tt}) \Rightarrow r = 2 \mid \mathsf{R} \ [2] \ 1 \\ \hline (2 &= r \lor \mathsf{ff}) \Rightarrow (r = 2 \Rightarrow 1 = s) \Rightarrow s = 1 \mid \mathsf{W} x \ 1 \end{aligned}$$

The precondition of (R[2]1) is a tautology, but the precondition of (Wx1) is not. This forces a dependency:

$$r := y; s := [r]; x := s$$
 
$$(2 = r \lor \mathsf{tt}) \Rightarrow r = 2 \mid \mathsf{R} \ [2] \ 1$$
 
$$(2 = r \Rightarrow (r = 2 \Rightarrow 1 = s) \Rightarrow s = 1 \mid \mathsf{W} x \ 1)$$

All the preconditions are now tautologies.

**Parallel Composition.** In [4, §2.4], parallel composition is defined allowing coalescing of events. Here we have forbidden coalescing. This difference appears to be arbitrary. In [4], however, there is a mistake in the handling of termination actions. The predicates should be joined using  $\land$ , not  $\lor$ .

**Internal Acquiring Reads.** Shortly after publication, Pod-kopaev [5] noticed a shortcoming of the implementation on ARM8 in [4, §7]. The proof given there assumes that all internal reads can be dropped. However, this is not the case

for acquiring reds. For example, [4] disallows the following execution, which is allowed by ARM8 and TSO.

$$x := 2; r := x^{\mathsf{ra}}; s := y \quad || \quad y := 2; x^{\mathsf{ra}} := 1$$

$$\boxed{\mathsf{W}x2} \quad || \quad \mathsf{R}x2 \quad || \quad \mathsf{R}y0 \quad - \quad \mathsf{W}y2 \quad || \quad \mathsf{W}x1 \quad ||$$

The solution we have adopted is to allow an acquiring read to be downgraded to a relaxed read when it is preceded (sequentially) by a relaxed write that could fulfill it. This solution allows executions that are not allowed under ARM8 since we do not insist that the local relaxed write is actually read from. This may seem counterintuitive, but we don't see a local way to be more precise.

As a result, we use a different proof strategy for ARM8 implementation, which does not rely on read elimination. The proof idea uses a recent alternative characterization of ARM8 [1, 2].

**Redundant Read Elimination.** Contrary to the claim, redundant read elimination fails for [4]. We discussed redundant read elimination in §5.2. Consider JMM Causality Test Case 2, which we discussed there.

Under the semantics of [4], we have

$$\begin{array}{c} r \coloneqq x; s \coloneqq x; \text{if} (r \negthinspace = \negthinspace s) \{ y \negthinspace \coloneqq \negthinspace 1 \, \} \\ \hline (\mathsf{R} x \negthinspace 1) & (1 \negthinspace = \negthinspace 1 \land 1 \negthinspace = \negthinspace x \land x \negthinspace = \negthinspace 1 \land x \negthinspace = \negthinspace x \mid \mathsf{W} y \negthinspace 1) \end{array}$$

The precondition of (Wy1) is *not* a tautology, and therefore redundant read elimination fails. (It is a tautology in r:=x; s:=r; if  $(r=s)\{y:=1\}$ .) In [4, §3.1], we incorrectly stated that the precondition of (Wy1) was  $1=1 \land x=x$ .

A Note on Mixed Races. In preparing this paper, we came across the following example, which appears to invalidate Theorem 4.1 of [3].

$$x := 1; y^{\mathsf{ra}} := 1; r := x^{\mathsf{ra}} \parallel \mathsf{if}(y^{\mathsf{ra}}) \{ x^{\mathsf{ra}} := 2 \}$$

$$\boxed{\mathsf{W}x1} \qquad \boxed{\mathsf{R}y1} \qquad \boxed{\mathsf{W}y2} \qquad \qquad (3)$$

$$\boxed{\mathsf{W}x1} \qquad \boxed{\mathsf{R}y1} \qquad \boxed{\mathsf{W}x2} \qquad \qquad (4)$$

The program is data-race free. The two executions shown are the only top-level executions that include (Wx2).

Theorem 4.1 of [3] is stated by extending execution sequences. In the terminology of [3], a read is L-weak if it is sequentially stale. Let  $\rho = (Wx1)(Wy1)(Ry1)(Wx2)$  be a sequence and  $\alpha = (Rx1)$ .  $\rho$  is L-sequential and  $\alpha$  is L-weak in  $\rho\alpha$ . But there is no execution of this program that includes a data race, contradicting the theorem. The error seems to be in Lemma A.4 of [3], which states that if  $\alpha$  is L-weak after an L-sequential  $\rho$ , then  $\alpha$  must be in a data race. That is clearly false here, since (Rx1) is stale, but the program is data race free.

In proving the SC-LDRF result in [4, §8], we noted that our proof technique is more robust than that of [3], because it limits the prefixes that must be considered. In (3), the induction hypothesis requires that we add (Rx1) before (Wx2) since  $(Rx1) \rightarrow (Wx2)$ . In particular,



is not a downset of (3), because  $(Rx1) \rightarrow (Wx2)$ . As we noted in [4, §8], this affects the inductive order in which we move across pomsets, but does not affect the set of pomsets that are considered. In particular,



is a downset of (3).

#### References

- [1] J. Alglave. This commit adds three alternative formulations of the arm model, both for non-mixed and mixed size accesses. https://github.com/herd/herdtools7/commit/685ee4, June 2020.
- [2] Arm Limited. Arm architecture reference manual: Armv8, for Armv8-A architecture profile (issue F.c). https://developer.arm.com/documentation/ddi0487/latest, July 2020.
- [3] B. Dongol, R. Jagadeesan, and J. Riely. Modular transactions: bounding mixed races in space and time. In J. K. Hollingsworth and I. Keidar, editors, *Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2019, Washington, DC, USA, February 16-20, 2019*, pages 82–93. ACM, 2019. doi: 10.1145/3293883.3295708. URL https://doi.org/10.1145/3293883.3295708.
- [4] R. Jagadeesan, A. Jeffrey, and J. Riely. Pomsets with preconditions: a simple model of relaxed memory. *Proc. ACM Program. Lang.*, 4(OOPSLA):194:1–194:30, 2020. doi: 10.1145/3428262. URL https://doi.org/10.1145/3428262.
- [5] A. Podkopaev. Private correspondence, Nov. 2020.