# -

## ANONYMOUS AUTHOR(S)

#### **ACM Reference Format:**

Anonymous Author(s). 2022. A Unified Memory Model for Heterogenous Systems. *Proc. ACM Program. Lang.* 0, POPL, Article 0 (January 2022), 24 pages.

A Unified Memory Model for Heterogenous Systems

#### 1 MODEL

#### 1.1 Preliminaries

The syntax is built from

- a set of values V, ranged over by  $v, w, \ell, k$ ,
- a set of registers  $\mathcal{R}$ , ranged over by r, s,
- a set of expressions  $\mathcal{M}$ , ranged over by M, N, L,
- a set of *thread ids*  $\mathcal{T}$ , ranged over by  $\alpha$ ,  $\gamma$ .

*Memory references* are tagged values, written [ $\ell$ ]. Let X be the set of memory references, ranged over by x, y, z. We require that:

- values and registers are disjoint,
- values include at least the constants 0 and 1,
- expressions include at least registers and values,
- references do not appear in expressions: M[N/x] = M,
- thread ids include the *top-level* id **0**.

We model the following language.

```
\mu, \nu := \text{wk} \mid \text{rlx} \mid \text{rel} \mid \text{acq} \mid \text{ra} \mid \text{sc} \sigma, \rho := \text{cta} \mid \text{gpu} \mid \text{sys}
S := \text{skip} \mid r := M \mid r := [L]^{\mu}_{\sigma} \mid [L]^{\mu}_{\sigma} := M \mid F^{\mu}_{\sigma} \mid \text{if}(M)\{S_1\} \text{ else } \{S_2\} \mid S_1; S_2 \mid S_1 \mid S_2 \mid r := \text{CAS}^{\mu, \nu}_{\sigma}([L], M, N) \mid r := \text{FADD}^{\mu, \nu}_{\sigma}([L], M) \mid r := \text{EXCHG}^{\mu, \nu}_{\sigma}([L], M)
```

Access modes,  $\mu$ , are weak (wk), relaxed (rlx), release (rel), acquire (acq), release-acquire (ra), and sequentially consistent (sc). Let expressions (r:=M) only affect thread-local state and thus do not have a mode. Reads  $(r:=[L]^{\mu}_{\sigma})$  support wk, rlx, acq, sc. Writes  $([L]^{\mu}_{\sigma}:=r)$  support wk, rlx, rel, sc. Fences  $(F^{\mu}_{\sigma})$  support rel, acq, ra, sc. In the atomic update operations,  $\mu$  is a read and  $\nu$  is a write; we require that r does not occur in L.

*Scopes*,  $\sigma$ , are thread group (cta), processor (gpu) and system (sys).

Commands, aka statements, S, include memory accesses at a given mode, as well as the usual structural constructs. Following [Ferreira et al. 1996], # denotes parallel composition. If  $(S_1 \not \# S_2)$  is executed with thread ID  $\alpha$ , then  $S_1$  runs with ID  $\gamma$  and  $S_2$  continues under ID  $\alpha$ . Top level programs run with thread ID  $\Omega$ . In examples, we usually drop thread IDs. We use the symmetric # operator when there is no continuation after the parallel composition.

We use common syntax sugar, such as *extended expressions*,  $\mathbb{M}$ , which include memory locations. For example, if  $\mathbb{M}$  includes a single occurrence of x, then  $y := \mathbb{M}$ ; S is shorthand for r := x;

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

© 2022 Copyright held by the owner/author(s).

2475-1421/2022/1-ART0

https://doi.org/

0:2 Anon.

y := M[r/x]; S. Each occurrence of x in an extended expression corresponds to an separate read. We also write if (M){S} as shorthand for if (M){S} else {skip}.

The semantics is built from the following.

- a set of events  $\mathcal{E}$ , ranged over by e, d, c, and subsets ranged over by E, D, C,
- a set of logical formulae  $\Phi$ , ranged over by  $\phi$ ,  $\psi$ ,  $\theta$ ,
- a set of actions  $\mathcal{A}$ , ranged over by a, b,
- a family of *quiescence symbols*  $Q_x$ , indexed by location.

## We require that

50

51 52

53

54

55

56

57

58

61

65

66

67

69

70 71

72

73

74 75

76

77 78

79

80 81

83

84

85

86

87

88

89

90

91

92

93

94

95 96

97 98

- registers include  $S_{\mathcal{E}} = \{s_e \mid e \in \mathcal{E}\}$  which do not appear in commands:  $S[N/s_e] = S$ ,
- formulae include tt, ff,  $Q_x$ , and the equalities (M=N) and (x=M),
- formulae are closed under negation, conjunction, disjunction, and substitutions [M/r], [M/x],  $[\phi/Q_x]$ ,
- there is a relation  $\models$  between formulae, capturing entailment,
- $\models$  has the expected semantics for =,  $\neg$ ,  $\land$ ,  $\lor$ ,  $\Rightarrow$  and substitution.

We relax the first assumption in examples, assuming that each register is assigned at most once.

Logical formulae include equations over registers, such as (r=s+1). For LIR, we also include equations over memory references, such as (x=1). Formulae are subject to substitutions; actions are not. We use expressions as formulae, coercing M to  $M\neq 0$ . Equations have precedence over logical operators; thus  $r=v \Rightarrow s>w$  is read  $(r=v) \Rightarrow (s>w)$ . As usual, implication associates to the right; thus  $\phi \Rightarrow \psi \Rightarrow \theta$  is read  $\phi \Rightarrow (\psi \Rightarrow \theta)$ .

We say  $\phi$  is a tautology if tt  $\models \phi$ . We say  $\phi$  is unsatisfiable if  $\phi \models$  ff.

We also require that there is a subset of actions, distinguishing *read* actions. We require several binary relations between actions, detailed in the next subsection: *sync-delays*, *co-delays*, *strongly-matches*, *matches*, *blocks*, *overlaps*, *strongly-overlaps*, *strongly-fences*. We require that

 $matches \subseteq blocks \subseteq overlaps \supseteq strongly-overlaps$  $strongly-matches \subseteq strongly-overlaps \cup strongly-fences$ 

### 1.2 Actions

We combine access and fence modes into a single order:  $wk \to rlx \Rightarrow rel \Rightarrow ra \to sc$ . We write  $\mu \sqsubseteq \nu$  for this order. Let  $\mu \sqcup \nu$  denote the least upper bound of  $\mu$  and  $\nu$ .

Let actions be reads, writes and fences:

$$a, b := \alpha W^{\mu}_{\sigma} x v \mid \alpha R^{\mu}_{\sigma} x v \mid \alpha F^{\mu}_{\sigma}$$

In examples, we systematically drop the default mode rlx and the default scope sys. In definitions, we drop elements of actions that are existentially quantified. We write  $(\alpha A_{\sigma}^{\mu} x)$  to stand for an *access*: either  $(\alpha W_{\sigma}^{\mu} x)$  or  $(\alpha R_{\sigma}^{\mu} x)$ . We write  $(W^{\square rel})$  to stand for either  $(W^{rel})$  or  $(W^{sc})$ , and similarly for other actions and modes.

We say a matches b if a = (Wxv) and b = (Rxv).

We say a blocks b if a = (Wx) and b = (Rx), regardless of value.

We say a overlaps b if a = (Ax) and b = (Ax), regardless of access type or value.

We say a co-delays b if  $(a, b) \in \{(Wx, Wx), (Rx, Wx), (Wx, Rx)\} \cup \{(A^{sc}, A^{sc})\}.$ 

We say a sync-delays b if  $(a, b) \in \{(a, W^{\supseteq rel}), (a, F^{\supseteq rel}), (R, F^{\supseteq acq}), (R^{\supseteq acq}, b), (F^{\supseteq acq}, b), (F^{\supseteq rel}, W), (W^{\supseteq rel}x, Wx)\}.$ 

Actions (R) are read actions.

<sup>&</sup>lt;sup>1</sup>For PTX, one could additionally include  $(Rx, R^{\supseteq acq}x)$ , but this is not sound for Arm or IMM.

99

100

101

102

103

104 105

106

107

108

109

110111

112

113

114

115

124

126

128

130

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146 147 *Definition 1.1.* We assume two equivalences:  $(=_{gpu}) \subseteq (\mathcal{T} \times \mathcal{T})$  partitions threads by *processor*, and  $(=_{cta}) \subseteq (=_{gpu})$  refines the processor partitioning into *thread groups*.

We say  $(\alpha_1 A_{\sigma_1}^{\mu_1} x)$  strongly-overlaps  $(\alpha_2 A_{\sigma_2}^{\mu_2} x)$  when either

```
(1) \alpha_1 = \alpha_2, or (2b) if \sigma_1 = \text{cta or } \sigma_2 = \text{cta then } \alpha_1 =_{\text{cta}} \alpha_2,
```

(2a) 
$$\mu_1, \mu_2 \supseteq \mathsf{rlx},$$
 (2c) if  $\sigma_1 = \mathsf{gpu} \text{ or } \sigma_2 = \mathsf{gpu} \text{ then } \alpha_1 = \mathsf{gpu} \alpha_2.$ 

We say  $(\alpha_1 \mathsf{F}_{\sigma_1}^{\mu_1})$  strongly-fences  $(\alpha_2 \mathsf{F}_{\sigma_2}^{\mu_2})$  when  $\mu_1 = \mu_2 = \mathsf{sc}$  and either (1) or (2) apply, from the definition of strongly-overlaps.

We say a strongly-matches b when a is a release, b is an acquire, and either a strongly-overlaps b or a strongly-fences b. [Todo: This looks wrong.]

Note that for a CPUs, all action have scope sys and mode rlx or greater. For this subset of actions, *strongly-overlaps* is the same as *overlaps* and *strongly-fences* applies to any pair of sc fences.

#### 1.3 Pomsets with Predicate Transformers

The semantics here includes all the features of [Jeffrey et al. 2021, §9]: Register Recycling, Register Consistency, Fences, and RMWs. We account for Address Calculation and If-Closure in §2. We have proposals to account for Dead Store Elimination, Store Forwarding, and Monotonicity in §13.

```
Definition 1.2. Let \lambda: E \to \mathcal{A}. Then we define \theta_{\lambda} = \bigwedge_{\{(e,v) \in (E \times \mathcal{V}) | \lambda(e) = (\mathsf{R}v)\}} (s_e = v).
```

We say that  $\phi$  is  $\lambda$ -consistent if  $\phi \wedge \theta_{\lambda}$  is satisfiable. We say that it is  $\lambda$ -inconsistent otherwise.

*Definition 1.3.* A  $\lambda$ -predicate transformer is a function  $\tau: \Phi \to \Phi$  such that

```
(x1) \tau(\psi_1 \wedge \psi_2) \equiv \tau(\psi_1) \wedge \tau(\psi_2), (x4) if \psi is \lambda-inconsistent then \tau(\psi) is \lambda-
```

(x2) 
$$\tau(\psi_1 \vee \psi_2) \equiv \tau(\psi_1) \vee \tau(\psi_2)$$
, inconsistent.

(x3) if 
$$\phi \models \psi$$
, then  $\tau(\phi) \models \tau(\psi)$ ,

Definition 1.4. A family of λ-predicate transformers consists of a λ-predicate transformer  $\tau^D$  for each  $D \subseteq \mathcal{E}$ , such that if  $C \cap E \subseteq D$  then  $\tau^C(\psi) \models \tau^D(\psi)$ .

We write  $\tau(\psi)$  as an abbreviation of  $\tau^{E}(\psi)$ .

*Definition 1.5.* A pomset with predicate transformers is a tuple  $(E, \lambda, \kappa, \tau, \checkmark, \leq, \leq, \sqsubseteq, rmw)$  where

```
(M1) E \subseteq \mathcal{E} is a set of events,
```

- (M2)  $\lambda : E \to \mathcal{A}$  defines a *label* for each event,
- (M3)  $\kappa : E \to \Phi$  defines a *precondition* for each event, such that (M3a)  $\kappa(e)$  is  $\lambda$ -consistent,
- (M4)  $\tau: 2^{\mathcal{E}} \to \Phi \to \Phi$  is a family of  $\lambda$ -predicate transformers,
- (M5)  $\checkmark$ :  $\Phi$  is a termination condition, such that

```
(M5a) \checkmark \models \tau(tt),
```

- (M6)  $\leq$ : (E × E) is a partial order capturing dependency,
- $(M7) \le (E \times E)$  is a partial order capturing synchronization,
- (M8)  $\sqsubseteq$ : ( $E \times E$ ) is a partial order capturing *per-location order*, such that

(M8a) if  $\lambda(d)$  overlaps  $\lambda(e)$  then  $d \leq e$  implies  $d \sqsubseteq e$ ,

(M9)  $rmw : E \rightarrow E$  is a partial function capturing read-modify-write *atomicity*, such that

```
(M9a) if d \xrightarrow{\text{rmw}} e then \lambda(e) blocks \lambda(d),
```

- (M9b) if  $d \xrightarrow{\mathsf{rmw}} e$  then  $d \leq e$  and  $d \sqsubseteq e$ ,
- (M9c) if  $\lambda(c)$  overlaps  $\lambda(d)$  then
  - (i) if  $d \xrightarrow{\mathsf{rmw}} e$  then  $c \le e$  implies  $c \le d$ ,  $c \le e$  implies  $c \le d$ ,  $c \subseteq e$  implies  $c \subseteq d$ ,
  - (ii) if  $d \xrightarrow{\mathsf{rmw}} e$  then  $d \le c$  implies  $e \le c$ ,  $d \le c$  implies  $e \le c$ ,  $d \subseteq c$  implies  $e \subseteq c$ .

A pomset is a *candidate* if there is an injective relation rf :  $E \times E$ , capturing *reads-from*, such that (c2a) if  $d \xrightarrow{rf} e$  then  $\lambda(d)$  matches  $\lambda(e)$ ,

0:4 Anon.

(c6) if  $d \xrightarrow{\mathsf{rf}} e$  then  $d \leq e$ ,

149

151

153

155

156

157

161

163

164

167

173

174

175

176177

178

179

180

181

182

191

192

193

194

195 196

- (c7a) if  $d' \le d \xrightarrow{\text{rf}} e \le e'$  and  $\lambda(d')$  strongly-matches  $\lambda(e')$  then  $d' \le e'$ ,
- (c7b) if  $\lambda(d)$  strongly-fences  $\lambda(e)$  then either  $d \le e$  or  $e \le d$ , [Todo: Is this right?]
- (c8a) if  $d \stackrel{\mathsf{rf}}{\longrightarrow} e$  then  $d \sqsubseteq e$ ,
- (c8b) if  $d \xrightarrow{rf} e$  and  $\lambda(c)$  blocks  $\lambda(e)$  then either  $c \subseteq d$  or  $e \subseteq c$ , where  $d' \subseteq e'$  when  $e' \subseteq d'$  implies d' = e' and  $\lambda(d')$  strongly-overlaps  $\lambda(e')$  implies  $d' \subseteq e'$ .

A candidate pomset with rf is complete if

- (c2b) if  $\lambda(e)$  is a read then there is some  $d \xrightarrow{\text{rf}} e$ ,
  - (c3)  $\kappa(e)$  is a tautology (for every  $e \in E$ ),
  - (c5)  $\checkmark$  is a tautology.

Note that for the IMM model, C8b is equivalent to:<sup>2</sup>

if 
$$d \xrightarrow{\mathsf{rf}} e$$
 and  $\lambda(c)$  blocks  $\lambda(e)$  then either  $c \sqsubseteq d$  or  $e \sqsubseteq c$ .

Let P range over pomsets, and  $\mathcal{P}$  over sets of pomsets.

We drop quantifiers when clear from context, such as  $(\forall e \in E)(\forall x \in X)$ . We write d < e when  $d \le e$  and  $d \ne e$ , and similarly for  $\triangleleft$  and  $\square$ . We sometimes use projection functions—for example, if  $\lambda(e) = \alpha \mathsf{W}_{\sigma}^{\mu} x v$  then  $\lambda_{\mathsf{thrd}}(e) = \alpha$ ,  $\lambda_{\mathsf{mode}}(e) = \mu$ ,  $\lambda_{\mathsf{scope}}(e) = \sigma$ ,  $\lambda_{\mathsf{loc}}(e) = x$ ,  $\lambda_{\mathsf{val}}(e) = v$ .

The semantic functions are defined in Fig. 1.

In diagrams, we use different shapes and colors for arrows and events. These are included only to help the reader understand why order is included. We adopt the following conventions:

- $d \rightarrow e$  arises from control/data/address dependency (s3, definition of  $\kappa'_2(d)$ ),
- $d \rightarrow e$  arises from sync-delays (s7a),
- $d \rightarrow e$  arises from co-delays (s8a),
- $d \rightarrow e$  arises from matching (c6), (c7a) and (c8a),
- $d \rightarrow e$  arises from strong fencing (c7b),
- $d \rightarrow e$  arises from blocking (c8b).

$$\forall \lambda(c) = (\mathsf{W} x) \text{ either } c \sqsubseteq d \text{ or } e \sqsubseteq c$$

If no accesses are morally strong with each other, weak fulfillment degenerates to

$$\not\exists \lambda(c) = (\mathsf{W} x) \text{ both } d \sqsubseteq c \text{ and } c \sqsubseteq e$$

Note that the difference between strong and weak fulfillment is limited to  $\sqsubseteq$ . We sometimes write  $\trianglerighteq$  for strong fulfillment and  $\trianglerighteq$  for weak fulfillment.

<sup>&</sup>lt;sup>2</sup>If all accesses are morally strong with each other, weak fulfillment degenerates to

```
If P \in SKIP then E = \emptyset and and \tau^D(\psi) \equiv \psi and \checkmark \equiv tt.
197
198
           If P \in PAR(\mathcal{P}_1, \mathcal{P}_2) then (\exists P_1 \in \mathcal{P}_1) \ (\exists P_2 \in \mathcal{P}_2)
199
               (P1) E = (E_1 \uplus E_2),
                                                                                                        (P5) \checkmark \equiv \checkmark_1 \land \checkmark_2,
200
               (P2) \lambda = (\lambda_1 \cup \lambda_2),
                                                                                                        (P6) \trianglelefteq \supseteq (\trianglelefteq_1 \cup \trianglelefteq_2),
201
              (P3a) if e \in E_1 then \kappa(e) \equiv \kappa_1(e),
                                                                                                        (P7) \leq \supseteq (\leq_1 \cup \leq_2),
202
             (P3b) if e \in E_2 then \kappa(e) \equiv \kappa_2(e),
                                                                                                        (P8) \sqsubseteq \supseteq (\sqsubseteq_1 \cup \sqsubseteq_2),
203
               (P4) \boldsymbol{\tau}^D(\psi) \equiv \boldsymbol{\tau}_2^D(\psi),
                                                                                                        (P9) rmw = (rmw_1 \cup rmw_2).
204
           If P \in SEO(\mathcal{P}_1, \mathcal{P}_2) then (\exists P_1 \in \mathcal{P}_1) (\exists P_2 \in \mathcal{P}_2)
205
                                                                                                         \begin{array}{ll} (\mathrm{s4}) \ \ \tau^D(\psi) \equiv \tau^D_1(\tau^D_2(\psi)), \\ (\mathrm{s5}) \ \ \checkmark \equiv \checkmark_1 \land \tau_1(\checkmark_2), \end{array} 
                (s1) E = (E_1 \cup E_2),
206
                (s2) (s6) (s7) (s8) (s9) as for PAR,
207
              (s3a) if e \in E_1 \setminus E_2 then \kappa(e) \equiv \kappa_1(e),
                                                                                                       (s7a) if \lambda_1(d) sync-delays \lambda_2(e) then d \leq e,
208
              (s3b) if e \in E_2 \setminus E_1 then \kappa(e) \equiv \kappa_2'(e),
                                                                                                      (s8a) if \lambda_1(d) co-delays \lambda_2(e) then d \sqsubseteq e,
209
              (s3c) if e \in E_1 \cap E_2 then \kappa(e) \equiv \kappa_1(e) \vee \kappa_2'(e),
210
           where \kappa_2'(e) = \tau_1(\kappa_2(e)) if \lambda(e) is a read—otherwise \kappa_2'(e) = \tau_1^{\downarrow e}(\kappa_2(e)), where \downarrow e = \{c \mid c \triangleleft e\}.
211
           If P \in IF(\phi, \mathcal{P}_1, \mathcal{P}_2) then (\exists P_1 \in \mathcal{P}_1) \ (\exists P_2 \in \mathcal{P}_2)
                (11) E = (E_1 \cup E_2),
                                                                                                       (13c) if e \in E_1 \cap E_2
213
                                                                                                                  then \kappa(e) \equiv (\phi \wedge \kappa_1(e)) \vee (\neg \phi \wedge \kappa_2(e)),
                (12) (16) (17) (18) (19) as for PAR,
214
                                                                                                         (14) \ \tau^D(\psi) \equiv (\phi \wedge \tau_1^D(\psi)) \vee (\neg \phi \wedge \tau_2^D(\psi)),
              (13a) if e \in E_1 \setminus E_2 then \kappa(e) \equiv \phi \wedge \kappa_1(e),
              (13b) if e \in E_2 \setminus E_1 then \kappa(e) \equiv \neg \phi \wedge \kappa_2(e),
                                                                                                         (15) \checkmark \equiv (\phi \land \checkmark_1) \lor (\neg \phi \land \checkmark_2).
           If P \in LET(r, M) then E = \emptyset and and \tau^D(\psi) \equiv \psi[M/r] and \checkmark \equiv tt.
218
           If P \in FENCE(\mu, \sigma)_{\alpha} then
219
                                                                                                        (F4) \tau^D(\psi) \equiv \psi,
               (F1) |E| \leq 1,
               (F2) \lambda(e) = \alpha \mathsf{F}^{\mu}_{\sigma}
                                                                                                      (F5a) if E \neq \emptyset then \sqrt{\ } \equiv tt,
221
                                                                                                      (F5b) if E = \emptyset then \checkmark \equiv ff.
               (F3) \kappa(e) \equiv tt,
222
           If P \in WRITE(x, M, \mu, \sigma)_{\alpha} then (\exists v \in \mathcal{V})
223
                                                                                                     (w4b) if E = \emptyset then \tau^D(\psi) \equiv \psi[M/x][ff/Q_x],
              (w1) |E| \leq 1,
224
              (w2) \lambda(e) = \alpha W_{\sigma}^{\mu} x v,
                                                                                                     (w5a) if E \neq \emptyset then \sqrt{\ } \equiv M = v.
225
              (w3) \kappa(e) \equiv M = v,
                                                                                                     (w5b) if E = \emptyset then \checkmark \equiv ff,
226
            (w4a) if E \neq \emptyset then \tau^D(\psi) \equiv \psi[M/x][M=v/Q_x],
227
228
           If P \in READ(r, x, \mu, \sigma)_{\alpha} then (\exists v \in \mathcal{V})
229
               (R1) |E| \leq 1,
                                                                                                      (R4b) if e \in E \setminus D then
230
               (R2) \lambda(e) = \alpha R_{\sigma}^{\mu} x v,
                                                                                                                  \tau^D(\psi) \equiv (v = s_e \lor x = s_e) \Rightarrow \psi[s_e/r],
231
                                                                                                      (R4c) if E = \emptyset then (\forall s) \tau^D(\psi) \equiv \psi[s/r],
               (R3) \kappa(e) \equiv Q_x,
232
             (R4a) if e \in E \cap D then
                                                                                                      (R5a) if E \neq \emptyset or \mu \sqsubseteq \text{rlx then } \checkmark \equiv \text{tt.}
233
                         \tau^D(\psi) \equiv v = s_e \Rightarrow \psi[s_e/r],
                                                                                                      (R5b) if E = \emptyset and \mu \supseteq acq then \checkmark \equiv ff.
234
           Let READ' be defined as for READ, adding the constraint:
235
             (R4d) if (E \cap D) = \emptyset then \tau^D(\psi) \equiv \psi.
236
           If P \in FADD(r, x, M, \mu, \nu, \sigma)_{\alpha} then P \in SEQ(READ'(r, x, \mu, \sigma)_{\alpha}, WRITE(x, r+M, \nu, \sigma)_{\alpha}) and
237
           If P \in EXCHG(r, x, M, \mu, \nu, \sigma)_{\alpha} then P \in SEQ(READ'(r, x, \mu, \sigma)_{\alpha}, WRITE(x, M, \nu, \sigma)_{\alpha}) and
238
           If P \in CAS(r, x, M, N, \mu, \nu, \sigma)_{\alpha} then
239
               P \in SEQ(READ'(r, x, \mu, \sigma)_{\alpha}, IF(r=M, WRITE(x, N, \nu, \sigma)_{\alpha}, SKIP)) and
240
241
               (U9) if \lambda(e) is a write then there is a read \lambda(d) such that \kappa(e) \models \kappa(d) and k \mapsto e.
242
```

Fig. 1. Semantic Functions

243

244245

0:6 Anon.

#### 2 ADDRESS CALCULATION AND IF-CLOSURE

### 2.1 Address Calculation

246247

248

262 263

267

269

270

271

272

273

275

276277

278

279

280

281

282

283

284

285 286

287

288

289

290

291

292

293 294

```
Definition 2.1. If P \in WRITE(L, M, \mu, \sigma)_{\alpha} then (\exists \ell \in \mathcal{V}) (\exists v \in \mathcal{V})
249
              (w1) if |E| \leq 1,
                                                                                                      (w5a) if E \neq \emptyset then \checkmark \equiv L = \ell \land M = v,
250
              (w2) \lambda(e) = \alpha W^{\mu}_{\sigma}[\ell]v,
                                                                                                      (w5b) if E = \emptyset then \checkmark \equiv ff.
251
              (w3) \kappa(e) \equiv L = \ell \wedge M = v,
252
            (w4a) if E \neq \emptyset then \tau^D(\psi) \equiv (L=\ell) \Rightarrow \psi[M/[\ell]][M=v/Q_{[\ell]}],
253
            (w4b) if E = \emptyset then (\forall k) \tau^D(\psi) \equiv (L=k) \Rightarrow \psi[M/[k]][ff/Q_{[k]}],
254
           If P \in READ(r, L, \mu, \sigma)_{\alpha} then (\exists \ell \in \mathcal{V}) (\exists v \in \mathcal{V})
255
256
                                                                                                        (R4c) if E = \emptyset then (\forall s) \tau^D(\psi) \equiv \psi[s/r],
               (R1) if |E| \leq 1,
257
               (R2) \lambda(e) = \alpha R_{\sigma}^{\mu}[\ell]v
                                                                                                        (R5a) if E \neq \emptyset or \mu \sqsubseteq \mathsf{rlx} then \checkmark \equiv \mathsf{tt}.
258
               (R3) \kappa(e) \equiv L = \ell \wedge Q_{[\ell]},
                                                                                                        (R5b) if E = \emptyset and \mu \supseteq acq then \checkmark \equiv ff.
             (R4a) if e \in E \cap D then \tau^D(\psi) \equiv (L=\ell \Rightarrow v=s_e) \Rightarrow \psi[s_e/r],
259
             (R4b) if e \in E \setminus D then \tau^D(\psi) \equiv ((L=\ell \Rightarrow v=s_e) \vee (L=\ell \Rightarrow [\ell]=s_e)) \Rightarrow \psi[s_e/r],
260
261
```

### 2.2 If-closure

*Definition 2.2.* Let  $E \subseteq \mathcal{E}$  and  $\theta : E \to \Phi$  and  $\Omega \in \Phi$ . We say that  $\theta$  *partitions*  $\Omega$  if (1) if  $\theta_e \wedge \theta_d$  is satisfiable then e = d, (2)  $\Omega \equiv \bigvee_{e \in E} \theta_e$ .

```
If P \in WRITE(x, M, \mu, \sigma)_{\alpha} then (\exists v : E \to V) (\exists \theta : E \to \Phi) (\exists \Omega \in \{tt, ff\})

(w1) \theta partitions \Omega, (w4) \tau^{D}(\psi) \equiv \bigwedge_{e \in E} (\theta_{e} \Rightarrow \psi[M/x][M=v_{e}/Q_{x}])

(w2) \lambda(e) = \alpha W_{\sigma}^{\mu} x v_{e}, \wedge \neg \Omega \Rightarrow \psi[M/x][ff/Q_{x}]

(w3) \kappa(e) \equiv \theta_{e} \wedge M=v_{e}, (w5) \checkmark \equiv \Omega \wedge \bigwedge_{e \in E} (\theta_{e} \Rightarrow M=v_{e}).
```

If  $P \in READ(r, x, \mu, \sigma)_{\alpha}$  then  $(\exists v : E \to V)$   $(\exists \theta : E \to \Phi)$   $(\exists \Omega \in \{\mathsf{tt}, \mathsf{ff}\})$ 

(R1)  $\theta$  partitions  $\Omega$ , (R5a) if  $\mu \sqsubseteq \text{rlx then } \sqrt{\equiv} \text{ tt}$ ,

(R2)  $\lambda(e) = \alpha R_{\sigma}^{\mu} x v_e$  (R5b) if  $\mu \supseteq \text{acq then } \checkmark \equiv \Omega$ .

(R3)  $\kappa(e) \equiv \theta_e \wedge Q_x$ ,

 $(R4) (\forall s) \tau^{D}(\psi) \equiv \bigwedge_{e \in E \cap D} (\theta_{e} \Rightarrow v_{e} = s_{e} \Rightarrow \psi[s_{e}/r])$   $\wedge \bigwedge_{e \in E \setminus D} (\theta_{e} \Rightarrow (v_{e} = s_{e} \lor x = s_{e}) \Rightarrow \psi[s_{e}/r])$   $\wedge \neg \Omega \Rightarrow \psi[s/r]$ 

#### 2.3 Address Calculation and If-closure

 $\begin{aligned} & \text{Definition 2.3. If } P \in \textit{WRITE}(L, M, \mu, \sigma)_{\alpha} \text{ then } (\exists \ell : E \to \mathcal{V}) \ (\exists v : E \to \mathcal{V}) \ (\exists \theta : E \to \Phi) \\ & (\exists \Omega \in \{\mathsf{tt}, \mathsf{ff}\}) \\ & (\mathsf{w1}) \ \theta \text{ partitions } \Omega, & (\mathsf{w5}) \ \checkmark \equiv \Omega \land \bigwedge_{e \in E} (\theta_e \Rightarrow L = \ell_e \land M = v_e). \\ & (\mathsf{w2}) \ \lambda(e) = \alpha \mathsf{W}_{\sigma}^{\mathsf{L}}[\ell] \mathsf{v}_e, & (\mathsf{w3}) \ \kappa(e) \equiv \theta_e \land L = \ell_e \land M = v_e, \\ & (\mathsf{w4}) \ (\forall k) \tau^D(\psi) \equiv \bigwedge_{e \in E} (\theta_e \Rightarrow (L = \ell) \Rightarrow \psi[M/x][M = v_e/Q_x]) \\ & \qquad \qquad \land \neg \Omega \Rightarrow (L = k) \Rightarrow \psi[M/x][\mathsf{ff}/Q_x] \end{aligned}$  If  $P \in \mathit{READ}(r, L, \mu, \sigma)_{\alpha}$  then  $(\exists \ell : E \to \mathcal{V}) \ (\exists v : E \to \mathcal{V}) \ (\exists \theta : E \to \Phi) \ (\exists \Omega \in \{\mathsf{tt}, \mathsf{ff}\}) \\ & (\mathsf{R1}) \ \theta \text{ partitions } \Omega, & (\mathsf{R5a}) \ \text{if } \mu \sqsubseteq \mathsf{rlx then} \ \checkmark \equiv \mathsf{tt}, \\ & (\mathsf{R2}) \ \lambda(e) = \alpha \mathsf{R}_{\sigma}^{\mathsf{L}}[\ell] v_e & (\mathsf{R5b}) \ \text{if } \mu \sqsubseteq \mathsf{acq then} \ \checkmark \equiv \Omega. \\ & (\mathsf{R3}) \ \kappa(e) \equiv \theta_e \land L = \ell_e \land \mathsf{Q}_{\mathsf{L}}[\ell], \end{aligned}$ 

 $\begin{array}{l} (\text{R4}) \ (\forall s) \tau^D(\psi) \equiv \bigwedge_{e \in E \cap D} (\theta_e \Rightarrow (L = \ell_e \Rightarrow v_e = s_e) \Rightarrow \psi[s_e/r]) \\ & \wedge \bigwedge_{e \in E \setminus D} (\theta_e \Rightarrow ((L = \ell_e \Rightarrow v_e = s_e) \vee (L = \ell_e \Rightarrow [\ell] = s_e)) \Rightarrow \psi[s_e/r]) \\ & \wedge \neg \Omega \Rightarrow \psi[s/r], \end{array}$ 

## 3 EXAMPLE FROM JAM PAPER

From [Bender and Palsberg 2019, §3.3]. With partial coherence/weak fulfillment you need to be careful that RMWs are totally ordered (if that's a property you want). May not come for free.

From [Bender and Palsberg 2019, §B]: "Here we demonstrate that it is possible to construct a program that is only forbidden due to the total coherence order"

$$r := x; x := 1 \parallel r := x^{\text{acq}}; y := 1 \parallel r := y^{\text{acq}}; x := 2$$

$$(\text{TOTAL-CO})$$

$$Rx2 \longrightarrow Wx1 \longrightarrow R^{\text{acq}}x1 \longrightarrow Wy1 \longrightarrow R^{\text{acq}}y1 \longrightarrow Wx2$$

$$(X \text{Arm8})$$

$$rf$$

$$coe$$

$$(XArm8)$$

### 4 IRIW

 Status of IRIW is unclear in our model, since we allow everything allowed by power...

$$x := 1 \parallel r := x^{ra}; \ s := y \parallel y := 1 \parallel s := y^{ra}; \ r := x$$

$$(Ry0) \longrightarrow (Ry1) \longrightarrow (Rx0)$$

#### 5 SYNC EXAMPLES

The first of these is seen in hardware. All are allowed by PTX. Showing rf that is not included in the order using a dotted arrow.

$$x := 1; \ y^{\text{rel}} := 1 \parallel r := y^{\text{acq}}; \ z_{\text{sys}} := r \not\Vdash_{\gamma} r := z_{\text{sys}}^{\text{acq}}; \ s := x$$

$$\alpha w_{x1} \rightarrow \alpha w^{\text{rel}} y_{1} \rightarrow \alpha w_{\text{sys}} y_{1} \rightarrow \alpha w_{\text{sys}} y_{1} \rightarrow \gamma w_{\text{sys}} y_$$

$$x := 1; y^{\text{rel}} := 1 \parallel r := y^{\text{acq}}; z := r \parallel r := z^{\text{acq}}; s := x$$

$$(\leq)$$

$$(\leq)$$

$$x := 1; \ y^{\text{rel}} := 1 \parallel r := y; \ z^{\text{rel}} := r \parallel r := z^{\text{acq}}; \ s := x$$

$$(\forall x_1) \rightarrow (\forall x_1) \rightarrow (\forall x_1) \rightarrow (\forall x_2) \rightarrow (\forall x_3) \rightarrow (\forall x_4) \rightarrow$$

To get publication using fences we need an additional closure property for rf on sync order:

$$x := 1; F^{\text{rel}}; y := 1 \parallel r := y; F^{\text{acq}}; s := x$$

$$(\forall x1) \rightarrow (F^{\text{rel}}) \rightarrow (Wy1) \cdots \rightarrow (Ry1) \rightarrow (F^{\text{acq}}) \rightarrow (Rx0)$$

Previous def of candidate requires:

(c7a) if  $d \xrightarrow{\text{rf}} e$  and  $\lambda(d)$  strongly-matches  $\lambda(e)$  then  $d \leq e$ .

This is not good enough for fences. A possible fix is the following closure condition:

0:8 Anon.

(c7a') if  $d' \le d \xrightarrow{\text{rf}} e \le e'$  and  $\lambda(d')$  strongly-matches  $\lambda(e')$  then  $d' \le e'$ .

With that we have the following, using  $\Rightarrow$  for edges induced by closure when  $d' \neq d$  or  $e' \neq e$ :

$$x := 1$$
;  $F^{\text{rel}}$ ;  $y := 1 \parallel r := y$ ;  $F^{\text{acq}}$ ;  $s := x$ 

$$(\leq)$$

This seems to work for the above examples, but it could be too strong in general.

- One possibility is to restrict to preceding and following things in the same thread: (c7a'') if  $d' \leq_{po} d \xrightarrow{rf} e \leq_{po} e'$  and  $\lambda(d')$  strongly-matches  $\lambda(e')$  then  $d' \leq e'$ . where  $\leq_{po}$  is the obvious restriction of  $\leq$  to actions on the same thread.
- With either (c7a') or (c7a") is it too strong to require ≤ that be transitive? In particular:
  - if we restrict to ≤po, the closure condition (c7a") could add order between actions on the same thread via cross-thread reads.
  - How does transitivity interact with scopes?

### Anton proposes:

344 345

346 347

348 349

350 351 352

353 354

355

357

358

359

360

363

364

365

367

369 370

371

372

373374375

376

377

378

379 380

385 386

387

388

389 390

391 392

```
(M9b') if d \xrightarrow{\text{rmw}} e then d \sqsubseteq e,
(c7a''') if d' \le d \left( \xrightarrow{\text{rf}}; \left( \xrightarrow{\text{rmw}}; \xrightarrow{\text{rf}} \right)^* \right) e \le e' and \lambda(d') strongly-matches \lambda(e') then d' \le e'.
```

The following behavior is allowed by Arm, IMM, and C11, but forbidden by PTX. PTX forbids it since acquire reads work as fences for po-previous reads from the same location (symmetrically to release writes for po-latter writes to the same location in IMM, C11, and PTX).

$$x := 1; \ y^{\text{rel}} := 1 \parallel r := y; \ y := 2; \ s := y^{\text{acq}}; \ t := x$$

$$(\leq)$$

To allow this on for IMM, we need to drop  $(Rx, R^{\supseteq acq}x)$  from sync-delays.

The following is allowed by c11, but not IMM or PTX. The goal here is to construct a cycle  $a \xrightarrow{rf} b \xrightarrow{hb} c \xrightarrow{rf} d \xrightarrow{hb} a$  where rf will be included in synch-relation. In relational notation, the cycle has the following form:

 $r := x^{\text{acq}}$ ; INC(y) || INC(y) || INC(y);  $z^{\text{rel}} := 1 || s := z^{\text{acq}}$ ; INC(w) || INC(w) || INC(w);  $x^{\text{rel}} := 1$  $R^{acq}x1$ Ry1 $R^{acq}z1$ (Ry2)Rw1Rw2rmv rmw rmw Rw0 Wy2Ww2Wy3Rw3Ww1Wy1 $W^{rel}z1$  $W^{rel}x1$ 

#### 6 RELATING IMM AND PTX

 It looks like we cannot prove compilation correctness from IMM to PTX. (In this email I assume that all threads are in the same CTA, so any relation is a morally strong one if it is applicable.) The problem is in the LB-data-rel example:

$$r := x ; y := r \parallel s := y ; x^{\text{rel}} := 1$$

$$Rx1 \xrightarrow{\text{data}} Wy1 \xrightarrow{\text{rfe}} Ry1 \xrightarrow{\text{bob}} W^{\text{rel}}x1$$

$$Rx1 \xrightarrow{\text{Wy1}} Ry1 \xrightarrow{\text{Wrel}}x1$$

$$(\leq)$$

IMM forbids it, but PTX allows it. The point is that IMM mixes dependencies and release/acquire-induced po-order in its NoOOTA axiom, whereas PTX doesn't — release/acquire are only used to have coherence.

The problem is related to the one we have already discussed in the context of the C++ model – if you don't have acquire reads in the program, then you can erase release annotations from writes. In this regard, PTX is closer to PL memory models than to hardware ones.

AFAIU for the same reason we won't be able to show compilation correctness from the Pomset model to PTX even directly, if the Pomset model mixes release/acquire induced order with dependencies in the same causality relation.

The previous example in the section shows that IMM's acquires are stronger than PTX for this pattern. The next example shows that acquiring reads in PTX are stronger than in IMM for a different pattern. Thus the acquires in PTX and IMM are incomparable.

The following behavior is allowed by IMM and C11, but forbidden by PTX. PTX forbids it since acquire reads work as fences for po-previous reads from the same location (symmetrically to release writes for po-latter writes to the same location in IMM, C11, and PTX).

$$x := 1; y^{\text{rel}} := 1 \parallel r := y; y := 2; s := y^{\text{acq}}; t := x$$

$$(Wx1) \longrightarrow (Ry1) \longrightarrow (Ry2) \longrightarrow (Rx0)$$

#### 7 THIN AIR

Need ≤ to prevent thin air on rlx:

$$y := x \parallel x := y$$

$$Rx1 \longrightarrow Ry1 \longrightarrow Wx1$$

$$\begin{array}{ccc}
\hline
(\mathbf{R}x1) \longrightarrow (\mathbf{W}y1) & (\mathbf{R}y1) \longrightarrow (\mathbf{W}x1)
\end{array}$$

$$\begin{array}{c} Rx1 \\ \hline \\ Ry1 \\ \hline \\ Ry1 \\ \hline \\ Wx1 \\ \hline \end{array}$$

0:10 Anon.

#### 8 IMM EXAMPLES

Disallowed by IMM:

$$x := 2; y^{\text{rel}} := 1 \parallel r := y^{\text{acq}}; x := 1$$

$$(\text{PUB-REL-ACQ-COE})$$

$$(\text{W}x2) \xrightarrow{\text{bob}} (\text{W}^{\text{rel}}y1) \xrightarrow{\text{rfe}} (\text{R}^{\text{acq}}y1) \xrightarrow{\text{bob}} (\text{W}x1)$$

$$(\text{SIMM})$$

$$(\text{SIMM})$$

$$(\text{SIMM})$$

$$(\text{SIMM})$$

Allowed by IMM, but not by Power/ARMv7/ARMv8/TSO:

$$x := 2; y^{\text{rel}} := 1 \parallel r := y; x := 1$$

$$(\text{PUB-REL-RLX-COE})$$

$$(\text{W}x2) \xrightarrow{\text{bob}} \text{W}^{\text{rel}}y1 \xrightarrow{\text{rfe}} \text{R}y1 \xrightarrow{\text{data}} \text{W}x1$$

$$(\text{M}x2) \xrightarrow{\text{W}^{\text{rel}}y1} \text{R}y1 \xrightarrow{\text{W}x1}$$

$$(\text{M}x2) \xrightarrow{\text{W}^{\text{rel}}y1} \text{R}y1 \xrightarrow{\text{W}x1}$$

$$(\text{S})$$

$$(\text{W}x2) \xrightarrow{\text{W}^{\text{rel}}y1} \text{R}y1 \xrightarrow{\text{W}x1}$$

$$(\text{S})$$

Example from talk:

$$r := x \; ; \; x := 1 \parallel y := x \parallel x := y \qquad \qquad \text{(ARM7-WEAK)}$$

$$\boxed{\text{R}x1} \qquad \qquad \boxed{\text{R}x1} \qquad \qquad \boxed{\text{R}y1} \qquad e : \text{W}x1 \qquad \qquad (\leq)$$

$$\boxed{\text{R}x1} \qquad \qquad \boxed{\text{R}x1} \qquad \boxed{\text{R}x1} \qquad \boxed{\text{W}y1} \qquad e : \text{W}x1 \qquad \qquad (\leq)$$

$$\boxed{\text{R}x1} \qquad \qquad \boxed{\text{R}x1} \qquad \boxed{\text{R}x1} \qquad \boxed{\text{W}y1} \qquad e : \text{W}x1 \qquad \qquad (\underline{\text{S}})$$

$$\boxed{\text{R}x1} \qquad \qquad \boxed{\text{R}x1} \qquad \boxed{\text{W}y1} \qquad e : \text{W}x1 \qquad \qquad (\underline{\text{S}})$$

## 9 PTX EXAMPLES

Based on [Lustig et al. 2019; NVIDIA 2020].

In examples, all threads in different ctas.

(Rx0) must be forbidden. Before fulfilling the read:

$$x_{\mathsf{cta}}^{\mathsf{wk}} := 0 \; ; \; x_{\mathsf{cta}}^{\mathsf{wk}} := 1 \; ; \; y^{\mathsf{rel}} := 1 \; || \; r := y^{\mathsf{acq}} \; ; \; s := x_{\mathsf{cta}}^{\mathsf{wk}}$$
 
$$(\mathsf{PUB1}_{\mathsf{SYS}})$$

$$( \leq = \leq )$$

 $(Wx1) \subseteq (Rx)$  is required by c8b, enforcing publication.

(Rx0) must be allowed:

$$x_{\mathsf{cta}}^{\mathsf{wk}} := 0 \; ; \; x_{\mathsf{cta}}^{\mathsf{wk}} := 1 \; ; \; y_{\mathsf{cta}}^{\mathsf{rel}} := 1 \; || \; r := y_{\mathsf{cta}}^{\mathsf{acq}} \; ; \; s := x_{\mathsf{cta}}^{\mathsf{wk}} \tag{PUB1}_{\mathsf{CTA}} )$$

$$\begin{array}{c|c}
\hline
W_{\mathsf{cta}}^{\mathsf{wk}} x 0 \longrightarrow & W_{\mathsf{cta}}^{\mathsf{wk}} x 1
\end{array}
\qquad \left(W_{\mathsf{cta}}^{\mathsf{rel}} y 1\right) \qquad \left(\mathsf{R}_{\mathsf{cta}}^{\mathsf{acq}} y 1\right) \qquad \left(\mathsf{R}_{\mathsf{cta}}^{\mathsf{wk}} x\right)$$

We do not have  $(W^{rel}y1) \le (R^{acq}y1)$  since c7a only requires order for things that are morally strong.

Another example that may be of interest (nothing morally strong). Can this (Rx0)?

$$x := 0; x := 1 \parallel y := x \parallel if(y) \{r := x\}$$

PTX allows TC16 for events that are not mutually strong (TC16<sub>WK</sub>), but disallows it when events are mutually strong (TC16<sub>SYS</sub>). Note that  $\leq$  imposes no requirements here. Fulfillment imposes no order. This example shows that c8b cannot be strengthened to replace  $\sqsubseteq$  with  $\sqsubseteq$ .

$$r := x_{\text{cta}}^{\text{wk}}; x_{\text{cta}}^{\text{wk}} := 1 \parallel s := x_{\text{cta}}^{\text{wk}}; x_{\text{cta}}^{\text{wk}} := 2$$
 (TC16<sub>wK</sub>)

$$\begin{array}{c|c}
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\
 & & \\$$

$$r := x$$
;  $x := 1 \parallel s := x$ ;  $x := 2$  (TC16<sub>SYS</sub>)

$$\begin{array}{c} (\boxtimes x2) & (\boxtimes x1) & (\boxtimes x2) \\ (\boxtimes x2) & (\boxtimes x2) & (\boxtimes x2) \end{array}$$

$$(Rx2)$$
  $(Wx1)$   $(Rx1)$   $(Wx2)$ 

About Release-Acquire semantics. Anton confirms that the following example is allowed in C11, but disallowed in the IMM. It is apparently allowed in C11 with the intention to allow releasing writes to be downgraded to relaxed in the case that only fulfill relaxed reads.

$$r := x ; y^{\text{rel}} := 1 \parallel s := y ; x^{\text{rel}} := 1$$
 (LB-REL)

Another example from Anton. This is allowed in PTX because it does not include synchronization in the no-tar axiom, only in coherence and causality.

$$r := x ; y := r \parallel s := y ; x^{\text{rel}} := 1$$
 (LB-DATA-REL)

0:12 Anon.

#### 10 RFI EXAMPLES

Bad example:

$$(8x1)^{\text{mw}} (Wx2) \quad (Rx2) \quad (Wy1) \quad (Ry1) \quad (Wx1)$$

$$r := x; x := 2; s := x; y := s-1 \parallel r := y; x := r$$

$$(\boxtimes)$$

$$(\boxtimes)$$

$$\begin{array}{c} (Rx1) \longrightarrow (Rx2) \\ \hline (Wx2) \longrightarrow (Rx2) \\ \hline (Wy1) \longrightarrow (Ry1) \\ \hline (Wx1) \\ \hline \end{array}$$

Anton example 1 (Allowed by ARM) [rfi-coe-coe]

Wx2

$$x := 2; r := x^{\operatorname{acq}}; y := 1 \parallel y := 2; x^{\operatorname{rel}} := 1$$

$$(\operatorname{RFI-COE-COE})$$

$$(\operatorname{W}x2) \xrightarrow{\operatorname{rfi}} (\operatorname{Racq} x2) \xrightarrow{\operatorname{bob}} (\operatorname{W}y1) \xrightarrow{\operatorname{coe}} (\operatorname{W}y2) \xrightarrow{\operatorname{bob}} (\operatorname{Wrel} x1)$$

$$(\operatorname{W}x2) \xrightarrow{\operatorname{Racq}} x2 \xrightarrow{\operatorname{W}y1} (\operatorname{W}y2) \xrightarrow{\operatorname{Wrel}} x1$$

$$(\operatorname{S})$$

$$(\operatorname{W}x2) \xrightarrow{\operatorname{Racq}} x2 \xrightarrow{\operatorname{W}y1} (\operatorname{W}y2) \xrightarrow{\operatorname{Wrel}} x1$$

$$(\operatorname{S})$$

 $W^{rel}x1$ 

(⊑)

Internal reads survive acquires [rfi-acq-coe-coe] (where SC read = LDAR)

 $R^{acq}x^2$ 

Wy1

$$x := 2; s := z^{\text{sc}}; r := x^{\text{sc}}; y := 1 \parallel y := 2; x^{\text{rel}} := 1$$

$$(\text{RFI-ACQ-COE-COE})$$

$$(\text{W}x2) \xrightarrow{\text{rfi}} (\text{RSc}x2) \xrightarrow{\text{bob}} (\text{W}y1) \xrightarrow{\text{coe}} (\text{W}y2) \xrightarrow{\text{bob}} (\text{W}rel}x1)$$

$$(\checkmark \text{Arm8})$$

**₩** ₩ y 2

 And release-acquire pairs [rfi-ra-coe-coe] (where acquiring read = LDAPR)

$$x := 2$$
;  $w^{\text{rel}} := 1$ ;  $s := z^{\text{acq}}$ ;  $r := x^{\text{acq}}$ ;  $y := 1$  (RFI-RA-COE-COE2)  
  $\parallel y := 2$ ;  $x^{\text{rel}} := 1 \parallel r := w$ ;  $z := 1$ ;

But not if either acquire is strengthened to SC (where SC read = LDAR). The execution is also disallowed if an external thread places order between the ra accesses:

$$x := 2$$
;  $w^{\text{rel}} := 1$ ;  $s := z^{\text{acq}}$ ;  $r := x^{\text{acq}}$ ;  $y := 1$  (RFI-RA-DATA-COE-COE)  
  $\parallel y := 2$ ;  $x^{\text{rel}} := 1 \parallel r := w$ ;  $z := r$ ;

To allow this, weaken ra to rlx when read fulfilled by relaxed write of same thread (don't need to allow this when the write is part of an RMW).

$$x := 2; r := x^{\text{acq}}; y := 1 \parallel y := 2; x^{\text{rel}} := 1$$
 $x := 2; r := x^{\text{acq}}; y := 1 \parallel y := 2; x^{\text{rel}} := 1$ 
 $x := 2; r := x^{\text{acq}}; y := 1 \parallel y := 2; x^{\text{rel}} := 1$ 
 $x := 2; r := x^{\text{acq}}; y := 1 \parallel y := 2; x^{\text{rel}} := 1$ 
 $x := 2; r := x^{\text{acq}}; y := 1 \parallel y := 2; x^{\text{rel}} := 1$ 

RF variant [rfi-rfe-coe]:

$$x := 2; r := x^{\text{acq}}; y := 1 \parallel s := y; x^{\text{rel}} := 1$$
 (RFI-RFE-COE)

 $(x := 2; r := x^{\text{acq}}; y := 1 \parallel s := y; x^{\text{rel}} := 1$  (RFI-RFE-COE)

 $(x := 2; r := x^{\text{acq}}; y := 1 \parallel s := y; x^{\text{rel}} := 1$  (RFI-RFE-COE)

Tso variant [rfi-fre-coe2]:

$$x := 2; r := x^{\text{acq}}; s := y \parallel y := 2; x^{\text{rel}} := 1$$
 (RFI-COE-COE2)

$$\begin{array}{c} Wx2 \\ \\ \hline Wx2 \\ \hline \end{array} \xrightarrow{\text{rfi}} \begin{array}{c} R^{\text{acq}}x2 \\ \hline \end{array} \xrightarrow{\text{bob}} \begin{array}{c} Ry0 \\ \hline \end{array} \xrightarrow{\text{fre}} \begin{array}{c} Wy2 \\ \hline \end{array} \xrightarrow{\text{bob}} \begin{array}{c} Wx1 \\ \hline \end{array}$$
 ( $\checkmark$ Arm8)

Note that TSO does not order W to R in local order, even in poloc. Nonetheless, TSO disallows the following because of local visibility in first thread.

$$x := 2$$
;  $r := x \parallel x := 1$ ;  $s := x$ 

Wx2

From Wx1

Rx2

(XTSO)

[Higham and Kawash 2000] describe TSO as a linearization of partial order including:

0:14 Anon.

```
poloc
```

 • lws = po; [W]

•  $d \stackrel{\text{po}}{\cdots} e$  when  $c \stackrel{\text{rfe}}{\longrightarrow} d \stackrel{\text{po}}{\cdots} e$ 

[Alglave et al. 2020] describe TSO as linearization of partial order satisfying internal visibility and including

- [W]; po; [W]
- $d \xrightarrow{\text{po}} e \text{ when } c \xrightarrow{\text{rfe}} d \xrightarrow{\text{po}} e, \text{ from (range(rfe)} * \_)$
- [R]; po; [W], from (rfi^-1; lob)

Ignoring fences and RMWs:

Double FRE variant [rfi-fre-fre]:

$$x := 2; r := x^{\text{acq}}; s := y \parallel y := 2; F; r := x$$

$$(RFI-FRE-FRE)$$

$$(Wx2) \xrightarrow{\text{rfi}} (R^{\text{acq}}x2) \xrightarrow{\text{bob}} (Ry0) \xrightarrow{\text{fre}} (Wy2) \xrightarrow{\text{bob}} (Rx0)$$

$$(\checkmark \text{Arm8})$$

It does not seem possible to do this only with rfe. ARM disallows this [data-rfi-rfe-rfe]:

$$x := z; r := x^{\text{acq}}; y := 1 \parallel z := y$$

$$(DATA-RFI-RFE-RFE)$$

$$(XArm8)$$

It also disallows [ctrl-rfi-rfe-rfe]:

$$if(z)\{\}; x := 1; r := x^{acq}; y := 1 \parallel z := y$$

$$(CTRL-RFI-RFE-RFE)$$

$$(Rz1) \xrightarrow{rfi} (R^{acq}x1) \xrightarrow{bob} (Wy1) \xrightarrow{rfe} (Wy1) \xrightarrow{data} (Wz1)$$

$$(XArm8)$$

ARM allows some counterintuitive results for SC access [ctrl-rfi-fre-rfe]:

if 
$$(x)$$
 {};  $x := 2$ ;  $r := x^{sc}$ ;  $s := y^{sc} \parallel y^{sc} := 2$ ;  $x^{sc} := 1$  (CTRL-RFI-FRE-RFE)

$$(x) = (x) + (x) +$$

Not possible with coe [ctrl-rfi-coe-rfe]:

if(x){}; 
$$x := 2$$
;  $r := x^{sc}$ ;  $y^{sc} := 1 \parallel y^{sc} := 2$ ;  $x^{sc} := 1$  (CTRL-RFI-COE-RFE)

Ctrl

Wx2

The Wsc y1

Coe

Wsc y2

Wsc y2

Wsc y1

The Wsc y2

Wsc y1

The Wsc y2

The Wsc y2

Wsc y2

The Wsc

This is not allowed with a data dependency instead of a control dependency [data-rfi-fre-rfe]:

$$x := x+1; r := x^{sc}; s := y^{sc} \parallel y^{sc} := 1; x^{sc} := 1$$
 (DATA-RFI-FRE-RFE)

(XArm8)

#### 11 SC EXAMPLES

 Example 11.1. Consider IRIW with all ra access:

$$x^{\text{rel}} := 1 \parallel r := x^{\text{acq}}; \ s := y^{\text{acq}} \parallel y^{\text{rel}} := 1 \parallel r := y^{\text{acq}}; \ s := x^{\text{acq}}$$

$$(\text{IRIW-ACQ-ACQ})$$

$$(\text{POWER,c11})$$

$$(\text{POWER,c11})$$

We allow this execution:

IRIW-ACQ-SC1, is allowed by trailing-sync compilation to power [Lahav et al. 2017, §1].

$$x^{\text{sc}} := 1 \parallel r := x^{\text{acq}}; \ s := y^{\text{sc}} \parallel y^{\text{sc}} := 1 \parallel r := y^{\text{acq}}; \ s := x^{\text{sc}}$$

$$(\text{IRIW-ACQ-SC1})$$

$$(\text{POWER}, \text{C11})$$

To model this it is convenient that synchronization is not included in dependency order:

- add sc bullet to def of ⊆ in c8b,
- add SC access to sync-delays.

This correctly forbids the all sc version:

$$x^{\text{sc}} := 1 \parallel r := x^{\text{sc}}; \ s := y^{\text{sc}} \parallel y^{\text{sc}} := 1 \parallel r := y^{\text{sc}}; \ s := x^{\text{sc}}$$

$$(\text{IRIW-SC-SC})$$

$$(\text{Sc} \ y1) \longrightarrow (\text{Rsc} \ y1) \longrightarrow (\text{Rsc} \ y1) \longrightarrow (\text{Rsc} \ y1) \longrightarrow (\text{Sc} \ y1) \longrightarrow (\text{Rsc} \ y1) \longrightarrow (\text{$$

*Example 11.2.* Thin air with an SC antidependency:

$$y^{\text{sc}} := x \parallel y^{\text{sc}} := 2 \parallel x := y - 1$$

$$Rx1 \longrightarrow W^{\text{sc}}y1 \longrightarrow Ry2 \longrightarrow Wx1$$

0:16 Anon.

IRIW-ACQ-SC2 is allowed by trailing-sync compilation to power [Lahav et al. 2017, §1].

$$x^{\text{sc}} := 1 \parallel r := x^{\text{acq}}; \ s := y^{\text{sc}} \parallel y^{\text{sc}} := 1 \parallel r := y^{\text{acq}}; \ s := x^{\text{sc}}$$

$$(\text{IRIW-ACQ-SC2})$$

$$(\text{VPOWER,RC11})$$

This example is hard to get right for power because it must be allowed with ra reads, but disallowed with sc reads. This seems unsolvable: To allow the version with ra, we would need to weaken the order between the reads in each thread for the ra case, and that would break publication.

Leading sync is also unsound in c11 with RMW [Lahav et al. 2017, §2.1].

$$x^{\text{sc}} := 1; \ y^{\text{rel}} := 1 \parallel \text{FADD}^{\text{sc,sc}}(y, 1); \ s := y \parallel y^{\text{sc}} := 3; \ s := x^{\text{sc}}$$

$$(z6.\text{U})$$

$$W^{\text{sc}}(y1) \longrightarrow (\mathbb{R}^{\text{sc}}(y1)) \longrightarrow (\mathbb{R}^{\text{sc}}(y1)$$

Leading sync is also unsound in c11 with SC fences [Lahav et al. 2017, §A.1].

$$x := 2; F^{\text{sc}}; r := y \parallel y^{\text{sc}} := 1 \parallel r := y^{\text{acq}}; x^{\text{rel}} := 1; s := x \parallel r := x^{\text{sc}}$$

$$(\text{RSYNC+RSC})$$

$$(\text{Wx2}) \leftarrow (\text{RSYNC+RSC})$$

$$(\text{RSYNC+RSC}) \leftarrow (\text{RSYNC+RSC})$$

Fulfillment of (Rx2) requires that either  $(W^{rel}x1) \to (Wx2)$  or  $(Rx2) \to (W^{rel}x1)$ . It's interesting that in the pomset,  $(R^{sc}x1)$  is not needed to get a cycle.

There is a long discussion of this in [Bender and Palsberg 2019, §5.2, Fig. 17], where they also discuss this example:

$$x^{\text{sc}} := 1; \ x := 2 \parallel y^{\text{sc}} := 1; \ y := 2 \parallel r := x^{\text{acq}}; \ s := y^{\text{sc}} \parallel r := y^{\text{acq}}; \ s := x^{\text{sc}}$$
 (IRIW-SC-RLX-ACQ)

(\*\vec{\mathbb{R}^{\text{sc}}} y\_1 \ \vec{\mathbb{W}} x\_2 \ \vec{\mathbb{R}^{\text{acq}}} x\_2 \ \vec{\mathbb{R}^{\text{sc}}} y\_0 \ (\*\vec{\mathbb{R}^{\text{sc}}} x\_0)

[Lahav et al. 2017, §A.2] claims that Arm8 allows this [RWC+acq+sc], but herd7 rejects it. Reason: they are citing the flowing/pop model [Flur et al. 2016] rather than [Pulte et al. 2018].

$$x^{\text{sc}} := 1 \parallel r := x; F^{\text{acq}}; s := y^{\text{sc}} \parallel y^{\text{sc}} := 1; r := x^{\text{sc}}$$

(RWC+ACQ+SC)

(RWC+ACQ+SC)

(XArm8)

## 12 TWO ORDER IDEA

 The two order idea from OOPSLA talk is:

• Require:  $d \sqsubseteq e$  when  $d \unlhd e$  and they conflict

This does not work for the IMM or ARMv7, but it may work for Power, TSO, ARMv8. That would be nice. Let's write  $\sqsubseteq$  for this notion, with strong fulfillment.

With this there is a cycle in ARM7-WEAK (weak/strong fulfillment not relevant here):

Anton says: ARM7-WEAK is forbidden by Power, TSO, ARMv8, but allowed by ARMv7. Maybe it isn't that important to support it anymore.

There is also a cycle in Pub-rel-rlx-coe. Anton says: I checked Power/ARMv7 models in this regard. They disallow the behavior (as well as ARMv8 and TSO), so we can in principle strengthen IMM to forbid it as well. For that, we may add axiom to IMM forbidding cycles in co  $\cup([W]; rfe^?; ([R^{acq}] \cup po; [FW^{rel}]); ar^*; [W])$ . This works if we have acquire/release accesses on the path since they are compiled with fences to Power.

## 13 DEAD STORE ELIMINATION, STORE FORWARDING, AND MONOTONICITY

We validate "monotonicity" by updating the rules for read, write and fence to include  $(\exists \nu \supseteq \mu)$ :

(R2) 
$$\lambda(e) = \alpha R_{\sigma}^{\nu} x v$$
, (W2)  $\lambda(e) = \alpha W_{\sigma}^{\nu} x v$ , (F2)  $\lambda(e) = \alpha F_{\sigma}^{\nu}$ .

One could do the same for scopes.

[Todo: The rest of this is very sketchy. It is difficult to get merging with alternate worlds not messing things up. Any kind of disjointness requirement imperils associativity.]

The semantics already validates:

- $[x := M; x := M] \supseteq [x := M]$
- $\llbracket s := x ; r := x \rrbracket \supseteq \llbracket s := x ; r := s \rrbracket$
- $\llbracket r := x \rrbracket \supseteq \llbracket skip \rrbracket$

It does not validate:

- $[x := M; x := N] \supseteq [x := N]$
- $[x := M; r := x] \supseteq [x := M; r := M]$

The semantics of Fig. 1 validates elimination of irrelevant relaxed reads and redundant reads. Fig. 1 also validates elimination of writes of the same value. However, Fig. 1 does not validate general write elimination, where, for example, (x := 1; x := 2) is refined to x := 2. Nor does it validate store forwarding, where, for example, (x := 1; r := x) is refined to (x := 1; r := 1).

Elimination can be justified in pomset by *merging* actions with different labels. A list of safe merges can be found in [Chakraborty and Vafeiadis 2017, §E] and [Kang 2019, §7.1]. For examples of unsafe merges and reorderings, see [Chakraborty and Vafeiadis 2017, §D]. See also [Chakraborty and Vafeiadis 2019, §6.2]

Read-read and fence-fence merges can be handled by "monotonicity": allowing actions to put down stronger modes in the model. Then they can merge on the nose.

Sad: read elimination can't be done the nice way using  $\tau^D(\psi) \equiv x = r \Rightarrow \psi$  for R4c because there may be a release-acquire pair between the read and the matching write.

Let merge :  $\mathcal{A} \times \mathcal{A} \to \mathcal{A}$  be a partial function defined as follows.

$$\mathsf{merge}(a,\ b) = \begin{cases} a & \text{if } a = b \text{ or } a = (\alpha \mathsf{W}^{\mu}_{\sigma} x v) \text{ and } b = (\alpha \mathsf{R}^{\nu}_{\sigma} x v) \\ b & \text{if } a = b \text{ or } a = (\alpha \mathsf{W}^{\mu}_{\sigma} x v) \text{ and } b = (\alpha \mathsf{W}^{\nu}_{\sigma} x w) \\ \mathsf{undefined} & \mathsf{otherwise} \end{cases}$$

(If we have "monotonicity" then we can require  $\mu = \nu$ .)

If  $a_0 = \mathsf{merge}(a_1, a_2)$ , then  $a_1$  and  $a_2$  can coalesce, resulting in  $a_0$ . This allows optimizations such as (x := 1; x := 2) to (x := 2) and (x := 1; r := x) to (x := 1; r := 1). For associativity of sequential composition, it is important that merge always take an upper bound on the modes of the two actions. For example, it would invalidate associativity to allow  $(\mathsf{W} x v) = \mathsf{merge}(\mathsf{W} x v, \mathsf{R}^{\mathsf{acq}} x v)$ , although this is considered safe.

```
Then we can replace s2-s3 in Fig. 1 by: (s2a) if e \in E_1 \setminus E_2 then \lambda(e) = \lambda_1(e),
```

0:18 Anon.

```
(s2b) if e \in E_2 \setminus E_1 then \lambda(e) = \lambda_2(e),
834
             (s2c) if e \in E_1 \cap E_2 then \lambda(e) = \text{merge}(\lambda_1(e), \lambda_2(e)),
835
             (s3a) if e \in E_1 \setminus E_2 then \kappa(e) \equiv \kappa_1(e),
836
             (s3b) if e \in E_2 \setminus E_1 then \kappa(e) \equiv \kappa_2'(e),
837
             (s3c) if e \in E_1 \cap E_2 then either
838
                            • \lambda_1(e) = \lambda(e) = \lambda_2(e) and \kappa(e) \equiv \kappa_1(e) \vee \kappa_2'(e),
839
                            • \lambda_1(e) = \lambda(e) \neq \lambda_2(e) and \kappa_2'(e) \equiv \kappa(e) \equiv \kappa_1(e) (write-read),
840
                            • \lambda_1(e) \neq \lambda(e) = \lambda_2(e) and \kappa_1(e) \equiv \kappa(e) \equiv \kappa_2'(e) (write-write).
841
              Full merge: if (M) {x := 1}; x := 2 can become x := 2.
```

Partial merge: x := 1; if  $(M)\{x := 2\}$  can become if  $(M)\{x := 2\}$  else  $\{x := 1\}$ .

To get associativity, you need the ability to merge with multiple events.

$$x := 1; \text{ if } (M)\{x := 2\}$$

$$(\neg M \mid Wx1) (M \mid Wx2)$$

$$(\neg M \mid Wx2)$$

This is asymmetric. We don't expect to merge all three events in the following:

$$\begin{array}{ll} \text{if}(!M)\{x := 2\} & x := 1; \text{ if}(M)\{x := 2\} \\ \hline \begin{pmatrix} \neg M \mid \forall x 2 \end{pmatrix} & \begin{pmatrix} \neg M \mid \forall x 1 \end{pmatrix} \begin{pmatrix} M \mid \forall x 2 \end{pmatrix} \\ \end{array}$$

We could have a lot merging:

Full merge: x := 1; if  $(M)\{r := x\}$  can become x := 1; if  $(M)\{r := 1\}$ .

Partial merge: if  $(M)\{x := 1\}$ ; r := x can become if  $(M)\{x := 1\}$ ;  $r := 1\}$  else  $\{r := x\}$ .

I don't think we need multi-merge for write-read. Reads only affect the world via the predicate transformer. Any conditional surrounding a read is baked into the predicate transformer, and so does not to persist in the preconditions of the actions themselves after the merge. Consider r := 1; x := 2; if  $(M)\{r := x\}$ . This can safely transform to r := 1; x := 2; if  $(M)\{r := 2\}$ .

In the example below, the reads should *not* merge. Although the second read can merge with the write.

$$if(!M)\{x := 1\}; if(M)\{r := x\} \qquad if(!M)\{s := x\}$$
$$(\neg M \mid Wx1) (M \mid Rx1) \qquad (\neg M \mid Rx1)$$

Another example:

$$x := 1; if(M)\{r := x\}$$
  $if(!M)\{s := x\}$   $(Mx1)$   $(M | Rx1)$ 

Another example:

$$x:=1$$
 if  $(M)\{r:=x\}$ ; if  $(!M)\{s:=x\}$   $(\mathbb{R}x1)$ 

Idea for multi-merge. Use  $E_1' \subseteq E_1$ , with a surjective function  $\pi: E_1 \to E_1'$  that shows how writes merge.

- Require that  $(\forall d \in E'_1) \pi(d) = d$ .
- Require that if  $c \in (E_1 \setminus E_1')$  then  $\pi(c) \in E_2$ —and therefore  $\pi(c) \in (E_1' \cap E_2)$ .
- Take  $E = E'_1 \cup E_2$ .

Require that the writes that coalesce have disjoint preconditions.

Proc. ACM Program. Lang., Vol. 0, No. POPL, Article 0. Publication date: January 2022.

```
• if \pi(c) = \pi(c') then \kappa_1(c) \wedge \kappa_1(c') is unsatisfiable
```

Then each of them has to merge into the same write  $e \in E_2$  using the merge function and combining the predicates as specified above.

```
(s2a) if e \in E'_1 \setminus E_2 then \lambda(e) = \lambda_1(e),
```

(s2b) if 
$$e \in E_2 \setminus E'_1$$
 then  $\lambda(e) = \lambda_2(e)$ ,

(s2c) if 
$$e \in (E'_1 \cap E_2)$$
 and  $c \in E_1$  and  $\pi(c) = e$  then  $\lambda(e) = \text{merge}(\lambda_1(c), \lambda_2(e))$ ,

- (s3a) if  $e \in E'_1 \setminus E_2$  then  $\kappa(e) \equiv \kappa_1(e)$ ,
- (s3b) if  $e \in E_2 \setminus E_1'$  then  $\kappa(e) \equiv \kappa_2'(e)$ ,
- (s3c) if  $e \in (E'_1 \cap E_2)$  then
  - $\kappa(e) \equiv \kappa_2'(e) \vee \bigvee_{c \in C} \kappa_1(c)$ , where  $C = \{c \in E_1 \mid \pi(c) = e \text{ and } \lambda_1(c) = \lambda_2(e)\}$ ,
  - if  $\pi(c) = e$  and  $\lambda_1(c) = \lambda(e) \neq \lambda_2(e)$  then  $\kappa_2'(c) \equiv \kappa(e)$  (write-read),
  - if  $\pi(c) = e$  and  $\lambda_1(c) \neq \lambda(e) = \lambda_2(e)$  then  $\kappa_1(c) \equiv \kappa(e)$  (write-write).

### 14 OLD NOTES

883 884

885

886

887

888

889

890

897

898

902

906

908

910

911

912

913 914

916

918

920

922 923

924

925

926

927

928 929

930 931 Goal is to capture POWER, not ARM-v7. See §14.3, below. So cannot use IMM out of the box. Introduce weak order  $\sqsubseteq^3$ .

*Definition 14.1 (2.1).* A (*memory order*) *pomset* is a tuple  $(E, \leq, \sqsubseteq, \lambda, \xrightarrow{rmw})$ :

- E is a set of states,
- $\leq \subseteq (E \times E)$  and  $\sqsubseteq \subseteq (E \times E)$  are partial orders,
- $\lambda : E \to (\Phi \times \mathcal{A})$  is a *labeling*, from which we derive functions  $\kappa : E \to \Phi$  and  $\lambda : E \to \mathcal{A}$ ,
- if  $d (\leq \cup \sqsubseteq) e$  then  $\kappa(e)$  implies  $\kappa(d)$ , and
- $\bigwedge_{e} \kappa(e)$  is satisfiable.

### Additional stuff:

- if  $d (\leq ; \sqsubseteq)$  e then  $d \neq e$ , and
- if  $d (\leq ; \sqsubseteq ; \leq) e$ , d is SC, and e is SC, then  $d \leq e$ .

## RMW:

- If  $d \xrightarrow{\mathsf{rmw}} e$ , then  $d \leq e$ .
- If  $\exists x. c$  and e write  $x, c \sqsubseteq e$ , and  $d \xrightarrow{\mathsf{rmw}} e$ , then  $c \sqsubseteq d$ .
- If  $\exists x. c$  and e write  $x, d \sqsubseteq c$ , and  $d \xrightarrow{\mathsf{rmw}} e$ , then  $e \sqsubseteq c$ .

Update the definitions to use  $\sqsubseteq$  instead of  $\leq$  in two places:

- the last item defining fulfillment, and
- item 5b defining prefixing.

Definition 14.2 (2.4). We say d fulfills e on x if

- d writes v to x,
- e reads v from x,
- $d \leq e$ , and
- if *c* writes to *x* then either  $c \sqsubseteq d$  or  $e \sqsubseteq c$ .

*Definition 14.3 (2.10).* Let  $(\phi \mid a) \Rightarrow \mathcal{P}$  be the set  $\nabla \mathcal{P}'$  where  $P' \in \mathcal{P}'$  when there is some  $P \in \mathcal{P}$  that satisfies items 1-4 of Definition 2.8 such that:

- 5a. if *e* writes then either d <' e or  $\kappa'(e)$  implies  $\kappa(e)$ ,
- 5b. if d and e are actions in conflict, then  $d \sqsubset' e$ ,

This does not hold, for example, in [x := 1; x := 2].

<sup>&</sup>lt;sup>3</sup>Note we can not require

<sup>•</sup> if  $d (\leq \cup \sqsubseteq) e$  then  $\sigma(d)$  subsumes  $\sigma(e)$ .

0:20 Anon.

- 5c. if d is an acquire or e is a release, then d <' e,
- 5d. if d is an SC write and e is an SC read, then d <' e,
- 5e. if d reads, and e is an acquiring fence, then d < 'e, and
- 5f. if *d* is a releasing fence, and *e* writes, then d <' e.

Weak order is only required to relate actions on the same location. In augmentation minimal pomsets, it is a subset of eco (only relates writes that are read). The irreflexivity requirement in the definition is thus comparable to requiring that  $\leq \cup \sqsubseteq_x$  is a partial order, for every x. It is *not* the case that  $\leq \cup \sqsubseteq$  is a partial order.

Note that we have a kind of semi-transitivity here, but only per variable.

- If  $c \leq_x d \sqsubseteq_x e$  then  $c \sqsubseteq_x e$ .
- If  $c \sqsubseteq_x d \leq_x e$  then  $c \sqsubseteq_x e$ .

With the requirements of fulfillment, we have that  $d \le e$  implies  $d \sqsubseteq e$  when the actions conflict—there is a caveat for unread writes, where no order is forced.

Here are some examples of the main text. To better visualize, we use different arrowheads for strong and weak order. We use a single color for strong order, and separate colors for each variable in weak order.

## 14.1 Many Examples

Here is fencing behavior mixing with release/acquire:

$$x := 1; y^{\text{ra}} := 1 \parallel r := y^{\text{acq}}; s := x$$

$$(Wx1) \leftarrow (W^{\text{rel}}y1) \rightarrow (Rx0)$$

$$x := 1$$
;  $F^{rel}$ ;  $y := 1 \parallel r := y^{acq}$ ;  $s := x$   
 $(Wx1) \longrightarrow (Ry1) \longrightarrow (Rx0)$ 

$$x := 1; y^{ra} := 1 \parallel r := y; F^{acq}; s := x$$

$$(Wx1) \qquad (Ry1) \qquad (Rx0)$$

$$x := 1; \ \mathsf{F}^{\mathsf{rel}}; \ y := 1 \parallel r := y; \ \mathsf{F}^{\mathsf{acq}}; \ s := x$$
 
$$(\mathsf{W} x 1) \leftarrow (\mathsf{F}^{\mathsf{rel}}) \leftarrow (\mathsf{W} y 1) \rightarrow (\mathsf{R} y 1) \leftarrow (\mathsf{F}^{\mathsf{acq}}) \leftarrow (\mathsf{R} x 0)$$

Coherence C does not allow:

$$x := 1; \ x := 2 \parallel y := x; \ z := x$$

$$(\mathbb{W}x1 \rightarrow \mathbb{W}x2) \rightarrow (\mathbb{R}x2) \rightarrow (\mathbb{W}y2) \rightarrow (\mathbb{R}x1) \rightarrow (\mathbb{W}z1)$$

Coherence Java allows:

$$x := 1; y^{ra} := 1 \parallel x := 2; z^{ra} := 1 \parallel r := y^{sc}; r := z^{sc}; r := x; r := x$$

$$\begin{array}{c} Wx1 \\ \hline Wx2 \\ \hline \end{array} \qquad \begin{array}{c} R^{ra}y1 \\ \hline \end{array} \qquad \begin{array}{c} Rx2 \\ \hline \end{array}$$

Store buffering

$$x := 0; y := 0; (x^{ra} := 1; y := r \parallel y^{ra} := 1; x := r)$$
 $(wx0) \rightarrow (w^{rel}x1) \rightarrow (Rx0)$ 
 $(wy0) \rightarrow (w^{rel}y1) \rightarrow (Rx0)$ 

**IRIW** 

$$x := 0$$
;  $x := 1 || r := x^{acq}$ ;  $s := y$   
 $|| y := 0$ ;  $y := 1 || r := y^{acq}$ ;  $s := x$ 



Three variable MCA example Allowed.

$$\begin{split} & \text{if}(x)\{y := 0\}; \ y := 1 \\ & \text{if}(y)\{z := 0\}; \ z := 1 \\ & \text{if}(z)\{x := 0\}; \ x := 1 \end{split}$$

Two variable MCA example Allowed.

$$if(x)\{y := 0\}; y := 1 || if(y)\{x := 0\}; x := 1$$

$$(\mathbb{R}x1) + (\mathbb{W}y0) \cdot (\mathbb{W}y1) + (\mathbb{R}y1) + (\mathbb{W}x0) \cdot (\mathbb{W}x1)$$

[Lahav et al. 2017, Fig. 5]

$$x := 1 \parallel r := x \; ; \; \mathsf{F^{sc}} \; ; \; r := y \parallel y := 1 \; ; \; \mathsf{F^{sc}} \; ; \; r := x$$

[Lahav et al. 2017, Fig. 6]

$$x := 1; z^{ra} := 1; \parallel r := z^{acq}; F^{sc}; r := y \parallel y := 1; F^{sc}; r := x$$

$$\underbrace{\mathbb{W}x1} \underbrace{\mathbb{W}^{rel}z1} \underbrace{\mathbb{F}^{sc}} \underbrace{\mathbb{R}y0} \underbrace{\mathbb{R}y0} \underbrace{\mathbb{F}^{sc}} \underbrace{\mathbb{R}x0}$$

PSC is part of AR [Podkopaev et al. 2019, Ex. 3.9]:

$$r := y$$
;  $F^{sc}$ ;  $r := z \parallel z := 1$ ;  $F^{sc}$ ;  $r := x \parallel if(x \neq 0)\{y := 1\}$ 

$$Ry1 \longrightarrow Rz0 \longrightarrow Wz1 \longrightarrow Rx1 \longrightarrow Rx1 \longrightarrow Wy1$$

Detour Example [Podkopaev et al. 2019, Ex. 3.7]:

$$x := z-1; \ y := x \parallel x := 1 \parallel z := y$$

$$(Rz1) \leftarrow (Wx0) \qquad (Ry1) \leftarrow (Wy1) \rightarrow (Ry1) \leftarrow (Wz1)$$

Proc. ACM Program. Lang., Vol. 0, No. POPL, Article 0. Publication date: January 2022.

0:22 Anon.

## 14.2 Another fencing example

 [Podkopaev et al. 2019, §D]: The following execution graph is not consistent in the promise-free declarative model of [Kang et al. 2017]. Nevertheless, its mapping to POWER (obtained by simply replacing Fsc with Fsync) is POWER-consistent and po  $\cup$  rf is acyclic (so it is Strong-POWER-consistent). Note that, using promises, the promising semantics allows this behavior.

$$r := z \; ; \; \mathsf{F^{sc}} \; ; \; x := 1 \parallel x := 2 \; ; \; \mathsf{F^{sc}} \; ; \; y := 1 \parallel r := y \; ; \; z := 1$$

Allowed behavior on POWER... Is there a dependency in the last thread? If so, this is a problem.

[Podkopaev et al. 2019, §8]: To establish the correctness of compilation of the promising semantics to POWER, Kang et al. [2017] followed the approach of Lahav and Vafeiadis [2016]. This approach reduces compilation correctness to POWER to (i) the correctness of compilation to the POWER model strengthened with po Urf acyclicity; and (ii) the soundness of local reorderings of memory accesses. To establish (i), Kang et al. [2017] wrongly argued that the strengthened POWER-consistency of mapped promise-free execution graphs imply the promise-free consistency of the source execution graphs. This is not the case due to SC fences, which have relatively strong semantics in the promise-free declarative model (see [Podkopaev et al. 2018, Appendix D] for a counter example). Nevertheless, our proof shows that the compilation claim of Kang et al. [2017] is correct.

#### 14.3 Power versus ARM7

[Lahav and Vafeiadis 2016, §5]: Characterizing ppo of power:

$$[RU]; (deps \cup poloc)^+; [WU] \subseteq ppo$$
 (PPO lower)  

$$ppo \cap po_{imm} \subseteq (deps \cup poloc)^+$$
 (PPO upper)

 $R_{\text{imm}}$  denotes the relation consisting of all immediate R-edges, i.e., pairs  $(a, b) \in R$  such that for every  $c, (c, b) \in R$  implies  $(c, a) \in R^{?}$ , and  $(a, c) \in R$  implies  $(b, c) \in R^{?}$ .

[Podkopaev et al. 2019, After example 3.6]: Note that we do not include fri in ppo since it is not preserved in ARMv7 [Alglave et al. 2014] (unlike in x86-TSO, POWER, and ARMv8). Thus, as ARMv7 (as well as the Flowing and POP models of ARM in [Flur et al. 2016]), IMM allows the weak behavior from [Lahav and Vafeiadis 2016, §6].

[Lahav and Vafeiadis 2016, §6]: Consider the program in Fig. 4.

$$r := x \; ; \; x := 1 \parallel y := x \parallel x := y$$

$$\begin{array}{c} (\mathsf{R}x1) & (\mathsf{W}y1) & (\mathsf{R}y1) & (\mathsf{W}x1) \\ \end{array}$$

Note that no reorderings or eliminations can be applied to this program. In the second and the third threads, reordering is forbidden because of the dependency between the load and the subsequent store. On the first thread, there is no dependency, but since the load and the store access the same location, their reordering is generally unsound, as it allows the load to read from the (originally subsequent) store. Moreover, this program cannot return a=1 under a (po  $\cup$  rf)-acyclic model, because the only instance of the constant 1 in the program occurs after the load of x in the first thread. Nevertheless, this behavior is allowed under both the axiomatic ARMv7 model of Alglave et al. [4] and the ARMv8 Flowing and POP models of Flur et al. [12]

The axiomatic ARMv7 model [4] is the same as the Power model presented in Section 5, with the only difference being the definition of ppo (preserved program order). In particular, this model does not satisfy (ppo-lower-bound) because

$$[RU]$$
; poloc;  $[WU] \nsubseteq ppo$ .

Hence, the first thread's program order in the example above is not included in ppo, and there is no happens-before cycle. For the same reason, our proof for Power does not carry over to ARM.

In the ARMv8 Flowing model [12], consider the topology where the first two threads share a queue and the third thread is separate. The following execution is possible: (1) the first thread issues a load request from x and immediately commits the x := 1 store; (2) the second thread then issues a load request from x, which gets satisfied by the x := 1 store, and then (3) issues a store to y := 1; (4) the store to y gets reordered with the x-accesses, and flows to the third thread; (5) the third thread then loads y = 1, and also issues a store x := 1, which flows to the memory; (6) the load of x flows to the next level and gets satisfied by the x := 1 store of the third thread; and (7) finally the x := 1 store of the first thread also flows to the next level. The POP model is strictly weaker than the Flowing model, and thus also allows this outcome.

### 14.4 Fences and RMW

[Podkopaev et al. 2019, Remark 2, After example 3.1]: Aim: allow the splitting of release writes and RMWs into release fences followed by relaxed operations. In RC11 [Lahav et al. 2017], as well as in C/C++11 [Batty et al. 2011], this rather intuitive transformation, as we found out, is actually unsound.

$$y := 1; \ x^{\mathsf{ra}} := 1 \parallel \mathsf{INC}^{\mathsf{ra},\mathsf{ra}}(x); \ x := 3 \parallel r := x^{\mathsf{acq}}; \ s := y$$

$$(\mathsf{W} y 1) \leftarrow (\mathsf{W}^{\mathsf{rel}} x 1) \rightarrow (\mathsf{R}^{\mathsf{acq}} x 1) \leftarrow (\mathsf{W}^{\mathsf{rel}} x 2) \rightarrow (\mathsf{R}^{\mathsf{acq}} x 3) \leftarrow (\mathsf{R} y 0)$$

(R)C11 disallows the annotated behavior, due in particular to the release sequence formed from the release exclusive write to x in the second thread to its subsequent relaxed write. However, if we split the increment to fencerel; a := FADDacq,rlx(x, 1) (which intuitively may seem stronger), the release sequence will no longer exist, and the annotated behavior will be allowed. IMM overcomes this problem by strengthening sw in a way that ensures a synchronization edge for the transformed program as well

$$y := 1$$
;  $x^{\text{ra}} := 1 \parallel \mathsf{F}^{\text{rel}}$ ;  $\mathsf{INC}^{\text{ra,rlx}}(x)$ ;  $x := 3 \parallel r := x^{\text{acq}}$ ;  $s := y$ 
 $\mathsf{W}y1 \longrightarrow \mathsf{W}^{\text{rel}}x1 \longrightarrow \mathsf{R}^{\text{acq}}x1 \longrightarrow \mathsf{W}x2 \longrightarrow \mathsf{W}x3 \longrightarrow \mathsf{R}^{\text{acq}}x3 \longrightarrow \mathsf{R}y0$ 

We seem to disallow both of these out of the box.

In the case of a relaxed read in the RMW, the outcome is allowed in both cases:

$$y := 1; x^{\mathsf{ra}} := 1 \parallel \mathsf{INC}^{\mathsf{rlx},\mathsf{ra}}(x); x := 3 \parallel r := x^{\mathsf{acq}}; s := y$$

$$(\mathsf{W}y1) + (\mathsf{W}^{\mathsf{rel}}x1) + (\mathsf{R}x1)^{\mathsf{rnw}}(\mathsf{W}^{\mathsf{rel}}x2) + (\mathsf{W}x3) + (\mathsf{R}^{\mathsf{acq}}x3) + (\mathsf{R}y0)$$

$$y := 1; x^{\mathsf{ra}} := 1 \parallel \mathsf{F}^{\mathsf{rel}}; \mathsf{INC}^{\mathsf{rlx},\mathsf{rlx}}(x); x := 3 \parallel r := x^{\mathsf{acq}}; s := y$$

$$(\mathsf{W}y1) + (\mathsf{W}^{\mathsf{rel}}x1) + (\mathsf{R}x1)^{\mathsf{rnw}}(\mathsf{W}x2) + (\mathsf{W}x3) + (\mathsf{R}^{\mathsf{acq}}x3) + (\mathsf{R}y0)$$

0:24 Anon.

#### REFERENCES

1128

1138

1139

1141

1153

1155

1157

1159

1161

1162

- Jade Alglave, Will Deacon, Richard Grisenthwaite, Antoine Hacquard, and Luc Maranget. 2020. Armed cats: Formal Concurrency Modelling at Arm. Draft., 49 pages.
- John Bender and Jens Palsberg. 2019. A formalization of Java's concurrent access modes. *Proc. ACM Program. Lang.* 3, OOPSLA (2019), 142:1–142:28. https://doi.org/10.1145/3360568
- Soham Chakraborty and Viktor Vafeiadis. 2017. Formalizing the concurrency semantics of an LLVM fragment. In *Proceedings of the 2017 International Symposium on Code Generation and Optimization, CGO 2017, Austin, TX, USA, February 4-8, 2017*, Vijay Janapa Reddi, Aaron Smith, and Lingjia Tang (Eds.). ACM, 100–110. http://dl.acm.org/citation.cfm?id=
- Soham Chakraborty and Viktor Vafeiadis. 2019. Grounding thin-air reads with event structures. *PACMPL* 3, POPL (2019), 70:1–70:28. https://doi.org/10.1145/3290383
  - William Ferreira, Matthew Hennessy, and Alan Jeffrey. 1996. A Theory of Weak Bisimulation for Core CML. In *Proceedings of the 1996 ACM SIGPLAN International Conference on Functional Programming, ICFP 1996, Philadelphia, Pennsylvania, USA, May 24-26, 1996*, Robert Harper and Richard L. Wexelblat (Eds.). ACM, 201–212. https://doi.org/10.1145/232627.232649
  - Shaked Flur, Kathryn E. Gray, Christopher Pulte, Susmit Sarkar, Ali Sezgin, Luc Maranget, Will Deacon, and Peter Sewell. 2016. Modelling the ARMv8 architecture, operationally: concurrency and ISA. In *Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2016, St. Petersburg, FL, USA, January 20 22, 2016*, Rastislav Bodík and Rupak Majumdar (Eds.). ACM, 608–621. https://doi.org/10.1145/2837614.2837615
- Lisa Higham and Jalal Kawash. 2000. Memory Consistency and Process Coordination for SPARC Multiprocessors. In High
   Performance Computing HiPC 2000, 7th International Conference, Bangalore, India, December 17-20, 2000, Proceedings
   (Lecture Notes in Computer Science, Vol. 1970), Mateo Valero, Viktor K. Prasanna, and Sriram Vajapeyam (Eds.). Springer,
   355–366. https://doi.org/10.1007/3-540-44467-X\_32
- Alan Jeffrey, James Riely, Mark Batty, Simon Cooksey, Ilya Kaysin, and Anton Podkopaev. 2021. The Leaky Semicolon: Compositional Semantic Dependencies for Relaxed-Memory Concurrency. https://github.com/chicago-relaxedmemory/seqcomp.
- Jeehoon Kang. 2019. Reconciling Low-Level Features of C with Compiler Optimizations. Ph.D. Dissertation. Seoul National University, Seoul, South Korea. https://sf.snu.ac.kr/jeehoon.kang/thesis/
  - Ori Lahav and Viktor Vafeiadis. 2016. Explaining Relaxed Memory Models with Program Transformations. In FM 2016: Formal Methods 21st International Symposium, Limassol, Cyprus, November 9-11, 2016, Proceedings (Lecture Notes in Computer Science, Vol. 9995), John S. Fitzgerald, Constance L. Heitmeyer, Stefania Gnesi, and Anna Philippou (Eds.). Springer, 479–495. https://doi.org/10.1007/978-3-319-48989-6\_29
    - Ori Lahav, Viktor Vafeiadis, Jeehoon Kang, Chung-Kil Hur, and Derek Dreyer. 2017. Repairing sequential consistency in C/C++11. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2017, Barcelona, Spain, June 18-23, 2017, Albert Cohen and Martin T. Vechev (Eds.). ACM, 618–632. https://doi.org/10.1145/3062341.3062352
    - Daniel Lustig, Sameer Sahasrabuddhe, and Olivier Giroux. 2019. A Formal Analysis of the NVIDIA PTX Memory Consistency Model. In *Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2019, Providence, RI, USA, April 13-17, 2019*, Iris Bahar, Maurice Herlihy, Emmett Witchel, and Alvin R. Lebeck (Eds.). ACM, 257–270. https://doi.org/10.1145/3297858.3304043
  - NVIDIA. 2020. Parallel Thread Execution ISA Version 7.1. https://docs.nvidia.com/cuda/parallel-thread-execution/index. html#memory-consistency-model.
- Anton Podkopaev, Ori Lahav, and Viktor Vafeiadis. 2019. Bridging the gap between programming languages and hardware weak memory models. *Proc. ACM Program. Lang.* 3, POPL (2019), 69:1–69:31. https://doi.org/10.1145/3290382
- Christopher Pulte, Shaked Flur, Will Deacon, Jon French, Susmit Sarkar, and Peter Sewell. 2018. Simplifying ARM concurrency: multicopy-atomic axiomatic and operational models for ARMv8. *PACMPL* 2, POPL (2018), 19:1–19:29. https://doi.org/10.1145/3158107