#### 1 MODEL

#### 1.1 Preliminaries

The syntax is built from

- a set of *thread ids*  $\mathcal{T}$ , ranged over by  $\alpha$ ,  $\gamma$ ,
- a set of values V, ranged over by v, w,  $\ell$ , k,
- a set of registers  $\mathcal{R}$ , ranged over by r, s,
- a set of *expressions*  $\mathcal{M}$ , ranged over by M, N, L.

*Memory references* are tagged values, written  $[\ell]$ . Let  $\mathcal{X}$  be the set of memory references, ranged over by x, y, z.

We require that

- values and registers are disjoint,
- values include at least the constants 0 and 1,
- expressions include at least registers and values,
- expressions do *not* include references: M[N/x] = M,
- there are registers  $S_{\mathcal{E}} = \{s_e \mid e \in \mathcal{E}\},\$
- registers  $S_{\mathcal{E}}$  do not appear in programs:  $S[N/s_e] = S$ .

Alternative to the last assumption, we sometimes assume each register is assigned at most once. We model the following language.

$$\begin{split} \sigma &:= \mathsf{cta} \mid \mathsf{gpu} \mid \mathsf{sys} \\ \mu &:= \mathsf{wk} \mid \mathsf{rlx} \mid \mathsf{ra} \mid \mathsf{sc} \\ v &:= \mathsf{rel} \mid \mathsf{acq} \mid \mathsf{fsc} \\ S &:= \mathsf{skip} \mid r := M \mid r := [\![L]^{\mu}_{\sigma} \mid [\![L]]^{\mu}_{\sigma} := M \mid \mathsf{F}^{\nu}_{\sigma} \\ \mid S_{1} \gamma \not \Vdash S_{2} \mid S_{1}; S_{2} \mid \mathsf{if}(M)\{S_{1}\} \, \mathsf{else}\,\{S_{2}\} \end{split}$$

*Scopes*,  $\sigma$ , are thread group (cta), processor (gpu) and system (sys).

Access modes,  $\mu$ , are weak (wk), are relaxed (rlx), release-acquire (ra), and sequentially consistent (sc). Relaxed mode is the default; we regularly elide it from examples. ra/sc accesses are collectively known as *synchronized accesses*.

Fence modes, v, are release (rel), acquire (acq), and sequentially consistent (fsc).

Commands, aka statements, S, include memory accesses at a given mode, as well as the usual structural constructs. Following [Ferreira et al. 1996],  $\Leftrightarrow$  denotes parallel composition. If  $(S_1 \not\mapsto S_2)$  is executed with id  $\alpha$ , then  $S_1$  runs with id  $\gamma$  and  $S_1$  continues under id  $\alpha$ . Top level programs run with thread id 0. In examples, we usually drop thread ids. We use the symmetric  $\parallel$  operator when there is no continuation after the parallel composition.

The semantics is built from the following.

- a set of *events*  $\mathcal{E}$ , ranged over by e, d, c, b,
- a set of actions  $\mathcal{A}$ , ranged over by a,
- a set of *logical formulae*  $\Phi$ , ranged over by  $\phi$ ,  $\psi$ ,  $\theta$ .

Subsets of  $\mathcal{E}$  are ranged over by E, D, C, B.

We require that:

- actions include writes  $(\alpha W_{\sigma}^{\mu} x v)$ , reads  $(\alpha R_{\sigma}^{\mu} x v)$ , and fences  $(F_{\sigma}^{\nu})$ ,
- formulae include equalities (M=N) and (x=M),
- formulae include the write symbol W, and the downgrade symbols  $\downarrow^x$ ,

<sup>&</sup>lt;sup>1</sup>We make this assumption when discussing any semantics of load  $(r := [L]_{\sigma}^{\mu})$  that does not include the substitution  $[s_e/r]$ .

- formulae are closed under negation, conjunction, disjunction, and substitutions [M/r], [M/x], and  $[\phi/s]$  for each symbol s,
- there is an entailment relation \= between formulae,
- $\models$  has the expected semantics for =,  $\neg$ ,  $\land$ ,  $\lor$ ,  $\Rightarrow$  and substitution.

Logical formulae include equations over registers, such as (r=s+1). For LIR, we also include equations over memory references, such as (x=1). Formulae are subject to substitutions; actions are not. We use expressions as formulae, coercing M to  $M\neq 0$ . Equations have precedence over logical operators; thus  $r=v \Rightarrow s>w$  is read  $(r=v) \Rightarrow (s>w)$ . As usual, implication associates to the right; thus  $\phi \Rightarrow \psi \Rightarrow \theta$  is read  $\phi \Rightarrow (\psi \Rightarrow \theta)$ .

We say  $\phi$  implies  $\psi$  if  $\phi \models \psi$ . We say  $\phi$  is a tautology if  $\mathsf{tt} \models \phi$ . We say  $\phi$  is unsatisfiable if  $\phi \models \mathsf{ff}$ .

## 1.2 Label Relations

We combine access and fence modes into a single order:

$$\mathsf{wk} \to \mathsf{rlx} \to \mathsf{ra} \to \mathsf{sc}$$
  $\xrightarrow{\mathsf{acq}} \mathsf{fsc}$   $\xrightarrow{\mathsf{rel}} \mathsf{fsc}$ 

We write  $\mu \sqsubseteq \nu$  for this order. Let  $\mu \sqcup \nu$  denote the least upper bound of  $\mu$  and  $\nu$ .

Definition 1.1. Define  $\prec : \mathcal{A} \times \mathcal{A} \to 2^{\mathcal{A}}$  as follows. If  $a_0 \in a_1 \prec a_2$ , then  $a_1$  and  $a_2$  can coalesce, resulting in  $a_0$ . Allows optimizations (x := 1; x := 2) to (x := 2) and (x := 1; x := x) to (x := 1; x := 1)

$$R^{\mu}xv \prec R^{\nu}xv = \{R^{\mu \sqcup \nu}xv\}$$

$$W^{\mu}xv \prec W^{\nu}xw = \{W^{\mu \sqcup \nu}xw\}$$

$$W^{\nu}xv \prec R^{rlx}xv = \{W^{\nu}xv\}$$

$$W^{\nu}xv \prec R^{\mu \neq rlx}xv = \{W^{sc}xv\}$$

$$F^{\mu} \prec F^{\nu} = \{F^{\mu \sqcup \nu}\}$$

$$a \prec b = \emptyset, \text{ otherwise}$$

Definition 1.2. Reorderability relations.

$$\begin{split} & \ltimes_{\mathsf{co}} = \{ (\mathsf{W} x, \mathsf{W} y), \ (\mathsf{R} x, \mathsf{W} y), \ (\mathsf{W} x, \mathsf{R} y) \mid x \neq y \} \ \cup \ \{ (\mathsf{R} x, \mathsf{R} y) \} \\ & \ltimes_{\mathsf{sync}} = \{ (\mathsf{W}^{\mu}, \mathsf{R}^{\nu}) \mid \mu \neq \mathsf{sc} \lor \nu \neq \mathsf{sc} \} \cup \{ (\mathsf{W}^{\mu}, \mathsf{W}^{\mathsf{rlx}}) \} \cup \{ (\mathsf{F}^{\mathsf{rel}}, \mathsf{R}^{\nu}) \} \\ & \cup \{ (\mathsf{R}^{\mathsf{rlx}}, \mathsf{W}^{\mathsf{rlx}}) \} \cup \{ (\mathsf{R}^{\mathsf{rlx}}, \mathsf{R}^{\nu}) \} \cup \{ (\mathsf{W}^{\mu}, \mathsf{F}^{\mathsf{acq}}) \} \cup \{ (\mathsf{F}^{\mathsf{rel}}, \mathsf{F}^{\mathsf{acq}}) \} \\ & \ltimes = \kappa_{\mathsf{sync}} \cap \kappa_{\mathsf{co}} \end{split}$$

Tabular version:

|                  | 2 <sup>nd</sup>  |          |          |          |     |     |      |          |      |
|------------------|------------------|----------|----------|----------|-----|-----|------|----------|------|
| 1 <sup>st</sup>  | R <sup>rlx</sup> | Rra      | Rsc      | Wrlx     | Wra | Wsc | Frel | Facq     | Ffsc |
| R <sup>rlx</sup> | 1                | <b>√</b> | <b>√</b> | <b>✓</b> | Х   | Х   | X    | X        | X    |
| $R^{ra}$         | X                | X        | X        | X        | X   | X   | X    | X        | X    |
| $R^{sc}$         | X                | X        | X        | X        | X   | X   | X    | X        | X    |
| Wrlx             | 1                | <b>√</b> | <b>√</b> | <b>✓</b> | Х   | Х   | Х    | <b>√</b> | X    |
| $W^{ra}$         | 1                | 1        | 1        | <b>✓</b> | X   | X   | X    | ✓        | X    |
| $W^{sc}$         | 1                | ✓        | X        | <b>✓</b> | X   | X   | X    | ✓        | X    |
| Frel             | 1                | 1        | 1        | X        | Х   | Х   | X    | <b>√</b> | X    |
| $F^acq$          | X                | X        | X        | X        | X   | X   | X    | X        | X    |
| $F^fsc$          | X                | X        | X        | X        | X   | X   | X    | X        | X    |

Proc. ACM Program. Lang., Vol. 1, No. 1, Article . Publication date: March 2021.

Action (Wxv) matches (Rxw) when v = w. Action (Wxv) blocks (Rxw), for any v, w. Actions (W $^{\mu\neq rlx}$ ) and (F $^{\nu\neq acq}$ ) are release actions.

# 1.3 Pomsets with Predicate Transformers

*Definition 1.3.* A predicate transformer is a function  $\tau:\Phi\to\Phi$  such that

- (1)  $\tau(ff)$  is ff,
- (2)  $\tau(\psi_1 \wedge \psi_2)$  is  $\tau(\psi_1) \wedge \tau(\psi_2)$ ,
- (3)  $\tau(\psi_1 \vee \psi_2)$  is  $\tau(\psi_1) \vee \tau(\psi_2)$ ,
- (4) if  $\phi$  implies  $\psi$ , then  $\tau(\phi)$  implies  $\tau(\psi)$ .

Definition 1.4. A family of predicate transformers for E consists of a predicate transformer  $\tau^D$  for each  $D \subseteq \mathcal{E}$ , such that if  $C \cap E \subseteq D$  then  $\tau^C(\psi)$  implies  $\tau^D(\psi)$ .

*Definition 1.5.* A *pomset with predicate transformers* is a tuple  $(E, \lambda, \kappa, \tau, \checkmark, \leq, \leq, \sqsubseteq, \mathsf{rf}, \mathsf{rmw})$  where

- (1)  $E \subset \mathcal{E}$  is a set of events,
- (2)  $\lambda : E \to \mathcal{A}$  defines a *label* for each event,
- (3)  $\kappa : E \to \Phi$  defines a *precondition* for each event,
- (4)  $\tau: 2^{\mathcal{E}} \to \Phi \to \Phi$  defines a predicate transformer for each set of events,
- (5)  $\checkmark$  :  $\Phi$  defines a termination condition,
- (6)  $\leq \subseteq (E \times E)$  is a partial order capturing dependency,
- $(7) \leq \subseteq (E \times E)$  is a partial order capturing synchronization,
- (8)  $\sqsubseteq \subseteq (E \times E)$  is a partial order capturing *per-location order*,
- (9) rf  $\subseteq$  ( $E \times E$ ) is a relation capturing *reads-from*,
- (10) rmw  $\subseteq$  ( $E \times E$ ) is a relation capturing *read-modify-write atomicity*,

subject the following constraints:

- (8a) if  $\lambda(d)$  and  $\lambda(e)$  access the same location then  $d \leq e$  implies  $d \sqsubseteq e$ ,
- (9a) if  $(d, e) \in \text{rf}$  and  $(c, e) \in \text{rf}$  then d = c,
- (9b) if  $(d, e) \in \text{rf then } \lambda(d) \text{ matches } \lambda(e)$ ,
- (9c) if  $(d, e) \in \text{rf}$  and  $\lambda(c)$  blocks  $\lambda(e)$  then either  $c \sqsubseteq d$  or  $e \sqsubseteq c$ ,
- (9d) if  $(d, e) \in \mathsf{rf}$  then  $d \leq e$  and  $d \sqsubseteq e$ ,
- (9e) if  $(d, e) \in \mathsf{rf}$  and  $\lambda(d)$  is morally strong with  $\lambda(e)$  then  $d \leq e$ ,
- (10a) if  $(d, e) \in \text{rmw}$  then  $d \leq e$  and  $d \sqsubseteq e$ ,
- (10b) if  $\lambda(c)$  and  $\lambda(d)$  access the same location then
  - if  $d \xrightarrow{\mathsf{rmw}} e$  then  $c \leq e$  implies  $c \leq d$ ,  $c \leq e$  implies  $c \leq d$ ,  $c \subseteq e$  implies  $c \subseteq d$ ,
  - if  $d \xrightarrow{\mathsf{rmw}} e$  then  $d \le c$  implies  $e \le c$ ,  $d \le c$  implies  $e \le c$ ,  $d \sqsubseteq c$  implies  $e \sqsubseteq c$ ,

where  $d \sqsubseteq e$  is  $d \sqsubseteq e$  if  $\lambda(d)$  is morally strong with  $\lambda(e)$ ; otherwise  $d \sqsubseteq e$  is  $e \not\sqsubset d$ .

*Definition 1.6.* A pomset is *top-level* if for every  $e \in E$ :

- (1)  $\kappa(e)$  is a tautology,
- (2) if  $\lambda(e) = (R)$  then  $e \in \text{codom}(rf)$ .

Let P range over pomsets, and  $\mathcal P$  over sets of pomsets. Let Pom be the set of all pomsets.

We lift terminology from actions to events. For example, we say that e writes x if  $\lambda(e)$  writes x. We also drop quantifiers when clear from context, such as  $(\forall e \in E)(\forall x \in X)$ . We write d < e when  $d \le e$  and  $d \ne e$ , and similarly for  $\triangleleft$  and  $\square$ .

Definition 1.7.  $\mathcal{P}_1$  refines  $\mathcal{P}_2$  if  $\mathcal{P}_1 \subseteq \mathcal{P}_2$ .

## 1.4 Semantics

```
Definition 1.8. If P \in \mathcal{P}_1 \not \mapsto \mathcal{P}_2 then (\exists P_1 \in \mathcal{P}_1) \ (\exists P_2 \in \mathcal{P}_2)
    (1) E = (E_1 \cup E_2), \leq \supseteq (\leq_1 \cup \leq_2), \leq \supseteq (\leq_1 \cup \leq_2), \sqsubseteq \supseteq (\sqsubseteq_1 \cup \sqsubseteq_2), \text{ rf } \supseteq (\text{rf}_1 \cup \text{rf}_2),
           rmw = (rmw_1 \cup rmw_2),
    (2) \lambda = (\lambda_1 \cup \lambda_2),
    (3) if e \in E_1 then \kappa(e) implies \kappa_1(e),
    (4) if e \in E_2 then \kappa(e) implies \kappa_2(e),
    (5) \tau^D(\psi) implies \tau_2^D(\psi),
    (6) E_1 and E_2 are disjoint,
    (7) \checkmark implies \checkmark_1 \land \checkmark_2.
    If P \in \mathcal{P}_1; \mathcal{P}_2 then (\exists P_1 \in \mathcal{P}_1) (\exists P_2 \in \mathcal{P}_2)
    (1) as for \Longrightarrow,
    (2) if d \in E_1 and e \in E_2 then either d \le e or \lambda_1(d) \ltimes_{\mathsf{sync}} \lambda_2(e),
    (3) if d \in E_1 and e \in E_2 then either d \sqsubseteq e or \lambda_1(d) \ltimes_{co} \lambda_2(e),
    (4) if e \in E_1 \setminus E_2 then \lambda(e) = \lambda_1(e),
    (5) if e \in E_2 \setminus E_1 then \lambda(e) = \lambda_2(e),
    (6) if e \in E_1 \cap E_2 then \lambda(e) \in \lambda_1(e) \prec \lambda_2(e),
    (7) if e \in E_2 and \lambda(e) is a release then \kappa(e) implies \sqrt{1},
    (8) if e \in E_1 \setminus E_2 then \kappa(e) implies \kappa_1(e),
    (9) if e \in E_2 \setminus E_1 then \kappa(e) implies \kappa'_2(e),
  (10) if e \in E_1 \cap E_2 then \kappa(e) implies \kappa_1(e) \vee \kappa_2'(e), where \kappa_2'(e) = \tau_1^{\downarrow e}(\kappa_2(e)),
           where \downarrow e = \{c \mid c < e\} if \lambda(e) is a write, and \downarrow e = E_1, otherwise,
  (11) \tau^D(\psi) implies \tau_1^D(\tau_2^D(\psi)),
  (12) \checkmark implies \checkmark_1 \land \checkmark_2.
    If P \in IF(\phi, \mathcal{P}_1, \mathcal{P}_2) then (\exists P_1 \in \mathcal{P}_1) (\exists P_2 \in \mathcal{P}_2)
(1-2) as for \Longrightarrow,
    (3) if e \in E_1 \setminus E_2 then \kappa(e) implies \phi \wedge \kappa_1(e),
    (4) if e \in E_2 \setminus E_1 then \kappa(e) implies \neg \phi \wedge \kappa_2(e),
    (5) if e \in E_1 \cap E_2 then \kappa(e) implies (\phi \Rightarrow \kappa_1(e)) \wedge (\neg \phi \Rightarrow \kappa_2(e)),
    (6) \tau^D(\psi) implies (\phi \Rightarrow \tau_1^D(\psi)) \land (\neg \phi \Rightarrow \tau_2^D(\psi)),
    (7) \checkmark implies (\phi \Rightarrow \checkmark_1) \land (\neg \phi \Rightarrow \checkmark_2).
    If P \in LET(r, M) then E = \emptyset and \tau^D(\psi) implies \psi[M/r].
    If P \in SKIP then E = \emptyset and \tau^D(\psi) implies \psi.
    If P \in FENCE(\mu, \sigma)_{\alpha} then
    (1) if d, e \in E then d = e,
    (2) \lambda(e) = \mathsf{F}^{\mu}_{\sigma},
    (3) \tau^D(\psi) implies \psi,
    (4) \checkmark implies E \neq \emptyset.
    If P \in STORE(x, M, \mu, \sigma)_{\alpha} then (\exists v \in \mathcal{V})
                                                                                          If P \in LOAD(r, x, \mu, \sigma)_{\alpha} then (\exists v \in \mathcal{V})
    (1) if d, e \in E then d = e,
                                                                                              (1) if d, e \in E then d = e,
    (2) \lambda(e) = \alpha W_{\sigma}^{\mu} x v,
                                                                                              (2) \lambda(e) = \alpha R_{\sigma}^{\mu} x v,
                                                                                              (3) \tau^D(\psi) implies v=r \Rightarrow \psi, if (E \cap D) \neq \emptyset,
    (3) \kappa(e) implies M=v,
                                                                                              (4) \tau^D(\psi) implies \psi, if (E \cap D) = \emptyset.
    (4) \tau^D(\psi) implies \psi,
    (5) \checkmark implies M=v.
```

$$\begin{split} \llbracket x^{\mu} &:= M \rrbracket_{\alpha} = STORE(x, M, \mu, \sigma)_{\alpha} & \llbracket \operatorname{skip} \rrbracket_{\alpha} = SKIP \\ \llbracket r &:= x^{\mu} \rrbracket_{\alpha} = LOAD(r, x, \mu, \sigma)_{\alpha} & \llbracket S_{1} \ _{\gamma} \not \mapsto S_{2} \rrbracket_{\alpha} = \llbracket S_{1} \rrbracket_{\gamma} \not \mapsto \llbracket S_{2} \rrbracket_{\alpha} \\ \llbracket r &:= M \rrbracket_{\alpha} = LET(r, M) & \llbracket S_{1} \ _{\gamma} \not \mapsto S_{2} \rrbracket_{\alpha} = \llbracket S_{1} \rrbracket_{\gamma} \ _{\gamma} \not \mapsto \llbracket S_{2} \rrbracket_{\alpha} \\ \llbracket F_{\sigma}^{\mu} \rrbracket_{\alpha} = FENCE(\mu, \sigma)_{\alpha} & \llbracket S_{1} \ _{\gamma} \not \mapsto S_{2} \rrbracket_{\alpha} = \llbracket S_{1} \rrbracket_{\alpha} \ _{\gamma} \not \models S_{2} \rrbracket_{\alpha} \end{aligned}$$

Full versions (everything but address calculation):

If  $P \in STORE(x, M, \mu, \sigma)_{\alpha}$  then  $(\exists v : E \to V) (\exists \theta : E \to \Phi)$ 

- (1) if  $\theta_d \wedge \theta_e$  is satisfiable then d = e,
- (2)  $\lambda(e) = \alpha W_{\sigma}^{\mu} x v_e$ ,
- (3)  $\kappa(e)$  implies  $\theta_e \wedge M = v_e$ ,
- (4)  $\tau^D(\psi)$  implies  $\theta_e \Rightarrow \psi[M/x]$ ,
- (5)  $\checkmark$  implies  $\bigvee_{e \in E} \theta_e$ .

If  $P \in LOAD(r, x, \mu, \sigma)_{\alpha}$  then  $(\exists v : E \to V) (\exists \theta : E \to \Phi)$ 

- (1) if  $\theta_d \wedge \theta_e$  is satisfiable then d = e,
- (2)  $\lambda(e) = \alpha R_{\sigma}^{\mu} x v_e$
- (3)  $\kappa(e)$  implies  $\theta_e$ ,
- (4)  $(\forall e \in E \cap D) \tau^D(\psi)$  implies  $\theta_e \Rightarrow v_e = s_e \Rightarrow \psi[s_e/r]$ ,
- (5)  $(\forall e \in E \setminus D) \ \tau^D(\psi) \text{ implies } \theta_e \Rightarrow (v_e = s_e \lor x = s_e) \Rightarrow \psi[s_e/r],$
- (6)  $(\forall s) \tau^D(\psi)$  implies  $(\bigwedge_{e \in E} \neg \theta_e) \Rightarrow \psi[s/r]$ .

#### 1.5 Fulfillment

*Definition 1.9.* Define  $\sqsubseteq$  as follows.

$$d \subseteq e$$
 when  $\begin{cases} d \subseteq e & \text{if } d \text{ is morally strong with } e \\ e \not\sqsubseteq d & \text{otherwise} \end{cases}$ 

A read event *e* is *strongly fulfilled* if there is a  $d \stackrel{\mathsf{rf}}{\longrightarrow} e$  and

for any c that can block e, either  $c \sqsubseteq d$  or  $e \sqsubseteq c$ .

A read event *e* is *weakly fulfilled* if there is a  $d \stackrel{\mathsf{rf}}{\longrightarrow} e$  and

for any c that can block e, either  $c \subseteq d$  or  $e \subseteq c$ .

If all accesses are morally strong with each other, weak fulfillment degenerates to

$$\forall \lambda(c) = (\mathsf{W} x) \text{ either } c \sqsubseteq d \text{ or } e \sqsubseteq c$$

If no accesses are morally strong with each other, weak fulfillment degenerates to

$$\not\exists \lambda(c) = (\mathsf{W} x) \text{ both } d \sqsubseteq c \text{ and } c \sqsubseteq e$$

Note that the difference between strong and weak fulfillment is limited to  $\sqsubseteq$ . We sometimes write  $\sqsubseteq$  for strong fulfillment and  $\sqsubseteq$  for weak fulfillment.

In diagrams, we use different shapes and colors for arrows and events. These are included only to help the reader understand why order is included. We adopt the following conventions:

- $e \rightarrow d$  arises from reads-from (rf),
- $e \rightarrow d$  arises from fulfillment,
- $e \rightarrow d$  arises from control/data/address dependency,
- $e \implies d$  arises from synchronized access.

#### 2 NOTES

GPU stuff:

- Vulcan/Alloy
- OpenCL
- AMD PTX
- Matthew Sinclair/Sarita Adve stuff "Chasing Away RAts- Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems" and his thesis

## 3 ANTON'S RECENT EXAMPLES RELATING IMM AND PTX

It looks like we cannot prove compilation correctness from IMM to PTX. (In this email I assume that all threads are in the same CTA, so any relation is a morally strong one if it is applicable.) The problem is in the LB-data-rel example:

$$r := x ; y := r \parallel s := y ; x^{ra} := 1$$

$$Rx1 \xrightarrow{\text{data}} Wy1 \xrightarrow{\text{rfe}} Ry1 \xrightarrow{\text{bob}} W^{ra}x1$$

IMM forbids it, but PTX allows it. The point is that IMM mixes dependencies and release/acquire-induced po-order in its NoOOTA axiom, whereas PTX doesn't — release/acquire are only used to have coherence.

The problem is related to the one we have already discussed in the context of the C++ model – if you don't have acquire reads in the program, then you can erase release annotations from writes. In this regard, PTX is closer to PL memory models than to hardware ones.

AFAIU for the same reason we won't be able to show compilation correctness from the Pomset model to PTX even directly, if the Pomset model mixes release/acquire induced order with dependencies in the same causality relation.

Another oddity: PTX includes the bob edge below; IMM does not.

$$x^{ra} := 1 \parallel r := x ; x := 1 ; 1 := x^{ra}$$

$$Rx1 \rightarrow Wx1 \stackrel{\text{rfi}}{\longrightarrow} R^{ra}x1$$
bob

## 4 THIN AIR

Need  $\leq$  to prevent thin air on rlx:

$$y := x \parallel x := y$$

$$Rx1 \longrightarrow Wy1 \longrightarrow Ry1 \longrightarrow Wx1$$

$$Rx1 \longrightarrow Wy1 \longrightarrow Ry1 \longrightarrow Wx1$$

$$Rx1 \longrightarrow Wy1 \longrightarrow Ry1 \longrightarrow Wx1$$

$$(\leq)$$

#### 5 IMM EXAMPLES

Interpreting this definition for the IMM:

- No wk, default is rlx
- All threads in same cta (only one scope)
- Actions are morally strong when both are ra/sc, mimicking happens-before
- Strong fulfillment may do the right thing

Disallowed by IMM:

$$x := 2; y^{ra} := 1 \parallel r := y^{ra}; x := 1$$
 (PUB-REL-ACQ-COE)

 $wx2 \xrightarrow{bob} w^{ra} y1 \xrightarrow{rfe} x^{ra} y1 \xrightarrow{bob} wx1$  (XIMM)

 $wx2 \xrightarrow{wra} y1 \xrightarrow{R^{ra} y1} wx1$  ( $\leq = \leq$ )

Allowed by IMM, but not by Power/ARMv7/ARMv8/TSO:

$$x := 2; y^{ra} := 1 \parallel r := y; x := 1$$
 (PUB-REL-RLX-COE)

 $wx2 \xrightarrow{bob} wra y1 \xrightarrow{rfe} Ry1 \xrightarrow{data} wx1$  ( $\checkmark$ IMM)

 $wx2 \xrightarrow{bob} wra y1 \xrightarrow{Ry1} wx1$  ( $\le$ )

 $wx2 \xrightarrow{wra} y1 \xrightarrow{Ry1} wx1$  ( $\le$ )

Example from talk:

#### 6 TWO ORDER IDEA

The two order idea from OOPSLA talk is:

• Require:  $d \sqsubseteq e$  when  $d \trianglelefteq e$  and they conflict

This does not work for the IMM or ARMv7, but it may work for Power, TSO, ARMv8. That would be nice. Let's write  $\sqsubseteq$  for this notion, with strong fulfillment.

With this there is a cycle in ARM7-WEAK (weak/strong fulfillment not relevant here):

Anton says: ARM7-WEAK is forbidden by Power, TSO, ARMv8, but allowed by ARMv7. Maybe it isn't that important to support it anymore.

There is also a cycle in Pub-rel-rlx-coe. Anton says: I checked Power/ARMv7 models in this regard. They disallow the behavior (as well as ARMv8 and TSO), so we can in principle strengthen IMM to forbid it as well. For that, we may add axiom to IMM forbidding cycles in  $co \cup ([W]; rfe^2; ([R^{acq}] \cup po; [FW^{rel}]); ar^*; [W])$ . This works if we have acquire/release accesses on the path since they are compiled with fences to Power.

#### 7 PTX EXAMPLES

Based on [Lustig et al. 2019; NVIDIA 2020].

PTX requires weak fulfillment.

Default scope is cta. In examples, all threads in different ctas.

Default mode is wk.

(Rx0) must be forbidden. Before fulfilling the read:

$$x := 0; x := 1; y_{\text{SVS}}^{\text{ra}} := 1 \parallel r := y_{\text{SVS}}^{\text{ra}}; s := x$$
 (PUB1<sub>SYS</sub>)

 $(\mathsf{W} x 1) \sqsubseteq (\mathsf{R} x)$  is required by M7, enforcing publication.

(Rx0) must be allowed:

$$x := 0; x := 1; y^{ra} := 1 \parallel r := y^{ra}; s := x$$
 (PUB1<sub>CTA</sub>)

$$(\mathbb{W}x0) \rightarrow (\mathbb{W}x1)$$
  $(\mathbb{W}^{ra}y1)$   $(\mathbb{R}x)$   $(\mathbb{L})$ 

We do not have  $(W^{ra}y1) \le (R^{ra}y1)$  since F3 only requires order for things that are morally strong. Another example that may be of interest (nothing morally strong). Can this (Rx0)?

$$x := 0; x := 1 \parallel y := x \parallel if(y)\{r := x\}$$

PTX allows TC16 for events that are not mutually strong (TC16<sub>WK</sub>), but disallows it when events are mutually strong (TC16<sub>SYS</sub>). Note that  $\leq$  imposes no requirements here. Fulfillment imposes no order. This example shows that F3C cannot be strengthened to require that  $d \sqsubseteq e$ .

$$r := x ; x := 1 \parallel s := x ; x := 2$$
 (TC16<sub>WK</sub>)

$$Rx2$$
  $Wx1$   $Rx1$   $Wx2$ 

$$\begin{array}{ccc}
(Rx2) & \rightarrow (Wx1) & (Rx1) & \rightarrow (Wx2)
\end{array}$$

$$r := x_{sys}^{r|x}; x_{sys}^{r|x} := 1 \parallel s := x_{sys}^{r|x}; x_{sys}^{r|x} := 2$$
 (Tc16<sub>sys</sub>)

$$\begin{pmatrix}
R_{\text{sys}}^{\text{rlx}} x 2
\end{pmatrix} \qquad \begin{pmatrix}
W_{\text{sys}}^{\text{rlx}} x 1
\end{pmatrix} \qquad \begin{pmatrix}
R_{\text{sys}}^{\text{rlx}} x 1
\end{pmatrix} \qquad \begin{pmatrix}
W_{\text{sys}}^{\text{rlx}} x 2
\end{pmatrix} \qquad (\leq = \leq)$$

$$\begin{array}{c|c}
 & \text{Rrix} \\
 & \text{Sys} \\
 & \text{Sys} \\
 & \text{Rrix} \\
 &$$

About Release-Acquire semantics. Anton confirms that the following example is allowed in C11, but disallowed in the IMM. It is apparently allowed in C11 with the intention to allow releasing writes to be downgraded to relaxed in the case that only fulfill relaxed reads.

$$r := x_{\mathsf{sys}}^{\mathsf{rlx}}; \ y_{\mathsf{sys}}^{\mathsf{ra}} := 1 \ \parallel \ s := y_{\mathsf{sys}}^{\mathsf{rlx}}; \ x_{\mathsf{sys}}^{\mathsf{ra}} := 1 \tag{\texttt{LB-Rel}}$$

Another example from Anton. This is allowed in PTX because it does not include synchronization in the no-tar axiom, only in coherence and causality.

$$r := x_{\mathsf{sys}}^{\mathsf{rlx}}; \ y_{\mathsf{sys}}^{\mathsf{rlx}} := r \parallel s := y_{\mathsf{sys}}^{\mathsf{rlx}}; \ x_{\mathsf{sys}}^{\mathsf{ra}} := 1 \tag{\texttt{LB-DATA-REL}}$$
 
$$( \leq = \leq )$$

#### 8 RFI EXAMPLES

Bad example:

$$r := \mathsf{EXCHG}(x,2) \; ; \; s := x \; ; \; y := s-1 \; \| \; r := y \; ; \; x := r$$

$$(\checkmark \mathsf{ARM8})$$

$$(\mathsf{R}x1) \qquad (\mathsf{R}x2) \qquad (\mathsf{R}x2) \qquad (\mathsf{R}x2) \qquad (\mathsf{R}y1) \qquad (\mathsf{R}y2) \qquad (\mathsf{R}y2) \qquad (\mathsf{R}y2) \qquad (\mathsf{R}y2) \qquad (\mathsf{R}y3) \qquad (\mathsf{R}y3) \qquad (\mathsf{R}y3)$$

Anton example 1 (Allowed by ARM) [rfi-coe-coe]

$$x := 2; r := x^{ra}; y := 1 \parallel y := 2; x^{ra} := 1$$
 (RFI-COE-COE)

$$(\checkmark ARM8)$$

$$(\forall x2) \qquad (R^{\mathsf{ra}}x2) \longrightarrow (Wy1) \qquad (Wy2) \longrightarrow (W^{\mathsf{ra}}x1) \qquad (\leq)$$

Internal reads survive acquires [rfi-acq-coe-coe] (where SC read = LDAR)

$$x := 2$$
;  $s := z^{sc}$ ;  $r := x^{sc}$ ;  $y := 1 \parallel y := 2$ ;  $x^{ra} := 1$  (RFI-ACQ-COE-COE)

$$\begin{array}{c}
\text{rfi} \\
\text{Wx2} \\
\text{Rsc}_{z0} \\
\text{bob} \\
\text{Rsc}_{x2} \\
\text{bob} \\
\text{Wy1} \\
\text{coe} \\
\text{Wy2} \\
\text{bob} \\
\text{Wra}_{x1} \\
\text{Oce} \\
\text{Coe} \\
\text{Oce} \\\text{Oce} \\
\text{Oce} \\$$

And release-acquire pairs [rfi-ra-coe-coe] (where acquiring read = LDAPR)

$$x := 2; w^{ra} := 1; s := z^{ra}; r := x^{ra}; y := 1$$
 (RFI-RA-COE-COE2)  
 $|| y := 2; x^{ra} := 1 || w := r; r := 1;$ 

But not if either acquire is strengthened to SC (where SC read = LDAR). The execution is also disallowed if an external thread places order between the ra accesses:

$$x := 2$$
;  $w^{ra} := 1$ ;  $s := z^{ra}$ ;  $r := x^{ra}$ ;  $y := 1$  (RFI-RA-DATA-COE-COE)  
  $\parallel y := 2$ ;  $x^{ra} := 1 \parallel w := r$ ;  $r := z$ ;

To allow this, weaken ra to rlx when read fulfilled by relaxed write of same thread (don't need to allow this when the write is part of an RMW).

$$x := 2; r := x^{ra}; y := 1 \parallel y := 2; x^{ra} := 1$$
 $(wx2) \longrightarrow (Rx2) \qquad (wy1) \longrightarrow (wy2) \longrightarrow (w^{ra}x1)$ 

RF variant [rfi-rfe-coe]:

$$x := 2; r := x^{ra}; y := 1 || s := y; x^{ra} := 1$$
 (RFI-RFE-COE)

$$(\checkmark ARM8)$$

тso variant [rfi-fre-coe]:

$$x := 2; r := x^{ra}; s := y \parallel y := 2; x^{ra} := 1$$
 (RFI-COE-COE)

$$(\sqrt{ARM8})$$

$$(\mathbb{V}x2)$$
  $(\mathbb{R}x2)$   $(\mathbb{V}x2)$   $(\mathbb{V}x2)$   $(\mathbb{V}x2)$   $(\mathbb{V}x2)$   $(\mathbb{V}x2)$ 

Note that TSO does not order W to R in local order, even in poloc. Nonetheless, TSO disallows the following because of local visibility in first thread.

$$x := 2$$
;  $r := x \parallel x := 1$ ;  $s := x$ 

The way of the

[Higham and Kawash 2000] describe TsO as a linearization of partial order including:

poloc

• lws = po; [W] •  $d \stackrel{\text{po}}{\longrightarrow} e$  when  $c \stackrel{\text{rfe}}{\longrightarrow} d \stackrel{\text{po}}{\longrightarrow} e$ 

[Alglave et al. 2020] describe TSO as linearization of partial order satisfying internal visibility and including

- [W]; po; [W]
- $d \stackrel{\text{po}}{\longrightarrow} e$  when  $c \stackrel{\text{rfe}}{\longrightarrow} d \stackrel{\text{po}}{\longrightarrow} e$ , from (range(rfe) \* \_)
- [R]; po; [W], from (rfi^-1; lob)

Ignoring fences and RMWs:

Double FRE variant [rfi-fre-fre]:

$$x := 2; r := x^{ra}; s := y \parallel y := 2; F; r := x$$
 (RFI-FRE-FRE)

$$\begin{array}{c} (\text{W}x2) \xrightarrow{\text{rf}} (\text{R}^{\text{ra}}x2) \xrightarrow{\text{bob}} (\text{R}y0) \xrightarrow{\text{fre}} (\text{W}y2) \xrightarrow{\text{bob}} (\text{R}x0) \\ \text{fre} \end{array}$$

It does not seem possible to do this only with rfe. ARM disallows this [data-rfi-rfe-rfe]:

$$x := z; r := x^{ra}; y := 1 \parallel z := y$$

$$(DATA-RFI-RFE-RFE)$$

$$(XARM8)$$

It also disallows [ctrl-rfi-rfe-rfe]:

if 
$$(z)$$
 {};  $x := 1$ ;  $r := x^{ra}$ ;  $y := 1 \parallel z := y$  (CTRL-RFI-RFE-RFE)

ctrl

Rz1

Wx1

rfe

Wy1

data
Wz1

(XARM8)

ARM allows some counterintuitive results for SC access [ctrl-rfi-fre-rfe]:

if 
$$(x)$$
 {};  $x := 2$ ;  $r := x^{sc}$ ;  $s := y^{sc} \parallel y^{sc} := 2$ ;  $x^{sc} := 1$  (CTRL-RFI-FRE-RFE)

$$(x) = (x) + (x) +$$

Not possible with coe [ctrl-rfi-coe-rfe]:

This is not allowed with a data dependency instead of a control dependency [data-rfi-fre-rfe]:

$$x := x+1; \ r := x^{\text{sc}}; \ s := y^{\text{sc}} \parallel y^{\text{sc}} := 1; \ x^{\text{sc}} := 1$$

$$(\text{DATA-RFI-FRE-RFE})$$

$$(\text{NARM8})$$

## 9 SC EXAMPLES

IRIW-AQC-SC is allowed by trailing-sync compilation to power [Lahav et al. 2017, §1].

$$x^{\text{sc}} := 1 \parallel y^{\text{sc}} := 1 \parallel r := x^{\text{ra}}; \ s := y^{\text{sc}} \parallel r := y^{\text{ra}}; \ s := x^{\text{sc}}$$

$$(\text{IRIW-AQC-SC})$$

$$(\text{POWER,RC11})$$

$$(\text{Power,RC11})$$

Leading sync is also unsound in c11 with RMW [Lahav et al. 2017, §2.1].

$$x^{\text{sc}} := 1; \ y^{\text{ra}} := 1 \parallel \text{FADD}^{\text{sc,sc}}(y, 1); \ s := y \parallel y^{\text{sc}} := 3; \ s := x^{\text{sc}}$$

$$(\text{z6.U})$$

$$\text{W}^{\text{sc}} x 1 \longrightarrow \text{W}^{\text{ra}} y 1 \longrightarrow \text{R}^{\text{sc}} y 1 \xrightarrow{\text{rmw}} \text{W}^{\text{sc}} y 2 \longrightarrow \text{R}^{\text{sc}} x 0 \longrightarrow \text{POWER,RC11}$$

Leading sync is also unsound in c11 with SC fences [Lahav et al. 2017, §A.1].

$$x := 2; F^{sc}; r := y \parallel y^{sc} := 1 \parallel r := y^{ra}; x^{ra} := 1; s := x \parallel r := x^{sc}$$

$$(RSYNC+RSC)$$

$$(Vx2) \longrightarrow (Ry0) \longrightarrow (Ry0) \longrightarrow (Ry0) \longrightarrow (Rx2)$$

$$(Vx2) \longrightarrow (Rx2) \longrightarrow (Rx2)$$

Fulfillment of (Rx2) requires that either  $(W^{ra}x1) \rightarrow (Wx2)$  or  $(Rx2) \rightarrow (W^{ra}x1)$ . Isct's interesting that in the pomset,  $(R^{sc}x1)$  is not needed to get a cycle.

There is a long discussion of this in [Bender and Palsberg 2019, §5.2, Fig. 17], where they also discuss this example:

$$x^{\text{sc}} := 1$$
;  $x := 2 \parallel y^{\text{sc}} := 1$ ;  $y := 2 \parallel r := x^{\text{ra}}$ ;  $s := y^{\text{sc}} \parallel r := y^{\text{ra}}$ ;  $s := (x^{\text{Riw-sc-rlx-acq}})$ 
 $(\checkmark \text{Rc11})$ 
 $(\checkmark \text{Rc11})$ 

[Lahav et al. 2017, §A.2] claims that ARM8 allows this [RWC+acq+sc], but herd7 rejects it. Reason: they are citing the flowing/pop model [Flur et al. 2016] rather than [Pulte et al. 2018].

$$x^{\text{sc}} := 1 \parallel r := x; F^{\text{acq}}; s := y^{\text{sc}} \parallel y^{\text{sc}} := 1; r := x^{\text{sc}}$$

$$(\text{RWC+ACQ+SC})$$

$$(\text{RWC+ACQ+SC})$$

$$(\text{RWC+ACQ+SC})$$

$$(\text{RWC+ACQ+SC})$$

### 10 RMWS

From [Bender and Palsberg 2019, §3.3]. With partial coherence/weak fulfillment you need to be careful that RMWs are totally ordered (if that's a property you want). May not come for free.

Proc. ACM Program. Lang., Vol. 1, No. 1, Article . Publication date: March 2021.

# 11 EXAMPLE FROM JAM PAPER

From [Bender and Palsberg 2019, §B]: "Here we demonstrate that it is possible to construct a program that is only forbidden due to the total coherence order"

$$r := x; x := 1 \parallel r := x^{ra}; x := 1 \parallel r := y^{ra}; x := 2$$

$$(TOTAL-CO)$$

$$Rx2 \longrightarrow Wx1 \longrightarrow R^{acq}x1 \longrightarrow Wy1 \longrightarrow R^{acq}y1 \longrightarrow Wx2$$

$$(XARM8)$$

$$Rx2 \longrightarrow Wx1 \longrightarrow R^{acq}x1 \longrightarrow Wy1 \longrightarrow R^{acq}y1 \longrightarrow Wx2$$

$$(XARM8)$$

## 12 OLD MODEL

Orders/Relations in model

- ⊴ is the old ≤ (without coherence stuff from F4 and P5B). This provides the NO-TAR axiom.
- ≤ is a the *happens-before* suborder, which only includes rf when they are morally strong. This serves as a cross-location transitive kernel for the per-location order.
- ⊑ is a per-location order that relates morally strong and poloc accesses
   This includes ≤ for morally strong accesses.
   This provides the SC-PER-LOC axiom.

Write  $d \triangle e$  if they conflict (ie, read/write or write/write, same location).

Write  $d \blacktriangle e$  if they conflict and are morally strong

*Definition 12.1.* A poisset with preconditions is a tuple  $(E, \lambda, \leq, \leq, \sqsubseteq)$  where

- (M1) E is a set of events
- (M2)  $\lambda : E \to (\Phi \times \mathcal{A})$  is a *labeling* from which we derive functions
  - $\kappa : E \to \Phi$  (formulae)
  - $\lambda: E \to \mathcal{A}$  (actions)
- $(M3) \le \subseteq (E \times E), \le \subseteq (E \times E), \text{ and } \subseteq \subseteq (E \times E) \text{ are partial orders}$
- (M4)  $\bigwedge_{e} \kappa(e)$  is satisfiable (consistency)
- (M5) if  $d \le e$  then  $\kappa(e)$  implies  $\kappa(d)$  (causal strengthening)
- (M6) if  $d \le e$  then  $d \le e$
- (M7) if  $d \le e$  and d conflicts with e then  $d \sqsubseteq e$

Definition 12.2 (Strong fulfillment). We say  $\lambda(d) = (Wxv)$  fulfills  $\lambda(e) = (Rxv)$  if

- (F3A) *d* ⊲ *e*
- (F3B) d < e if d is morally strong with e
- (F3c)  $d \sqsubseteq e$  (if d is not morally strong with e)

(F4) 
$$\forall \lambda(c) = (\mathsf{W}x..)$$
 either  $c \sqsubseteq d$  or  $e \sqsubseteq c$ ,

Definition 12.3 (Weak fulfillment). We say  $\lambda(d) = (Wxv)$  fulfills  $\lambda(e) = (Rxv)$  if

- (F3A) *d* ⊲ *e*
- (F3B) d < e if d is morally strong with e
- (F3c)  $e \not\sqsubseteq d$  (if d is not morally strong with e)
- (F4)  $\forall \lambda(c) = (Wx..)$  either  $c \subseteq d$  or  $e \subseteq c$ , where

$$d \subseteq e$$
 when  $\begin{cases} d \subseteq e & \text{if } d \text{ is morally strong with } e \\ e \not\sqsubset d & \text{otherwise} \end{cases}$ 

If all accesses are morally strong with each other, weak fulfillment degenerates to

- (F3) d < e
- (F4)  $\forall \lambda(c) = (\mathbf{W}x..)$  either  $c \sqsubseteq d$  or  $e \sqsubseteq c$

If no accesses are morally strong with each other, weak fulfillment degenerates to

- (**F**3) *e* ⊈ *d*
- (F4)  $\not\exists \lambda(c) = (\mathsf{W}x..)$  both  $d \sqsubset c$  and  $c \sqsubset e$

Note that the difference between strong and weak fulfillment is limited to  $\sqsubseteq$ . We sometimes write  $\sqsubseteq$  for strong fulfillment and  $\sqsubseteq$  for weak fulfillment.

Prefixing is as in OOPSLA, using ≤ for order everywhere except P5B, which has ⊑.

*Definition 12.4.* Let 
$$P' \in (\phi \mid a) \Rightarrow \mathcal{P}$$
 when  $(\exists P \in \mathcal{P}) \ (\forall e \in E)$ 

- (P1)  $E' = E \cup \{d\}$
- $(P2) \leq' \supseteq \leq, \leq' \supseteq \leq,$ and  $\sqsubseteq' \supseteq \sqsubseteq$
- (P3A)  $\lambda'(e) = \lambda(e)$
- (P3B)  $\lambda'(d) = a$
- (P4A)  $\kappa'(d)$  implies  $\phi \wedge (d \notin E \vee \kappa(d))$
- (P4B) if  $d \neq (R..)$  then e = d or  $\kappa'(e)$  implies  $\kappa(e)$
- (P4C) if d = (Rvx) then e = d or  $\kappa'(e)$  implies  $\kappa(e)[v/x]$
- (P5A) if d = (R..), e = (W..) then e = d or  $\kappa'(e)$  implies  $\kappa(e)$  or  $d \leq' e$
- (P5B) if d conflicts with e then  $d \sqsubseteq' e$
- (P5C) if d is an acquire or e is a release then  $d \leq' e$
- (P5D) if *d* is an SC write and *e* is an SC read then  $d \leq' e$
- (P5E) if *d* reads, and *e* is an acquiring fence, then  $d \le' e$
- (P5F) if *d* is a releasing fence, and *e* writes, then  $d \le' e$

#### REFERENCES

Jade Alglave, Will Deacon, Richard Grisenthwaite, Antoine Hacquard, and Luc Maranget. 2020. Armed cats: Formal Concurrency Modelling at Arm. Draft., 49 pages.

- John Bender and Jens Palsberg. 2019. A formalization of Java's concurrent access modes. Proc. ACM Program. Lang. 3, OOPSLA (2019), 142:1–142:28. https://doi.org/10.1145/3360568
- William Ferreira, Matthew Hennessy, and Alan Jeffrey. 1996. A Theory of Weak Bisimulation for Core CML. In *Proceedings of the 1996 ACM SIGPLAN International Conference on Functional Programming, ICFP 1996, Philadelphia, Pennsylvania, USA, May 24-26, 1996*, Robert Harper and Richard L. Wexelblat (Eds.). ACM, 201–212. https://doi.org/10.1145/232627.232649
- Shaked Flur, Kathryn E. Gray, Christopher Pulte, Susmit Sarkar, Ali Sezgin, Luc Maranget, Will Deacon, and Peter Sewell. 2016. Modelling the ARMv8 architecture, operationally: concurrency and ISA. In *Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2016, St. Petersburg, FL, USA, January 20-22, 2016*, Rastislav Bodík and Rupak Majumdar (Eds.). ACM, 608–621. https://doi.org/10.1145/2837614.2837615
- Lisa Higham and Jalal Kawash. 2000. Memory Consistency and Process Coordination for SPARC Multiprocessors. In High Performance Computing HiPC 2000, 7th International Conference, Bangalore, India, December 17-20, 2000, Proceedings (Lecture Notes in Computer Science, Vol. 1970), Mateo Valero, Viktor K. Prasanna, and Sriram Vajapeyam (Eds.). Springer, 355–366. https://doi.org/10.1007/3-540-44467-X\_32
- Ori Lahav, Viktor Vafeiadis, Jeehoon Kang, Chung-Kil Hur, and Derek Dreyer. 2017. Repairing sequential consistency in C/C++11. In *Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2017, Barcelona, Spain, June 18-23, 2017*, Albert Cohen and Martin T. Vechev (Eds.). ACM, 618–632. https://doi.org/10.1145/3062341.3062352
- Daniel Lustig, Sameer Sahasrabuddhe, and Olivier Giroux. 2019. A Formal Analysis of the NVIDIA PTX Memory Consistency Model. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2019, Providence, RI, USA, April 13-17, 2019, Iris Bahar, Maurice Herlihy, Emmett Witchel, and Alvin R. Lebeck (Eds.). ACM, 257–270. https://doi.org/10.1145/3297858.3304043
- NVIDIA. 2020. Parallel Thread Execution ISA Version 7.1. https://docs.nvidia.com/cuda/parallel-thread-execution/index. html#memory-consistency-model.
- Christopher Pulte, Shaked Flur, Will Deacon, Jon French, Susmit Sarkar, and Peter Sewell. 2018. Simplifying ARM concurrency: multicopy-atomic axiomatic and operational models for ARMv8. *PACMPL* 2, POPL (2018), 19:1–19:29. https://doi.org/10.1145/3158107