#### 1 SYNC EXAMPLES

The first of these is seen in hardware. All are allowed by PTX. Showing rf that is not included in the order using a dashed arrow.

$$x := 1; \ y^{\mathsf{ra}} := 1 \parallel r := y^{\mathsf{ra}}; \ z_{\mathsf{sys}} := r \parallel_{\gamma} r := z_{\mathsf{sys}}^{\mathsf{ra}}; \ s := x$$

$$(\alpha \mathsf{W} x \mathsf{1}) \longrightarrow (\alpha \mathsf{W}^{\mathsf{ra}} y \mathsf{1}) \longrightarrow (\alpha \mathsf{R}^{\mathsf{ra}} y \mathsf{1}) \longrightarrow (\alpha \mathsf{W}_{\mathsf{sys}} z \mathsf{1}) \longrightarrow (\gamma \mathsf{R}^{\mathsf{ra}}_{\mathsf{sys}} z \mathsf{1}) \longrightarrow (\gamma \mathsf{R}^{\mathsf{ra}}_{\mathsf{sys}}$$

$$x := 1; y^{\mathsf{ra}} := 1 \parallel r := y^{\mathsf{ra}}; z := r \parallel r := z^{\mathsf{ra}}; s := x$$

$$(\forall x 1) \longrightarrow (\mathsf{W}^{\mathsf{ra}} y 1) \longrightarrow (\mathsf{R}^{\mathsf{ra}} y 1) \longrightarrow (\mathsf{R} x 0)$$

$$(\leq)$$

$$x := 1; \ y^{\mathsf{ra}} := 1 \parallel r := y; \ z^{\mathsf{ra}} := r \parallel r := z^{\mathsf{ra}}; \ s := x$$

$$(\mathbb{W}x1) \longrightarrow (\mathbb{W}^{\mathsf{ra}}y1) \longrightarrow (\mathbb{R}y1) \longrightarrow (\mathbb{R}x0)$$

$$(\leq)$$

To get publication using fences we need an additional closure property for rf on sync order:

$$x := 1; F^{\text{rel}}; y := 1 \parallel r := y; F^{\text{acq}}; s := x$$

$$(\leq)$$

Current def of candidate requires:

(c4) if  $d \xrightarrow{\text{rf}} e$  and  $\lambda(d)$  strongly-matches  $\lambda(e)$  then  $d \leq e$ .

This is not good enough for fences. A possible fix is the following closure condition:

(c4') if  $d' \le d \xrightarrow{\mathsf{rf}} e \le e'$  and  $\lambda(d')$  strongly-matches  $\lambda(e')$  then  $d' \le e'$ .

With that we have the following, using  $\Rightarrow$  for edges induced by closure when  $d' \neq d$  or  $e' \neq e$ :

$$x := 1$$
;  $F^{\text{rel}}$ ;  $y := 1 \parallel r := y$ ;  $F^{\text{acq}}$ ;  $s := x$ 

$$(\leq)$$

This seems to work for the above examples, but it could be too strong in general.

- One possibility is to restrict to preceding and following things in the same thread:
- (c4") if  $d' \leq_{po} d \xrightarrow{rf} e \leq_{po} e'$  and  $\lambda(d')$  strongly-matches  $\lambda(e')$  then  $d' \leq e'$ .

where  $\leq_{po}$  is the obvious restriction of  $\leq$  to actions on the same thread.

- With either (c4') or (c4'') is it too strong to require  $\leq$  that be transitive? In particular:
  - if we restrict to  $\leq_{po}$ , the closure condition (c4") could add order between actions on the same thread via cross-thread reads.
  - How does transitivity interact with scopes?

Anton proposes:

(M9b') if 
$$d \xrightarrow{\mathsf{rmw}} e$$
 then  $d \sqsubseteq e$ , (c4''') if  $d' \le d$  ( $\xrightarrow{\mathsf{rf}}$ ; ( $\xrightarrow{\mathsf{rmw}}$ ;  $\xrightarrow{\mathsf{rf}}$ )\*)  $e \le e'$  and  $\lambda(d')$  strongly-matches  $\lambda(e')$  then  $d' \le e'$ .

The following behavior is allowed by Arm, IMM, and C11, but forbidden by PTX. PTX forbids it since acquire reads work as fences for po-previous reads from the same location (symmetrically to release writes for po-latter writes to the same location in IMM, C11, and PTX).

$$x := 1; y^{ra} := 1 \parallel r := y; y := 2; s := y^{ra}; t := x$$

$$(\leq)$$

To allow this on for IMM, we need to drop  $(Rx, R^{\supseteq ra}x)$  from synchronization-delays.

1

The following is allowed by c11, but not IMM or PTX. The goal here is to construct a cycle  $a \xrightarrow{rf} b \xrightarrow{hb} c \xrightarrow{rf} d \xrightarrow{hb} a$  where rf will be included in synch-relation. In relational notation, the cycle has the following form:

$$(rmw; (rfe; rmw)^2; ppo; [W^{ra}]; rfe; [R^{ra}]; ppo)^2$$

$$r := x^{ra}$$
; INC(y) || INC(y) || INC(y);  $z^{ra} := 1 || s := z^{ra}$ ; INC(w) || INC(w) || INC(w);  $x^{ra} := 1$ 



#### 2 MODEL

#### 2.1 Preliminaries

The syntax is built from

- a set of *values* V, ranged over by v, w,  $\ell$ , k,
- a set of registers R, ranged over by r, s,
- a set of expressions  $\mathcal{M}$ , ranged over by M, N, L,
- a set of *thread ids*  $\mathcal{T}$ , ranged over by  $\alpha$ ,  $\gamma$ .

*Memory references* are tagged values, written  $[\ell]$ . Let  $\mathcal{X}$  be the set of memory references, ranged over by x, y, z.

We require that

- values and registers are disjoint,
- values include at least the constants 0 and 1,
- expressions include at least registers and values,
- expressions do *not* include references: M[N/x] = M,
- there are registers  $S_{\mathcal{E}} = \{s_e \mid e \in \mathcal{E}\},\$
- registers  $S_{\mathcal{E}}$  do not appear in programs:  $S[N/s_e] = S$ .

Alternative to the last assumption, we sometimes assume each register is assigned at most once. We model the following language.

$$\mu := \mathsf{wk} \mid \mathsf{rlx} \mid \mathsf{ra} \mid \mathsf{sc} \qquad \nu := \mathsf{acq} \mid \mathsf{rel} \mid \mathsf{fsc} \qquad \sigma, \rho := \mathsf{cta} \mid \mathsf{gpu} \mid \mathsf{sys}$$
 
$$S := \mathsf{skip} \mid r := M \mid r := [L]^{\mu}_{\sigma} \mid [L]^{\mu}_{\sigma} := M \mid \mathsf{F}^{\nu}_{\sigma} \mid \mathsf{if}(M)\{S_1\} \, \mathsf{else} \, \{S_2\} \mid S_1; S_2 \mid S_1 \mid_{V} S_2 \mid r := \mathsf{CAS}^{\mu_1,\mu_2}_{\sigma}([L],M,N) \mid r := \mathsf{FADD}^{\mu_1,\mu_2}_{\sigma}([L],M) \mid r := \mathsf{EXCHG}^{\mu_1,\mu_2}_{\sigma}([L],M)$$

Access modes,  $\mu$ , are weak (wk), are relaxed (rlx), release-acquire (ra), and sequentially consistent (sc). ra/sc accesses are collectively known as *synchronized accesses*.

*Fence modes*, *v*, are acquire (acq), release (rel), and acquire-release (fsc).

*Scopes*,  $\sigma$ , are thread group (cta), processor (gpu) and system (sys).

*Commands*, aka *statements*, S, include memory accesses at a given mode, as well as the usual structural constructs. Following [Ferreira et al. 1996],  $\|$  denotes parallel composition. If  $(S_1 \|_{\gamma} S_2)$  is executed with thread ID  $\alpha$ , then  $S_2$  runs with ID  $\gamma$  and  $S_1$  continues under ID  $\alpha$ . Top level programs run with thread ID 0. In examples, we usually drop thread IDs. We use the symmetric  $\|$  operator when there is no continuation after the parallel composition.

The semantics is built from the following.

<sup>&</sup>lt;sup>1</sup>We make this assumption when discussing any semantics of load  $(r := [L]_{\sigma}^{\mu})$  that does not include the substitution  $[s_{\varepsilon}/r]$ .

- a set of events  $\mathcal{E}$ , ranged over by e, d, c, b,
- a set of actions  $\mathcal{A}$ , ranged over by a,
- a set of logical formulae  $\Phi$ , ranged over by  $\phi$ ,  $\psi$ ,  $\theta$ .

Subsets of  $\mathcal{E}$  are ranged over by E, D, C, B.

We require that:

- formulae include equalities (M=N) and (x=M),
- formulae are closed under negation, conjunction, disjunction, and substitutions [M/r], [M/x],
- there is a relation ⊨ between formulae, capturing entailment,
- $\models$  has the expected semantics for =,  $\neg$ ,  $\land$ ,  $\lor$ ,  $\Rightarrow$  and substitution.

Logical formulae include equations over registers, such as (r=s+1). For LIR, we also include equations over memory references, such as (x=1). Formulae are subject to substitutions; actions are not. We use expressions as formulae, coercing M to  $M\neq 0$ . Equations have precedence over logical operators; thus  $r=v \Rightarrow s>w$  is read  $(r=v) \Rightarrow (s>w)$ . As usual, implication associates to the right; thus  $\phi \Rightarrow \psi \Rightarrow \theta$  is read  $\phi \Rightarrow (\psi \Rightarrow \theta)$ .

We say  $\phi$  is a tautology if tt  $\models \phi$ . We say  $\phi$  is unsatisfiable if  $\phi \models \text{ff}$ .

We require several binary relations between actions, detailed in the next subsection: *overlaps*, *matches*, *strongly-matches*, *blocks*, *strongly-blocks*, *synchronization-delays* and *coherence-delays*. We also require that there is a subsets of actions, distinguishing *read* and *release* actions, and an operator merge :  $\mathcal{A} \times \mathcal{A} \to 2^{\mathcal{A}}$ .

### 2.2 Actions

We combine access and fence modes into a single order:

$$\mathsf{wk} \to \mathsf{rlx} \to \mathsf{ra} \to \mathsf{sc} \qquad \qquad \mathsf{acq} \underset{\mathsf{rel}}{\rightarrow} \mathsf{fsc}$$

We write  $\mu \sqsubseteq \nu$  for this order. Let  $\mu \sqcup \nu$  denote the least upper bound of  $\mu$  and  $\nu$ .

Let actions be reads, writes and fences:

$$a, b := \alpha W^{\mu}_{\sigma} x v \mid \alpha R^{\mu}_{\sigma} x v \mid \alpha F^{\nu}_{\sigma}$$

In definitions, we drop elements of actions that are existentially quantified. In examples, we drop elements of actions, using defaults. We write  $(\alpha A^{\mu}_{\sigma})$  to stand for  $(\alpha W^{\mu}_{\sigma})$ ,  $(\alpha R^{\mu}_{\sigma})$ , or  $(\alpha F^{\mu}_{\sigma})$ . We write  $(W^{\exists ra})$  to stand for either  $(W^{ra})$  or  $(W^{sc})$ , and similarly for other actions and modes.

We say a matches b if a = (Wxv) and b = (Rxv).

We say a blocks b if a = (Wx) and b = (Rx), regardless of value.

We say *a overlaps b* if they access the same location.

We say a coherence-delays b if  $(a,b) \in \{(\mathsf{W} x, \mathsf{W} x), \; (\mathsf{R} x, \mathsf{W} x), \; (\mathsf{W} x, \mathsf{R} x), \; (\mathsf{A}^{\mathsf{sc}}, \mathsf{A}^{\mathsf{sc}})\}.$ 

We say a synchronization-delays b if  $(a, b) \in \{(a, W^{\supseteq ra}), (a, F^{\supseteq rel}), (R, F^{\supseteq acq}), (R^{\supseteq ra}, b), (F^{\supseteq acq}, b), (F^{\supseteq rel}, W), (W^{\supseteq ra}x, Wx)\}.^2$ 

Let  $(W^{\supseteq ra})$  and  $(F^{\supseteq rel})$  be *release* actions. Actions (R) are *read* actions.

Let merge :  $\mathcal{A} \times \mathcal{A} \to 2^{\mathcal{A}}$  be defined as follows. Let merge( $\mathsf{R}^{\mu}xv$ ,  $\mathsf{R}^{\nu}xv$ ) = { $\mathsf{R}^{\mu\sqcup\nu}xv$ }, merge ( $\mathsf{W}^{\mu}xv$ ,  $\mathsf{W}^{\nu}xw$ ) = { $\mathsf{W}^{\mu\sqcup\nu}xw$ }, merge( $\mathsf{W}^{\mu}xv$ ,  $\mathsf{R}^{\nu\sqsubseteq\mathsf{rl}\times}xv$ ) = { $\mathsf{W}^{\mu\sqcup\nu}xv$ }, merge( $\mathsf{F}^{\mu}$ ,  $\mathsf{F}^{\nu}$ ) = { $\mathsf{F}^{\mu\sqcup\nu}$ }, and merge(a, b) =  $\emptyset$ , otherwise.

If  $a_0 \in \mathsf{merge}(a_1, a_2)$ , then  $a_1$  and  $a_2$  can coalesce, resulting in  $a_0$ . This allows optimizations such as (x := 1; x := 2) to (x := 2) and (x := 1; x := x) to (x := 1; x := 1). For associativity of sequential composition, it is important that merge always take an upper bound on the modes of the two

<sup>&</sup>lt;sup>2</sup>For PTX, one can additionally include  $(Rx, R^{\supseteq ra}x)$ .

actions. For example, it would invalidate associativity to allow  $(Wxv) \in \mathsf{merge}(Wxv, \, \mathsf{R}^{\mathsf{ra}}xv)$ , although this is considered safe.<sup>3</sup>

Default mode should be RLX, default scope SYS.

Definition 2.1. When modeling IMM, we ban access mode wk; the default access mode is rlx. We also ban scopes cta and gpu; the only allowed scope is sys. We assume there is only one thread ID ( $|\mathcal{T}| = 1$ ), which we elide. Let strongly-blocks be  $\mathcal{A} \times \mathcal{A}$ . We say a strongly-matches b if (1) a overlaps b and (2) neither has mode rlx.

Definition 2.2. When modeling PTX, the default access mode is wk. The default scope is cta. We assume two equivalences:  $(=_{gpu}) \subseteq (\mathcal{T} \times \mathcal{T})$  partitions threads by *processor*, and  $(=_{cta}) \subseteq (=_{gpu})$  refines the processor partitioning into *thread groups*. We say  $(\alpha A_{\sigma}^{\mu})$  *strongly-blocks*  $(\gamma A_{\rho}^{\nu})$  when either (1)  $\alpha = \gamma$  or (2) all of the following hold:

```
(2a) \mu, \nu \neq wk,
```

- (2b) if  $\sigma = \text{cta or } \rho = \text{cta then } \alpha =_{\text{cta}} \gamma$ ,
- (2c) if  $\sigma = \text{gpu or } \rho = \text{gpu then } \alpha =_{\text{gpu}} \gamma$ ,
- (2d) if either action is an access then they overlap.

We say a strongly-matches b if (1) a strongly-blocks b and either (2a) they have the same thread (2b) a is an acquire and b is a release.

## 2.3 Pomsets with Predicate Transformers

*Definition 2.3.* A predicate transformer is a function  $\tau: \Phi \to \Phi$  such that

- (1)  $\tau(ff)$  is ff,
- (2)  $\tau(\psi_1 \wedge \psi_2)$  is  $\tau(\psi_1) \wedge \tau(\psi_2)$ ,
- (3)  $\tau(\psi_1 \vee \psi_2)$  is  $\tau(\psi_1) \vee \tau(\psi_2)$ ,
- (4) if  $\phi \models \psi$ , then  $\tau(\phi) \models \tau(\psi)$ .

Definition 2.4. A family of predicate transformers for E consists of a predicate transformer  $\tau^D$  for each  $D \subseteq \mathcal{E}$ , such that if  $C \cap E \subseteq D$  then  $\tau^C(\psi) \models \tau^D(\psi)$ .

*Definition 2.5.* A *pomset with predicate transformers* is a tuple  $(E, \lambda, \kappa, \tau, \checkmark, \leq, \leq, \sqsubseteq, \text{rmw})$  where

- (M1)  $E \subset \mathcal{E}$  is a set of events,
- (M2)  $\lambda : E \to \mathcal{A}$  defines a *label* for each event,
- (M3)  $\kappa : E \to \Phi$  defines a *precondition* for each event,
- (M4)  $\tau: 2^{\mathcal{E}} \to \Phi \to \Phi$  is a family of predicate transformers over E,
- (M5)  $\checkmark$ :  $\Phi$  defines a termination condition,
- (M6)  $\leq : (E \times E)$  is a partial order capturing dependency,
- $(M7) \le (E \times E)$  is a partial order capturing synchronization,
- (M8)  $\sqsubseteq$  :  $(E \times E)$  is a partial order capturing *per-location order*, such that
- (M8a) if  $\lambda(d)$  overlaps  $\lambda(e)$  then  $d \leq e$  implies  $d \sqsubseteq e$ ,
- (M9)  $rmw : E \rightarrow E$  is a partial function capturing read-modify-write atomicity, such that
- (M9a) if  $d \xrightarrow{\mathsf{rmw}} e$  then  $\lambda(e)$  blocks  $\lambda(d)$ ,
- (M9b) if  $d \xrightarrow{\mathsf{rmv}} e$  then  $d \leq e$  and  $d \sqsubseteq e$ ,
- (M9c) if  $\lambda(c)$  overlaps  $\lambda(d)$  then
  - (i) if  $d \xrightarrow{\mathsf{rmv}} e$  then  $c \leq e$  implies  $c \leq d$ ,  $c \leq e$  implies  $c \leq d$ ,  $c \subseteq e$  implies  $c \subseteq d$ ,
  - (ii) if  $d \xrightarrow{\mathsf{rmw}} e$  then  $d \leq c$  implies  $e \leq c$ ,  $d \leq c$  implies  $e \leq c$ ,  $d \subseteq c$  implies  $e \subseteq c$ .

A pomset is a *candidate* if there is an injective relation rf :  $E \times E$ , capturing *reads-from*, such that

<sup>&</sup>lt;sup>3</sup>A list of safe merge operations can be found in [Chakraborty and Vafeiadis 2017, §E] and [Kang 2019, §7.1]. For examples of unsafe merges and reorderings, see [Chakraborty and Vafeiadis 2017, §D].

- (c1) if  $d \xrightarrow{rf} e$  then  $\lambda(d)$  matches  $\lambda(e)$ ,
- (c2) if  $d \xrightarrow{rf} e$  and  $\lambda(c)$  blocks  $\lambda(e)$  then either  $c \sqsubseteq d$  or  $e \sqsubseteq c$ , where  $d' \sqsubseteq e'$  when (c2a)  $e' \not\sqsubseteq d'$  and (c2b) if  $\lambda(d')$  strongly-blocks  $\lambda(e')$  then  $d' \sqsubseteq e'$ ,
- (c3) if  $d \xrightarrow{rf} e$  then  $d \le e$  and  $d \sqsubseteq e$ ,
- (c4) if  $d \xrightarrow{rf} e$  and  $\lambda(d)$  strongly-matches  $\lambda(e)$  then  $d \leq e$ ,
- (c5) if  $\lambda(d)$  strongly-blocks  $\lambda(e)$  and both are sc fences then either  $d \le e$  or  $e \le d$ .

A pomset is top-level if

- (T1) ✓ is a tautology,
- (T2)  $\kappa(e)$  is a tautology (for every  $e \in E$ ),
- (T3) if  $\lambda(e)$  is a read then there is some  $d \stackrel{\mathsf{rf}}{\longrightarrow} e$ .

Note that for the IMM model, c2 is equivalent to:

if 
$$d \xrightarrow{\mathsf{rf}} e$$
 and  $\lambda(c)$  blocks  $\lambda(e)$  then either  $c \sqsubseteq d$  or  $e \sqsubseteq c$ .

Let P range over pomsets, and  $\mathcal{P}$  over sets of pomsets.

We lift terminology from actions to events. For example, we say that e writes x if  $\lambda(e)$  writes x. We also drop quantifiers when clear from context, such as  $(\forall e \in E)(\forall x \in X)$ . We write d < e when  $d \le e$  and  $d \ne e$ , and similarly for  $\triangleleft$  and  $\square$ .

Definition 2.6.  $\mathcal{P}_1$  refines  $\mathcal{P}_2$  if  $\mathcal{P}_1 \subseteq \mathcal{P}_2$ .

#### 2.4 Semantics

(12)  $\lambda = (\lambda_1 \cup \lambda_2)$ ,

```
Definition 2.7. If P \in SKIP then E = \emptyset and \tau^D(\psi) \models \psi.
If P \in PAR(\mathcal{P}_1, \mathcal{P}_2) then (\exists P_1 \in \mathcal{P}_1) (\exists P_2 \in \mathcal{P}_2)
  (P1) \ E = (E_1 \cup E_2), \ \trianglelefteq \supseteq (\trianglelefteq_1 \cup \trianglelefteq_2), \ \leq \supseteq (\leq_1 \cup \leq_2), \ \sqsubseteq \supseteq (\sqsubseteq_1 \cup \sqsubseteq_2), \ \mathsf{rmw} = (\mathsf{rmw}_1 \cup \mathsf{rmw}_2),
  (P2) \lambda = (\lambda_1 \cup \lambda_2),
(P3a) if e \in E_1 then \kappa(e) \models \kappa_1(e),
(P3b) if e \in E_2 then \kappa(e) \models \kappa_2(e),
  (P4) \tau^D(\psi) \models \tau_1^D(\psi),
  (P5) \checkmark \models \checkmark_1 \land \checkmark_2,
  (P6) E_1 and E_2 are disjoint.
    If P \in SEO(\mathcal{P}_1, \mathcal{P}_2) then (\exists P_1 \in \mathcal{P}_1) (\exists P_2 \in \mathcal{P}_2)
  (s1) as in P1,
 (s2a) if e \in E_1 \setminus E_2 then \lambda(e) = \lambda_1(e),
 (s2b) if e \in E_2 \setminus E_1 then \lambda(e) = \lambda_2(e),
 (s2c) if e \in E_1 \cap E_2 then \lambda(e) \in \text{merge}(\lambda_1(e), \lambda_2(e)),
 (s3a) if e \in E_1 \setminus E_2 then \kappa(e) \models \kappa_1(e),
(s3b) if e \in E_2 \setminus E_1 then \kappa(e) \models \kappa_2'(e),
 (s3c) if e \in E_1 \cap E_2 then \kappa(e) \models \kappa_1(e) \vee \kappa_2'(e), where \kappa_2'(e) = \tau_1^{\downarrow e}(\kappa_2(e)),
           where \downarrow e = \{c \mid c \triangleleft e\} if \lambda(e) is a write, and \downarrow e = E_1, otherwise,
(s3d) if \lambda_2(e) is a release then \kappa(e) \models \sqrt{1},
  (s4) \tau^{D}(\psi) \models \tau_{1}^{D}(\tau_{2}^{D}(\psi)),
  (s5) \checkmark \models \checkmark_1 \land \tau_1^{E_1}(\checkmark_2),
 (s6a) if \lambda_1(d) synchronization-delays \lambda_2(e) then d \leq e,
(s6b) if \lambda_1(d) coherence-delays \lambda_2(e) then d \sqsubseteq e.
If P \in IF(\phi, \mathcal{P}_1, \mathcal{P}_2) then (\exists P_1 \in \mathcal{P}_1) \ (\exists P_2 \in \mathcal{P}_2)
   (11) as in P1,
```

```
(13a) if e \in E_1 \setminus E_2 then \kappa(e) \models \phi \land \kappa_1(e),
  (13b) if e \in E_2 \setminus E_1 then \kappa(e) \models \neg \phi \land \kappa_2(e),
  (13c) if e \in E_1 \cap E_2
            then \kappa(e) \models (\phi \land \kappa_1(e)) \lor (\neg \phi \land \kappa_2(e)),
    (14) \tau^D(\psi) \models (\phi \wedge \tau_1^D(\psi)) \vee (\neg \phi \wedge \tau_2^D(\psi)),
    (15) \checkmark \models (\phi \land \checkmark_1) \lor (\neg \phi \land \checkmark_2).
 If P \in LET(r, M) then E = \emptyset and \tau^D(\psi) \models \psi[M/r].
 If P \in FENCE(\mu, \sigma)_{\alpha} then
   (F1) if d, e \in E then d = e,
   (F2) \lambda(e) = \alpha F_{\sigma}^{\mu},
   (F4) \tau^D(\psi) \models \psi,
   (F5) if E = \emptyset then \checkmark \models ff.
 If P \in READ(r, x, \mu, \sigma)_{\alpha} then (\exists v \in \mathcal{V})
   (R1) if d, e \in E then d = e,
   (R2) \lambda(e) = \alpha R_{\sigma}^{\mu} x v,
 (R4a) if (E \cap D) \neq \emptyset then \tau^D(\psi) \models v = s_e \Rightarrow \psi[s_e/r],
 (R4b) if E \neq \emptyset and (E \cap D) = \emptyset then \tau^D(\psi) \models (v = s_e \lor x = s_e) \Rightarrow \psi[s_e/r],
 (R4c) if E = \emptyset then (\forall s) \tau^D(\psi) \models \psi[s/r],
   (R5) if E = \emptyset and \mu \supseteq \text{ra then } \checkmark \models \text{ff.}
 If P \in WRITE(x, M, \mu, \sigma)_{\alpha} then (\exists v \in V)
  (w1) if d, e \in E then d = e,
  (w2) \lambda(e) = \alpha W_{\sigma}^{\mu} x v,
  (w3) \kappa(e) \models M=v,
  (w4) \tau^D(\psi) \models \psi,
(w5a) if E = \emptyset then \checkmark \models ff,
(w5b) if E \neq \emptyset then \checkmark \models M=v.
           [r := M]_{\alpha} = LET(r, M)
                                                                                                                [skip]_{\alpha} = SKIP
           [r := x^{\mu}]_{\alpha} = READ(r, x, \mu, \sigma)_{\alpha}
                                                                                                           [S_1]_V S_2]_{\alpha} = PAR([S_1]_V, [S_2]_{\alpha})
         [x^{\mu} := M]_{\alpha} = WRITE(x, M, \mu, \sigma)_{\alpha}
                                                                                                               [S_1; S_2]_{\alpha} = SEQ([S_1]_{\alpha}, [S_2]_{\alpha})
                  [\![ \mathsf{F}^{\nu}_{\sigma} ]\!]_{\alpha} = FENCE(\nu, \sigma)_{\alpha}
                                                                                [\inf(M)\{S_1\} \text{ else } \{S_2\}]_{\alpha} = IF(M \neq 0, [S_1]_{\alpha}, [S_2]_{\alpha})
```

In diagrams, we use different shapes and colors for arrows and events. These are included only to help the reader understand why order is included. We adopt the following conventions:

```
• e \rightarrow d arises from control/data/address dependency (s3),
```

- $e \rightarrow d$  arises from synchronization-delays (s6a),
- $e \rightarrow d$  arises from coherence-delays (s6b),
- $e \rightarrow d$  arises from blocking (c2),
- $e \rightarrow d$  arises from matching (c3) (c4).

Definition 2.8. Address Calculation.

```
If P \in WRITE(L, M, \mu, \sigma)_{\alpha} then (\exists \ell \in \mathcal{V}) (\exists v \in \mathcal{V})

(w1) if d, e \in E then d = e, (w4b) if E = \emptyset then

(w2) \lambda(e) = \alpha W_{\sigma}^{\mu}[\ell]v, (\forall k) \tau^{D}(\psi) \models (L=k) \Rightarrow \psi[M/[k]]

(w3) \kappa(e) \models L=\ell \land M=v, (w5a) if E \neq \emptyset then \sigma^{D}(\psi) \models (L=\ell) \Rightarrow \psi[M/[\ell]], (w5b) if E = \emptyset then \sigma^{D}(\psi) \models (L=\ell) \Rightarrow \psi[M/[\ell]], (w5b) if E = \emptyset then \sigma^{D}(\psi) \models (L=\ell) \Rightarrow \psi[M/[\ell]], (w5b) if E = \emptyset then \sigma^{D}(\psi) \models (L=\ell) \Rightarrow \psi[M/[\ell]], (w5b) if E = \emptyset then \sigma^{D}(\psi) \models (L=\ell) \Rightarrow \psi[M/[\ell]], (w5b) if E = \emptyset then \sigma^{D}(\psi) \models (L=\ell) \Rightarrow \psi[M/[\ell]], (w5b) if E = \emptyset then \sigma^{D}(\psi) \models (L=\ell) \Rightarrow \psi[M/[\ell]], (w5b) if E = \emptyset then \sigma^{D}(\psi) \models (L=\ell) \Rightarrow \psi[M/[\ell]], (w5b) if E = \emptyset then \sigma^{D}(\psi) \models (L=\ell) \Rightarrow \psi[M/[\ell]], (w5b) if E = \emptyset then \sigma^{D}(\psi) \models (L=\ell) \Rightarrow \psi[M/[\ell]], (w5b) if E = \emptyset then \sigma^{D}(\psi) \models (L=\ell) \Rightarrow \psi[M/[\ell]], (w5b) if E = \emptyset then \sigma^{D}(\psi) \models (L=\ell) \Rightarrow \psi[M/[\ell]], (w5b) if E = \emptyset then \sigma^{D}(\psi) \models (L=\ell) \Rightarrow \psi[M/[\ell]]
```

```
If P \in READ(r, L, \mu, \sigma)_{\alpha} then (\exists \ell \in \mathcal{V}) (\exists v \in \mathcal{V})
   (R1) if d, e \in E then d = e,
   (R2) \lambda(e) = \alpha R_{\sigma}^{\mu}[\ell]v
   (R3) \kappa(e) \wedge L = \ell,
 (R4a) (\forall e \in E \cap D) \tau^D(\psi) \models (L=\ell \Rightarrow v=s_e) \Rightarrow \psi[s_e/r],
(R4b) (\forall e \in E \setminus D) \tau^D(\psi) \models ((L=\ell \Rightarrow v=s_e) \lor (L=\ell \Rightarrow [\ell]=s_e)) \Rightarrow \psi[s_e/r],
 (R4c) (\forall s) if E = \emptyset then \tau^D(\psi) \models \psi[s/r],
  (R5) if E = \emptyset and \mu \neq \text{rlx then } \checkmark \models \text{ff.}
     Definition 2.9. If-closure
If P \in WRITE(x, M, \mu, \sigma)_{\alpha} then (\exists v : E \to V) (\exists \theta : E \to \Phi)
 (w1) if \theta_d \wedge \theta_e is satisfiable then d = e,
 (w2) \lambda(e) = \alpha W_{\sigma}^{\mu} x v_e,
 (w3) \kappa(e) \models \theta_e \land M = v_e,
 (w4) \tau^D(\psi) \models \theta_e \Rightarrow \psi[M/x],
 (w5) \checkmark \models \theta_e \Rightarrow M = v_e,
If P \in READ(r, x, \mu, \sigma)_{\alpha} then (\exists v : E \to V) (\exists \theta : E \to \Phi)
   (R1) if \theta_d \wedge \theta_e is satisfiable then d = e,
  (R2) \lambda(e) = \alpha R_{\sigma}^{\mu} x v_e
   (R3) \kappa(e) \models \theta_e,
 (R4a) (\forall e \in E \cap D) \tau^D(\psi) \models \theta_e \Rightarrow v_e = s_e \Rightarrow \psi[s_e/r],
(R4b) (\forall e \in E \setminus D) \tau^D(\psi) \models \theta_e \Rightarrow (v_e = s_e \lor x = s_e) \Rightarrow \psi[s_e/r],
 (R4c) (\forall s) \tau^D(\psi) \models (\bigwedge_{e \in E} \neg \theta_e) \Rightarrow \psi[s/r],
  (R5) if E = \emptyset and \mu \neq \text{rlx then } \checkmark \models \text{ff.}
     Definition 2.10. Both.
If P \in WRITE(L, M, \mu, \sigma)_{\alpha} then (\exists \ell : E \to V) (\exists v : E \to V) (\exists \theta : E \to \Phi)
 (w1) if \theta_d \wedge \theta_e is satisfiable then d = e,
                                                                                        (w4b) (\forall k)
                                                                                                    \tau^D(\psi) \models (\bigwedge_{e \in E} \neg \theta_e) \Rightarrow (L=k) \Rightarrow \psi[M/[k]]
 (w2) \lambda(e) = \alpha W_{\sigma}^{\mu}[\ell] v_e,
 (w3) \kappa(e) \models \theta_e \land L = \ell_e \land M = v_e,
                                                                                        (w5a) \checkmark \models \theta_e \Rightarrow L = \ell_e \land M = v_e,
(w4a) \tau^D(\psi) \models \theta_e \Rightarrow (L=\ell) \Rightarrow \psi[M/[\ell]],
                                                                                        (w5b) \checkmark \models \bigvee_{e \in E} \theta_e.
If P \in READ(r, L, \mu, \sigma)_{\alpha} then (\exists \ell : E \to \mathcal{V}) (\exists v : E \to \mathcal{V}) (\exists \theta : E \to \Phi)
   (R1) if \theta_d \wedge \theta_e is satisfiable then d = e,
   (R2) \lambda(e) = \alpha R_{\sigma}^{\mu}[\ell] v_e
   (R3) \kappa(e) \models \theta_e \land L = \ell_e,
 (R4a) (\forall e \in E \cap D) \tau^D(\psi) \models \theta_e \Rightarrow (L = \ell_e \Rightarrow \nu_e = s_e) \Rightarrow \psi[s_e/r],
(R4b) (\forall e \in E \setminus D) \tau^D(\psi) \models \theta_e \Rightarrow ((L=\ell_e \Rightarrow v_e=s_e) \lor (L=\ell_e \Rightarrow [\ell]=s_e)) \Rightarrow \psi[s_e/r],
(R4c) (\forall s) \ \tau^D(\psi) \models (\bigwedge_{e \in E} \neg \theta_e) \Rightarrow \psi[s/r],
  (R5) if E = \emptyset and \mu \neq \text{rlx then } \checkmark \models \text{ff.}
     Definition 2.11. Let READ' be defined as for READ, adding the constraint:
(R4d) if (E \cap D) = \emptyset then \tau^D(\psi) \models \psi.
If P \in FADD(r, x, M, \mu_1, \mu_2) then (\exists P_1 \in SEQ(READ'(r, x, \mu_1), WRITE(x, r+M, \mu_2)))
  (U1) if \lambda_1(e) is a write then there is a read \lambda_1(d) such that \kappa(e) \models \kappa(d) and d \xrightarrow{\mathsf{rmv}} e.
If P \in EXCHG(r, x, M, \mu_1, \mu_2) then (\exists P_1 \in SEQ(READ'(r, x, \mu_1), WRITE(x, M, \mu_2)))
  (U1) if \lambda_1(e) is a write then there is a read \lambda_1(d) such that \kappa(e) \models \kappa(d) and d \xrightarrow{\mathsf{rmv}} e.
If P \in CAS(r, x, M, N, \mu_1, \mu_2) then (\exists P_1 \in SEQ(READ'(r, x, \mu_1), IF(r=M, WRITE(x, N, \mu_2), SKIP)))
  (U1) if \lambda_1(e) is a write then there is a read \lambda_1(d) such that \kappa(e) \models \kappa(d) and d \xrightarrow{\mathsf{rmv}} e.
```

Example 2.12. Consider IRIW with all ra access:

$$x^{\text{ra}} := 1 \parallel r := x^{\text{ra}}; \ s := y^{\text{ra}} \parallel y^{\text{ra}} := 1 \parallel r := y^{\text{ra}}; \ s := x^{\text{ra}}$$

$$(\text{IRIW-ACQ-ACQ})$$

$$(\text{Power,c11})$$

We allow this execution:

IRIW-ACQ-SC, is allowed by trailing-sync compilation to power [Lahav et al. 2017, §1].

$$x^{\text{sc}} := 1 \parallel r := x^{\text{ra}}; \ s := y^{\text{sc}} \parallel y^{\text{sc}} := 1 \parallel r := y^{\text{ra}}; \ s := x^{\text{sc}}$$

$$(\text{IRIW-ACQ-SC})$$

$$(\text{POWER}, \text{C11})$$

To model this it is convenient that synchronization is not included in dependency order:

- add sc bullet to def of ⊑ in c2,
- add SC access to synchronization-delays.

$$(S^{\operatorname{sc}} x1) \longrightarrow (R^{\operatorname{ra}} x1) \longrightarrow (R^{\operatorname{sc}} y0) \qquad (S^{\operatorname{sc}} y1) \longrightarrow (R^{\operatorname{ra}} y1) \longrightarrow (R^{\operatorname{sc}} x0)$$

This correctly forbids the all sc version:

$$x^{\text{sc}} := 1 \parallel r := x^{\text{sc}}; \ s := y^{\text{sc}} \parallel y^{\text{sc}} := 1 \parallel r := y^{\text{sc}}; \ s := x^{\text{sc}}$$

$$(\text{IRIW-SC-SC})$$

$$(\text{Sc} \ y1) \longrightarrow (\text{R}^{\text{sc}} \ y1)$$

Example 2.13. Thin air with an SC antidependency:

## 2.5 Fulfillment

[This is old.]

*Definition 2.14.* Define  $\sqsubseteq$  as follows.

$$d \subseteq e$$
 when 
$$\begin{cases} d \subseteq e & \text{if } d \text{ is morally strong with } e \\ e \not\sqsubseteq d & \text{otherwise} \end{cases}$$

A read event *e* is *strongly fulfilled* if there is a  $d \xrightarrow{rf} e$  and

Proc. ACM Program. Lang., Vol. 1, No. 1, Article . Publication date: April 2021.

for any *c* that can block *e*, either  $c \sqsubseteq d$  or  $e \sqsubseteq c$ .

A read event e is weakly fulfilled if there is a  $d \stackrel{\mathsf{rf}}{\longrightarrow} e$  and

for any c that can block e, either  $c \sqsubseteq d$  or  $e \sqsubseteq c$ .

If all accesses are morally strong with each other, weak fulfillment degenerates to

$$\forall \lambda(c) = (\mathsf{W} x) \text{ either } c \sqsubseteq d \text{ or } e \sqsubseteq c$$

If no accesses are morally strong with each other, weak fulfillment degenerates to

$$\not\exists \lambda(c) = (\mathsf{W} x) \text{ both } d \sqsubset c \text{ and } c \sqsubset e$$

Note that the difference between strong and weak fulfillment is limited to  $\sqsubseteq$ . We sometimes write  $\sqsubseteq$  for strong fulfillment and  $\sqsubseteq$  for weak fulfillment.

#### 3 NOTES

GPU stuff:

- Vulcan/Alloy
- OpenCL
- AMD PTX
- Matthew Sinclair/Sarita Adve stuff "Chasing Away RAts- Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems" and his thesis

## 4 RELATING IMM AND PTX

It looks like we cannot prove compilation correctness from IMM to PTX. (In this email I assume that all threads are in the same CTA, so any relation is a morally strong one if it is applicable.) The problem is in the LB-data-rel example:

$$r := x; y := r \parallel s := y; x^{ra} := 1$$

$$\boxed{Rx1} \xrightarrow{\text{data}} \boxed{Wy1} \xrightarrow{\text{rfe}} \boxed{Ry1} \xrightarrow{\text{bob}} \boxed{W^{ra}x1}$$

$$\boxed{Rx1} \xrightarrow{\text{Wy1}} \boxed{Ry1} \xrightarrow{\text{Wra}x1}$$

$$\boxed{Rx1} \xrightarrow{\text{Wy1}} \boxed{Ry1} \xrightarrow{\text{Wra}x1}$$

$$\boxed{Sx1} \xrightarrow{\text{Wy1}} \boxed{Ry1} \xrightarrow{\text{Wra}x1}$$

$$\boxed{Sx1} \xrightarrow{\text{Wy1}} \boxed{Ry1} \xrightarrow{\text{Wra}x1}$$

IMM forbids it, but PTX allows it. The point is that IMM mixes dependencies and release/acquire-induced po-order in its NoOOTA axiom, whereas PTX doesn't — release/acquire are only used to have coherence.

The problem is related to the one we have already discussed in the context of the C++ model – if you don't have acquire reads in the program, then you can erase release annotations from writes. In this regard, PTX is closer to PL memory models than to hardware ones.

AFAIU for the same reason we won't be able to show compilation correctness from the Pomset model to PTX even directly, if the Pomset model mixes release/acquire induced order with dependencies in the same causality relation.

The previous example in the section shows that IMM's acquires are stronger than PTX for this pattern. The next example shows that acquiring reads in PTX are stronger than in IMM for a different pattern. Thus the acquires in PTX and IMM are incomparable.

The following behavior is allowed by IMM and C11, but forbidden by PTX. PTX forbids it since acquire reads work as fences for po-previous reads from the same location (symmetrically to release writes for po-latter writes to the same location in IMM, C11, and PTX).

## 5 THIN AIR

Need  $\leq$  to prevent thin air on rlx:

$$y := x \parallel x := y$$

$$Rx1 \longrightarrow Ry1 \longrightarrow Wx1$$

$$\begin{array}{ccc}
(\mathbf{R}x1) \longrightarrow (\mathbf{W}y1) & (\mathbf{R}y1) \longrightarrow (\mathbf{W}x1)
\end{array}$$

$$\begin{array}{c}
(\underline{\mathbb{R}}x1) & \underline{\mathbb{R}}y1 & \underline{\mathbb{R}}y1
\end{array}$$

## **6 IMM EXAMPLES**

Interpreting this definition for the IMM:

- No wk, default is rlx
- All threads in same cta (only one scope)
- Actions are morally strong when both are ra/sc, mimicking happens-before
- Strong fulfillment may do the right thing

Disallowed by IMM:

$$x := 2; y^{ra} := 1 \parallel r := y^{ra}; x := 1$$

$$(PUB-REL-ACQ-COE)$$

$$(Wx2) \xrightarrow{bob} (W^{ra}y1) \xrightarrow{rfe} (R^{ra}y1) \xrightarrow{bob} (Wx1)$$

$$(Wx2) \xrightarrow{(W^{ra}y1)} (R^{ra}y1) \xrightarrow{(Wx1)} (Wx1)$$

$$(SIMM)$$

Allowed by IMM, but not by Power/ARMv7/ARMv8/TSO:

$$x := 2; y^{ra} := 1 \parallel r := y; x := 1$$
 (PUB-REL-RLX-COE)

$$\begin{array}{c} Wx2 \xrightarrow{bob} W^{ra}y1 \xrightarrow{rfe} Ry1 \xrightarrow{data} Wx1 \\ \hline Wx2 \xrightarrow{} W^{ra}y1 \xrightarrow{} Ry1 \xrightarrow{} Wx1 \\ \hline Wx2 \xrightarrow{} W^{ra}y1 \xrightarrow{} Ry1 \xrightarrow{} Wx1 \\ \hline Wx2 \xrightarrow{} W^{ra}y1 \xrightarrow{} Ry1 \xrightarrow{} Wx1 \\ \hline Wx2 \xrightarrow{} W^{ra}y1 \xrightarrow{} Ry1 \xrightarrow{} Wx1 \\ \hline \end{array}$$

Example from talk:

#### 7 TWO ORDER IDEA

The two order idea from OOPSLA talk is:

• Require:  $d \sqsubseteq e$  when  $d \le e$  and they conflict

This does not work for the IMM or ARMv7, but it may work for Power, TSO, ARMv8. That would be nice. Let's write ⊑ for this notion, with strong fulfillment.

With this there is a cycle in ARM7-WEAK (weak/strong fulfillment not relevant here):

Anton says: ARM7-WEAK is forbidden by Power, TSO, ARMv8, but allowed by ARMv7. Maybe it isn't that important to support it anymore.

There is also a cycle in Pub-rel-rlx-coe. Anton says: I checked Power/ARMv7 models in this regard. They disallow the behavior (as well as ARMv8 and TSO), so we can in principle strengthen IMM to forbid it as well. For that, we may add axiom to IMM forbidding cycles in  $co \cup ([W]; rfe^?; ([R^{acq}] \cup po; [FW^{rel}]); ar^*; [W])$ . This works if we have acquire/release accesses on the path since they are compiled with fences to Power.

#### 8 PTX EXAMPLES

Based on [Lustig et al. 2019; NVIDIA 2020].

PTX requires weak fulfillment.

Default scope is cta. In examples, all threads in different ctas.

Default mode is wk.

(Rx0) must be forbidden. Before fulfilling the read:

$$x := 0; x := 1; y_{\mathsf{sys}}^{\mathsf{ra}} := 1 \parallel r := y_{\mathsf{sys}}^{\mathsf{ra}}; s := x \tag{PUB1}_{\mathsf{SYS}}$$

$$\boxed{\mathsf{W}x0} \qquad \boxed{\mathsf{W}x1} \qquad \boxed{\mathsf{W}_{\mathsf{sys}}^{\mathsf{ra}}y1} \qquad \boxed{\mathsf{R}_{\mathsf{sys}}^{\mathsf{ra}}y1} \qquad \boxed{\mathsf{R}x}$$

$$\boxed{( \leq )}$$

 $(\mathsf{W} x 1) \sqsubseteq (\mathsf{R} x)$  is required by M7, enforcing publication.

(Rx0) must be allowed:

$$x := 0; x := 1; y^{\mathsf{ra}} := 1 \parallel r := y^{\mathsf{ra}}; s := x$$

$$(\text{PUB1}_{\text{CTA}})$$

$$(\text{W}_{x0}) \qquad (\text{W}_{x1}) \qquad (\text{R}_{x1}) \qquad (\text{PW}_{x2})$$

We do not have  $(W^{ra}y1) \le (R^{ra}y1)$  since F3 only requires order for things that are morally strong.

Another example that may be of interest (nothing morally strong). Can this (Rx0)?

$$x := 0; x := 1 \parallel y := x \parallel if(y)\{r := x\}$$

PTX allows TC16 for events that are not mutually strong ( $TC16_{WK}$ ), but disallows it when events are mutually strong (TC16<sub>SYS</sub>). Note that  $\leq$  imposes no requirements here. Fulfillment imposes no order. This example shows that F3C cannot be strengthened to require that  $d \sqsubseteq e$ .

$$r := x ; x := 1 \parallel s := x ; x := 2$$
 (Tc16<sub>wk</sub>)

$$\begin{array}{c|c}
(Rx2) & Wx1 & Wx2
\end{array}$$

$$(Rx2)$$
  $(Wx1)$   $(Rx1)$   $(Wx2)$   $(\leq)$ 

$$\begin{array}{ccc}
(Rx2) \longrightarrow (Wx1) & (Rx1) \longrightarrow (Wx2)
\end{array}$$

$$r := x_{\text{sys}}^{\text{rlx}}; x_{\text{sys}}^{\text{rlx}} := 1 \parallel s := x_{\text{sys}}^{\text{rlx}}; x_{\text{sys}}^{\text{rlx}} := 2$$
 (TC16<sub>sys</sub>)

$$\begin{array}{c|c}
 & & & & & & & & & & & & \\
R_{sys}^{rlx}x2 & & & & & & & & & \\
\hline
& & & & & & & & & & \\
R_{sys}^{rlx}x1 & & & & & & & \\
\hline
& & & & & & & & \\
R_{sys}^{rlx}x2 & & & & & & \\
\hline
& & & & & & & \\
\end{array}$$

About Release-Acquire semantics. Anton confirms that the following example is allowed in C11, but disallowed in the IMM. It is apparently allowed in C11 with the intention to allow releasing writes to be downgraded to relaxed in the case that only fulfill relaxed reads.

$$r := x_{\text{sys}}^{\text{rlx}}; \ y_{\text{sys}}^{\text{ra}} := 1 \parallel s := y_{\text{sys}}^{\text{rlx}}; \ x_{\text{sys}}^{\text{ra}} := 1$$
 (LB-REL)

$$r := x_{\mathsf{sys}}^{\mathsf{rlx}}; \ y_{\mathsf{sys}}^{\mathsf{ra}} := 1 \parallel s := y_{\mathsf{sys}}^{\mathsf{rlx}}; \ x_{\mathsf{sys}}^{\mathsf{ra}} := 1 \tag{LB-REL}$$

$$(\mathsf{R}_{\mathsf{sys}}^{\mathsf{rlx}} x 1) \longrightarrow (\mathsf{R}_{\mathsf{sys}}^{\mathsf{rlx}} y 1) \longrightarrow (\mathsf{R}_{\mathsf{sys}}^{\mathsf{rlx}} x 1)$$

Another example from Anton. This is allowed in PTX because it does not include synchronization in the no-tar axiom, only in coherence and causality.

$$r := x_{\mathsf{sys}}^{\mathsf{rlx}}; \ y_{\mathsf{sys}}^{\mathsf{rlx}} := r \parallel s := y_{\mathsf{sys}}^{\mathsf{rlx}}; \ x_{\mathsf{sys}}^{\mathsf{ra}} := 1 \tag{\texttt{LB-DATA-REL}}$$

# RFI EXAMPLES

Bad example:

$$r := \mathsf{EXCHG}(x, 2) \; ; \; s := x \; ; \; y := s - 1 \parallel r := y \; ; \; x := r$$

$$(\checkmark Arm8)$$

$$r := x ; x := 2 ; s := x ; y := s-1 \parallel r := y ; x := r$$

$$( \boxtimes )$$

$$( \boxtimes )$$

$$(Rx1)$$
  $(Wx2)$   $(Rx2)$   $(Wy1)$   $(Ry1)$   $(Wx1)$ 

Anton example 1 (Allowed by ARM) [rfi-coe-coe]

$$x := 2; r := x^{ra}; y := 1 \parallel y := 2; x^{ra} := 1$$

$$(RFI-COE-COE)$$

$$(Wx2) \xrightarrow{\text{rfi}} (R^{ra}x2) \xrightarrow{\text{bob}} (Wy1) \xrightarrow{\text{coe}} (Wy2) \xrightarrow{\text{bob}} (W^{ra}x1)$$

$$(Wy2) \xrightarrow{\text{R}^{ra}x2} (Wy1) \xrightarrow{\text{W}y2} (W^{ra}x1)$$

$$(S)$$

$$(Wx2) \xrightarrow{\text{R}^{ra}x2} (Wy1) \xrightarrow{\text{W}y2} (W^{ra}x1)$$

$$(S)$$

$$(Wx2) \xrightarrow{\text{R}^{ra}x2} (Wy1) \xrightarrow{\text{W}y2} (W^{ra}x1)$$

$$(S)$$

Internal reads survive acquires [rfi-acq-coe-coe] (where SC read = LDAR)

$$x := 2$$
;  $s := z^{sc}$ ;  $r := x^{sc}$ ;  $y := 1 \parallel y := 2$ ;  $x^{ra} := 1$  (RFI-ACQ-COE-COE)

$$(x := 2; s := z^{sc}; r := x^{sc}; y := 1 \parallel y := 2; x^{ra} := 1$$

$$(x := 2; s := z^{sc}; r := x^{sc}; y := 1 \parallel y := 2; x^{ra} := 1$$

$$(x := 2; s := z^{sc}; r := x^{sc}; y := 1 \parallel y := 2; x^{ra} := 1$$

$$(x := 2; s := z^{sc}; r := x^{sc}; y := 1 \parallel y := 2; x^{ra} := 1$$

$$(x := 2; s := z^{sc}; r := x^{sc}; y := 1 \parallel y := 2; x^{ra} := 1$$

$$(x := 2; s := z^{sc}; r := x^{sc}; y := 1 \parallel y := 2; x^{ra} := 1$$

$$(x := 2; s := z^{sc}; r := x^{sc}; y := 1 \parallel y := 2; x^{ra} := 1$$

$$(x := 2; s := z^{sc}; r := x^{sc}; y := 1 \parallel y := 2; x^{ra} := 1$$

$$(x := 2; s := z^{sc}; r := x^{sc}; y := 1 \parallel y := 2; x^{ra} := 1$$

$$(x := 2; s := x^{sc}; r := x^{sc}; y := 1 \parallel y := 2; x^{ra} := 1$$

And release-acquire pairs [rfi-ra-coe-coe] (where acquiring read = LDAPR)

$$x := 2; w^{ra} := 1; s := z^{ra}; r := x^{ra}; y := 1$$

$$\parallel y := 2; x^{ra} := 1 \parallel r := w; z := 1;$$

$$Wx2 \xrightarrow{\text{rfi}} W^{ra}w1 \xrightarrow{\text{Rfi}} R^{ra}x2 \xrightarrow{\text{bob}} Wy1 \xrightarrow{\text{coe}} Wy2 \xrightarrow{\text{bob}} W^{ra}x1$$

$$(\checkmark \text{Arm8})$$

But not if either acquire is strengthened to SC (where SC read = LDAR). The execution is also disallowed if an external thread places order between the ra accesses:

$$x := 2$$
;  $w^{ra} := 1$ ;  $s := z^{ra}$ ;  $r := x^{ra}$ ;  $y := 1$  (RFI-RA-DATA-COE-COE)
$$\parallel y := 2$$
;  $x^{ra} := 1 \parallel r := w$ ;  $z := r$ ;
$$Wx2 \xrightarrow{bob} W^{ra} w1$$

$$R^{ra} z1 \xrightarrow{coe} Wy2 \xrightarrow{bob} W^{ra} x1$$

$$R^{ra} z1 \xrightarrow{coe} Wy2 \xrightarrow{bob} Wy3$$

$$R^{ra} x1 \xrightarrow{coe} Wy3 \xrightarrow$$

To allow this, weaken ra to rlx when read fulfilled by relaxed write of same thread (don't need to allow this when the write is part of an RMW).

$$x := 2; r := x^{ra}; y := 1 \parallel y := 2; x^{ra} := 1$$
 $(Wx2) \rightarrow (Rx2) \qquad (Wy1) \rightarrow (Wy2) \rightarrow (W^{ra}x1)$ 

RF variant [rfi-rfe-coe]:

$$x := 2; r := x^{ra}; y := 1 \parallel s := y; x^{ra} := 1$$
 (RFI-RFE-COE)

$$(x^{ra}) \stackrel{\text{rfi}}{\longrightarrow} (R^{ra}x^{2}) \stackrel{\text{bob}}{\longrightarrow} (Wy^{1}) \stackrel{\text{rfe}}{\longrightarrow} (Ry^{1}) \stackrel{\text{bob}}{\longrightarrow} (W^{ra}x^{1})$$

Tso variant [rfi-fre-coe]:

$$x := 2; r := x^{ra}; s := y \parallel y := 2; x^{ra} := 1$$

$$(RFI-COE-COE)$$

$$(Wx2) \xrightarrow{\text{rfi}} (R^{ra}x2) \xrightarrow{\text{bob}} (Ry0) \xrightarrow{\text{fre}} (Wy2) \xrightarrow{\text{bob}} (W^{ra}x1)$$

$$(\checkmark Arm8)$$

$$(VTSO)$$

Note that TsO does not order W to R in local order, even in poloc. Nonetheless, TsO disallows the following because of local visibility in first thread.

$$x := 2; r := x \parallel x := 1; s := x$$

Wx2  $\xrightarrow{\text{rfe}}$   $\xrightarrow{\text{Wx1}}$   $\xrightarrow{\text{rfe}}$   $\xrightarrow{\text{Wx2}}$   $\xrightarrow{\text{Kx2}}$ 

[Higham and Kawash 2000] describe TsO as a linearization of partial order including:

- poloc
- lws = po; [W]
- $d \xrightarrow{po} e$  when  $c \xrightarrow{rfe} d \xrightarrow{po} e$

[Alglave et al. 2020] describe TSO as linearization of partial order satisfying internal visibility and including

- [W]; po; [W]
- $d \xrightarrow{po} e$  when  $c \xrightarrow{rfe} d \xrightarrow{po} e$ , from (range(rfe) \* \_)
- [R]; po; [W], from (rfi^-1; lob)

Ignoring fences and RMWs:

Double FRE variant [rfi-fre-fre]:

$$x := 2; r := x^{ra}; s := y \parallel y := 2; F; r := x$$

$$(RFI-FRE-FRE)$$

$$(Wx2) \xrightarrow{rfi} (R^{ra}x2) \xrightarrow{bob} (Ry0) \xrightarrow{fre} (Wy2) \xrightarrow{bob} (Fx0)$$

$$(\checkmark Arm8)$$

It does not seem possible to do this only with rfe. ARM disallows this [data-rfi-rfe-rfe]:

$$x := z; r := x^{ra}; y := 1 \parallel z := y$$

$$(DATA-RFI-RFE-RFE)$$

$$(XArm8)$$

It also disallows [ctrl-rfi-rfe-rfe]:

if 
$$(z)$$
 {};  $x := 1$ ;  $r := x^{ra}$ ;  $y := 1 \parallel z := y$  (CTRL-RFI-RFE-RFE)

$$(x \times y) \xrightarrow{\text{rfi}} (x \times y$$

ARM allows some counterintuitive results for SC access [ctrl-rfi-fre-rfe]:

if 
$$(x)$$
 {};  $x := 2$ ;  $r := x^{sc}$ ;  $s := y^{sc} \parallel y^{sc} := 2$ ;  $x^{sc} := 1$  (CTRL-RFI-FRE-RFE)

$$(x) = (x) + (x) +$$

Not possible with coe [ctrl-rfi-coe-rfe]:

if(x){}; 
$$x := 2$$
;  $r := x^{sc}$ ;  $y^{sc} := 1 \parallel y^{sc} := 2$ ;  $x^{sc} := 1$  (CTRL-RFI-COE-RFE)

Ctrl

(XArm8)

This is not allowed with a data dependency instead of a control dependency [data-rfi-fre-rfe]:

$$x := x+1; \ r := x^{\text{sc}}; \ s := y^{\text{sc}} \parallel y^{\text{sc}} := 1; \ x^{\text{sc}} := 1$$

$$(\text{DATA-RFI-FRE-RFE})$$

$$(\text{Rx1}) \xrightarrow{\text{data}} (\text{Wx2}) \xrightarrow{\text{rfe}} (\text{Rsc} \ x2) \xrightarrow{\text{bob}} (\text{Rsc} \ y0) \xrightarrow{\text{fre}} (\text{Wsc} \ y1) \xrightarrow{\text{bob}} (\text{Wsc} \ x1)$$

$$(\text{XArm8})$$

## 10 SC EXAMPLES

IRIW-ACQ-SC is allowed by trailing-sync compilation to power [Lahav et al. 2017, §1].

$$x^{\text{sc}} := 1 \parallel r := x^{\text{ra}}; \ s := y^{\text{sc}} \parallel y^{\text{sc}} := 1 \parallel r := y^{\text{ra}}; \ s := x^{\text{sc}}$$

$$(\text{IRIW-ACQ-SC})$$

$$(\text{POWER,RC11})$$

This example is hard to get right for power because it must be allowed with ra reads, but disallowed with sc reads. This seems unsolvable: To allow the version with ra, we would need to weaken the order between the reads in each thread for the ra case, and that would break publication.

Leading sync is also unsound in c11 with RMW [Lahav et al. 2017, §2.1].

$$x^{\text{sc}} := 1; \ y^{\text{ra}} := 1 \parallel \text{FADD}^{\text{sc,sc}}(y, 1); \ s := y \parallel y^{\text{sc}} := 3; \ s := x^{\text{sc}}$$

$$(z6.\text{U})$$

$$(\text{W}^{\text{sc}}x1) \longrightarrow (\text{R}^{\text{sc}}y1) \xrightarrow{\text{rmw}} (\text{W}^{\text{sc}}y2) \qquad (\text{R}y3) \longrightarrow (\text{R}^{\text{sc}}x0)$$

Leading sync is also unsound in c11 with SC fences [Lahav et al. 2017, §A.1].

$$x := 2; \mathsf{F}^{\mathsf{sc}}; r := y \parallel y^{\mathsf{sc}} := 1 \parallel r := y^{\mathsf{ra}}; x^{\mathsf{ra}} := 1; s := x \parallel r := x^{\mathsf{sc}}$$

$$(\mathsf{RSYNC} + \mathsf{RSC})$$

$$(\mathsf{W}x2) \longrightarrow (\mathsf{F}^{\mathsf{sc}}) \longrightarrow (\mathsf{R}y0) \longrightarrow (\mathsf{W}^{\mathsf{ra}}y1) \longrightarrow (\mathsf{R}x2) \longrightarrow (\mathsf{R}x2)$$

$$(\mathsf{RSYNC} + \mathsf{RSC})$$

Fulfillment of (Rx2) requires that either  $(W^{ra}x1) \rightarrow (Wx2)$  or  $(Rx2) \rightarrow (W^{ra}x1)$ . It's interesting that in the pomset,  $(R^{sc}x1)$  is not needed to get a cycle.

There is a long discussion of this in [Bender and Palsberg 2019, §5.2, Fig. 17], where they also discuss this example:

$$x^{\text{sc}} := 1; \ x := 2 \parallel y^{\text{sc}} := 1; \ y := 2 \parallel r := x^{\text{ra}}; \ s := y^{\text{sc}} \parallel r := y^{\text{ra}}; \ s := x^{\text{sc}}$$
 (IRIW-SC-RLX-ACQ)

$$(\sqrt{\text{Rc11}}) \longrightarrow (\sqrt{\text{Rc11}})$$

$$(\sqrt{\text{Rc11}}) \longrightarrow (\sqrt{\text{Rc11}})$$

[Lahav et al. 2017, §A.2] claims that Arm8 allows this [RWC+acq+sc], but herd7 rejects it. Reason: they are citing the flowing/pop model [Flur et al. 2016] rather than [Pulte et al. 2018].

$$x^{\text{sc}} := 1 \parallel r := x; \text{ } \text{F}^{\text{acq}}; \text{ } s := y^{\text{sc}} \parallel y^{\text{sc}} := 1; \text{ } r := x^{\text{sc}}$$

$$\text{(RWC+ACQ+SC)}$$

$$\text{(XArm8)}$$

# 11 ADDITIONAL RMW EXAMPLES

It is not possible for two RMWs to see the same write.

$$x := 0; (FADD^{rlx,rlx}(x,1) \parallel FADD^{rlx,rlx}(x,1))$$

$$(RX0) \xrightarrow{rmw} (Wx1) \xrightarrow{rmw} (Wx1)$$

$$(RX0) \xrightarrow{rmw} (Wx1)$$

The gray arrow is required the RMW atomicity axioms.

Lee et al. [2020] introduce PS2.0 to refine the treatment of RMWs in the promising semantics (PS). Their examples have the expected results here, with far less work. First they recall that PS requires quantification over multiple futures in order to disallow executions such as CDRF:

$$r := \mathsf{FADD}^{\mathsf{ra},\mathsf{ra}}(x,1) \; ; \; \mathsf{if}(r=0) \{ y := 1 \} \; \| \; r := \mathsf{FADD}^{\mathsf{ra},\mathsf{ra}}(x,1) \; ; \; \mathsf{if}(r=0) \{ \mathsf{if}(y) \{ x := 0 \} \}$$
 
$$(\mathsf{CDRF})$$
 
$$(\mathsf{W}^{\mathsf{ra}}x1)$$
 
$$(\mathsf{W}^{\mathsf{ra}}x1)$$

This execution is clearly impossible, due to the cycle above. In this diagram, we have not drawn order adjacent to the writes of the RMWS, since this is not necessary to produce the cycle. If CDRF is allowed then DRF-RA fails.

PS does not support global value range analysis, as modeled by GA+E below. Our semantics permits GA+E:

$$x := 0$$
;  $(r := CAS^{r|x,r|x}(x, 0, 1); if(r < 10) {y := 1} || x := 42; x := y)$ 
 $(GA+E)$ 

PS also does not support register promotion, as modeled by RP below. Our semantics permits RP:

$$r := x ; s := \mathsf{FADD}^{\mathsf{rlx},\mathsf{rlx}}(z,r) ; y := s+1 \parallel x := y$$

$$(\mathsf{R}x1) \qquad (\mathsf{R}y1) \qquad (\mathsf{R}y1) \qquad (\mathsf{R}y1)$$

These following examples are from "Modular Data-Race-Freedom Guarantees in the Promising Semantics" to appear in PLDI21.

CDRF shows that our semantics is not too permissive for ra-RMWs. But what about rlx-RMWs. The following execution is allowed by Arm8, and PS2.0, but disallowed by PS2.1.

If this  $\{z\}$ -DRF-RA?

$$if(y)\{x := z\} else\{x := 1\} \parallel r := x; z := 1; y := r$$

$$Ry1 \longrightarrow Rx1 \longrightarrow Wy1 \longrightarrow Wy1$$
(NAIVE-LDRF-RA-FAIL)

Interpreting  $\{z\}$  as ra:

$$Ry1$$
  $R^{ra}z1$   $Wx1$   $Rx1$   $W^{ra}z1$   $Wy1$ 

Our semantics already disallows LDRF-FAIL-PS, which is similar to OOTA4.

$$if(x)$$
{FADD $(w, 1)$ ;  $y := 1$ ;  $z := 1$ }  $|| if(!z){x := 1}$  else {if(!FADD $(w, 1)$ ){ $x := y$ }}

(LDRF-FAIL-PS)

$$y := x \parallel r := y; \text{ if } (b)\{x := r; z := r\} \text{ else } \{x := 1\} \parallel b := 1$$

$$(OOTA4)$$

If RMWs simply use the same semantics as read and write, then we allow LDRF-PF-FAIL, which is used to show failure of LDRF-SC.

$$y := 0$$
; if(y){if(!CAS(x, 0, 1)){if(z){x := 2}}}  $\| y := 1$ ; if(1 $\neq$ CAS(x, 0, 3)){z := 1}

$$(\mathbb{R}y1)$$
  $(\mathbb{R}x0)$   $(\mathbb{R}x1)$   $(\mathbb{R}x1)$   $(\mathbb{R}x2)$   $(\mathbb{R}x2)$   $(\mathbb{R}x2)$ 

(LDRF-PF-FAIL)

To disallow this, we need to retain the dependency  $(Rx2) \rightarrow (Wz1)$ . For this, we need to avoid the substitution for x. This is clearer in the LICS semantics. You just use L6 rather than L5 for the independent case on RMWs.

# 12 EXAMPLE FROM JAM PAPER

From [Bender and Palsberg 2019, §3.3]. With partial coherence/weak fulfillment you need to be careful that RMWs are totally ordered (if that's a property you want). May not come for free.

From [Bender and Palsberg 2019, §B]: "Here we demonstrate that it is possible to construct a program that is only forbidden due to the total coherence order"



## 13 OLD MODEL

$$\begin{array}{lllll} \mu := \mathsf{wk} & (\mathsf{Weak}) & \sigma := \mathsf{cta} & (\mathsf{Thread\ group}) \\ & | \mathsf{rlx} & (\mathsf{Relaxed}) & | \mathsf{gpu} & (\mathsf{Processor}) \\ & | \mathsf{ra} & (\mathsf{Release/Acquire}) & | \mathsf{sys} & (\mathsf{System}) \\ & | \mathsf{sc} & (\mathsf{Sequentially\ Consistent}) \end{array}$$

Orders/Relations in model

- ⊴ is the old ≤ (without coherence stuff from F4 and P5B). This provides the NO-TAR axiom.
- ≤ is a the *happens-before* suborder, which only includes rf when they are morally strong. This serves as a cross-location transitive kernel for the per-location order.
- ⊑ is a per-location order that relates morally strong and poloc accesses
   This includes ≤ for morally strong accesses.
   This provides the SC-PER-LOC axiom.

Write  $d \triangle e$  if they conflict (ie, read/write or write/write, same location).

Write  $d \triangleq e$  if they conflict and are morally strong

*Definition 13.1.* A pomset with preconditions is a tuple  $(E, \lambda, \leq, \leq, \leq)$  where

- (M1) E is a set of events
- (M2)  $\lambda: E \to (\Phi \times \mathcal{A})$  is a *labeling* from which we derive functions
  - $\kappa : E \to \Phi$  (formulae)
  - $\lambda: E \to \mathcal{A}$  (actions)
- (M3)  $\leq \subseteq (E \times E)$ ,  $\leq \subseteq (E \times E)$ , and  $\subseteq \subseteq (E \times E)$  are partial orders
- (M4)  $\bigwedge_e \kappa(e)$  is satisfiable (consistency)
- (M5) if  $d \le e$  then  $\kappa(e)$  implies  $\kappa(d)$  (causal strengthening)
- (M6) if  $d \le e$  then  $d \le e$

(M7) if  $d \le e$  and d conflicts with e then  $d \sqsubseteq e$ 

Definition 13.2 (Strong fulfillment). We say  $\lambda(d) = (Wxv)$  fulfills  $\lambda(e) = (Rxv)$  if

- (F3A) *d* ⊲ *e*
- (F3B) d < e if d is morally strong with e
- (F3C)  $d \sqsubseteq e$  (if d is not morally strong with e)
- (F4)  $\forall \lambda(c) = (\mathbf{W}x..)$  either  $c \sqsubseteq d$  or  $e \sqsubseteq c$ ,

Definition 13.3 (Weak fulfillment). We say  $\lambda(d) = (\mathsf{W} x v)$  fulfills  $\lambda(e) = (\mathsf{R} x v)$  if

- (F3A) *d* ⊲ *e*
- (F3B) d < e if d is morally strong with e
- (F3C)  $e \not\sqsubseteq d$  (if d is not morally strong with e)
- (F4)  $\forall \lambda(c) = (\mathsf{W}x..)$  either  $c \sqsubseteq d$  or  $e \sqsubseteq c$ , where

$$d \subseteq e$$
 when  $\begin{cases} d \subseteq e & \text{if } d \text{ is morally strong with } e \\ e \not\sqsubset d & \text{otherwise} \end{cases}$ 

If all accesses are morally strong with each other, weak fulfillment degenerates to

- (F3) d < e
- (F4)  $\forall \lambda(c) = (\mathbf{W}x..)$  either  $c \sqsubseteq d$  or  $e \sqsubseteq c$

If no accesses are morally strong with each other, weak fulfillment degenerates to

- (F3) e ⊄ d
- (F4)  $\not\exists \lambda(c) = (\mathsf{W}x..)$  both  $d \sqsubseteq c$  and  $c \sqsubseteq e$

Note that the difference between strong and weak fulfillment is limited to  $\sqsubseteq$ . We sometimes write  $\sqsubseteq$  for strong fulfillment and  $\sqsubseteq$  for weak fulfillment.

Prefixing is as in OOPSLA, using ≤ for order everywhere except P5B, which has ⊑.

Definition 13.4. Let  $P' \in (\phi \mid a) \Rightarrow \mathcal{P}$  when  $(\exists P \in \mathcal{P}) \ (\forall e \in E)$ 

- (P1)  $E' = E \cup \{d\}$
- $(P2) \leq' \supseteq \leq, \leq' \supseteq \leq,$ and  $\sqsubseteq' \supseteq \sqsubseteq$
- (P3A)  $\lambda'(e) = \lambda(e)$
- (P3B)  $\lambda'(d) = a$
- (P4A)  $\kappa'(d)$  implies  $\phi \wedge (d \notin E \vee \kappa(d))$
- (P4B) if  $d \neq (R...)$  then e = d or  $\kappa'(e)$  implies  $\kappa(e)$
- (P4c) if d = (Rvx) then e = d or  $\kappa'(e)$  implies  $\kappa(e)[v/x]$
- (P5A) if d = (R..), e = (W..) then e = d or  $\kappa'(e)$  implies  $\kappa(e)$  or  $d \le e$
- (P5B) if d conflicts with e then  $d \sqsubseteq' e$
- (P5c) if d is an acquire or e is a release then  $d \leq' e$
- (P5D) if *d* is an SC write and *e* is an SC read then  $d \leq' e$
- (P5E) if d reads, and e is an acquiring fence, then  $d \leq' e$
- (P5F) if *d* is a releasing fence, and *e* writes, then  $d \le' e$

#### REFERENCES

- Jade Alglave, Will Deacon, Richard Grisenthwaite, Antoine Hacquard, and Luc Maranget. 2020. Armed cats: Formal Concurrency Modelling at Arm. Draft., 49 pages.
- John Bender and Jens Palsberg. 2019. A formalization of Java's concurrent access modes. *Proc. ACM Program. Lang.* 3, OOPSLA (2019), 142:1–142:28. https://doi.org/10.1145/3360568
- Soham Chakraborty and Viktor Vafeiadis. 2017. Formalizing the concurrency semantics of an LLVM fragment. In Proceedings of the 2017 International Symposium on Code Generation and Optimization, CGO 2017, Austin, TX, USA, February 4-8, 2017, Vijay Janapa Reddi, Aaron Smith, and Lingjia Tang (Eds.). ACM, 100–110. http://dl.acm.org/citation.cfm?id=3049844
- William Ferreira, Matthew Hennessy, and Alan Jeffrey. 1996. A Theory of Weak Bisimulation for Core CML. In Proceedings of the 1996 ACM SIGPLAN International Conference on Functional Programming, ICFP 1996, Philadelphia, Pennsylvania, USA, May 24-26, 1996, Robert Harper and Richard L. Wexelblat (Eds.). ACM, 201–212. https://doi.org/10.1145/232627.232649
- Shaked Flur, Kathryn E. Gray, Christopher Pulte, Susmit Sarkar, Ali Sezgin, Luc Maranget, Will Deacon, and Peter Sewell. 2016. Modelling the ARMv8 architecture, operationally: concurrency and ISA. In *Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2016, St. Petersburg, FL, USA, January 20 22, 2016*, Rastislav Bodík and Rupak Majumdar (Eds.). ACM, 608–621. https://doi.org/10.1145/2837614.2837615
- Lisa Higham and Jalal Kawash. 2000. Memory Consistency and Process Coordination for SPARC Multiprocessors. In High Performance Computing HiPC 2000, 7th International Conference, Bangalore, India, December 17-20, 2000, Proceedings (Lecture Notes in Computer Science, Vol. 1970), Mateo Valero, Viktor K. Prasanna, and Sriram Vajapeyam (Eds.). Springer, 355–366. https://doi.org/10.1007/3-540-44467-X\_32
- Jeehoon Kang. 2019. Reconciling Low-Level Features of C with Compiler Optimizations. Ph.D. Dissertation. Seoul National University, Seoul, South Korea. https://sf.snu.ac.kr/jeehoon.kang/thesis/
- Ori Lahav, Viktor Vafeiadis, Jeehoon Kang, Chung-Kil Hur, and Derek Dreyer. 2017. Repairing sequential consistency in C/C++11. In *Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2017, Barcelona, Spain, June 18-23, 2017*, Albert Cohen and Martin T. Vechev (Eds.). ACM, 618–632. https://doi.org/10.1145/3062341.3062352
- Sung-Hwan Lee, Minki Cho, Anton Podkopaev, Soham Chakraborty, Chung-Kil Hur, Ori Lahav, and Viktor Vafeiadis. 2020. Promising 2.0: global optimizations in relaxed memory concurrency. In *Proceedings of the 41st ACM SIGPLAN International Conference on Programming Language Design and Implementation, PLDI 2020, London, UK, June 15-20, 2020, Alastair F. Donaldson and Emina Torlak (Eds.).* ACM, 362–376. https://doi.org/10.1145/3385412.3386010
- Daniel Lustig, Sameer Sahasrabuddhe, and Olivier Giroux. 2019. A Formal Analysis of the NVIDIA PTX Memory Consistency Model. In *Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2019, Providence, RI, USA, April 13-17, 2019*, Iris Bahar, Maurice Herlihy, Emmett Witchel, and Alvin R. Lebeck (Eds.). ACM, 257–270. https://doi.org/10.1145/3297858.3304043
- NVIDIA. 2020. Parallel Thread Execution ISA Version 7.1. https://docs.nvidia.com/cuda/parallel-thread-execution/index. html#memory-consistency-model.
- Christopher Pulte, Shaked Flur, Will Deacon, Jon French, Susmit Sarkar, and Peter Sewell. 2018. Simplifying ARM concurrency: multicopy-atomic axiomatic and operational models for ARMv8. *PACMPL* 2, POPL (2018), 19:1–19:29. https://doi.org/10.1145/3158107