## CS 4110

# Programming Languages & Logics

Lecture 32 Shared-Memory Parallelism

#### IMP with Parallel Composition

Here's a simple model of shared-memory parallelism: let's extend IMP with a new a parallel composition command.

#### **IMP** with Parallel Composition

Here's a simple model of shared-memory parallelism: let's extend IMP with a new a parallel composition command.

```
\begin{array}{lll} a & ::= & x \mid n \mid a_1 + a_2 \\ b & ::= & \mathsf{true} \mid \mathsf{false} \mid a_1 < a_2 \\ c & ::= & \mathsf{skip} \mid x := a \mid c_1; c_2 \mid \mathsf{if} \ b \ \mathsf{then} \ c_1 \ \mathsf{else} \ c_2 \mid \mathsf{while} \ b \ \mathsf{do} \ c \\ & & \mid c_1 \mid \mid c_2 \end{array}
```

And add small-step operational semantics rules for  $c_1 \parallel c_2$  that interleave the execution of  $c_1$  and  $c_2$ :

$$\frac{\langle \sigma, c_1 \rangle \rightarrow \langle \sigma', c_1' \rangle}{\langle \sigma, c_1 \mid\mid c_2 \rangle \rightarrow \langle \sigma', c_1' \mid\mid c_2 \rangle}$$

And add small-step operational semantics rules for  $c_1 \parallel c_2$  that interleave the execution of  $c_1$  and  $c_2$ :

$$\begin{split} \frac{\langle \sigma, c_1 \rangle \rightarrow \langle \sigma', c_1' \rangle}{\langle \sigma, c_1 \mid\mid c_2 \rangle \rightarrow \langle \sigma', c_1' \mid\mid c_2 \rangle} \\ \frac{\langle \sigma, c_2 \rangle \rightarrow \langle \sigma.' c_2' \rangle}{\langle \sigma, c_1 \mid\mid c_2 \rangle \rightarrow \langle \sigma', c_1 \mid\mid c_2' \rangle} \end{split}$$

And add small-step operational semantics rules for  $c_1 \parallel c_2$  that interleave the execution of  $c_1$  and  $c_2$ :

$$\frac{\langle \sigma, c_1 \rangle \to \langle \sigma', c_1' \rangle}{\langle \sigma, c_1 \mid\mid c_2 \rangle \to \langle \sigma', c_1' \mid\mid c_2 \rangle}$$

$$\frac{\langle \sigma, c_2 \rangle \to \langle \sigma.' c_2' \rangle}{\langle \sigma, c_1 \mid\mid c_2 \rangle \to \langle \sigma', c_1 \mid\mid c_2' \rangle}$$

$$\langle \sigma, \mathsf{skip} \mid\mid \mathsf{skip} \rangle \to \langle \sigma, \mathsf{skip} \rangle$$

And add small-step operational semantics rules for  $c_1 \parallel c_2$  that interleave the execution of  $c_1$  and  $c_2$ :

$$egin{aligned} &\langle \sigma, c_1 
angle 
ightarrow \langle \sigma', c_1' 
angle \ & \langle \sigma, c_1 \mid \mid c_2 
angle 
ightarrow \langle \sigma', c_1' \mid \mid c_2 
angle \ & \langle \sigma, c_2 
angle 
ightarrow \langle \sigma.' c_2' 
angle \ & \langle \sigma, c_1 \mid \mid c_2 
angle 
ightarrow \langle \sigma', c_1 \mid \mid c_2' 
angle \ & \langle \sigma, ext{skip} \mid \mid ext{skip} 
angle 
ightarrow \langle \sigma, ext{skip} 
angle \end{aligned}$$

The rules allow either sub-command to take a step; two sub-commands can interleave read and write operations involving the same store.

#### Parallel Bank Account

What happens if we deposit into a bank account twice under parallel composition?

```
\begin{aligned} & \mathsf{bal} := 0; \\ & (\mathsf{bal} := \mathsf{bal} + 21.0 \quad || \quad \mathsf{bal} := \mathsf{bal} + 21.0) \end{aligned}
```

#### Synchronization

Languages have synchronization constructs that control the interactions between threads.

Many languages have mutual exclusion, a.k.a. locking:

#### Synchronization

Languages have synchronization constructs that control the interactions between threads.

Many languages have mutual exclusion, a.k.a. locking:

```
\begin{aligned} & \textbf{lock } \textit{l}; \\ & \textbf{bal} := \textbf{bal} + 21.0; \\ & \textbf{unlock } \textit{l} \end{aligned}
```

A well-behaved alternative is transactional memory:

```
 \begin{aligned} & \textbf{transaction} \ \{ \\ & \textbf{bal} := \textbf{bal} + 21.0 \\ \} \end{aligned}
```

### Reasoning About Shared Memory

This program reads and writes two shared variables from two different "threads":

$$\mathbf{x} := 0; \mathbf{y} := 0;$$
  
 $(\mathbf{y} := 1; \mathsf{tmp1} := \mathbf{x}) \mid |$   
 $(\mathbf{x} := 1; \mathsf{tmp2} := \mathbf{y})$ 

What can tmp1 and tmp2 be afterward?

### **Ordering Operations**



The *happens before* relation is a partial order on events in a program execution.

See also Lamport, 1978: "Time, Clocks and the Ordering of Events in a Distributed System."

The *happens before* relation is a partial order on events in a program execution.

Operation *a* happens before *b*, written  $a \rightarrow b$ , iff:

- a and b belong to the same thread and a comes before b in a single-threaded execution, or
- a sends an inter-thread message that b receives.

See also Lamport, 1978: "Time, Clocks and the Ordering of Events in a Distributed System."

The *happens before* relation is a partial order on events in a program execution.

Operation a happens before b, written  $a \rightarrow b$ , iff:

- a and b belong to the same thread and a comes before b in a single-threaded execution, or
- *a* sends an inter-thread message that *b* receives.

(Also add transitivity: if  $a \rightarrow b$  and  $b \rightarrow c$ , then  $a \rightarrow c$ .)

See also Lamport, 1978: "Time, Clocks and the Ordering of Events in a Distributed System."

In modern multithreaded programming, messages are sent and received at *synchronization* events:

- unlock  $l \rightarrow lock l$
- fork  $t \rightarrow$  first operation in thread t
- last operation in thread  $t o ext{join } t$

Which executions of a multi-threaded program are possible?

Which executions of a multi-threaded program are possible?

Model an execution as a total order  $a \rightarrow_e b$  on the same set of events. For example:

$$y := 1 \longrightarrow tmp1 := x \longrightarrow x := 1 \longrightarrow tmp2 := y$$

Which executions of a multi-threaded program are possible?

Model an execution as a total order  $a \rightarrow_e b$  on the same set of events. For example:

$$y := 1 \longrightarrow tmp1 := x \longrightarrow x := 1 \longrightarrow tmp2 := y$$

Then ask: is  $\rightarrow \subseteq \rightarrow_e$ ? If so, then we say that  $\rightarrow_e$  is a sequentially consistent execution.

Which executions of a multi-threaded program are possible?

Model an execution as a total order  $a \rightarrow_e b$  on the same set of events. For example:

$$y := 1$$
  $\longrightarrow$   $tmp1 := x$   $\longrightarrow$   $x := 1$   $\longrightarrow$   $tmp2 := y$ 

Then ask: is  $\rightarrow \subseteq \rightarrow_e$ ? If so, then we say that  $\rightarrow_e$  is a sequentially consistent execution.

Intuitively,  $\rightarrow_e$  is an *interleaving* that obeys  $\rightarrow$ .

To see what a parallel program can do, we can enumerate all the SC executions and "run" them:

•  $y := 1 \rightarrow \mathsf{tmp1} := \mathsf{x} \rightarrow \mathsf{x} := 1 \rightarrow \mathsf{tmp2} := \mathsf{y}$ 

To see what a parallel program can do, we can enumerate all the SC executions and "run" them:

To see what a parallel program can do, we can enumerate all the SC executions and "run" them:

To see what a parallel program can do, we can enumerate all the SC executions and "run" them:

- $\mathsf{y} := 1 \to \mathsf{x} := 1 \to \mathsf{tmp1} := \mathsf{x} \to \mathsf{tmp2} := \mathsf{y}$  $\Longrightarrow \mathsf{tmp1} \mapsto 1, \mathsf{tmp2} \mapsto 1$

Enumerating SC executions gets old fast, but lets us produce the set of possible final stores,  $\sigma$ :

```
 \begin{cases} \mathsf{tmp1} \mapsto 0, \mathsf{tmp2} \mapsto 1 \\ \mathsf{tmp1} \mapsto 1, \mathsf{tmp2} \mapsto 1 \\ \mathsf{tmp1} \mapsto 1, \mathsf{tmp2} \mapsto 0 \end{cases}
```

So no sequentially consitent execution makes both tmp1 and tmp2 equal to zero.

### That Same Program, in C

```
volatile int x, y, tmp1, tmp2;
// Thread 0: write x and read y.
void *t0(void *arg) {
  x = 1:
  tmp1 = y;
  return 0;
// Thread 1, the opposite: write y and read x.
void *t1(void *arg) {
  y = 1;
  tmp2 = x;
  return 0;
```

### That Same Program, in C

```
void main() {
 x = y = tmp1 = tmp2 = 0;
  // Launch both threads.
  pthread t thread0, thread1;
  pthread create(&thread0, NULL, t0, NULL);
 pthread create(&thread1, NULL, t1, NULL);
  // Wait for both threads to finish.
 pthread join(thread0, NULL);
  pthread join(thread1, NULL);
 printf("%d %d\n", tmp1, tmp2);
```

No real parallel machine enforces sequential consistency!

No real parallel machine enforces sequential consistency!

There are many reasons and/or excuses:

- Per-processor caching lets each CPU read values that other processors can't see yet.
- Private write buffers are critical for good performance with coherent caches.
- Lots of "obvious" compiler optimizations violate sequential consistency.

See also Boehm, 2005: "Threads cannot be implemented as a library."

Every machine (and every programming language) as a memory model. Memory models describe the set of legal executions.

Every machine (and every programming language) as a memory model. Memory models describe the set of legal executions.

Sequential consistency is the *strongest* memory model out there: it allows the fewest different executions.

Real machines and languages have weaker memory models:

$$SC \ge x86 \ge ARM \ge C/C++ \ge DRF0$$

#### **Data Races**

#### A data race occurs when:

- There are two events a and b that are unordered in the happens-before relation ( $a \rightarrow b$  and  $b \rightarrow a$ ),
- both events access the same shared variable, and
- one or both of a and b is a write.

#### **Data Races**

#### A data race occurs when:

- There are two events a and b that are unordered in the happens-before relation (a → b and b → a),
- both events access the same shared variable, and
- one or both of a and b is a write.

Our little example has *two* data races: one on x and one on y.

#### Data Races & Memory Models

Languages have recently agreed on one critical property:

data race free ⇒ sequentially consistent

As long as you avoid data races, you get sequential consistency on *any* machine in Java, C, and C++.

(In jargon: the DRF implies SC theorem.)

#### Data Races & Memory Models

Languages have recently agreed on one critical property:

data race free ⇒ sequentially consistent

As long as you avoid data races, you get sequential consistency on *any* machine in Java, C, and C++.

(In jargon: the DRF implies SC theorem.)

Languages still disagree about what happens when you *do* have a race. In C and C++, races allow undefined behavior.

#### Race-Free Programming

\$ ./a.out

Data race detection is an active field of research.

\$ cc -g -fsanitize=thread simple race.c

#0 Thread2(void\*) simple race.cc:13

One called ThreadSanitizer is included with recent Clang and GCC compilers:

```
WARNING: ThreadSanitizer: data race (pid=26327)
Write of size 4 at 0x7f89554701d0 by thread T1:
    #0 Thread1(void*) simple_race.cc:8
Previous write of size 4 at 0x7f89554701d0 by thread T
```