



# Loop Parallelism I

Cosmin E. Oancea

Department of Computer Science (DIKU) University of Copenhagen

September 2022 PMPH Lecture Slides



| Course | rgan | 172 | ition |
|--------|------|-----|-------|
| Course | . 5  | 120 |       |

| W | HARDWARE        | SOFTWARE          |                | LAB/CUDA              |
|---|-----------------|-------------------|----------------|-----------------------|
| 1 | Trends          |                   | List HOM       | Intro & Simple        |
|   | Vector Machine  | $\longleftarrow$  | (Map-Reduce)   | Map Programming       |
| 2 | In Order        | $\longrightarrow$ | VLIW Instr     | Scan &                |
|   | Processor       | $\longleftarrow$  | Scheduling     | Reduce                |
| 3 | Cache           |                   | Loop           | Sparse Vect           |
|   | Coherence       |                   | Parallelism I  | Matrix Mult           |
| 4 | Interconnection |                   | Case Studies & | Transpose & Matrix    |
|   | Networks        |                   | Optimizations  | Matrix Mult           |
| 5 | Memory          |                   | Optimising     | Sorting & Profiling & |
|   | Consistency     |                   | Locality       | Mem Optimizations     |
| 6 | OoO, Spec       |                   | Thread-Level   | Project               |
|   | Processor       |                   | Speculation    | Work                  |
|   |                 |                   |                |                       |

Three narative threads: the path to complex & good design:

- Design Space tradeoffs, constraints, common case, trends.
- Reasoning: from simple to complex, Applying Concepts.

### **Motivation**

- + So far we reasoned about how to parallelize a known algorithm
- using a clean, functional approach, e.g., flattening,
- which provides work and depth guarantees,
- but does NOT account for locality of reference.

Why do we have to look at imperative loops?



### **Motivation**

- + So far we reasoned about how to parallelize a known algorithm
- using a clean, functional approach, e.g., flattening,
- which provides work and depth guarantees,
- but does NOT account for locality of reference.

### Why do we have to look at imperative loops?

- A lot of legacy sequential imperative code, C++/Java/Fortran.
- Need to parallelize the implementation of unknown algorithm,
- Need to optimize parallelism, e.g., locality of reference requires subscript analysis.



Direction-Vector Analysis

- - Brute-Force Nearest Neighbor Case Study
  - Matrix Multiplication Case Study



### Problem Statement

### Three Loop Examples DO i = 2, N DO i = 1, NDO i = 2, N DO j = 1, N DO j = 2, N DO j = 1, NA[j,i] = A[j,i] ... A[j,i] = A[j-1,i-1]...A[i,j] = A[i-1,j+1]...B[j,i] = B[j-1,i]...**ENDDO** ENDDO **ENDDO** ENDDO ENDDO **ENDDO**

Iterations are ordered lexicographically, i.e., in the order they occur in the sequential execution, e.g.,  $\vec{k} = (i=2, j=4) < \vec{l} = (i=3, j=3)$ .

- Which of the three loop nests is amenable to parallelization?
- Loop interchange is one of the most simple and useful code transformations, e.g., used to enhance locality of reference, parallel-loop granularity, and even to "create" parallelism.
- In which loop nest is it safe to interchange the loops?



# **Definition of a Dependency**

### Load-Store Classification of Dependencies

| True Dependency (RAW) | Anti Dependency (WAR) | Output dependency (WAW) |
|-----------------------|-----------------------|-------------------------|
| S1 X =                | S1 = X                | S1 X =                  |
| S2 = X                | S2 X =                | S2 X =                  |

**Def. Loop Dependence:** There is a dependence from statement *S*1 to S2 in a loop nest iff  $\exists$  iterations  $\vec{k}$ ,  $\vec{l}$  such that:



# Definition of a Dependency

### Load-Store Classification of Dependencies

```
True Dependency (RAW)
                                                   Output dependency (WAW)
                         Anti Dependency (WAR)
S1
                         S1
                               .. = X
                                                        X = \dots
     X = ...
                                                   S1
S2 .. = X
                         S2
                               X = ...
                                                        X = \dots
```

**Def. Loop Dependence:** There is a dependence from statement *S*1 to S2 in a loop nest iff  $\exists$  iterations  $\vec{k}$ ,  $\vec{l}$  such that:

- 1.  $\vec{k} < \vec{l}$  or  $\vec{k} = \vec{l}$  and  $\exists$  an execution path from statement S1 to statement S2 such that:
- 2. S1 accesses memory location M on iteration k, and
- 3. S2 accesses memory location M on iteration I, and
- 4. one of these accesses is a write.

We say that S1 is the source and S2 is the sink of the dependence, because S1 executes before S2 in the sequential program execution. Dependence depicted with an arrow pointing from source to sink.



# **Definition of a Dependency**

### Load-Store Classification of Dependencies

```
True Dependency (RAW) Anti Dependency (WAR) Output dependency (WAW) S1 \quad X = \dots \qquad S1 \quad \dots = X \qquad S1 \quad X = \dots  S2 \quad \dots = X \qquad S2 \quad X = \dots
```

**Def. Loop Dependence:** There is a dependence from statement S1 to S2 in a loop nest *iff*  $\exists$  iterations  $\vec{k}$ ,  $\vec{l}$  such that:

- 1.  $\vec{k} < \vec{l}$  or  $\vec{k} = \vec{l}$  and  $\exists$  an execution path from statement S1 to statement S2 such that:
- 2. S1 accesses memory location M on iteration  $\vec{k}$ , and
- 3. S2 accesses memory location M on iteration I, and
- 4. one of these accesses is a write.

We say that S1 is the source and S2 is the sink of the dependence, because S1 executes before S2 in the sequential program execution. Dependence depicted with an arrow pointing from source to sink. We are most interested in cross iteration dependencies, i.e.,  $\vec{k} < \vec{l}$ .



# **Loop-Nest Dependencies**

Lexicographic ordering, e.g.,  $\vec{k} = (i=2, j=4) < \vec{l} = (i=3, j=3)$ .

## Three Loop Examples

```
D i = 1, N D 0 i = 2, N D 0 j = 2, N
DO i = 1, N
                                                  DO i = 2, N
                                                 DO j = 1, N
   A[j,i] = A[j,i] \dots A[j,i] = A[j-1,i-1] \dots A[i,j] = A[i-1,j+1] \dots
 ENDDO
                        B[j,i] = B[j-1,i]...
                                              ENDDO
ENDDO
                       ENDDO ENDDO
                                                  ENDDO
```



# **Loop-Nest Dependencies**

Lexicographic ordering, e.g.,  $\vec{k} = (i=2, j=4) < \vec{l} = (i=3, j=3)$ .

# Three Loop Examples

# Leftmost Loop





How can I summarize this information?

# Aggregate Dependencies via Direction Vectors

# Write the Direction Vectors for Each Loop:

Dependencies depicted via an edge *from* the stmt that executes first in the loop nest, i.e., *the source*, *to* the one that executes later, *the sink*.

**Def. Dependence Direction:** Assume  $\exists$  a dependence from S1 in iteration  $\vec{k}$  to S2 in  $\vec{l}$  ( $\vec{k} \leq \vec{l}$ ). Dependence-direction vector  $\vec{D}(\vec{k}, \vec{l})$ :

- 1.  $\vec{D}(\vec{k}, \vec{l})_m = \text{"<" if } \vec{k}_m < \vec{l}_m$
- 2.  $\vec{D}(\vec{k}, \vec{l})_m = \text{"=" if } \vec{k}_m = \vec{l}_m$
- 3.  $\vec{D}(\vec{k}, \vec{l})_m = ">" \text{ if } \vec{k}_m > \vec{l}_m.$

If the source is a write and the sink a read then RAW dependency, iff the source is a read then WAR, if both are writes then WAW.



# How to Compute the Direction Vectors?

- $\bullet$  For any two statements S1 and S2 that may access the same array A (and one of the accesses is a write),
- in two symbolic iterations  $I^1 \equiv (i_1^1, \dots, i_m^1)$  and  $I^2 = (i_1^2, \dots, i_m^2)$ (such that  $I^1 < I^2$ )
- on indices  $A[e_1^1, \ldots, e_n^1]$  and  $A[e_1^2, \ldots, e_n^2]$ , respectively,
- then the direction vectors may be derived from the equations

$$\left\{egin{aligned} e_1^1 &= e_1^2 \ \cdots \ e_n^1 &= e_n^2 \end{aligned}
ight.$$

(The system of equations models the definition of a dependency: both accesses need to refer to the same memory location!)

### Direction Vectors/Matrix for Three Loops

```
DO i = 1, N
   DO j = 1, N
  A[j,i]=A[j,i]..
S1
   ENDDO
 ENDDO
```



### Direction Vectors/Matrix for Three Loops

```
DO i = 2, N
 DO i = 1, N
                            DO j = 2, N
   DO j = 1, N
                        S1 A[j,i]=A[j-1,i]...
S1 A[j,i]=A[j,i]..
                        S2
                           B[j,i]=B[j-1,i-1]...
   ENDDO
                            ENDDO
 ENDDO
                          ENDDO
For S1→S1:
    (j1,i1)=(j2,i2)
   i1 = i2 \& j1 = j2
Direction matrix:
S1→S1: [=,=]
```



### Direction Vectors/Matrix for Three Loops

```
DO i = 2, N
  DO i = 1. N
                                                         D0 i = 2. N
                            DO j = 2, N
   DO j = 1, N
                                                          DO j = 1, N
                         S1 A[j,i]=A[j-1,i]...
S1 A[j,i]=A[j,i]..
                                                       S1 A[i,j]=A[i-1,j+1].
                          S2 B[j,i]=B[j-1,i-1]...
    ENDDO
                                                            ENDDO
                              ENDDO
  ENDDO
                                                         ENDDO
                            ENDDO
For S1→S1:
                                                       For S1→S1:
                         S1 \rightarrow S1: (j1,i1) = (j2-1,i2)
    (i1,i1)=(i2,i2)
                                  i1 = i2 & j1 < j2
    i1 = i2 \& j1 = j2
                         S2 \rightarrow S2: (j1,i1)=(j2-1,i2-1)
                                  i1 < i2 & j1 < j2
Direction matrix:
                          S1→S1: [=,<]
S1→S1: [=.=]
                          S2→S2: [<,<]
```



```
Direction Vectors/Matrix for Three Loops
                           DO i = 2, N
  DO i = 1. N
                                                         DO i = 2, N
                             DO j = 2, N
                                                         DO j = 1, N
   DO j = 1, N
                         S1 A[j,i]=A[j-1,i]...
S1 A[j,i]=A[j,i]..
                                                       S1 A[i,j]=A[i-1,j+1].
                         S2 B[j,i]=B[j-1,i-1]...
    ENDDO
                                                           ENDDO
                             ENDDO
  ENDDO
                                                         ENDDO
                           ENDDO
For S1→S1:
                                                       For S1→S1:
                         S1 \rightarrow S1: (j1,i1) = (j2-1,i2)
    (i1,i1)=(i2,i2)
                                                           (i1,i1) = (i2-1,i2+1)
                                 i1 = i2 & j1 < j2
    i1 = i2 \& j1 = j2
                                                           i1 < i2 & j1 > j2
                         S2 \rightarrow S2: (j1,i1)=(j2-1,i2-1)
                                 i1 < i2 & j1 < j2
Direction matrix:
                                                       Direction matrix:
                         S1→S1: [=,<]
                                                       S1→S1: [<,>]
S1→S1: [=.=]
                         S2→S2: [<,<]
```

**Th. Parallelism:** A loop in a loop nest is parallel *iff* all its directions are either = or there exists an outer loop whose corresp. direction is <. A direction vector cannot have > as the first non-= symbol, as that would mean that I depend on something in the future.

```
Direction Vectors/Matrix for Three Loops
                           DO i = 2, N
  DO i = 1. N
                                                         DO i = 2, N
                            DO j = 2, N
                                                         DO j = 1, N
   DO j = 1, N
                         S1 A[j,i]=A[j-1,i]...
S1 A[j,i]=A[j,i]..
                                                       S1 A[i,j]=A[i-1,j+1].
                         S2 B[j,i]=B[j-1,i-1]...
    ENDDO
                                                           ENDDO
                             ENDDO
  ENDDO
                                                         ENDDO
                           ENDDO
For S1→S1:
                                                       For S1→S1:
                         S1 \rightarrow S1: (j1,i1) = (j2-1,i2)
    (i1,i1)=(i2,i2)
                                                           (i1, j1) = (i1-1, j2+1)
                                 i1 = i2 & j1 < j2
    i1 = i2 \& j1 = j2
                                                           i1 < i2 & j1 > j2
                         S2 \rightarrow S2: (j1,i1)=(j2-1,i2-1)
                                 i1 < i2 & j1 < j2
Direction matrix:
                                                       Direction matrix:
                         S1→S1: [=,<]
S1→S1: [=.=]
                                                       S1→S1: [<,>]
                         S2→S2: [<,<]
```

**Th. Loop Interchange:** A column permutation of the loops in a loop nest is legal iff permuting the direction matrix in the same way does NOT result in a > direction as the leftmost non-= direction in a row.

```
Direction Vectors/Matrix for Three Loops
 DO i = 1, N DO i = 2, N
                                                 D0 i = 2, N
   DO j = 1, N DO j = 2, N
                                     DO j = 1, N
S1 A[j,i]=A[j,i]... S1 A[j,i]=A[j-1,i]... S1 A[i,j]=A[i-1,j+1]...
   ENDDO
                      S2 B[j,i]=B[j-1,i-1]...
                                                   ENDDO
                        ENDDO ENDDO
  ENDDO
                                                 ENDDO
For S1\rightarrowS1: j1 = j2 For S1\rightarrowS1: j1 = j2-1
                                                    For S1 \rightarrow S1: i1 = i2-1
           i1 = i2
                                 i1 = i2
                                                           j1 = j2+1
                                                (i2, j2)-(i1, j1)=[<,>]
(i2,j2)-(i1,j1)=
                (i2,j2)-(i1,j1)=[=,<]
[=,=]
                      For S2 \rightarrow S2: i1 = i2-1
                                 i1 = i2-1
                        (i2, j2)-(i1, j1)=[<,<]
```

Interchange is safe for the first and second nests, but not for the third!

e.g., 
$$[=,<]$$
  $\rightarrow$   $[<,=]$  (for the second loop nest)  $[<,<]$ 



```
Direction Vectors/Matrix for Three Loops
 DO i = 1, N DO i = 2, N
                                                 D0 i = 2, N
   DO j = 1, N DO j = 2, N
                                      D0 i = 1, N
S1 A[j,i]=A[j,i]... S1 A[j,i]=A[j-1,i]... S1 A[i,j]=A[i-1,j+1]...
    ENDDO
                      S2 B[j,i]=B[j-1,i-1]...
                                                    ENDDO
                        ENDDO ENDDO
  ENDDO
                                                  ENDDO
                                                    For S1 \rightarrow S1: i1 = i2-1
For S1 \rightarrow S1: j1 = j2 For S1 \rightarrow S1: j1 = j2-1
           i1 = i2
                                  i1 = i2
                                                            j1 = j2+1
                                                 (i2, j2)-(i1, j1)=[<,>]
(i2,j2)-(i1,j1)=
                     (i2,j2)-(i1,j1)=[=,<]
[=,=]
                      For S2 \rightarrow S2: i1 = i2-1
                                  i1 = i2-1
                        (i2, j2)-(i1, j1)=[<,<]
```

Interchange is safe for the first and second nests, but not for the third!

e.g., 
$$[=,<]$$
  $\rightarrow$   $[<,=]$  (for the second loop nest)  $[<,<]$ 

After interchange, loop j of the second loop nest is parallel.

Corollary: A parallel loop can be always interchanged inwards.

C. Oancea: Loop Parallelism Sept 2022.

# **Dependency Graph and Loop Distribution**

**Def. Dependency Graph:** edges from the source of the dependency, i.e., early iteration, to the sink, i.e., later iteration.

**Th. Loop Distribution:** Statements that are in a dependence cycle remain in one (sequential) loop. The others are distributed to separate loops in graph order. A loop with no dependency cycle is parallel!

### Vectorization Example: Remember Vector Machines?

```
DO i = 3, N
S1 A[i] = B[i-2] \dots
S2 B[i] = B[i-1] \dots
  ENDDO
For S2\rightarrow S1: i1 = i2-2, [<]
For S2 \rightarrow S2: i1 = i2-1, [<]
```



# **Dependency Graph and Loop Distribution**

**Def. Dependency Graph:** edges from the source of the dependency, i.e., early iteration, to the sink, i.e., later iteration.

**Th. Loop Distribution:** Statements that are in a dependence cycle remain in one (sequential) loop. The others are distributed to separate loops in graph order. A loop with no dependency cycle is parallel!



**Corollary:** It is always legal to distribute a parallel loop; but requires array expansion for local variables or if output dependencies are present.



# **Loop Distribution May Require Array Expansion**

```
float tmp;
for(int i=2; i<N; i++) {
  tmp = 2*B[i-2];
  A[i] = tmp;
 B[i] = tmp+B[i-1];
```



# **Loop Distribution May Require Array Expansion**

```
float tmp[N];
                                     for(int i=2; i<N; i++) {
float tmp;
                                       tmp[i] = 2*B[i-2];
for(int i=2; i<N; i++) {
                                       B[i] = tmp[i] + B[i-1];
  tmp = 2*B[i-2];
  A[i] = tmp;
 B[i] = tmp+B[i-1];
                                     forall(int i=2; i<N; i++) {
                                       A[i] = tmp[i];
```

No matter where tmp is declared (inside or outside the loop) it needs to be expanded into an array in order to perform loop distribution.

If tmp is declared outside the loop then it requires **privatization**,



# Loop Distribution May Require Array Expansion

No matter where tmp is declared (inside or outside the loop) it needs to be expanded into an array in order to perform loop distribution.

If tmp is declared outside the loop then it requires **privatization**, because it actually causes frequent WAW dependencies. However its value is written before being used within the same iteration. Hence it is semantically equivalent to a locally declared variable, which will remove the output (WAW) dependency.

Distribution requires array expansion of the scalar tmp.

# False Dependencies (WAR/WAW)

- Cross-Iteration Anti Dependencies (WAR) correspond to a read from the array as it was before the loop  $\Rightarrow$  can be eliminated by reading from a copy of the array.
- Cross-Iteration WAW Dependencies (WAW): If they correspond to the case in which every read from a scalar or array location is covered by a previous same-iteration write ⇒ can be eliminated **privatization** (renaming), which semantically moves the declaration of the variable (scalar or array) inside the loop.
- Direction-vectors reasoning is limited to relatively simple loop nests, e.g., difficult to reason about privatization in such a way.



# **Example: Eliminating WAR Dependencies**

```
Anti Dependency (WAR)
                                and
                                        Rewritten with Parallel Loops
 float tmp = A[1];
 for (i=0; i<N-1; i++)
S1 A[i] = A[i+1];
 A[N-1] = tmp;
//S1 \rightarrow S1: i1+1=i2, [<] WAR
```



# **Example: Eliminating WAR Dependencies**

```
Anti Dependency (WAR)
                                        Rewritten with Parallel Loops
                                and
                                   // Solution: copy A into A'
                                    // and use A' for the reads!
                                   float Acopy[N];
                                    #pragma omp parallel for
  float tmp = A[1];
                                      for(int i=0; i<N; i++) {
  for (i=0: i<N-1: i++)
                                        Acopy[i] = A[i];
S1 A[i] = A[i+1];
  A[N-1] = tmp;
                                      tmp = A[1]:
                                    #pragma omp parallel for private(i)
//S1 \rightarrow S1: i1+1=i2, [<] WAR
                                      for (i=0; i<N-1; i++) {
                                        A[i] = Acopy[i+1];
                                      A[N-1] = tmp;
```

For OpenMP code, run the g++ compiler with the -fopenmp flag. You may play with the degree of parallelism by setting the OMP\_NUM\_THREADS enviornment variable, e.g., to run on 4 cores use: \$ export OMP\_NUM\_THREADS=4.



# **Eliminating WAW Dependencies by Privatization**

Assumption: array A is not used after the target loop.

```
int A[M]:
for(i=0; i<N; i++){
  for(int j=0, j<M; j++)</pre>
      A[i] = (4*i+4*i) % M;
  for(int k=0; k<N; k++)
      X[i,k] = X[i,k-1] *
                A[A[(2*i+k)\%M]\%M];
```



# **Eliminating WAW Dependencies by Privatization**

Assumption: array A is not used after the target loop.

```
// The write to A[j] causes multiple WAWs,
```

```
int A[M]:
for(i=0; i<N; i++){
  for(int j=0, j<M; j++)</pre>
      A[i] = (4*i+4*i) % M;
  for(int k=0; k<N; k++)
      X[i,k] = X[i,k-1] *
                A[A[(2*i+k)\%M]\%M];
```



# **Eliminating WAW Dependencies by Privatization**

Assumption: array A is not used after the target loop.

```
// The write to A[j] causes multiple WAWs,
                                 // but A is fully written in the inner loop
                                 #pragma omp parallel
int A[M]:
for(i=0; i<N; i++){
                                   int A[M];
 for(int j=0, j<M; j++)
                                #pragma omp for
      A[j] = (4*i+4*j) \% M;
                                   for(int i=0; i<N; i++){
 for(int k=0; k<N; k++)
                                     for(int j=0, j<M; j++)
      X[i,k] = X[i,k-1] *
                                         A[j] = (4*i+4*j) % M;
               A[A[(2*i+k)\%M]\%M]:
                                     for(int k=0; k<N; k++)
                                         X[i,k]=X[i,k-1] *
                                                A[A[(2*i+k)]/M]/M]:
```

The declaration of int A[M] inside the OMP parallel region, will create an array A for every thread (and visible/accessible inside the thread context). Alternatively you may expand A with a new dimension of number-of-processor size, and use explicit indexing.



# Reduction is Typically Easy To Recognize

If all the statements in which a scalar variable x appears are of the form  $x \oplus = exp$ , where x does not appear in exp and  $\oplus$  is associative (and commutative) then the cross-iteration RAWs on x can be resolved by:

- privatizing x initialized with the neutral element,
- computing the per-processor partial values of x,
- reducing the xs across processors and with the initial value.



# Scan and Segmented Scan Are Difficult!

Compilers cannot recognize and parallelize even simple scans:

- they raise a cross-iteration true dependency (RAW),
- they appear in a multitude of forms,
- hence they are difficult to analyze.

```
// What kind of scans are these?
1. A[O] = B[O]:
  for(i=1; i<N; i++) {
    A[i] = A[i-1] + B[i];
2. acc = 0:
  for(i=0; i<N; i++){
     acc = acc ^ i:
    A[i] = acc;
3. for(j=0; j<M; j++)
     A[0,j] = B[0,j];
  for(i=1; i<N; i++) {
     for(j=0; j<M; j++)
       A[i,j] = A[i-1,j] + B[i,j];
```



# Scan and Segmented Scan Are Difficult!

Compilers cannot recognize and parallelize even simple scans:

- they raise a cross-iteration true dependency (RAW),
- they appear in a multitude of forms,
- hence they are difficult to analyze.

```
// What kind of scans are these?
1. A[O] = B[O]:
   for(i=1; i<N; i++) {
                                     1. let A = scan(+) 0 B
     A[i] = A[i-1] + B[i];
                                     2. let A = scan (\hat{}) 0 (iota N)
2. acc = 0:
   for(i=0; i<N; i++){
                                     3. let A = scan (\ a b -> map2 (+) a b)
     acc = acc ^ i:
                                                      (replicate M 0.0) B
     A[i] = acc;
                                        let A = transpose <|</pre>
3. for(j=0; j<M; j++)
                                                map (scanInc (+) 0.0) <|
     A[0,j] = B[0,j];
                                                transpose B
   for(i=1; i<N; i++) {
     for(j=0; j<M; j++)
       A[i,j] = A[i-1,j] + B[i,j];
```

- Optimizing Temporal Locality by Block Tiling
  - Brute-Force Nearest Neighbor Case Study
  - Matrix Multiplication Case Study



ENDDO

# **Brute-Force Nearest Neighbor**

For simplicity we assume a 1-dimensional space:

- N reference points;
- M queries, i.e., for each query we need to find its nearest neighbor w.r.t. the reference points.

```
DOALL i = 0, M-1, 1 // In parallel
  float query = queries[i];
  float nn_dist = Infinity;
  float nn ind = -1:
  // Sequential over ref. points
  DO j = 0, N-1, 1
    ref_pt = points[j];
    float dist = fabs(query-ref_pt);
    IF dist < nn dist
      nn_ind = j;
      nn dist = dist:
    ENDIF
  ENDDO
  nns[i] = nn ind:
```

### Strategy:

- We execute the queries in parallel;
- We sequentialize the computation of each query; in principle it is a map-reduce, but we do not exploit it.
- Why?



## **Brute-Force Nearest Neighbor**

For simplicity we assume a 1-dimensional space:

- N reference points;
- M queries, i.e., for each query we need to find its nearest neighbor w.r.t. the reference points.

```
DOALL i = 0, M-1, 1 // In parallel
  float query = queries[i];
  float nn_dist = Infinity;
  float nn ind = -1:
  // Sequential over ref. points
  DO j = 0, N-1, 1
    ref_pt = points[j];
    float dist = fabs(query-ref_pt);
    IF dist < nn dist
      nn_ind = j;
      nn dist = dist:
    ENDIF
  ENDDO
  nns[i] = nn ind:
```

#### Strategy:

- We execute the queries in parallel;
- We sequentialize the computation of each query; in principle it is a map-reduce, but we do not exploit it.
- Why?

#### **ENDDO**

Please note that the queries access the ref points in the same order, so there is potential for exploiting temporal local

#### **Loop Strip Mining: A Simple Transformation**

Strip mining breaks a loop into two-nested loops, as shown below:

```
// Original Code
                                       // After Strip Mining
DO i = 0, N, 1 // stride 1
                                       DO ii = 0, N, T // stride T
 loop_body(i)
                                         DO i = ii, MIN(ii+T-1,N), 1
ENDDO
                                           loop_body(i)
                                         ENDDO
                                       ENDDO
```

#### When is it safe to strip-mine a loop?

Strip Mining is always safe because the observant student will surely notice that the transformed loop nest executes the same instructions the original loop and in the same order.



# Strip Mining: Forming the CUDA Grid and Block

```
DOALL ii = 0, M-1, B // CUDA Grid
  DOALL i = ii, MIN(ii+B-1,M-1), 1 //Block
    // local thread id within Block
    int loc_tid = i - ii;
    float query = queries[i];
    float nn_dist = Infinity;
    float nn_ind = -1;
    // Sequential over ref. points
    DO i = 0, N-1, 1
      ref_pt = points[j];
      float dist = fabs(query-ref_pt);
      IF dist < nn_dist</pre>
        nn_ind = j;
        nn_dist = dist;
      ENDIF
    ENDDO
    nns[i] = nn_ind;
  ENDDO // END CUDA BLOCK
ENDDO // END CUDA GRID
```

- Semantically, blockIdx.x would be ii/B,
- Semantically, threadIdx.x would be i ii;
- Harware caching will exploit some temporal locality across threads, but we can do better by using shared memory as a staging buffer!

## Strip Mining: Blocking the Sequential Loop

```
DOALL ii = 0, M-1, B // CUDA Grid
  DOALL i = ii, MIN(ii+B-1,M-1), 1 // CUDA Block
    // local thread id within Block
    int loc tid = i - ii:
    float query = queries[i];
    float nn_dist = Infinity;
    float nn ind = -1:
    // Sequential over ref. points
    DO jj = 0, N-1, B
      // What slice of the points array is
      // used by the loop below? What to do?
      DO j = jj, MIN(jj+B-1, N-1), 1
        ref_pt = points[j];
        float dist = fabs(query-ref_pt);
        IF dist < nn dist
          nn_ind = j;
          nn dist = dist:
        ENDIF
    ENDDO ENDDO
    nns[i] = nn ind:
  ENDDO
ENDDO
```

- Strip mine the sequential loop
   by block size B;
- Notice that all B threads of the inner block, access in the innermost loop the slice points[jj: jj+B-1: 1], which has at most B consecutive elements.
- Solution: the B threads of the CUDA block collectively copy the B elements into a shared-memory buffer, and then read from there in the innermost loop (instead of acessing the points array which is held in global memory).
- Essentially we use shared memory as an explicitly programmable cache to reduce the number of global-memory accesses by a factor of B.

## Shared Memory as an Explicitly-Programmed Cache

```
DOALL ii = 0, M-1, B // CUDA Grid
  DOALL i = ii, MIN(ii+B-1,M-1), 1 // CUDA Block
    // local thread id within Block
          loc_tid = i - ii;
    int
    float query = queries[i];
    float nn_dist = Infinity;
    float nn_ind = -1;
    // Sequential over ref. points
    DO jj = 0, N-1, B
      float buffer[B]:
      collectiveCopy(buffer, points, jj, B);
      // buffer[loc_tid] = points[jj+loc_tid];
      svncthreads():
      DO j = jj, MIN(jj+B-1, N-1), 1
        ref_pt = buffer[loc_tid]; //points[j]
        float dist = fabs(query-ref_pt);
        IF dist < nn_dist</pre>
          nn_ind = j;
          nn dist = dist:
        ENDIF
    ENDDO __syncthreads(); ENDDO
    nns[i] = nn_ind;
  ENDDO
ENDDO
```

- Essentially we use shared memory as an explicitly programmable cache to reduce the number of global-memory accesses by a factor of B.
- Each of the B threads copies one element from global to shared memory to fill in the buffer. Please note that the access to global memory is coalesced.
- Each thread than reads in the innermost loop B elements from the shared-memory buffer. The number of accesses to global memory is thus reduced with a factor of B!

### Approximate CUDA Code

```
// CPU code
int B = 256;
int dimx = (M + B - 1) / B;
// dimx = ceil( ((float)M) / B );
dim3 block(B,1,1), grid(dimx,1,1);
unsigned long int elapsed;
struct timeval t_start,t_end,t_diff;
gettimeofday(&t_start, NULL);
 nearestNeighb<B><<<grid, block>>>
       (queries, points, nns, M, N);
gettimeofday(&t_end, NULL);
timeval_subtract(&t_diff,
                 &t end.&t start):
elapsed=(t_diff.tv_sec*1e6 +
        t_diff.tv_usec);
double flops = 3.0 * M * N;
double gigaFlops=(flops*1.0e-3f) /
                 elapsed;
```

```
template <int B> // KERNEL
__global__ void nearestNeighb( ... ) {
  __shared__ float shmem[B];
  int ii = blockIdx.x * B; //blockDim.x==B
  int tidx = threadIdx.x, i = ii+tidx;
  float nn_dist = INFINITY;
  int
        nn ind = -1:
  float query = (i<M)? queries[i] : 0.0;</pre>
  for(int jj=0; jj<N; jj+=B) {</pre>
    shmem[tidx] = (jj+tidx<N) ?</pre>
                points[jj+tidx] : INFINITY;
    syncthreads():
    for(int j=0; j<B; j++) {
    //if(jj+j < N && i < M) {
        float ref_pt = shmem[k];
        float dist = fabs(querry - ref_pt);
        if(dist < nn_dist) {</pre>
            nn_ind = jj + j;
            nn_dist = dist;
    } __syncthreads();
```

} if ( i < M ) nns[i] = nn\_dist;</pre>

## Matrix Multiplication: Loop Strip Mining

```
DOALL i = 0, M-1, 1 // Parallel
 DOALL j = 0, N-1, 1 // Parallel
   float tmp = 0.0
   DO k = 0, U-1, 1 // Reduction
     tmp += A[i,k]*B[k,i]
   ENDDO
   C[i,j] = tmp;
 ENDDO
ENDDO
```

←Matrix Multiplication. Matrices:

- input A has M rows and U columns
- input B has U rows and N columns
- result C has M rows and N columns

Loops of indices i and j are parallel (can be proved by direction vectors).

Accesses to A and B invariant to loops i and  $j \Rightarrow Block Tiling to$ optimize locality of reference!



## Matrix Multiplication: Loop Strip Mining

```
DOALL i = 0, M-1, 1 // Parallel
 DOALL j = 0, N-1, 1 // Parallel
   float tmp = 0.0
   DO k = 0, U-1, 1 // Reduction
     tmp += A[i,k]*B[k,j]
   ENDDO
   C[i,j] = tmp;
 ENDDO
ENDDO
```

←Matrix Multiplication. Matrices:

- input A has M rows and U columns
- input B has U rows and N columns
- result C has M rows and N columns

Loops of indices i and j are parallel (can be proved by direction vectors).

Accesses to A and B invariant to loops i and  $j \Rightarrow Block Tiling to$ optimize locality of reference!

First step: Strip Mining, always safe since the transformed loop executes the same instructions in the same order as the original loop:

```
DO i = 0, N-1, 1 // stride 1
 loop_body(i)
ENDDO
```

```
DO ii = 0, N-1, T // stride T
 DO i = ii, MIN(ii+T-1,N-1), 1
   loop_body(i)
 ENDDO
ENDDO
```

## Matrix Multiplication: Loop Interchange

After strip mining all loops with a tile of size T:

```
DOALL ii = 0, M-1, T
 DOALL i = ii, MIN(ii+T-1,M-1), 1 // loop
   DOALL jj = 1, N, T // interchange. Why Safe?
     DOALL j = jj, MIN(jj+T-1,N-1), 1
       float tmp = 0.0
       DO kk = 0. U-1. T
         DO k = kk, MIN(kk+T-1,U-1), 1
           tmp += A[i,k]*B[k,j]
       ENDDO ENDDO
       C[i,j] = tmp;
ENDDO ENDDO ENDDO ENDDO
```



#### Matrix Multiplication: Loop Interchange

After strip mining all loops with a tile of size T:

```
DOALL ii = 0, M-1, T
 DOALL i = ii, MIN(ii+T-1,M-1), 1 // loop
   DOALL jj = 1, N, T
                      // interchange. Why Safe?
     DOALL j = jj, MIN(jj+T-1,N-1), 1
       float tmp = 0.0
       DO kk = 0. U-1. T
         DO k = kk, MIN(kk+T-1,U-1), 1
           tmp += A[i,k]*B[k,j]
       ENDDO ENDDO
       C[i,j] = tmp;
ENDDO ENDDO ENDDO ENDDO
```

The second step is to apply loop interchange between the loops of indices i and jj. This is safe because loop i is parallel, hence it can always be interchanged inwards!



## Matrix Multiplication: Summarizing Read Subscripts

After loop interchange we have a grid shape, as in CUDA:

```
DOALL ii = 0, M-1, T
                                      // grid.y
 DOALL jj = 0, N-1, T
                                       // grid.x
   DOALL i = ii, MIN(ii+T-1,M-1), 1 // block.v
      DOALL j = jj, MIN(jj+T-1,N-1), 1// block.x
       float tmp = 0.0
       DO kk = 0, U-1, T
         DO k = kk, MIN(kk+T-1,U-1), 1
           tmp += A[i,k]*B[k,j]
         ENDDO
       ENDDO
       C[i,j] = tmp;
ENDDO ENDDO ENDDO
```

The third step is to summarize the subscripts of A and B read inside the loop of index k, for fixed ii, jj and kk (x:y denotes [x...y]):



## Matrix Multiplication: Summarizing Read Subscripts

After loop interchange we have a grid shape, as in CUDA:

The third step is to summarize the subscripts of A and B read inside the loop of index k, for fixed ii, jj and kk (x:y denotes [x...y]):

- A subscripts [ii : MIN(ii+T-1,M), kk : MIN(kk+T-1,U)]
- B subscripts [kk : MIN(kk+T-1,U), jj : MIN(jj+T-1,N)]
- Summaries have size at most  $T^2$  & independent on i, j, and k  $\rightleftharpoons$  CUDA-block threads cooperatively copy-in data to shared mem!

## Block Tiled Matrix Multiplication CUDA Kernel

Shared memory padded with zeros to remove the branch from loop k!

```
DOALL ii = 0, M-1, T // grid.v
 DOALL jj = 0, N-1, T // grid.x
   DOALL i = ii, MIN(ii+T-1,M-1), 1
      DOALL j = jj, MIN(jj+T-1,N-1), 1
        float tmp = 0.0
        DO kk = 0, U-1, T
         //we would like to copy
         //to shared memory here
         //& use it inside loop k
          DO k = kk, MIN(kk+T-1,U-1), 1
            tmp += A[i,k]*B[k,j]
          ENDDO
        ENDDO
        C[i,j] = tmp;
ENDDO ENDDO ENDDO ENDDO
```



## Block Tiled Matrix Multiplication CUDA Kernel

Shared memory padded with zeros to remove the branch from loop k!

```
__global__ void matMultTiledKer( ... ) {
                                        __shared__ T Ash[T][T], Bsh[T][T];
                                        int ii = blockIdx.y * T; //blockDim.x==T
DOALL ii = 0, M-1, T // grid.y
                                        int jj = blockIdx.x * T; //blockDim.y==T
  DOALL jj = 0, N-1, T // grid.x
                                        int tidy = threadIdx.y, i = tidy+ii;
    DOALL i = ii, MIN(ii+T-1,M-1), 1
                                        int tidx = threadIdx.x, j = tidx+jj;
      DOALL j = jj, MIN(jj+T-1,N-1), 1
                                        float tmp = 0.0;
        float tmp = 0.0
        DO kk = 0, U-1, T
                                        for(int kk=0; kk<U; kk+=T) {</pre>
          //we would like to copy
                                           Ash[tidy][tidx] = (i < M & kk+tidx < U)?
          //to shared memory here
                                                         A[i*U + (kk+tidx)] : 0.0:
          //& use it inside loop k
                                          Bsh[tidy][tidx] = (j<N && kk_tidy<U) ?</pre>
          DO k = kk, MIN(kk+T-1,U-1), 1
                                                         B[(kk+tidy)*N + j] : 0.0;
            tmp += A[i,k]*B[k,j]
          ENDDO
                                          __syncthreads();
                                          for(int k=0; k<T; k++) {
        ENDDO
        C[i,j] = tmp;
                                             tmp += Ash[tidy][k] * Bsh[k][tidx]
                                           } __syncthreads();
ENDDO ENDDO ENDDO ENDDO
                                         } if (i < M && j < N) C[i * N + j] = tmp;
```

A global memory access amortized by (T-1) shared memory accesses.

### Measuring GFlops Performance

Sequential matrix multiplication  $\sim 2 \times M \times N \times U$  floating point operations. What is the GFlops performance of our implementation?

```
// CPU code
                                      template <int T> // KERNEL
int dimy = ceil( ((float)M) / T );
                                      __global__ void matMultTiledKer( ... ) {
int dimx = ceil( ((float)N) / T );
                                        __shared__ float Ash[T][T], Bsh[T][T];
dim3 block(T,T,1), grid(dimx,dimy,1);
                                        int ii = blockIdx.y * T; //blockDim.x==T
                                        int jj = blockIdx.x * T; //blockDim.y==T
                                        int tidy = threadIdx.y, i = tidy+ii;
unsigned long int elapsed;
struct timeval t_start,t_end,t_diff;
                                        int tidx = threadIdx.x, j = tidx+jj;
gettimeofday(&t_start, NULL);
                                        float tmp = 0.0;
  // ignoring generic shared mem problems
  matMultTiledKer<T><<<grid, block>>>
                                        for(int kk=0; kk<U; kk+=T) {</pre>
                                          Ash[tidy][tidx] = (i \le M \&\& kk+tidx \le U)?
            (d_A, d_B, d_C, U, M, N);
                                                         A[i*U + (kk+tidx)] : 0.0:
gettimeofday(&t_end, NULL);
                                          Bsh[tidy][tidx] = (j < N & kk_tidy < U) ?
timeval_subtract(&t_diff,
                                                         B[(kk+tidy)*N + j] : 0.0;
                 &t_end,&t_start);
                                          __syncthreads();
elapsed=(t_diff.tv_sec*1e6 +
                                          for(int k=0; k<T; k++) {
        t diff.tv usec):
                                            tmp += Ash[tidy][k] * Bsh[k][tidx
double flops = 2.0 * M * N * U;
                                          } __syncthreads();
double gigaFlops=(flops*1.0e-3f) /
                                        } if (i<M && j<N) C[i*N + j] = tmp;
                 elapsed;
```

- - Brute-Force Nearest Neighbor Case Study
  - Matrix Multiplication Case Study

Coalesced Accesses: Matrix Transposition Case Study



## **Matrix Transposition: Motivation**

```
A' = transpose(A)
// Non-Coalesced Memory Access
                                    DOALL i = 0 to N-1 // parallel
// Transposition to coalesce it \Rightarrow
                                      float accum = 0
DOALL i = 0 to N-1 // parallel
                                      DO j = 0, 63 // sequential
  float accum = 0
  DO j = 0, 63 // sequential
                                        float tmpA = A'[i, i]
                                        accum = accum*accum + tmpA*tmpA
    float tmpA = A[i, j]
                                        B'[i, i] = accum
    accum = accum*accum + tmpA*tmpA
                                      ENDDO
    B[i,j] = accum
  ENDDO
                                    ENDDO
                                    B = transpose(B')
ENDDO
```

The transformed program performs about three-times more accesses to global memory than the original.

But exhibits only coalesced accesses!

Coalesced Access: a (half) warp accesses in a SIMD instruction consecutive memory (word) locations.



## Transposition: Strip Mining, Interchange & Kernel

```
for(ii=0; ii<rowsA; ii+=T) {</pre>
//Both loops are parallel
//Strip mining & interchange⇒
                                          for(jj=0; jj<colsA; jj+=T) {</pre>
for(i = 0; i < rowsA; i++) {</pre>
                                             for(i=ii; i<min(ii+T,rowsA); i++) {</pre>
  for(j = 0; j < colsA; j++) {</pre>
                                               for(j=jj; j<min(jj+T,colsA); j++) {</pre>
    trA[j*rowsA+i] = A[i*colsA+j];
                                                 trA[j*rowsA+i] = A[i*colsA+j];
```



## Transposition: Strip Mining, Interchange & Kernel

```
//Both loops are parallel
//Strip mining & interchange⇒
for(i = 0; i < rowsA; i++) {</pre>
 for(j = 0; j < colsA; j++) {</pre>
   trA[j*rowsA+i] = A[i*colsA+j];
} }
__global__ void matTranspose(
       float* A, float* trA,
        int rowsA, int colsA ) {
 shared float tile[T][T+1]:
 int tidx = threadIdx.x;
  int tidy = threadIdx.y;
 int i = blockIdx.x*T + tidx;
 int i = blockIdx.y*T + tidy;
  if( j < colsA && i < rowsA )
   tile[tidy][tidx] = A[i*colsA+j];
 __syncthreads();
  i = blockIdx.y*T + threadIdx.x;
  j = blockIdx.x*T + threadIdx.y;
  if( j < colsA && i < rowsA )
   trA[j*rowsA+i] = tile[tidx][tidy];
```

- for(ii=0; ii<rowsA; ii+=T) {
   for(jj=0; jj<colsA; jj+=T) {
   for(i=ii; i<min(ii+T,rowsA); i++) {
   for(j=jj; j<min(jj+T,colsA); j++) {
   trA[j\*rowsA+i] = A[i\*colsA+j];
   } } }</pre>
  - Trick is to write the element of the symmetric thread in the same block.
  - What is the problem?



## Transposition: Strip Mining, Interchange & Kernel

```
//Both loops are parallel
                                     for(ii=0; ii<rowsA; ii+=T) {</pre>
//Strip mining & interchange⇒
                                       for(jj=0; jj<colsA; jj+=T) {</pre>
for(i = 0; i < rowsA; i++) {</pre>
  for(j = 0; j < colsA; j++) {</pre>
    trA[j*rowsA+i] = A[i*colsA+j];
} }
__global__ void matTranspose(
        float* A, float* trA,
        int rowsA, int colsA ) {
  __shared__ float tile[T][T+1];
  int tidx = threadIdx.x:
  int tidy = threadIdx.y;
  int j = blockIdx.x*T + tidx;
  int i = blockIdx.y*T + tidy;
  if( j < colsA && i < rowsA )
    tile[tidy][tidx] = A[i*colsA+j];
  __syncthreads();
  i = blockIdx.y*T + threadIdx.x;
  j = blockIdx.x*T + threadIdx.y;
  if( j < colsA && i < rowsA )
    trA[j*rowsA+i] = tile[tidx][tidy];
```

- Trick is to write the element of the symmetric thread in the same block.
- What is the problem?

for(i=ii; i<min(ii+T,rowsA); i++) {</pre>

for(j=jj; j<min(jj+T,colsA); j++) {</pre>

trA[j\*rowsA+i] = A[i\*colsA+j];

- Number of shared memory banks typically 16 or 32.
- T is also either 16 or  $32 \Rightarrow$
- 16 consecutive threads will read the same memory bank at the same time.
- cSolution; striller[T] [Tt21];

#### **Lessons Learned So Far**

- Tiled transposition is about  $2\times$  faster than the naive version,
- but the motivating example runs much faster than that when transposition coalesces accesses to arrays A and B. Why?



#### Lessons Learned So Far

- Tiled transposition is about  $2\times$  faster than the naive version,
- but the motivating example runs much faster than that when transposition coalesces accesses to arrays A and B. Why?
- Better to eliminate rather than hide latency. Impact of hardware multi-threading limited by the amount of available resources!



## Constant (Read-Only) Memory in CUDA

- 64KB of \_\_constant\_\_ memory on device (global/slow), cached in each multiprocessor, e.g., 8KB (fast).
- May reduce the required memory bandwidth:
  - if found in cache, then no extra traffic,
  - if a (half) warp accesses the same location and misses in cache ⇒ only one request is sent and the result is broadcast back to all,
  - serialized accesses if a warp of threads read different locations!
  - latency can range from one to hundreds of cycles.
- Best Use: when an entire block accesses the same location in the same SIMD instruction: even on a miss, the first warp brings the data in cache @ minimal traffic, the rest find it in cache.

```
// C in __constant__ memory: Bad!
// C in __constant__ memory: Good!
                                   D0 i = 1, N, 1 // grid
DO i = 1, N, 1 // grid
                                     DO j = 1, M, 1 // block(s)
 DO j = 1, M, 1 // block(s)
                                       A[i,j] = A[i,j] % C[j]
   A[i,j] = A[i,j] % C[i]
                                   ENDDO
ENDDO
                                   // Either global memory or loop interchange
```

36 / 36