# Overlapping compute & I/O





# How do **GPU** ↔ **Host copies** happen? Does it use the processor cores (SMs)?



### CPU code:

```
gpu_memcpy_host_device(cpu_buf, gpu_buf);
my_kernel<<<blocks, threads>>>(args...););
gpu_memcpy_device_host(gpu_buf, cpu_buf);
```

# How do **GPU** ↔ **Host copies** happen? Does it use the processor cores (SMs)?



No!
Dedicated DMA
can run fully in parallel
with kernel computation

### How does it synchronize these operations?



# How can we **overlap** implicitly-ordered operations?



# Multiple command queues ("streams") allow parallelism & overlapping



# Overlapping compute & data movement is essential for performance

**Recurs** at all **levels** of memory hierarchy and in many **different accelerators** 

Common design patterns (double buffering, async DMA...)

# 6.5894 Accelerated Computing Lecture 6: Energy & Specialization

Jonathan Ragan-Kelley Hiir

# Throughput processor eliminates most control overhead



# Throughput processor eliminates most control overhead

> most silicon goes to useful work

What about energy?



# Off-chip DRAM (GBs)

LPDDR

640 pJ / word (32-bits)

Moving data on-chip: ->

3.2 pJ / word-mm

On-chip SRAM (MBs)

Local SRAM (KBs)

ALU (32-bit FMA) **50 pJ** / word

5 pJ / word

**1.2 pJ / FLOP** 

50C (~100mm<sup>2</sup>)

Data from Bill Dally
14nm foundry process





Data from Bill Dally
14nm foundry process

### **Operation Energy**

### Data movement

Cost for loading a 32-bit word

Load from DRAM 640 pJ

Load from large SRAM 50 pJ

Move 10mm across chip 32 pJ

Load from local SRAM 5 pJ

Compute

**64-bit FMA** 5 pJ

**32-bit FMA** 1.2 pJ

**16-bit IMUL** 0.26 pJ

**8-bit IADD** 0.01 pJ

100X

Data from Bill Dally
14nm foundry process

# Data supply dwarfs computation for primitive ALU operations

| Operation   | Data Supply Overhead |
|-------------|----------------------|
| 64-bit FMA  | 72%                  |
| 32-bit FMA  | 84%                  |
| 16-bit FMA  | 91%                  |
| 8-bit IMAC  | 96%                  |
| 32-bit IADD | 99%                  |

To be efficient, we need to reduce data supply cost

# Amortize data-supply with more complex instructions



e.g., AES, video encode/decode, DSP, texture filtering, . . .

"ASIC-in-an-Instruction"

### CISC: x86 memory operands



### Complex memory operations

12fetch \*base, width,
 height, stride





Qualcomm Hexagon DSP / NPU

### Complex memory operations

12fetch \*base, width,
 height, stride





Qualcomm Hexagon DSP / NPU

#### **NVIDIA H100**

**Tensor Memory Accelerator** 

cp.async

→ cp.async.bulk.tensor



### Texture mapping

### One instruction:

- computes many addresses
- loads
- blends results



mip-map texels: level d+1



mip-map texels: level d

### "Tri-linear" filtering



$$lerp(t, v_1, v_2) = v_1 + t(v_2 - v_1)$$

Bilinear resampling: four texel reads 3 lerps (3 mul + 6 add)

Trilinear resampling:
eight texel reads
7 lerps (7 mul + 14 add)

# The quintessential complex instruction: Matrix block multiply accumulate



**Examples:** ARM SME **AVX VNNI** Intel AMX Google TPU Neural Engine Hexagon NPU

. . .

# **Tensor Cores MMA on NVIDIA GPUs**

### PTX:

mma.sync.aligned.m16n8k8

### Hardware (SASS):

HMMA.1688.F32.TF32





### MMA operands packed into registers

|   | U   | 1   | 2   | 3   | 4   | 5   | 6   | 1   |
|---|-----|-----|-----|-----|-----|-----|-----|-----|
|   | 8   | 9   | 10  | 11  | 12  | 13  | 14  | 15  |
|   | 16  | 17  | 18  | 19  | 20  | 21  | 22  | 23  |
|   | 24  | 25  | 26  | 27  | 28  | 29  | 30  | 31  |
|   | 32  | 33  | 34  | 35  | 36  | 37  | 38  | 39  |
|   | 40  | 41  | 42  | 43  | 44  | 45  | 46  | 47  |
|   | 48  | 49  | 50  | 51  | 52  | 53  | 54  | 55  |
|   | 56  | 57  | 58  | 59  | 60  | 61  | 62  | 63  |
| U | 64  | 65  | 66  | 67  | 68  | 69  | 70  | 71  |
|   | 72  | 73  | 74  | 75  | 76  | 77  | 78  | 79  |
|   | 80  | 81  | 82  | 83  | 84  | 85  | 86  | 87  |
|   | 88  | 89  | 90  | 91  | 92  | 93  | 94  | 95  |
|   | 96  | 97  | 98  | 99  | 100 | 101 | 102 | 103 |
|   | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 |
|   | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 |
|   | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 |
|   |     |     |     |     |     |     |     |     |

| 0   | 1   | 2   | 3   | 4   | 5   | 6   | 7   |
|-----|-----|-----|-----|-----|-----|-----|-----|
| 8   | 9   | 10  | 11  | 12  | 13  | 14  | 15  |
| 16  | 17  | 18  | 19  | 20  | 21  | 22  | 23  |
| 24  | 25  | 26  | 27  | 28  | 29  | 30  | 31  |
| 32  | 33  | 34  | 35  | 36  | 37  | 38  | 39  |
| 40  | 41  | 42  | 43  | 44  | 45  | 46  | 47  |
| 48  | 49  | 50  | 51  | 52  | 53  | 54  | 55  |
| 56  | 57  | 58  | 59  | 60  | 61  | 62  | 63  |
| 64  | 65  | 66  | 67  | 68  | 69  | 70  | 71  |
| 72  | 73  | 74  | 75  | 76  | 77  | 78  | 79  |
| 80  | 81  | 82  | 83  | 84  | 85  | 86  | 87  |
| 88  | 89  | 90  | 91  | 92  | 93  | 94  | 95  |
| 96  | 97  | 98  | 99  | 100 | 101 | 102 | 103 |
| 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 |
| 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 |
| 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 |

|   |   | 0  | 1  | 2  | 3  | 4  | 5  | 6  | 7  |
|---|---|----|----|----|----|----|----|----|----|
|   |   | 8  | 9  | 10 | 11 | 12 | 13 | 14 | 15 |
|   |   | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 |
| V | R | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 |
|   | D | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 |
|   |   | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 |
|   |   | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 |
|   |   | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 |
|   |   |    |    |    |    |    |    |    |    |

#### Ampere TF32 16x8x8 Tensor Core 'A' Matrix Layout

#### Logical Layout

| 0   | 1   | 2   | 3   | 4   | 5   | 6   | 7   |
|-----|-----|-----|-----|-----|-----|-----|-----|
| 8   | 9   | 10  | 11  | 12  | 13  | 14  | 15  |
| 16  | 17  | 18  | 19  | 20  | 21  | 22  | 23  |
| 24  | 25  | 26  | 27  | 28  | 29  | 30  | 31  |
| 32  | 33  | 34  | 35  | 36  | 37  | 38  | 39  |
| 40  | 41  | 42  | 43  | 44  | 45  | 46  | 47  |
| 48  | 49  | 50  | 51  | 52  | 53  | 54  | 55  |
| 56  | 57  | 58  | 59  | 60  | 61  | 62  | 63  |
| 64  | 65  | 66  | 67  | 68  | 69  | 70  | 71  |
| 72  | 73  | 74  | 75  | 76  | 77  | 78  | 79  |
| 80  | 81  | 82  | 83  | 84  | 85  | 86  | 87  |
| 88  | 89  | 90  | 91  | 92  | 93  | 94  | 95  |
| 96  | 97  | 98  | 99  | 100 | 101 | 102 | 103 |
| 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 |
| 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 |
| 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 |

#### Register Layout



### Ampere TF32 16x8x8 Tensor Core 'B' Matrix Layout

#### **Logical Layout**

| 0  | 1  | 2  | 3  | 4  | 5  | 6  | 7  |
|----|----|----|----|----|----|----|----|
| 8  | 9  | 10 | 11 | 12 | 13 | 14 | 15 |
| 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 |
| 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 |
| 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 |
| 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 |
| 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 |
| 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 |

#### Register Layout



### Ampere TF32 16x8x8 Tensor Core 'C' Matrix Layout

#### **Logical Layout**

| 0   | 1   | 2   | 3   | 4   | 5   | 6   | 7   |
|-----|-----|-----|-----|-----|-----|-----|-----|
| 8   | 9   | 10  | 11  | 12  | 13  | 14  | 15  |
| 16  | 17  | 18  | 19  | 20  | 21  | 22  | 23  |
| 24  | 25  | 26  | 27  | 28  | 29  | 30  | 31  |
| 32  | 33  | 34  | 35  | 36  | 37  | 38  | 39  |
| 40  | 41  | 42  | 43  | 44  | 45  | 46  | 47  |
| 48  | 49  | 50  | 51  | 52  | 53  | 54  | 55  |
| 56  | 57  | 58  | 59  | 60  | 61  | 62  | 63  |
| 64  | 65  | 66  | 67  | 68  | 69  | 70  | 71  |
| 72  | 73  | 74  | 75  | 76  | 77  | 78  | 79  |
| 80  | 81  | 82  | 83  | 84  | 85  | 86  | 87  |
| 88  | 89  | 90  | 91  | 92  | 93  | 94  | 95  |
| 96  | 97  | 98  | 99  | 100 | 101 | 102 | 103 |
| 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 |
| 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 |
| 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 |

#### Register Layout



# Tensor Cores MMA on NVIDIA GPUs

1 instruction10 register operands2048 FLOPs / 8 cycles



# MMA instructions amortize energy overhead

| Instruction     | <b>Ops</b> | Control & data overhead |
|-----------------|------------|-------------------------|
| HMMA<br>(fp16)  | 128        | 19%                     |
| IMMA<br>(int 8) | 1024       | 12%                     |

**NVIDIA Tensor Cores** 

data from Bill Dally

H100

Vector: 67 TF

MMA: 990 TF

Vector < 7% peak

## Matrix multiply has high <u>arithmetic intensity</u>

$$+=n$$

**FLOPS:** 2*n*<sup>3</sup>

Entries:  $3n^2$ 

> ratio grows with n

$$\frac{\text{arithmetic intensity} = \frac{\text{work (ops)}}{\text{data (bytes)}}$$

**Arithmetic intensity** is what makes matrix multiply special

All "acceleratable computations" have high arithmetic intensity

# Questions?