### 6.894 Accelerated Computing Lecture 5: Memory, Overlapping Compute & I/O

Jonathan Ragan-Kelley IIII

#### L1 SRAM

128 KB per-SM  $(\times 48 \text{ SMs} = 6 \text{ MB})$ 

128 bytes / cycle / SM

1 warp-wide ld/st
(Per-core: 1 every 4 cycles)

 $\times 48 \text{ SMs} \times 2.18 \text{GHz} = 13.4 \text{ TB/s}$ 



#### L1 SRAM: also used as explicit scratchpad

Scratchpad Address Space

Each block (SM) only sees its own scratchpad

### Tradeoffs: scratchpads vs. caches

- + No tag / lookup overhead
- + Predictable / controllable
- + No need for coherence
- + Read >1 line / cycle

- Programming effort
- Software depends on size

#### How is the L1 SRAM built? How can it deliver 128 B / cycle? Parallelism!



128 KB

32 parallel "banks"

(addresses striped % 32 × 4 Bytes)

### gather

32 × 4 Bytes / cycle



32 × 4 Bytes / cycle scatter

### How can we handle gather/scatter in the L1 SRAM?



### How can we handle gather/scatter in the L1 SRAM? All-to-all crossbar!



#### L1 bank conflicts





high **bandwidth**, limited **capacity** 

high clocks & wide interface

**aggregate** large transactions for DRAM

streaming

access

large-scale **reuse** 

shared per-SM, **not coherent** 

high **bandwidth** via **banking** 

cache or **scratchpad** 



high **bandwidth**, limited **capacity** 

high clocks & wide interface

**aggregate** large transactions for DRAM

streaming

access

large-scale **reuse** 

shared per-SM, **not coherent** 

high **bandwidth** via **banking** 

cache or scratchpad

### Register file: parallel banks per-lane



no r[3] (or r[i]), only r3
explicit shfl instruction

## Data can be shuffled between lanes using the L1 crossbar





high **bandwidth**, limited **capacity** 

high clocks & wide interface

**aggregate** large transactions for DRAM

streaming

access

large-scale **reuse** 

shared per-SM, not coherent
high bandwidth via banking

cache or **scratchpad** 

banked per-lane

no dynamic indexing, either across or within lanes

"infinite" BW

# Overlapping compute & I/O

# **Key facets of a processor:**Control, Compute, Memory



Goal: fully utilize both resources

#### Use compute & memory in parallel



Goal: fully utilize both resources

Approach 1: "CPU-style" wide issue, out-of-order parallelism within instruction stream

Approach 2: "GPU-style" multithreading parallelism across instruction streams

### Both strategies work for common workloads



```
ld
       r3, mem[r0+r2]
ld
       r4, mem[r1+r2]
            r3, mem[r0+r2]
     ld
mul
add
     ld
            r4, mem[r1+r2]
add:
          ld
     mul
                 r3, mem[r0+r2]
blt
     add
                 r4, mem[r1+r2]
st
     add:
          mul
                 r3, r3, r4
                 r5, r5, r3
          add
          addi
                 r2, r2, 4
     st
                 r2, $400, LOOP
          blt
```

# What about workloads like matrix multiplication?

```
lots of
        reuse
core loop
              compute a bunch
load a
bunch
```

```
// load into scratchpad
for i,j:
  load A,B global → scratch

for each microtile:
  // load into registers
```

```
// load into registers
for i,k:
  load A scratch → reg
for k,j:
  load B scratch → reg
```

```
// compute!
for i,j,k:
  compute C += A*B
```



### What about a throughput processor (GPU)?

- single issue
- in-order
- "RISC"



more manual & explicit management of overlapping

> recurs at many levels

### Problem 1: load / store instructions are asynchronous & long-latency

Task 1

Runnable

Task 1

Solution 1: ILP hoist loads early to avoid blocking



Solution 2: multithreading switch from tasks

Task 2

blocked on memory to one ready to compute





### Problem 2: load / store instructions waste issue slots

Solution: bulk load / store instructions e.g., "vectorized" ld / st

```
1d.f32
1d.f32
1d.f32
1d.f32
fma
fma
fma
fma
fma
fma
fma
fma
fma
...
```

# Problem 3: overlapping compute with loading to the scratchpad

# Solution: asynchronous fetch & double-buffering



### Implementation approach: warp specialization

```
if threadIdx.y < 4:
    // load into scratchpad
    for i,j:
        load next A,B → scratch
else:
    // compute!
    for i,j,k:
        compute C += A*B
sync & swap buffers...</pre>
```