



### Recap

- Storage technologies and trends
- Locality of reference
- Caching in the memory hierarchy

## Recap: Traditional Bus Structure Connecting CPU and Memory

- A bus is a collection of parallel wires that carry address, data, and control signals.
- Buses are typically shared by multiple devices.



## Recap: I/O Bus



### Recap: Disk Access Time

- Average time to access some target sector approximated by:
  - Taccess = Tavg seek + Tavg rotation + Tavg transfer
- Seek time (Tavg seek)
  - Time to position heads over cylinder containing target sector.
  - Typical Tavg seek is 3—9 ms
- Rotational latency (Tavg rotation)
  - Time waiting for first bit of target sector to pass under r/w head.
  - Tavg rotation =  $1/2 \times 1/RPMs \times 60 sec/1 min$
  - Typical Tavg rotation = 7200 RPMs
- Transfer time (Tavg transfer)
  - Time to read the bits in the target sector.
  - Tavg transfer =  $1/RPM \times 1/(avg \# sectors/track) \times 60 secs/1 min.$

### Recap: Disk Access Time Example

#### Given:

- Rotational rate = 7,200 RPM
- Average seek time = 9 ms.
- Avg # sectors/track = 400.

#### Derived:

- Tavg rotation =  $1/2 \times (60 \text{ secs}/7200 \text{ RPM}) \times 1000 \text{ ms/sec} = 4 \text{ ms.}$
- Tavg transfer =  $60/7200 \text{ RPM} \times 1/400 \text{ secs/track} \times 1000 \text{ ms/sec} = 0.02 \text{ ms}$
- Taccess = 9 ms + 4 ms + 0.02 ms

#### Important points:

- Access time dominated by seek time and rotational latency.
- First bit in a sector is the most expensive, the rest are free.
- SRAM access time is about 4 ns/doubleword, DRAM about 60 ns
  - Disk is about 40,000 times slower than SRAM,
  - 2,500 times slower then DRAM.

### Recap: Solid State Disks (SSDs)



- Pages: 512KB to 4KB, Blocks: 32 to 128 pages
- Data read/written in units of pages.
- Page can be written only after its block has been erased
- A block wears out after about 100,000 repeated writes.

## Recap: SSD Tradeoffs vs Rotating Disks

#### Advantages

No moving parts 

faster, less power, more rugged

#### Disadvantages

- Have the potential to wear out
  - Mitigated by "wear leveling logic" in flash translation layer
  - E.g. Intel SSD 730 guarantees 128 petabyte (128 x 1015 bytes) of writes before they wear out
- In 2015, about 30 times more expensive per byte

#### Applications

- MP3 players, smart phones, laptops
- Beginning to appear in desktops and servers

### Recap: The CPU-Memory Gap

The gap widens between DRAM, disk, and CPU speeds.



- → Disk seek time
- → SSD access time
- → DRAM access time
- SRAM access time
- —CPU cycle time
- -O-Effective CPU cycle time

Until 2003, DRAM and disk access times was decreasing more slowly than the cycle time of a processor.

Today, with the introduction of multiple cores, this performance gap is now more and more a function of throughput, with multiple processor cores issuing requests to the DRAM and disk in parallel.

## Recap: Locality

• Principle of Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently

#### Temporal locality:

 Recently referenced items are likely to be referenced again in the near future



#### Spatial locality:

 Items with nearby addresses tend to be referenced close together in time



### Recap: Locality Example

```
sum = 0;
for (i = 0; i < n; i++)
  sum += a[i];
return sum;
```

#### Data references

 Reference array elements in succession (stride-1 reference pattern).

Reference variable sum each iteration.

#### Instruction references

• Reference instructions in sequence.

Cycle through loop repeatedly.

**Spatial locality** 

**Temporal locality** 

**Spatial locality** 

**Temporal locality** 

### Plan for Today

- Cache memory organization and operation
- Performance impact of caches
  - The memory mountain
  - Rearranging loops to improve spatial locality
  - Using blocking to improve temporal locality

**Disclaimer:** Slides for this lecture were borrowed from

—Randal E. Bryant and David R. O'Hallaroni's CMU 15-213 class

### Lecture Plan

- Cache memory organization and operation
- Performance impact of caches
  - The memory mountain
  - Rearranging loops to improve spatial locality
  - Using blocking to improve temporal locality

## Example Memory Hierarchy



## Examples of Caching in the Mem. Hierarchy

| Cache Type           | What is Cached?      | Where is it Cached? | Latency (cycles) | Managed By       |
|----------------------|----------------------|---------------------|------------------|------------------|
| Registers            | 4-8 bytes words      | CPU core            | 0                | Compiler         |
| TLB                  | Address translations | On-Chip TLB         | 0                | Hardware MMU     |
| L1 cache             | 64-byte blocks       | On-Chip L1          | 4                | Hardware         |
| L2 cache             | 64-byte blocks       | On-Chip L2          | 10               | Hardware         |
| Virtual Memory       | 4-KB pages           | Main memory         | 100              | Hardware + OS    |
| Buffer cache         | Parts of files       | Main memory         | 100              | OS               |
| Disk cache           | Disk sectors         | Disk controller     | 100,000          | Disk firmware    |
| Network buffer cache | Parts of files       | Local disk          | 10,000,000       | NFS client       |
| Browser cache        | Web pages            | Local disk          | 10,000,000       | Web browser      |
| Web cache            | Web pages            | Remote server disks | 1,000,000,000    | Web proxy server |

### Cache Memories

- Cache memories are small, fast SRAM-based memories managed automatically in hardware
  - Hold frequently accessed blocks of main memory
- CPU looks first for data in cache
- Typical system structure:



### General Cache Concepts



### General Cache Concepts: Hit



### General Cache Concepts: Miss



### Types of Cache Misses

#### Cold (compulsory) miss

Cold misses occur because the cache is empty.

#### Conflict miss

- Most caches limit blocks at level k+1 to a small subset (sometimes a singleton)
  of the block positions at level k.
  - E.g. Block i at level k+1 must be placed in block (i mod 4) at level k.
- Conflict misses occur when the level k cache is large enough, but multiple data objects all map to the same level k block.
  - E.g. Referencing blocks 0, 8, 0, 8, 0, 8, ... would miss every time.

#### Capacity miss

 Occurs when the set of active cache blocks (working set) is larger than the cache.

## General Cache Organization (S, E, B)



### Cache Read



### Example: Direct Mapped Cache (E = 1)

Direct mapped: One line per set

Assume: cache block size 8 bytes



### Example: Direct Mapped Cache (E = 1)

Direct mapped: One line per set

Assume: cache block size 8 bytes



### Example: Direct Mapped Cache (E = 1)

Direct mapped: One line per set

Assume: cache block size 8 bytes



If tag doesn't match: old line is evicted and replaced

### Direct-Mapped Cache Simulation

| t=1 | s=2 | b=1 |
|-----|-----|-----|
| X   | XX  | X   |

M=16 bytes (4-bit addresses), B=2 bytes/block, S=4 sets, E=1 Blocks/set



Address trace (reads, one byte per read):

| 0 | [0 <u>00</u> 0 <sub>2</sub> ], | miss |
|---|--------------------------------|------|
| 1 | [0 <u>00</u> 1 <sub>2</sub> ], | hit  |
| 7 | [0 <u>11</u> 1 <sub>2</sub> ], | miss |
| 8 | [1 <u>00</u> 0 <sub>2</sub> ], | miss |
| 0 | [0 <u>00</u> 0 <sub>2</sub> ]  | miss |

### E-way Set Associative Cache (Here: E = 2)

E = 2: Two lines per set Assume: cache block size 8 bytes



## E-way Set Associative Cache (Here: E = 2)

E = 2: Two lines per set Assume: cache block size 8 bytes



### E-way Set Associative Cache (Here: E = 2)

E = 2: Two lines per set

Assume: cache block size 8 bytes



#### No match: short int (2 Bytes) is here

- One line in set is selected for eviction and replacement
- Replacement policies: random, least recently used (LRU), ...

### 2-Way Set Associative Cache Simulation

M=16 byte addresses, B=2 bytes/block, S=2 sets, E=2 blocks/set

Address trace (reads, one byte per read):

|       | V | Tag | Block  |
|-------|---|-----|--------|
| Set 0 | 1 | 00  | M[0-1] |
|       | 1 | 10  | M[8-9] |
|       |   |     |        |
| Set 1 | 1 | 01  | M[6-7] |
|       | 0 |     |        |

# Question Break

### What about writes?

- Multiple copies of data exist:
  - L1, L2, L3, Main Memory, Disk
- What to do on a write-hit?
  - Write-through (write immediately to memory)
  - Write-back (defer write to memory until replacement of line)
    - Need a dirty bit (line different from memory or not)
- What to do on a write-miss?
  - Write-allocate (load into cache, update line in cache)
    - Good if more writes to the location follow
  - No-write-allocate (writes straight to memory, does not load into cache)
- Typical
  - Write-through + No-write-allocate
  - Write-back + Write-allocate

### Intel Core i7 Cache Hierarchy

#### Processor package



#### L1 i-cache and d-cache:

32 KB, 8-way,

Access: 4 cycles

#### L2 unified cache:

256 KB, 8-way,

Access: 10 cycles

#### L3 unified cache:

8 MB, 16-way,

Access: 40-75 cycles

Block size: 64 bytes for all caches.

### Cache Performance Metrics

#### Miss Rate

- Fraction of memory references not found in cache (misses / accesses)
   = 1 hit rate
- Typical numbers (in percentages):
  - 3-10% for L1
  - can be quite small (e.g., < 1%) for L2, depending on size, etc.</li>

#### Hit Time

- Time to deliver a line in the cache to the processor
  - includes time to determine whether the line is in the cache
- Typical numbers:
  - 4 clock cycle for L1
  - 10 clock cycles for L2

#### Miss Penalty

- Additional time required because of a miss
  - typically 50-200 cycles for main memory (Trend: increasing!)

### Let's think about those numbers

- Huge difference between a hit and a miss
  - Could be 100x, if just L1 and main memory
- Would you believe 99% hits is twice as good as 97%?
  - Consider: cache hit time of 1 cycle miss penalty of 100 cycles
  - Average access time:
  - 97% hits: 1 cycle + 0.03 \* 100 cycles = 4 cycles
  - 99% hits: 1 cycle + 0.01 \* 100 cycles = 2 cycles
- This is why "miss rate" is used instead of "hit rate"

### Writing Cache Friendly Code

- Make the common case go fast
  - Focus on the inner loops of the core functions
- Minimize the misses in the inner loops
  - Repeated references to variables are good (temporal locality)
  - Stride-1 reference patterns are good (spatial locality)

**Key idea:** Our qualitative notion of locality is quantified through our understanding of cache memories

### Lecture Plan

- Cache organization and operation
- Performance impact of caches
  - The memory mountain
  - Rearranging loops to improve spatial locality
  - Using blocking to improve temporal locality

### The Memory Mountain

- Read throughput (read bandwidth)
  - Number of bytes read from memory per second (MB/s)

- **Memory mountain:** Measured read throughput as a function of spatial and temporal locality.
  - Compact way to characterize memory system performance.

### Memory Mountain Test Function

```
long data[MAXELEMS]; /* Global array to traverse */
/* test - Iterate over first "elems" elements of
          array "data" with stride of "stride", using
*
          using 4x4 loop unrolling.
int test(int elems, int stride) {
    long i, sx2=stride*2, sx3=stride*3, sx4=stride*4;
    long acc0 = 0, acc1 = 0, acc2 = 0, acc3 = 0;
    long length = elems, limit = length - sx4;
    /* Combine 4 elements at a time */
    for (i = 0; i < limit; i += sx4) {
        acc0 = acc0 + data[i]:
        acc1 = acc1 + data[i+stride];
        acc2 = acc2 + data[i+sx2]:
       acc3 = acc3 + data[i+sx3];
    /* Finish any remaining elements */
    for (; i < length; i++) {</pre>
       acc0 = acc0 + data[i]:
    return ((acc0 + acc1) + (acc2 + acc3));
```

Call test() with many combinations of elems and stride.

For each elems and stride:

- 1. Call test() once to warm up the caches.
- 2. Call test() again and
  measure the read
  throughput(MB/s)

mountain/mountain.c

### The Memory Mountain



Core i7 Haswell 2.1 GHz 32 KB L1 d-cache 256 KB L2 cache 8 MB L3 cache 64 B block size

### Lecture Plan

- Cache organization and operation
- Performance impact of caches
  - The memory mountain
  - Rearranging loops to improve spatial locality
  - Using blocking to improve temporal locality

### Matrix Multiplication Example



### Matrix Multiplication Example

- Description:
  - Multiply N x N matrices
  - Matrix elements are doubles (8 bytes)
  - O(N³) total operations
  - N reads per source element
  - N values summed per destination
    - but may be able to hold in register

```
/* ijk */
for (i=0; i<n; i++) {
  for (j=0; j<n; j++) {
    sum = 0.0;
    for (k=0; k<n; k++)
        sum += a[i][k] * b[k][j];
    c[i][j] = sum;
  }
}</pre>
```

matmult/mm.c

# Miss Rate Analysis for Matrix Multiply

#### Assume

- Block size = 32B (big enough for four doubles)
- Matrix dimension (N) is very large
  - Approximate 1/N as 0.0
- Cache is not even big enough to hold multiple rows

#### Analysis Method:

Look at access pattern of inner loop



# Layout of C Arrays in Memory (review)

- C arrays allocated in row-major order
  - each row in contiguous memory locations
- Stepping through columns in one row:

```
for (i = 0; i < N; i++)
sum += a[0][i];
```

- -accesses successive elements
- -if block size (B) > sizeof(aij) bytes, exploit spatial locality miss rate = sizeof(aij) / B

Stepping through rows in one column:

```
for (i = 0; i < n; i++)
sum += a[i][0];
```

- -accesses distant elements
- no spatial locality!miss rate = 1 (i.e. 100%)

### Matrix Multiplication (ijk)

```
/* ijk */
                                            Inner loop:
for (i=0; i<n; i++) {
  for (j=0; j<n; j++) {
    sum = 0.0;
                                                (i,*)
                                                      В
    for (k=0; k<n; k++)
      sum += a[i][k] * b[k][j];
    c[i][j] = sum;
                                           Row-wise Column-
                                                            Fixed
                                                   wise
                                 matmult/mm.c
```

Misses per inner loop iteration:

<u>A</u> <u>B</u> <u>C</u> 0.25 1.0 0.0

### Matrix Multiplication (jik)

```
/* jik */
                                            Inner loop:
for (j=0; j<n; j++) {
  for (i=0; i<n; i++) {
                                                             (i,j)
    sum = 0.0;
    for (k=0; k<n; k++)
      sum += a[i][k] * b[k][j];
    c[i][j] = sum
                                            Row-wise Column-
                                                             Fixed
                                                     wise
                                 matmult/mm.c
```

Misses per inner loop iteration:

| <u>A</u> | <u>B</u> | <u>C</u> |
|----------|----------|----------|
| 0.25     | 1.0      | 0.0      |

### Matrix Multiplication (kij)

0.25

0.0

0.25

```
/* ikj */
                                               Inner loop:
for (k=0; k<n; k++) {
  for (i=0; i<n; i++) {
                                                           (k,*)
                                              (i,k)
                                                                    (i,*)
    r = a[i][k];
    for (j=0; j<n; j++)
       c[i][j] += r * b[k][j];
                                                     Row-wise Row-wise
                                              Fixed
                                   matmult/mm.c
Misses per inner loop iteration:
```

### Matrix Multiplication (ikj)

Misses per inner loop iteration:

0.0

<u>B</u>

0.25

0.25

```
/* ikj */
                                              Inner loop:
for (i=0; i<n; i++) {
  for (k=0; k<n; k++) {
                                                          (k,*)
                                             (i,k)
                                                                  (i,*)
    r = a[i][k];
    for (j=0; j<n; j++)
      c[i][j] += r * b[k][j];
                                                    Row-wise Row-wise
                                            Fixed
                                  matmult/mm.c
```

### Matrix Multiplication (jki)

```
/* jki */
                                              Inner loop:
for (j=0; j<n; j++) {
                                               (*,k)
                                                              (*,j)
  for (k=0; k<n; k++) {
                                                      (k,j)
    r = b[k][j];
    for (i=0; i<n; i++)
       c[i][j] += a[i][k] * r;
                                            Column-
                                                      Fixed
                                                             Column-
                                              wise
                                                              wise
                                  matmult/mm.c
```

#### Misses per inner loop iteration:

| <u>A</u> | <u>B</u> | <u>C</u> |
|----------|----------|----------|
| 1.0      | 0.0      | 1.0      |

### Matrix Multiplication (kji)

```
/* kji */
                                             Inner loop:
for (k=0; k<n; k++) {
                                                (*,k)
  for (j=0; j<n; j++) {
                                                       (k,j)
    r = b[k][j];
    for (i=0; i<n; i++)
      c[i][j] += a[i][k] * r;
                                                       Fixed
                                                              Column-
                                             Column-
                                              wise
                                                                wise
                                 matmult/mm.c
```

#### Misses per inner loop iteration:

| <u>A</u> | <u>B</u> | <u>C</u> |
|----------|----------|----------|
| 1.0      | 0.0      | 1.0      |

### Summary of Matrix Multiplication

```
for (i=0; i<n; i++) {
  for (j=0; j<n; j++) {
    sum = 0.0;
  for (k=0; k<n; k++)
    sum += a[i][k] * b[k][j];
  c[i][j] = sum;
}
}</pre>
```

```
for (k=0; k<n; k++) {
  for (i=0; i<n; i++) {
    r = a[i][k];
  for (j=0; j<n; j++)
    c[i][j] += r * b[k][j];
}</pre>
```

```
for (j=0; j<n; j++) {
  for (k=0; k<n; k++) {
    r = b[k][j];
    for (i=0; i<n; i++)
    c[i][j] += a[i][k] * r;
}
</pre>
```

#### ijk (& jik):

- 2 loads, 0 stores
- misses/iter = 1.25

```
kij (& ikj):
```

- 2 loads, 1 store
- misses/iter = 0.5

```
jki (& kji):
```

- 2 loads, 1 store
- misses/iter = 2.0

# Core i7 Matrix Multiply Performance



# Question Break

### Lecture Plan

- Cache organization and operation
- Performance impact of caches
  - The memory mountain
  - Rearranging loops to improve spatial locality
  - Using blocking to improve temporal locality

### Example: Matrix Multiplication

```
c = (double *) calloc(sizeof(double), n*n);

/* Multiply n x n matrices a and b */
void mmm(double *a, double *b, double *c, int n) {
   int i, j, k;
   for (i = 0; i < n; i++)
        for (j = 0; j < n; j++)
        for (k = 0; k < n; k++)
        c[i*n + j] += a[i*n + k] * b[k*n + j];
}</pre>
```



### Cache Miss Analysis

- Assume
  - Matrix elements are doubles
  - Cache block = 8 doubles
  - Cache size C << n (much smaller than n)</li>

- First iteration:
  - n/8 + n = 9n/8 misses

 Afterwards in cache: (schematic)



# Cache Miss Analysis

#### Assume

- Matrix elements are doubles
- Cache block = 8 doubles
- Cache size C << n (much smaller than n)</li>

#### Second iteration:

• Again: n/8 + n = 9n/8 misses



#### Total misses:

•  $9n/8 * n^2 = (9/8) * n^3$ 

### Blocked Matrix Multiplication

matmult/bmm.c



### Cache Miss Analysis

#### Assume

- Cache block = 8 doubles
- Cache size C << n (much smaller than n)</li>
- Three blocks fit into cache: 3B<sup>2</sup> < C

#### First (block) iteration:

- B<sup>2</sup>/8 misses for each block
- $2n/B * B^2/8 = nB/4$  (omitting matrix c)
- Afterwards in cache (schematic)



### Cache Miss Analysis

- Assume:
  - Cache block = 8 doubles
  - Cache size C << n (much smaller than n)</li>
  - Three blocks fit into cache: 3B<sup>2</sup> < C

- Second (block) iteration:
  - Same as first iteration
  - $2n/B * B^2/8 = nB/4$



- Total misses:
  - $nB/4 * (n/B)^2 = n^3/(4B)$

### Blocking Summary

- **No blocking:** (9/8) \* n<sup>3</sup>
- **Blocking:** 1/(4B) \* n<sup>3</sup>

Suggest largest possible block size B, but limit 3B<sup>2</sup> < C!</li>

- Reason for dramatic difference:
  - Matrix multiplication has inherent temporal locality:
    - Input data: 3n², computation 2n³
    - Every array elements used O(n) times!
  - But program has to be written properly

# Naïve vs. Blocked Matrix Multiplication

Naïve Multiplication

**Blocked Multiplication** 



Cache misses: 388

Cache misses: 388

 $\approx$  1,020,000 cache misses

≈ 90,000 cache misses

### Recap

- Cache memories can have significant performance impact
- You can write your programs to exploit this!
  - Focus on the inner loops, where bulk of computations and memory accesses occur.
  - Try to maximize spatial locality by reading data objects with sequentially with stride 1.
  - Try to maximize temporal locality by using a data object as often as possible once it's read from memory.

Next time: Debugging and Design