### cache accesses and C code (1)

```
int scaleFactor;
int scaleByFactor(int value) {
    return value * scaleFactor;
scaleByFactor:
    movl scaleFactor, %eax
    imull %edi, %eax
    ret
```

exericse: what data cache accesses does this function do?

### cache accesses and C code (1)

```
int scaleFactor;
int scaleByFactor(int value) {
    return value * scaleFactor;
scaleByFactor:
    movl scaleFactor, %eax
    imull %edi, %eax
    ret
exericse: what data cache accesses does this function do?
    4-byte read of scaleFactor
    8-byte read of return address
```

#### possible scaleFactor use

```
for (int i = 0; i < size; ++i) {
    array[i] = scaleByFactor(array[i]);
}</pre>
```

### misses and code (2)

```
scaleByFactor:
   movl scaleFactor, %eax
   imull %edi, %eax
   ret
```

suppose each time this is called in the loop:

return address located at address 0x7ffffffe43b8 scaleFactor located at address 0x6bc3a0

with direct-mapped 32KB cache w/64 B blocks, what is their:

|        | return address | scaleFactor |
|--------|----------------|-------------|
| tag    |                |             |
| index  |                |             |
| offset |                |             |

### misses and code (2)

```
scaleByFactor:
   movl scaleFactor, %eax
   imull %edi, %eax
   ret
```

suppose each time this is called in the loop:

return address located at address 0x7ffffffe43b8 scaleFactor located at address 0x6bc3a0

with direct-mapped 32KB cache w/64 B blocks, what is their:

|        | return address | scaleFactor |
|--------|----------------|-------------|
| tag    |                | 0xd7        |
| index  |                | 0×10e       |
| offset | 0x38           | 0×20        |

### misses and code (2)

```
scaleByFactor:
   movl scaleFactor, %eax
   imull %edi, %eax
   ret
```

suppose each time this is called in the loop:

return address located at address 0x7ffffffe43b8 scaleFactor located at address 0x6bc3a0

with direct-mapped 32KB cache w/64 B blocks, what is their:

|        | return address | scaleFactor |
|--------|----------------|-------------|
|        | 0xfffffffc     | 0xd7        |
|        | 0x10e          | 0x10e       |
| offset | 0x38           | 0×20        |

#### conflict miss coincidences?

obviously I set that up to have the same index have to use exactly the right amount of stack space...

but one of the reasons we'll want something better than direct-mapped cache

### C and cache misses (warmup 1)

```
int array[4];
...
int even_sum = 0, odd_sum = 0;
even_sum += array[0];
odd_sum += array[1];
even_sum += array[2];
odd_sum += array[3];
```

Assume everything but array is kept in registers (and the compiler does not do anything funny).

How many data cache misses on a 1-set direct-mapped cache with 8B blocks?

#### some possiblities



Q1: how do cache blocks correspond to array elements? not enough information provided!

#### aside: alignment

compilers and malloc/new implementations usually try align values align = make address be multiple of something

most important reason: don't cross cache block boundaries

## C and cache misses (warmup 2)

```
int array[4];
int even_sum = 0, odd_sum = 0;
even_sum += array[0];
even_sum += array[2];
odd_sum += array[1];
odd_sum += array[3];
```

Assume everything but array is kept in registers (and the compiler does not do anything funny).

Assume array[0] at beginning of cache block.

How many data cache misses on a 1-set direct-mapped cache with 8B blocks?

# C and cache misses (warmup 3)

```
int array[8];
...
int even_sum = 0, odd_sum = 0;
even_sum += array[0];
odd_sum += array[1];
even_sum += array[2];
odd_sum += array[3];
even_sum += array[4];
odd_sum += array[5];
even_sum += array[6];
odd_sum += array[7];
```

Assume everything but array is kept in registers (and the compiler does not do anything funny), and array[0] at beginning of cache block.

How many data cache misses on a **2**-set direct-mapped cache with 8B blocks?

### arrays and cache misses (1)

```
int array[1024]; // 4KB array
int even_sum = 0, odd_sum = 0;
for (int i = 0; i < 1024; i += 2) {
    even_sum += array[i + 0];
    odd_sum += array[i + 1];
}</pre>
```

Assume everything but array is kept in registers (and the compiler does not do anything funny).

How many data cache misses on initially empty 2KB direct-mapped cache with 16B cache blocks?

## arrays and cache misses (2)

```
int array[1024]; // 4KB array
int even_sum = 0, odd_sum = 0;
for (int i = 0; i < 1024; i += 2)
    even_sum += array[i + 0];
for (int i = 0; i < 1024; i += 2)
    odd_sum += array[i + 1];</pre>
```

Assume everything but array is kept in registers (and the compiler does not do anything funny).

How many data cache misses on initially empty 2KB direct-mapped cache with 16B cache blocks?

### arrays and cache misses (2b)

```
int array[1024]; // 4KB array
int even_sum = 0, odd_sum = 0;
for (int i = 0; i < 1024; i += 2)
    even_sum += array[i + 0];
for (int i = 0; i < 1024; i += 2)
    odd_sum += array[i + 1];</pre>
```

Assume everything but array is kept in registers (and the compiler does not do anything funny).

How many data cache misses on initially empty 4KB direct-mapped cache with 16B cache blocks?

### simulated misses: BST lookups



(simulated 16KB direct-mapped data cache; excluding BST setup)

#### actual misses: BST lookups



(actual 32KB more complex data cache) (only one set of measurements + other things on machine + excluding initial load)

#### simulated misses: matrix multiplies



(simulated 16KB direct-mapped data cache; excluding initial load)

#### actual misses: matrix multiplies



(actual 32KB more complex data cache; excluding matrix initial load) (only one set of measurements + other things on machine)

#### misses with skipping

```
int array1[512]; int array2[512];
...
for (int i = 0; i < 512; i += 1)
    sum += array1[i] * array2[i];
}</pre>
```

Assume everything but array1, array2 is kept in registers (and the compiler does not do anything funny).

About how many data cache misses on a 2KB direct-mapped cache with 16B cache blocks?

Hint: depends on relative placement of array1, array2

### best/worst case

```
array1[i] and array2[i] always different sets:
```

= distance from array1 to array2 not multiple of # sets  $\times$  bytes/set 2 misses every 4 i blocks of 4 array1[X] values loaded, then used 4 times before loading next block (and same for array2[X])

#### array1[i] and array2[i] same sets:

= distance from array1 to array2 is multiple of # sets  $\times$  bytes/set 2 misses every i block of 4 array1[X] values loaded, one value used from it, then, block of 4 array2[X] values replaces it, one value used from it, ...

#### worst case in practice?

two rows of matrix?

often sizeof(row) bytes apart

if the row size is multiple of number of sets  $\times$  bytes per block, oops!

2-way set associative, 2 byte blocks, 2 sets

| index | valid | tag | value | valid | tag | value |
|-------|-------|-----|-------|-------|-----|-------|
| 0     | 0     |     |       | 0     |     |       |
| 1     | 0     |     |       | 0     |     |       |

multiple places to put values with same index avoid misses from two active values using same set ("conflict misses"))

| index | valid | tag | value | valid | tag | value |
|-------|-------|-----|-------|-------|-----|-------|
| 0     | 0     |     | set 0 | 0     |     |       |
| 1     | 0     |     | set 1 | 0     |     |       |

| index | valid | tag         | value    | valid | tag            | value |
|-------|-------|-------------|----------|-------|----------------|-------|
| 0     | 0     | — way 0 ——— |          | 0     | way 1          |       |
| 1     | 0     | - way       | y 0 ———— | 0     | — way 1 ——<br> | y 1   |

| index | valid | tag | value | valid | tag | value |
|-------|-------|-----|-------|-------|-----|-------|
| 0     | 0     |     |       | 0     |     |       |
| 1     | 0     |     |       | 0     |     |       |

$$m=8$$
 bit addresses  $S=2=2^s$  sets  $s=1$  (set) index bits

$$B=2=2^b$$
 byte block size  $b=1$  (block) offset bits  $t=m-(s+b)=6$  tag bits

| index | valid | tag    | value                  | valid | tag | value |
|-------|-------|--------|------------------------|-------|-----|-------|
| 0     | 1     | 000000 | mem[0x00]<br>mem[0x01] | 0     |     |       |
| 1     | 0     |        |                        | 0     |     |       |

| address (hex)                           | result                |
|-----------------------------------------|-----------------------|
| 00000000 (                              | 90) <mark>miss</mark> |
| 00000001 (                              | 91)                   |
| 01100011 (                              | 63)                   |
| 01100001 (                              | 61)                   |
| 01100010 (                              | 62)                   |
| 000000000000000000000000000000000000000 | 90)                   |
| 01100100 (                              | 64)                   |
| tag indexoff                            | set                   |

| index |   |        | value                  | valid | tag | value |
|-------|---|--------|------------------------|-------|-----|-------|
| 0     | 1 | 000000 | mem[0x00]<br>mem[0x01] | 0     |     |       |
| 1     | 0 |        |                        | 0     |     |       |

| address | (hex)    | result |
|---------|----------|--------|
| 000000  | 00 (00)  | miss   |
| 000000  | 01 (01)  | hit    |
| 011000  | 11 (63)  |        |
| 011000  | 01 (61)  |        |
| 011000  | 10 (62)  |        |
| 000000  | 00 (00)  |        |
| 011001  |          |        |
| tag ind | exoffset |        |

| index | valid | tag      | value                  | valid | l tag | value |
|-------|-------|----------|------------------------|-------|-------|-------|
| 0     | 1     | 000000   | mem[0x00]              | 0     |       |       |
|       |       |          | mem[0x01]              | 0     |       |       |
| 1     | 1     | 011000   | mem[0x62]<br>mem[0x63] | 0     |       |       |
|       |       | 1 011000 | mem[0x63]              |       |       |       |

| address | (hex)    | result |
|---------|----------|--------|
| 000000  | 00 (00)  | miss   |
| 000000  | 01 (01)  | hit    |
| 011000  | 11 (63)  | miss   |
| 011000  | 01 (61)  |        |
| 011000  | 10 (62)  |        |
| 000000  | 00 (00)  |        |
| 011001  | 00 (64)  |        |
| tag ind | exoffset | •      |

| index |   | 0                          | value                  | valid  |           | value                  |
|-------|---|----------------------------|------------------------|--------|-----------|------------------------|
| Θ     | 1 | 000000                     | mem[0x00]              | 1      | 011000    | mem[0x60]<br>mem[0x61] |
| U     |   | 000000 mem[0x00] mem[0x01] | +                      | 011000 | mem[0x61] |                        |
| 1     | 1 | 011000                     | mem[0x62]<br>mem[0x63] | 0      |           |                        |
| т     |   | 011000                     | mem[0x63]              |        |           |                        |

| address (hex)   | result |
|-----------------|--------|
| 0000000 (00)    | miss   |
| 00000001 (01)   | hit    |
| 01100011 (63)   | miss   |
| 01100001 (61)   | miss   |
| 01100010 (62)   |        |
| 0000000 (00)    |        |
| 01100100 (64)   |        |
| tag indexoffset | _      |

2-way set associative, 2 byte blocks, 2 sets

| index |          |           | value                  | valid  | tag       | value               |
|-------|----------|-----------|------------------------|--------|-----------|---------------------|
| 0     | 1        | 000000    | mem[0x00]<br>mem[0x01] | 1      | 011000    | mem[0x60] mem[0x61] |
| 0 1 0 | 000000   | mem[0x01] |                        | 011000 | mem[0x61] |                     |
| 1     | 1        | 011000    | mem[0x62]<br>mem[0x63] | 0      |           |                     |
| _     | 1 011000 | mem[0x63] |                        |        |           |                     |

| address | (hex)    | result      |
|---------|----------|-------------|
| 000000  | 00 (00)  | miss        |
| 000000  | 01 (01)  | hit         |
| 011000  | 11 (63)  | miss        |
| 011000  | 01 (61)  | miss        |
| 011000  | 10 (62)  | hit         |
| 000000  | 00 (00)  |             |
| 011001  | 00 (64)  |             |
| tag ind | exoffset | <del></del> |

ag indexoffset

2-way set associative, 2 byte blocks, 2 sets

| index |          |           | value                  | valid  | tag       | value               |
|-------|----------|-----------|------------------------|--------|-----------|---------------------|
| 0     | 1        | 000000    | mem[0x00]<br>mem[0x01] | 1      | 011000    | mem[0x60] mem[0x61] |
| 0 1 0 | 000000   | mem[0x01] |                        | 011000 | mem[0x61] |                     |
| 1     | 1        | 011000    | mem[0x62]<br>mem[0x63] | 0      |           |                     |
| _     | 1 011000 | mem[0x63] |                        |        |           |                     |

| address | (he | ex)  | result |
|---------|-----|------|--------|
| 000000  | 00  | (00) | miss   |
| 000000  | 01  | (01) | hit    |
| 011000  | 11  | (63) | miss   |
| 011000  | 01  | (61) | miss   |
| 011000  | 10  | (62) | hit    |
| 000000  | 00  | (00) | hit    |
| 011001  | 00  | (64) |        |

tag indexoffset

| index | valid  | tag       | value                  | valid  | tag       | value                  |
|-------|--------|-----------|------------------------|--------|-----------|------------------------|
| 0     | 1      | 000000    | mem[0x00]<br>mem[0x01] | 1      | 011000    | mem[0x60]<br>mem[0x61] |
|       | 000000 | mem[0x01] |                        | 011000 | mem[0x61] |                        |
| 1     | 1      | 011000    | mem[0x62]              | 0      |           |                        |
| _   _ | Т      | 011000    | mem[0x62]<br>mem[0x63] | 0      |           |                        |

| address (hex)   | result             |                               |
|-----------------|--------------------|-------------------------------|
| 00000000 (00)   | miss               |                               |
| 00000001 (01)   | hit                |                               |
| 01100011 (63)   | miss               |                               |
|                 | miss               |                               |
| 01100010 (62)   | <sub>hit</sub> nee | ds to replace block in set 0! |
| 00000000 (00)   | hit                |                               |
| 01100100 (64)   | miss               |                               |
| tag indexoffset |                    |                               |

2-way set associative, 2 byte blocks, 2 sets

| index | valid                        | tag    | value                  | valid     | tag    | value                  |
|-------|------------------------------|--------|------------------------|-----------|--------|------------------------|
| Θ     | 1                            | 000000 | mem[0x00]              | 1         | 011000 | mem[0x60]<br>mem[0x61] |
| U     | 1 000000 mem[0x00] mem[0x01] | Τ.     | 011000                 | mem[0x61] |        |                        |
| 1     | 1                            | 011000 | mem[0x62]<br>mem[0x63] | 0         |        |                        |
| т     |                              | 011000 | mem[0x63]              |           |        |                        |

| address | (hex)  | result  |
|---------|--------|---------|
| 000000  | 00 (00 | )) miss |
| 000000  | 01 (01 | L) hit  |
| 011000  | 11 (63 | 3) miss |
| 011000  | 01 (61 | L) miss |
| 011000  | 10 (62 | 2) hit  |
| 000000  | 00 (00 | ) hit   |
| 011001  | 00 (64 | 1) miss |

tag indexoffset

### cache operation (associative)



# cache operation (associative)



# cache operation (associative)



#### associative lookup possibilities

none of the blocks for the index are valid

none of the valid blocks for the index match the tag something else is stored there

one of the blocks for the index is valid and matches the tag

# replacement policies

2-way set associative, 2 byte blocks, 2 sets

| index | valid | tag    | value                  | valid | tag    | value                  |
|-------|-------|--------|------------------------|-------|--------|------------------------|
| 0     | 1     | 000000 | mem[0x00]<br>mem[0x01] | 1     | 011000 | mem[0x60]<br>mem[0x61] |
| 1     | 1     | 011000 | mem[0x62]<br>mem[0x63] | 0     |        |                        |
|       |       | •      |                        |       | •      |                        |

address (hex) result

 $\frac{000}{000}$  how to decide where to insert 0x64?

| 000 <del>00001 (</del> | <u> </u> | IIL  |
|------------------------|----------|------|
| 01100011 (             | 63) r    | niss |
| 01100001 (             | 61) r    | niss |
| 01100010 (             | 62) h    | iit  |
| 00000000 (             | 90) h    | iit  |
| 01100100 (             | 64) r    | niss |

# replacement policies

-way set associative, 2 byte blocks, 2 sets

| index | valid | tag    | value                  | valid | tag    | value                  | LRU |
|-------|-------|--------|------------------------|-------|--------|------------------------|-----|
| 0     | 1     | 000000 | mem[0x00]<br>mem[0x01] | 1     | 011000 | mem[0x60]<br>mem[0x61] | 1   |
| 1     | 1     | 011000 | mem[0x62]<br>mem[0x63] | 0     |        |                        | 1   |

| address (hex) | result  |                                        |
|---------------|---------|----------------------------------------|
| 00000000 (00) | mi trac | ck which block was read least recently |
| 00000001 (01) | hit     | lated on every access                  |
| 01100011 (63) | mi upc  | lated on every access                  |
| 01100001 (61) | miss    |                                        |
| 01100010 (62) | hit     |                                        |
| 00000000 (00) | hit     |                                        |
| 01100100 (64) | miss    |                                        |

## example replacement policies

actually works pretty well in practice

```
least recently used
     take advantage of temporal locality
     at least \lceil \log_2(E!) \rceil bits per set for E-way cache
           (need to store order of all blocks)
approximations of least recently used
     implementing least recently used is expensive
     really just need "avoid recently used" — much faster/simpler
     good approximations: E to 2E bits
first-in, first-out
     counter per set — where to replace next
(pseudo-)random
     no extra information!
```

27

# C and cache misses (warmup 4)

```
int array[8];
...
int even_sum = 0, odd_sum = 0;
even_sum += array[0];
even_sum += array[2];
even_sum += array[4];
even_sum += array[6];
odd_sum += array[1];
odd_sum += array[3];
odd_sum += array[5];
odd_sum += array[7];
```

Assume everything but array is kept in registers (and the compiler does not do anything funny).

How many data cache misses on a **2**-set direct-mapped cache with 8B blocks?

### associativity terminology

direct-mapped — one block per set

E-way set associative — E blocks per set E ways in the cache

fully associative — one set total (everything in one set)

# **Tag-Index-Offset formulas**

| m                         | memory addreses bits              |
|---------------------------|-----------------------------------|
| E                         | number of blocks per set ("ways") |
| $S = 2^s$                 | number of sets                    |
| S                         | (set) index bits                  |
| $B=2^b$                   | block size                        |
| b                         | (block) offset bits               |
| t = m - (s+b)             | tag bits                          |
| $C = B \times S \times E$ | cache size (excluding metadata)   |

## misses with skipping

```
int array1[512]; int array2[512];
...
for (int i = 0; i < 512; i += 1)
    sum += array1[i] * array2[i];
}</pre>
```

Assume everything but array1, array2 is kept in registers (and the compiler does not do anything funny).

About how many data cache misses on a 2KB direct-mapped cache with 16B cache blocks?

Hint: depends on relative placement of array1, array2

How about on a two-way set associative cache?

# arrays and cache misses (2)

```
int array[1024]; // 4KB array
int even_sum = 0, odd_sum = 0;
for (int i = 0; i < 1024; i += 2)
    even_sum += array[i + 0];
for (int i = 0; i < 1024; i += 2)
    odd_sum += array[i + 1];</pre>
```

Assume everything but array is kept in registers (and the compiler does not do anything funny).

How many data cache misses on initially empty 2KB direct-mapped cache with 16B cache blocks? Would a set-associtiave cache be better?

# simulated misses: BST lookups



## simulated misses: matrix multiplies



## handling writes

what about writing to the cache?

two decision points:

if the value is not in cache, do we add it?

if yes: need to load rest of block if no: missing out on locality?

if value is in cache, when do we update next level?

if immediately: extra writing

if later: need to remember to do so

#### allocate on write?

processor writes less than whole cache block

block not yet in cache

two options:

#### write-allocate

fetch rest of cache block, replace written part (then follow write-through or write-back policy)

#### write-no-allocate

don't use cache at all (send write to memory *instead*) guess: not read soon?

option 1: write-through





#### option 1: write-through



option 2: write-back





option 2: write-back





# writeback policy

changed value!

2-way set associative, 4 byte blocks, 2 sets

| index | valid | tag    | value                  | dirty | valid | tag | value                 | dirty | LRU |
|-------|-------|--------|------------------------|-------|-------|-----|-----------------------|-------|-----|
| 0     | 1     | 000000 | mem[0x00]<br>mem[0x01] | 0     | 1     |     | mem[0x60]* mem[0x61]* |       | 1   |
| 1     | 1     | 011000 | mem[0x62]<br>mem[0x63] | 0     | 0     |     |                       |       | 0   |

1 = dirty (different than memory) needs to be written if evicted

2-way set associative, LRU, writeback

| index | valid | tag | value                  | dirty | valid | tag | value                  | dirty | LRU |
|-------|-------|-----|------------------------|-------|-------|-----|------------------------|-------|-----|
| 0     | 1     |     | mem[0x00]<br>mem[0x01] |       | 1     |     | mem[0x60]<br>mem[0x61] |       | 1   |
| 1     | 1     |     | mem[0x62]<br>mem[0x63] | 0     | 0     |     |                        |       | 0   |

writing 0xFF into address 0x04? index 0, tag 000001

2-way set associative, LRU, writeback

| index | valid | tag    | value                  | dirty | valid | tag    | value                  | dirty      | LRU |
|-------|-------|--------|------------------------|-------|-------|--------|------------------------|------------|-----|
| 0     | 1     |        | mem[0x00]<br>mem[0x01] | 0     | 1     | 011000 | mem[0x60]<br>mem[0x61] | * <b>1</b> | 1   |
| 1     | 1     | 011000 | mem[0x62]<br>mem[0x63] | 0     | 0     |        |                        |            | 0   |

writing 0xFF into address 0x04?

index 0, tag 000001

step 1: find least recently used block

2-way set associative, LRU, writeback

| index | valid | tag | value                  | dirty | valid | tag    | value                  | dirty          | LRU |
|-------|-------|-----|------------------------|-------|-------|--------|------------------------|----------------|-----|
| 0     | 1     |     | mem[0x00]<br>mem[0x01] | 0     | 1     | 011000 | mem[0x60]<br>mem[0x61] | * <del>1</del> | 1   |
| 1     | 1     |     | mem[0x62]<br>mem[0x63] | 0     | 0     |        |                        |                | 0   |

writing 0xFF into address 0x04?

index 0, tag 000001

step 1: find least recently used block

step 2: possibly writeback old block

2-way set associative, LRU, writeback

| index | valid | tag | value                  | dirty | valid | tag    | value             | dirty | LRU |
|-------|-------|-----|------------------------|-------|-------|--------|-------------------|-------|-----|
| 0     | 1     |     | mem[0x00]<br>mem[0x01] | 0     | 1     | 000001 | 0xFF<br>mem[0x05] | 1     | 0   |
| 1     | 1     |     | mem[0x62]<br>mem[0x63] | 0     | 0     |        |                   |       | 0   |

writing 0xFF into address 0x04?

index 0, tag 000001

step 1: find least recently used block

step 2: possibly writeback old block

step 3a: read in new block – to get mem[0x05]

step 3b: update LRU information

2-way set associative, LRU, writeback

| index | valid | tag    | value                  | dirty | valid | tag    | value                  | dirty | LRU |
|-------|-------|--------|------------------------|-------|-------|--------|------------------------|-------|-----|
| 0     | 1     | 000000 | mem[0x00]<br>mem[0x01] | 0     | 1     | 011000 | mem[0x60]<br>mem[0x61] | * 1   | 1   |
| 1     | 1     | 011000 | mem[0x62]<br>mem[0x63] | 0     | 0     |        |                        |       | 0   |

writing 0xFF into address 0x04?

step 1: is it in cache yet?

step 2: no, just send it to memory

# exercise (1)

2-way set associative, LRU, write-allocate, writeback

| index | valid | tag    | value                  | dirty | valid | tag    | value                  | dirty | LRU |
|-------|-------|--------|------------------------|-------|-------|--------|------------------------|-------|-----|
| 0     | 1     | 001100 | mem[0x30]<br>mem[0x31] | 0     | 1     | 010000 | mem[0x40]<br>mem[0x41] | * 1   | 0   |
| 1     | 1     | 011000 | mem[0x62]<br>mem[0x63] | 0     | 1     | 001100 | mem[0x32]<br>mem[0x33] | * 1   | 1   |

for each of the following accesses, performed alone, would it require (a) reading a value from memory (or next level of cache) and (b) writing a value to the memory (or next level of cache)?

writing 1 byte to 0x33 reading 1 byte from 0x52 reading 1 byte from 0x50

# exercise (2)

2-way set associative, LRU, write-no-allocate, write-through

| index | valid | tag    | value                  | valid | tag    | value                  | LRU |
|-------|-------|--------|------------------------|-------|--------|------------------------|-----|
| 0     | 1     | 001100 | mem[0x30]<br>mem[0x31] | 1     | 010000 | mem[0x40]<br>mem[0x41] | 0   |
| 1     | 1     | 011000 | mem[0x62]<br>mem[0x63] | 1     | 001100 | mem[0x32]<br>mem[0x33] | 1   |

for each of the following accesses, performed alone, would it require (a) reading a value from memory and (b) writing a value to the memory?

writing 1 byte to 0x33 reading 1 byte from 0x52 reading 1 byte from 0x50

#### fast writes



write appears to complete immediately when placed in buffer memory can be much slower

#### cache miss types

common to categorize misses: roughly "cause" of miss assuming cache block size fixed

compulsory (or cold) — first time accessing something adding more sets or blocks/set wouldn't change

 ${\it conflict} \ -- \ {\it sets aren't big/flexible enough} \\ {\it a fully-associtive (1-set) cache of the same size would have done better}$ 

capacity — cache was not big enough

coherence — from sync'ing cache with other caches only issue with multiple cores

## making any cache look bad

- 1. access enough blocks, to fill the cache
- 2. access an additional block, replacing something
- 3. access last block replaced
- 4. access last block replaced
- 5. access last block replaced

...

but — typical real programs have locality

### cache optimizations

```
(assuming typical locality + keeping cache size constant if possible...)
                        miss rate hit time miss penalty
increase cache size
                        better
                                   worse
                                             worse?
increase associativity
                        better
                                   worse
increase block size
                        depends
                                   worse
                                             worse
add secondary cache
                                             better
write-allocate
                        hetter
writeback
LRU replacement
                                             worse?
                        better
prefetching
                        better
 prefetching = guess what program will use, access in advance
```

average time = hit time + miss rate  $\times$  miss penalty

# cache optimizations by miss type

| (assuming other listed parameters remain constant) |              |              |              |  |  |  |  |  |
|----------------------------------------------------|--------------|--------------|--------------|--|--|--|--|--|
|                                                    | capacity     | conflict     | compulsory   |  |  |  |  |  |
| increase cache size                                | fewer misses | fewer misses |              |  |  |  |  |  |
| increase associativity                             | _            | fewer misses | <del></del>  |  |  |  |  |  |
| increase block size                                | more misses? | more misses? | fewer misses |  |  |  |  |  |
|                                                    |              |              |              |  |  |  |  |  |
| LRU replacement                                    | _            | fewer misses | <del></del>  |  |  |  |  |  |
| prefetching                                        | _            |              | fewer misses |  |  |  |  |  |

#### another view



# two-level page table lookup



#### cache accesses and multi-level PTs

four-level page tables — five cache accesses per program memory access

L1 cache hits — typically a couple cycles each?

so add 8 cycles to each program memory access?

not acceptable

#### program memory active sets



0xFFFF FFFF FFFF
0xFFFF 8000 0000 0000
0x7F...

small areas of memory active at a time one or two pages in each area?

0x0000 0000 0040 0000

### page table entries and locality

page table entries have excellent temporal locality

typically one or two pages of the stack active

typically one or two pages of code active

typically one or two pages of heap/globals active

each page contains whole functions, arrays, stack frames, etc.

### page table entries and locality

page table entries have excellent temporal locality

typically one or two pages of the stack active

typically one or two pages of code active

typically one or two pages of heap/globals active

each page contains whole functions, arrays, stack frames, etc.

needed page table entries are very small

caled a **TLB** (translation lookaside buffer)

very small cache of page table entries

| L1 cache                    | TLB                            |  |
|-----------------------------|--------------------------------|--|
| physical addresses          | virtual page numbers           |  |
| bytes from memory           | page table entries             |  |
| tens of bytes per block     | one page table entry per block |  |
| usually thousands of blocks | usually tens of entries        |  |

caled a **TLB** (translation lookaside buffer)

very small cache of page table entries

| L1 cache                                                                                      | TLB                                                      |  |
|-----------------------------------------------------------------------------------------------|----------------------------------------------------------|--|
| physical addresses                                                                            | virtual page numbers                                     |  |
| bytes from memory                                                                             | page table entries                                       |  |
| tens of bytes per block                                                                       | one page /able entry per block                           |  |
| usually thousands of blocks usually tells of entries only caches the page table lookup itself |                                                          |  |
| only caches th                                                                                | only caches the page table lookup itself                 |  |
| (generally) jus                                                                               | (generally) just entries from the last-level page tables |  |

caled a **TLB** (translation lookaside buffer)

very small cache of page table entries

| L1 cache                    | TLB                            |  |
|-----------------------------|--------------------------------|--|
| physical addresses          | virtual page numbers           |  |
| bytes from memory           | om memory page table entries   |  |
| tens of bytes per block     | one page table entry per block |  |
| usually thousands of blocks | usually tens of entries        |  |

not much spatial locality between page table entries (they're used for kilobytes of data already) (and if spatial locality, maybe use larger page size?)

caled a **TLB** (translation lookaside buffer)

very small cache of page table entries

| L1 cache                    | TLB                            |  |
|-----------------------------|--------------------------------|--|
| physical addresses          | virtual page numbers           |  |
| bytes from memory           | page table entries             |  |
| tens of bytes per block     | one page table entry per block |  |
| usually thousands of blocks | usually tens of entries        |  |
| -                           |                                |  |

few active page table entries at a time enables highly associative cache designs

### TLB and multi-level page tables

TLB caches valid last-level page table entries

doesn't matter which last-level page table

means TLB output can be used directly to form address

### TLB and two-level lookup



### TLB and two-level lookup













### exercise: TLB access pattern (setup)

4-entry, 2-way TLB, LRU replacement policy, initially empty

4096 byte pages

how many index bits?

TLB index of virtual address 0x12345?

### exercise: TLB access pattern

4-entry, 2-way TLB, LRU replacement policy, initially empty

4096 byte pages

| type  | virtual    | physical |
|-------|------------|----------|
| read  | 0x440030   | 0x554030 |
| write | 0x440034   | 0x554034 |
| read  | 0x7FFFE008 | 0x556008 |
| read  | 0x7FFFE000 | 0x556000 |
| read  | 0x7FFFDFF8 | 0x5F8FF8 |
| read  | 0x664080   | 0x5F9080 |
| read  | 0x440038   | 0x554038 |
| write | 0x7FFFDFF0 | 0x5F8FF0 |

which are TLB hits? which are TLB misses? final contents of TLB?

# backup slides

# arrays and cache misses (3)

```
int sum; int array[1024]; // 4KB array
for (int i = 8; i < 1016; i += 1) {
    int local_sum = 0;
    for (int j = i - 8; j < i + 8; j += 1) {
        local_sum += array[i] * (j - i);
    }
    sum += (local_sum - array[i]);
}</pre>
```

Assume everything but array is kept in registers (and the compiler does not do anything funny).

How many data cache misses on initially empty 2KB direct-mapped cache with 16B cache blocks?