## last time (1)

locality — temporal and spatial

temporal: same thing again soon

spatial: nearby thing soon

natural properties of programs

some taken advantage of by compiler (register allocation)

#### direct-mapped caches

divide memory, cache into blocks always power-of-two size blocks, number of 'rows' in cache one place to put each block of memory in the cache

## last time (2)

#### direct-mapped cache lookup

```
divide address into tag / (set) index / (block) offset b-bit block offset — where in 2^b block is byte? s-bit set index — which of 2^s rows of cache to use? tag — which block from memory is stored here? (could store whole block address instead of tag, just saving space)
```

#### instruction v data caches

#### alignment and C code

want to avoid splitting things across blocks better start at beginning of block (= multiple of block size)

## anonymous feedback

"The quiz this week was absurdly ambiguous and difficult to understand."

"The quiz for this week was incredibly ambiguous for questions 4 and 5. In DSA, we learned about two ways to resolve collisions in hash tables (separate chaining and probing). Depending on which method is, the locality answers in those questions will be greatly effected."

## quiz Q1 (1)

"contains a public encryption key used by the web browser to encrypt, among other things, the path of the web page being requested"

```
TLS protocol client and server agree on symmetric keys use key in certificate here!
most commonly: to sign one-time key share for server
```

then use symmetric keys to encrypt rest of connection

```
why?
symmetric encryption faster
forward secrecy — server can't decrypt old connections retrospectively
```

## quiz Q1 (2)

"is verified primarily web browser contacting certificate authority"

certificate contains signature that can be checked with just CA public key

avoids scaling problem of every browser contacting CA every time

(yes, is true that revocation information might contact CA, but not required/often ignored if CA not available and/or done indirectly)

## quiz Q2

 $A \rightarrow B$ : A's key share (from secret  $S_A$ )

 $B \rightarrow A$ : B's key share (from secret  $S_B$ )

"using the key shares sent by A and B as well as their own secret value and key share, compute their own copy of the symmetric encryption key A and B are using"

can't compute A and B's secret from their key shares alone typical attack: replace one of the key shares

scenario that 'works': replace A's key share with attacker's for B and replace B's key share with attacerk's for A

## quiz Q4

#### hashtable:

spatial: not really, spread out (maybe except *rare* collisions) temporal: yes, for duplicated eliminates

#### input array:

spatial: yes, iterating through sequentially most likely temporal: no, each element used essentially once (maybe once to hash, once to equality compare — still not much)

### quiz Q5

hashtable with better hash function (less collisions)

```
improves locality (either kind) — no: better spread less spatial if probing sequentially less traversing lists/proving entries also traversed for other values
```

reduces accesses — yes, less traversing lists/probing for collisions

## cache operation (read)



## cache operation (read)



## cache operation (read)



## C and cache misses (warmup 1)

```
int array[4];
...
int even_sum = 0, odd_sum = 0;
even_sum += array[0];
odd_sum += array[1];
even_sum += array[2];
odd_sum += array[3];
```

Assume everything but array is kept in registers (and the compiler does not do anything funny).

How many data cache misses on a 1-set direct-mapped cache with 8B blocks?

## some possiblities



Q1: how do cache blocks correspond to array elements? not enough information provided!

### some possiblities

one cache block

if array[0] starts at beginning of a cache block... array split across two cache blocks

| memory access        | cache contents afterwards |
|----------------------|---------------------------|
| _                    | (empty)                   |
| read array[0] (miss) | {array[0], array[1]}      |
| read array[1] (hit)  | {array[0], array[1]}      |
| read array[2] (miss) | {array[2], array[3]}      |
| read array[3] (hit)  | {array[2], array[3]}      |

### some possiblities

one cache block

|  |  |  |  |  |  |  |  |  |  |  | × | * | * * | 7 | ć | arı | ra | у[ | [0] | ] a | ar | r | ау | [] | L] | ar | ra | ıy | [2] | ] a | rr | ay | /[: | 3] |  | ++ | + | + |  |  |  |  |  |  |  |  |  |  |
|--|--|--|--|--|--|--|--|--|--|--|---|---|-----|---|---|-----|----|----|-----|-----|----|---|----|----|----|----|----|----|-----|-----|----|----|-----|----|--|----|---|---|--|--|--|--|--|--|--|--|--|--|
|--|--|--|--|--|--|--|--|--|--|--|---|---|-----|---|---|-----|----|----|-----|-----|----|---|----|----|----|----|----|----|-----|-----|----|----|-----|----|--|----|---|---|--|--|--|--|--|--|--|--|--|--|

if array[0] starts right in the middle of a cache block array split across three cache blocks

| memory access        | cache contents afterwards |
|----------------------|---------------------------|
| _                    | (empty)                   |
| read array[0] (miss) | {****, array[0]}          |
| read array[1] (miss) | {array[1], array[2]}      |
| read array[2] (hit)  | {array[1], array[2]}      |
| read array[3] (miss) | {array[3], ++++}          |



if array[0] starts at an odd place in a cache block, need to read two cache blocks to get most array elements

| memory access                 | cache contents afterwards                        |
|-------------------------------|--------------------------------------------------|
| _                             | (empty)                                          |
| read array[0] byte 0 (miss)   | { ****, array[0] byte 0 }                        |
| read array[0] byte 1-3 (miss) | { array[0] byte 1-3, array[2], array[3] byte 0 } |
| read array[1] (hit)           | { array[0] byte 1-3, array[2], array[3] byte 0 } |
| read array[2] byte 0 (hit)    | { array[0] byte 1-3, array[2], array[3] byte 0 } |
| read array[2] byte 1-3 (miss) | {part of array[2], array[3], $++++$ }            |
| read array[3] (hit)           | {part of array[2], array[3], $++++$ }            |

### aside: alignment

compilers and malloc/new implementations usually try align values align = make address be multiple of something

most important reason: don't cross cache block boundaries

# C and cache misses (warmup 2)

```
int array[4];
int even_sum = 0, odd_sum = 0;
even_sum += array[0];
even_sum += array[2];
odd_sum += array[1];
odd_sum += array[3];
Assume everything but array is kept in registers (and the compiler does not do
```

Assume array[0] at beginning of cache block.

anything funny).

How many data cache misses on a 1-set direct-mapped cache with 8B blocks?



| memory access        | cache contents afterwards |
|----------------------|---------------------------|
|                      | (empty)                   |
| read array[0] (miss) | {array[0], array[1]}      |
| read array[2] (miss) | {array[2], array[3]}      |
| read array[1] (miss) | {array[0], array[1]}      |
| read array[3] (miss) | {array[2], array[3]}      |

## C and cache misses (warmup 3)

```
int array[8];
...
int even_sum = 0, odd_sum = 0;
even_sum += array[0];
odd_sum += array[1];
even_sum += array[2];
odd_sum += array[3];
even_sum += array[4];
odd_sum += array[5];
even_sum += array[6];
odd_sum += array[7];
```

Assume everything but array is kept in registers (and the compiler does not do anything funny), and array[0] at beginning of cache block.

How many data cache misses on a **2**-set direct-mapped cache with 8B blocks?





| one cache block (index 1) | one cache block<br>(index 0) | _        | ne block<br>× 1) |                      | he block<br>ex 0) |                   |  |  |
|---------------------------|------------------------------|----------|------------------|----------------------|-------------------|-------------------|--|--|
|                           | array[0]array[1]             | array[2] | array[3]         | array[4]             | array[5]          | arra <sub>j</sub> |  |  |
| memory access             | set 0 afterwards             |          | set 1 af         | terwards             |                   |                   |  |  |
| _                         | (empty)                      |          | (empt            | y)                   |                   |                   |  |  |
| read array[0] (miss)      | {array[0], arra              | y[1]}    | (empt            | y)                   |                   |                   |  |  |
| read array[1] (hit)       | {array[0], arra              | y[1]}    | (empt            | y)                   |                   |                   |  |  |
| read array[2] (miss)      | {array[0], arra              | y[1]}    | {arra            | y[2], ar             | ray[3]}           |                   |  |  |
| read array[3] (hit)       | {array[0], arra              | y[1]}    | {arra            | y[2], ar             | ray[3]}           |                   |  |  |
| read array[4] (miss)      | {array[4], arra              | y[5]}    | {arra            | {array[2], array[3]} |                   |                   |  |  |
| read array[5] (hit)       | {array[4], arra              | y[5]}    | {arra            | y[2], ar             | ray[3]}           |                   |  |  |
| read array[6] (miss)      | {array[4], arra              | y[5]}    | {arra            | y[6],ar              | ray[7]}           |                   |  |  |
| read array[7] (hit)       | {array[4], arra              | y[5]}    | {arra            | y[6], ar             | ray[7]}           |                   |  |  |

one cache block one cache block one cache block observation: what happens in set 0 doesn't affect set 1 when evaluating set 0 accesses, can ignore non-set 0 accesses/content memory adeess set u arterwarus set i aiterwarus (empty) (empty) {array[0], array[1]} read array[0] (miss) (empty) {array[0], array[1]} (empty) read array[1] (hit) {array[0], array[1]}  $\{array[2], array[3]\}$ read array[2] (miss) {array[0], array[1]}  $\{array[2], array[3]\}$ read array[3] (hit) read array[4] (miss)  $\{array[4], array[5]\}$  $\{array[2], array[3]\}$ {array[2], array[3]}  $\{array[4], array[5]\}$ read array[5] (hit) {array[6], array[7]}  $\{array[4], array[5]\}$ read array[6] (miss) {array[4], array[5]} {array[6], array[7]} read array[7] (hit)



| one cache block (index 1) | _           | he block<br>ex 0) | _        | he block<br>ex 1) | _        | he block<br>ex 0) |                     |
|---------------------------|-------------|-------------------|----------|-------------------|----------|-------------------|---------------------|
|                           | array[0]    | array[1]          | array[2] | array[3]          | array[4] | array[5]          | arra <sub>!</sub> . |
| memory access             | set 0 after | wards             |          | set 1 af          | terwards |                   |                     |
| _                         | (empty)     |                   |          | (empt             |          |                   |                     |
| read array[0] (miss)      | {array[     | 0],arra           | y[1]}    | (empt             |          |                   |                     |
| read array[1] (hit)       | {array[     | 0],arra           | y[1]}    | (empt             |          |                   |                     |
| read array[2] (miss)      | {array[     | 0], arra          | y[1]}    | {arra             |          |                   |                     |
|                           |             | 0], arra          |          | {arra             |          |                   |                     |
| read array[4] (miss)      | {array[     | 4], arra          | y[5]}    | {arra             |          |                   |                     |
| read array[5] (hit)       | {array[     | 4], arra          | y[5]}    | {arra             |          |                   |                     |
| read array[6] (miss)      | {array[     | 4], arra          | y[5]}    | {arra             |          |                   |                     |
|                           | {array[     | 4], arra          |          | {arra             | y[6], ar |                   |                     |



# C and cache misses (warmup 4)

```
int array[8];
int even sum = 0, odd sum = 0;
even sum += array[0];
even_sum += array[2];
even_sum += array[4];
even sum += array[6];
odd_sum += array[1];
odd_sum += array[3];
odd_sum += array[5];
odd sum += array[7];
```

Assume everything but array is kept in registers (and the compiler does not do anything funny).

How many data cache misses on a **2**-set direct-mapped cache with 8B blocks?

| one cache (index 1   | block one cad<br>1) (ind | che block<br>lex 0) | _        | ne block<br>ex 1) |          | he block<br>ex 0) |                   |
|----------------------|--------------------------|---------------------|----------|-------------------|----------|-------------------|-------------------|
|                      | array[0]                 | ]array[1]           | array[2] | array[3]          | array[4] | array[5]          | arra <sub>!</sub> |
| memory access        | set 0 afte               | rwards              |          | set 1 af          | terwards |                   |                   |
| _                    | (empty)                  |                     |          | (empt             | у)       |                   |                   |
| read array[0] (miss) | {array                   | [0], arra           | y[1]}    | (empt             | y)       |                   |                   |
| read array[2] (miss) | {array                   | [0], arra           | y[1]}    | {arra             | y[2], ar | ray[3]}           |                   |
| read array[4] (miss) | {array                   | [4], arra           | y[5]}    | {arra             | y[2], ar | ray[3]}           |                   |
| read array[6] (miss) | {array                   | [4], arra           | y[5]}    | {arra             | y[6],ar  | ray[7]}           |                   |
| read array[1] (miss) | {array                   | [0], arra           | y[1]}    | {arra             | y[6],ar  | ray[7]}           |                   |
| read array[3] (miss) | {array                   | [0], arra           | y[1]}    | {arra             | y[2], ar | ray[3]}           |                   |
| read array[5] (miss) | {array                   | [4], arra           | y[5]}    | {arra             | y[2], ar | ray[3]}           |                   |
| read array[7] (miss) | {array                   | [4], arra           | y[5]}    | {arra             | y[6], ar | ray[7]}           |                   |





### cache size

cache size = amount of *data* in cache not included metadata (tags, valid bits, etc.)

## arrays and cache misses (1)

```
int array[1024]; // 4KB array
int even_sum = 0, odd_sum = 0;
for (int i = 0; i < 1024; i += 2) {
    even_sum += array[i + 0];
    odd_sum += array[i + 1];
}</pre>
```

Assume everything but array is kept in registers (and the compiler does not do anything funny).

How many data cache misses on initially empty 2KB direct-mapped cache with 16B cache blocks?

# arrays and cache misses (2)

```
int array[1024]; // 4KB array
int even_sum = 0, odd_sum = 0;
for (int i = 0; i < 1024; i += 2)
    even_sum += array[i + 0];
for (int i = 0; i < 1024; i += 2)
    odd_sum += array[i + 1];</pre>
```

Assume everything but array is kept in registers (and the compiler does not do anything funny).

How many data cache misses on initially empty 2KB direct-mapped cache with 16B cache blocks?

## arrays and cache misses (2b)

```
int array[1024]; // 4KB array
int even_sum = 0, odd_sum = 0;
for (int i = 0; i < 1024; i += 2)
    even_sum += array[i + 0];
for (int i = 0; i < 1024; i += 2)
    odd_sum += array[i + 1];</pre>
```

Assume everything but array is kept in registers (and the compiler does not do anything funny).

How many data cache misses on initially empty 4KB direct-mapped cache with 16B cache blocks?

# arrays and cache misses (3)

direct-mapped cache with 16B cache blocks?

```
int sum; int array[1024]; // 4KB array
for (int i = 8; i < 1016; i += 1) {
    int local_sum = 0;
    for (int j = i - 8; j < i + 8; j += 1) {
         local sum += array[i] * (i - i);
    sum += (local_sum - array[i]);
Assume everything but array is kept in registers (and the compiler does not do
```

How many data cache misses on initially empty 2KB

# simulated misses: BST lookups



(simulated 16KB direct-mapped data cache; excluding BST setup)

#### actual misses: BST lookups



(actual 32KB more complex data cache)
(only one set of measurements + other things on machine + excluding initial load)

#### simulated misses: matrix multiplies



(simulated 16KB direct-mapped data cache; excluding initial load)

#### actual misses: matrix multiplies



(actual 32KB more complex data cache; excluding matrix initial load) (only one set of measurements + other things on machine)

#### misses with skipping

```
int array1[512]; int array2[512];
...
for (int i = 0; i < 512; i += 1)
    sum += array1[i] * array2[i];
}</pre>
```

Assume everything but array1, array2 is kept in registers (and the compiler does not do anything funny).

About how many *data cache misses* on a 2KB direct-mapped cache with 16B cache blocks?

Hint: depends on relative placement of array1, array2

#### best/worst case

```
array1[i] and array2[i] always different sets:
```

= distance from array1 to array2 not multiple of # sets  $\times$  bytes/set 2 misses every 4 i blocks of 4 array1[X] values loaded, then used 4 times before loading next block (and same for array2[X])

#### array1[i] and array2[i] same sets:

= distance from array1 to array2 is multiple of # sets  $\times$  bytes/set 2 misses every i block of 4 array1[X] values loaded, one value used from it, then, block of 4 array2[X] values replaces it, one value used from it, ...

#### worst case in practice?

two rows of matrix?

often sizeof(row) bytes apart

if the row size is multiple of number of sets  $\times$  bytes per block, oops!

2-way set associative, 2 byte blocks, 2 sets

| index | valid | tag | value | valid | tag | value |
|-------|-------|-----|-------|-------|-----|-------|
| 0     | 0     |     |       | 0     |     |       |
| 1     | 0     |     |       | 0     |     |       |

multiple places to put values with same index avoid misses from two active values using same set ("conflict misses"))

| index | valid | tag | value | valid | tag | value |
|-------|-------|-----|-------|-------|-----|-------|
| 0     | 0     |     | set 0 | 0     |     |       |
| 1     | 0     |     | set 1 | 0     |     |       |

| index | valid | tag   | value                                   | valid | tag  | value    |
|-------|-------|-------|-----------------------------------------|-------|------|----------|
| Θ     | 0     | 14/21 | , n                                     | 0     | W(2) | , 1      |
| 1     | 0     | — way | y U ——————————————————————————————————— | 0     | — wa | y 1 ———— |

| index | valid | tag | value | valid | tag | value |
|-------|-------|-----|-------|-------|-----|-------|
| 0     | 0     |     |       | 0     |     |       |
| 1     | 0     |     |       | 0     |     |       |

$$m=8$$
 bit addresses  $S=2=2^s$  sets  $s=1$  (set) index bits

$$B=2=2^b$$
 byte block size  $b=1$  (block) offset bits  $t=m-(s+b)=6$  tag bits

| index |   | _      | value                  | valid | tag | value |
|-------|---|--------|------------------------|-------|-----|-------|
| 0     | 1 | 000000 | mem[0x00]<br>mem[0x01] | 0     |     |       |
| 1     | 0 |        |                        | 0     |     |       |

| address (hex) | result |
|---------------|--------|
| 0000000 (00)  | miss   |
| 00000001 (01) |        |
| 01100011 (63) |        |
| 01100001 (61) |        |
| 01100010 (62) |        |
| 0000000 (00)  |        |
| 01100100 (64) |        |

| index | valid | tag    | value                  | valid | tag | value |
|-------|-------|--------|------------------------|-------|-----|-------|
| 0     | 1     | 000000 | mem[0x00]<br>mem[0x01] | 0     |     |       |
| 1     | 0     |        |                        | 0     |     |       |

| address | result |      |      |
|---------|--------|------|------|
| 000000  | 00     | (00) | miss |
| 000000  | 01     | (01) | hit  |
| 011000  | 11     | (63) |      |
| 011000  | 01     | (61) |      |
| 011000  | 10     | (62) |      |
| 000000  | 00     | (00) |      |
| 011001  | 00     | (64) |      |

| index | valid | tag    | value                  | valid | tag | value |
|-------|-------|--------|------------------------|-------|-----|-------|
| 0     | 1     | 000000 | mem[0x00]<br>mem[0x01] | 0     |     |       |
| U     | _     | 000000 | mem[0x01]              |       |     |       |
| 1     | 1     | 011000 | mem[0x62]<br>mem[0x63] | 0     |     |       |
| 1     |       | 011000 | mem[0x63]              | 0     |     |       |

| address | result |      |      |
|---------|--------|------|------|
| 000000  | 00     | (00) | miss |
| 000000  | 01     | (01) | hit  |
| 011000  | 11     | (63) | miss |
| 011000  | 01     | (61) |      |
| 011000  | 10     | (62) |      |
| 000000  | 00     | (00) |      |
| 011001  | 00     | (64) |      |

| index | valid | tag      | value                  | valid | tag    | value     |
|-------|-------|----------|------------------------|-------|--------|-----------|
| 0     | 1     | 00000    | mem[0x00]<br>mem[0x01] | 1     | 011000 | mem[0x60] |
| U     |       | 000000   | mem[0x01]              | +     |        | mem[0x61] |
| 1     | 1     | 1 011000 | mem[0x62]              | 0     |        |           |
| 1     |       |          | mem[0x63]              |       |        |           |

| address | result |      |      |
|---------|--------|------|------|
| 000000  | 00     | (00) | miss |
| 000000  | 01     | (01) | hit  |
| 011000  | 11     | (63) | miss |
| 011000  | 01     | (61) | miss |
| 011000  | 10     | (62) |      |
| 000000  | 00     | (00) |      |
| 011001  | 00     | (64) |      |

| index | valid | tag      | value                  | valid | tag    | value     |
|-------|-------|----------|------------------------|-------|--------|-----------|
| 0     | 1     | 000000   | mem[0x00]<br>mem[0x01] | 1     | 011000 | mem[0x60] |
|       |       | 000000   | mem[0x01]              |       |        | mem[0x61] |
| 1     | 1     | 1 011000 | mem[0x62]<br>mem[0x63] | 0     |        |           |
|       |       | 011000   | mem[0x63]              |       |        |           |

| address | (hex)   | result |
|---------|---------|--------|
| 000000  | 00 (00) | miss   |
| 000000  | 01 (01) | hit    |
| 011000  | 11 (63) | miss   |
| 011000  | 01 (61) | miss   |
| 011000  | 10 (62) | hit    |
| 000000  | 00 (00) |        |
| 011001  | 00 (64) |        |

| index | valid | tag      | value                  | valid | tag    | value     |
|-------|-------|----------|------------------------|-------|--------|-----------|
| 0     | 1     | 000000   | mem[0x00]<br>mem[0x01] | 1     | 011000 | mem[0x60] |
|       |       | 000000   | mem[0x01]              |       |        | mem[0x61] |
| 1     | 1     | 1 011000 | mem[0x62]<br>mem[0x63] | 0     |        |           |
|       |       | 011000   | mem[0x63]              |       |        |           |

| address | (he | ex)  | result |
|---------|-----|------|--------|
| 000000  | 00  | (00) | miss   |
| 000000  | 01  | (01) | hit    |
| 011000  | 11  | (63) | miss   |
| 011000  | 01  | (61) | miss   |
| 011000  | 10  | (62) | hit    |
| 000000  | 00  | (00) | hit    |
| 011001  | 00  | (64) |        |

| index | valid    | tag       | value     | valid | tag    | value     |
|-------|----------|-----------|-----------|-------|--------|-----------|
| Θ     | 1        | 000000    | mem[0x00] | 1     | 011000 | mem[0x60] |
|       |          |           | mem[0x01] |       |        | mem[0x61] |
| 1     | 1 011000 | mem[0x62] | 0         |       |        |           |
| 1     |          | 011000    | mem[0x63] | 0     |        |           |

| address | (hex)   | result                              |   |
|---------|---------|-------------------------------------|---|
| 000000  | 00 (00) | miss                                |   |
| 000000  | 01 (01) | hit                                 |   |
| 011000  | 11 (63) | miss                                |   |
| 011000  | 91 (61) | miss                                |   |
| 011000  | 10 (62) | hit needs to replace block in set 0 | ! |
| 000000  | 00 (00) | hit                                 |   |
| 011001  | 00 (64) | miss                                |   |

| index | valid | tag    | value                  | valid                  | tag    | value     |  |
|-------|-------|--------|------------------------|------------------------|--------|-----------|--|
| 0     | 1     | 000000 | mem[0x00]<br>mem[0x01] | 1                      | 011000 | mem[0x60] |  |
| 9     |       | 000000 | mem[0x01]              |                        |        | mem[0x61] |  |
| 1     | 1     | 1      | 1 011000               | mem[0x62]<br>mem[0x63] | 0      |           |  |
|       |       | 011000 | mem[0x63]              | 0                      |        |           |  |

| address | (he | ex)  | result |
|---------|-----|------|--------|
| 000000  | 00  | (00) | miss   |
| 000000  | 01  | (01) | hit    |
| 011000  | 11  | (63) | miss   |
| 011000  | 01  | (61) | miss   |
| 011000  | 10  | (62) | hit    |
| 000000  | 00  | (00) | hit    |
| 011001  | 00  | (64) | miss   |

#### cache operation (associative)



#### cache operation (associative)



### cache operation (associative)



#### associative lookup possibilities

none of the blocks for the index are valid

none of the valid blocks for the index match the tag something else is stored there

one of the blocks for the index is valid and matches the tag

#### replacement policies

|                                                               |       | -      |                |                |       |        |                        |
|---------------------------------------------------------------|-------|--------|----------------|----------------|-------|--------|------------------------|
| index                                                         | valid | tag    | va             | lue            | valid | tag    | value                  |
| 0                                                             | 1     | 000000 | mem[0<br>mem[0 | 0x00]<br>0x01] | 1     | 011000 | mem[0x60]<br>mem[0x61] |
| 1                                                             | 1     | 011000 | mem[0          | x62]<br>x63]   | 0     |        |                        |
| address (hex) result  000 how to decide where to insert 0x64? |       |        |                |                |       |        |                        |
| ىم 000                                                        | TOOC  |        | πι             |                |       |        |                        |
| 01100                                                         | 9011  | (63) I | miss           |                |       |        |                        |
| 01100001 (61) miss                                            |       |        |                |                |       |        |                        |
| 01100                                                         | 9010  | (62) I | nit            |                |       |        |                        |
| 00000                                                         | 9000  | (00) I | nit            | 7              |       |        |                        |
| 01100                                                         | 2100  | (64)   | miss           |                |       |        |                        |

#### replacement policies

| index | valid | tag    | value                  | valid | tag    | value                  | LRU |
|-------|-------|--------|------------------------|-------|--------|------------------------|-----|
| 0     | 1     | 000000 | mem[0x00]<br>mem[0x01] | 1     | 011000 | mem[0x60]<br>mem[0x61] | 1   |
| 1     | 1     | 011000 | mem[0x62]<br>mem[0x63] | 0     |        |                        | 1   |

| address (hex) | result  |                                        |
|---------------|---------|----------------------------------------|
| 00000000 (00) | mi trac | ck which block was read least recently |
| 00000001 (01) | hit     | lated on every access                  |
| 01100011 (63) | mi upc  | lated on every access                  |
| 01100001 (61) | miss    |                                        |
| 01100010 (62) | hit     |                                        |
| 00000000 (00) | hit     |                                        |
| 01100100 (64) | miss    |                                        |

#### example replacement policies

```
least recently used
     take advantage of temporal locality
     at least \lceil \log_2(E!) \rceil bits per set for E-way cache
           (need to store order of all blocks)
approximations of least recently used
     implementing least recently used is expensive
     really just need "avoid recently used" — much faster/simpler
     good approximations: E to 2E bits
first-in. first-out
```

#### (pseudo-)random no extra information! actually works pretty well in practice

counter per set — where to replace next

#### associativity terminology

direct-mapped — one block per set

 $E ext{-way set associative} - E ext{ blocks per set}$   $E ext{ ways in the cache}$ 

fully associative — one set total (everything in one set)

# simulated misses: BST lookups



#### simulated misses: matrix multiplies



# logistics

Prof Skadron here Thursday

# backup sides

# **Tag-Index-Offset formulas**

| m                         | memory addreses bits              |
|---------------------------|-----------------------------------|
| E                         | number of blocks per set ("ways") |
| $S = 2^s$                 | number of sets                    |
| S                         | (set) index bits                  |
| $B=2^b$                   | block size                        |
| b                         | (block) offset bits               |
| t = m - (s+b)             | tag bits                          |
| $C = B \times S \times E$ | cache size (excluding metadata)   |

#### Tag-Index-Offset exercise

```
m memory addreses bits (Y86-64: 64) E number of blocks per set ("ways") S=2^s number of sets
```

$$S = 2$$
 number of sets  $s$  (set) index bits  $B = 2^b$  block size

$$b$$
 (block) offset bits

$$t = m - (s + b)$$
 tag bits

$$C = B \times S \times E$$
 cache size (excluding metadata)

#### My desktop:

L1 Data Cache: 32 KB, 8 blocks/set, 64 byte blocks

L2 Cache: 256 KB, 4 blocks/set, 64 byte blocks L3 Cache: 8 MB, 16 blocks/set, 64 byte blocks

Divide the address 0x34567 into tag, index, offset for each cache.

#### T-I-O exercise: L1

| quantity           | value for L1                   |
|--------------------|--------------------------------|
| block size (given) | B=64Byte                       |
|                    | $B=2^b$ (b: block offset bits) |

#### T-I-O exercise: L1

| quantity           | value for L1                   |
|--------------------|--------------------------------|
| block size (given) | $B=64 {\sf Byte}$              |
|                    | $B=2^b$ (b: block offset bits) |
| block offset bits  | b = 6                          |

| quantity           | value for L1                             |
|--------------------|------------------------------------------|
| block size (given) | B=64Byte                                 |
|                    | $B=2^b$ (b: block offset bits)           |
| block offset bits  | b = 6                                    |
| blocks/set (given) | E = 8                                    |
| cache size (given) | $C = 32 \text{KB} = E \times B \times S$ |

| quantity           | value for L1                                   |
|--------------------|------------------------------------------------|
| block size (given) | B = 64Byte                                     |
|                    | $B=2^b$ (b: block offset bits)                 |
| block offset bits  | b = 6                                          |
| blocks/set (given) | E = 8                                          |
| cache size (given) | $C = 32KB = E \times B \times S$               |
|                    | $S = \frac{C}{B \times E}$ (S: number of sets) |

| quantity           | value for L1                                            |  |  |  |  |
|--------------------|---------------------------------------------------------|--|--|--|--|
| block size (given) | B = 64Byte                                              |  |  |  |  |
|                    | $B=2^b$ (b: block offset bits)                          |  |  |  |  |
| block offset bits  | b = 6                                                   |  |  |  |  |
| blocks/set (given) | E = 8                                                   |  |  |  |  |
| cache size (given) | $C = 32KB = E \times B \times S$                        |  |  |  |  |
|                    | $S = \frac{C}{B \times E}$ (S: number of sets)          |  |  |  |  |
| number of sets     | $S = \frac{32 \text{KB}}{64 \text{Byte} \times 8} = 64$ |  |  |  |  |

| quantity           | value for L1                                            |  |  |  |
|--------------------|---------------------------------------------------------|--|--|--|
| block size (given) | B=64Byte                                                |  |  |  |
|                    | $B=2^b$ (b: block offset bits)                          |  |  |  |
| block offset bits  | b = 6                                                   |  |  |  |
| blocks/set (given) | E=8                                                     |  |  |  |
| cache size (given) | $C = 32KB = E \times B \times S$                        |  |  |  |
|                    | $S = \frac{C}{B \times E} $ (S: number of sets)         |  |  |  |
| number of sets     | $S = \frac{32 \text{KB}}{64 \text{Byte} \times 8} = 64$ |  |  |  |
|                    | $S=2^s$ (s: set index bits)                             |  |  |  |
| set index bits     | $s = \log_2(64) = 6$                                    |  |  |  |

#### T-I-O results

|                   | L1         | L2   | L3   |
|-------------------|------------|------|------|
| sets              | 64         | 1024 | 8192 |
| block offset bits | 6          | 6    | 6    |
| set index bits    | 6          | 10   | 13   |
| tag bits          | (the rest) |      |      |

```
L1 L2 L3
block offset bits 6 6
                       6
set index bits 6 10 13
tag bits
                (the rest)
0x34567:
                  0100
                         0101
bits 0-5 (all offsets): 100111 = 0x27
```

```
L1 L2 L3
block offset bits 6 6
                       6
set index bits 6 10 13
tag bits
                (the rest)
0x34567:
                         0101
                  0100
bits 0-5 (all offsets): 100111 = 0x27
```

```
L1 L2 L3
block offset bits 6 6
                         6
set index bits 6 10 13
tag bits
                  (the rest)
0x34567:
                    0100
                           0101
                                   0110
bits 0-5 (all offsets): 100111 = 0x27
L1:
    bits 6-11 (L1 set): 01 \ 0101 = 0 \times 15
    bits 12- (L1 tag): 0x34
```

```
L1 L2 L3
block offset bits 6 6
                         6
set index bits 6 10 13
tag bits
                  (the rest)
0x34567:
                    0100
                           0101
bits 0-5 (all offsets): 100111 = 0x27
L1:
    bits 6-11 (L1 set): 01 \ 0101 = 0 \times 15
    bits 12- (L1 tag): 0x34
```

```
11 12 13
block offset bits 6 6
                          6
set index bits 6 10 13
tag bits
                  (the rest)
0x34567:
                    0100
                           0101
bits 0-5 (all offsets): 100111 = 0x27
L2:
    bits 6-15 (set for L2): 01 \ 0001 \ 0101 = 0 \times 115
    bits 16-: 0x3
```

```
11 12 13
block offset bits 6 6
                         6
set index bits 6 10 13
tag bits
                  (the rest)
0x34567:
                   0100
                           0101
bits 0-5 (all offsets): 100111 = 0x27
L2:
    bits 6-15 (set for L2): 01 0001 0101 = 0 \times 115
    bits 16-: 0x3
```

53

bits 18-: 0x0

```
11 12 13
block offset bits 6 6
                         6
set index bits 6 10 13
tag bits
                 (the rest)
0x34567:
                   0100
                          0101
bits 0-5 (all offsets): 100111 = 0x27
L3:
    bits 6-18 (set for L3): 0 1101 0001 0101 = 0 \times D15
```

53

# misses with skipping

```
int array1[512]; int array2[512];
...
for (int i = 0; i < 512; i += 1)
    sum += array1[i] * array2[i];
}</pre>
```

Assume everything but array1, array2 is kept in registers (and the compiler does not do anything funny).

About how many data cache misses on a 2KB direct-mapped cache with 16B cache blocks?

Hint: depends on relative placement of array1, array2

How about on a two-way set associative cache?

# arrays and cache misses (2)

```
int array[1024]; // 4KB array
int even_sum = 0, odd_sum = 0;
for (int i = 0; i < 1024; i += 2)
    even_sum += array[i + 0];
for (int i = 0; i < 1024; i += 2)
    odd_sum += array[i + 1];</pre>
```

Assume everything but array is kept in registers (and the compiler does not do anything funny).

How many data cache misses on initially empty 2KB direct-mapped cache with 16B cache blocks? Would a set-associtiave cache be better?

#### inclusive versus exclusive

L2 inclusive of L1
everything in L1 cache duplicated in L2
adding to L1 also adds to L2

L2 cache







#### inclusive versus exclusive



#### inclusive versus exclusive

L2 inclusive of L1

everything in L1 cache duplicated in L1 adding to L1 also adds to L2

L2 cache

exclusive policy: avoid duplicated data sometimes called *victim cache* (contains cache eviction victims)

makes less sense with multicore

#### L2 exclusive of L1

L2 contains different data than L1 adding to L1 must remove from L2 probably evicting from L1 adds to L2 L2 cache





## **Tag-Index-Offset formulas (direct-mapped)**

(formulas derivable from prior slides)

(formulas derivable from prior slides) 
$$S=2^s$$
 number of sets

(set) index bits

 $B = 2^{b}$ block size

(block) offset bits

memory addreses bits m

t = m - (s + b) tag bits

 $C = B \times S$  cache size (if direct-mapped)

## **Tag-Index-Offset formulas (direct-mapped)**

(formulas derivable from prior slides)

(formulas derivable from prior slides) 
$$S=2^s$$
 number of sets

$$s$$
 (set) index bits

$$B=2^b$$
 block size

$$b$$
 (block) offset bits

$$m$$
 memory addreses bits

$$t = m - (s + b)$$
 tag bits

$$t = III - (s + \theta)$$
 tag bit

$$C = B \times S$$
 cache size (if direct-mapped)