"Kevin Skadron 64KB direct-mapped cache w/ 32-bit VA 64R blocks

AIJ=BUJ+CIJ

2-way set associative, 2 byte blocks, 2 sets

| index | valid | tag | value   | valid | tag | value |
|-------|-------|-----|---------|-------|-----|-------|
| 0     | 0     |     | PET HAR | 0     |     |       |
| 1     | 0     |     |         | 0     |     |       |

multiple places to put values with same index avoid conflict misses



| index | valid | tag | value | valid | tag | value |
|-------|-------|-----|-------|-------|-----|-------|
| 0     | 0     |     | set 0 | 0     |     |       |
| 1     | 0     |     | set 1 | 0     |     |       |

| index | valid | tag  | value |     | valid | tag  | value <sub>¬</sub> |
|-------|-------|------|-------|-----|-------|------|--------------------|
| 0     | 0     |      | . 0   |     | 0     |      | . 1                |
| 1     | Θ     | — wa | y U   |     | 0     | — wa | y 1 ———            |
| 2     |       |      |       |     |       |      |                    |
| 3     |       |      |       |     |       |      |                    |
|       |       |      |       | ) ( |       |      |                    |

tays idx offict

| index | valid | tag | value | valid | tag | va | lue |
|-------|-------|-----|-------|-------|-----|----|-----|
| 0     | 0     |     |       | 0     |     | BO | 78  |
| 1     | 0     |     |       | 0     |     |    |     |

$$m=8$$
 bit addresses  $S=2=2^s$  sets  $s=1$  (set) index bits

$$B=2=2^b$$
 byte block size  $b=1$  (block) offset bits  $t=m-(s+b)=6$  tag bits

| index       | valid    | tag    | value                  | valid | tag | value |
|-------------|----------|--------|------------------------|-------|-----|-------|
| <b>&gt;</b> | <b>1</b> | 000000 | mem[0x00]<br>mem[0x01] | 0     |     | •     |
| 1           | 0        |        |                        | 0     |     |       |

|     | address | (he              | x)   | result |  |
|-----|---------|------------------|------|--------|--|
|     | 00000   | <mark>0</mark> 0 | (00) | miss   |  |
| رمح | 000000  | 01               | (01) |        |  |
|     | 011000  | 11               | (63) |        |  |
|     | 011000  | 01               | (61) |        |  |
|     | 011000  | 10               | (62) |        |  |
|     | 000000  | 00               | (00) |        |  |
|     | 011001  | 00               | (64) |        |  |

| index      | valid | tag    | value     | valid | tag | value |
|------------|-------|--------|-----------|-------|-----|-------|
| Θ          | 1     | 000000 | mem[0x00] | 0     |     |       |
| <b>)</b> 1 | 0     | 1      |           | 0     |     |       |

|   | address (k |      | result |
|---|------------|------|--------|
| T | 0000000    | 00)  | miss   |
|   | 00000000   | 01)  | hit    |
| ₹ | 0110001    | (63) |        |
|   | 01100001   | (61) |        |
|   | 0110001    | (62) |        |
|   | 0000000    | (00) |        |
|   | 01100100   | (64) |        |

| index           |     | _      | value                  | valid | tag | value |
|-----------------|-----|--------|------------------------|-------|-----|-------|
| <b>&gt;</b> 0   | 1   | 000000 | mem[0x00] $mem[0x01]$  |       |     |       |
| <b>&gt;</b> 1 ( | 1)( | 011000 | mem[0x62]<br>mem[0x63] | 0     |     |       |

|   | address        | (h | ex)  | result |
|---|----------------|----|------|--------|
|   | 000000         | 00 | (00) | miss   |
|   | 000000         | 01 | (01) | hit    |
|   | <b>0</b> 11000 |    |      | miss   |
| 7 | 011000         | 01 | (61) |        |
|   | 011000         | 10 | (62) |        |
|   | 000000         | 00 | (00) |        |
|   | 011001         | 00 | ((1) | 1      |

| index    | valid | tag    | value                  | valid |        | value                |
|----------|-------|--------|------------------------|-------|--------|----------------------|
| 0 1      |       | 000000 | mem[0x00]<br>mem[0x01] | 1     | 011000 | mem[0x60]' mem[0x61] |
|          | _     | 000000 | mem[0x01]              |       | 011000 | mem[0x61]            |
| <b>1</b> | 1 (   | 011000 | mem[0x62]<br>mem[0x63] | 0     |        |                      |

|   | address | (he             | ex)  | result |
|---|---------|-----------------|------|--------|
|   | 000000  | 00              | (00) | miss   |
|   | 000000  | 01              | (01) | hit    |
|   | 011000  | $1\overline{1}$ | (63) | miss   |
| < | 011000  | 01              | (61) | miss   |
| 7 | 011000  | 10              | (62) |        |
|   | 000000  | 00              | (00) |        |
|   | 011001  | 00              | (64) |        |



| index | valid | tag    | value                  | valid | tag    | value     |
|-------|-------|--------|------------------------|-------|--------|-----------|
| 0     | 1     | 000000 | mem[0x00]              | 1     | 011000 | mem[0x60] |
|       |       |        | mem[0x01]              |       |        | mem[0x61] |
| 1     | 1     | 011000 | mem[0x62]<br>mem[0x63] | 0     |        |           |
|       |       | 011000 | mem[0x63]              |       |        |           |

| address | result  |      |
|---------|---------|------|
| 000000  | 00 (00) | miss |
| 000000  | 01 (01) | hit  |
| 011000  | 11 (63) | miss |
| 011000  | 01 (61) | miss |
| 011000  | 10(62)  | hit  |
| 000000  | 00 (00) |      |
| 011001  | 00 (64) |      |

| index |   | _      | value               | valid | tag    | value     |
|-------|---|--------|---------------------|-------|--------|-----------|
| -0    | 1 | 00000  | mem[0x00) mem[0x01] | 1     | 011000 | mem[0x60] |
| >0    | ( | 000000 | mem[0x01]           |       |        | mem[0x61] |
| 1     | 1 | 011000 | mem[0x62]           | 0     |        |           |
|       |   |        | mem[0x63]           |       |        |           |

|   | address  |    |      | result |  |  |
|---|----------|----|------|--------|--|--|
| _ | 000000   | 00 | (00) | miss   |  |  |
|   | 000000   | 01 | (01) | hit    |  |  |
|   | 011000   | 11 | (63) | miss   |  |  |
|   | 011000   | 01 | (61) | miss   |  |  |
|   | 011000   | 10 | (62) | hit    |  |  |
| < | <u> </u> | 00 | (00) | hit    |  |  |
| > | 011001   | ho | (61) |        |  |  |

| index | valid | tag    | value     | valid | tag    | value     |
|-------|-------|--------|-----------|-------|--------|-----------|
| 0     | 1     | 000000 | mem[0x00] | 1     | 011000 | mem[0x60] |
|       |       |        | mem[0x01] |       |        | mem[0x61] |
| 1     | 1     | 011000 | mem[0x62] | 0     |        |           |
|       |       |        | mem[0x63] | ا     |        |           |

| address (hex) | result                               |
|---------------|--------------------------------------|
| 0000000 (00)  | miss                                 |
| 00000001 (01) | hit                                  |
| 01100011 (63) | miss                                 |
| 01100001 (61) | mire                                 |
| 01100010 (62) | hit needs to replace block in set 0! |
| 0000000 (00)  | hit                                  |
| 01100100 (64) | miss                                 |

| inde | valid |        | value                  | valid | tag    | value                  |      |
|------|-------|--------|------------------------|-------|--------|------------------------|------|
| 0 ,  | 1     | 000000 | mem[0x00] mem[0x01]    | 1     | 011000 | mem[0x60]<br>mem[0x61] | DLRU |
|      |       |        |                        |       |        | illelii [0X61]         |      |
| 1    | 1     | 011000 | mem[0x62]<br>mem[0x63] | 0     |        |                        |      |

| address | (he      | ex)  | result |
|---------|----------|------|--------|
| 000000  | 00       | (00) | miss   |
| 000000  | 01       | (01) | hit    |
| 011000  | 11       | (63) | miss   |
| 011000  | 01       | (61) | miss   |
| 011000  | 10       | (62) | hit    |
| 000000  | 00       | (00) | hit    |
| 011001  | <u> </u> | (64) | miss   |

## cache operation (associative)



### cache operation (associative)



## cache operation (associative)



### associative lookup possibilities

none of the blocks for the index are valid

none of the valid blocks for the index match the tag something else is stored there

one of the blocks for the index is valid and matches the tag

### handling writes

what about writing to the cache?

two decision points:

if the value is not in cache, do we add it?

if yes: need to load rest of block

if no: missing out on locality?

if value is in cache, when do we update next level?

if immediately: extra writing

if later: need to remember to do so

#### allocate on write?

processor writes less than whole cache block

block not yet in cache

two options:

#### write-allocate

fetch rest of cache block, replace written part (then follow write-through or write-back policy)

#### write-no-allocate

don't use cache at all (send write to memory *instead*) guess: not read soon?

option 1: write-through





option 1: write-through



option 2: write-back





option 2: write-back





## writeback policy

changed value!

2-way set associative, 4 byte blocks, 2 sets

| index | valid | tag    | value                  | dirty | valid | tag | value                 | dirty | LRU |
|-------|-------|--------|------------------------|-------|-------|-----|-----------------------|-------|-----|
| 0     | 1     | 000000 | mem[0x00]<br>mem[0x01] | 0     | 1     |     | mem[0x60]* mem[0x61]* |       | 1   |
| 1     | 1     | 011000 | mem[0x62]<br>mem[0x63] | 0     | 0     |     |                       |       | 0   |

1 = dirty (different than memory) needs to be written if evicted

2-way set associative, LRU, writeback

| index             | valid | tag | value                  | dirty | valid | tag    | value                  | dirty      | LRU |
|-------------------|-------|-----|------------------------|-------|-------|--------|------------------------|------------|-----|
| <del>&gt;</del> 0 | 1     |     | mem[0x00]<br>mem[0x01] |       | 1     | 011000 | mem[0x60]<br>mem[0x61] | * <b>1</b> | 1   |
| 1                 | 1     |     | mem[0x62]<br>mem[0x63] |       | 0     |        |                        |            | 0   |

writing 0xFF into address 0x04? index 0, tag 000001



2-way set associative, LRU, writeback

| index | valid | tag | value                  | dirty | valid | tag | value                  | dirty | LRU |
|-------|-------|-----|------------------------|-------|-------|-----|------------------------|-------|-----|
| 0     | 1     |     | mem[0x00]<br>mem[0x01] | 0     | 1     |     | mem[0x60]<br>mem[0x61] |       | 1   |
| 1     | 1     |     | mem[0x62]<br>mem[0x63] | 0     | 0     |     |                        |       | 0   |

writing 0xFF into address 0x04?

index 0, tag 000001

step 1: find least recently used block

2-way set associative, LRU, writeback

| index | valid | tag | value                  | dirty | valid | tag | value                  | dirty | LRU |
|-------|-------|-----|------------------------|-------|-------|-----|------------------------|-------|-----|
| 0     | 1     |     | mem[0x00]<br>mem[0x01] | 0     | 1     |     | mem[0x60]<br>mem[0x61] |       | 1   |
| 1     | 1     |     | mem[0x62]<br>mem[0x63] | 0     | 0     |     |                        |       | 0   |

writing 0xFF into address 0x04?

index 0, tag 000001

step 1: find least recently used block

step 2: possibly writeback old block

2-way set associative, LRU, writeback

| index | valid | tag | value                  | dirty | valid | tag    | value             | dirty | LRU |
|-------|-------|-----|------------------------|-------|-------|--------|-------------------|-------|-----|
| 0     | 1/    |     | mem[0x00]<br>mem[0x01] |       | 1     | 000001 | 0xFF<br>mem[0x05] | 1     | 0   |
| 1     | 1     |     | mem[0x62]<br>mem[0x63] | 0     | 0     |        |                   |       | 0   |

writing 0xFF into address 0x04?

index 0, tag 000001

step 1: find least recently used block

step 2: possibly writeback old block

step 3a: read in new block – to get mem[0x05]

step 3b: update LRU information

2-way set associative, LRU, writeback

| index | valid | tag    | value                  | dirty | valid | tag    | value                  | dirty    | LRU |
|-------|-------|--------|------------------------|-------|-------|--------|------------------------|----------|-----|
| 0     | 1     | 000000 | mem[0x00]<br>mem[0x01] | 0     | 1     | 011000 | mem[0x60]<br>mem[0x61] | *<br>* 1 | 1   |
| 1     | 1     | 011000 | mem[0x62]<br>mem[0x63] | 0     | 0     |        |                        |          | 0   |

writing 0xFF into address 0x04?

step 1: is it in cache yet?

step 2: no, just send it to memory

# exercise (1)

2-way set associative, LRU, write-allocate, writeback

| index | valid | tag    | value                  | dirty | valid | tag    | value                  | dirty      | LRU |
|-------|-------|--------|------------------------|-------|-------|--------|------------------------|------------|-----|
| 0     | 1     | 001100 | mem[0x30]<br>mem[0x31] | 0     | 1     | 010000 | mem[0x40]<br>mem[0x41] | * <b>1</b> | 0   |
| 1     | 1     | 011000 | mem[0x62]<br>mem[0x63] | 0     | 1     | 001100 | mem[0x32]<br>mem[0x33] | * <b>1</b> | 1   |

for each of the following accesses, performed alone, would it require (a) reading a value from memory (or next level of cache) and (b) writing a value to the memory (or next level of cache)?

writing 1 byte to 0x33 reading 1 byte from 0x52 reading 1 byte from 0x50

2-way set associative, LRU write-allocate, writeback

| index | valid | tag    | value                  | dirty | valid | tag    | value                 | dirty | LRU |  |
|-------|-------|--------|------------------------|-------|-------|--------|-----------------------|-------|-----|--|
| 0     | 1     | 001100 | mem[0x30]<br>mem[0x31] | 0     | 1     | 010000 | mem[0x40]* mem[0x41]* | 1     | 0   |  |
| 1     | 1     | 011000 | mem[0x62]<br>mem[0x63] | 0 <   | 1     | 001100 | mem[0x32]* mem[0x33]* | 1     | 1   |  |

writing 1 byte to 0x33: (set 1, offset 1) no read or write

reading 1 byte from 0x52:

2-way set associative, LRU, write-allocate, writeback

| index | valid | tag    | value                  | dirty | valid | tag    | value                 | dirty | LRU |
|-------|-------|--------|------------------------|-------|-------|--------|-----------------------|-------|-----|
| 0     | 1     | 001100 | mem[0x30]<br>mem[0x31] | 0     | 1     | 010000 | mem[0x40]* mem[0x41]* | 1     | 0   |
| 1     | 1     | 011000 | mem[0x62]<br>mem[0x63] | 0     | 1     | 001100 | mem[0x32]* mem[0x33]* | )1    | 10  |

writing 1 byte to 0x33: (set 1, offset 1) no read or write

reading 1 byte from 0x52:

2-way set associative, LRU, write-allocate, writeback

| index | valid | tag    | value                  | dirty | valid | tag    | value                 | dirty | LRU |
|-------|-------|--------|------------------------|-------|-------|--------|-----------------------|-------|-----|
| 0     | 1     | 001100 | mem[0x30]<br>mem[0x31] | 0     | 1     | 010000 | mem[0x40]* mem[0x41]* | 1     | 0   |
| 1     | 1     | 011000 | mem[0x62]<br>mem[0x63] | 0     | 1     | 001100 | mem[0x32]* mem[0x33]* | 1     | 1   |

writing 1 byte to 0x33: (set 1, offset 1) no read or write

reading 1 byte from 0x52: (set 1, offset 0) write back 0x32-0x33; read 0x52-0x53

2-way set associative, LRU, write-allocate, writeback

| index | valid | tag    | value                  | dirty | valid | tag    | value                  | dirty          | LRU            |
|-------|-------|--------|------------------------|-------|-------|--------|------------------------|----------------|----------------|
| 0     | 1     | 001100 | mem[0x30]<br>mem[0x31] | 0     | 1     | 010000 | mem[0x40]* mem[0x41]*  | 1              | 0              |
| 1     | 1     | 011000 | mem[0x62]<br>mem[0x63] | 0     | 1     | 101000 | mem[0x52]<br>mem[0x53] | <del>1</del> 0 | <del>1</del> 0 |

writing 1 byte to 0x33: (set 1, offset 1) no read or write

reading 1 byte from 0x52: (set 1, offset 0) write back 0x32-0x33; read 0x52-0x53

2-way set associative, LRU, write-allocate, writeback

| index | valid | tag    | value                  | dirty | valid | tag    | value                 | dirty | LRU |
|-------|-------|--------|------------------------|-------|-------|--------|-----------------------|-------|-----|
| 0     |       | 001100 | mem[0x30]<br>mem[0x31] | _0 >  | 1     | 010000 | mem[0x40]* mem[0x41]* | 1     | 0   |
| 1     | 1     | 011000 | mem[0x62]<br>mem[0x63] | 0     | 1     | 001100 | mem[0x32]* mem[0x33]* | 1     | 1   |

writing 1 byte to 0x33: (set 1, offset 1) no read or write

reading 1 byte from 0x52: (set 1, offset 0) **write** back 0x32-0x33; **read** 0x52-0x53

reading 1 byte from 0x50: (set 0, offset 0) replace 0x30-0x31 (no write back); read 0x50-0x51

2-way set associative, LRU, write-allocate, writeback

| index | valid | tag    | value                  | dirty | valid | tag    | value                 | dirty | LRU |
|-------|-------|--------|------------------------|-------|-------|--------|-----------------------|-------|-----|
| 0     | 1     | 101000 | mem[0x50]<br>mem[0x51] | 0     | 1     | 010000 | mem[0x40]* mem[0x41]* | 1     | 01  |
| 1     | 1     | 011000 | mem[0x62]<br>mem[0x63] | 0     | 1     | 001100 | mem[0x32]* mem[0x33]* | 1     | 1   |

writing 1 byte to 0x33: (set 1, offset 1) no read or write

reading 1 byte from 0x52: (set 1, offset 0) **write** back 0x32-0x33; **read** 0x52-0x53

reading 1 byte from 0x50: (set 0, offset 0) replace 0x30-0x31 (no write back); **read** 0x50-0x51

# exercise (2)

2-way set associative, LRU, write-no-allocate, write-through

| index | valid | tag    | value                  | valid | tag    | value                  | LRU |
|-------|-------|--------|------------------------|-------|--------|------------------------|-----|
| 0     | 1     | 001100 | mem[0x30]<br>mem[0x31] | 1     | 010000 | mem[0x40]<br>mem[0x41] | 0   |
| 1     | 1     | 011000 | mem[0x62]<br>mem[0x63] | 1     | 001100 | mem[0x32]<br>mem[0x33] | 1   |

for each of the following accesses, performed alone, would it require (a) reading a value from memory and (b) writing a value to the memory?

writing 1 byte to 0x33 reading 1 byte from 0x52 reading 1 byte from 0x50

2-way set associative, LRU, write-no-allocate, write-through

| index | valid | tag    | value                  | valid | tag    | value                  | LRU |
|-------|-------|--------|------------------------|-------|--------|------------------------|-----|
| 0     | 1     | 001100 | mem[0x30]<br>mem[0x31] | 1     | 010000 | mem[0x40]<br>mem[0x41] | 0   |
| 1     | 1     | 011000 | mem[0x62]<br>mem[0x63] | 1     | 001100 | mem[0x32]<br>mem[0x33] | 1   |

writing 1 byte to 0x33: (set 1, offset 1) write-through 0x33 modification

reading 1 byte from 0x52:

2-way set associative, LRU, write-no-allocate, write-through

| index | valid | tag    | value                  | valid | tag    | value                  | LRU            |
|-------|-------|--------|------------------------|-------|--------|------------------------|----------------|
| 0     | 1     | 001100 | mem[0x30]<br>mem[0x31] | 1     | 010000 | mem[0x40]<br>mem[0x41] | 0              |
| 1     | 1     | 011000 | mem[0x62]<br>mem[0x63] | 1     | 001100 | mem[0x32]<br>mem[0x33] | <del>1</del> 0 |

writing 1 byte to 0x33: (set 1, offset 1) write-through 0x33 modification

reading 1 byte from 0x52:

2-way set associative, LRU, write-no-allocate, write-through

| index | valid | tag    | value                  | valid | tag    | value                  | LRU |
|-------|-------|--------|------------------------|-------|--------|------------------------|-----|
| 0     | 1     | 001100 | mem[0x30]<br>mem[0x31] | 1     | 010000 | mem[0x40]<br>mem[0x41] | 0   |
| 1     | 1     | 011000 | mem[0x62]<br>mem[0x63] | 1     | 001100 | mem[0x32]<br>mem[0x33] | 1   |

writing 1 byte to 0x33: (set 1, offset 1) write-through 0x33 modification

reading 1 byte from 0x52: (set 1, offset 0) replace 0x32-0x33; **read** 0x52-0x53

2-way set associative, LRU, write-no-allocate, write-through

| index | valid | tag    | value                  | valid | tag    | value                  | LRU            |
|-------|-------|--------|------------------------|-------|--------|------------------------|----------------|
| 0     | 1     | 001100 | mem[0x30]<br>mem[0x31] | 1     | 010000 | mem[0x40]<br>mem[0x41] | 0              |
| 1     | 1     | 011000 | mem[0x62]<br>mem[0x63] | 1     | 101000 | mem[0x52]<br>mem[0x53] | <del>1</del> 0 |

writing 1 byte to 0x33: (set 1, offset 1) write-through 0x33 modification

reading 1 byte from 0x52: (set 1, offset 0) replace 0x32-0x33; **read** 0x52-0x53

2-way set associative, LRU, write-no-allocate, write-through

| index | valid | tag    | value                  | valid | tag    | value                  | LRU |
|-------|-------|--------|------------------------|-------|--------|------------------------|-----|
| 0     | 1     | 001100 | mem[0x30]<br>mem[0x31] | 1     | 010000 | mem[0x40]<br>mem[0x41] | 0   |
| 1     | 1     | 011000 | mem[0x62]<br>mem[0x63] | 1     | 001100 | mem[0x32]<br>mem[0x33] | 1   |

writing 1 byte to 0x33: (set 1, offset 1) write-through 0x33 modification

reading 1 byte from 0x52: (set 1, offset 0) replace 0x32-0x33; **read** 0x52-0x53

reading 1 byte from 0x50: (set 0, offset 0) replace 0x30-0x31; read 0x50-0x51

2-way set associative, LRU, write-no-allocate, write-through

| index | valid | tag    | value                  | valid | tag    | value                  | LRU |
|-------|-------|--------|------------------------|-------|--------|------------------------|-----|
| 0     | 1     | 101000 | mem[0x50]<br>mem[0x51] | 1     | 010000 | mem[0x40]<br>mem[0x41] | 01  |
| 1     | 1     | 011000 | mem[0x62]<br>mem[0x63] | 1     | 001100 | mem[0x32]<br>mem[0x33] | 1   |

writing 1 byte to 0x33: (set 1, offset 1) write-through 0x33 modification

reading 1 byte from 0x52: (set 1, offset 0) replace 0x32-0x33; **read** 0x52-0x53

reading 1 byte from 0x50: (set 0, offset 0) replace 0x30-0x31; read 0x50-0x51

#### fast writes



write appears to complete immediately when placed in buffer memory can be much slower

#### cache miss types

H C's

common to categorize misses:

roughly "cause" of miss assuming cache block size fixed

compulsory (or cold) — first time accessing something adding more sets or blocks/set wouldn't change

 ${\it conflict} \ -- \ {\it sets aren't big/flexible enough} \\ {\it a fully-associtive (1-set) cache of the same size would have done better}$ 

capacity — cache was not big enough

coherence — from sync'ing cache with other caches only issue with multiple cores

# making any cache look bad

- 1. access enough blocks, to fill the cache
- 2. access an additional block, replacing something
- 3. access last block replaced
- 4. access last block replaced
- 5. access last block replaced

...

but — typical real programs have locality



#### cache optimizations

```
(assuming typical locality + keeping cache size constant if possible...)
                        miss rate hit time miss penalty
increase cache size
                        better
                                   worse
increase associativity
                        better
                                             worse?
                                   worse
increase block size
                        depends
                                   worse
                                              worse
add secondary cache
                                              better
write-allocate
                        better
writeback
LRU replacement
                        better
                                              worse?
prefetching
                        better
 prefetching = guess what program will use, access in advance
         average time = hit time + miss rate \times miss penalty
```

#### cache optimizations by miss type



#### another view



23

# two-level page table lookup



#### cache accesses and multi-level PTs

four-level page tables — five cache accesses per program memory access

L1 cache hits — typically a couple cycles each?

so add 8 cycles to each program memory access?

not acceptable

#### program memory active sets



0xFFFF FFFF FFFF

0xFFFF 8000 0000 0000

0x7F...

small areas of memory active at a time one or two pages in each area?

0x0000 0000 0040 0000

#### page table entries and locality

page table entries have excellent temporal locality

typically one or two pages of the stack active

typically one or two pages of code active

typically one or two pages of heap/globals active

each page contains whole functions, arrays, stack frames, etc.

#### page table entries and locality

page table entries have excellent temporal locality

typically one or two pages of the stack active

typically one or two pages of code active

typically one or two pages of heap/globals active

each page contains whole functions, arrays, stack frames, etc.

needed page table entries are very small

caled a **TLB** (translation lookaside buffer)

very small cache of page table entries

| L1 cache                    | TLB                            |
|-----------------------------|--------------------------------|
| physical addresses          | virtual page numbers           |
| bytes from memory           | page table entries             |
| tens of bytes per block     | one page table entry per block |
| usually thousands of blocks | usually tens of entries        |

caled a **TLB** (translation lookaside buffer)

very small cache of page table entries

| L1 cache          |                                                         | TLB                            |  |  |
|-------------------|---------------------------------------------------------|--------------------------------|--|--|
| physical add      | resses                                                  | virtual page numbers           |  |  |
| bytes from memory |                                                         | page table entries             |  |  |
| tens of bytes     |                                                         | one page ∱able entry per block |  |  |
| usually thou      | sands of blocks                                         | usually te is of entries       |  |  |
| ·                 | only caches the page table lookup itself                |                                |  |  |
|                   | (generally) just entries from the last-level page table |                                |  |  |

caled a **TLB** (translation lookaside buffer)

very small cache of page table entries

| L1 cache                    | TLB                            |
|-----------------------------|--------------------------------|
| physical addresses          | virtual page numbers           |
| bytes from memory           | page table entries             |
| tens of bytes per block     | one page table entry per block |
| usually thousands of blocks | usually tens of entries        |

not much spatial locality between page table entries (they're used for kilobytes of data already)
(and if spatial locality, maybe use larger page size?)

caled a **TLB** (translation lookaside buffer)

very small cache of page table entries

| L1 cache                    | TLB                            |
|-----------------------------|--------------------------------|
| physical addresses          | virtual page numbers           |
| bytes from memory           | page table entries             |
| tens of bytes per block     | one page table entry per block |
| usually thousands of blocks | usually tens of entries        |
|                             |                                |

few active page table entries at a time enables highly associative cache designs

#### TLB and multi-level page tables

TLB caches valid last-level page table entries

doesn't matter which last-level page table

means TLB output can be used directly to form address

# TLB and two-level lookup



#### TLB and two-level lookup













# address splitting for TLBs (1)

```
my desktop:
```

4KB ( $2^{12}$  byte) pages; 48-bit virtual address

64-entry, 4-way L1 data TLB

TLB index bits?

TLB tag bits?

# address splitting for TLBs (1)

```
my desktop:
```

```
4KB (2^{12} byte) pages; 48-bit virtual address
```

64-entry, 4-way L1 data TLB

```
TLB index bits?
```

$$64/4 = 16 \text{ sets} - 4 \text{ bits}$$

TLB tag bits?

$$48-12=36$$
 bit virtual page number —  $36-4=32$  bit TLB tag

# address splitting for TLBs (2)

my desktop:

4KB ( $2^{12}$  byte) pages; 48-bit virtual address

1536-entry  $(3 \cdot 2^9)$ , 12-way L2 TLB

TLB index bits?

TLB tag bits?

# address splitting for TLBs (2)

```
my desktop:
```

```
4KB (2^{12} byte) pages; 48-bit virtual address
```

1536-entry  $(3 \cdot 2^9)$ , 12-way L2 TLB

#### TLB index bits?

1536/12 = 128 sets - 7 bits

#### TLB tag bits?

48-12=36 bit virtual page number — 36-7=29 bit TLB tag

# exercise: TLB access pattern (setup)

4-entry, 2-way TLB, LRU replacement policy, initially empty

4096 byte pages

how many index bits?

TLB index of virtual address 0x12345?

# exercise: TLB access pattern

4-entry, 2-way TLB, LRU replacement policy, initially empty

4096 byte pages

| type  | virtual    | physical |
|-------|------------|----------|
| read  | 0x440030   | 0x554030 |
| write | 0x440034   | 0x554034 |
| read  | 0x7FFFE008 | 0x556008 |
| read  | 0x7FFFE000 | 0x556000 |
| read  | 0x7FFFDFF8 | 0x5F8FF8 |
| read  | 0x664080   | 0x5F9080 |
| read  | 0x440038   | 0x554038 |
| write | 0x7FFFDFF0 | 0x5F8FF0 |
|       |            |          |

which are TLB hits? which are TLB misses? final contents of TLB?

## exercise: TLB access pattern

4-entry, 2-way TLB, LRU replacement policy, initially empty

4096 byte pages

| ,     |            |          |        |                          |         |  |  |
|-------|------------|----------|--------|--------------------------|---------|--|--|
|       |            |          |        | VPNs of PTEs held in TLB |         |  |  |
| type  | virtual    | physical | result | set 0                    | set 1   |  |  |
| read  | 0x440030   | 0x554030 | miss   | 0×440                    |         |  |  |
| write | 0x440034   | 0x554034 | hit    | 0×440                    |         |  |  |
| read  | 0x7FFFE008 | 0x556008 | miss   | 0×440                    |         |  |  |
| read  | 0x7FFFE000 | 0x556000 | hit    | 0x440, 0x7FFFE           |         |  |  |
| read  | 0x7FFFDFF8 | 0x5F8FF8 | miss   | 0x440, 0x7FFFE           | 0x7FFFD |  |  |
| read  | 0x664080   | 0x5F9080 | miss   | 0x664, 0x7FFFE           | 0x7FFFD |  |  |
| read  | 0x440038   | 0x554038 | miss   | 0x664, 0x440             | 0x7FFFD |  |  |
| write | 0x7FFFDFF0 | 0x5F8FF0 | hit    | 0x664, 0x440             | 0x7FFFD |  |  |

which are TLB hits? which are TLB misses? final contents of TLB?

## exercise: TLB access pattern

4-entry, 2-way TLB, LRU replacement policy, initially empty

4096 byte pages

|      |            | ٠,         |    |         |                  |      |               |        |       |     |      |
|------|------------|------------|----|---------|------------------|------|---------------|--------|-------|-----|------|
| type | set<br>id> |            | V  | tag     |                  |      | physical page | write? | user? |     | LRU? |
| read |            |            | 1  | 0x00220 | $9 (0x440 \gg 1$ | .)   | 0x554         | 1      | 1     | ••• | no   |
| writ | ı          | ן ע        | 1  | 0x00332 | 2 (0x00664 ≫     | · 1) | 0x5F9         | 1      | 1     | ••• | yes  |
| read |            | ,          |    |         |                  |      |               |        |       |     |      |
| read |            | .          | 1  | 0x3FFFI | F (0x7FFFD ≫     | 1)   | 0x5F8         | 1      | 1     | ••• | no   |
| read | ı          | <u>ا</u> ا | 0  |         |                  |      |               | -      | _     | ••• | yes  |
| read |            | ı          |    |         |                  |      |               |        |       |     |      |
| read |            | Эх         | 44 | 9038    | 0x554038         | miss | 0x664, 0x440  | 0x7F   | FFD   |     |      |
| writ | e (        | Эх         | 7F | FFDFF0  | 0x5F8FF0         | hit  | 0x664, 0x440  | 0x7F   | FFD   |     |      |

which are TLB hits? which are TLB misses? final contents of TLB?

36

## changing page tables

what happens to TLB when page table base pointer is changed? e.g. context switch

most entries in TLB refer to things from wrong process oops — read from the wrong process's stack?

## changing page tables

what happens to TLB when page table base pointer is changed? e.g. context switch

most entries in TLB refer to things from wrong process oops — read from the wrong process's stack?

option 1: invalidate all TLB entries side effect on "change page table base register" instruction

37

## changing page tables

what happens to TLB when page table base pointer is changed? e.g. context switch

most entries in TLB refer to things from wrong process oops — read from the wrong process's stack?

option 1: invalidate all TLB entries side effect on "change page table base register" instruction

option 2: TLB entries contain process ID set by OS (special register) checked by TLB in addition to TLB tag, valid bit

## editing page tables

what happens to TLB when OS changes a page table entry?

most common choice: has to be handled in software

## editing page tables

what happens to TLB when OS changes a page table entry?

most common choice: has to be handled in software

invalid to valid — nothing needed

TLB doesn't contain invalid entries

MMU will check memory again

valid to invalid — OS needs to tell processor to invalidate it special instruction (x86: invlpg)

valid to other valid — OS needs to tell processor to invalidate it

# backup slides

### inclusive versus exclusive



#### L2 exclusive of L1

L2 contains different data than L1 adding to L1 must remove from L2 probably evicting from L1 adds to L2 L2 cache



### inclusive versus exclusive



### inclusive versus exclusive

#### L2 inclusive of L1

everything in L1 cache duplicated in L2 adding to L1 also adds to L2

### L2 cache

exclusive policy:
avoid duplicated data
sometimes called *victim cache*(contains cache eviction victims)

makes less sense with multicore

#### L2 exclusive of L1

L2 contains different data than L1 adding to L1 must remove from L2 probably evicting from L1 adds to L2 L2 cache





# **Tag-Index-Offset formulas (direct-mapped)**

(formulas derivable from prior slides)

(formulas derivable from prior sides). 
$$S=2^s$$
 number of sets

(set) index bits

(Set) much bits

 $B = 2^b$  block size

b (block) offset bits

m memory addreses bits

t = m - (s + b) tag bits

 $C = B \times S$  cache size (if direct-mapped)

# **Tag-Index-Offset formulas (direct-mapped)**

(formulas derivable from prior slides)

$$S=2^s$$
 number of sets  $s$  (set) index bits  $B=2^b$  block size  $s$  (block) offset bits  $s$  memory addreses bits  $s$  tag bits

 $C = B \times S$  cache size (if direct-mapped)