### changelog

27 Feb 2024: quiz Q4: correct swapped 1001/0000 in representing 0x90 as binary and resulting error in overall PTE value on slide

#### last time

#### multi-level page tables

each level: single page table index = part of virtual page number use virtual page number from MSB to LSB check valid bits at each level later levels omitted if there would be no valid entries

### quiz Q1B

OS needed to allocate whole pages of heap

allocated 0x500-0x5ff (virtual page 5) and 0x600-0x6ff (virtual page 6)

child: got private copy of page 0x3 (0x300-0x3ff), 0x4, 0xe

parent: got private copy page 0xe

B: write to page 0x5 (0x500) — need private copy

D: can avoid copying 0xe twice by recognizing it's no longer shared

0x123450 = virtual page 0x12, page offset 0x3450

 $0xA0000 + 0x12 \times 8$  bytes per PTE = 0xA0090

```
0x903450 = physical page 0x90, page offset 0x3450
page table entry for this:
     24 unused bits (must be 0) = 0
     24-bit physical page = 0 \times 90
     11 unused bits (must be 0) = 0
     user-mode-accessible bit = 1
     readable bit = 1
     writeable bit =?
     executable bit =?
     valid bit = 1
0000 ... 0000 0000 0000 0000 0000 1001 0000 0000 0000 0001 1??<mark>1</mark>
                0x 90 00 1 (9 or B or D or F)
```

#### 0x00123456789ABC



## lab this week

























a prox register location via chip-architect.org (Hans de Vries)

# the place of cache (1)



# memory hierarchy goals

```
performance of the fastest (smallest) memory
hide 100x latency difference? 99+% hit (= value found in cache) rate
capacity of the largest (slowest) memory
```

## memory hierarchy assumptions

#### temporal locality

"if a value is accessed now, it will be accessed again soon" caches should keep recently accessed values

#### spatial locality

"if a value is accessed now, adjacent values will be accessed soon" caches should store adjacent values at the same time

natural properties of programs — think about loops

## locality examples

```
double computeMean(int length, double *values) {
    double total = 0.0;
    for (int i = 0; i < length; ++i) {</pre>
        total += values[i];
    return total / length;
temporal locality: machine code of the loop
spatial locality: machine code of most consecutive instructions
temporal locality: total, i, length accessed repeatedly
spatial locality: values[i+1] accessed after values[i]
```



## hierarchy and instruction/data caches

typically separate data and instruction caches for L1

(almost) never going to read instructions as data or vice-versa avoids instructions evicting data and vice-versa can optimize instruction cache for different access pattern easier to build fast caches: that handles less accesses at a time



decision: divide memory into two-byte blocks put exactly one of these blocks in the cache

#### Cache

# **value** 00 00

#### Memory

| addresses   | bytes |
|-------------|-------|
| 00000-00001 | 00 11 |
| 00010-00011 | 22 33 |
| 00100-00101 | 55 55 |
| 00110-00111 | 66 77 |
| 01000-01001 | 88 99 |
| 01010-01011 | AA BB |
| 01100-01101 | CC DD |
| 01110-01111 | EE FF |
| 10000-10001 | F0 F1 |
| •••         | •••   |

read byte at 01011?

| Cache | Memory      |       |
|-------|-------------|-------|
| value | addresses   | bytes |
| 00 00 | 00000-00001 | 00 11 |
|       | 00010-00011 | 22 33 |
|       | 00100-00101 | 55 55 |
|       | 00110-00111 | 66 77 |
|       | 01000-01001 | 88 99 |
|       | 01010-01011 | AA BB |
|       | 01100-01101 | CC DD |
|       | 01110-01111 | EE FF |
|       | 10000-10001 | F0 F1 |
|       |             |       |

read byte at 01011?



read byte at 01011? invalid, fetch



read byte at 01011?



read byte at 01011?

| Cache           | Memory      | ′     |
|-----------------|-------------|-------|
| valid tag value | addresses   | bytes |
| 1 0101 AA BB    | 00000-00001 | 00 11 |
|                 | 00010-00011 | 22 33 |
|                 | 00100-00101 | 55 55 |
|                 | 00110-00111 | 66 77 |
|                 | 01000-01001 | 88 99 |
|                 | 01010-01011 | AA BB |
|                 | 01100-01101 | CC DD |
|                 | 01110-01111 | EE FF |
|                 | 10000-10001 | F0 F1 |
|                 | •••         | •••   |

read byte at 01011?

| Cache            | Memory      | ′     |
|------------------|-------------|-------|
| valid tag value  | addresses   | bytes |
| 1   0101   AA BB | 00000-00001 | 00 11 |
|                  | 00010-00011 | 22 33 |
|                  | 00100-00101 | 55 55 |
|                  | 00110-00111 | 66 77 |
|                  | 01000-01001 | 88 99 |
|                  | 01010-01011 | AA BB |
|                  | 01100-01101 | CC DD |
|                  | 01110-01111 | EE FF |
|                  | 10000-10001 | F0 F1 |
|                  | •••         | •••   |

read byte at 01011?

| Cache                      | Memory      | /     |
|----------------------------|-------------|-------|
| valid <sup>tag</sup> value | addresses   | bytes |
| 1   0101   AA BB           | 00000-00001 | 00 11 |
|                            | 00010-00011 | 22 33 |
|                            | 00100-00101 | 55 55 |
|                            | 00110-00111 | 66 77 |
|                            | 01000-01001 | 88 99 |
|                            | 01010-01011 | AA BB |
|                            | 01100-01101 | CC DD |
|                            | 01110-01111 | EE FF |
|                            | 10000-10001 | F0 F1 |
|                            | •••         | •••   |



read byte at 01011?



read byte at 01011?

exactly one place for each address spread out what can go in a block



read byte at 01011?

exactly one place for each address spread out what can go in a block



read byte at 01011?

exactly one place for each address spread out what can go in a block



read byte at 01011?



## building a (direct-mapped) cache

read byte at 01011? invalid, fetch



## building a (direct-mapped) cache

read byte at 01011? invalid, fetch



## building a (direct-mapped) cache

read byte at 01011? invalid, fetch

| Cache  |        |       |       | Memory      | ,     |
|--------|--------|-------|-------|-------------|-------|
| index  | valid  | tag   | value | addresses   | bytes |
| 00     | 0      | 00    | 00 00 | 00000-00001 | 00 11 |
| 01     | 1      | 01    | AA BB | 00010-00011 | 22 33 |
| 10     | 0      | 00    | 00 00 | 00100-00101 | 55 55 |
| 11     | 0      | 00    | 00 00 | 00110-00111 | 66 77 |
|        |        |       |       | 01000-01001 | 88 99 |
| cache  | e bloc | :k: 2 | bytes | 01010-01011 | AA BB |
| direct | t-map  | ped   |       | 01100-01101 | CC DD |
|        | •      | -     |       | 01110-01111 | EE FF |
|        |        |       |       | 10000-10001 | F0 F1 |

## terminology

```
row = set
```

preview: change how much is in a row

address 001111 (stores value 0xFF)

tag index offset cache

- 2 byte blocks, 4 sets
- 2 byte blocks, 8 sets
- 4 byte blocks, 2 sets

| 2 | hyto | blocks | . 1 | coto |
|---|------|--------|-----|------|
|   | DVTE | DIOCKS | • 4 | Sers |

| index | valid | tag | value |  |  |
|-------|-------|-----|-------|--|--|
| 00    | 1     | 000 | 00 11 |  |  |
| 01    | 1     | 001 | AA BB |  |  |
| 10    | 0     |     |       |  |  |
| 11    | 1     | 001 | EE FF |  |  |

4 byte blocks, 2 sets

| ındex | valid | tag | value       |
|-------|-------|-----|-------------|
| 0     | 1     | 000 | 00 11 22 33 |
| 1     | 1     | 001 | CC DD EE FF |

| 2 byte blocks, 8 sets |   |    |       |  |  |  |  |
|-----------------------|---|----|-------|--|--|--|--|
| index valid tag value |   |    |       |  |  |  |  |
| 000                   | 1 | 00 | 00 11 |  |  |  |  |
| 001                   | 1 | 01 | F1 F2 |  |  |  |  |
| 010                   | 0 |    |       |  |  |  |  |
| 011                   | 0 |    |       |  |  |  |  |
| 100                   | 0 |    |       |  |  |  |  |
| 101                   | 1 | 00 | AA BB |  |  |  |  |
| 110                   | 0 |    |       |  |  |  |  |
| 111                   | 1 | 00 | EE FF |  |  |  |  |

address 001111 (stores value 0xFF)

| cache                 | tag | index | offset |
|-----------------------|-----|-------|--------|
| 2 byte blocks, 4 sets |     |       | 1      |
| 2 byte blocks, 8 sets |     |       | 1      |
| 4 byte blocks 2 sets  |     |       |        |



address 001111 (stores value 0xFF)

| cache                 | tag | index | offset |
|-----------------------|-----|-------|--------|
| 2 byte blocks, 4 sets |     |       | 1      |
| 2 byte blocks, 8 sets |     |       | 1      |
| 4 byte blocks, 2 sets |     |       | 11     |

| 2 byte blocks, 4 sets |                  |        | 2 b        | yte bl  | ocks, 8 | sets  |     |      |
|-----------------------|------------------|--------|------------|---------|---------|-------|-----|------|
| index                 | valid            | tag    | value      |         | index   | valid | tag | valu |
| 00                    | 1                | 000    | 00 11      |         | 000     | 1     | 00  | 00 1 |
| 01                    | 1                | 001    | AA BB      |         | 001     | 1     | 01  | F1 F |
| 10                    | 0                | 1      | $2^2$ byte | ck      | 0       |       |     |      |
| 11                    | 1                | 4 —    | Z byte     | CK      | 0       |       |     |      |
|                       | $\frac{1}{4}$ by | , 2 bi | ts to sa   | y which | i byte  | 0     |     |      |
| index                 | valid            | tag    |            | и́е     | 101     | 1     | 00  | AA B |
| 0                     | 1                | 000    | 00 11      | /       | 110     | 0     |     |      |
| 0                     |                  | 000    | 00 11      | 22 33   |         |       |     |      |

0.01

CC DD EE FF

value

00 11

F1 F2

AA BB

EE FF

00

address 001111 (stores value 0xFF)

| cache                 | tag | index | offset |
|-----------------------|-----|-------|--------|
| 2 byte blocks, 4 sets |     | 11    | 1      |
| 2 byte blocks, 8 sets |     |       | 1      |
| 4 byte blocks, 2 sets |     | 1     | 11     |

| 2 byte blocks, $4$ sets |       |     |          |    | 2 k       | yte bl | ocks, 8 | sets  |
|-------------------------|-------|-----|----------|----|-----------|--------|---------|-------|
| index                   | valid | tag | value    |    | index     | valid  | tag     | value |
| 00                      | 1     | 000 | 00 11    |    | 000       | 1      | 00      | 00 11 |
| 01                      | 1     | 001 | AA BB    |    | $2^2=4$ s | ets    |         | F1 F2 |
| 10                      | 0     |     |          |    |           |        |         |       |
| 11                      | 1     | 001 | EE FF    | 2  | 2 bits to | inde   | x set   |       |
| 4 byte blocks, 2 sets   |       |     |          |    | 100       | 0      |         |       |
|                         | -     |     |          |    | 101       | 1      | 00      | AA BB |
| index                   | valid | tag | value    |    | 110       |        |         |       |
| 0                       | 1     | 000 | 00 11 22 | 33 |           | 0      |         |       |
| 1                       | 1     | 001 | CC DD EE | FF | 111       | 1      | 00      | EE FF |

address 001111 (stores value 0xFF)

| cache                 | tag | index | offset |
|-----------------------|-----|-------|--------|
| 2 byte blocks, 4 sets |     | 11    | 1      |
| 2 byte blocks, 8 sets |     | 111   | 1      |
| 4 byte blocks, 2 sets |     | 1     | 11     |



address 001111 (stores value 0xFF)

| cache                 | tag | index | offset |
|-----------------------|-----|-------|--------|
| 2 byte blocks, 4 sets |     | 11    | 1      |
| 2 byte blocks, 8 sets |     | 111   | 1      |
| 4 byte blocks, 2 sets |     | 1     | 11     |

| 2 byte blocks, $4$ sets |       |     |       |  |
|-------------------------|-------|-----|-------|--|
| index                   | valid | tag | value |  |
| 00                      | 1     | 000 | 00 11 |  |
| 01                      | 1     | 001 | AA BB |  |
| 10                      | 0     |     |       |  |
| 11                      | 1     | 001 | EE FF |  |
| 4 byte blocks, 2 sets   |       |     |       |  |

index

| 4 byte blocks, 2 sets |     |             |  |  |
|-----------------------|-----|-------------|--|--|
| valid                 | tag | value       |  |  |
| 1                     | 000 | 00 11 22 33 |  |  |
| 1                     | 001 | CC DD FF FF |  |  |

| 2 byte blocks, 8 sets  |       |       |        |  |  |
|------------------------|-------|-------|--------|--|--|
| index                  | valid | tag   | value  |  |  |
| 000                    | 1     | 00    | 00 11  |  |  |
| 001                    | 1     | 01    | F1 F2  |  |  |
| 010                    | 0     |       |        |  |  |
| $\frac{01}{10}2^{1} =$ | 0 -   | -4-   |        |  |  |
| $_{f 16}$ $2^{f 1}$ =  | =2 s  | ets   |        |  |  |
| <sup>10</sup> 1 bi     | t to  | index | set BB |  |  |
| 11 <del>0</del>        | 0     |       |        |  |  |
| 111                    | 1     | 0.0   | FF FF  |  |  |

address 001111 (stores value 0xFF)

| cache                 | tag | index | offset |
|-----------------------|-----|-------|--------|
| 2 byte blocks, 4 sets | 001 | 11    | 1      |
| 2 byte blocks, 8 sets | 00  | 111   | 1      |
| 4 byte blocks, 2 sets | 001 | 1     | 11     |

| tag — | whatever | is | left | over |
|-------|----------|----|------|------|
|-------|----------|----|------|------|

| 00 | 1 | 000 | 00 11 |
|----|---|-----|-------|
| 01 | 1 | 001 | AA BB |
| 10 | 0 |     |       |
| 11 | 1 | 001 | EE FF |

4 byte blocks, 2 sets

| index | valid | tag | value       |
|-------|-------|-----|-------------|
| 0     | 1     | 000 | 00 11 22 33 |
| 1     | 1     | 001 | CC DD EE FF |

| 2 byte | blocks, | 8 | sets |
|--------|---------|---|------|
|--------|---------|---|------|

| 2 byte blocks, 8 sets |       |     |       |  |  |
|-----------------------|-------|-----|-------|--|--|
| index                 | valid | tag | value |  |  |
| 000                   | 1     | 00  | 00 11 |  |  |
| 001                   | 1     | 01  | F1 F2 |  |  |
| 010                   | 0     |     |       |  |  |
| 011                   | 0     |     |       |  |  |
| 100                   | 0     |     |       |  |  |
| 101                   | 1     | 00  | AA BB |  |  |
| 110                   | 0     |     |       |  |  |
| 111                   | 1     | 00  | EE FF |  |  |
|                       |       |     |       |  |  |

#### cache size

cache size = amount of data in cache not included metadata (tags, valid bits, etc.)

## Tag-Index-Offset formulas (direct-mapped)

(formulas derivable from prior slides)

formulas derivable from prior slides)
$$S = 2^s \qquad \text{number of sets}$$

 $= 2^s$  number of sets

s (set) index bits

 $B=2^b$  block size

b (block) offset bits

m memory addreses bits

 $t-m=(e\perp h)$  tag hit

t = m - (s + b) tag bits

 $C = B \times S$  cache size (if direct-mapped)

## Tag-Index-Offset formulas (direct-mapped)

(formulas derivable from prior slides)

(formulas derivable from prior slides)
$$S = 2^{s} \qquad \text{number of sets}$$

number of sets

s (set) index bits

 $B=2^b$  block size

(block) offset bits

m memory addreses bits

t = m - (s + b) tag bits

 $C = B \times S$  cache size (if direct-mapped)

#### TIO: exercise

64-byte blocks, 128 set cache

stores  $64 \times 128 = 8192$  bytes (of data)

if addresses 32-bits, then how many tag/index/offset bits?

which bytes are stored in the same block as byte from 0x1037?

- A. byte from 0x1011
- B. byte from 0x1021
- C. byte from 0x1035
- D. byte from 0x1041

# backup slides

## arrays and cache misses (1)

```
int array[1024]; // 4KB array
int even_sum = 0, odd_sum = 0;
for (int i = 0; i < 1024; i += 2) {
    even_sum += array[i + 0];
    odd_sum += array[i + 1];
}</pre>
```

Assume everything but array is kept in registers (and the compiler does not do anything funny).

How many data cache misses on initially empty 2KB direct-mapped cache with 16B cache blocks?

# arrays and cache misses (2)

```
int array[1024]; // 4KB array
int even_sum = 0, odd_sum = 0;
for (int i = 0; i < 1024; i += 2)
    even_sum += array[i + 0];
for (int i = 0; i < 1024; i += 2)
    odd_sum += array[i + 1];</pre>
```

Assume everything but array is kept in registers (and the compiler does not do anything funny).

How many data cache misses on initially empty 2KB direct-mapped cache with 16B cache blocks?

## arrays and cache misses (2b)

```
int array[1024]; // 4KB array
int even_sum = 0, odd_sum = 0;
for (int i = 0; i < 1024; i += 2)
    even_sum += array[i + 0];
for (int i = 0; i < 1024; i += 2)
    odd_sum += array[i + 1];</pre>
```

Assume everything but array is kept in registers (and the compiler does not do anything funny).

How many data cache misses on initially empty 4KB direct-mapped cache with 16B cache blocks?

### inclusive versus exclusive

L2 inclusive of L1
everything in L1 cache duplicated in L2
adding to L1 also adds to L2

L2 cache



L2 contains different data than L1 adding to L1 must remove from L2 probably evicting from L1 adds to L2 L2 cache





#### inclusive versus exclusive



#### inclusive versus exclusive

L2 inclusive of L1

everything in L1 cache duplicated in L2 adding to L1 also adds to L2

L2 cache

exclusive policy:
avoid duplicated data
sometimes called *victim cache*(contains cache eviction victims)

makes less sense with multicore

#### L2 exclusive of L1

L2 contains different data than L1 adding to L1 must remove from L2 probably evicting from L1 adds to L2 L2 cache





### **Tag-Index-Offset formulas (direct-mapped)**

(formulas derivable from prior slides)

$$S = 2^s$$

number of sets

(set) index bits

 $B = 2^{b}$ 

block size

(block) offset bits

m

memory addreses bits

t = m - (s + b) tag bits

 $C = B \times S$  cache size (if direct-mapped)

## **Tag-Index-Offset formulas (direct-mapped)**

(formulas derivable from prior slides)

$$S = 2^s$$

number of sets

(set) index bits

 $B = 2^{b}$ 

block size

(block) offset bits

m

memory addreses bits

t = m - (s + b) tag bits

 $C = B \times S$  cache size (if direct-mapped)

### cache organization and miss rate

depends on program; one example:

SPEC CPU2000 benchmarks, 64B block size

LRU replacement policies

| d | lata | cac | he n | าโรร | rates: |
|---|------|-----|------|------|--------|
|   | _    |     |      |      |        |

| Cache size | direct-mapped | 2-way  | 8-way   | fully assoc. |
|------------|---------------|--------|---------|--------------|
| 1KB        | 8.63%         | 6.97%  | 5.63%   | 5.34%        |
| 2KB        | 5.71%         | 4.23%  | 3.30%   | 3.05%        |
| 4KB        | 3.70%         | 2.60%  | 2.03%   | 1.90%        |
| 16KB       | 1.59%         | 0.86%  | 0.56%   | 0.50%        |
| 64KB       | 0.66%         | 0.37%  | 0.10%   | 0.001%       |
| 128KB      | 0.27%         | 0.001% | 0.0006% | 0.0006%      |

### cache organization and miss rate

depends on program; one example:

SPEC CPU2000 benchmarks, 64B block size

#### LRU replacement policies

| data cache | miss rates:   |        |         |              |
|------------|---------------|--------|---------|--------------|
| Cache size | direct-mapped | 2-way  | 8-way   | fully assoc. |
| 1KB        | 8.63%         | 6.97%  | 5.63%   | 5.34%        |
| 2KB        | 5.71%         | 4.23%  | 3.30%   | 3.05%        |
| 4KB        | 3.70%         | 2.60%  | 2.03%   | 1.90%        |
| 16KB       | 1.59%         | 0.86%  | 0.56%   | 0.50%        |
| 64KB       | 0.66%         | 0.37%  | 0.10%   | 0.001%       |
| 128KB      | 0.27%         | 0.001% | 0.0006% | 0.0006%      |

## exercise (1)

initial cache: 64-byte blocks, 64 sets, 8 ways/set

If we leave the other parameters listed above unchanged, which will probably reduce the number of capacity misses in a typical program? (Multiple may be correct.)

- A. quadrupling the block size (256-byte blocks, 64 sets, 8 ways/set)
- B. quadrupling the number of sets
- C. quadrupling the number of ways/set

## exercise (2)

initial cache: 64-byte blocks, 8 ways/set, 64KB cache

If we leave the other parameters listed above unchanged, which will probably reduce the number of capacity misses in a typical program? (Multiple may be correct.)

- A. quadrupling the block size (256-byte block, 8 ways/set, 64KB cache)
- B. quadrupling the number of ways/set
- C. quadrupling the cache size

## exercise (3)

initial cache: 64-byte blocks, 8 ways/set, 64KB cache

If we leave the other parameters listed above unchanged, which will probably reduce the number of conflict misses in a typical program? (Multiple may be correct.)

- A. quadrupling the block size (256-byte block, 8 ways/set, 64KB cache)
- B. quadrupling the number of ways/set
- C. quadrupling the cache size

### prefetching

seems like we can't really improve cold misses...

have to have a miss to bring value into the cache?

### prefetching

seems like we can't really improve cold misses...

have to have a miss to bring value into the cache?

solution: don't require miss: 'prefetch' the value before it's accessed

remaining problem: how do we know what to fetch?

### common access patterns

suppose recently accessed 16B cache blocks are at: 0x48010, 0x48020, 0x48030, 0x48040

guess what's accessed next

### common access patterns

suppose recently accessed 16B cache blocks are at: 0x48010, 0x48020, 0x48030, 0x48040

guess what's accessed next

common pattern with instruction fetches and array accesses

### prefetching idea

look for sequential accesses

bring in guess at next-to-be-accessed value

if right: no cache miss (even if never accessed before)

if wrong: possibly evicted something else — could cause more misses

fortunately, sequential access guesses almost always right

### quiz exercise solution

one cache block one cache block one cache block (set index 1) (set index 0) (set index 1) (set index 0)

|   | array[0] | array[1] | array[2] | array[3] | array[4] | array[5] | array[6] | array[7] | arra |  |
|---|----------|----------|----------|----------|----------|----------|----------|----------|------|--|
| • |          |          |          |          |          |          |          |          |      |  |

| memory access        | set 0 afterwards     | set 1 afterwards     |
|----------------------|----------------------|----------------------|
| _                    | (empty)              | (empty)              |
| read array[0] (miss) | {array[0], array[1]} | (empty)              |
| read array[3] (miss) | {array[0], array[1]} | {array[2], array[3]} |
| read array[6] (miss) | {array[0], array[1]} | {array[6], array[7]} |
| read array[1] (hit)  | {array[0], array[1]} | {array[6], array[7]} |
| read array[4] (miss) | {array[4], array[5]} | {array[6], array[7]} |
| read array[7] (hit)  | {array[4], array[5]} | {array[6], array[7]} |
| read array[2] (miss) | {array[4], array[5]} | {array[2], array[3]} |
| mend orrov[E] (hit)  | [array[4] array[E]]  | [25524[6] 25524[7]]  |

### quiz exercise solution

one cache block one cache block one cache block (set index 1) (set index 0) (set index 1) (set index 0)

array[0] array[1] array[2] array[3] array[4] array[5] array[6] array[7] array

| memory access        | set 0 afterwards     | set 1 afterwards     |
|----------------------|----------------------|----------------------|
| _                    | (empty)              | (empty)              |
| read array[0] (miss) | {array[0], array[1]} | (empty)              |
| read array[3] (miss) | {array[0], array[1]} | {array[2], array[3]} |
|                      | {array[0], array[1]} | {array[6], array[7]} |
| read array[1] (hit)  | {array[0], array[1]} | {array[6], array[7]} |
| read array[4] (miss) | {array[4], array[5]} | {array[6], array[7]} |
| read array[7] (hit)  | {array[4],array[5]}  | {array[6], array[7]} |
|                      | {array[4],array[5]}  | {array[2], array[3]} |
| read array(FE] (hit) | [2rr2v[4] 2rr2v[5]]  | [array[6] array[7]]  |

### quiz exercise solution

one cache block one cache block one cache block (set index 1) (set index 0) (set index 1) (set index 0)

array[0] array[1] array[2] array[3] array[4] array[5] array[6] array[7] array

| memory access        | set 0 afterwards     | wards set 1 afterwards |  |  |
|----------------------|----------------------|------------------------|--|--|
| _                    | (empty)              | (empty)                |  |  |
| read array[0] (miss) | {array[0], array[1]} | (empty)                |  |  |
| read array[3] (miss) | {array[0], array[1]} | {array[2], array[3]}   |  |  |
| read array[6] (miss) | {array[0], array[1]} | {array[6], array[7]}   |  |  |
| read array[1] (hit)  | {array[0], array[1]} | {array[6], array[7]}   |  |  |
|                      | {array[4],array[5]}  | {array[6], array[7]}   |  |  |
| read array[7] (hit)  | {array[4],array[5]}  | {array[6], array[7]}   |  |  |
| read array[2] (miss) | {array[4],array[5]}  | {array[2], array[3]}   |  |  |
|                      |                      |                        |  |  |

#### not the quiz problem

one cache block one cache block one cache bloc one cache block array[0] array[1] array[2] array[3] array[4] array[5] array[6] array[7] array

if 1-set 2-way cache instead of 2-set 1-way cache:

| memory access        | single set with 2-ways, LRU first          |
|----------------------|--------------------------------------------|
| _                    | ,                                          |
| read array[0] (miss) | , {array[0], array[1]}                     |
| read array[3] (miss) | {array[0], array[1]}, {array[2], array[3]} |
| read array[6] (miss) | {array[2], array[3]}, {array[6], array[7]} |
| read array[1] (miss) | {array[6], array[7]}, {array[0], array[1]} |
| read array[4] (miss) | {array[0], array[1]}, {array[3], array[4]} |
| read array[7] (miss) | {array[3], array[4]}, {array[6], array[7]} |
| read array[2] (miss) | {array[6], array[7]}, {array[2], array[3]} |
| read array[5] (miss) | {array[2], array[3]}, {array[5], array[6]} |
| read array[8] (miss) | {array[5], array[6]}, {array[8], array[9]} |

## C and cache misses (4)

```
typedef struct {
    int a value, b value;
    int other values[6];
} item:
item items[5];
int a sum = 0, b sum = 0;
for (int i = 0; i < 5; ++i)
    a sum += items[i].a value;
for (int i = 0; i < 5; ++i)
    b sum += items[i].b value:
```

Assume everything but items is kept in registers (and the compiler does not do anything funny).

## C and cache misses (4, rewrite)

```
int array[40]
int a_sum = 0, b_sum = 0;
for (int i = 0; i < 40; i += 8)
    a_sum += array[i];
for (int i = 1; i < 40; i += 8)
    b_sum += array[i];</pre>
```

Assume everything but array is kept in registers (and the compiler does not do anything funny) and array starts at beginning of cache block.

How many *data cache misses* on a 2-way set associative 128B cache with 16B cache blocks and LRU replacement?

## C and cache misses (4, solution pt 1)

```
ints 4 byte \rightarrow array[0 to 3] and array[16 to 19] in same cache set 64B = 16 ints stored per way 4 sets total
```

accessing 0, 8, 16, 24, 32, 1, 9, 17, 25, 33

### C and cache misses (4, solution pt 1)

```
ints 4 byte \rightarrow array[0 to 3] and array[16 to 19] in same cache set
     64B = 16 ints stored per way
     4 sets total
accessing 0, 8, 16, 24, 32, 1, 9, 17, 25, 33
0 (set 0), 8 (set 2), 16 (set 0), 24 (set 2), 32 (set 0)
1 (set 0), 9 (set 2), 17 (set 0), 25 (set 2), 33 (set 0)
```

## C and cache misses (4, solution pt 2)

set 0 after (LRU first) result access arrav[0] —. arrav[0 to 3]miss array[16] array[0 to 3], array[16 to 19] miss 6 misses for set 0 array[32] array[16 to 19], array[32 to 35] miss array[32 to 35], array[0 to 3] array[1] miss array[17] array[0 to 3], array[16 to 19] miss array[16 to 19], array[32 to 35] arrav[32] miss

## C and cache misses (4, solution pt 3)

```
access set 2 after (LRU first) result

— —, —

array[8] —, array[8 to 11] miss

array[24] array[8 to 11], array[24 to 27] miss

array[9] array[8 to 11], array[24 to 27] hit

array[25] array[16 to 19], array[32 to 35] hit
```

## C and cache misses (3)

```
typedef struct {
    int a_value, b_value;
    int other values[10];
} item;
item items[5]:
int a sum = 0, b sum = 0;
for (int i = 0; i < 5; ++i)
    a sum += items[i].a value:
for (int i = 0; i < 5; ++i)
    b sum += items[i].b value:
observation: 12 ints in struct: only first two used
```

...then accessing array[1], array[13], array[25], etc.

equivalent to accessing array[0], array[12], array[24], etc.

## C and cache misses (3, rewritten?)

```
int array[60];
int a_sum = 0, b_sum = 0;
for (int i = 0; i < 60; i += 12)
    a_sum += array[i];
for (int i = 1; i < 60; i += 12)
    b_sum += array[i];</pre>
```

Assume everything but array is kept in registers (and the compiler does not do anything funny) and array at beginning of cache block.

How many *data cache misses* on a 128B two-way set associative cache with 16B cache blocks and LRU replacement?

observation 1: first loop has 5 misses — first accesses to blocks observation 2: array[0] and array[1], array[12] and array[13], etc. in

## C and cache misses (3, solution)

ints 4 byte  $\rightarrow$  array[0 to 3] and array[16 to 19] in same cache set 64B = 16 ints stored per way 4 sets total

accessing array indices 0, 12, 24, 36, 48, 1, 13, 25, 37, 49

```
so access to 1, 21, 41, 61, 81 all hits: set 0 contains block with array[0 to 3] set 5 contains block with array[20 to 23] etc.
```

## C and cache misses (3, solution)

ints 4 byte  $\rightarrow$  array[0 to 3] and array[16 to 19] in same cache set 64B = 16 ints stored per way 4 sets total

accessing array indices 0, 12, 24, 36, 48, 1, 13, 25, 37, 49

```
so access to 1, 21, 41, 61, 81 all hits: set 0 contains block with array[0 to 3] set 5 contains block with array[20 to 23] etc.
```

## C and cache misses (3, solution)

```
ints 4 byte \rightarrow array[0 to 3] and array[16 to 19] in same cache set 64B = 16 ints stored per way 4 sets total
```

accessing array indices 0, 12, 24, 36, 48, 1, 13, 25, 37, 49

```
0 (set 0, array[0 to 3]), 12 (set 3), 24 (set 2), 36 (set 1), 48 (set 0) each set used at most twice no replacement needed
```

```
so access to 1, 21, 41, 61, 81 all hits: set 0 contains block with array[0 to 3] set 5 contains block with array[20 to 23] etc.
```

## C and cache misses (3)

```
typedef struct {
    int a value, b value;
    int boring values[126];
} item;
item items[8]; // 4 KB array
int a sum = 0, b sum = 0;
for (int i = 0; i < 8; ++i)
    a sum += items[i].a_value;
for (int i = 0; i < 8; ++i)
    b sum += items[i].b value:
```

Assume everything but items is kept in registers (and the compiler does not do anything funny).

How many data cache misses on a 2KB direct-mapped cache with

## C and cache misses (3, rewritten?)

```
item array[1024]; // 4 KB array
int a_sum = 0, b_sum = 0;
for (int i = 0; i < 1024; i += 128)
    a_sum += array[i];
for (int i = 1; i < 1024; i += 128)
    b_sum += array[i];</pre>
```

## C and cache misses (4)

```
typedef struct {
    int a value, b value;
    int boring values[126];
} item;
item items[8]; // 4 KB array
int a sum = 0, b sum = 0;
for (int i = 0; i < 8; ++i)
    a_sum += items[i].a_value;
for (int i = 0; i < 8; ++i)
    b sum += items[i].b value:
```

Assume everything but items is kept in registers (and the compiler does not do anything funny).

45

2KB direct-mapped cache with 16B blocks —

set 0: address 0 to 15, (0 to 15) + 2KB, (0 to 15) + 4KB, ...

```
set 1: address 16 to 31, (16 to 31) + 2KB, (16 to 31) + 4KB, ...
```

...

set 127: address 2032 to 2047, (2032 to 2047) + 2KB, ...

2KB direct-mapped cache with 16B blocks —

set 0: address 0 to 15, (0 to 15) + 2KB, (0 to 15) + 4KB, ...

```
set 1: address 16 to 31, (16 to 31) + 2KB, (16 to 31) + 4KB, ...
```

...

set 127: address 2032 to 2047, (2032 to 2047) + 2KB, ...

2KB direct-mapped cache with 16B blocks —

```
set 0: address 0 to 15, (0 to 15) + 2KB, (0 to 15) + 4KB, ... block at 0: array[0] through array[3]
```

```
set 1: address 16 to 31, (16 to 31) + 2KB, (16 to 31) + 4KB, ... block at 16: array[4] through array[7]
```

...

```
set 127: address 2032 to 2047, (2032 to 2047) + 2KB, ... block at 2032: array[508] through array[511]
```

2KB direct-mapped cache with 16B blocks —

```
set 0: address 0 to 15, (0 to 15) + 2KB, (0 to 15) + 4KB, ... block at 0: array[0] through array[3] block at 0+2KB: array[512] through array[515]
```

```
set 1: address 16 to 31, (16 to 31) + 2KB, (16 to 31) + 4KB, ... block at 16: array[4] through array[7] block at 16+2KB: array[516] through array[519]
```

...

```
set 127: address 2032 to 2047, (2032 to 2047) + 2KB, ... block at 2032: array[508] through array[511] block at 2032+2KB: array[1020] through array[1023]
```

2KB 2-way set associative cache with 16B blocks: block addresses

set 0: address 0, 0 + 2KB, 0 + 4KB, ...

```
set 1: address 16, 16 + 2KB, 16 + 4KB, ...
```

...

set 63: address 1008, 2032 + 2KB, 2032 + 4KB ...

2KB 2-way set associative cache with 16B blocks: block addresses —

set 0: address 0, 0 + 2KB, 0 + 4KB, ... block at 0: array[0] through array[3]

```
set 1: address 16, 16 + 2KB, 16 + 4KB, ... address 16: array[4] through array[7]
```

set 63: address 1008, 2032 + 2KB, 2032 + 4KB ... address 1008: array[252] through array[255]

2KB 2-way set associative cache with 16B blocks: block addresses

```
block at 0+1KB: array[256] through array[259] block at 0+2KB: array[512] through array[515] ... set 1: address 16, 16+2KB, 16+4KB, ... address 16: array[4] through array[7]
```

set 63: address 1008, 2032 + 2KB, 2032 + 4KB ... address 1008: array[252] through array[255]

set 0: address 0, 0 + 2KB, 0 + 4KB, ... block at 0: array[0] through array[3]

47

2KB 2-way set associative cache with 16B blocks: block addresses

block at 0+1KB: array[256] through array[259] block at 0+2KB: array[512] through array[515] ... set 1: address 16, 16+2KB, 16+4KB, ... address 16: array[4] through array[7]

set 0: address 0, 0 + 2KB, 0 + 4KB, ... block at 0: array[0] through array[3]

set 63: address 1008, 2032 + 2KB, 2032 + 4KB ... address 1008: array[252] through array[255]

### misses with skipping

```
int array1[512]; int array2[512];
...
for (int i = 0; i < 512; i += 1)
    sum += array1[i] * array2[i];
}</pre>
```

Assume everything but array1, array2 is kept in registers (and the compiler does not do anything funny).

About how many data cache misses on a 2KB direct-mapped cache with 16B cache blocks?

Hint: depends on relative placement of array1, array2

### best/worst case

```
array1[i] and array2[i] always different sets:
```

- = distance from array1 to array2 not multiple of # sets  $\times$  bytes/set 2 misses every 4 i blocks of 4 array1[X] values loaded, then used 4 times before loading next block (and same for array2[X])
- array1[i] and array2[i] same sets:
  - = distance from array1 to array2 is multiple of # sets  $\times$  bytes/set 2 misses every i block of 4 array1[X] values loaded, one value used from it, then, block of 4 array2[X] values replaces it, one value used from it, ...

#### worst case in practice?

two rows of matrix?

often sizeof(row) bytes apart

if the row size is multiple of number of sets  $\times$  bytes per block, oops!

# arrays and cache misses (3)

```
int sum; int array[1024]; // 4KB array
for (int i = 8; i < 1016; i += 1) {
    int local_sum = 0;
    for (int j = i - 8; j < i + 8; j += 1) {
         local sum += arrav[i] * (i - i);
    sum += (local_sum - array[i]);
Assume everything but array is kept in registers (and the compiler does not do
anything funny).
```

How many *data cache misses* on initially empty 2KB direct-mapped cache with 16B cache blocks?

### Tag-Index-Offset exercise

```
memory addreses bits (Y86-64: 64)
m
                  number of blocks per set ("ways")
S=2^s
                  number of sets
```

$$S=2$$
 number of sets  $s$  (set) index bits  $B=2^b$  block size

$$B = 2^b$$
 block size

$$t = m - (s + b)$$
 tag bits

$$C = B \times S \times E$$
 cache size (excluding metadata)

#### My desktop:

L1 Data Cache: 32 KB, 8 blocks/set, 64 byte blocks

L2 Cache: 256 KB, 4 blocks/set, 64 byte blocks

L3 Cache: 8 MB, 16 blocks/set, 64 byte blocks

Divide the address 0x34567 into tag, index, offset for each cache.

| quantity           | value for L1                   |
|--------------------|--------------------------------|
| block size (given) | B=64Byte                       |
|                    | $B=2^b$ (b: block offset bits) |

| quantity           | value for L1                   |
|--------------------|--------------------------------|
| block size (given) | B=64Byte                       |
|                    | $B=2^b$ (b: block offset bits) |
| block offset bits  | b = 6                          |

| quantity           | value for L1                     |
|--------------------|----------------------------------|
| block size (given) | B=64Byte                         |
|                    | $B=2^b$ (b: block offset bits)   |
| block offset bits  | b = 6                            |
| blocks/set (given) | E=8                              |
| cache size (given) | $C = 32KB = E \times B \times S$ |

| quantity           | value for L1                                    |
|--------------------|-------------------------------------------------|
| block size (given) | B = 64Byte                                      |
|                    | $B=2^b$ (b: block offset bits)                  |
| block offset bits  | b = 6                                           |
| blocks/set (given) | E = 8                                           |
| cache size (given) | $C = 32KB = E \times B \times S$                |
|                    | $S = \frac{C}{B \times E} $ (S: number of sets) |

| quantity           | value for L1                                            |
|--------------------|---------------------------------------------------------|
| block size (given) | B = 64Byte                                              |
|                    | $B=2^b$ (b: block offset bits)                          |
| block offset bits  | b = 6                                                   |
| blocks/set (given) | E=8                                                     |
| cache size (given) | $C = 32KB = E \times B \times S$                        |
|                    | $S = \frac{C}{B \times E} $ (S: number of sets)         |
| number of sets     | $S = \frac{32 \text{KB}}{64 \text{Byte} \times 8} = 64$ |

| quantity           | value for L1                                           |
|--------------------|--------------------------------------------------------|
| block size (given) | B=64Byte                                               |
|                    | $B=2^b$ ( $b$ : block offset bits)                     |
| block offset bits  | b = 6                                                  |
| blocks/set (given) | E = 8                                                  |
| cache size (given) | $C = 32KB = E \times B \times S$                       |
|                    | $S = \frac{C}{B \times E} (S: \text{ number of sets})$ |
| number of sets     | $S = \frac{32KB}{64Byte \times 8} = 64$                |
|                    | $S=2^s$ (s: set index bits)                            |
| set index bits     | $s = \log_2(64) = 6$                                   |

### T-I-O results

|                   | L1         | L2   | L3   |
|-------------------|------------|------|------|
| sets              | 64         | 1024 | 8192 |
| block offset bits | 6          | 6    | 6    |
| set index bits    | 6          | 10   | 13   |
| tag bits          | (the rest) |      |      |

```
L1 L2 L3
block offset bits 6 6
                       6
set index bits 6 10 13
tag bits
                (the rest)
0x34567:
                  0100
                         0101
bits 0-5 (all offsets): 100111 = 0x27
```

```
L1 L2 L3
block offset bits 6 6
                       6
set index bits 6 10 13
tag bits
                (the rest)
0x34567:
                  0100
                         0101
bits 0-5 (all offsets): 100111 = 0x27
```

```
L1 L2 L3
block offset bits 6 6
                         6
set index bits 6 10 13
tag bits
                  (the rest)
0x34567:
                    0100
                           0101
                                   0110
bits 0-5 (all offsets): 100111 = 0x27
L1:
    bits 6-11 (L1 set): 01 \ 0101 = 0 \times 15
    bits 12- (L1 tag): 0x34
```

```
L1 L2 L3
block offset bits 6 6
                         6
set index bits 6 10 13
tag bits
                  (the rest)
0x34567:
                    0100
                           0101
bits 0-5 (all offsets): 100111 = 0x27
L1:
    bits 6-11 (L1 set): 01 \ 0101 = 0 \times 15
    bits 12- (L1 tag): 0x34
```

```
11 12 13
block offset bits 6 6
                          6
set index bits 6 10 13
tag bits
                  (the rest)
0x34567:
                    0100
                           0101
bits 0-5 (all offsets): 100111 = 0x27
L2:
    bits 6-15 (set for L2): 01 \ 0001 \ 0101 = 0 \times 115
    bits 16-: 0x3
```

```
11 12 13
block offset bits 6 6
                         6
set index bits 6 10 13
tag bits
                  (the rest)
0x34567:
                   0100
                           0101
bits 0-5 (all offsets): 100111 = 0x27
L2:
    bits 6-15 (set for L2): 01 0001 0101 = 0 \times 115
    bits 16-: 0x3
```

55

```
11 12 13
block offset bits 6 6
                       6
set index bits 6 10 13
tag bits
                (the rest)
0x34567:
                  0100
                         0101
bits 0-5 (all offsets): 100111 = 0x27
L3:
```

```
bits 6-18 (set for L3): 0 1101 0001 0101 = 0 \times D15
bits 18-: 0x0
```

# cache operation (associative)



# cache operation (associative)



# cache operation (associative)



# backup slides — cache performance

### cache miss types

common to categorize misses: roughly "cause" of miss assuming cache block size fixed

compulsory (or cold) — first time accessing something adding more sets or blocks/set wouldn't change

conflict — sets aren't big/flexible enough a fully-associtive (1-set) cache of the same size would have done better

capacity — cache was not big enough

coherence — from sync'ing cache with other caches only issue with multiple cores

### making any cache look bad

- 1. access enough blocks, to fill the cache
- 2. access an additional block, replacing something
- 3. access last block replaced
- 4. access last block replaced
- 5. access last block replaced

...

but — typical real programs have locality

### cache optimizations

```
(assuming typical locality + keeping cache size constant if possible...)
                        miss rate hit time miss penalty
increase cache size
                        better
                                   worse
increase associativity
                        better
                                             worse?
                                   worse
increase block size
                        depends
                                   worse
                                              worse
add secondary cache
                                              better
write-allocate
                        better
writeback
LRU replacement
                        better
                                              worse?
prefetching
                        better
 prefetching = guess what program will use, access in advance
         average time = hit time + miss rate \times miss penalty
```

# cache optimizations by miss type

| (assuming other listed parameters remain constant) |              |              |              |
|----------------------------------------------------|--------------|--------------|--------------|
|                                                    | capacity     | conflict     | compulsory   |
| increase cache size                                | fewer misses | fewer misses | _            |
| increase associativity                             | _            | fewer misses | _            |
| increase block size                                | more misses? | more misses? | fewer misses |
|                                                    |              |              |              |
| LRU replacement                                    | _            | fewer misses | _            |
| prefetching                                        |              |              | fewer misses |

### average memory access time

```
\begin{aligned} \mathsf{AMAT} &= \mathsf{hit} \ \mathsf{time} + \mathsf{miss} \ \mathsf{penalty} \times \mathsf{miss} \ \mathsf{rate} \\ &\quad \mathsf{or} \ \mathsf{AMAT} = \mathsf{hit} \ \mathsf{time} \times \mathsf{hit} \ \mathsf{rate} + \mathsf{miss} \ \mathsf{time} \times \mathsf{miss} \ \mathsf{rate} \end{aligned} effective speed of memory
```

# AMAT exercise (1)

- 90% cache hit rate
- hit time is 2 cycles
- 30 cycle miss penalty

what is the average memory access time?

suppose we could increase hit rate by increasing its size, but it would increase the hit time to 3 cycles

how much do we have to increase the hit rate for this to not increase AMAT?

### AMAT exercise (1)

- 90% cache hit rate
- hit time is 2 cycles
- 30 cycle miss penalty
- what is the average memory access time?
- 5 cycles
  - suppose we could increase hit rate by increasing its size, but it would increase the hit time to 3 cycles
- how much do we have to increase the hit rate for this to not increase AMAT?

### AMAT exercise (1)

- 90% cache hit rate
- hit time is 2 cycles
- 30 cycle miss penalty
- what is the average memory access time?
- 5 cycles
  - suppose we could increase hit rate by increasing its size, but it would increase the hit time to 3 cycles
- how much do we have to increase the hit rate for this to not increase AMAT?

#### exercise: AMAT and multi-level caches

suppose we have L1 cache with

```
3 cycle hit time
    90% hit rate
and an 12 cache with
     10 cycle hit time
    80% hit rate (for accesses that make this far)
    (assume all accesses come via this L1)
and main memory has a 100 cycle access time
assume when there's an cache miss, the next level access starts
after the hit time
    e.g. an access that misses in L1 and hits in L2 will take 10+3 cycles
what is the average memory access time for the L1 cache?
```

#### exercise: AMAT and multi-level caches

suppose we have L1 cache with

```
3 cycle hit time
    90% hit rate
and an 12 cache with
     10 cycle hit time
    80% hit rate (for accesses that make this far)
     (assume all accesses come via this L1)
and main memory has a 100 cycle access time
assume when there's an cache miss, the next level access starts
after the hit time
    e.g. an access that misses in L1 and hits in L2 will take 10+3 cycles
what is the average memory access time for the L1 cache?
```

#### exercise: AMAT and multi-level caches

suppose we have L1 cache with

```
3 cycle hit time
    90% hit rate
and an 12 cache with
     10 cycle hit time
    80% hit rate (for accesses that make this far)
     (assume all accesses come via this L1)
and main memory has a 100 cycle access time
assume when there's an cache miss, the next level access starts
after the hit time
    e.g. an access that misses in L1 and hits in L2 will take 10+3 cycles
what is the average memory access time for the L1 cache?
```

# approximate miss analysis

very tedious to precisely count cache misses even more tedious when we take advanced cache optimizations into account

```
instead, approximations:
```

```
good or bad temporal/spatial locality
good temporal locality: value stays in cache
good spatial locality: use all parts of cache block
```

with nested loops: what does inner loop use?
intuition: values used in inner loop loaded into cache once
(that is, once each time the inner loop is run)
...if they can all fit in the cache

### approximate miss analysis

very tedious to precisely count cache misses even more tedious when we take advanced cache optimizations into account

#### instead, approximations:

```
good or bad temporal/spatial locality
```

good temporal locality: value stays in cache good spatial locality: use all parts of cache block

with nested loops: what does inner loop use?
intuition: values used in inner loop loaded into cache once
(that is, once each time the inner loop is run)
...if they can all fit in the cache

# locality exercise (1)

```
/* version 1 */
for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j)
        A[i] += B[i] * C[i * N + j]
/* version 2 */
for (int j = 0; j < N; ++j)
    for (int i = 0; i < N; ++i)
        A[i] += B[i] * C[i * N + j];
```

exercise: which has better temporal locality in A? in B? in C? how about spatial locality?

# exercise: miss estimating (1)

```
for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j)
        A[i] += B[j] * C[i * N + j]</pre>
```

Assume: 4 array elements per block, N very large, nothing in cache at beginning.

Example: N/4 estimated misses for A accesses:

A[i] should always be hit on all but first iteration of inner-most loop. first iter: A[i] should be hit about 3/4s of the time (same block as A[i-1] that often)

Exericse: estimate # of misses for B, C

### a note on matrix storage

```
A - N \times N matrix
represent as array
makes dynamic sizes easier:
float A_2d_array[N][N];
float *A flat = malloc(N * N);
A flat[i * N + j] === A 2d array[i][j]
```

### convertion re: rows/columns

going to call the first index rows

 $A_{i,j}$  is A row i, column j

rows are stored together

this is an arbitrary choice

```
array[0*5 + 0] array[0*5 + 1] array[0*5 + 2] array[0*5 + 3] array[0*5 + 4] array[1*5 + 0] array[1*5 + 1] array[1*5 + 2] array[1*5 + 3] array[1*5 + 4] array[2*5 + 0] array[2*5 + 1] array[2*5 + 2] array[2*5 + 3] array[2*5 + 4] array[3*5 + 0] array[3*5 + 1] array[3*5 + 2] array[3*5 + 3] array[3*5 + 4] array[4*5 + 0] array[4*5 + 1] array[4*5 + 2] array[4*5 + 3] array[4*5 + 4]
```

```
    array[0*5 + 0]
    array[0*5 + 1]
    array[0*5 + 2]
    array[0*5 + 3]
    array[0*5 + 4]

    array[1*5 + 0]
    array[1*5 + 1]
    array[1*5 + 2]
    array[1*5 + 3]
    array[1*5 + 4]

    array[2*5 + 0]
    array[2*5 + 1]
    array[2*5 + 2]
    array[2*5 + 3]
    array[2*5 + 4]

    array[3*5 + 0]
    array[3*5 + 1]
    array[3*5 + 2]
    array[3*5 + 3]
    array[3*5 + 4]

    array[4*5 + 0]
    array[4*5 + 1]
    array[4*5 + 2]
    array[4*5 + 3]
    array[4*5 + 4]
```

if array starts on cache block first cache block = first elements all together in one row!

```
array[0*5 + 0] array[0*5 + 1] array[0*5 + 2] array[0*5 + 3] array[0*5 + 4] array[1*5 + 0] array[1*5 + 1] array[1*5 + 2] array[1*5 + 3] array[1*5 + 4] array[2*5 + 0] array[2*5 + 1] array[2*5 + 2] array[2*5 + 3] array[2*5 + 4] array[3*5 + 0] array[3*5 + 1] array[3*5 + 2] array[3*5 + 3] array[3*5 + 4] array[4*5 + 0] array[4*5 + 1] array[4*5 + 2] array[4*5 + 3] array[4*5 + 4]
```

second cache block:

- 1 from row 0
- 3 from row 1

```
array[0*5 + 0] array[0*5 + 1] array[0*5 + 2] array[0*5 + 3] array[0*5 + 4] array[1*5 + 0] array[1*5 + 1] array[1*5 + 2] array[1*5 + 3] array[1*5 + 4] array[2*5 + 0] array[2*5 + 1] array[2*5 + 2] array[2*5 + 3] array[2*5 + 4] array[3*5 + 0] array[3*5 + 1] array[3*5 + 2] array[3*5 + 3] array[3*5 + 4] array[4*5 + 0] array[4*5 + 1] array[4*5 + 2] array[4*5 + 3] array[4*5 + 4]
```

```
array[0*5 + 0] array[0*5 + 1] array[0*5 + 2] array[0*5 + 3] array[0*5 + 4] array[1*5 + 0] array[1*5 + 1] array[1*5 + 2] array[1*5 + 3] array[1*5 + 4] array[2*5 + 0] array[2*5 + 1] array[2*5 + 2] array[2*5 + 3] array[2*5 + 4] array[3*5 + 0] array[3*5 + 1] array[3*5 + 2] array[3*5 + 3] array[3*5 + 4] array[4*5 + 0] array[4*5 + 1] array[4*5 + 2] array[4*5 + 3] array[4*5 + 4]
```

generally: cache blocks contain data from 1 or 2 rows

ightarrow better performance from reusing rows

### matrix multiply

$$C_{ij} = \sum_{k=1}^{n} A_{ik} \times B_{kj}$$
/\* version 1: inner loop is k, middle is j \*/
for (int i = 0; i < N; ++i)
 for (int j = 0; j < N; ++j)
 for (int k = 0; k < N; ++k)
 C[i \* N + i] += A[i \* N + k] \* B[k \* N + j];

### matrix multiply

$$C_{ij} = \sum_{k=1}^{n} A_{ik} \times B_{kj}$$
 /\* version 1: inner loop is k, middle is j\*/
for (int i = 0; i < N; ++i)
 for (int j = 0; j < N; ++j)
 for (int k = 0; k < N; ++k)
 C[i\*N+j] += A[i \* N + k] \* B[k \* N + j];

/\* version 2: outer loop is k, middle is i \*/
for (int k = 0; k < N; ++k)
 for (int i = 0; i < N; ++i)
 for (int j = 0; j < N; ++j)
 C[i\*N+j] += A[i \* N + k] \* B[k \* N + j];

# loop orders and locality

loop body:  $C_{ij} += A_{ik}B_{kj}$ 

kij order:  $C_{ij}$ ,  $B_{kj}$  have spatial locality

kij order:  $A_{ik}$  has temporal locality

... better than ...

ijk order:  $A_{ik}$  has spatial locality

ijk order:  $C_{ij}$  has temporal locality

## loop orders and locality

loop body:  $C_{ij} += A_{ik}B_{kj}$ 

kij order:  $C_{ij}$ ,  $B_{kj}$  have spatial locality

kij order:  $A_{ik}$  has temporal locality

... better than ...

ijk order:  $A_{ik}$  has spatial locality

ijk order:  $C_{ij}$  has temporal locality

$$C_{ij} = \sum_{k=1}^{n} A_{ik} \times B_{kj}$$
 /\* version 1: inner loop is k, middle is j\*/
for (int i = 0; i < N; ++i)
 for (int j = 0; j < N; ++j)
 for (int k = 0; k < N; ++k)
 C[i\*N+j] += A[i \* N + k] \* B[k \* N + j];

/\* version 2: outer loop is k, middle is i \*/
for (int k = 0; k < N; ++k)
 for (int i = 0; i < N; ++i)
 for (int j = 0; j < N; ++j)
 C[i\*N+j] += A[i \* N + k] \* B[k \* N + j];

$$C_{ij} = \sum_{k=1}^{n} A_{ik} \times B_{kj}$$
 /\* version 1: inner loop is k, middle is j\*/ for (int i = 0; i < N; ++i) for (int j = 0; j < N; ++j) for (int k = 0; k < N; ++k) 
$$C[i*N+j] += A[i*N+k] * B[k*N+j];$$
 /\* version 2: outer loop is k, middle is i \*/ for (int k = 0; k < N; ++k) for (int i = 0; i < N; ++i) for (int j = 0; j < N; ++j) 
$$C[i*N+j] += A[i*N+k] * B[k*N+j];$$

$$C_{ij} = \sum_{k=1}^{n} A_{ik} \times B_{kj}$$
 /\* version 1: inner loop is k, middle is j\*/
for (int i = 0; i < N; ++i)
 for (int j = 0; j < N; ++j)
 for (int k = 0; k < N; ++k)
 C[i\*N+j] += A[i \* N + k] \* B[k \* N + j];

/\* version 2: outer loop is k, middle is i \*/
for (int k = 0; k < N; ++k)
 for (int i = 0; i < N; ++i)
 for (int j = 0; j < N; ++j)
 C[i\*N+j] += A[i \* N + k] \* B[k \* N + j];

#### which is better?

$$C_{ij} = \sum_{k=1}^{n} A_{ik} \times B_{kj}$$

```
/* version 1: inner loop is k, middle is j^*/
for (int i = 0; i < N; ++i)
  for (int j = 0; j < N; ++j)
    for (int k = 0; k < N; ++k)
     C[i*N+i] += A[i * N + k] * B[k * N + i]:
/* version 2: outer loop is k, middle is i */
for (int k = 0; k < N; ++k)
  for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j)
     C[i*N+j] += A[i * N + k] * B[k * N + i];
```

exercise: Which version has better spatial/temporal locality for...







```
A_{ik}
A_{x0}
A_{xN}
for all i:
 for all j:
 for all k:
 C_{ij}+=A_{ik}\times B_{kj}
```

if N large: using  $C_{ij}$  many times per load into cache using  $A_{ik}$  once per load-into-cache (but using  $A_{i,k+1}$  right after) using  $B_{kj}$  once per load into cache



oost loop:

 $A_{x0}$   $A_{xN}$  for all i:
 for all k:
  $C_{ij}+=A_{ik}\times B_{kj}$ 

good spatial locality in A
(rows stored together = reuse cache blocks)
bad spatial locality in B
(use each cache block once)
no useful spatial locality in C



for all j: for all k:  $C_{ij} += A_{ik} \times B_{kj}$ 

for all i:

looking only at innermost loop: temporal locality in C bad temporal locality in everything else (everything accessed exactly once)



for all i: for all k.

for all i:

 $C_{ii} += A_{ik} \times B_{ki}$ 

row of A (elements used once) column of B (elements used once) single element of C (used many times)







looking only at two innermost loops together: some temporal locality in A (column reused) some temporal locality in B (row reused) some temporal locality in C (row reused)



 $B_{ki}$ 



(but using  $B_{k,i+1}$  right after)

 $A_{xN}$ for all k: for all i: for all j:  $C_{ij} += A_{ik} \times B_{ki}$ 

if N large: using  $C_{ii}$  once per load into cache (but using  $C_{i,i+1}$  right after) using  $B_{kj}$  once per load into cache

using  $A_{ik}$  many times per load-into-cache



looking only at innermost loop: spatial locality in B, C (use most of loaded B, C cache blocks) no useful spatial locality in A (rest of A's cache block wasted)

 $A_{x0}$   $A_{xN}$  for all k: for all j:  $C_{ij} + = A_{ik} \times B_{kj}$ 





temporal locality in A no temporal locality in B, C (B, C values used exactly once)



looking only at innermost loop: processing one element of A (use many times) row of B (each element used once)

for all k: for all i: for all j:  $C_{ij} + = A_{ik} \times B_{kj}$  column of C (each element used once)

for all j:



good temporal locality in A (column reused) good temporal locality in B (row reused)  $C_{ij} + = A_{ik} \times B_{kj}$  | bad temporal locality in C (nothing reused)

$$C_{ij} = \sum_{k=1}^{n} A_{ik} \times B_{kj}$$
 /\* version 1: inner loop is k, middle is j\*/
for (int i = 0; i < N; ++i)
 for (int j = 0; j < N; ++j)
 for (int k = 0; k < N; ++k)
 C[i\*N+j] += A[i \* N + k] \* B[k \* N + j];

/\* version 2: outer loop is k, middle is i \*/
for (int k = 0; k < N; ++k)
 for (int i = 0; i < N; ++i)
 for (int j = 0; j < N; ++j)
 C[i\*N+j] += A[i \* N + k] \* B[k \* N + j];

# performance (with A=B)



# alternate view 1: cycles/instruction



## alternate view 2: cycles/operation



## counting misses: version 1

```
for (int i = 0; i < N; ++i)
  for (int j = 0; j < N; ++i)
    for (int k = 0; k < N; ++k)
       C[i * N + j]' += A[i * N + k] * B[k * N + j];
if N really large
     assumption: can't get close to storing N values in cache at once
for A: about N \div \text{block} size misses per k-loop
     total misses. N^3 	ildar block size
for B: about N misses per k-loop
```

for C: about  $1 \div \text{block}$  size miss per k-loop total misses:  $N^2 \div \text{block}$  size

total misses:  $N^3$ 

## counting misses: version 2

```
for (int k = 0; k < N; ++k)
  for (int i = 0; i < N; ++i)
     for (int j = 0; j < N; ++j)
       C[i * N + j] += A[i * N + k] * B[k * N + i]:
for A: about 1 misses per i-loop
     total misses: N^2
for B: about N \div \text{block} size miss per j-loop
     total misses: N^3 \div \text{block size}
for C: about N \div \text{block} size miss per j-loop
     total misses: N^3 \div \text{block size}
```

#### exercise: miss estimating (2)

assuming: 4 elements per block

assuming: cache not close to big enough to hold 1K elements

estimate: approximately how many misses for A, B?

## L1 misses (with A=B)



## L1 miss detail (1)



# L1 miss detail (2)



#### addresses

```
B[k*114+j] is at 10 0000 0000 0100
B[k*114+j+1] is at 10 0000 0000 1000
B[(k+1)*114+j] is at 10 0011 1001 0100
B[(k+2)*114+j] is at 10 0101 0101 1100
...
B[(k+9)*114+j] is at 11 0000 0000 1100
```

#### addresses

test system L1 cache: 6 index bits, 6 block offset bits

#### conflict misses

```
powers of two — lower order bits unchanged
B[k*93+i] and B[(k+11)*93+i]:
    1023 elements apart (4092 bytes: 63.9 cache blocks)
64 sets in L1 cache: usually maps to same set
B[k*93+(j+1)] will not be cached (next i loop)
even if in same block as B[k*93+i]
```

how to fix? improve spatial locality (maybe even if it requires copying)

# locality exercise (2)

```
/* version 2 */
for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j)
        A[i] += B[i] * C[i * N + j]
/* version 3 */
for (int ii = 0; ii < N; ii += 32)
    for (int ii = 0; ii < N; ii += 32)
        for (int i = ii; i < ii + 32; ++i)
            for (int j = jj; j < jj + 32; ++j)
                A[i] += B[i] * C[i * N + i]:
```

exercise: which has better temporal locality in A? in B? in C? how about spatial locality?

#### a transformation

```
for (int k = 0; k < N; k += 1)
      for (int i = 0; i < N; ++i)
        for (int j = 0; j < N; ++j)
          C[i*N+i] += A[i*N+k] * B[k*N+i];
for (int kk = 0; kk < N; kk += 2)
  for (int k = kk; k < kk + 2; ++k)
      for (int i = 0; i < N; ++i)
        for (int i = 0; i < N; ++i)
          C[i*N+i] += A[i*N+k] * B[k*N+i]:
split the loop over k — should be exactly the same
    (assuming even N)
```

#### a transformation

```
for (int k = 0; k < N; k += 1)
      for (int i = 0; i < N; ++i)
        for (int j = 0; j < N; ++j)
          C[i*N+i] += A[i*N+k] * B[k*N+i];
for (int kk = 0; kk < N; kk += 2)
  for (int k = kk; k < kk + 2; ++k)
      for (int i = 0; i < N; ++i)
        for (int i = 0; i < N; ++i)
          C[i*N+i] += A[i*N+k] * B[k*N+i]:
split the loop over k — should be exactly the same
    (assuming even N)
```

#### simple blocking

now reorder split loop — same calculations

#### simple blocking

```
for (int kk = 0; kk < N; kk += 2)
  /* was here: for (int k = kk; k < kk + 2; ++k) */
    for (int i = 0; i < N; ++i)
      for (int i = 0; i < N; ++i)
        /* load Aik, Aik+1 into cache and process: */
        for (int k = kk; k < kk + 2; ++k)
             C[i*N+i] += A[i*N+k] * B[k*N+i]:
now reorder split loop — same calculations
now handle B_{ii} for k+1 right after B_{ii} for k
(previously: B_{i,i+1} for k right after B_{ij} for k)
```

#### simple blocking

```
for (int kk = 0; kk < N; kk += 2)
  /* was here: for (int k = kk; k < kk + 2; ++k) */
    for (int i = 0; i < N; ++i)
      for (int i = 0; i < N; ++i)
        /* load Aik, Aik+1 into cache and process: */
        for (int k = kk; k < kk + 2; ++k)
             C[i*N+i] += A[i*N+k] * B[k*N+i]:
now reorder split loop — same calculations
now handle B_{ii} for k+1 right after B_{ii} for k
(previously: B_{i,i+1} for k right after B_{ij} for k)
```

## simple blocking - expanded

```
for (int kk = 0; kk < N; kk += 2) {
  for (int i = 0; i < N; ++i) {
    for (int j = 0; j < N; ++j) {
        /* process a "block" of 2 k values: */
        C[i*N+j] += A[i*N+kk+0] * B[(kk+0)*N+j];
        C[i*N+j] += A[i*N+kk+1] * B[(kk+1)*N+j];
    }
}</pre>
```

#### simple blocking - expanded

```
for (int kk = 0; kk < N; kk += 2) {
  for (int i = 0; i < N; ++i) {
    for (int j = 0; j < N; ++j) {
        /* process a "block" of 2 k values: */
        C[i*N+j] += A[i*N+kk+0] * B[(kk+0)*N+j];
        C[i*N+j] += A[i*N+kk+1] * B[(kk+1)*N+j];
    }
}</pre>
```

Temporal locality in  $C_{ij}$ s

## simple blocking - expanded

```
for (int kk = 0; kk < N; kk += 2) {
  for (int i = 0; i < N; ++i) {
    for (int j = 0; j < N; ++j) {
        /* process a "block" of 2 k values: */
        C[i*N+j] += A[i*N+kk+0] * B[(kk+0)*N+j];
        C[i*N+j] += A[i*N+kk+1] * B[(kk+1)*N+j];
    }
}</pre>
```

More spatial locality in  $A_{ik}$ 

### simple blocking - expanded

```
for (int kk = 0; kk < N; kk += 2) {
  for (int i = 0; i < N; ++i) {
    for (int j = 0; j < N; ++j) {
        /* process a "block" of 2 k values: */
        C[i*N+j] += A[i*N+kk+0] * B[(kk+0)*N+j];
        C[i*N+j] += A[i*N+kk+1] * B[(kk+1)*N+j];
    }
}</pre>
```

Still have good spatial locality in  $B_{kj}$ ,  $C_{ij}$ 

```
for (int kk = 0; kk < N; kk += 2)
  for (int i = 0; i < N; i += 1)
    for (int j = 0; j < N; ++j) {
      C[i*N+i] += A[i*N+kk+0] * B[(kk+0)*N+i];
      C[i*N+i] += A[i*N+kk+1] * B[(kk+1)*N+j];
access pattern for A:
A[0*N+0], A[0*N+1], A[0*N+0], A[0*N+1] ...(repeats N times)
A[1*N+0], A[1*N+1], A[1*N+0], A[1*N+1] ...(repeats N times)
```

•••

```
for (int kk = 0; kk < N; kk += 2)
  for (int i = 0; i < N; i += 1)
    for (int j = 0; j < N; ++j) {
      C[i*N+i] += A[i*N+kk+0] * B[(kk+0)*N+i]:
     C[i*N+i] += A[i*N+kk+1] * B[(kk+1)*N+j];
access pattern for A:
A[0*N+0], A[0*N+1], A[0*N+0], A[0*N+1] ...(repeats N times)
A[1*N+0], A[1*N+1], A[1*N+0], A[1*N+1] ...(repeats N times)
A[(N-1)*N+0], A[(N-1)*N+1], A[(N-1)*N+0], A[(N-1)*N+1] ...
A[0*N+2], A[0*N+3], A[0*N+2], A[0*N+3] ...
```

```
for (int kk = 0; kk < N; kk += 2)
  for (int i = 0; i < N; i += 1)
    for (int j = 0; j < N; ++j) {
      C[i*N+i] += A[i*N+kk+0] * B[(kk+0)*N+i]:
     C[i*N+i] += A[i*N+kk+1] * B[(kk+1)*N+j];
access pattern for A:
A[0*N+0], A[0*N+1], A[0*N+0], A[0*N+1] ...(repeats N times)
A[1*N+0], A[1*N+1], A[1*N+0], A[1*N+1] ...(repeats N times)
A[(N-1)*N+0], A[(N-1)*N+1], A[(N-1)*N+0], A[(N-1)*N+1] ...
A[0*N+2], A[0*N+3], A[0*N+2], A[0*N+3] ...
```

```
A[0*N+0], A[0*N+1], A[0*N+0], A[0*N+1] ...(repeats N times) A[1*N+0], A[1*N+1], A[1*N+0], A[1*N+1] ...(repeats N times)
```

```
A[0*N+0], A[0*N+1], A[0*N+0], A[0*N+1] ...(repeats N times)
A[1*N+0], A[1*N+1], A[1*N+0], A[1*N+1] ...(repeats N times)
A[(N-1)*N+0], A[(N-1)*N+1], A[(N-1)*N+0], A[(N-1)*N+1] ...
A[0*N+2], A[0*N+3], A[0*N+2], A[0*N+3] ...
likely cache misses: only first iterations of i loop
how many cache misses per iteration? usually one
    A[0*N+0] and A[0*N+1] usually in same cache block
```

```
A[0*N+0], A[0*N+1], A[0*N+0], A[0*N+1] ...(repeats N times) A[1*N+0], A[1*N+1], A[1*N+0], A[1*N+1] ...(repeats N times) ... A[(N-1)*N+0], A[(N-1)*N+1], A[(N-1)*N+0], A[(N-1)*N+1] ... A[0*N+2], A[0*N+3], A[0*N+2], A[0*N+3] ...
```

...

likely cache misses: only first iterations of j loop

how many cache misses per iteration? usually one A[0\*N+0] and A[0\*N+1] usually in same cache block

about  $\frac{N}{2} \cdot N$  misses total

```
for (int kk = 0; kk < N; kk += 2)
  for (int i = 0; i < N; i += 1)
    for (int i = 0; i < N; ++i) {
      C[i*N+i] += A[i*N+kk+0] * B[(kk+0)*N+i];
      C[i*N+i] += A[i*N+kk+1] * B[(kk+1)*N+j];
access pattern for B:
B[0*N+0], B[1*N+0], ...B[0*N+(N-1)], B[1*N+(N-1)]
B[2*N+0], B[3*N+0], ...B[2*N+(N-1)], B[3*N+(N-1)]
B[4*N+0], B[5*N+0], ...B[4*N+(N-1)], B[5*N+(N-1)]
B[0*N+0], B[1*N+0], ...B[0*N+(N-1)], B[1*N+(N-1)]
```

96

```
access pattern for B: B[0*N+0],\ B[1*N+0],\ ...B[0*N+(N-1)],\ B[1*N+(N-1)]\\ B[2*N+0],\ B[3*N+0],\ ...B[2*N+(N-1)],\ B[3*N+(N-1)]\\ B[4*N+0],\ B[5*N+0],\ ...B[4*N+(N-1)],\ B[5*N+(N-1)]\\ ...\\ B[0*N+0],\ B[1*N+0],\ ...B[0*N+(N-1)],\ B[1*N+(N-1)]\\ ...
```

```
access pattern for B:
B[0*N+0], B[1*N+0], ...B[0*N+(N-1)], B[1*N+(N-1)]
B[2*N+0], B[3*N+0], ...B[2*N+(N-1)], B[3*N+(N-1)]
B[4*N+0], B[5*N+0], ...B[4*N+(N-1)], B[5*N+(N-1)]
B[0*N+0], B[1*N+0], ...B[0*N+(N-1)], B[1*N+(N-1)]
likely cache misses: any access, each time
```

```
access pattern for B:
B[0*N+0], B[1*N+0], ...B[0*N+(N-1)], B[1*N+(N-1)]
B[2*N+0], B[3*N+0], ...B[2*N+(N-1)], B[3*N+(N-1)]
B[4*N+0], B[5*N+0], ...B[4*N+(N-1)], B[5*N+(N-1)]
B[0*N+0], B[1*N+0], ...B[0*N+(N-1)], B[1*N+(N-1)]
likely cache misses: any access, each time
how many cache misses per iteration? equal to \# cache blocks in 2
rows
```

access pattern for B:

 $B[0*N+0], B[1*N+0], ...B[0*N+(N-1)], B[1*N+(N-1)] \\ B[2*N+0], B[3*N+0], ...B[2*N+(N-1)], B[3*N+(N-1)]$ 

B[4\*N+0], B[5\*N+0], ...B[4\*N+(N-1)], B[5\*N+(N-1)]... B[0\*N+0], B[1\*N+0], ...B[0\*N+(N-1)], B[1\*N+(N-1)]

... likely cache misses: any access, each time

how many cache misses per iteration? equal to # cache blocks in 2

rows  $\text{about } \frac{N}{2} \cdot N \cdot \frac{2N}{\text{block size}} = N^3 \div \text{block size misses}$ 

# simple blocking - counting misses

```
for (int kk = 0; kk < N; kk += 2)  
for (int i = 0; i < N; i += 1)  
for (int j = 0; j < N; ++j) {  
        C[i*N+j] += A[i*N+kk+0] * B[(kk+0)*N+j];  
        C[i*N+j] += A[i*N+kk+1] * B[(kk+1)*N+j];  
    }  

\frac{N}{2} \cdot N j-loop executions and (assuming N large):
```

about 
$$1 \text{ misses from } A \text{ per j-loop}$$

 $N^2/2$  total misses (before blocking:  $N^2$ )

about  $2N \div \text{block}$  size misses from B per j-loop  $N^3 \div \text{block}$  size total misses (same as before blocking)

# simple blocking - counting misses

about 1 misses from A per j-loop

$$N^2/2$$
 total misses (before blocking:  $N^2$ )

about  $2N \div \text{block}$  size misses from B per j-loop  $N^3 \div \text{block}$  size total misses (same as before blocking)

about  $N \div \mathsf{block}$  size misses from C per j-loop

#### improvement in read misses



# simple blocking (2)

```
same thing for i in addition to k?
for (int kk = 0; kk < N; kk += 2) {
  for (int ii = 0; ii < N; ii += 2) {
    for (int i = 0; i < N; ++i) {
      /* process a "block": */
      for (int k = kk; k < kk + 2; ++k)
        for (int i = 0; i < ii + 2; ++i)
            C[i*N+i] += A[i*N+k] * B[k*N+i]:
```

## simple blocking — locality

```
for (int k = 0; k < N; k += 2) {
  for (int i = 0; i < N; i += 2) {
    /* load a block around Aik */
    for (int i = 0; i < N; ++i) {
       /* process a "block": */
       C_{i+0,i} + A_{i+0,k+0} \star B_{k+0,i}
       C_{i+0,j} + A_{i+0,k+1} * B_{k+1,j}
       C_{i+1,j} + A_{i+1,k+0} \star B_{k+0,j}
       C_{i+1,j} + A_{i+1,k+1} * B_{k+1,j}
```

# simple blocking — locality

```
for (int k = 0; k < N; k += 2) {
  for (int i = 0; i < N; i += 2) {
    /* load a block around Aik */
    for (int i = 0; i < N; ++i) {
       /* process a "block": */
       C_{i+0,i} + A_{i+0,k+0} \star B_{k+0,i}
       C_{i+0,i} + A_{i+0,k+1} \star B_{k+1,i}
       C_{i+1,j} + A_{i+1,k+0} \star B_{k+0,j}
       C_{i+1,j} + A_{i+1,k+1} \star B_{k+1,j}
```

now: more temporal locality in B previously: access  $B_{kj}$ , then don't use it again for a long time

## simple blocking — counting misses for A

```
for (int k = 0; k < N; k += 2) for (int i = 0; i < N; i += 2) for (int j = 0; j < N; ++j) { C_{i+0,j} += A_{i+0,k+0} * B_{k+0,j} C_{i+0,j} += A_{i+0,k+1} * B_{k+1,j} C_{i+1,j} += A_{i+1,k+0} * B_{k+0,j} C_{i+1,j} += A_{i+1,k+1} * B_{k+1,j} }
```

$$\frac{N}{2} \cdot \frac{N}{2}$$
 iterations of  $j$  loop

likely 2 misses per loop with A (2 cache blocks) total misses:  $\frac{N^2}{2}$  (same as only blocking in K)

# simple blocking — counting misses for B

```
for (int k = 0; k < N; k += 2)

for (int i = 0; i < N; i += 2)

for (int j = 0; j < N; ++j) {

C_{i+0,j} \stackrel{}{+}= A_{i+0,k+0} * B_{k+0,j}
C_{i+0,j} \stackrel{}{+}= A_{i+0,k+1} * B_{k+1,j}
C_{i+1,j} \stackrel{}{+}= A_{i+1,k+0} * B_{k+0,j}
C_{i+1,j} \stackrel{}{+}= A_{i+1,k+1} * B_{k+1,j}
}
```

$$\frac{N}{2} \cdot \frac{N}{2}$$
 iterations of  $j$  loop

likely  $2 \div \text{block size misses per iteration with } B$  total misses:  $\frac{N^3}{2 \cdot \text{block size}}$  (before:  $\frac{N^3}{\text{block size}}$ )

# simple blocking — counting misses for C

for (int k = 0; k < N; k += 2) for (int i = 0; i < N; i += 2) for (int j = 0; j < N; ++j) {  $C_{i+0,j}$  +=  $A_{i+0,k+0}$  \*  $B_{k+0,j}$ 

total misses:  $\frac{N^3}{2^{-1} + 1 + 1}$  (same as blocking only in K)

$$C_{i+0,j} += A_{i+0,k+1} * B_{k+1,j}$$

$$C_{i+1,j} += A_{i+1,k+0} * B_{k+0,j}$$

$$C_{i+1,j} += A_{i+1,k+1} * B_{k+1,j}$$

 $\frac{N}{2} \cdot \frac{N}{2}$  iterations of j loop

$$\frac{1}{2} \cdot \frac{1}{2}$$
 iterations of  $j$  loop likely  $\frac{2}{\log \log \log 2}$  misses per iteration with  $C$ 

104

# simple blocking — counting misses (total) for (int k = 0; k < N; k += 2)</pre>

for (int k = 0; k < N; k + = 2)

for (int i = 0; i < N; i + = 2)

for (int j = 0; j < N; + + j) {  $C_{i+0,j} += A_{i+0,k+0} * B_{k+0,j}$   $C_{i+0,j} += A_{i+0,k+1} * B_{k+1,j}$   $C_{i+1,j} += A_{i+1,k+0} * B_{k+0,j}$   $C_{i+1,j} += A_{i+1,k+1} * B_{k+1,j}$ }

before:

before: A:  $\frac{N^2}{2}$ ; B:  $\frac{N^3}{1 \cdot \text{block size}}$ ; C  $\frac{N^3}{1 \cdot \text{block size}}$ 

after:
A:  $\frac{N^2}{2}$ ; B:  $\frac{N^3}{2}$ ; C  $\frac{N^3}{2}$ 

## generalizing: divide and conquer

```
partial matrixmultiply(float *A, float *B, float *C
               int startI, int endI, ...) {
  for (int i = startI; i < endI; ++i) {</pre>
    for (int j = startJ; j < endJ; ++j) {</pre>
      for (int k = startK; k < endK; ++k) {</pre>
matrix_multiply(float *A, float *B, float *C, int N) {
  for (int ii = 0; ii < N; ii += BLOCK I)
    for (int jj = 0; jj < N; jj += BLOCK_J)
      for (int kk = 0; kk < N; kk += BLOCK K)
         /* do everything for segment of A, B, C
            that fits in cache! */
         partial matmul(A, B, C,
```

106

#### array usage: matrix blockC<sub>ij</sub> += A<sub>ik</sub> · B<sub>kj</sub>



inner loops work on "matrix block" of A, B, C rather than rows of some, little blocks of others blocks fit into cache (b/c we choose I, K, J)

#### array usage: matrix blockCij += Aik · Bkj







now (versus loop ordering example) some spatial locality in A, B, and C some temporal locality in A, B, and C

# array usage: matrix block $C_{ij}$ += $A_{ik} \cdot B_{kj}$







 $C_{ij}$  calculation uses strips from A, B K calculations for one cache miss good temporal locality!

## array usage: matrix blockCij += Aik · Bkj



 $A_{ik}$  used with entire strip of  $B\ J$  calculations for one cache miss good temporal locality!

#### array usage: matrix blockCij += Aik · Bkj







(approx.) KIJ fully cached calculations for KI+IJ+KJ values need to be lodaed per "matrix block" (assuming everything stays in cache)

#### cache blocking efficiency

for each of  $N^3/IJK$  matrix blocks:

load  $I \times K$  elements of  $A_{ik}$ :  $\approx IK \div \text{block size misses per matrix block}$  $\approx N^3/(J \cdot \text{blocksize})$  misses total

load  $K \times J$  elements of  $B_{kj}$ :  $\approx N^3/(I \cdot \text{blocksize})$  misses total

load  $I \times J$  elements of  $C_{ij}$ :  $\approx N^3/(K \cdot \text{blocksize})$  misses total

bigger blocks — more work per load!

catch: IK + KJ + IJ elements must fit in cache otherwise estimates above don't work

#### cache blocking rule of thumb

fill the most of the cache with useful data

and do as much work as possible from that

example: my desktop 32KB L1 cache

$$I = J = K = 48$$
 uses  $48^2 \times 3$  elements, or  $27 \text{KB}$ .

assumption: conflict misses aren't important

#### systematic approach

```
for (int k = 0; k < N; ++k) {
  for (int i = 0; i < N; ++i) {
    A_{ik} loaded once in this loop:
    for (int j = 0; j < N; ++j)
    C_{ij}, B_{kj} loaded each iteration (if N big):
    B[i*N+j] += A[i*N+k] * A[k*N+j];
```

values from  $A_{ik}$  used N times per load

values from  $B_{kj}$  used 1 times per load but good spatial locality, so cache block of  $B_{kj}$  together

values from  $C_{ij}$  used 1 times per load but good spatial locality, so cache block of  $C_{ij}$  together

## exercise: miss estimating (3)

assuming: 4 elements per block

assuming: cache not close to big enough to hold 1K elements, but big enough to hold 500 or so

estimate: approximately how many misses for A, B?

#### loop ordering compromises

loop ordering forces compromises:

```
for k: for i: c[i,j] += a[i,k] * b[j,k]
perfect temporal locality in ali.kl
bad temporal locality for c[i,i], b[i,k]
perfect spatial locality in c[i,i]
bad spatial locality in b[j,k], a[i,k]
```

## loop ordering compromises

loop ordering forces compromises:

```
for k: for i: for j: c[i,j] += a[i,k] * b[j,k]
```

```
perfect temporal locality in a[i,k]
bad temporal locality for c[i,i], b[i,k]
```

bad spatial locality in 
$$b[j,k]$$
,  $a[i,k]$ 

cache blocking: work on blocks rather than rows/columns have some temporal, spatial locality in everything

#### cache blocking pattern

no perfect loop order? work on rectangular matrix blocks

size amount used in inner loops based on cache size in practice:

test performance to determine 'size' of blocks

# backup slides

#### cache organization and miss rate

depends on program; one example:

SPEC CPU2000 benchmarks, 64B block size

LRU replacement policies

| data cache | miss rates:   |        |         |              |
|------------|---------------|--------|---------|--------------|
| Cache size | direct-mapped | 2-way  | 8-way   | fully assoc. |
| 1KB        | 8.63%         | 6.97%  | 5.63%   | 5.34%        |
| 2KB        | 5.71%         | 4.23%  | 3.30%   | 3.05%        |
| 4KB        | 3.70%         | 2.60%  | 2.03%   | 1.90%        |
| 16KB       | 1.59%         | 0.86%  | 0.56%   | 0.50%        |
| 64KB       | 0.66%         | 0.37%  | 0.10%   | 0.001%       |
| 128KB      | 0.27%         | 0.001% | 0.0006% | 0.0006%      |
|            |               |        |         |              |

#### cache organization and miss rate

depends on program; one example:

SPEC CPU2000 benchmarks, 64B block size

#### LRU replacement policies

-lata cacha mica katasi

| data cache miss rates: |               |        |         |              |  |  |
|------------------------|---------------|--------|---------|--------------|--|--|
| Cache size             | direct-mapped | 2-way  | 8-way   | fully assoc. |  |  |
| 1KB                    | 8.63%         | 6.97%  | 5.63%   | 5.34%        |  |  |
| 2KB                    | 5.71%         | 4.23%  | 3.30%   | 3.05%        |  |  |
| 4KB                    | 3.70%         | 2.60%  | 2.03%   | 1.90%        |  |  |
| 16KB                   | 1.59%         | 0.86%  | 0.56%   | 0.50%        |  |  |
| 64KB                   | 0.66%         | 0.37%  | 0.10%   | 0.001%       |  |  |
| 128KB                  | 0.27%         | 0.001% | 0.0006% | 0.0006%      |  |  |
|                        |               |        |         |              |  |  |

# exercise (1)

initial cache: 64-byte blocks, 64 sets, 8 ways/set

If we leave the other parameters listed above unchanged, which will probably reduce the number of capacity misses in a typical program? (Multiple may be correct.)

- A. quadrupling the block size (256-byte blocks, 64 sets, 8 ways/set)
- B. quadrupling the number of sets
- C. quadrupling the number of ways/set

# exercise (2)

initial cache: 64-byte blocks, 8 ways/set, 64KB cache

If we leave the other parameters listed above unchanged, which will probably reduce the number of capacity misses in a typical program? (Multiple may be correct.)

- A. quadrupling the block size (256-byte block, 8 ways/set, 64KB cache)
- B. quadrupling the number of ways/set
- C. quadrupling the cache size

# exercise (3)

initial cache: 64-byte blocks, 8 ways/set, 64KB cache

If we leave the other parameters listed above unchanged, which will probably reduce the number of conflict misses in a typical program? (Multiple may be correct.)

- A. quadrupling the block size (256-byte block, 8 ways/set, 64KB cache)
- B. quadrupling the number of ways/set
- C. quadrupling the cache size

### prefetching

seems like we can't really improve cold misses...

have to have a miss to bring value into the cache?

### prefetching

seems like we can't really improve cold misses...

have to have a miss to bring value into the cache?

solution: don't require miss: 'prefetch' the value before it's accessed

remaining problem: how do we know what to fetch?

#### common access patterns

suppose recently accessed 16B cache blocks are at: 0x48010, 0x48020, 0x48030, 0x48040

guess what's accessed next

#### common access patterns

suppose recently accessed 16B cache blocks are at: 0x48010, 0x48020, 0x48030, 0x48040

guess what's accessed next

common pattern with instruction fetches and array accesses

### prefetching idea

look for sequential accesses

bring in guess at next-to-be-accessed value

if right: no cache miss (even if never accessed before)

if wrong: possibly evicted something else — could cause more misses

fortunately, sequential access guesses almost always right

### array usage: *ijk* order







looking only at two innermost loops together: good spatial locality in A poor spatial locality in B

### array usage: kij order







k: looking only at two innermost loops together: poor spatial locality in A good spatial locality in B  $C_{ij}+=A_{ik}\times B_{kj}$  good spatial locality in C

# simple blocking – with 3?

for (int kk = 0; kk < N; kk += 3) 
for (int i = 0; i < N; i += 1) 
for (int j = 0; j < N; ++j) { 
 C[i\*N+j] += A[i\*N+kk+0] \* B[(kk+0)\*N+j]; 
 C[i\*N+j] += A[i\*N+kk+1] \* B[(kk+1)\*N+j]; 
 C[i\*N+j] += A[i\*N+kk+2] \* B[(kk+2)\*N+j]; 
}  $\frac{N}{3} \cdot N \text{ j-loop iterations, and (assuming } N \text{ large}):$ 

about 1 misses from A per j-loop iteration  $N^2/\mathbf{3}$  total misses (before blocking:  $N^2$ )

shout  $3N ilde{\cdot}$  block size misses from C per i-loop iteration

about  $3N \div \text{block}$  size misses from B per j-loop iteration  $N^3 \div \text{block}$  size total misses (same as before)

# simple blocking – with 3?

for (int kk = 0; kk < N; kk += 3) 
for (int i = 0; i < N; i += 1) 
for (int j = 0; j < N; ++j) { 
 C[i\*N+j] += A[i\*N+kk+0] \* B[(kk+0)\*N+j]; 
 C[i\*N+j] += A[i\*N+kk+1] \* B[(kk+1)\*N+j]; 
 C[i\*N+j] += A[i\*N+kk+2] \* B[(kk+2)\*N+j]; 
}  $\frac{N}{3} \cdot N \text{ j-loop iterations, and (assuming } N \text{ large}):$ 

about 1 misses from A per j-loop iteration  $N^2/3$  total misses (before blocking:  $N^2$ ) about  $3N \div \text{block}$  size misses from B per j-loop iteration

 $N^3 \div$  block size total misses (same as before)

#### more than 3?

can we just keep doing this increase from 3 to some large X? ...

assumption: X values from A would stay in cache X too large — cache not big enough

assumption: X blocks from B would help with spatial locality X too large — evicted from cache before next iteration





```
for each i:

for each j:

for k=kk,kk+1:

C_{ij}+=A_{ik}\cdot B_{kj}
```

for each kk:



within innermost loop good spatial locality in A bad locality in B good temporal locality in C

for each j: for k=kk,kk+1:  $C_{ij}+=A_{ik}\cdot B_{kj}$ 

for each kk:

for each i:





for each kk:

for each i:

for each j:

for k=kk,kk+1:  $C_{ij}+=A_{ik}\cdot B_{kj}$ 

loop over j: better spatial locality over A than before; still good temporal locality for A



for each kk: for each i: but probably not more misses cache needs to keep two cache blocks for next iter instead of one  $C_{ij}+=A_{ik}\cdot B_{kj}$  [probably has the space left over!]



for each kk:
 for each i:
 for each j:
 for k=kk,kk+1:
  $C_{ij}+=A_{ik}$ :
 right now: only really care about keeping 4 cache blocks in j loop
 have more than 4 cache blocks?
 increasing kk increment would use more of them

### keeping values in cache

can't explicitly ensure values are kept in cache

...but reusing values *effectively* does this cache will try to keep recently used values

cache optimization ideas: choose what's in the cache for thinking about it: load values explicitly for implementing it: access only values we want loaded















#### changing page tables

what happens to TLB when page table base pointer is changed? e.g. context switch

most entries in TLB refer to things from wrong process oops — read from the wrong process's stack?

#### changing page tables

what happens to TLB when page table base pointer is changed? e.g. context switch

most entries in TLB refer to things from wrong process oops — read from the wrong process's stack?

option 1: invalidate all TLB entries side effect on "change page table base register" instruction

### changing page tables

what happens to TLB when page table base pointer is changed? e.g. context switch

most entries in TLB refer to things from wrong process oops — read from the wrong process's stack?

option 1: invalidate all TLB entries side effect on "change page table base register" instruction

option 2: TLB entries contain process ID set by OS (special register) checked by TLB in addition to TLB tag, valid bit

### editing page tables

what happens to TLB when OS changes a page table entry?

most common choice: has to be handled in software

### editing page tables

what happens to TLB when OS changes a page table entry?

most common choice: has to be handled in software

invalid to valid — nothing needed

TLB doesn't contain invalid entries

MMU will check memory again

valid to invalid — OS needs to tell processor to invalidate it special instruction (x86: invlpg)

valid to other valid — OS needs to tell processor to invalidate it

# address splitting for TLBs (1)

```
my desktop:
```

4KB ( $2^{12}$  byte) pages; 48-bit virtual address

64-entry, 4-way L1 data TLB

TLB index bits?

TLB tag bits?

# address splitting for TLBs (1)

```
my desktop:
```

```
4KB (2^{12} byte) pages; 48-bit virtual address
```

64-entry, 4-way L1 data TLB

```
TLB index bits? 64/4 = 16 \text{ sets} - 4 \text{ bits}
```

TLB tag bits?

48-12=36 bit virtual page number — 36-4=32 bit TLB tag

# address splitting for TLBs (2)

my desktop:

4KB ( $2^{12}$  byte) pages; 48-bit virtual address

1536-entry  $(3 \cdot 2^9)$ , 12-way L2 TLB

TLB index bits?

TLB tag bits?

# address splitting for TLBs (2)

```
my desktop:
```

```
4KB (2^{12} byte) pages; 48-bit virtual address
```

1536-entry  $(3 \cdot 2^9)$ , 12-way L2 TLB

```
TLB index bits? 1536/12 = 128 \text{ sets} - 7 \text{ bits}
```

TLB tag bits?

48-12=36 bit virtual page number — 36-7=29 bit TLB tag











#### changing page tables

what happens to TLB when page table base pointer is changed? e.g. context switch

most entries in TLB refer to things from wrong process oops — read from the wrong process's stack?

#### changing page tables

what happens to TLB when page table base pointer is changed? e.g. context switch

most entries in TLB refer to things from wrong process oops — read from the wrong process's stack?

option 1: invalidate all TLB entries side effect on "change page table base register" instruction

#### changing page tables

what happens to TLB when page table base pointer is changed? e.g. context switch

most entries in TLB refer to things from wrong process oops — read from the wrong process's stack?

option 1: invalidate all TLB entries side effect on "change page table base register" instruction

option 2: TLB entries contain process ID set by OS (special register) checked by TLB in addition to TLB tag, valid bit

#### editing page tables

what happens to TLB when OS changes a page table entry?

most common choice: has to be handled in software

#### editing page tables

what happens to TLB when OS changes a page table entry?

most common choice: has to be handled in software

invalid to valid — nothing needed

TLB doesn't contain invalid entries

MMU will check memory again

valid to invalid — OS needs to tell processor to invalidate it special instruction (x86: invlpg)

valid to other valid — OS needs to tell processor to invalidate it

9-bit virtual address6-bit physical address



- 9-bit virtual address
- 6-bit physical address
- 8-byte pages  $\rightarrow$  3-bit page offset (bottom)  $\frac{1}{6}$
- 9-bit VA: 6 bit VPN + 3 bit PO
- 6-bit PA: 3 bit PPN + 3 bit PO



physical addr
PPN page offset

- 9-bit virtual address
- 6-bit physical address
- 8-byte pages  $\rightarrow$  3-bit page offset (bottom)
- 9-bit VA: 6 bit VPN + 3 bit PO
- 6-bit PA: 3 bit PPN + 3 bit PO
- 1 page page tables w/ 1 byte entry  $\rightarrow$  8 entry PTs







- - valid? PPN

- 9-bit virtual address
- 6-bit physical address
- 8-byte pages  $\rightarrow$  3-bit page offset (bottom)  $\frac{1}{6}$
- 9-bit VA: 6 bit VPN + 3 bit PO
- 6-bit PA: 3 bit PPN + 3 bit PO
- 1 page page tables w/ 1 byte entry ightarrow 8 entry PTs
- 8 entry page tables  $\rightarrow$  3-bit VPN parts
- 9-bit VA: 3 bit VPN part 1; 3 bit VPN part 2





physical addr

page table (either level) valid? PPN

- - ... ...

9-bit virtual addresses, 6-bit physical; 8 byte pages, 1 byte PTE page tables 1 page; PTE: 3 bit PPN (MSB), 1 valid bit, 4 unused page table base register 0x20; translate virtual address 0x129

| physical bytes addresses |    |    |    |    | physical bytes addresses |        |    |    |    |    |
|--------------------------|----|----|----|----|--------------------------|--------|----|----|----|----|
|                          |    |    |    |    |                          |        |    |    |    |    |
| 0x00-3                   | 00 | 11 | 22 | 33 |                          | 0x20-3 | 00 | 91 | 72 | 13 |
| 0x04-7                   | 44 | 55 | 66 | 77 |                          | 0x24-7 | F4 | Α5 | 36 | 07 |
| 0x08-B                   | 88 | 99 | AA | ВВ |                          | 0x28-B | 89 | 9Α | ΑB | ВС |
| 0x0C-F                   | CC | DD | EE | FF |                          | 0x2C-F | CD | DE | EF | F0 |
| 0x10-3                   | 1A | 2A | ЗА | 4A |                          | 0x30-3 | ВА | 0Α | ВА | 0A |
| 0x14-7                   | 1В | 2B | 3B | 4B |                          | 0x34-7 | DB | 0B | DB | 0B |
| 0x18-B                   | 1C | 2C | 3C | 4C |                          | 0x38-B | EC | 0C | EC | 0C |
| 0x1C-F                   | 1C | 2C | 3C | 4C |                          | 0x3C-F | AC | DC | DC | 0C |

9-bit virtual addresses, 6-bit physical; 8 byte pages, 1 byte PTE page tables 1 page; PTE: 3 bit PPN (MSB), 1 valid bit, 4 unused page table base register 0x20; translate virtual address 0x129

```
physical bytes
                                                0x129 = 1 0010 1001
addresses
                                                0x20 + 0x4 \times 1 = 0x24
0 \times 00 - 3 | 00 \ 11 \ 22 \ 33
                         0x20-3|00 91 72 13
                                                PTE 1 value:
0 \times 04 - 7 | 44 55 66 77
                         0x24-7|_{F4} A5 36 07
                                                0xF4 = 1111 0100
                         0x28-Bl89 9A AB BC
0x08-Bl88 99 AA BB
                                                PPN 111. valid 1
0x0C-FCC DD EE FF
                         0x2C-FCD DE EF F0
                         0x30-3|BA 0A BA 0A
0 \times 10 - 3 | 1A 2A 3A 4A
0x14-7|1B 2B 3B 4B
                         0x34-7DB 0B DB 0B
0x18-Bl1C 2C 3C 4C
                         0x38-BIEC 0C EC 0C
0x1C-F|1C 2C 3C 4C
                         0x3C-FIAC DC DC 0C
```

9-bit virtual addresses, 6-bit physical; 8 byte pages, 1 byte PTE page tables 1 page; PTE: 3 bit PPN (MSB), 1 valid bit, 4 unused

```
physical bytes
                                                  0 \times 129 = 1 \ 0010 \ 1001
addresses
                                                  0x20 + 0x4 \times 1 = 0x24
0 \times 00 - 3 | 00 \ 11 \ 22 \ 33
                          0x20-3|00 91 72 13
                                                  PTE 1 value:
                          0x24-7|F4 A5 36 07
0 \times 04 - 7 | 44 55 66 77
                                                  0xF4 = 1111 0100
                          0x28-Bl89 9A AB BC
0x08-Bl88 99 AA BB
                                                  PPN 111. valid 1
0x0C-FCC DD EE FF
                          0x2C-FCD DE EF F0
                                                  PTE 2 addr:
                          0x30-3|BA 0A BA 0A
0 \times 10 - 3 | 1A 2A 3A 4A
                                                   111 \ 000 + 101 \times 1 = 0 \times 3D
                          0x34-7DB 0B DB 0B
0 \times 14 - 7 | 1B 2B 3B 4B
                                                   PTE 2 value: 0xDC
0x18-Bl1C 2C 3C 4C
                          0x38-BIEC 0C EC 0C
0x1C-F|1C 2C 3C 4C
                          0x3C-FIAC DC DC 0C
```

9-bit virtual addresses, 6-bit physical; 8 byte pages, 1 byte PTE

page tables 1 page; PTE: 3 bit PPN (MSB), 1 valid bit, 4 unused

```
page table base register 0x20; translate virtual address 0x129
   physical bytes
                             physical bytes
                                                     0 \times 129 = 1 \ 0010 \ 1001
  addresses
                            addresses
                                                     0x20 + 0x4 \times 1 = 0x24
```

0x00-3|00 11 22 33  $0 \times 20 - 3 | 00 \ 91 \ 72 \ 13$ PTE 1 value:  $0 \times 04 - 7 | 44 55 66 77$ 

0x24-7|F4 A5 36 07  $0 \times F4 = 1111 \ 0100$ 0x28-B|89 9A AB BC 0x08-Bl88 99 AA BB PPN 111. valid 1 0x0C-FCC DD EE FF 0x2C-FCD DE EF F0 PTE 2 addr:

 $0 \times 10 - 3 | 1A 2A 3A 4A$ 0x30-3|BA 0A BA 0A  $111 \ 000 + 101 \times 1 = 0x3D$ PTE 2 value: 0xDC  $0 \times 14 - 7 | 1B 2B 3B 4B$  $0 \times 34 - 7 \mid DB \mid 0B \mid DB \mid 0B$ 

0x38-BIEC 0C EC 0C PPN 110; valid 1 0x18-Bl1C 2C 3C 4C

 $M[110 \ 001 \ (0x31)] = 0x0A$ 0x1C-F|1C 2C 3C 4C 0x3C-FAC DC DC 0C

9-bit virtual addresses, 6-bit physical; 8 byte pages, 1 byte PTE

page tables 1 page; PTE: 3 bit PPN (MSB), 1 valid bit, 4 unused

```
page table base register 0x20; translate virtual address 0x129
   physical bytes
                             physical bytes
                                                     0 \times 129 = 1 \ 0010 \ 1001
  addresses
                            addresses
                                                     0x20 + 0x4 \times 1 = 0x24
```

0x00-3|00 11 22 33  $0 \times 20 - 3 | 00 \ 91 \ 72 \ 13$ 

PTE 1 value:  $0 \times 04 - 7 | 44 55 66 77$ 0x24-7|F4 A5 36 07  $0 \times F4 = 1111 \ 0100$ 

0x28-B|89 9A AB BC 0x08-Bl88 99 AA BB PPN 111. valid 1

0x0C-FCC DD EE FF 0x2C-FCD DE EF F0 PTE 2 addr:  $0 \times 10 - 3 | 1A 2A 3A 4A$ 0x30-3|BA 0A BA 0A PTE 2 value: 0xDC  $0 \times 14 - 7 | 1B 2B 3B 4B$  $0 \times 34 - 7 \mid DB \mid 0B \mid DB \mid 0B$ 

 $111 \ 000 + 101 \times 1 = 0x3D$ 0x38-BIEC 0C EC 0C PPN 110; valid 1 0x18-Bl1C 2C 3C 4C  $M[110 \ 001 \ (0x31)] = 0x0A$ 0x1C-F|1C 2C 3C 4C 0x3C-FAC DC DC 0C

9-bit virtual addresses, 6-bit physical; 8 byte pages, 1 byte PTE

page tables 1 page; PTE: 3 bit PPN (MSB), 1 valid bit, 4 unused

```
page table base register 0x20; translate virtual address 0x129
   physical bytes
                             physical <sub>bytes</sub>
                                                    0x129 = 1 0010 1001
  addresses
                            addresses
```

 $0x20 + 0x4 \times 1 = 0x24$ 0x00-3|00 11 22 33 0x20-3|00 91 72 13 PTE 1 value:  $0 \times 04 - 7 | 44 55 66 77$ 

0x24-7|F4 A5 36 07  $0 \times F4 = 1111 \ 0100$ 0x08-B|88 99 AA BB 0x28-B|89 9A AB BC

```
PPN 111. valid 1
0x0C-FCC DD EE FF
                             0x2C-FCD DE EF F0
                                                         PTE 2 addr:
0 \times 10 - 3 | 1A 2A 3A 4A
                             0x30-3|BA 0A BA 0A
                                                         PTE 2 value: 0xDC
0 \times 14 - 7 | 1B 2B 3B 4B
                             0 \times 34 - 7 \mid DB \mid 0B \mid DB \mid 0B
```

 $111 000 + 101 \times 1 = 0x3D$ 0x38-BIEC 0C EC 0C PPN 110; valid 1 0x18-Bl1C 2C 3C 4C  $M[110 \ 001 \ (0x31)] = 0x0A$ 0x1C-F|1C 2C 3C 4C 0x3C-FAC DC DC 0C

9-bit virtual addresses, 6-bit physical; 8 byte pages, 1 byte PTE page tables 1 page; PTE: 3 bit PPN (MSB), 1 valid bit, 4 unused; page table base register 0x08; translate virtual address 0x0FB

```
physical bytes
addresses
0 \times 00 - 3 | 00 \ 11 \ 22 \ 33
                          0x20-3|D0 D1 D2 D3
0 \times 04 - 7 | 44 55 66 77
                          0x24-7D4 D5 D6 D7
                          0x28-Bl89 9A AB BC
0x08-Bl88 99 AA BB
0x0C-FCC DD EE FF
                          0x2C-FCD DE EF F0
                          0x30-3|BA 0A BA 0A
0 \times 10 - 3 | 1A 2A 3A 4A
                          0x34-7DB 0B DB 0B
0 \times 14 - 7 | 1B 2B 3B 4B
0x18-Bl1C 2C 3C 4C
                          0x38-BIEC 0C EC 0C
0x1C-F|1C 2C 3C 4C
                          0x3C-FIFC 0C FC 0C
```

9-bit virtual addresses, 6-bit physical; 8 byte pages, 1 byte PTE page tables 1 page; PTE: 3 bit PPN (MSB), 1 valid bit, 4 unused;

page table base register  $0 \times 08$ ; translate virtual address  $0 \times 0FB$ 

```
physical bytes
                         physical <sub>bytes</sub>
addresses
                                                0 \times 0 = 011 \ 111 \ 011
0x00-3|00 11 22 33
                         0x20-3|D0 D1 D2 D3
                                                (PTE 1 addr: 0x08 +
                         0x24-7D4 D5 D6 D7
0x04-7|44 55 66 77
                                                PTE size times 011 (3))
0x08-B|88 99 AA BB
                         0x28-B|89 9A AB BC
                                                PTE 1: 0xBB at 0x0B
0x0C-FCC DD EE FF
                         0x2C-FCD DE EF F0
                                                PTE 1: PPN 101 (5) valid 1
0x10-3|1A 2A 3A 4A
                         0x30-3|BA 0A BA 0A
                                                PTE 2: 0xF0 at 0x2F
                         0 \times 34 - 7 | DB | 0B | DB | 0B
0x14-7|1B 2B 3B 4B
                                                PTE 2: PPN 111 (7) valid 1
0x18-Bl1C 2C 3C 4C
                         0x38-BIEC 0C EC 0C
                                                111 \ 011 = 0x3B \rightarrow 0x0C
0x1C-F|1C 2C 3C 4C
                         0x3C-FIFC 0C FC 0C
```

9-bit virtual addresses, 6-bit physical; 8 byte pages, 1 byte PTE page tables 1 page; PTE: 3 bit PPN (MSB), 1 valid bit, 4 unused;

```
physical bytes
                         physical <sub>bytes</sub>
addresses
                                                0 \times 0 = 011 \ 111 \ 011
0x00-3|00 11 22 33
                         0x20-3|D0 D1 D2 D3
                                                (PTE 1 addr: 0x08 +
                         0x24-7D4 D5 D6 D7
0x04-7|44 55 66 77
                                                PTE size times 011 (3))
0x08-B|88 99 AA BB
                         0x28-B|89 9A AB BC
                                                PTE 1: 0xBB at 0x0B
0x0C-FCC DD EE FF
                         0x2C-FCD DE EF F0
                                                PTE 1: PPN 101 (5) valid 1
0x10-3|1A 2A 3A 4A
                         0x30-3|BA 0A BA 0A
                                                PTE 2: 0xF0 at 0x2F
                         0 \times 34 - 7 | DB | 0B | DB | 0B
0x14-7|1B 2B 3B 4B
                                                PTE 2: PPN 111 (7) valid 1
0x18-Bl1C 2C 3C 4C
                         0x38-BIEC 0C EC 0C
                                                111 \ 011 = 0x3B \rightarrow 0x0C
0x1C-F|1C 2C 3C 4C
                         0x3C-FIFC 0C FC 0C
```

9-bit virtual addresses, 6-bit physical; 8 byte pages, 1 byte PTE page tables 1 page; PTE: 3 bit PPN (MSB), 1 valid bit, 4 unused;

```
physical bytes
                         physical <sub>bytes</sub>
addresses
                                                0 \times 0 = 011 \ 111 \ 011
0x00-3|00 11 22 33
                         0x20-3|D0 D1 D2 D3
                                                (PTE 1 addr: 0x08 +
                         0x24-7D4 D5 D6 D7
0x04-7|44 55 66 77
                                                PTE size times 011 (3))
0x08-B|88 99 AA BB
                         0x28-B|89 9A AB BC
                                                PTE 1: 0xBB at 0x0B
0x0C-FCC DD EE FF
                         0x2C-FCD DE EF F0
                                                PTE 1: PPN 101 (5) valid 1
0x10-3|1A 2A 3A 4A
                         0x30-3|BA 0A BA 0A
                                                PTE 2: 0xF0 at 0x2F
                         0 \times 34 - 7 | DB | 0B | DB | 0B
0x14-7|1B 2B 3B 4B
                                                PTE 2: PPN 111 (7) valid 1
0x18-Bl1C 2C 3C 4C
                         0x38-BIEC 0C EC 0C
                                                111 \ 011 = 0x3B \rightarrow 0x0C
0x1C-F|1C 2C 3C 4C
                         0x3C-FIFC 0C FC 0C
```

9-bit virtual addresses, 6-bit physical; 8 byte pages, 1 byte PTE page tables 1 page; PTE: 3 bit PPN (MSB), 1 valid bit, 4 unused;

```
physical bytes
                         physical <sub>bytes</sub>
addresses
                                                0 \times 0 = 011 \ 111 \ 011
0x00-3|00 11 22 33
                         0x20-3|D0 D1 D2 D3
                                                (PTE 1 addr: 0x08 +
                         0x24-7D4 D5 D6 D7
0x04-7|44 55 66 77
                                                PTE size times 011 (3))
0x08-B|88 99 AA BB
                         0x28-B|89 9A AB BC
                                                PTE 1: 0xBB at 0x0B
0x0C-FCC DD EE FF
                         0x2C-FCD DE EF F0
                                                PTE 1: PPN 101 (5) valid 1
0x10-3|1A 2A 3A 4A
                         0x30-3|BA 0A BA 0A
                                                PTE 2: 0xF0 at 0x2F
                         0 \times 34 - 7 | DB | 0B | DB | 0B
0x14-7|1B 2B 3B 4B
                                                PTE 2: PPN 111 (7) valid 1
0x18-Bl1C 2C 3C 4C
                         0x38-BIEC 0C EC 0C
                                                111 \ 011 = 0x3B \rightarrow 0x0C
0x1C-F|1C 2C 3C 4C
                         0x3C-FIFC 0C FC 0C
```

9-bit virtual addresses, 6-bit physical; 8 byte pages, 1 byte PTE page tables 1 page; PTE: 3 bit PPN (MSB), 1 valid bit, 4 unused; page table base register 0x10; translate virtual address 0x109

```
physical bytes
addresses
0 \times 00 - 3 \mid 00 \ 11 \ 22 \ 33
                           0x20-3|D0 D1 D2 D3
0 \times 04 - 7 | 44 55 66 77
                           0x24-7D4 D5 D6 D7
                           0x28-Bl89 9A AB BC
0x08-Bl88 99 AA BB
0x0C-FCC DD EE FF
                           0x2C-FCD DE EF F0
                           0x30-3|BA 0A BA 0A
0 \times 10 - 3 | 1A 2A 5A 4A
0x14-7|1B 2B 3B 4B
                           0 \times 34 - 7 \mid DB \mid 0B \mid DB \mid 0B
0x18-Bl1C 2C 3C 4C
                           0x38-BIEC 0C EC 0C
0x1C-F|1C 2C 3C 4C
                           0x3C-FIFC 0C FC 0C
```

9-bit virtual addresses, 6-bit physical; 8 byte pages, 1 byte PTE

page tables 1 page; PTE: 3 bit PPN (MSB), 1 valid bit, 4 unused;

```
page table base register 0x10; translate virtual address 0x109
```

physical bytes addresses physical bytes  $0x109 = 100 \ 011 \ 001$  addresses (PTE 1 at:

addresses  $0 \times 00 - 3 | 00 \ 11 \ 22 \ 33$   $0 \times 20 - 3 | 00 \ D1 \ D2 \ D3$   $0 \times 10 + \text{PTE size times 4 (100)}$ 

0x04-7 44 55 66 77 0x24-7 D4 D5 D6 D7 PTE 1: 0x1B at 0x14

0x16-61C 2C 3C 4C 0x36-6EC 0C EC 0C PTE 2: PPN 001 (1) Valid 1 0x3C-FEC 0C FC 0C  $001 001 = 0x09 \rightarrow 0x99$ 

9-bit virtual addresses, 6-bit physical; 8 byte pages, 1 byte PTE

page tables 1 page; PTE: 3 bit PPN (MSB), 1 valid bit, 4 unused;

```
page table base register 0 \times 10; translate virtual address 0 \times 109
    physical bytes
                                  physical <sub>bytes</sub>
                                                              0 \times 109 = 100 \ 011 \ 001
```

addresses (PTE 1 at:

0x00-300 11 22 33 0x20-3|D0 D1 D2 D3 0x10 + PTE size times 4 (100)) 0x04-7|44 55 66 77 0x24-7D4 D5 D6 D7 PTE 1: 0x1B at 0x14

0x08-B|88 99 AA BB 0x28-B|89 9A AB BC PTE 1: PPN 000 (0) valid 1 0x0C-FCC DD EE FF 0x2C-FCD DE EF F0 (second table at:

0x10-3|1A 2A 5A 4A 0x30-3|BA 0A BA 0A 0 (000) times page size =  $0 \times 00$ ) 0x14-7|1B 2B 3B 4B  $0 \times 34 - 7 \mid DB \mid 0B \mid DB \mid 0B$ PTE 2: 0x33 at 0x03 0x18-Bl1C 2C 3C 4C 0x38-BIEC 0C EC 0C PTE 2: PPN 001 (1) valid 1

0x1C-F|1C 2C 3C 4C 0x3C-FIFC 0C FC 0C  $001 \ 001 = 0x09 \rightarrow 0x99$ 

9-bit virtual addresses, 6-bit physical; 8 byte pages, 1 byte PTE

page tables 1 page; PTE: 3 bit PPN (MSB), 1 valid bit, 4 unused;

```
page table base register 0 \times 10; translate virtual address 0 \times 109
    physical bytes
                                  physical <sub>bytes</sub>
                                                              0 \times 109 = 100 \ 011 \ 001
```

addresses (PTE 1 at:

0x00-300 11 22 33 0x20-3|D0 D1 D2 D3 0x10 + PTE size times 4 (100))

0x04-7|44 55 66 77 0x24-7D4 D5 D6 D7 PTF 1: 0x1B at 0x14 0x08-B|88 99 AA BB 0x28-B|89 9A AB BC PTE 1: PPN 000 (0) valid 1 0x0C-FCC DD EE FF 0x2C-FCD DE EF F0

(second table at: 0x10-3|1A 2A 5A 4A 0x30-3|BA 0A BA 0A 0 (000) times page size =  $0 \times 00$ ) 0x14-7|1B 2B 3B 4B  $0 \times 34 - 7 \mid DB \mid 0B \mid DB \mid 0B$ PTF 2: 0x33 at 0x03

0x18-Bl1C 2C 3C 4C 0x38-BIEC 0C EC 0C PTE 2: PPN 001 (1) valid 1 0x1C-F|1C 2C 3C 4C 0x3C-FIFC 0C FC 0C  $001 \ 001 = 0x09 \rightarrow 0x99$ 

9-bit virtual addresses, 6-bit physical; 8 byte pages, 1 byte PTE

page tables 1 page; PTE: 3 bit PPN (MSB), 1 valid bit, 4 unused;

```
page table base register 0 \times 10; translate virtual address 0 \times 109
    physical bytes
                                  physical <sub>bytes</sub>
                                                              0 \times 109 = 100 \ 011 \ 001
```

addresses (PTE 1 at:

0x00-300 11 22 33 0x20-3|D0 D1 D2 D3 0x10 + PTE size times 4 (100))

0x04-7|44 55 66 77 0x24-7D4 D5 D6 D7 PTF 1: 0x1B at 0x14 0x08-B|88 99 AA BB 0x28-B|89 9A AB BC PTE 1: PPN 000 (0) valid 1

0x0C-FCC DD EE FF 0x2C-FCD DE EF F0 (second table at: 0x10-3|1A 2A 5A 4A 0x30-3|BA 0A BA 0A 0 (000) times page size =  $0 \times 00$ )

0x14-7|1B 2B 3B 4B  $0 \times 34 - 7 \mid DB \mid 0B \mid DB \mid 0B$ PTF 2: 0x33 at 0x03 0x18-Bl1C 2C 3C 4C 0x38-BIEC 0C EC 0C PTE 2: PPN 001 (1) valid 1 0x1C-F|1C 2C 3C 4C 0x3C-FIFC 0C FC 0C  $001 \ 001 = 0x09 \rightarrow 0x99$ 

9-bit virtual addresses, 6-bit physical; 8 byte pages, 1 byte PTE page tables 1 page; PTE: 3 bit PPN (MSB), 1 valid bit, 4 unused page table base register 0x08; translate virtual address 0x00B

physical bytes addresses  $0 \times 00 - 3 \mid 00 \ 11 \ 22 \ 33$ 0x20-3|D0 D1 D2 D3  $0 \times 04 - 7 | 44 55 66 77$ 0x24-7D4 D5 D6 D7 0x28-Bl89 9A AB BC 0x08-Bl88 99 AA BB 0x0C-FCC DD EE FF 0x2C-FCD DE EF F0 0x30-3|BA 0A BA 0A  $0 \times 10 - 3 | 1A 2A 3A 4A$ 0x34-7DB 0B DB 0B  $0 \times 14 - 7 | 1B 2B 3B 4B$ 0x18-Bl1C 2C 3C 4C 0x38-BIEC 0C EC 0C 0x1C-F|1C 2C 3C 4C 0x3C-FIFC 0C FC 0C

9-bit virtual addresses, 6-bit physical; 8 byte pages, 1 byte PTE page tables 1 page; PTE: 3 bit PPN (MSB), 1 valid bit, 4 unused



9-bit virtual addresses, 6-bit physical; 8 byte pages, 1 byte PTE page tables 1 page; PTE: 3 bit PPN (MSB), 1 valid bit, 4 unused page table base register 0x08; translate virtual address 0x00B

```
physical bytes
addresses
0x00-3|00 11 22 33
                        0x20-3 D0 D1 D2 D3
0x04-7|44 55 66 77
                        0x24-7D4 D5 D6 D7
                                              0 \times 0 = 000 001 011
0x08-B|88 99 AA BB
                        0x28-Bl89 9A AB BC
                                              PTE 1: 0x88 at 0x08
0x0C-FCC DD EE FF
                        0x2C-FCD DE EF F0
                                              PTE 1: PPN 100 (5) valid 0
                        0x30-3|BA 0A BA 0A
0 \times 10 - 3 | 1A 2A 3A 4A
                                              page fault!
                        0x34-7DB 0B DB 0B
0 \times 14 - 7 | 1B 2B 3B 4B
0x18-Bl1C 2C 3C 4C
                        0x38-BIEC 0C EC 0C
0x1C-F|1C 2C 3C 4C
                        0x3C-FIFC 0C FC 0C
```

9-bit virtual addresses, 6-bit physical; 8 byte pages, 1 byte PTE page tables 1 page; PTE: 3 bit PPN (MSB), 1 valid bit, 4 unused page table base register 0x08; translate virtual address 0x1CB

```
physical bytes
addresses
0 \times 00 - 3 | 00 \ 11 \ 22 \ 33
                         0x20-3|D0 D1 D2 D3
0x04-7|44 55 66 77
                         0x24-7D4 D5 D6 D7
                         0x28-Bl89 9A AB BC
0x08-Bl88 99 AA BB
0x0C-FCC DD EE FF
                         0x2C-FCD DE EF F0
                         0x30-3|BA 0A BA 0A
0 \times 10 - 3 | 1A 2A 3A 4A
                         0x34-7DB 0B DB 0B
0 \times 14 - 7 | 1B 2B 3B 4B
0x18-Bl1C 2C 3C 4C
                         0x38-BIEC 0C EC 0C
0x1C-F|1C 2C 3C 4C
                         0x3C-FIFC 0C FC 0C
```

9-bit virtual addresses, 6-bit physical; 8 byte pages, 1 byte PTE page tables 1 page; PTE: 3 bit PPN (MSB), 1 valid bit, 4 unused

```
physical bytes
                         physical <sub>bytes</sub>
addresses
0x00-3|00 11 22 33
                        0x20-3|D0 D1 D2 D3
                                               0 \times 1 CB = 111 001 011
                        0x24-7D4 D5 D6 D7
0x04-7|44 55 66 77
                                               PTE 1: 0xFF at 0x0F
0x08-B|88 99 AA BB
                        0x28-B|89 9A AB BC
                                               PTE 1: PPN 111 (7) valid 1
0x0C-FCC DD EE FF
                        0x2C-FCD DE EF F0
                                               PTE 2: 0x0C at 0x39
0x10-3|1A 2A 3A 4A
                        0x30-3|BA 0A BA 0A
                                               PTE 2: PPN 000 (0) valid 0
                        0x34-7DB 0B DB 0B
0 \times 14 - 7 | 1B 2B 3B 4B
                                               page fault!
                        0x38-BIEC 0C EC 0C
0x18-Bl1C 2C 3C 4C
                        0x3C-F|FC 0C FC 0C
0x1C-F|1C 2C 3C 4C
```

9-bit virtual addresses, 6-bit physical; 8 byte pages, 1 byte PTE page tables 1 page; PTE: 3 bit PPN (MSB), 1 valid bit, 4 unused

```
physical bytes
                         physical <sub>bytes</sub>
addresses
0x00-3|00 11 22 33
                        0x20-3|D0 D1 D2 D3
                                               0 \times 1 CB = 111 001 011
                        0x24-7D4 D5 D6 D7
0x04-7|44 55 66 77
                                               PTE 1: 0xFF at 0x0F
0x08-B|88 99 AA BB
                        0x28-B|89 9A AB BC
                                               PTE 1: PPN 111 (7) valid 1
0x0C-FCC DD EE FF
                        0x2C-FCD DE EF F0
                                               PTE 2: 0x0C at 0x39
0x10-3|1A 2A 3A 4A
                        0x30-3|BA 0A BA 0A
                                               PTE 2: PPN 000 (0) valid 0
                        0x34-7DB 0B DB 0B
0 \times 14 - 7 | 1B 2B 3B 4B
                                               page fault!
                        0x38-BIEC 0C EC 0C
0x18-Bl1C 2C 3C 4C
                        0x3C-F|FC 0C FC 0C
0x1C-F|1C 2C 3C 4C
```

9-bit virtual addresses, 6-bit physical; 8 byte pages, 1 byte PTE page tables 1 page; PTE: 3 bit PPN (MSB), 1 valid bit, 4 unused



10-bit virtual addresses, 6-bit physical; 16 byte pages, 2 byte PTE

```
page tables 1 page; PTE 1st byte: (MSB) 2-bit PPN, valid bit; rest unused
```

```
physical bytes
addresses
0 \times 00 - 3 | 00 \ 11 \ 22 \ 33
                         0x20-3 D0 E1 D2 D3
0x04-7|44 55 66 77
                         0x24-7D4 E5 D6 E7
                         0x28-Bl89 9A AB BC
0x08-Bl88 99 AA BB
0x0C-FCC DD EE FF
                         0x2C-FCD DE EF F0
                         0x30-3|BA 0A BA 0A
0 \times 10 - 3 | 1A 2A 3A 4A
                         0x34-7DB 0B DB 0B
0 \times 14 - 7 | 1B 2B 3B 4B
0x18-Bl1C 2C 3C 4C
                         0x38-BIEC 0C EC 0C
0x1C-FAC BC DC EC
                         0x3C-FIFC 0C FC 0C
```

10-bit virtual addresses, 6-bit physical; 16 byte pages, 2 byte PTE

```
page tables 1 page; PTE 1st byte: (MSB) 2-bit PPN, valid bit; rest unused
```

```
physical bytes
addresses
                                                   0 \times 376 = 110 \ 111 \ 0110
0 \times 00 - 3 \mid 00 \ 11 \ 22 \ 33
                          0x20-3|D0 E1 D2 D3
                                                   PTE 1: 0x10 + 6 \times 2 = 0x1C:
0 \times 04 - 7 | 44 55 66 77
                          0x24-7D4 E5 D6 E7
                                                   AC BC
                          0x28-Bl89 9A AB BC
0x08-Bl88 99 AA BB
                                                   PTF 1: PPN 10 valid 1
0x0C-FCC DD EE FF
                          0x2C-FCD DE EF F0
                                                   PTE 2: 0x20 + 7 \times 2 = 0x2E:
                          0x30-3|BA 0A BA 0A
0 \times 10 - 3 | 1A 2A 3A 4A
                                                   FF F0
0 \times 14 - 7 | 1B 2B 3B 4B
                          0x34-7|DB 0B DB 0B
                                                   PTE 2: PPN 11 valid 1
0x18-Bl1C 2C 3C 4C
                          0x38-BIEC 0C EC 0C
                                                   11 0110 = 0x36 \rightarrow DB
0×1C-FAC BC DC EC
                          0x3C-F|FC 0C FC 0C
```

10-bit virtual addresses, 6-bit physical; 16 byte pages, 2 byte PTE

page tables 1 page; PTE 1st byte: (MSB) 2-bit PPN, valid bit; rest unused

```
physical bytes
addresses
                                                    0 \times 376 = 110 \ 111 \ 0110
0 \times 00 - 3 \mid 00 \ 11 \ 22 \ 33
                           0x20-3|D0 E1 D2 D3
                                                    PTE 1: 0x10 + 6 \times 2 = 0x1C:
0 \times 04 - 7 | 44 55 66 77
                           0x24-7D4 E5 D6 E7
                                                    AC BC
0x08-Bl88 99 AA BB
                           0x28-Bl89 9A AB BC
                                                    PTF 1: PPN 10 valid 1
0x0C-FCC DD EE FF
                           0x2C-FCD DE EF F0
                                                    PTE 2: 0x20 + 7 \times 2 = 0x2E:
                           0x30-3|BA 0A BA 0A
0 \times 10 - 3 | 1A 2A 3A 4A
                                                    FF F0
                           0 \times 34 - 7 | DB | OB | DB | OB
0 \times 14 - 7 | 1B 2B 3B 4B
                                                    PTE 2: PPN 11 valid 1
0x18-Bl1C 2C 3C 4C
                           0x38-BIEC 0C EC 0C
                                                    11 0110 = 0x36 \rightarrow DB
0×1C-FAC BC DC EC
                           0x3C-F|FC 0C FC 0C
```

10-bit virtual addresses, 6-bit physical; 16 byte pages, 2 byte PTE

```
page tables 1 page; PTE 1st byte: (MSB) 2-bit PPN, valid bit; rest unused
```

```
physical bytes
addresses
                                                   0 \times 376 = 110 \ 111 \ 0110
0 \times 00 - 3 \mid 00 \ 11 \ 22 \ 33
                          0x20-3|D0 E1 D2 D3
                                                   PTE 1: 0x10 + 6 \times 2 = 0x1C:
0 \times 04 - 7 | 44 55 66 77
                          0x24-7D4 E5 D6 E7
                                                   AC BC
                          0x28-Bl89 9A AB BC
0x08-Bl88 99 AA BB
                                                   PTF 1: PPN 10 valid 1
0x0C-FCC DD EE FF
                          0x2C-FCD DE EF F0
                                                   PTE 2: 0x20 + 7 \times 2 = 0x2E:
                          0x30-3|BA 0A BA 0A
0 \times 10 - 3 | 1A 2A 3A 4A
                                                   FF F0
0 \times 14 - 7 | 1B 2B 3B 4B
                          0x34-7|DB 0B DB 0B
                                                   PTE 2: PPN 11 valid 1
0x18-Bl1C 2C 3C 4C
                          0x38-BIEC 0C EC 0C
                                                   11 0110 = 0x36 \rightarrow DB
0x1C-FAC BC DC EC
                          0x3C-FIFC 0C FC 0C
```

10-bit virtual addresses, 6-bit physical; 16 byte pages, 2 byte PTE

```
page tables 1 page; PTE 1st byte: (MSB) 2-bit PPN, valid bit; rest unused
```

```
physical bytes
addresses
                                                    0 \times 376 = 110 \ 111 \ 0110
0 \times 00 - 3 \mid 00 \ 11 \ 22 \ 33
                           0x20-3|D0 E1 D2 D3
                                                    PTE 1: 0x10 + 6 \times 2 = 0x1C:
0 \times 04 - 7 | 44 55 66 77
                           0x24-7D4 E5 D6 E7
                                                    AC BC
0x08-Bl88 99 AA BB
                           0x28-Bl89 9A AB BC
                                                    PTF 1: PPN 10 valid 1
0x0C-FCC DD EE FF
                           0x2C-FCD DE EF F0
                                                    PTE 2: 0x20 + 7 \times 2 = 0x2E:
                           0x30-3|BA 0A BA 0A
0 \times 10 - 3 | 1A 2A 3A 4A
                                                    FF F0
                           0 \times 34 - 7 | DB | 0B | DB | 0B
0 \times 14 - 7 | 1B 2B 3B 4B
                                                    PTE 2: PPN 11 valid 1
0x18-Bl1C 2C 3C 4C
                           0x38-BIEC 0C EC 0C
                                                    11 0110 = 0x36 \rightarrow DB
0×1C-FAC BC DC EC
                           0x3C-F|FC 0C FC 0C
```

10-bit virtual addresses, 6-bit physical; 16 byte pages, 2 byte PTE

```
page tables 1 page; PTE 1st byte: (MSB) 2-bit PPN, valid bit; rest unused
```

```
physical bytes
addresses
                                                    0 \times 376 = 110 \ 111 \ 0110
0 \times 00 - 3 \mid 00 \ 11 \ 22 \ 33
                           0x20-3|D0 E1 D2 D3
                                                    PTE 1: 0x10 + 6 \times 2 = 0x1C:
0 \times 04 - 7 | 44 55 66 77
                           0x24-7D4 E5 D6 E7
                                                    AC BC
0x08-Bl88 99 AA BB
                           0x28-Bl89 9A AB BC
                                                    PTF 1: PPN 10 valid 1
0x0C-FCC DD EE FF
                           0x2C-FCD DE EF F0
                                                     PTE 2: 0x20 + 7 \times 2 = 0x2E:
                           0x30-3|BA 0A BA 0A
0 \times 10 - 3 | 1A 2A 3A 4A
                                                    FF FO
                           0 \times 34 - 7 | DB | 0B | DB | 0B
0 \times 14 - 7 | 1B 2B 3B 4B
                                                     PTE 2: PPN 11 valid 1
0x18-Bl1C 2C 3C 4C
                           0x38-BIEC 0C EC 0C
                                                     11 0110 = 0x36 \rightarrow DB
0×1C-FAC BC DC EC
                           0x3C-FIFC 0C FC 0C
```

10-bit virtual addresses, 6-bit physical; 16 byte pages, 2 byte PTE

```
page tables 1 page; PTE 1st byte: (MSB) 2-bit PPN, valid bit; rest unused
```

```
physical bytes
addresses
                                                   0 \times 376 = 110 \ 111 \ 0110
0 \times 00 - 3 \mid 00 \ 11 \ 22 \ 33
                          0x20-3|D0 E1 D2 D3
                                                   PTE 1: 0x10 + 6 \times 2 = 0x1C:
0 \times 04 - 7 | 44 55 66 77
                          0x24-7D4 E5 D6 E7
                                                   AC BC
                          0x28-Bl89 9A AB BC
0x08-Bl88 99 AA BB
                                                   PTF 1: PPN 10 valid 1
0x0C-FCC DD EE FF
                          0x2C-FCD DE EF F0
                                                   PTE 2: 0x20 + 7 \times 2 = 0x2E:
                          0x30-3|BA 0A BA 0A
0 \times 10 - 3 | 1A 2A 3A 4A
                                                   FF F0
0 \times 14 - 7 | 1B 2B 3B 4B
                          0x34-7|DB 0B DB 0B
                                                   PTE 2: PPN 11 valid 1
0x18-Bl1C 2C 3C 4C
                          0x38-BIEC 0C EC 0C
                                                   11 0110 = 0x36 \rightarrow DB
0×1C-FAC BC DC EC
                          0x3C-FIFC 0C FC 0C
```

10-bit virtual addresses, 6-bit physical; 16 byte pages, 2 byte PTE

```
page tables 1 page; PTE 1st byte: (MSB) 2-bit PPN, valid bit; rest unused
```

```
physical bytes
addresses
                                                   0 \times 376 = 110 \ 111 \ 0110
0 \times 00 - 3 \mid 00 \ 11 \ 22 \ 33
                          0x20-3|D0 E1 D2 D3
                                                   PTE 1: 0x10 + 6 \times 2 = 0x1C:
0 \times 04 - 7 | 44 55 66 77
                          0x24-7D4 E5 D6 E7
                                                   AC BC
                          0x28-Bl89 9A AB BC
0x08-Bl88 99 AA BB
                                                   PTF 1: PPN 10 valid 1
0x0C-FCC DD EE FF
                          0x2C-FCD DE EF F0
                                                   PTE 2: 0x20 + 7 \times 2 = 0x2E:
                          0x30-3|BA 0A BA 0A
0 \times 10 - 3 | 1A 2A 3A 4A
                                                   FF F0
                          0x34-7DB 0B DB 0B
0 \times 14 - 7 | 1B 2B 3B 4B
                                                   PTE 2: PPN 11 valid 1
0x18-Bl1C 2C 3C 4C
                          0x38-BIEC 0C EC 0C
                                                   11 0110 = 0x36 \rightarrow DB
0×1C-FAC BC DC EC
                          0x3C-F|FC 0C FC 0C
```