### last time

buses and direct memory access devices can access memory directly to free CPU time

out-of-order execution idea — do work as available

register renaming + extra physical registers
unique physical register rename for each version
each physical register only written once
opportunity to convert complex instr. to multiple simpler ones

instruction queue + issuing
track which physical registers have values ready
run instructions from queue if their inputs are ready
set of "execution units" that can accept instructions

## anonymous feedback (1)

"It is kind of late for this semester, but in the future it would be nice to know if an online submission has an autograder set up or if the submission will be graded manually at a later date (or if there will be an autograder, but it is not live at the current moment)..."

I want to be clearer re: automatic testing next semester, but... also am concerned about students not doing testing on their local machines

(which seems important for actually debugging anything...)

## anonymous feedback (2)

"I am a little worried about the out-of-order HW in relation to the quiz we will have next week. On weeks where the quiz due on Tuesday and the HW due on Wednesday are on the same topic, I really find that the quiz helps me gauge my understanding of the HW material..."

most of what's covered on the OOO homework is from last week... considered making HW due later, but I don't think it would be good to be less forthcoming in final review/have homework due during finals period

## quiz Q1

## quiz Q4D

```
say %r11 initially in %x11 renaming might look like (depending on free regs):
```

```
mov (%rax), %r9
add %r9, %r10
mov (%rbx), %r9 | mov (%x??), %x20
add %r9, %r11 | add %x20, %x11 -> %x21
mov (%rcx), %r9 | ...
add %r9, %r12 | ...
xor %r10, %r11 | add %x??, %x21 -> %x24
```

## quiz Q5B

```
add
     %x10, %x19 -> %x20
sub %x20, %x21 -> %x22
xor %x18, %x22 -> %x24
imul %x18, %x18 -> %x25
to run xor, need to run sub to get %x22
to run sub, need to run add to get %x20
therefore, xor must be computed after sub
```



## execution units AKA functional units (1)

where actual work of instruction is done

e.g. the actual ALU, or data cache

sometimes pipelined:

(here: 1 op/cycle; 3 cycle latency)







## execution units AKA functional units (1)

where actual work of instruction is done

e.g. the actual ALU, or data cache

sometimes pipelined:

(here: 1 op/cycle; 3 cycle latency)



exercise: how long to compute  $A \times (B \times (C \times D))$ ?

## execution units AKA functional units (2)

where actual work of instruction is done

e.g. the actual ALU, or data cache

sometimes unpipelined:



#### instruction queue

|    | • • • • • • • • • • • • • • • • • • • |
|----|---------------------------------------|
| #  | instruction                           |
| 1  | add %x01, %x02 → %x03                 |
| 2  | imul %x04, %x05 → %x06                |
| 3  | imul %x03, %x07 → %x08                |
| 4  | cmp %x03, %x08 → %x09.cc              |
| 5  | jle %x09.cc,                          |
| 6  | add %x01, %x03 → %x11                 |
| 7  | imul %x04, %x06 → %x12                |
| 8  | imul %x03, %x08 → %x13                |
| 9  | cmp %x11, %x13 → %x14.cc              |
| 10 | jle %x14.cc,                          |
|    |                                       |

execution unit

ALU 1 (add, cmp, jxx) ALU 2 (add, cmp, jxx)

ALU 3 (mul) start ALU 3 (mul) end

| status  |                                                                                                                                                 |
|---------|-------------------------------------------------------------------------------------------------------------------------------------------------|
| ready   |                                                                                                                                                 |
| ready   |                                                                                                                                                 |
| pending |                                                                                                                                                 |
| ready   |                                                                                                                                                 |
| ready   |                                                                                                                                                 |
| pending |                                                                                                                                                 |
| ready   |                                                                                                                                                 |
| pending |                                                                                                                                                 |
|         |                                                                                                                                                 |
|         | ready ready pending ready pending ready pending |

#### instruction queue

| #  | instruction                     |
|----|---------------------------------|
| 1  | add %x01, %x02 → %x03           |
| 2  | imul %x04, %x05 → %x06          |
| 3  | imul %x03, %x07 → %x08          |
| 4  | cmp %x03, %x08 → %x09.cc        |
| 5  | jle %x09.cc,                    |
| 6  | add %x01, %x03 → %x11           |
| 7  | $ imul\ %x04$ , $%x06 \to %x12$ |
| 8  | imul %x03, %x08 → %x13          |
| 9  | cmp %x11, %x13 → %x14.cc        |
| 10 | jle %x14.cc,                    |
|    |                                 |

execution unit

ALU 1 (add, cmp, jxx) ALU 2 (add, cmp, jxx) ALU 3 (mul) start ALU 3 (mul) end

| status  |
|---------|
| ready   |
| ready   |
| pending |
| ready   |
| ready   |
| pending |
| ready   |
| pending |
|         |
|         |

#### instruction queue

|    | •                                    |
|----|--------------------------------------|
| #  | instruction                          |
| 1  | add %x01, %x02 → %x03                |
| 2  | imul %x04, %x05 → %x06               |
|    | imul %x03, %x07 → %x08               |
| 4  | cmp %x03, %x08 → %x09.cc             |
| 5  | jle %x09.cc,                         |
| 6  | add %x01, %x03 → %x11                |
| 7  | imul %x04, %x06 → %x12               |
| 8  | imul %x03, %x08 → %x13               |
| 9  | cmp %x11, %x13 $\rightarrow$ %x14.cc |
| 10 | jle %x14.cc,                         |
| •  |                                      |

execution unit cycle# 1
ALU 1 (add, cmp, jxx) 1
ALU 2 (add, cmp, jxx) ALU 3 (mul) start 2
ALU 3 (mul) end

|      | •       |   |
|------|---------|---|
| reg  | status  |   |
| %x01 | ready   |   |
| %x02 | ready   |   |
| %x03 | pending |   |
| %x04 | ready   |   |
| %x05 | ready   |   |
| %x06 | pending |   |
| %x07 | ready   |   |
| %x08 | pending |   |
| %x09 | pending |   |
| %x10 | pending |   |
| %x11 | pending |   |
| %x12 | pending |   |
| %x13 | pending |   |
| %x14 | pending |   |
| •••  | "       |   |
|      |         | _ |

|     | • • • • • • • • • • • • • • • • • • •    |
|-----|------------------------------------------|
| #   | instruction                              |
| ⋉   | $add %x01, %x02 \rightarrow %x03$        |
| 2×< | imul %x04, %x05 → %x06                   |
| 3   | imul %x03, %x07 → %x08                   |
| 4   | cmp $%x03$ , $%x08 \rightarrow %x09$ .cc |
| 5   | jle %x09.cc,                             |
| 6   | add %x01, %x03 → %x11                    |
| 7   | imul %x04, %x06 → %x12                   |
| 8   | imul %x03, %x08 → %x13                   |
| 9   | cmp %x11, %x13 $\rightarrow$ %x14.cc     |
| 10  | jle %x14.cc,                             |
|     |                                          |

| execution unit        | cycle# 1 | 2 |   |
|-----------------------|----------|---|---|
| ALU 1 (add, cmp, jxx) | 1        | 6 |   |
| ALU 2 (add, cmp, jxx) | _        | _ |   |
| ALU 3 (mul) start     | 2        | 3 |   |
| ALU 3 (mul) end       |          | 2 | 3 |

| reg  | status          |
|------|-----------------|
| %x01 | ready           |
| %x02 | ready           |
| %x03 | pending ready   |
| %x04 | ready           |
| %x05 | ready           |
| %x06 | pending (still) |
| %x07 | ready           |
| %x08 | pending         |
| %x09 | pending         |
| %x10 | pending         |
| %x11 | pending         |
| %x12 | pending         |
| %x13 | pending         |
| %x14 | pending         |
| •••  | "               |

|           | · •                               |
|-----------|-----------------------------------|
| #         | instruction                       |
| $\bowtie$ | add %x01, %x02 → %x03             |
| 2×        | <pre>imul %x04, %x05 → %x06</pre> |
| 3≪        | <pre>imul %x03, %x07 → %x08</pre> |
| 4         | cmp %x03, %x08 → %x09.cc          |
| 5         | jle %x09.cc,                      |
| 6≪        | add %x01, %x03 → %x11             |
| 7         | imul %x04, %x06 → %x12            |
| 8         | imul %x03, %x08 → %x13            |
| 9         | cmp %x11, %x13 → %x14.cc          |
| 10        | jle %x14.cc,                      |
| •         |                                   |

| cycle# 1 | 2                       | 3     |                           |
|----------|-------------------------|-------|---------------------------|
| 1        | 6                       | _     |                           |
| _        | _                       | _     |                           |
| 2        | 3                       | 7     |                           |
|          | 2                       | 3     | 7                         |
|          | cycle# 1<br>1<br>-<br>2 | 1 6 - | 1 6 -<br><br>2 3 <b>7</b> |

|      | •               |
|------|-----------------|
| reg  | status          |
| %x01 | ready           |
| %x02 | ready           |
| %x03 | pending ready   |
| %x04 | ready           |
| %x05 | ready           |
| %x06 | pending ready   |
| %x07 | ready           |
| %x08 | pending (still) |
| %x09 | pending         |
| %x10 | pending         |
| %x11 | pending ready   |
| %x12 | pending         |
| %x13 | pending         |
| %x14 | pending         |
| •••  |                 |

### instruction queue

|        | • • • • • • • • • • • • • • • • • • • |
|--------|---------------------------------------|
| #      | instruction                           |
| ×      | add %x01, %x02 → %x03                 |
| 2×<    | imul %x04, %x05 → %x06                |
| 3≪     | imul %x03, %x07 → %x08                |
| 4~     | <u>cmp %x03, %x08 → %x09.cc</u>       |
|        | jle %x09.cc,                          |
| 6≪     | add %x01, %x03 → %x11                 |
| $\sim$ | imul %x04, %x06 → %x12                |
| 8      | imul %x03, %x08 → %x13                |
| 9      | cmp %x11, %x13 → %x14.cc              |
| 10     | jle %x14.cc,                          |
|        |                                       |

| execution unit        | cycle# 1 | 2 | 3 | 4 |  |
|-----------------------|----------|---|---|---|--|
| ALU 1 (add, cmp, jxx) | 1        | 6 | _ | 4 |  |
| ALU 2 (add, cmp, jxx) | _        | _ | _ | _ |  |
| ALU 3 (mul) start     | 2        | 3 | 7 | 8 |  |
| ALU 3 (mul) end       |          | 2 | 3 | 7 |  |

| status          |
|-----------------|
| ready           |
| ready           |
| pending ready   |
| ready           |
| ready           |
| pending ready   |
| ready           |
| pending ready   |
| pending ready   |
| pending         |
| pending ready   |
| pending (still) |
| pending         |
| pending         |
| "               |
|                 |

8

| struction                             |
|---------------------------------------|
| $d %x01, %x02 \rightarrow %x03$       |
| ul %x04, %x05 → %x06                  |
| ul %x03, %x07 → %x08                  |
| $p \%x03, \%x08 \rightarrow \%x09.cc$ |
| e %x09.cc,                            |
| $d \%x01, \%x03 \rightarrow \%x11$    |
| ul %x04, %x06 → %x12                  |
| ul %x03, %x08 → %x13                  |
| p %x11, %x13 → %x14.cc                |
| e %x14.cc,                            |
|                                       |

| execution unit        | cycle# 1 | 2 | 3 | 4 | 5 |
|-----------------------|----------|---|---|---|---|
| ALU 1 (add, cmp, jxx) | 1        | 6 | _ | 4 | 5 |
| ALU 2 (add, cmp, jxx) | _        | _ | _ | _ | _ |
| ALU 3 (mul) start     | 2        | 3 | 7 | 8 | _ |
| ALU 3 (mul) end       |          | 2 | 3 | 7 | 8 |

| reg  | status                   |
|------|--------------------------|
| %x01 | ready                    |
| %x02 | ready                    |
| %x03 | pending ready            |
| %x04 | ready                    |
| %x05 | ready                    |
| %x06 | <del>pending</del> ready |
| %x07 | ready                    |
| %x08 | pending ready            |
| %x09 | pending ready            |
| %x10 | pending                  |
| %x11 | pending ready            |
| %x12 | pending ready            |
| %x13 | pending (still)          |
| %x14 | pending                  |
| •••  | "                        |
|      |                          |

|            | · •                           |
|------------|-------------------------------|
| #          | instruction                   |
| $\bowtie$  | add %x01, %x02 → %x03         |
| 2<         | imul %x04, %x05 → %x06        |
| 3≪         | imul %x03, %x07 → %x08        |
| 4><        | cmp %x03, %x08 → %x09.cc      |
| 5≪         | jle %x09.cc,                  |
| 6≪         | <u>add %x01, %x03 → %x11</u>  |
| ~          | imul %x04, %x06 → %x12        |
| <b>≫</b> < | <u>imul %x03, %x08</u> → %x13 |
| 9          | cmp %x11, %x13 → %x14.cc      |
| 10         | jle %x14.cc,                  |
|            |                               |

| execution unit        | cycle# 1 | 2 | 3 | 4 | 5 |
|-----------------------|----------|---|---|---|---|
| ALU 1 (add, cmp, jxx) | 1        | 6 | _ | 4 | 5 |
| ALU 2 (add, cmp, jxx) | _        | _ | _ | _ | _ |
| ALU 3 (mul) start     | 2        | 3 | 7 | 8 | _ |
| ALU 3 (mul) end       |          | 2 | 3 | 7 | 8 |

|      | ,                        |
|------|--------------------------|
| reg  | status                   |
| %x01 | ready                    |
| %x02 | ready                    |
| %x03 | <del>pending</del> ready |
| %x04 | ready                    |
| %x05 | ready                    |
| %x06 | <del>pending</del> ready |
| %x07 | ready                    |
| %x08 | <del>pending</del> ready |
| %x09 | <del>pending</del> ready |
| %x10 | pending                  |
| %x11 | <del>pending</del> ready |
| %x12 | <del>pending</del> ready |
| %x13 | pending ready            |
| %x14 | pending                  |
| •••  | "                        |
|      |                          |

| # instruction    add %x01, %x02 + %x03     mul %x04, %x05 + %x06     mul %x03, %x07 + %x08     cmp %x03, %x08 + %x09.66     fle %x09.cc,     dd %x01, %x03 + %x11     mul %x04, %x06 + %x12     mul %x03, %x08 + %x13     cmp %x11, %x13 + %x14.66     jle %x14.cc,                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |           | •                                       |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------|-----------------------------------------|
| > tmul %x04, %x05 → %x06  > imul %x03, %x07 → %x08  > cmp %x03, %x08 → %x09.cc    te %x09.cc,   dd %x01, %x03 → %x11   mul %x04, %x06 → %x12   mul %x03, %x08 → %x13   cmp %x11, %x13 → %x14.cc                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | #         | instruction                             |
| <pre>3</pre>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | $\bowtie$ | add %x01, %x02 → %x03                   |
| <pre></pre>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | 2×        | <pre>imul %x04, %x05 → %x06</pre>       |
| <pre>5</pre>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | 3≪        | <pre>imul %x03, %x07 → %x08</pre>       |
| 6       add %x01, %x03 → %x11         7       imul %x04, %x06 → %x12         8       imul %x03, %x98 → %x13         9       cmp %x11, %x13 → %x14.cc                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | 4         | $cmp \%x03, \%x08 \rightarrow \%x09.cc$ |
| <pre> /* imul %x04, %x06 -&gt; %x12 /* imul %x03, %x08 -&gt; %x13 /* cmp %x11, %x13 -&gt; %x14.cc /* imul %x04, %x06 -&gt; %x12 /* imul %x03, %x08 -&gt; %x13 /* imul %x03, %x08 -&gt; %x13 /* imul %x03, %x08 -&gt; %x13 /* imul %x03, %x08 -&gt; %x14 /* imul %x08, %x08 -&gt; %x14 /* imul %x08,</pre> | 5×        | jle %x09.cc,                            |
| 8× imul %x03, %x08 → %x13<br>9× cmp %x11, %x13 → %x14.cc                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | 6×        | add %x01, %x03 → %x11                   |
| 9< cmp %x11, %x13 → %x14.cc                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | $\sim$    | <pre>imul %x04, %x96 → %x12</pre>       |
| , , ,                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | 8<        | <pre>imul %x03, %x08 → %x13</pre>       |
| 10 jle %x14.cc,                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | 9≪        | <u>cmp %x11, %x13 → %x14.cc</u>         |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | 10        | jle %x14.cc,                            |

| cycle# 1 | 2                       | 3     | 4                  | 5                      |
|----------|-------------------------|-------|--------------------|------------------------|
| 1        | 6                       | _     | 4                  | 5                      |
| _        | _                       | _     | _                  | _                      |
| 2        | 3                       | 7     | 8                  | _                      |
|          | 2                       | 3     | 7                  | 8                      |
|          | cycle# 1<br>1<br>-<br>2 | 1 6 - | 1 6 -<br><br>2 3 7 | 1 6 - 4<br><br>2 3 7 8 |

| reg        | status                   |
|------------|--------------------------|
| %x01       | ready                    |
| %x02       | ready                    |
| %x03       | pending ready            |
| %x04       | ready                    |
| %x05       | ready                    |
| %x06       | pending ready            |
| %x07       | ready                    |
| %x08       | <del>pending</del> ready |
| %x09       | pending ready            |
| %x10       | pending                  |
| %x11       | <del>pending</del> ready |
| %x12       | pending ready            |
| %x13       | pending ready            |
| %x14       | pending ready            |
| <b>5</b> . | "                        |
|            | •                        |

|      | • • • • • • • • • • • • • • • • • • • |
|------|---------------------------------------|
| #    | instruction                           |
| 1×   | add %x01, %x02 → %x03                 |
| 2×<  | imul %x04, %x05 → %x06                |
| 3≪   | imul %x03, %x97 → %x08                |
| 4    | <u>cmp %x03, %x08 → %x09.cc</u>       |
| 5×   | jle %x09.cc,                          |
| 6<   | add $%x01$ , $%x03 \rightarrow %x11$  |
| 7×   | imul %x04, %x96 → %x12                |
| 8<   | <u>imul %x03, %x08</u> → %x13         |
| 9×   | <u>cmp %x11, %x13 → %x14.ec</u>       |
| 128< | jle %x14.cc,                          |
|      |                                       |

| execution unit        | cycle# 1 | 2 | 3 | 4 | 5 |
|-----------------------|----------|---|---|---|---|
| ALU 1 (add, cmp, jxx) | 1        | 6 | _ | 4 | 5 |
| ALU 2 (add, cmp, jxx) | _        | - | _ | _ | _ |
| ALU 3 (mul) start     | 2        | 3 | 7 | 8 | _ |
| ALU 3 (mul) end       |          | 2 | 3 | 7 | 8 |

| reg        | status                   |
|------------|--------------------------|
| %x01       | ready                    |
| %x02       | ready                    |
| %x03       | pending ready            |
| %x04       | ready                    |
| %x05       | ready                    |
| %x06       | pending ready            |
| %x07       | ready                    |
| %x08       | pending ready            |
| %x09       | pending ready            |
| %x10       | pending                  |
| %x11       | <del>pending</del> ready |
| %x12       | pending ready            |
| %x13       | pending ready            |
| %x14       | pending ready            |
| <u>6</u> . | <b>7</b>                 |
| 9 1        | Ò                        |

### **000** limitations

can't always find instructions to run

plenty of instructions, but all depend on unfinished ones programmer can adjust program to help this

need to track all uncommitted instructions

can only go so far ahead

e.g. Intel Skylake: 224-entry reorder buffer, 168 physical registers

branch misprediction has a big cost (relative to pipelined)

e.g. Intel Skylake: up to approx. 16 cycles (v. 2 for simple pipelined  $\mathsf{CPU}$ )

### 000 limitations

### can't always find instructions to run

plenty of instructions, but all depend on unfinished ones programmer can adjust program to help this

### need to track all uncommitted instructions

can only go so far ahead

e.g. Intel Skylake: 224-entry reorder buffer, 168 physical registers

branch misprediction has a big cost (relative to pipelined)

e.g. Intel Skylake: up to approx. 16 cycles (v. 2 for simple pipelined  $\mathsf{CPU}$ )

## some performance examples

```
example1:
    movq $10000000000, %rax
loop1:
    addq %rbx, %rcx
    decq %rax
    jge loop1
    ret
```

about 30B instructions my desktop: approx 2.65 sec

```
example2:
    movq $10000000000, %rax
loop2:
    addq %rbx, %rcx
    addq %r8, %r9
    decq %rax
    jge loop2
    ret
```

about 40B instructions my desktop: approx 2.65 sec

## some performance examples

```
example1:
    movq $10000000000, %rax
loop1:
    addq %rbx, %rcx
    decq %rax
    jge loop1
    ret
```

about 30B instructions my desktop: approx 2.65 sec

```
example2:
    movq $10000000000, %rax
loop2:
    addq %rbx, %rcx
    addq %r8, %r9
    decq %rax
    jge loop2
    ret
```

about 40B instructions my desktop: approx 2.65 sec

## data flow model and limits (1)



# data flow model and limits (1)



each yellow box = instruction

arrows = dependences

instructions only executed when dependencies ready

### reassociation

with pipelined, 5-cycle latency multiplier; how long does each take to compute?

$$((a \times b) \times c) \times d$$

$$(a \times b) \times (c \times d)$$

imulq %rbx, %rax
imulq %rcx, %rdx
imulq %rdx, %rax

### reassociation

with pipelined, 5-cycle latency multiplier; how long does each take to compute?





 $\operatorname{arch} \to \operatorname{phys} \operatorname{reg}$  for new instrs

| arch. | phys. |
|-------|-------|
| reg   | reg   |
| %rax  | %x12  |
| %rcx  | %x17  |
| %rbx  | %x13  |
| %rdx  | %x07  |
| •••   | •••   |

#### free list

| %x19 |  |
|------|--|
| %x23 |  |
| ••   |  |
| •••  |  |

 $arch \rightarrow phys reg$  for new instrs

| arch. | phys. |  |  |
|-------|-------|--|--|
| reg   | reg   |  |  |
| %rax  | %x12  |  |  |
| %rcx  | %x17  |  |  |
| %rbx  | %x13  |  |  |
| %rdx  | %x07  |  |  |
| •••   | •••   |  |  |

### free list

| %x19 |  |
|------|--|
| %x23 |  |
| •••  |  |
| •••  |  |

### reorder buffer (ROB)

| instr<br>num. | PC     | dest. | reg    | done? | mispred? /<br>except? |
|---------------|--------|-------|--------|-------|-----------------------|
| 14            | 0x1233 | %rbx  | / %x23 |       |                       |
| 15            | 0x1239 | %rax  | / %x30 |       |                       |
| 16            | 0x1242 | %rcx  | / %x31 |       |                       |
| 17            | 0x1244 | %rcx  | / %x32 |       |                       |
| 18            | 0x1248 | %rdx  | / %x34 |       |                       |
| 19            | 0x1249 | %rax  | / %x38 |       |                       |
| 20            | 0x1254 | PC    |        |       |                       |
| 21            | 0x1260 | %rcx  | / %x17 |       |                       |
|               |        |       |        |       |                       |
| 31            | 0x129f | %rax  | / %x12 |       |                       |
|               |        |       |        |       |                       |
|               |        |       |        |       |                       |

reorder buffer contains instructions started, but not fully finished new entries created on rename (not enough space? stall rename stage)

 $arch \rightarrow phys reg$  for new instrs

| arch. | phys. |  |  |
|-------|-------|--|--|
| reg   | reg   |  |  |
| %rax  | %x12  |  |  |
| %rcx  | %x17  |  |  |
| %rbx  | %x13  |  |  |
| %rdx  | %x07  |  |  |
| •••   | •••   |  |  |

#### free list

| %x19 |  |
|------|--|
| %x23 |  |
| •••  |  |
| •••  |  |

### reorder buffer (ROB)

| remove    |             | str<br>um. | dest.    | reg    | done? | mispred?<br>except? |
|-----------|-------------|------------|----------|--------|-------|---------------------|
| here -    | <b>→</b> 14 | 4 0×12     | 233 %rbx | / %x23 |       |                     |
| on commit | 15          | 5 0x12     | 39 %rax  | / %x30 |       |                     |
|           | 16          | 5 0x12     | 242 %rcx | / %x31 |       |                     |
|           | 17          | 7 0×12     | 244 %rcx | / %x32 |       |                     |
|           | 18          | 3 0x12     | 48 %rdx  | / %x34 |       |                     |
|           | 19          | 0×12       | 249 %rax | / %x38 |       |                     |
|           | 20          | 0×12       | 254 PC   |        |       |                     |
|           | 2:          | 1 0×12     | 260 %rcx | / %x17 |       |                     |
|           |             | •••        |          |        |       |                     |
| add here  | 3:          | 1 0×12     | 9f%rax   | / %x12 |       |                     |
| -         | <b>→</b>    |            |          |        |       |                     |
| on rename |             |            | _        |        |       |                     |
| _         |             |            |          |        |       |                     |

place newly started instruction at end of buffer remember at least its destination register (both architectural and physical versions)

 $arch \rightarrow phys reg$  for new instrs

| 101 11011 1110113 |       |  |  |  |  |
|-------------------|-------|--|--|--|--|
| arch.             | phys. |  |  |  |  |
| reg               | reg   |  |  |  |  |
| %rax              | %x12  |  |  |  |  |
| 0/rcv             | 0/v17 |  |  |  |  |

%x13

%x07 %x19

•••

free list

%rbx

%rdx

| <del>%x19</del> |  |
|-----------------|--|
| %x23            |  |
| •••             |  |
|                 |  |

reorder buffer (ROB)

|           |               |        | ,           |       | ,                   |
|-----------|---------------|--------|-------------|-------|---------------------|
| remove    | instr<br>num. | PC     | dest. reg   | done? | mispred?<br>except? |
| here →    | 14            | 0x1233 | %rbx / %x23 |       |                     |
| on commit | 15            | 0x1239 | %rax / %x30 |       |                     |
|           | 16            | 0x1242 | %rcx / %x31 |       |                     |
|           | 17            | 0x1244 | %rcx / %x32 |       |                     |
|           | 18            | 0x1248 | %rdx / %x34 |       |                     |
|           | 19            | 0x1249 | %rax / %x38 |       |                     |
|           | 20            | 0x1254 | PC          |       |                     |
|           | 21            | 0x1260 | %rcx / %x17 |       |                     |
|           |               |        | •••         |       |                     |
| add here  | 31            | 0x129f | %rax / %x12 |       |                     |
| -         | 32            | 0x1230 | %rdx / %x19 |       |                     |
| on rename |               |        |             |       |                     |

next renamed instruction goes in next slot, etc.

 $\operatorname{arch} \to \operatorname{phys} \operatorname{reg}$  for new instrs

| arch. | phys.                |
|-------|----------------------|
| reg   | reg                  |
| %rax  | %x12                 |
| %rcx  | %x17                 |
| %rbx  | %x13                 |
| %rdx  | <del>%x07</del> %x19 |
| •••   |                      |

### free list

| %x19 |  |
|------|--|
| %x23 |  |
| •••  |  |
| •••  |  |

### reorder buffer (ROB)

|           |               |               |        |       | ,      | `     | ,                   |   |
|-----------|---------------|---------------|--------|-------|--------|-------|---------------------|---|
| remove    |               | instr<br>num. | PC     | dest. | reg    | done? | mispred?<br>except? | / |
| here      | $\rightarrow$ | 14            | 0x1233 | %rbx  | / %x23 |       |                     |   |
| on commit |               | 15            | 0x1239 | %rax  | / %x30 |       |                     |   |
|           |               | 16            | 0x1242 | %rcx  | / %x31 |       |                     |   |
|           |               | 17            | 0x1244 | %rcx  | / %x32 |       |                     |   |
|           |               | 18            | 0x1248 | %rdx  | / %x34 |       |                     |   |
|           |               | 19            | 0x1249 | %rax  | / %x38 |       |                     |   |
|           |               | 20            | 0x1254 | PC    |        |       |                     |   |
|           |               | 21            | 0x1260 | %rcx  | / %x17 |       |                     |   |
|           |               |               |        |       |        |       |                     |   |
|           |               | 31            | 0x129f | %rax  | / %x12 |       |                     |   |
| add here  |               | 32            | 0x1230 | %rdx  | / %x19 |       |                     |   |
| add ficic | -             |               |        |       |        |       |                     |   |
| on rename |               |               | I      | l     |        | 1     |                     | - |

## reorder buffer: on commit

 $\operatorname{arch} \to \operatorname{phys.} \operatorname{reg}$  for new instrs

| arch. | phys.                |  |  |  |
|-------|----------------------|--|--|--|
| reg   | reg                  |  |  |  |
| %rax  | %x12                 |  |  |  |
| %rcx  | %x17                 |  |  |  |
| %rbx  | %x13                 |  |  |  |
| %rdx  | <del>%x07</del> %x19 |  |  |  |
| •••   | •••                  |  |  |  |

#### free list

| %x19 |  |
|------|--|
| %x13 |  |
| •••  |  |
| •••  |  |

### reorder buffer (ROB)

remove here → on commit

| instr<br>num. | PC     | dest. | reg    | done? | mispred?<br>except? |
|---------------|--------|-------|--------|-------|---------------------|
| 14            | 0x1233 | %rbx  | / %x24 |       |                     |
| 15            | 0x1239 | %rax  | / %x30 |       |                     |
| 16            | 0x1242 | %rcx  | / %x31 |       |                     |
| 17            | 0x1244 | %rcx  | / %x32 |       |                     |
| 18            | 0x1248 | %rdx  | / %x34 |       |                     |
| 19            | 0x1249 | %rax  | / %x38 |       |                     |
| 20            | 0x1254 | PC    |        |       |                     |
| 21            | 0x1260 | %rcx  | / %x17 |       |                     |
|               |        |       |        |       |                     |
| 31            | 0x129f | %rax  | / %x12 |       |                     |
|               |        |       |        |       |                     |
|               |        |       |        |       |                     |

 $arch \rightarrow phys. reg$  for new instrs

| arch. | phys.                |  |
|-------|----------------------|--|
| reg   | reg                  |  |
| %rax  | %x12                 |  |
| %rcx  | %x17                 |  |
| %rbx  | %x13                 |  |
| %rdx  | <del>%x07</del> %x19 |  |
| •••   | •••                  |  |

#### free list

| %x19 |  |
|------|--|
| %x13 |  |
| •••  |  |
| •••  |  |

#### reorder buffer (ROB)

| remove    | instr<br>num. | PC     | dest. reg   | done? | mispred<br>except? |
|-----------|---------------|--------|-------------|-------|--------------------|
| here →    | 14            | 0x1233 | %rbx / %x24 |       |                    |
| on commit | 15            | 0x1239 | %rax / %x30 |       |                    |
|           | 16            | 0x1242 | %rcx / %x31 | ✓     |                    |
|           | 17            | 0x1244 | %rcx / %x32 |       |                    |
|           | 18            | 0x1248 | %rdx / %x34 | ✓     |                    |
|           | 19            | 0x1249 | %rax / %x38 | ✓     |                    |
|           | 20            | 0x1254 | PC          |       |                    |
|           | 21            | 0x1260 | %rcx / %x17 |       |                    |
|           |               |        | •••         |       |                    |
|           | 31            | 0x129f | %rax / %x12 |       | ✓                  |
|           |               |        |             |       |                    |

instructions marked done in reorder buffer when computed but not removed ('committed') yet

 $arch \rightarrow phys. reg$ reorder buffer (ROB) for new instrs mispred? / arch. phys. instr done? except? PC dest. reg remove num. reg reg here  $\longrightarrow$  14 0x1233 %rbx / %x24  $\operatorname{arch} \to \operatorname{phys} \operatorname{reg}$ %rax %x12 15 0x1239 %rax / %x30 on commit %rcx %x17 for committed 16 0x1242 %rcx / %x31 %rbx %x13 17 0x1244 %rcx / %x32 arch. phys. <del>%x07</del> %x19 %rdx 18 0x1248 %rdx / %x34 reg reg ••• 19 0x1249 %rax / %x38 %x30 %rax 20 0x1254 PC %rcx %x28 free list 21 0x1260 %rcx / %x17 %x23 %rbx %x 19 %rdx %x21 31 0x129f%rax / %x12 %x13 commit stage tracks architectural to physical register map for committed instructions

 $arch \rightarrow phys. reg$ reorder buffer (ROB) for new instrs mispred? / arch. phys. instr done? except? PC dest. reg remove num. reg reg here  $\longrightarrow$  14 0x1233 %rbx / %x24  $\operatorname{arch} \to \operatorname{phys} \operatorname{reg}$ %rax %x12 15 0x1239 %rax / %x30 on commit %rcx %x17 for committed 16 0x1242 %rcx / %x31 %rbx %x13 17 0x1244 %rcx / %x32 arch. phys. <del>%x07</del> %x19 %rdx 18 0x1248 %rdx / %x34 reg reg ••• 19 0x1249 %rax / %x38 %x30 %rax 20 0x1254 PC %rcx %x28 free list 21 0x1260 %rcx / %x17 %x23 %x24 %rbx %x 19 %rdx %x21 31 0x129f%rax / %x12 %x13 32 0x1230 %rdx / %x19 when next-to-commit instruction is done %x23 update this register map and free register list and remove instr. from reorder buffer

 $arch \rightarrow phys. reg$ reorder buffer (ROB) for new instrs arch. phys. instr done? except? mispred? / PC dest. reg num. reg reg  $\begin{array}{c} {\sf arch} \to {\sf phys} \ {\sf reg} \ \ {\sf remove} \ {\sf here} \\ {\sf for} \ {\sf committed} \end{array}$ %rax %x12 15 0x1239 %rax / %x30 %rcx %x17 16 0x1242 %rcx / %x31 %rbx %x13 17 0x1244%rcx / %x32 arch. phys. <del>%x07</del> %x19 %rdx 18 0x1248 %rdx / %x34 reg reg ••• 19 0x1249 %rax / %x38 %x30 %rax 20 0x1254 PC %rcx %x28 free list 21 0x1260 %rcx / %x17 <del>%x23</del> %x24 %rbx %x 19 %rdx %x21 0x129f%rax / %x12 31 %x13 32 0x1230\%rdx / \%x19 when next-to-commit instruction is done %x23 update this register map and free register list and remove instr. from reorder buffer

 $\operatorname{arch} \to \operatorname{phys} \operatorname{reg}$  for new instrs

| arch. | phys. |
|-------|-------|
| reg   | reg   |
| %rax  | %x12  |
| %rcx  | %x17  |
| %rbx  | %x13  |
| %rdx  | %x19  |
| •••   | •••   |

free list

| <del>%x19</del> |  |
|-----------------|--|
| %x13            |  |
| •••             |  |
| •••             |  |

 $\begin{array}{c} {\sf arch} \to {\sf phys} \ {\sf reg} \\ {\sf for} \ {\sf committed} \end{array}$ 

| arch. | phys.                |
|-------|----------------------|
| reg   | reg                  |
| %rax  | <del>%x30</del> %x38 |
| %rcx  | <del>%x31</del> %x32 |
| %rbx  | <del>%x23</del> %x24 |
| %rdx  | <del>%x21</del> %x34 |
|       | •••                  |

reorder buffer (ROB)

| instr<br>num. | PC     | dest. reg   | done?    | mispred? /<br>except? |
|---------------|--------|-------------|----------|-----------------------|
| 14            | 0x1233 | %rbx / %x24 | V        |                       |
| 15            | 0×1239 | %rax / %x30 | <b>V</b> |                       |
| 16            | 0×1242 | %rex / %x31 | · ·      |                       |
| 17            | 0x1244 | %rcx / %x32 | <b>V</b> |                       |
| 18            | 0×1248 | %rdx / %x34 | ·        |                       |
| 19            | 0×1249 | %rax / %x38 | V        |                       |
| 20            | 0x1254 | PC          | <b>√</b> | <b>√</b>              |
| 21            | 0x1260 | %rcx / %x17 |          |                       |
|               | •••    | •••         |          |                       |
| 31            | 0x129f | %rax / %x12 | <b>√</b> |                       |
| 32            | 0x1230 | %rdx / %x19 |          |                       |
|               |        |             |          |                       |

 $arch \rightarrow phys reg$  for new instrs

| arch. | phys. |
|-------|-------|
| reg   | reg   |
| %rax  | %x12  |
| %rcx  | %x17  |
| %rbx  | %x13  |
| %rdx  | %x19  |
| •••   | •••   |

 $\operatorname{arch} \to \operatorname{phys} \operatorname{reg}$  for committed

| arch. | phys.                |
|-------|----------------------|
| reg   | reg                  |
| %rax  | <del>%x30</del> %x38 |
| %rcx  | <del>%x31</del> %x32 |
| %rbx  | <del>%x23</del> %x24 |
| %rdx  | <del>%x21</del> %x34 |
|       | •••                  |

reorder buffer (ROB)

|          |               |        |             | `            | ,                  |
|----------|---------------|--------|-------------|--------------|--------------------|
|          | instr<br>num. | PC     | dest. reg   | done?        | mispred? , except? |
|          | 14            | 0×1233 | %rbx / %x24 | V            |                    |
|          | 15            |        | %rax / %x30 | <del>,</del> |                    |
|          | 16            | 0x1242 | %rex / %x31 | √            |                    |
|          | 17            | 0×1244 | %rcx / %x32 | √ ·          |                    |
|          | 18            | 0×1248 | %rdx / %x34 | V            |                    |
|          | 19            | 0x1249 | %rax / %x38 | <b>√</b>     |                    |
| <b>\</b> | 20            | 0x1254 | PC          | ✓            | ✓                  |
|          | 21            | 0x1260 | %rcx / %x17 |              |                    |
|          |               | •••    | •••         |              |                    |
|          | 31            | 0x129f | %rax / %x12 | ✓            |                    |
|          | 32            | 0x1230 | %rdx / %x19 |              |                    |
|          |               |        |             |              |                    |

free list

| %x19 |  |
|------|--|
| %x13 |  |
| •••  |  |
| •••  |  |

when committing a mispredicted instruction...

this is where we undo mispredicted instructions





#### better? alternatives

can take snapshots of register map on each branch don't need to reconstruct the table (but how to efficiently store them)

can reconstruct register map before we commit the branch instruction

need to let reorder buffer be accessed even more?

can track more/different information in reorder buffer

## Intel Skylake OOO design

- 2015 Intel design codename 'Skylake'
- 94-entry instruction queue-equivalent
- 168 physical integer registers
- 168 physical floating point registers
- 4 ALU functional units but some can handle more/different types of operations than others
- 2 load functional units but pipelined: supports multiple pending cache misses in parallel
- 1 store functional unit
- 224-entry reorder buffer determines how far ahead branch mispredictions, etc. can happen

#### check\_passphrase

```
int check_passphrase(const char *versus) {
    int i = 0;
    while (passphrase[i] == versus[i] &&
           passphrase[i]) {
        i += 1:
    return (passphrase[i] == versus[i]);
number of iterations = number matching characters
leaks information about passphrase, oops!
```

# exploiting check\_passphrase (1)

| guess | measured time |
|-------|---------------|
| aaaa  | $100 \pm 5$   |
| baaa  | $103 \pm 4$   |
| caaa  | $102 \pm 6$   |
| daaa  | $111 \pm 5$   |
| eaaa  | $99 \pm 6$    |
| faaa  | $101 \pm 7$   |
| gaaa  | $104 \pm 4$   |
|       | •••           |

# exploiting check\_passphrase (2)

| guess | measured time |
|-------|---------------|
| daaa  | $102 \pm 5$   |
| dbaa  | $99 \pm 4$    |
| dcaa  | $104 \pm 4$   |
| ddaa  | $100 \pm 6$   |
| deaa  | $102 \pm 4$   |
| dfaa  | $109 \pm 7$   |
| dgaa  | $103 \pm 4$   |
|       |               |

#### timing and cryptography

lots of asymmetric cryptography uses big-integer math

example: multiplying 500+ bit numbers together

how do you implement that?

#### big integer multiplcation

say we have two 64-bit integers x, y and want to 128-bit product, but our multiply instruction only does 64-bit products

one way to multiply:

divide 
$$x$$
,  $y$  into 32-bit parts:  $x = x_1 \cdot 2^{32} + x_0$  and  $y = y_1 \cdot 2^{32} + y_0$  then  $xy = x_1y_12^{64} + x_1y_0 \cdot 2^{32} + x_0y_1 \cdot 2^{32} + x_0y_0$ 

#### big integer multiplcation

say we have two 64-bit integers x, y and want to 128-bit product, but our multiply instruction only does 64-bit products

one way to multiply:

divide 
$$x$$
,  $y$  into 32-bit parts:  $x=x_1\cdot 2^{32}+x_0$  and  $y=y_1\cdot 2^{32}+y_0$  then  $xy=x_1y_12^{64}+x_1y_0\cdot 2^{32}+x_0y_1\cdot 2^{32}+x_0y_0$ 

can extend this idea to arbitrarily large numbers

number of smaller multiplies depends on size of numbers!

### big integers and cryptography

naive multiplication idea: number of steps depends on size of numbers

problem: sometimes the value of the number is a secret e.g. part of the private key

oops! revealed through timing

# big integer timing attacks in practice (1)

early versions of OpenSSL (TLS implementation)had timing attack Brumley and Boneh, "Remote Timing Attacks are Practical" (Usenix Security '03)

attacker could figure out bits of private key from timing

why? variable-time mulitplication and modulus operations got faster/slower depending on how input was related to private key

# big integer timing attacks in practice (2)



(a) The zero-one gap  $T_g - T_{g_{hi}}$  indicates that we can distinguish between bits that are 0 and 1 of the RSA factor q for 3 different randomly-generated keys. For clarity, bits of q that are 1 are omitted, as the x-axis can be used for reference for this case.

#### browsers and website leakage

web browsers run code from untrusted webpages

one goal: can't tell what other webpages you visit

## some webpage leakage (1)

```
...as you can see <a href="here">here</a>, <a href="here">here</a>, and <a href="here">here</a> ...
```

#### convenient feature 1: browser marks visited links

```
<script>
var the_color = window.getComputedStyle(
    document.querySelector('a[href=~"foo.com"]')
).color
if (color == ...) { ... }
</script>
```

convenient feature 2: scripts can query current color of something

## some webpage leakage (1)

```
...as you can see <a href="here">here</a>, <a href="here">here</a>, and <a href="here">here</a> ...
```

#### convenient feature 1: browser marks visited links

```
<script>
var the_color = window.getComputedStyle(
    document.querySelector('a[href=~"foo.com"]')
).color
if (color == ...) { ... }
</script>
```

#### convenient feature 2: scripts can query current color of something

fix 1: getComputedStyle lies about the color

fix 2: limited styling options for visited links

### some webpage leakage (2)

one idea: script in webpage times loop that writes big array

variation in timing depends on other things running on machine

### some webpage leakage (2)

one idea: script in webpage times loop that writes big array

variation in timing depends on other things running on machine



Figure 3: Example loop-counting traces collected over 15 seconds. Darker shades indicate smaller counter values and lower instruction throughput.

turns out, other webpages create distinct "signatures"

Figure from Cook et al, "There's Always a Bigger Fish: Clarifying Analysis o Machine-Learning-Assisted Side-Channel Attack" (ISCA '22)

# inferring cache accesses (1)

suppose I time accesses to array of chars: reading array[0]: 3 cycles

```
reading array[64]: 4 cycles
reading array[128]: 4 cycles
reading array[192]: 20 cycles
reading array[256]: 4 cycles
reading array[288]: 4 cycles
```

what could cause this difference? array[192] not in some cache, but others were

# inferring cache accesses (2)

```
some psuedocode:
char array[CACHE SIZE];
AccessAllOf(array);
*other address += 1;
TimeAccessingArray();
suppose during these accesses I discover that array [128] is
slower to access
probably because *other_address loaded into cache + evicted
what do we know about other_address? (select all that apply)
 A. same cache tag B. same cache index C. same cache offset
 D. diff. cache tag E. diff. cache index F. diff. cache offset
```

# some complications (1)

```
caches often use physical, not virtual addresses
(and need to know about physical address to compare index bits)
(but can infer physical addresses with measurements/asking OS)
(and often OS allocates contiguous physical addresses esp. w/'large pages')
```

storing/processing timings evicts things in the cache (but can compare timing with/without access of interest to check for this)

processor "pre-fetching" may load things into cache before access is timed

(but can arrange accesses to avoid triggering prefetcher and make sure to measure with memory barriers)

some L3 caches use a simple hash function to select index instead of index bits

# exercise: inferring cache accesses (1)

```
char *array;
array = AllocateAlignedPhysicalMemory(CACHE SIZE);
LoadIntoCache(array, CACHE_SIZE);
if (mystery) {
    *pointer = 1;
if (TimeAccessTo(&array[index]) > THRESHOLD) {
    /* pointer accessed */
suppose pointer is 0x1000188
and cache (of interest) is direct-mapped, 32768 (2^{15}) byte, 64-byte
blocks
what array index should we check?
```

## exercise: inferring cache accesses (2)

```
char *other_array = ...;
char *array;
array = AllocateAlignedPhysicalMemory(CACHE_SIZE);
LoadIntoCache(array, CACHE_SIZE);
other_array[mystery] += 1;
for (int i = 0; i < CACHE_SIZE; i += BLOCK_SIZE) {
   if (TimeAccessTo(&array[i]) > THRESHOLD) {
      /* found something interesting */
   }
}
```

other\_array at 0x200400, and interesting index is i=0x800, then what was mystery?

# exercise: inferring cache accesses (2)

```
char *array;
posix_memalign(&array, CACHE_SIZE, CACHE_SIZE);
LoadIntoCache(array, CACHE_SIZE);
if (mystery) {
    *pointer = 1;
   (TimeAccessTo(&array[index1]) > THRESHOLD ||
    TimeAccessTo(&array[index2]) > THRESHOLD) {
    /* pointer accessed */
pointer is 0 \times 1000188
cache is 2-way, 32768 (2^{15}) byte, 64-byte blocks, ???? replacement
what array indexes should we check?
```

#### PRIME+PROBE

name in literature: PRIME + PROBE

PRIME: fill cache (or part of it) with values

do thing that uses cache

PROBE: access those values again and see if it's slow

(one of several ways to measure how cache is used)

coined in attacks on AES encryption

#### example: AES (1)

from Osvik, Shamir, and Tromer, "Cache Attacks and Countermeasures: the Case of AES" (2004)

early AES implementation used lookup table

goal: detect index into lookup table index depended on key + data being encrypted

tricks they did to make this work

vary data being encrypted
subtract average time to look for what changes
lots of measurements

# example: AES (2)

from Osvik, Shamir, and Tromer, "Cache Attacks and Countermeasures: the Case of AES" (2004)



Fig. 5. Prime+Probe attack using 30,000 encryption calls on a 2GHz Athlon 64, attacking Linux 2.6.11 dm-crypt. The horizontal axis is the evicted cache set (i.e.,  $\langle y \rangle$  plus an offset due to the table's location) and the vertical axis is  $p_0$ . Left: raw timings (lighter is slower). Right: after subtraction of the average timing of the cache set. The bright diagonal reveals the high nibble of  $p_0 = 0$ x00.

#### reading a value

```
char *array;
posix_memalign(&array, CACHE_SIZE, CACHE_SIZE);
AccessAllOf(array);
other_array[mystery * BLOCK_SIZE] += 1;
for (int i = 0; i < CACHE_SIZE; i += BLOCK_SIZE) {</pre>
    if (CheckIfSlowToAccess(&array[i])) {
with 32KB direct-mapped cache
suppose we find out that array [0x400] is slow to access
and other array starts at address 0x100000
what was mystery?
```

### revisiting an earlier example (1)

```
char *array;
posix_memalign(&array, CACHE_SIZE, CACHE_SIZE);
LoadIntoCache(array, CACHE_SIZE);
if (mystery) {
    *pointer += 1;
}
if (TimeAccessTo(&array[index]) > THRESHOLD) {
    /* pointer accessed */
}
```

what if mystery is false but branch mispredicted?

#### revisiting an earlier example (2)

```
cycle # 0 1 2 3 4 5 6 7 8 9 10 11
movq mystery, %rax
                 FDRIEEEWC
test %rax, %rax
                  F D R I F W C
jz skip (mispred.)
                                IEWC
                    F D R
mov pointer, %rax
                    F D R I E E E W
mov (%rax), %r8
                      F D R
                                  I E W
add $1, %r8
                      F D R
mov %r8, %rax
                        F D R
...
skip: ...
```

## avoiding/triggering this problem

```
if (something false) {
    access *pointer;
}
```

what can we do to make access more/less likely to happen?

## reading a value without really reading it

```
char *array;
posix_memalign(&array, CACHE_SIZE, CACHE_SIZE);
AccessAllOf(array);
if (something false) {
    other array[mystery * BLOCK SIZE] += 1;
for (int i = 0; i < CACHE SIZE; i += BLOCK SIZE) {</pre>
    if (CheckIfSlowToAccess(&array[i])) {
```

if branch mispredicted, cache access may still happen can find the value of mystery

# seeing past a segfault? (1)

```
Prime();
if (something false) {
    triggerSegfault();
    Use(*pointer);
Probe();
could cache access for *pointer still happen?
yes, if:
    branch for if statement mispredicted, and
    *pointer starts before segfault detected
```

# seeing past a segfault? (2)

```
operations in virtual memory lookup:
    translate virtual to physical address
    check if access is permitted by permission bits
Intel processors: looks like these were separate steps, so...
Prime();
if (@2something false@) {
    int value = @3ReadMemoryMarkedNonReadbleInPageTable();@
    access other array[value @4* ...@];
Probe();
```

```
// %rcx = kernel address
   // %rbx = array to load from to cause eviction
   xor %rax, %rax // rax <- 0
retry:
   // rax <- memory[kernel address] (segfaults)</pre>
       // but check for segfault done out-of-order on Intel
   movb (%rcx), %al
   // rax <- memory[kernel address] * 4096 [speculated]</pre>
   shl $0xC, %rax
   iz retrv
                   // not-taken branch
   // access array[memory[kernel address] * 4096]
   mov (%rbx, %rax), %rbx
```

```
// %rcx = ke | space out accesses by 4096 | viction | viction | avoid triggering prefetcher
retry:
    // rax <- memory[kernel address] (seqfaults)</pre>
         // but check for segfault done out-of-order on Intel
    movb (%rcx), %al
    // rax <- memory[kernel address] * 4096 [speculated]</pre>
    shl $0xC, %rax
    iz retrv
                       // not-taken branch
    // access array[memory[kernel address] * 4096]
    mov (%rbx, %rax), %rbx
```

```
// %rcx repeat access if zero
apparently value of zero speculatively read
when real value not yet available
        when real value not yet available
retry:
    // rax <- memory[kernel address] (seafaults)
         // but check for segfault done out-of-order on Intel
    movb (%rcx), %al
    // rax <- memory[kernel address] * 4096 [speculated]</pre>
    shl $0xC, %rax
    jz retry
                       // not-taken branch
    // access array[memory[kernel address] * 4096]
    mov (%rbx, %rax), %rbx
```

```
// %rcx access cache to allow measurement later
// %rbx in paper not with FLUSH+RELOAD instead of PRIME+PROBE technique
     // rax <- memory[kernel address] (segfaults)
          // but check for segfault done out-of-order on Intel
     movb (%rcx), %al
     // rax <- memory[kernel address] * 4096 [speculated]</pre>
     shl $0xC, %rax
     iz retrv
                        // not-taken branch
     // access array[memory[kernel address] * 4096]
     mov (%rbx, %rax), %rbx
```

from Lipp et al, "Meltdown: Reading Kernel Memory from User Space"

segfault actually happens eventually

```
option 1: okay, just start a new process every time
option 2: way of suppressing exception (transactional memory support)
   // rax <- memory[kernel address] (segfaults)</pre>
       // but check for segfault done out-of-order on Intel
   movb (%rcx), %al
   // rax <- memory[kernel address] * 4096 [speculated]</pre>
   shl $0xC, %rax
   jz retry
                       // not-taken branch
   // access array[memory[kernel address] * 4096]
   mov (%rbx, %rax), %rbx
```

#### Meltdown fix

HW: permissions check done with/before physical address lookup was already done by AMD, ARM apparently? now done by Intel

SW: separate page tables for kernel and user space don't have sensitive kernel memory pointed to by page table when user-mode code running unfortunate performance problem exceptions start with code that switches page tables

## reading a value without really reading it

```
char *array;
posix_memalign(&array, CACHE_SIZE, CACHE_SIZE);
AccessAllOf(array);
if (something false) {
    other_array[mystery * BLOCK_SIZE] += 1;
}
for (int i = 0; i < CACHE_SIZE; i += BLOCK_SIZE) {
    if (CheckIfSlowToAccess(&array[i])) {
        ...
    }
}</pre>
```

if branch mispredicted, cache access may still happen can find the value of mystery

## mistraining branch predictor?

```
if (something) {
     CodeToRunSpeculatively()
how can we have 'something' be false, but predicted as true
run lots of times with something true
then do actually run with something false
```

# contrived(?) vulnerable code (1)

```
suppose this C code is run with extra privileges
    (e.g. in system call handler, library called from JavaScript in webpage,
    etc.)
assume x chosen by attacker
(example from original Spectre paper)
if (x < array1_size)</pre>
           y = array2[array1[x] * 4096];
```

# the out-of-bounds access (1)

```
char array1[...];
int secret;
y = array2[array1[x] * 4096];
suppose array1 is at 0x1000000 and
secret is at 0x103F0003:
what x do we choose to make array1[x] access first byte of
secret?
```

# the out-of-bounds access (2)

```
char array1[...];
int secret;
y = array2[array1[x] * 4096];
suppose our cache has 64-byte blocks and 8192 sets
and array2[0] is stored in cache set 0
if the above evicts something in cache set 128,
then what do we know about array1[x]?
```

# the out-of-bounds access (2)

```
char array1[...];
int secret;
y = array2[array1[x] * 4096];
suppose our cache has 64-byte blocks and 8192 sets
and array2[0] is stored in cache set 0
if the above evicts something in cache set 128,
then what do we know about array1[x]?
    is 2 or 254
```

# exploit with contrived(?) code

```
/* in kernel: */
int systemCallHandler(int x) {
    if (x < array1_size)</pre>
        v = array2[array1[x] * 4096];
    return y;
/* exploiting code */
    /* step 1: mistrain branch predictor */
for (a lot) {
    systemCallHandler(0 /* less than array1_size */);
    /* step 2: evict from cache using misprediction */
Prime():
systemCallHandler(targetAddress - array1Address);
int evictedSet = ProbeAndFindEviction();
int targetValue = (evictedSet - array2StartSet) / setsPer4K;
```

## really contrived?

```
char *array1; char *array2;
if (x < array1_size)
    y = array2[array1[x] * 4096];

times 4096 shifts so we can get lower bits of target value
    so all bits effect what cache block is used</pre>
```

## really contrived?

```
char *array1; char *array2;
if (x < array1_size)</pre>
    y = array2[array1[x] * 4096];
times 4096 shifts so we can get lower bits of target value
     so all bits effect what cache block is used
int *array1; int *array2;
if (x < array1 size)</pre>
    v = array2[array1[x]];
will still get upper bits of array1[x] (can tell from cache set)
can still read arbitrary memory!
     want memory at 0x10000?
    upper bits of 4-byte integer at 0x3FFFE
```

#### bounds check in kernel

```
void SomeSystemCallHandler(int index) {
    if (index > some_table_size)
        return ERROR;
    int x = table[some_table];
    switch (other_table[x].foo) {
        ...
    }
}
```

### context: Java script

```
JavaScript: scripts in webpages
for performance, compiled to assembly, run in browser
not supposed to be access arbitrary browser memory
example JavaScript code from paper:
if (index < simpleByteArray.length) {</pre>
    index = simpleByteArray[index | 0];
    index = (((index * 4096)|0) & (32*1024*1024-1))|0;
    localJunk ^= probeTable[index|0]|0;
}
web page runs a lot to train branch predictor
then does run with out-of-bounds index
examines what's evicted by probeTable access
```

### other misprediction

```
so far: talking about mispredicting direction of branch
what about mispredicting target of branch in, e.g.:
// possibly from C code like:
// (*function pointer)();
imp *%rax
// possibly from C code like:
// switch(rcx) { ... }
imp *(%rax,%rcx,8)
```

## an idea for predicting indirect jumps

for jmps like jmp \*%rax predict target with cache:

|                                         | 0                                |
|-----------------------------------------|----------------------------------|
| bottom 12 bits of jmp address           | last seen target                 |
| 0x0-0x7                                 | 0x200000                         |
| 0x8-0xF                                 | 0×440004                         |
| 0×10-0×18                               | 0×4CD894                         |
| 0×18-0×20                               | 0×510194                         |
| 0×20-0×28                               | 0x4FF194                         |
|                                         |                                  |
| 0xFF8-0xFFF                             | 0x3F8403                         |
| 0×10-0×18<br>0×18-0×20<br>0×20-0×28<br> | 0x4CD894<br>0x510194<br>0x4FF194 |

Intel Haswell CPU did something similar to this uses bits of last several jumps, not just last one

can mistrain this branch predictor

## using mispredicted jump

- 1: find some kernel function with jmp \*%rax
- 2: mistrain branch target predictor for it to jump to chosen code use code at address that conflicts in "recent jumps cache"
- 3: have chosen code be attack code (e.g. array access) either write special code OR find suitable instructions (e.g. array access) in existing kernel code

### **Spectre variants**

```
showed Spectre variant 1 (array bounds), 2 (indirect jump) from original paper
```

#### other possible variations:

could cause other things to be mispredicted

prediction of where functions return to?

values instead of which code is executed?

could use side-channel other than data cache changes

instruction cache

cache of pending stores not yet committed

contention for resources on multi-threaded CPU core

branch prediction changes

---

# some Linux kernel mitigations (1)

```
replace array [x] with
array[x & ComputeMask(x, size)]
...where ComputeMask() returns
    0 if x > size
    0xFFFF...F if x < size
...and ComputeMask() does not use jumps:
mov x, %r8
mov size, %r9
cmp %r9, %r8
sbb %rax, %rax // sbb = subtract with borrow
    // either 0 or -1
```

# some Linux kernel mitigations (2)

for indirect branches:

#### with hardware help:

separate indirect (computed) branch prediction for kernel v user mode other branch predictor isolation changes

#### without hardware help:

transform jmp \*(%rax), etc. into code that will only predicted to jump to safe locations (by writing assembly very carefully)

## only safe prediction

```
as replacement for jmp *(%rax)
code from Intel's "Retpoline: A Branch Target Injection
Mitigation"
        call load label
                              /* <-- want prediction to go here
    capture_ret_spec:
        pause
        lfence
        imp capture ret spec
    load label:
        mov %rax, (%rsp)
        ret
```

# backup slides

# backup slides

# backup slides



 $\operatorname{arch} \to \operatorname{phys} \operatorname{reg}$  for new instrs

| arch. | ch. phys. |  |
|-------|-----------|--|
| reg   | reg       |  |
| %rax  | %x12      |  |
| %rcx  | %x17      |  |
| %rbx  | %x13      |  |
| %rdx  | %x07      |  |
| •••   | •••       |  |

#### free list

| %x19 |  |
|------|--|
| %x23 |  |
| •••  |  |
| •••  |  |

 $\operatorname{arch} \to \operatorname{phys} \operatorname{reg}$  for new instrs

| arch. | phys. |
|-------|-------|
| reg   | reg   |
| %rax  | %x12  |
| %rcx  | %x17  |
| %rbx  | %x13  |
| %rdx  | %x07  |
| •••   | •••   |

#### free list

| %x19 |  |
|------|--|
| %x23 |  |
| •••  |  |
| •••  |  |

#### reorder buffer (ROB)

| instr<br>num. | PC     | dest. | reg    | done? | mispred? / except? |
|---------------|--------|-------|--------|-------|--------------------|
| 14            | 0x1233 | %rbx  | / %x23 |       |                    |
| 15            | 0x1239 | %rax  | / %x30 |       |                    |
| 16            | 0x1242 | %rcx  | / %x31 |       |                    |
| 17            | 0x1244 | %rcx  | / %x32 |       |                    |
| 18            | 0x1248 | %rdx  | / %x34 |       |                    |
| 19            | 0x1249 | %rax  | / %x38 |       |                    |
| 20            | 0x1254 | PC    |        |       |                    |
| 21            | 0x1260 | %rcx  | / %x17 |       |                    |
|               | ••     |       |        |       |                    |
| 31            | 0x129f | %rax  | / %x12 |       |                    |
|               |        |       |        |       |                    |
|               |        |       |        |       |                    |

reorder buffer contains instructions started, but not fully finished new entries created on rename (not enough space? stall rename stage)

 $arch \rightarrow phys reg$  for new instrs

| arch. | phys. |  |
|-------|-------|--|
| reg   | reg   |  |
| %rax  | %x12  |  |
| %rcx  | %x17  |  |
| %rbx  | %x13  |  |
| %rdx  | %x07  |  |
| •••   | •••   |  |

free list

%x19 %x23 ... reorder buffer (ROB)

|           |               |        | · ·         |       | •                   |
|-----------|---------------|--------|-------------|-------|---------------------|
| remove    | instr<br>num. | PC     | dest. reg   | done? | mispred?<br>except? |
| here →    | 14            | 0x1233 | %rbx / %x23 |       |                     |
| on commit | 15            | 0x1239 | %rax / %x30 |       |                     |
|           | 16            | 0x1242 | %rcx / %x31 |       |                     |
|           | 17            | 0x1244 | %rcx / %x32 |       |                     |
|           | 18            | 0x1248 | %rdx / %x34 |       |                     |
|           | 19            | 0x1249 | %rax / %x38 |       |                     |
|           | 20            | 0x1254 | PC          |       |                     |
|           | 21            | 0x1260 | %rcx / %x17 |       |                     |
|           |               |        | •••         |       |                     |
| add here  | 31            | 0x129f | %rax / %x12 |       |                     |
| →         |               |        |             |       |                     |
| on rename |               |        |             |       |                     |
|           | - £           |        |             |       |                     |

place newly started instruction at end of buffer remember at least its destination register (both architectural and physical versions)

 $arch \rightarrow phys reg$  for new instrs

| arch. | phys.     |
|-------|-----------|
| reg   | reg       |
| %rax  | %x12      |
| %rcx  | %x17      |
| %rbx  | %x13      |
| %rdx  | %x07 %x19 |
|       |           |

#### free list

| %x19 |  |
|------|--|
| %x23 |  |
| •••  |  |
| •••  |  |

reorder buffer (ROB)

| remove    | instr<br>num. | PC     | dest. reg   | done? | mispred?<br>except? |
|-----------|---------------|--------|-------------|-------|---------------------|
| here →    | 14            | 0x1233 | %rbx / %x23 |       |                     |
| on commit | 15            | 0x1239 | %rax / %x30 |       |                     |
|           | 16            | 0x1242 | %rcx / %x31 |       |                     |
|           | 17            | 0x1244 | %rcx / %x32 |       |                     |
|           | 18            | 0x1248 | %rdx / %x34 |       |                     |
|           | 19            | 0x1249 | %rax / %x38 |       |                     |
|           | 20            | 0x1254 | PC          |       |                     |
|           | 21            | 0x1260 | %rcx / %x17 |       |                     |
|           |               |        | ••          |       |                     |
| add here  | 31            | 0x129f | %rax / %x12 |       |                     |
| -         | 32            | 0x1230 | %rdx / %x19 |       |                     |
| on rename |               |        |             |       |                     |

next renamed instruction goes in next slot, etc.

### reorder buffer: on rename

 $\operatorname{arch} \to \operatorname{phys} \operatorname{reg}$  for new instrs

| arch. | phys.                |  |
|-------|----------------------|--|
| reg   | reg                  |  |
| %rax  | %x12                 |  |
| %rcx  | %x17                 |  |
| %rbx  | %x13                 |  |
| %rdx  | <del>%x07</del> %x19 |  |
| •••   | •••                  |  |

#### free list

| %x19 |  |
|------|--|
| %x23 |  |
| •••  |  |
| •••  |  |

#### reorder buffer (ROB)



 $\operatorname{arch} \to \operatorname{phys.} \operatorname{reg}$  for new instrs

| arch. | phys.                |
|-------|----------------------|
| reg   | reg                  |
| %rax  | %x12                 |
| %rcx  | %x17                 |
| %rbx  | %x13                 |
| %rdx  | <del>%x07</del> %x19 |
| •••   |                      |

#### free list

| %x19 |  |
|------|--|
| %x13 |  |
| •••  |  |
| •••  |  |

#### reorder buffer (ROB)

remove here → on commit

| instr<br>num. | PC     | dest. | reg    | done? | mispred?<br>except? |
|---------------|--------|-------|--------|-------|---------------------|
| 14            | 0x1233 | %rbx  | / %x24 |       |                     |
| 15            | 0x1239 | %rax  | / %x30 |       |                     |
| 16            | 0x1242 | %rcx  | / %x31 |       |                     |
| 17            | 0x1244 | %rcx  | / %x32 |       |                     |
| 18            | 0x1248 | %rdx  | / %x34 |       |                     |
| 19            | 0x1249 | %rax  | / %x38 |       |                     |
| 20            | 0x1254 | PC    |        |       |                     |
| 21            | 0x1260 | %rcx  | / %x17 |       |                     |
|               |        | •••   |        |       |                     |
| 31            | 0x129f | %rax  | / %x12 |       |                     |
|               |        |       |        |       |                     |
|               |        |       |        |       |                     |

 $arch \rightarrow phys. reg$  for new instrs

| arch. | phys.                |
|-------|----------------------|
| reg   | reg                  |
| %rax  | %x12                 |
| %rcx  | %x17                 |
| %rbx  | %x13                 |
| %rdx  | <del>%x07</del> %x19 |
| •••   | •••                  |

#### free list

| %x19 |  |
|------|--|
| %x13 |  |
| •••  |  |
| •••  |  |

### reorder buffer (ROB)

|           |            |              |        |       |        | •                                                | ,                   |
|-----------|------------|--------------|--------|-------|--------|--------------------------------------------------|---------------------|
| remove    |            | nstr<br>num. | PC     | dest. | reg    | done?                                            | mispred?<br>except? |
| here      | <b>→</b> [ | 14           | 0x1233 | %rbx  | / %x24 |                                                  |                     |
| on commit | [          | 15           | 0x1239 | %rax  | / %x30 |                                                  |                     |
|           | [          | 16           | 0x1242 | %rcx  | / %x31 | ✓                                                |                     |
|           | [          | 17           | 0x1244 | %rcx  | / %x32 |                                                  |                     |
|           | [          | 18           | 0x1248 | %rdx  | / %x34 | ✓                                                |                     |
|           | [          | 19           | 0x1249 | %rax  | / %x38 | ✓                                                |                     |
|           | 2          | 20           | 0x1254 | PC    |        |                                                  |                     |
|           | 2          | 21           | 0x1260 | %rcx  | / %x17 |                                                  |                     |
|           | -          |              |        |       |        |                                                  |                     |
|           | 3          | 31           | 0x129f | %rax  | / %x12 |                                                  | ✓                   |
|           |            |              |        |       |        |                                                  |                     |
|           |            |              |        |       |        | <del>                                     </del> |                     |

instructions marked done in reorder buffer when computed but not removed ('committed') yet

arch  $\rightarrow$  phys. reg reorder buffer (ROB) for new instrs mispred? / arch. phys. instr done? except? PC dest. reg remove num. reg reg here  $\longrightarrow$  14 0x1233 %rbx / %x24  $\operatorname{arch} \to \operatorname{phys} \operatorname{reg}$ %rax %x12 15 0x1239 %rax / %x30 on commit %rcx %x17 for committed 16 0x1242 %rcx / %x31 %rbx %x13 17 0x1244 %rcx / %x32 arch. phys. <del>%x07</del> %x19 %rdx 18 0x1248 %rdx / %x34 reg reg ••• 19 0x1249 %rax / %x38 %x30 %rax 20 0x1254 PC %rcx %x28 free list 21 0x1260 %rcx / %x17 %x23 %rbx %x 19 %rdx %x21 31 0x129f%rax / %x12 %x13 commit stage tracks architectural to physical register map for committed instructions

arch  $\rightarrow$  phys. reg reorder buffer (ROB) for new instrs mispred? / arch. phys. instr done? except? PC dest. reg remove num. reg reg here  $\longrightarrow$  14 0x1233 %rbx / %x24  $\operatorname{arch} \to \operatorname{phys} \operatorname{reg}$ %rax %x12 15 0x1239 %rax / %x30 on commit %rcx %x17 for committed 16 0x1242 %rcx / %x31 %rbx %x13 17 0x1244 %rcx / %x32 arch. phys. <del>%x07</del> %x19 %rdx 18 0x1248 %rdx / %x34 reg reg ••• 19 0x1249 %rax / %x38 %x30 %rax 20 0x1254 PC %rcx %x28 free list 21 0x1260 %rcx / %x17 %x23 %x24 %rbx %x 19 %rdx %x21 31 0x129f%rax / %x12 %x13 32 0x1230 %rdx / %x19 when next-to-commit instruction is done %x23 update this register map and free register list and remove instr. from reorder buffer

arch  $\rightarrow$  phys. reg reorder buffer (ROB) for new instrs arch. phys. instr done? except? mispred? / PC dest. reg num. reg reg  $\begin{array}{c} {\sf arch} \to {\sf phys} \ {\sf reg} \ \ {\sf remove} \ {\sf here} \\ {\sf for} \ {\sf committed} \end{array}$ %rax %x12 15 0x1239 %rax / %x30 %rcx %x17 16 0x1242 %rcx / %x31 %rbx %x13 17 0x1244 %rcx / %x32 arch. phys. <del>%x07</del> %x19 %rdx 18 0x1248 %rdx / %x34 reg reg ••• 19 0x1249 %rax / %x38 %x30 %rax 20 0x1254 PC %rcx %x28 free list 21 0x1260 %rcx / %x17 <del>%x23</del> %x24 %rbx %x 19 %rdx %x21 0x129f%rax / %x12 31 %x13 32 0x1230\%rdx / \%x19 when next-to-commit instruction is done %x23 update this register map and free register list and remove instr. from reorder buffer

 $\operatorname{arch} \to \operatorname{phys} \operatorname{reg}$  for new instrs

| arch. | phys. |
|-------|-------|
| reg   | reg   |
| %rax  | %x12  |
| %rcx  | %x17  |
| %rbx  | %x13  |
| %rdx  | %x19  |
| •••   | •••   |

#### free list

| <del>%x19</del> |  |
|-----------------|--|
| %x13            |  |
| •••             |  |
| •••             |  |

 $\begin{array}{c} {\sf arch} \to {\sf phys} \ {\sf reg} \\ {\sf for} \ {\sf committed} \end{array}$ 

| arch. | phys.                |
|-------|----------------------|
| reg   | reg                  |
| %rax  | <del>%x30</del> %x38 |
| %rcx  | <del>%x31</del> %x32 |
| %rbx  | <del>%x23</del> %x24 |
| %rdx  | <del>%x21</del> %x34 |
| •••   | •••                  |

#### reorder buffer (ROB)

| instr<br>num. | PC     | dest. reg   | done?    | mispred? /<br>except? |
|---------------|--------|-------------|----------|-----------------------|
| 14            | 0x1233 | %rbx / %x24 | V        |                       |
| 15            | 0x1239 | %rax / %x30 | V        |                       |
| 16            | 0×1242 | %rex / %x31 | V        |                       |
| 17            | 0×1244 | %rcx / %x32 | · ·      |                       |
| 18            | 0×1248 | %rdx / %x34 | · ·      |                       |
| 19            | 0x1249 | %rax / %x38 | <b>√</b> |                       |
| 20            | 0x1254 | PC          | <b>√</b> | <b>√</b>              |
| 21            | 0x1260 | %rcx / %x17 |          |                       |
|               | •••    | •••         |          |                       |
| 31            | 0x129f | %rax / %x12 | <b>√</b> |                       |
| 32            | 0x1230 | %rdx / %x19 |          |                       |
|               |        |             |          |                       |

 $arch \rightarrow phys reg$  for new instrs

| arch. | phys. |
|-------|-------|
| reg   | reg   |
| %rax  | %x12  |
| %rcx  | %x17  |
| %rbx  | %x13  |
| %rdx  | %x19  |
| •••   | •••   |

 $\operatorname{arch} \to \operatorname{phys} \operatorname{reg}$  for committed

| arch. | phys.                |  |
|-------|----------------------|--|
| reg   | reg                  |  |
| %rax  | <del>%x30</del> %x38 |  |
| %rcx  | <del>%x31</del> %x32 |  |
| %rbx  | <del>%x23</del> %x24 |  |
| %rdx  | <del>%x21</del> %x34 |  |
|       | •••                  |  |

reorder buffer (ROB)

|          |               |        |             | -        | ,                     |
|----------|---------------|--------|-------------|----------|-----------------------|
|          | instr<br>num. | PC     | dest. reg   | done?    | mispred? /<br>except? |
|          | 14            | 0x1233 | %rbx / %x24 | <b>√</b> |                       |
|          | 15            | 0×1239 | %rax / %x30 | V        |                       |
|          | 16            | 0x1242 | %rex / %x31 | <b>√</b> |                       |
|          | 17            | 0×1244 | %rex / %x32 | ·        |                       |
|          |               | 0×1248 | %rdx / %x34 | V        |                       |
|          | 19            | 0×1249 | %rax / %x38 | <b>√</b> |                       |
| <b>-</b> | 20            | 0x1254 | PC          | <b>√</b> | <b>√</b>              |
|          | 21            | 0x1260 | %rcx / %x17 |          |                       |
|          |               | •••    |             |          |                       |
|          | 31            | 0x129f | %rax / %x12 | ✓        |                       |
|          | 32            | 0x1230 | %rdx / %x19 |          |                       |

free list

| <del>%x19</del> |  |
|-----------------|--|
| %x13            |  |
| •••             |  |
| •••             |  |

when committing a mispredicted instruction...

this is where we undo mispredicted instructions





#### better? alternatives

can take snapshots of register map on each branch don't need to reconstruct the table (but how to efficiently store them)

can reconstruct register map before we commit the branch instruction

need to let reorder buffer be accessed even more?

can track more/different information in reorder buffer





free regs for new instrs

| X19<br>X23 | arch. | phys. |
|------------|-------|-------|
|            | reg   | reg   |
|            | RAX   | X15   |
|            | RCX   | X17   |
|            | RBX   | X13   |
|            | RBX   | X07   |
|            | •••   |       |







free regs for new instrs for complete instrs

| X19 |
|-----|
| X23 |
|     |

| arch. | phys. |
|-------|-------|
| reg   | reg   |
| RAX   | X15   |
| RCX   | X17   |
| RBX   | X13   |
| RBX   | X07   |
| •••   | •••   |

| arch. | phys.             |
|-------|-------------------|
| reg   | reg               |
| RAX   | X21               |
| RCX   | <del>X2</del> X32 |
| RBX   | X48               |
| RDX   | X37               |
|       | •••               |

| instr<br>num. | PC     | dest. reg | done? | except? |
|---------------|--------|-----------|-------|---------|
|               |        |           |       |         |
| 17            | 0x1244 | RCX / X32 | V     |         |
| 18            | 0x1248 | RDX / X34 |       |         |
| 19            | 0x1249 | RAX / X38 | ✓     |         |
| 20            | 0x1254 | R8 / X05  |       |         |
| 21            | 0x1260 | R8 / X06  |       |         |
|               |        |           |       |         |



free regs for new instrs for o

X19 X23 ...

| arch. | phys. |
|-------|-------|
| reg   | reg   |
| RAX   | X15   |
| RCX   | X17   |
| RBX   | X13   |
| RBX   | X07   |
|       | •••   |

for complete instrs

| arch. | phys.             |
|-------|-------------------|
| reg   | reg               |
| RAX   | X21               |
| RCX   | <del>X2</del> X32 |
| RBX   | X48               |
| RDX   | X37               |
|       |                   |

| instr<br>num. | PC                               | dest. reg                                                 | done?                                                                                                               | except?                                                                                                                                                                                                                     |
|---------------|----------------------------------|-----------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|               |                                  | <b></b>                                                   |                                                                                                                     |                                                                                                                                                                                                                             |
| 17            | <del>0x1244</del>                | RCX / X32                                                 | <b>√</b>                                                                                                            |                                                                                                                                                                                                                             |
| 18            | 0x1248                           | RDX / X34                                                 |                                                                                                                     |                                                                                                                                                                                                                             |
| 19            | 0x1249                           | RAX / X38                                                 | ✓                                                                                                                   |                                                                                                                                                                                                                             |
| 20            | 0x1254                           | R8 / X05                                                  | ✓                                                                                                                   | <b>√</b>                                                                                                                                                                                                                    |
| 21            | 0x1260                           | R8 / X06                                                  |                                                                                                                     |                                                                                                                                                                                                                             |
|               | •••                              |                                                           |                                                                                                                     |                                                                                                                                                                                                                             |
|               | num.<br><br>17<br>18<br>19<br>20 | num. PC 17 0×1244 18 0×1248 19 0×1249 20 0×1254 21 0×1260 | num. PC dest. reg 17 0×1244 RCX / X32 18 0×1248 RDX / X34 19 0×1249 RAX / X38 20 0×1254 R8 / X05 21 0×1260 R8 / X06 | num.     PC     dest. reg     done?             17     0x1244 RCX / X32     √       18     0x1248 RDX / X34     √       19     0x1249 RAX / X38     √       20     0x1254 R8 / X05     √       21     0x1260 R8 / X06     _ |



free regs for new instrs for complete instrs

| X19 |
|-----|
| X23 |
|     |

| arch. | phys. |  |
|-------|-------|--|
| reg   | reg   |  |
| RAX   | X15   |  |
| RCX   | X17   |  |
| RBX   | X13   |  |
| RBX   | X07   |  |
| •••   | •••   |  |

| arch. | phys.             |  |
|-------|-------------------|--|
| reg   | reg               |  |
| RAX   | X21 X38           |  |
| RCX   | <del>X2</del> X32 |  |
| RBX   | X48               |  |
| RDX   | X37 X34           |  |
|       |                   |  |

|   | instr<br>num. | PC                | dest. reg | done?          | except? |
|---|---------------|-------------------|-----------|----------------|---------|
| V |               |                   |           |                |         |
|   | 17            | <del>0x1244</del> | RCX / X32 | <b>√</b>       |         |
|   | 18            | 0x1248            | RDX-/X34  | <b>√</b> ····· |         |
|   | 19            | 0x·1249           | RAX-/X38  | <b>√</b> ····· |         |
|   | 20            | 0x1254            | R8 / X05  | √              | √       |
|   | 21            | 0x1260            | R8 / X06  |                |         |
|   |               | •••               |           |                |         |









free regs for new instrs

| X19 |
|-----|
| X23 |
|     |

| arch. | phys. |  |
|-------|-------|--|
| reg   | reg   |  |
| RAX   | X15   |  |
| RCX   | X17   |  |
| RBX   | X13   |  |
| RBX   | X07   |  |
| •••   |       |  |

for complete instrs

| arch. | phys.              |  |
|-------|--------------------|--|
| reg   | reg                |  |
| RAX   | <del>X21</del> X38 |  |
| RCX   | <del>X2</del> X32  |  |
| RBX   | X48                |  |
| RDX   | <del>X37</del> X34 |  |
|       |                    |  |

|   | instr<br>num. | PC     | dest. reg | done?    | except?  |
|---|---------------|--------|-----------|----------|----------|
| ¥ |               |        |           |          |          |
|   | 17            | 0x1244 | RCX / X32 | √        |          |
|   | 18            | 0x1248 | RDX / X34 | <u>√</u> |          |
|   | 19            |        | RAX / X38 | √        |          |
|   | 20            | 0x1254 | R8 / X05  | <b>√</b> | <b>√</b> |
|   | 21            | 0x1260 | R8 / X06  |          |          |
|   |               | •••    |           |          |          |

### handling memory accesses?

one idea:

list of done + uncommitted loads+stores

execute load early + double-check on commit have data cache watch for changes to addresses on list if changed, treat like branch misprediction

loads check list of stores so you read back own values actually finish store on commit maybe treat like branch misprediction if conflict?

### the open-source BROOM pipeline











#### better data-flow



#### better data-flow



#### better data-flow

