### static branch prediction

```
forward (target > PC) not taken; backward taken
intuition: loops:
LOOP: ...
      ie LOOP
LOOP: ...
      ine SKIP_LOOP
      imp LOOP
SKIP LOOP:
```

### exercise: static prediction

```
.global foo
foo:
   xor %eax, %eax // eax <- 0</pre>
foo_loop_top:
   test $0x1, %edi
   je foo loop bottom // if (edi & 1 == 0) goto .Lskip
   add %edi, %eax
foo loop bottom:
   jg for_loop_top // if (edi > 0) goto for_loop_top
    ret
suppose \%edi = 3 (initially)
and using forward-taken, backwards-not-taken strategy:
how many mispreditions for je? for il?
```













5





prediction to fetch stage







5



prediction to fetch stage







#### collisions?

two branches could have same hashed PC nothing in table tells us about this versus direct-mapped cache: had *tag bits* to tell

is it worth it?

adding tag bits makes table *much* larger and/or slower but does anything go wrong when there's a collision?

#### collision results

- possibility 1: both branches usually taken no actual conflict prediction is better(!)
- possibility 2: both branches usually not taken no actual conflict prediction is better(!)
- possibility 3: one branch taken, one not taken performance probably worse

# 1-bit predictor for loops

predicts first and last iteration wrong

example: branch to beginning — but same for branch from beginning to end

everything else correct

#### exercise

```
use 1-bit predictor on this loop
    executed in outer loop (not shown) many, many times
what is the conditional branch misprediction rate?
int i = 0;
while (true) {
  if (i % 3 == 0) goto next;
next:
  i += 1;
  if (i == 50) break;
```

# beyond local 1-bit predictor

can predict using more historical info

```
whether taken last several times \rightarrow predict taken example: taken 3 out of 4 last times \rightarrow predict taken
```

example: if last few are T, N, T, N, T, N; next is probably T makes two branches hashing to same entry not so bad

outcomes of last N conditional jumps ("global history") take into account conditional jumps in surrounding code example: loops with if statements will have regular patterns

# predicting ret: ministack of return addresses

predicting ret — ministack in processor registers push on ministack on call; pop on ret

ministack overflows? discard oldest, mispredict it later

| baz saved registers |
|---------------------|
| baz return address  |
| bar saved registers |
| bar return address  |
| foo local variables |
| foo saved registers |
| foo return address  |
| foo saved registers |

baz return address
bar return address
foo return address

(partial?) stack in CPU registers

stack in memory

# 4-entry return address stack

4-entry return address stack in CPU



return address from call

on call: increment index, save return address in that slot on ret: read prediction from index, decrement index

# 1-cycle fetch?

assumption so far:

1 cycle to fetch instruction + identify if jmp, etc.

often not really practical

especially if:

complex machine code format many pipeline stages more complex instruction cache (future idea) fetching 2+ instructions/cycle

# branch target buffer

what if we can't decode LABEL from machine code for jmp LABEL or jle LABEL fast?

will happen in more complex pipelines

what if we can't decode that there's a RET, CALL, etc. fast?

# BTB: cache for branch targets

| idx  | valid | tag   | ofst | type | target   | (more info?) |
|------|-------|-------|------|------|----------|--------------|
| 0×00 | 1     | 0x400 | 5    | Jxx  | 0x3FFFF3 | •••          |
| 0x01 | 1     | 0x401 | С    | ЈМР  | 0x401035 |              |
| 0x02 | 0     |       |      |      |          |              |
| 0x03 | 1     | 0x400 | 9    | RET  |          | •••          |
| •••  | •••   | •••   | •••  | •••  | •••      | •••          |
| 0xFF | 1     | 0x3FF | 8    | CALL | 0x404033 | •••          |

| valid |     |
|-------|-----|
| 1     | ••• |
| 0     |     |
| 0     | ••• |
| 0     |     |
| •••   |     |
| 0     | ••• |

0x3FFFF3: movq %rax, %rsi

0x3FFFF7: pushq %rbx

0x3FFFF8: call 0x404033

0x400001: popq %rbx

0x400003: cmpq %rbx, %rax 0x400005: jle 0x3FFFF3

•••

0x400031: ret

•• •••

# BTB: cache for branch targets

| idx  | valid | tag   | ofst | type | target   | (more info?) |
|------|-------|-------|------|------|----------|--------------|
| 0×00 | 1     | 0x400 | 5    | Jxx  | 0x3FFFF3 | •••          |
| 0x01 | 1     | 0x401 | С    | JMP  | 0x401035 |              |
| 0x02 | 0     |       |      |      |          |              |
| 0x03 | 1     | 0x400 | 9    | RET  |          | •••          |
| •••  |       | •••   |      | •••  | •••      | •••          |
| 0xFF | 1     | 0x3FF | 8    | CALL | 0x404033 | •••          |

| valid |     |
|-------|-----|
| 1     | ••• |
| 0     | ••• |
| 0     | ••• |
| 0     | ••• |
|       | ••• |
| 0     | ••• |

0x3FFFF3: movq %rax, %rsi

0x3FFFF7: pushq %rbx

0x3FFFF8: call 0x404033

0x400001: popq %rbx

0x400003: cmpq %rbx, %rax 0x400005: jle 0x3FFFF3

•••

0x400031: ret

. ...

# **BTB**: cache for branch targets

| idx  | valid | tag   | ofst | type | target   | (more info?) |
|------|-------|-------|------|------|----------|--------------|
| 0×00 | 1     | 0x400 | 5    | Jxx  | 0x3FFFF3 | •••          |
| 0x01 | 1     | 0x401 | С    | JMP  | 0x401035 |              |
| 0x02 | 0     |       |      |      |          |              |
| 0x03 | 1     | 0x400 | 9    | RET  |          | •••          |
| •••  | •••   | •••   |      | •••  | •••      | •••          |
| 0xFF | 1     | 0x3FF | 8    | CALL | 0x404033 | •••          |

| valid |     |
|-------|-----|
| 1     | ••• |
| 0     | ••• |
| 0     | ••• |
| 0     | ••• |
|       | ••• |
| 0     | ••• |

0x3FFFF3: movq %rax, %rsi

0x3FFFF7: pushq %rbx

0x3FFFF8: call 0x404033

0x400001: popq %rbx

0x400003: cmpq %rbx, %rax 0x400005: jle 0x3FFFF3

•••

0x400031: ret

. ..

## indirect branch prediction

for instructions like: jmp \*%rax or jmp \*(%rax, %rcx, 8) simple idea: record what happened last time, predict the same

#### Intel Haswell extension:

lookup in based on hash of *last several jump addresses* allows different predictions based on context of jmp \*%rax really useful for polymorphism

# missing topic: connecting processors + devices

talked about how individual processors work

but no place to communicate with I/O devices, other CPUs

how do we do that?

# individual computers are networks

individual computers are (kinda) networks of...

processors memories I/O devices

so what topology (layout) do those networks have?

## the "bus"



# example: 80386 signal pins

| name      | purpose                   |            |
|-----------|---------------------------|------------|
| CLK2      | clock for bus             | timing     |
| W/R#      | write or read?            |            |
| D/C#      | data or control?          | metadata   |
| M/IO#     | memory or I/O?            | Illetauata |
| INTR      | interrupt request         |            |
|           | other metadata signals    |            |
| BE0#-BE3# | BE0#-BE3# (4) byte enable |            |
| A2-A31    | (30) address bits         | address    |
| DO-D31    | (32) data signals         | data       |

# example: AMD EPYC (1 socket)



Fig. 21. Single-socket AMD EPYC<sup>TM</sup> system (SP3). Figure from Burd et al, "'Zepllin': An SoC for Multichip Architectures" (IEEE JSSC Vol 54, No 1)

## example: Intel Skylake-SP



26

# extra trips to CPU



## extra trips to CPU







## instruction-level parallelism

with pipelining: ran multiple instructions at once

but started/finished at most one at a time

and one slow instruction can slow everything down

we can often do better

## beyond pipelining: multiple issue

start more than one instruction/cycle

multiple parallel pipelines; many-input/output register file

#### hazard handling much more complex

33

# beyond pipelining: out-of-order

find later instructions to do instead of stalling

lists of available instructions in pipeline registers take any instruction with available values

provide illusion that work is still done in order much more complicated hazard handling logic

```
      cycle #
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11

      mov 0(%rbx), %r8
      F
      D
      R
      I
      E
      M
      M
      M
      W
      C

      sub %r8, %r9
      F
      D
      R
      I
      E
      W
      C

      add %r10, %r11
      F
      D
      R
      I
      E
      W
      C

      xor %r12, %r13
      F
      D
      R
      I
      E
      W
      C
```

•••

### interlude: real CPUs

modern CPUs:

execute multiple instructions at once

execute instructions out of order — whenever values available

### out-of-order and hazards

out-of-order execution makes hazards harder to handle

#### problems for forwarding:

value in last stage may not be most up-to-date older value may be written back before newer value?

#### problems for branch prediction:

mispredicted instructions may complete execution before squashing

#### which instructions to dispatch?

how to quickly find instructions that are ready?

### out-of-order and hazards

out-of-order execution makes hazards harder to handle

#### problems for forwarding:

value in last stage may not be most up-to-date older value may be written back before newer value?

#### problems for branch prediction:

mispredicted instructions may complete execution before squashing

#### which instructions to dispatch?

how to quickly find instructions that are ready?

# read-after-write examples (1)

```
      cycle #
      0
      1
      2
      3
      4
      5
      6
      7
      8

      addq %r10, %r8
      F
      D
      E
      M
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W
      W</td
```

normal pipeline: two options for %r8? choose the one from *earliest stage* because it's from the most recent instruction

read-after-write examples (1) out-of-order execution: %r8 from earliest stage might be from *delayed instruction* can't use same forwarding logic addg %r12, %r8 cvcle # 0 1 2 3 4 5 6 7 8 addq %r10, %r8 F rmmovq %r8, (%rax) irmovq \$100, %r8

addq %r13, %r8

## register version tracking

goal: track different versions of registers

out-of-order execution: may compute versions at different times

only forward the correct version

strategy for doing this: preprocess instructions represent version info

makes forwarding, etc. lookup easier

# rewriting hazard examples (1)

```
addq %r10, %r8 | addq %r10, %r8_{v1} \rightarrow %r8_{v2} addq %r11, %r8 | addq %r11, %r8_{v2} \rightarrow %r8_{v3} addq %r12, %r8 | addq %r12, %r8_{v3} \rightarrow %r8_{v4}
```

read different version than the one written represent with three argument psuedo-instructions

forwarding a value? must match version exactly

for now: version numbers

later: something simpler to implement

```
      cycle #
      0
      1
      2
      3
      4
      5
      6
      7
      8

      addq %r10, %r8
      F
      F
      D
      E
      M
      W

      rmmovq %r8, (%rax)
      F
      D
      E
      M
      W

      rmmovq %r8, 8(%rax)
      F
      D
      E
      M
      W

      irmovq $100, %r8
      F
      D
      E
      M
      W

      addq %r13, %r8
      F
      D
      E
      M
      W
```

```
      cycle # 0 1 2 3 4 5 6 7 8

      addq %r10, %r8
      F
      D E M W

      rmmovq %r8, (%rax)
      F
      D E M W

      rmmovq %r11, %r8
      F D E M W

      rmmovq %r8, 8(%rax)
      F D E M W

      irmovq $100, %r8
      F D E M W

      addq %r13, %r8
      F
      D E M W
```

out-of-order execution: if we don't do something, newest value could be overwritten!

```
      cycle # 0 1 2 3 4 5 6 7 8

      addq %r10, %r8
      F
      D E M W

      rmmovq %r8, (%rax)
      F
      D E M W

      rmmovq %r11, %r8
      F D E M W

      rmmovq %r8, 8(%rax)
      F
      D E M W

      irmovq $100, %r8
      F D E M W

      addq %r13, %r8
      F
      D E M W
```

two instructions that haven't been started could need *different versions* of %r8!



## keeping multiple versions

for write-after-write problem: need to keep copies of multiple versions

both the new version and the old version needed by delayed instructions

for read-after-write problem: need to distinguish different versions

solution: have lots of extra registers

...and assign each version a new 'real' register

called register renaming

### register renaming

rename architectural registers to physical registers

different physical register for each version of architectural track which physical registers are ready

compare physical register numbers to do forwarding





branch prediction needs to happen before instructions decoded done with cache-like tables of information about recent branches



register renaming done here stage needs to keep mapping from architectural to physical names



instruction queue holds pending renamed instructions combined with register-ready info to *issue* instructions (issue = start executing)



read from much larger register file and handle forwarding register file: typically read 6+ registers at a time (extra data paths wires for forwarding not shown)



many execution units actually do math or memory load/store some may have multiple pipeline stages some may take variable time (data cache, integer divide, ...)



writeback results to physical registers register file: typically support writing 3+ registers at a time



new commit (sometimes *retire*) stage finalizes instruction figures out when physical registers can be reused again



commit stage also handles branch misprediction reorder buffer tracks enough information to undo mispredicted instrs.

```
cycle #
                0 1 2 3 4 5 6 7 8 9 10 11
addg %r01, %r05
                     RIEW
addg %r02, %r05
                         IEW
                     R
addg %r03, %r04
                    DRIE
cmpg %r04, %r08
                            IEW
jne ...
                              IE
                         R
                                  W
                       D
addg %r01, %r05
                       DRIE
                                W
addg %r02, %r05
                            RI
                                Ε
                                   W
addg %r03, %r04
                                IE
                         D
                           R
                                     W
cmpg %r04, %r08
                                   IEW
```



```
cycle #
                      1 2 3 4 5 6 7 8 9 10 11
addg %r01, %r05
                               E W
addq %r02, %r05
                                  Ε
                         R
addg %r03, %r04
                                  E issue instructions
                                    (to "execution units")
cmpg %r04, %r08
                                    when operands ready
jne ...
                               R
                            D
addg %r01, %r05
addg %r02, %r05
                                          W
addg %r03, %r04
                               D
                                  R
                                          Ε
cmpg %r04, %r08
```

```
cycle #
                 0 1 2 3 4 5 6 7 8 9
addq %r01, %r05 FDRIE
addq %r03 %r04
cmpq %r0 commit instructions in order waiting until next complete
                                      W
addg %r01, %r05
                                   W
addq %r02, %r05
                                    Ε
                                      W
addq %r03, %r04
                            DR
                                      Ε
cmpg %r04, %r08
```



### register renaming

rename architectural registers to physical registers architectural = part of instruction set architecture

different name for each version of architectural register

### register renaming state

original add %r10, %r8 ...

renamed

add %r11, %r8 · add %r12, %r8 ·

 $arch \rightarrow phys$  register map

| %rax | %x04 |
|------|------|
| %rcx | %x09 |
| •••  | •••  |
| %r8  | %x13 |
| %r9  | %x17 |
| %r10 | %x19 |
| %r11 | %x07 |
| %r12 | %x05 |
| •••  | •••  |

| %x18 |  |
|------|--|
| %x20 |  |
| %x21 |  |
| %x23 |  |
| %x24 |  |
| •••  |  |

### register renaming state

```
original
add %r10, %r8 ...
add %r11, %r8 ...
add %r12, %r8 ...
```

|             | $arch \to phys$ |
|-------------|-----------------|
|             | register map    |
| %rax        | %x04            |
| %rcx        | %x09            |
| •••         | •••             |
| %r8         | %x13            |
| %r9<br>%r10 | %x17            |
| %r10        | %x19            |
| %r11        | %x07            |
| %r12        | %x05            |
| •••         | •••             |
|             |                 |

#### renamed

table for architectural (external) and physical (internal) name (for next instr. to process)

| %х  | <u> 18</u> |
|-----|------------|
| %x  | 20         |
| %x  | 21         |
| %x  | 23         |
| %x  | 24         |
| ••• |            |

### register renaming state

original add %r10, %r8

add %r11, %r8 ... add %r12, %r8 ...

 $arch \rightarrow phys$  register map

| %rax | %x04 |
|------|------|
| %rcx | %x09 |
| •••  | •••  |
| %r8  | %x13 |
| %r9  | %x17 |
| %r10 | %x19 |
| %r11 | %x07 |
| %r12 | %x05 |
| •••  | •••  |

renamed

list of available physical registers added to as instructions finish



original add %r10, %r8 add %r11, %r8 add %r12, %r8 renamed

 $arch \rightarrow phys$  register map

%rax %x04
%rcx %x09
... ...
%r8 %x13
%r9 %x17
%r10 %x19
%r11 %x07
%r12 %x05
... ...

free reg list

%x18 %x20 %x21 %x23 %x24 ...

```
original renamed add %r10, %r8 add %x19, %x13 \rightarrow %x18 add %r11, %r8 add %r12, %r8
```

 $arch \rightarrow phys$  register map

| %rax | %x04                 |
|------|----------------------|
| %rcx | %x09                 |
| •••  | •••                  |
| %r8  | <del>%x13</del> %x18 |
| %r9  | %x17                 |
| %r10 | %x19                 |
| %r11 | %x07                 |
| %r12 | %x05                 |
| •••  | •••                  |



```
original renamed add %r10, %r8 add %x19, %x13 \rightarrow %x18 add %r11, %r8 add %x07, %x18 \rightarrow %x20 add %r12, %r8
```

 $arch \rightarrow phys$  register map

| %rax | %x04         |
|------|--------------|
| %rcx | %x09         |
| •••  | •••          |
| %r8  | %x13%x18%x20 |
| %r9  | %x17         |
| %r10 | %x19         |
| %r11 | %x07         |
| %r12 | %x05         |
| •••  | •••          |

| %x18            |
|-----------------|
| <del>%x20</del> |
| %x21            |
| %x23            |
| %x24            |
| •••             |

```
original renamed add %r10, %r8 add %x19, %x13 \rightarrow %x18 add %r11, %r8 add %x07, %x18 \rightarrow %x20 add %r12, %r8 add %x05, %x20 \rightarrow %x21
```

 $arch \rightarrow phys$  register map

| %rax | %x04             |
|------|------------------|
| %rcx | %x09             |
| •••  | •••              |
| %r8  | %x13%x18%x20%x21 |
| %r9  | %x17             |
| %r10 | %x19             |
| %r11 | %x07             |
| %r12 | %x05             |
| •••  | •••              |

| %x18            |
|-----------------|
| %x20            |
| <del>%x21</del> |
| %x23            |
| %x24            |
| •••             |

```
original renamed add %r10, %r8 add %x19, %x13 \rightarrow %x18 add %r11, %r8 add %x07, %x18 \rightarrow %x20 add %r12, %r8 add %x05, %x20 \rightarrow %x21
```

 $arch \rightarrow phys$  register map

| %rax | %x04             |
|------|------------------|
| %rcx | %x09             |
| •••  | •••              |
| %r8  | %x13%x18%x20%x21 |
| %r9  | %x17             |
| %r10 | %x19             |
| %r11 | %x07             |
| %r12 | %x05             |
| •••  | •••              |

| %x18 |
|------|
| %x20 |
| %x21 |
| %x23 |
| %x24 |
| •••  |

original renamed addq %r10, %r8 rmmovq %r8, (%rax) subq %r8, %r11 mrmovq 8(%r11), %r11 irmovg \$100, %r8 addq %r11, %r8  $\operatorname{arch} \to \operatorname{phys}$ register map free %rax %x04 regs %rcx %x09 %x18 %r8 %x13 %x20 %r9 %x17 %x21 %x23 %r10 %x19 %x24 %r11 1%x07

•••

%r12

%r13

%x05

%x02

51

```
original
addq %r10, %r8
                         addg %x19, %x13 \rightarrow %x18
rmmovq %r8, (%rax)
subq %r8, %r11
mrmovq 8(%r11), %r11
irmovq $100, %r8
addg %r11, %r8
            arch \rightarrow phys
            register map
%rax
       %x04
%rcx
       %x09
%r8
       \frac{%x13}{}%x18
%r9
       %x17
%r10
       %x19
%r11
       %x07
%r12
       %x05
%r13
       %x02
```

free regs %x20 %x21 %x23 %x24 •••

renamed

%r13

%x02

```
original
                                        renamed
addq %r10, %r8
                        addg %x19, %x13 \rightarrow %x18
rmmovq %r8, (%rax)
                        rmmovg %x18, (%x04) \rightarrow (memory)
subq %r8, %r11
mrmovq 8(%r11), %r11
irmovq $100, %r8
addg %r11, %r8
            arch \rightarrow phys
            register map
                                           free
%rax
       %x04
                                           regs
%rcx
       %x09
                                          %x18
%r8
       %x13%x18
                                          %x20
%r9
                                          %x21
       %x17
                                          %x23
%r10
       %x19
                                          %x24
%r11
       %x07
%r12
       %x05
```

•••

```
original
addq %r10, %r8
                          addg %x19, %x13 \rightarrow %x18
                          rmmovg %x18, (%x04) \rightarrow (memory)
rmmovq %r8, (%rax)
subq %r8, %r11
mrmovq 8(%r11), %r11
irmovg $100, %r8
addq %r11, %r8
            \operatorname{arch} \to \operatorname{phys}
             register map
%rax
        %x04
%rcx
        %x09
%r8
        %x13%x18
%r9
        %×17
%r10
        %x19
%r11
        1%x07
%r12
        %x05
%r13
        %x02
```

could be that %rax = 8+%r11could load before value written! possible data hazard! not handled via register renaming option 1: run load+stores in order option 2: compare load/store addresse

renamed

%x21

%x23

%x24

```
original
addq %r10, %r8
                          addg %x19, %x13 \rightarrow %x18
rmmovq %r8, (%rax)
                          rmmovq %x18, (%x04) \rightarrow (memory)
                          subq %x18, %x07 \rightarrow %x20
subq %r8, %r11
mrmovq 8(%r11), %r11
irmovq $100, %r8
addg %r11, %r8
            arch \rightarrow phys
             register map
%rax
        %x04
%rcx
        %x09
        <del>%x1</del>3%x18
%r8
%r9
        %x17
%r10
        %x19
        %x<del>07</del>%x20
%r11
%r12
        %x05
%r13
        %x02
```

free regs %x18 <del>%x20</del> %x21 %x23 %x24 •••

renamed

```
original
addq %r10, %r8
                          addg %x19, %x13 \rightarrow %x18
                          rmmovq %x18, (%x04) \rightarrow (memory)
rmmovq %r8, (%rax)
                          subq %x18, %x07 \rightarrow %x20
subg %r8, %r11
mrmovq 8(%r11), %r11 mrmovq 8(%x20), (memory) \rightarrow %x21
irmovq $100, %r8
addg %r11, %r8
            arch \rightarrow phys
             register map
%rax
        %x04
%rcx
        %x09
        <del>%x1</del>3%x18
%r8
%r9
        %x17
%r10
        %x19
%r11
        <del>%x07%x20</del>%x21
%r12
        %x05
%r13
        %x02
```

free regs %x18 %x23 %x24

•••

renamed

%r12

%r13

%x05

%x02

```
original
                                          renamed
addq %r10, %r8
                          addg %x19, %x13 \rightarrow %x18
                          rmmovg %x18, (%x04) \rightarrow (memory)
rmmovq %r8, (%rax)
subq %r8, %r11
                          subq %x18, %x07 \rightarrow %x20
mrmovq 8(%r11), %r11 mrmovq 8(%x20), (memory) \rightarrow %x21
irmovq $100, %r8
                          irmovq $100 
ightarrow \%x23
addg %r11, %r8
            arch \rightarrow phys
             register map
                                             free
%rax
        %x04
                                             regs
%rcx
        %x09
                                             %x18
%r8
        %x13%x18%x23
                                             %x2€
%r9
                                             %x21
        %x17
%r10
        %x19
                                             %x23
                                             %x24
%r11
        <del>%x07%x20</del>%x21
```

```
original renamed addq %r10, %r8 addq %x19, %x13 \rightarrow %x18 rmmovq %r8, (%rax) rmmovq %x18, (%x04) \rightarrow (memory) subq %r8, %r11 subq %x18, %x07 \rightarrow %x20 mrmovq 8(%r11), %r11 mrmovq 8(%x20), (memory) \rightarrow %x21 irmovq $100, %r8 irmovq $100 \rightarrow %x23 addq %r11, %r8 addq %x21, %x23 \rightarrow %x24
```

arch → phys register map

| %rax | %x04                     |
|------|--------------------------|
| %rcx | %x09                     |
| •••  | •••                      |
| %r8  | %x13%x18%x23%x24         |
| %r9  | %x17                     |
| %r10 | %x19                     |
| %r11 | <del>%x07%x20</del> %x21 |
| %r12 | %x05                     |
| %r13 | %x02                     |
|      |                          |

free regs <del>%x18</del>

%x20 %x21 %x23

<del>%X∠</del>••••

### register renaming exercise

original addq %r8, %r9 movq \$100, %r10 subq %r10, %r8 xorq %r8, %r9 andq %rax, %r9 arch  $\rightarrow$  phys

| %rax | %x04 |
|------|------|
| %rcx | %x09 |
| •••  | •••  |
| %r8  | %x13 |
| %r9  | %x17 |
| %r10 | %x19 |
| %r11 | %x29 |
| %r12 | %x05 |
| %r13 | %x02 |
| •••  | •••  |

free regs %x18 %x20 %x21 %x23 %x24 ....

renamed



#### instruction queue

| instruction                           |
|---------------------------------------|
| addq %x01, %x05 → %x06                |
| addq %x02, %x06 → %x07                |
| addq %x03, %x07 $ ightarrow$ %x08     |
| cmpq %x04, %x08 → %x09.cc             |
| jne %x09.cc,                          |
| addq %x01, %x08 → %x10                |
| addq %x02, %x10 $ ightarrow$ %x11     |
| addq %x03, %x11 $\rightarrow$ %x12    |
| cmpq %x04, %x12 $\rightarrow$ %x13.cc |
|                                       |

#### scoreboard

| reg  | status  |
|------|---------|
| %x01 | ready   |
| %x02 | ready   |
| %x03 | ready   |
| %x04 | ready   |
| %x05 | ready   |
| %x06 | pending |
| %x07 | pending |
| %x08 | pending |
| %x09 | pending |
| %x10 | pending |
| %x11 | pending |
| %x12 | pending |
| %x13 | pending |
| •••  |         |
|      |         |

execution unit ALU 1 ALU 2

#### instruction queue

| instruction                           |
|---------------------------------------|
| addq %x01, %x05 → %x06                |
| addq %x02, %x06 $\rightarrow$ %x07    |
| addq %x03, %x07 $\rightarrow$ %x08    |
| cmpq %x04, %x08 → %x09.cc             |
| jne %x09.cc,                          |
| addq %x01, %x08 $\rightarrow$ %x10    |
| addq %x02, %x10 $\rightarrow$ %x11    |
| addq %x03, %x11 $\rightarrow$ %x12    |
| cmpq %x04, %x12 $\rightarrow$ %x13.cc |
|                                       |

execution unit cycle# 1 ALU 1 ALU 2

|      | 1       |
|------|---------|
| reg  | status  |
| %x01 | ready   |
| %x02 | ready   |
| %x03 | ready   |
| %x04 | ready   |
| %x05 | ready   |
| %x06 | pending |
| %x07 | pending |
| %x08 | pending |
| %x09 | pending |
| %x10 | pending |
| %x11 | pending |
| %x12 | pending |
| %x13 | pending |
| •••  |         |

#### instruction queue

| ,                                         |
|-------------------------------------------|
| instruction                               |
| addq %x01, %x05 → %x06                    |
| addq %x02, %x06 → %x07                    |
| addq %x03, %x07 $\rightarrow$ %x08        |
| cmpq %x04, %x08 → %x09.cc                 |
| jne %x09.cc,                              |
| addq %x01, %x08 $\rightarrow$ %x10        |
| addq $%x02$ , $%x10 \rightarrow %x11$     |
| addq %x03, %x11 $\rightarrow$ %x12        |
| cmpq $%x04$ , $%x12 \rightarrow %x13$ .cc |
|                                           |

#### scoreboard

| reg  | status  |
|------|---------|
| %x01 | ready   |
| %x02 | ready   |
| %x03 | ready   |
| %x04 | ready   |
| %x05 | ready   |
| %x06 | pending |
| %x07 | pending |
| %x08 | pending |
| %x09 | pending |
| %x10 | pending |
| %x11 | pending |
| %x12 | pending |
| %x13 | pending |
| •••  |         |

execution unit cycle# 1 ALU 1 ALU 2

#### instruction queue

| # | instruction                               |
|---|-------------------------------------------|
| 1 | addq %x01, %x05 → %x06                    |
| 2 | addq $%x02$ , $%x06 \rightarrow %x07$     |
| 3 | addq %x03, %x07 $\rightarrow$ %x08        |
| 4 | cmpq $%x04$ , $%x08 \rightarrow %x09$ .cc |
| 5 | jne %x09.cc,                              |
| 6 | addq %x01, %x08 $\rightarrow$ %x10        |
| 7 | addq $%x02$ , $%x10 \rightarrow %x11$     |
| 8 | addq %x03, %x11 $\rightarrow$ %x12        |
| 9 | cmpq $%x04$ , $%x12 \rightarrow %x13$ .cc |
|   |                                           |

execution unit cycle# 1 ALU 1 ALU 2

| reg  | status        |
|------|---------------|
| %x01 | ready         |
| %x02 | ready         |
| %x03 | ready         |
| %x04 | ready         |
| %x05 | ready         |
| %x06 | pending ready |
| %x07 | pending       |
| %x08 | pending       |
| %x09 | pending       |
| %x10 | pending       |
| %x11 | pending       |
| %x12 | pending       |
| %x13 | pending       |
| •••  |               |

#### instruction queue

| #         | instruction                               |
|-----------|-------------------------------------------|
| $\bowtie$ | addq %x01, %x05 → %x06                    |
| 2         | addq %x02, %x06 → %x07                    |
| 3         | addq %x03, %x07 → %x08                    |
| 4         | cmpq %x04, %x08 $\rightarrow$ %x09.cc     |
| 5         | jne %x09.cc,                              |
| 6         | addq %x01, %x08 $ ightarrow$ %x10         |
| 7         | addq %x02, %x10 $\rightarrow$ %x11        |
| 8         | addq %x03, %x11 $\rightarrow$ %x12        |
| 9         | cmpq $%x04$ , $%x12 \rightarrow %x13$ .cc |
|           |                                           |

execution unit cycle# 1 2
ALU 1 1 2
ALU 2 —

#### scoreboard

| reg  | status        |
|------|---------------|
| %x01 | ready         |
| %x02 | ready         |
| %x03 | ready         |
| %x04 | ready         |
| %x05 | ready         |
| %x06 | pending ready |
| %x07 | pending ready |
| %x08 | pending       |
| %x09 | pending       |
| %x10 | pending       |
| %x11 | pending       |
| %x12 | pending       |
| %x13 | pending       |
| •••  |               |

•••

#### instruction queue

| #         | instruction                               |
|-----------|-------------------------------------------|
| $\bowtie$ | addq %x01, %x05 → %x06                    |
| 2×<       | addq %x02, %x06 → %x07                    |
| 3         | addq %x03, %x07 → %x08                    |
| 4         | cmpq %x04, %x08 $\rightarrow$ %x09.cc     |
| 5         | jne %x09.cc,                              |
| 6         | addq %x01, %x08 → %x10                    |
| 7         | addq %x02, %x10 $ ightarrow$ %x11         |
| 8         | addq %x03, %x11 $\rightarrow$ %x12        |
| 9         | cmpq $%x04$ , $%x12 \rightarrow %x13$ .cc |
|           |                                           |

execution unit cycle# 1 2 3
ALU 1 1 2 3
ALU 2 — — —

#### scoreboard

| reg  | status                   |
|------|--------------------------|
| %x01 | ready                    |
| %x02 | ready                    |
| %x03 | ready                    |
| %x04 | ready                    |
| %x05 | ready                    |
| %x06 | <del>pending</del> ready |
| %x07 | <del>pending</del> ready |
| %x08 | pending ready            |
| %x09 | pending                  |
| %x10 | pending                  |
| %x11 | pending                  |
| %x12 | pending                  |
| %x13 | pending                  |
| •••  |                          |

54

#### instruction queue

| #   | instruction                               |
|-----|-------------------------------------------|
|     | addq %x01, %x05 → %x06                    |
| 2×< | addq %x02, %x06 → %x07                    |
| 3≪  | addq %x03, %x07 → %x08                    |
| 4   | cmpq %x04, %x08 $\rightarrow$ %x09.cc     |
| 5   | jne %x09.cc,                              |
| 6   | addq %x01, %x08 $ ightarrow$ %x10         |
| 7   | addq %x02, %x10 $\rightarrow$ %x11        |
| 8   | addq %x03, %x11 $\rightarrow$ %x12        |
| 9   | cmpq $%x04$ , $%x12 \rightarrow %x13$ .cc |
|     |                                           |

| reg  | status                   |
|------|--------------------------|
| %x01 | ready                    |
| %x02 | ready                    |
| %x03 | ready                    |
| %x04 | ready                    |
| %x05 | ready                    |
| %x06 | <del>pending</del> ready |
| %x07 | <del>pending</del> ready |
| %x08 | <del>pending</del> ready |
| %x09 | pending                  |
| %x10 | pending                  |
| %x11 | pending                  |
| %x12 | pending                  |
| %x13 | pending                  |
| •••  |                          |

#### instruction queue

| #         | instruction                               |
|-----------|-------------------------------------------|
| $\bowtie$ | addq %x01, %x05 → %x06                    |
| 2×<       | addq %x02, %x06 → %x07                    |
| 3≪        | addq %x03, %x07 → %x08                    |
| 4         | cmpq %x04, %x08 → %x09.cc                 |
| 5         | jne %x09.cc,                              |
| 6         | addq %x01, %x08 → %x10                    |
| 7         | addq %x02, %x10 $ ightarrow$ %x11         |
| 8         | addq %x03, %x11 $\rightarrow$ %x12        |
| 9         | cmpq $%x04$ , $%x12 \rightarrow %x13$ .cc |
|           |                                           |

| reg  | status                   |
|------|--------------------------|
| %x01 | ready                    |
| %x02 | ready                    |
| %x03 | ready                    |
| %x04 | ready                    |
| %x05 | ready                    |
| %x06 | <del>pending</del> ready |
| %x07 | <del>pending</del> ready |
| %x08 | <del>pending</del> ready |
| %x09 | pending ready            |
| %x10 | pending ready            |
| %x11 | pending                  |
| %x12 | pending                  |
| %x13 | pending                  |
| •••  |                          |

#### instruction queue

| #         | instruction                               |
|-----------|-------------------------------------------|
| $\bowtie$ | addq %x01, %x05 → %x06                    |
| 2×<       | addq %x02, %x06 → %x07                    |
|           | addq %x03, %x07 → %x08                    |
| 4≪        | $cmpq \%x04, \%x98 \rightarrow \%x09.cc$  |
| 5         | jne %x09.cc,                              |
| 6≪        | addq %x01, %x08 → %x10                    |
| 7         | addq %x02, %x10 $ ightarrow$ %x11         |
| 8         | addq %x03, %x11 $\rightarrow$ %x12        |
| 9         | cmpq $%x04$ , $%x12 \rightarrow %x13$ .cc |
|           |                                           |

 execution unit
 cycle# 1
 2
 3
 4

 ALU 1
 1
 2
 3
 4

 ALU 2
 —
 —
 —
 6

| reg  | status                   |
|------|--------------------------|
| %x01 | ready                    |
| %x02 | ready                    |
| %x03 | ready                    |
| %x04 | ready                    |
| %x05 | ready                    |
| %x06 | <del>pending</del> ready |
| %x07 | <del>pending</del> ready |
| %x08 | pending ready            |
| %x09 | pending ready            |
| %x10 | pending ready            |
| %x11 | pending                  |
| %x12 | pending                  |
| %x13 | pending                  |
| •••  |                          |

#### instruction queue

|           | instruction                               |
|-----------|-------------------------------------------|
| $\bowtie$ | addq %x01, %x05 → %x06                    |
| 2×<       | addq %x02, %x06 → %x07                    |
| 3≪        | addq %x03, %x07 → %x08                    |
| 4≪        | $cmpq \%x04, \%x98 \rightarrow \%x09.cc$  |
| 5≪        | jne %x09.cc,                              |
| 6≪        | addq %x01, %x08 → %x10                    |
| ~         | addq $%x02$ , $%x10 \rightarrow %x11$     |
| 8         | addq %x03, %x11 $\rightarrow$ %x12        |
| 9         | cmpg $%x04$ , $%x12 \rightarrow %x13$ .cc |
| 9         |                                           |

| reg  | status                   |
|------|--------------------------|
| %x01 | ready                    |
| %x02 | ready                    |
| %x03 | ready                    |
| %x04 | ready                    |
| %x05 | ready                    |
| %x06 | <del>pending</del> ready |
| %x07 | <del>pending</del> ready |
| %x08 | pending ready            |
| %x09 | <del>pending</del> ready |
| %x10 | <del>pending</del> ready |
| %x11 | pending                  |
| %x12 | pending                  |
| %x13 | pending                  |
| •••  |                          |

| execution unit | cycle# 1 | 2 | 3 | 4 | 5 |
|----------------|----------|---|---|---|---|
| ALU 1          | 1        | 2 | 3 | 4 | 5 |
| ALU 2          |          | — | _ | 6 | 7 |

#### instruction queue

| #         | instruction                              |
|-----------|------------------------------------------|
| $\bowtie$ | addq %x01, %x05 → %x06                   |
| 2×        | addq %x02, %x06 → %x07                   |
| 3≪        | addq %x03, %x07 → %x08                   |
| 4≪        | $cmpq \%x04, \%x08 \rightarrow \%x09.cc$ |
| 5<        | jne %x09.cc,                             |
| 6≪        | addq %x01, %x08 → %x10                   |
| $\sim$    | addq %x02, %x10 → %x11                   |
| ⊗<        | addq %x03, %x11 → %x12                   |
| 9         | cmpq %x04, %x12 $\rightarrow$ %x13.cc    |
|           |                                          |

| reg  | status                   |
|------|--------------------------|
| %x01 | ready                    |
| %x02 | ready                    |
| %x03 | ready                    |
| %x04 | ready                    |
| %x05 | ready                    |
| %x06 | <del>pending</del> ready |
| %x07 | <del>pending</del> ready |
| %x08 | <del>pending</del> ready |
| %x09 | <del>pending</del> ready |
| %x10 | <del>pending</del> ready |
| %x11 | pending ready            |
| %x12 | pending                  |
| %x13 | pending                  |
| •••  |                          |

#### instruction queue

| #         | instruction                              |
|-----------|------------------------------------------|
| $\bowtie$ | addq %x01, %x05 → %x06                   |
| 2×<       | addq %x02, %x06 → %x07                   |
| 3≪        | addq %x03, %x07 → %x08                   |
| 4≪        | $cmpq \%x04, \%x08 \rightarrow \%x09.cc$ |
| 5≪        | jne %x09.cc,                             |
| 6≪        | addq %x01, %x08 → %x10                   |
| 7≪        | addq %x02, %x10 → %x11                   |
| ≫<        | addq %x03, %x11 → %x12                   |
| 9≪        | $cmpq %x04, %x12 \rightarrow %x13.cc$    |
|           |                                          |

| reg  | status                   |
|------|--------------------------|
| %x01 | ready                    |
| %x02 | ready                    |
| %x03 | ready                    |
| %x04 | ready                    |
| %x05 | ready                    |
| %x06 | <del>pending</del> ready |
| %x07 | <del>pending</del> ready |
| %x08 | pending ready            |
| %x09 | <del>pending</del> ready |
| %x10 | <del>pending</del> ready |
| %x11 | <del>pending</del> ready |
| %x12 | <del>pending</del> ready |
| %x13 | pending                  |
| •••  |                          |

| execution unit | cycle# 1 | 2 | 3 | 4 | 5 | 6 | 7 |  |
|----------------|----------|---|---|---|---|---|---|--|
| ALU 1          | 1        | 2 | 3 | 4 | 5 | 8 | 9 |  |
| ALU 2          |          |   | _ | 6 | 7 |   |   |  |

#### instruction queue

| #          | instruction                              |
|------------|------------------------------------------|
| $\bowtie$  | addq %x01, %x05 → %x06                   |
| 2×<        | addq %x02, %x06 → %x07                   |
| 3≪         | addq %x03, %x07 → %x08                   |
| 4≪         | $cmpq \%x04, \%x08 \rightarrow \%x09.cc$ |
| 5≪         | jne %x09.cc,                             |
| 6≪         | addq %x01, %x08 → %x10                   |
| ~          | addq $%x02$ , $%x10 \rightarrow %x11$    |
| <b>≫</b> < | addq %x03, %x11 → %x12                   |
| 9≪         | $cmpq \%x04, \%x12 \rightarrow \%x13.cc$ |
|            |                                          |

| <b>"</b> " |                          |
|------------|--------------------------|
| reg        | status                   |
| %x01       | ready                    |
| %x02       | ready                    |
| %x03       | ready                    |
| %x04       | ready                    |
| %x05       | ready                    |
| %x06       | <del>pending</del> ready |
| %x07       | <del>pending</del> ready |
| %x08       | pending ready            |
| %x09       | <del>pending</del> ready |
| %x10       | <del>pending</del> ready |
| %x11       | <del>pending</del> ready |
| %x12       | <del>pending</del> ready |
| %x13       | pending ready            |
| •••        |                          |

#### instruction queue

| # | instruction                        |
|---|------------------------------------|
| 1 | mrmovq (%x04) → %x06               |
| 2 | mrmovq (%x05) $\rightarrow$ %x07   |
| 3 | addq %x01, %x02 → %x08             |
| 4 | addq %x01, %x06 → %x09             |
| 5 | addq %x01, %x07 $\rightarrow$ %x10 |

| reg  | status |
|------|--------|
| %x01 | ready  |
| %x02 | ready  |
| %x03 | ready  |
| %x04 | ready  |
| %x05 | ready  |
| %x06 |        |
| %x07 |        |
| %x08 |        |
| %x09 |        |
| %x10 |        |
| •••  |        |

execution unit 
$$cycle \# 1$$
 2 3 4 5 6 7 ALU data cache assume 1 cycle/access

### register renaming: missing pieces

what about "hidden" inputs like %rsp, condition codes?

one solution: translate to intructions with additional register parameters

making %rsp explicit parameter turning hidden condition codes into operands!

bonus: can also translate complex instructions to simpler ones







### execution units AKA functional units (1)

where actual work of instruction is done

e.g. the actual ALU, or data cache

sometimes pipelined:

(here: 1 op/cycle; 3 cycle latency)





### execution units AKA functional units (1)

where actual work of instruction is done

e.g. the actual ALU, or data cache

sometimes pipelined:

(here: 1 op/cycle; 3 cycle latency)



exercise: how long to compute  $A \times (B \times (C \times D))$ ?

## execution units AKA functional units (2)

where actual work of instruction is done

e.g. the actual ALU, or data cache

sometimes unpipelined:



#### instruction queue

| #  | instruction                       |
|----|-----------------------------------|
| 1  | add %x01, %x02 → %x03             |
| 2  | imul %x04, %x05 → %x06            |
| 3  | imul %x03, %x07 → %x08            |
| 4  | cmp %x03, %x08 → %x09.cc          |
| 5  | jle %x09.cc,                      |
| 6  | add %x01, %x03 $\rightarrow$ %x11 |
| 7  | imul %x04, %x06 → %x12            |
| 8  | imul %x03, %x08 $ ightarrow$ %x13 |
| 9  | cmp %x11, %x13 → %x14.cc          |
| 10 | jle %x14.cc,                      |

execution unit
ALU 1 (add, cmp, jxx)
ALU 2 (add, cmp, jxx)
ALU 3 (mul) start
ALU 3 (mul) end

| reg  | status  |
|------|---------|
| %x01 | ready   |
| %x02 | ready   |
| %x03 | pending |
| %x04 | ready   |
| %x05 | ready   |
| %x06 | pending |
| %x07 | ready   |
| %x08 | pending |
| %x09 | pending |
| %x10 | pending |
| %x11 | pending |
| %x12 | pending |
| %x13 | pending |
| %x14 | pending |
|      |         |

#### instruction queue

| #  | instruction                       |
|----|-----------------------------------|
| 1  | add %x01, %x02 → %x03             |
| 2  | imul %x04, %x05 → %x06            |
| 3  | imul %x03, %x07 → %x08            |
| 4  | cmp %x03, %x08 → %x09.cc          |
| 5  | jle %x09.cc,                      |
| 6  | add %x01, %x03 $\rightarrow$ %x11 |
| 7  | imul %x04, %x06 → %x12            |
| 8  | imul %x03, %x08 $ ightarrow$ %x13 |
| 9  | cmp %x11, %x13 → %x14.cc          |
| 10 | jle %x14.cc,                      |

execution unit
ALU 1 (add, cmp, jxx)
ALU 2 (add, cmp, jxx)
ALU 3 (mul) start
ALU 3 (mul) end

| reg  | status  |
|------|---------|
| %x01 | ready   |
| %x02 | ready   |
| %x03 | pending |
| %x04 | ready   |
| %x05 | ready   |
| %x06 | pending |
| %x07 | ready   |
| %x08 | pending |
| %x09 | pending |
| %x10 | pending |
| %x11 | pending |
| %x12 | pending |
| %x13 | pending |
| %x14 | pending |
|      |         |

| #  | instruction              |
|----|--------------------------|
| 1  | add %x01, %x02 → %x03    |
| 2  | imul %x04, %x05 → %x06   |
|    | imul %x03, %x07 → %x08   |
| 4  | cmp %x03, %x08 → %x09.cc |
| 5  | jle %x09.cc,             |
|    | add %x01, %x03 → %x11    |
| 7  | imul %x04, %x06 → %x12   |
| 8  | imul %x03, %x08 → %x13   |
| 9  | cmp %x11, %x13 → %x14.cc |
| 10 | jle %x14.cc,             |

| execution unit        | cycle# 1 |
|-----------------------|----------|
| ALU 1 (add, cmp, jxx) | 1        |
| ALU 2 (add, cmp, jxx) | _        |
| ALÙ 3 (mul) start     | 2        |
| ALU 3 (mul) end       |          |

| reg  | status  |
|------|---------|
| %x01 | ready   |
| %x02 | ready   |
| %x03 | pending |
| %x04 | ready   |
| %x05 | ready   |
| %x06 | pending |
| %x07 | ready   |
| %x08 | pending |
| %x09 | pending |
| %x10 | pending |
| %x11 | pending |
| %x12 | pending |
| %x13 | pending |
| %x14 | pending |
| •••  |         |
|      |         |

| #         | instruction                       |
|-----------|-----------------------------------|
| $\bowtie$ | add %x01, %x02 → %x03             |
| 2×<       | <pre>imul %x04, %x05 → %x06</pre> |
| 3         | imul %x03, %x07 → %x08            |
| 4         | cmp %x03, %x08 → %x09.cc          |
| 5         | jle %x09.cc,                      |
| 6         | add %x01, %x03 → %x11             |
| 7         | imul %x04, %x06 → %x12            |
| 8         | imul %x03, %x08 → %x13            |
| 9         | cmp %x11, %x13 → %x14.cc          |
| 10        | jle %x14.cc,                      |

| execution unit        | cycle# 1 | 2 |
|-----------------------|----------|---|
| ALU 1 (add, cmp, jxx) | 1        | 6 |
| ALU 2 (add, cmp, jxx) | _        | _ |
| ALÙ 3 (mul) start     | 2        | 3 |
| ALU 3 (mul) end       |          | 2 |

| reg  | status          |
|------|-----------------|
| %x01 | ready           |
| %x02 | ready           |
| %x03 | pending ready   |
| %x04 | ready           |
| %x05 | ready           |
| %x06 | pending (still) |
| %x07 | ready           |
| %x08 | pending         |
| %x09 | pending         |
| %x10 | pending         |
| %x11 | pending         |
| %x12 | pending         |
| %x13 | pending         |
| %x14 | pending         |
| •••  |                 |

| #      | instruction                                   |
|--------|-----------------------------------------------|
| $\sim$ | add %x01, %x02 → %x03                         |
| 2×<    | <u>1mul %x04, %x05 → %x06</u>                 |
| 3≪     | <pre>fmul %x03, %x07 → %x08</pre>             |
| 4      | cmp $%$ x03, $%$ x08 $\rightarrow$ $%$ x09.cc |
| 5      | jle %x09.cc,                                  |
| 6≪     | add %x01, %x03 → %x11                         |
| 7      | imul %x04, %x06 → %x12                        |
| 8      | imul %x03, %x08 → %x13                        |
| 9      | cmp %x11, %x13 → %x14.cc                      |
| 10     | jle %x14.cc,                                  |
|        |                                               |

| execution unit        | cycle# 1 | 2 | 3 |
|-----------------------|----------|---|---|
| ALU 1 (add, cmp, jxx) | 1        | 6 | _ |
| ALU 2 (add, cmp, jxx) | _        | _ | _ |
| ALU 3 (mul) start     | 2        | 3 | 7 |
| ALU 3 (mul) end       |          | 2 | 3 |

| reg  | status                   |
|------|--------------------------|
| %x01 | ready                    |
| %x02 | ready                    |
| %x03 | <del>pending</del> ready |
| %x04 | ready                    |
| %x05 | ready                    |
| %x06 | pending ready            |
| %x07 | ready                    |
| %x08 | pending (still)          |
| %x09 | pending                  |
| %x10 | pending                  |
| %x11 | pending ready            |
| %x12 | pending                  |
| %x13 | pending                  |
| %x14 | pending                  |
| •••  |                          |

| #         | instruction                             |
|-----------|-----------------------------------------|
| $\bowtie$ | add %x01, %x02 → %x03                   |
| 2><       | <pre>imul %x04, %x05 → %x06</pre>       |
| 3≪        | <pre>imul %x03, %x07 → %x08</pre>       |
| 4≪        | $cmp \%x03, \%x08 \rightarrow \%x09.cc$ |
| 5         | jle %x09.cc,                            |
| 6≪        | add %x01, %x03 → %x11                   |
| $\sim$    | <pre>1mul %x04, %x06 → %x12</pre>       |
| 8         | imul %x03, %x08 → %x13                  |
| 9         | cmp %x11, %x13 → %x14.cc                |
| 10        | jle %x14.cc,                            |

| reg  | status                   |
|------|--------------------------|
| %x01 | ready                    |
| %x02 | ready                    |
| %x03 | <del>pending</del> ready |
| %x04 | ready                    |
| %x05 | ready                    |
| %x06 | <del>pending</del> ready |
| %x07 | ready                    |
| %x08 | <del>pending</del> ready |
| %x09 | <del>pending</del> ready |
| %x10 | pending                  |
| %x11 | <del>pending</del> ready |
| %x12 | pending (still)          |
| %x13 | pending                  |
| %x14 | pending                  |
| •••  |                          |

| execution unit          | cycle# 1 | 2 | 3 | 4 |   |
|-------------------------|----------|---|---|---|---|
| ALU $1$ (add, cmp, jxx) | 1        | 6 | _ | 4 |   |
| ALU 2 (add, cmp, jxx)   | _        | _ | _ | _ |   |
| ALU 3 (mul) start       | 2        | 3 | 7 | 8 |   |
| ALU 3 (mul) end         |          | 2 | 3 | 7 | 8 |
|                         |          |   |   |   |   |

| #   | instruction                             |
|-----|-----------------------------------------|
|     | add %x01, %x02 → %x03                   |
| 2×< | <u>imul %x04, %x05 → %x06</u>           |
| 3≪  | <pre>imul %x03, %x07 → %x08</pre>       |
| 4>< | $cmp \%x03, \%x08 \rightarrow \%x09.cc$ |
| 5≪  | jle %x09.cc,                            |
| 6≪  | add %x01, %x03 → %x11                   |
| ~   | imul %x04, %x96 → %x12                  |
| 8   | imul %x03, %x08 $ ightarrow$ %x13       |
| 9   | cmp %x11, %x13 → %x14.cc                |
| 10  | jle %x14.cc,                            |

| LE /0/17 · CC , · · · |          |   |   | 70/12 | Ib.  |
|-----------------------|----------|---|---|-------|------|
|                       |          | _ |   | %x1   | L3 p |
|                       |          |   |   | %x1   | L4 p |
|                       |          |   |   | •••   |      |
| execution unit        | cycle# 1 | 2 | 3 | 4     | 5    |
| ALU 1 (add, cmp, jxx) | 1        | 6 | _ | 4     | 5    |
| ALU 2 (add, cmp, jxx) | _        | _ | _ | -     | _    |
| ALU 3 (mul) start     | 2        | 3 | 7 | 8     | _    |
| ALU 3 (mul) end       |          | 2 | 3 | 7     | 8    |

| reg  | status                   |
|------|--------------------------|
| %x01 | ready                    |
| %x02 | ready                    |
| %x03 | <del>pending</del> ready |
| %x04 | ready                    |
| %x05 | ready                    |
| %x06 | pending ready            |
| %x07 | ready                    |
| %x08 | <del>pending</del> ready |
| %x09 | <del>pending</del> ready |
| %x10 | pending                  |
| %x11 | <del>pending</del> ready |
| %x12 | pending ready            |
| %x13 | pending (still)          |
| %x14 | pending                  |
| •••  |                          |
|      | 5                        |

| #         | instruction                     |
|-----------|---------------------------------|
| $\bowtie$ | add %x01, %x02 → %x03           |
| 2×<       | 1mul %x04, %x05 → %x06          |
| 3≪        | imul %x03, %x07 → %x08          |
| 4         | <u>cmp %x03, %x08 → %x09.cc</u> |
| 5≪        | jle %x09.cc,                    |
| 6≪        | add %x01, %x03 → %x11           |
| ~         | imul %x04, %x06 → %x12          |
| 8≪        | imul %x03, %x08 → %x13          |
| 9         | cmp %x11, %x13 → %x14.cc        |
| 10        | jle %x14.cc,                    |

|                       |          | 1 |   |     |                   |    |
|-----------------------|----------|---|---|-----|-------------------|----|
|                       |          | _ |   | %x1 | .3 <del>pen</del> | d  |
|                       |          |   |   | %x1 | .4 pen            | di |
|                       |          |   |   | ••• |                   |    |
| execution unit        | cycle# 1 | 2 | 3 | 4   | 5                 | Т  |
| ALU 1 (add, cmp, jxx) | 1        | 6 | _ | 4   | 5                 |    |
| ALU 2 (add, cmp, jxx) | _        | _ | _ | _   | _                 |    |
| ALU 3 (mul) start     | 2        | 3 | 7 | 8   | _                 |    |
| ALU 3 (mul) end       |          | 2 | 3 | 7   | 8                 |    |

| reg  | status                   |  |
|------|--------------------------|--|
| %x01 | ready                    |  |
| %x02 | ready                    |  |
| %x03 | <del>pending</del> ready |  |
| %x04 | ready                    |  |
| %x05 | ready                    |  |
| %x06 | pending ready            |  |
| %x07 | ready                    |  |
| %x08 | pending ready            |  |
| %x09 | pending ready            |  |
| %x10 | pending                  |  |
| %x11 | <del>pending</del> ready |  |
| %x12 | pending ready            |  |
| %x13 | pending ready            |  |
| %x14 | pending                  |  |
|      | I                        |  |

| #          | instruction                             |
|------------|-----------------------------------------|
| $\bowtie$  | add %x01, %x02 → %x03                   |
| 2><        | 1mul %x04, %x05 → %x06                  |
| 3≪         | imul %x03, %x07 → %x08                  |
| 4≪         | $cmp \%x03, \%x08 \rightarrow \%x09.cc$ |
| 5 <b>×</b> | jle %x09.cc,                            |
| 6≪         | add %x01, %x03 → %x11                   |
| ><         | imul %x04, %x06 → %x12                  |
| <b>≫</b> < | imul %x03, %x08 → %x13                  |
| 9≪         | cmp %x11, %x13 → %x14.cc                |
| 10         | jle %x14.cc,                            |

|                       |          |   |   |     | 115             |           |
|-----------------------|----------|---|---|-----|-----------------|-----------|
|                       |          | • |   | %x1 | L3 <del>p</del> | ending re |
|                       |          |   |   | %x1 | 4 p             | ending re |
|                       |          |   |   | ••• |                 |           |
| execution unit        | cycle# 1 | 2 | 3 | 4   | 5               | 6         |
| ALU 1 (add, cmp, jxx) | 1        | 6 | - | 4   | 5               | 9         |
| ALU 2 (add, cmp, jxx) | _        | _ | _ | _   | _               | _         |
| ALU 3 (mul) start     | 2        | 3 | 7 | 8   | _               |           |
| ALU 3 (mul) end       |          | 2 | 3 | 7   | 8               |           |

|   |      | Scorebourd               |  |
|---|------|--------------------------|--|
|   | reg  | status                   |  |
|   | %x01 | ready                    |  |
|   | %x02 | ready                    |  |
|   | %x03 | pending ready            |  |
|   | %x04 | ready                    |  |
|   | %x05 | ready                    |  |
|   | %x06 | pending ready            |  |
|   | %x07 | ready                    |  |
|   | %x08 | <del>pending</del> ready |  |
|   | %x09 | pending ready            |  |
|   | %x10 | pending                  |  |
|   | %x11 | <del>pending</del> ready |  |
|   | %x12 | pending ready            |  |
|   | %x13 | pending ready            |  |
|   | %x14 | pending ready            |  |
|   | •••  |                          |  |
| 4 | !    | 5 <b>6</b>               |  |
| 4 |      | 5 <b>9</b>               |  |
| _ | _    |                          |  |
|   |      | <del>_</del>             |  |
| 8 |      | _                        |  |
| _ |      | _                        |  |

| #          | instruction                     |
|------------|---------------------------------|
| $\bowtie$  | add %x01, %x02 → %x03           |
| 2><        | <u>1mul %x04, %x05</u> → %x06   |
| 3≪         | imul %x03, %x07 → %x08          |
| <b>4</b> < | <u>cmp %x03, %x08 → %x09.cc</u> |
| 5 <b>×</b> | jle %x09.cc,                    |
| 6≪         | add %x01, %x03 → %x11           |
| ><         | imul %x04, %x96 → %x12          |
| <b>≫</b> < | imul %x03, %x08 → %x13          |
| 9≪         | cmp %x11, %x13 → %x14.cc        |
| 128<       | jle %x14.cc,                    |

|                       |          |   |   | ••• |   |
|-----------------------|----------|---|---|-----|---|
| execution unit        | cycle# 1 | 2 | 3 | 4   | 5 |
| ALU 1 (add, cmp, jxx) | 1        | 6 | _ | 4   | 5 |
| ALU 2 (add, cmp, jxx) | _        | _ | _ | _   | _ |
| ALU 3 (mul) start     | 2        | 3 | 7 | 8   | _ |
| ALU 3 (mul) end       |          | 2 | 3 | 7   | 8 |

|   | reg  | st  | atus                     |     |  |
|---|------|-----|--------------------------|-----|--|
|   | %x01 | rea | idy                      |     |  |
|   | %x02 | rea | idy                      |     |  |
|   | %x03 | pei | nding re                 | ady |  |
|   | %x04 | rea | idy                      |     |  |
|   | %x05 | rea | idy                      |     |  |
|   | %x06 | pei | <del>nding</del> re      | ady |  |
|   | %x07 | rea | idy                      |     |  |
|   | %x08 | реі | nding re                 | ady |  |
|   | %x09 | pei | <del>pending</del> ready |     |  |
|   | %x10 | pei | nding                    |     |  |
|   | %x11 | pei | <del>nding</del> re      | ady |  |
|   | %x12 | pei | <del>nding</del> re      | ady |  |
|   | %x13 | pei | <del>nding</del> re      | ady |  |
|   | %x14 | pei | <del>nding</del> re      | ady |  |
|   |      |     |                          |     |  |
| 4 | 1    | 5   | 6                        | 7   |  |
| 4 | -    | 5   | 9                        | 10  |  |
| _ | -    | _   | _                        | _   |  |
| 8 | }    | _   |                          |     |  |

### 000 limitations

can't always find instructions to run

plenty of instructions, but all depend on unfinished ones
programmer can adjust program to help this

need to track all uncommitted instructions

can only go so far ahead

e.g. Intel Skylake: 224-entry reorder buffer, 168 physical registers

branch misprediction has a big cost (relative to pipelined)

e.g. Intel Skylake: approx 16 cycles (v. 2 for pipehw2 CPU)

### 000 limitations

#### can't always find instructions to run

plenty of instructions, but all depend on unfinished ones programmer can adjust program to help this

#### need to track all uncommitted instructions

can only go so far ahead

e.g. Intel Skylake: 224-entry reorder buffer, 168 physical registers

branch misprediction has a big cost (relative to pipelined)

e.g. Intel Skylake: approx 16 cycles (v. 2 for pipehw2 CPU)

## some performance examples

```
example1:
    movq $10000000000, %rax
loop1:
    addq %rbx, %rcx
    decq %rax
    jge loop1
    ret
```

about 30B instructions my desktop: approx 2.65 sec

```
example2:
    movq $10000000000, %rax
loop2:
    addq %rbx, %rcx
    addq %r8, %r9
    decq %rax
    jge loop2
    ret
```

about 40B instructions my desktop: approx 2.65 sec

## some performance examples

```
example1:
    movq $10000000000, %rax
loop1:
    addq %rbx, %rcx
    decq %rax
    jge loop1
    ret
```

about 30B instructions my desktop: approx 2.65 sec

```
example2:
    movq $10000000000, %rax
loop2:
    addq %rbx, %rcx
    addq %r8, %r9
    decq %rax
    jge loop2
    ret
```

about 40B instructions my desktop: approx 2.65 sec

## data flow model and limits (1)



## data flow model and limits (1)



### reassociation

with pipelined, 5-cycle latency multiplier; how long does each take to compute?

$$((a \times b) \times c) \times d$$

$$(a \times b) \times (c \times d)$$

imulq %rbx, %rax
imulq %rcx, %rdx
imulq %rdx, %rax

### reassociation

with pipelined, 5-cycle latency multiplier; how long does each take to compute?



## Intel Skylake OOO design

- 2015 Intel design codename 'Skylake'
- 94-entry instruction queue-equivalent
- 168 physical integer registers
- 168 physical floating point registers
- 4 ALU functional units but some can handle more/different types of operations than others
- 2 load functional units but pipelined: supports multiple pending cache misses in parallel
- 1 store functional unit
- 224-entry reorder buffer determines how far ahead branch mispredictions, etc. can happen

# backup slides

# backup slides



 $\begin{array}{c} \text{phys} \rightarrow \text{arch. reg} \\ \text{for new instrs} \end{array}$ 

| arch. | phys.   |
|-------|---------|
| reg   | reg     |
| %rax  | %x12    |
| %rcx  | %x17    |
| %rbx  | %x13    |
| %rdx  | %x07    |
| •••   | <b></b> |

#### free list

| %x19 |  |
|------|--|
| %x23 |  |
| ••   |  |
| •••  |  |

phys  $\rightarrow$  arch. reg

| arch. | phys. |
|-------|-------|
| reg   | reg   |
| %rax  | %x12  |
| %rcx  | %x17  |
| %rbx  | %x13  |
| %rdx  | %x07  |
| •••   | •••   |

free list

| %x19 |  |
|------|--|
| %x23 |  |
| •••  |  |
| •••  |  |

#### reorder buffer (ROB)

| instr<br>num. | PC     | dest. reg   | done? | mispred? /<br>except? |
|---------------|--------|-------------|-------|-----------------------|
| 14            | 0x1233 | %rbx / %x23 |       |                       |
| 15            | 0x1239 | %rax / %x30 |       |                       |
| 16            | 0x1242 | %rcx / %x31 |       |                       |
| 17            | 0x1244 | %rcx / %x32 |       |                       |
| 18            | 0x1248 | %rdx / %x34 |       |                       |
| 19            | 0x1249 | %rax / %x38 |       |                       |
| 20            | 0x1254 | PC          |       |                       |
| 21            | 0x1260 | %rcx / %x17 |       |                       |
|               | •••    | •••         |       |                       |
| 31            | 0x129f | %rax / %x12 |       |                       |
|               |        |             |       |                       |
|               |        |             |       |                       |

reorder buffer contains instructions started, but not fully finished new entries created on rename (not enough space? stall rename stage)

 $\begin{array}{c} \mathsf{phys} \to \mathsf{arch.} \ \mathsf{reg} \\ \mathsf{for} \ \mathsf{new} \ \mathsf{instrs} \end{array}$ 

| arch. | phys. |
|-------|-------|
| reg   | reg   |
| %rax  | %x12  |
| %rcx  | %x17  |
| %rbx  | %x13  |
| %rdx  | %x07  |
| •••   | •••   |

free list

%x19 %x23 ...

#### reorder buffer (ROB)



place newly started instruction at end of buffer remember at least its destination register (both architectural and physical versions)

phys  $\rightarrow$  arch. reg for new instrs

| arch. | phys.     |
|-------|-----------|
| reg   | reg       |
| %rax  | %x12      |
| %rcx  | %x17      |
| %rbx  | %x13      |
| %rdx  | %x07 %x19 |
| •••   | •••       |

#### free list

| %x19 |  |
|------|--|
| %x23 |  |
| •••  |  |
| •••  |  |

#### reorder buffer (ROB)

|           |               |               |        |             | •     | •                   |
|-----------|---------------|---------------|--------|-------------|-------|---------------------|
| remove    |               | instr<br>num. | PC     | dest. reg   | done? | mispred?<br>except? |
| here      | $\rightarrow$ | 14            | 0x1233 | %rbx / %x23 |       |                     |
| on commit |               | 15            | 0x1239 | %rax / %x30 |       |                     |
|           |               | 16            | 0x1242 | %rcx / %x31 |       |                     |
|           |               | 17            | 0x1244 | %rcx / %x32 |       |                     |
|           |               | 18            | 0x1248 | %rdx / %x34 |       |                     |
|           |               | 19            | 0x1249 | %rax / %x38 |       |                     |
|           |               | 20            | 0x1254 | PC          |       |                     |
|           |               | 21            | 0x1260 | %rcx / %x17 |       |                     |
|           |               |               | •••    | •••         |       |                     |
| add here  |               | 31            | 0x129f | %rax / %x12 |       |                     |
|           | $\rightarrow$ | 32            | 0x1230 | %rdx / %x19 |       |                     |
| on rename |               |               |        |             |       |                     |
|           |               |               |        |             |       |                     |

next renamed instruction goes in next slot, etc.

 $\begin{array}{c} \mathsf{phys} \to \mathsf{arch.} \ \mathsf{reg} \\ \mathsf{for} \ \mathsf{new} \ \mathsf{instrs} \end{array}$ 

| arch. | phys.                |  |  |  |
|-------|----------------------|--|--|--|
| reg   | reg                  |  |  |  |
| %rax  | %x12                 |  |  |  |
| %rcx  | %x17                 |  |  |  |
| %rbx  | %x13                 |  |  |  |
| %rdx  | <del>%x07</del> %x19 |  |  |  |
|       |                      |  |  |  |

#### free list

| %x19 |  |
|------|--|
| %x23 |  |
| •••  |  |
| •••  |  |

#### reorder buffer (ROB)

|           |               |        |             | `     | ,                  |
|-----------|---------------|--------|-------------|-------|--------------------|
| remove    | instr<br>num. | РС     | dest. reg   | done? | mispred? / except? |
| here      | <b>→</b> 14   | 0x1233 | %rbx / %x23 |       |                    |
| on commit | 15            | 0x1239 | %rax / %x30 |       |                    |
|           | 16            | 0x1242 | %rcx / %x31 |       |                    |
|           | 17            | 0x1244 | %rcx / %x32 |       |                    |
|           | 18            | 0x1248 | %rdx / %x34 |       |                    |
|           | 19            | 0x1249 | %rax / %x38 |       |                    |
|           | 20            | 0x1254 | PC          |       |                    |
|           | 21            | 0x1260 | %rcx / %x17 |       |                    |
|           |               |        | •••         |       |                    |
|           | 31            | 0x129f | %rax / %x12 |       |                    |
| add here  | 32            | 0x1230 | %rdx / %x19 |       |                    |
| auu nere  | <b>—</b>      |        |             |       |                    |
| on rename |               |        | l           |       |                    |

phys  $\rightarrow$  arch. reg for new instrs

| arch. | phys.                |  |  |
|-------|----------------------|--|--|
| reg   | reg                  |  |  |
| %rax  | %x12                 |  |  |
| %rcx  | %x17                 |  |  |
| %rbx  | %x13                 |  |  |
| %rdx  | <del>%x07</del> %x19 |  |  |
| •••   | •••                  |  |  |

#### free list

| <del>%x19</del> |  |
|-----------------|--|
| %x13            |  |
| •••             |  |
| •••             |  |

#### reorder buffer (ROB)

remove here → on commit

| instr<br>num. | PC     | dest. | reg    | done? | mispred?<br>except? |
|---------------|--------|-------|--------|-------|---------------------|
| 14            | 0x1233 | %rbx  | / %x24 |       |                     |
| 15            | 0x1239 | %rax  | / %x30 |       |                     |
| 16            | 0x1242 | %rcx  | / %x31 |       |                     |
| 17            | 0x1244 | %rcx  | / %x32 |       |                     |
| 18            | 0x1248 | %rdx  | / %x34 |       |                     |
| 19            | 0x1249 | %rax  | / %x38 |       |                     |
| 20            | 0x1254 | PC    |        |       |                     |
| 21            | 0x1260 | %rcx  | / %x17 |       |                     |
|               |        |       |        |       |                     |
| 31            | 0x129f | %rax  | / %x12 |       |                     |
|               |        |       |        |       |                     |
|               |        |       |        |       |                     |

phys  $\rightarrow$  arch. reg for new instrs

| arch. | phys.                |  |  |  |
|-------|----------------------|--|--|--|
| reg   | reg                  |  |  |  |
| %rax  | %x12                 |  |  |  |
| %rcx  | %x17                 |  |  |  |
| %rbx  | %x13                 |  |  |  |
| %rdx  | <del>%x07</del> %x19 |  |  |  |
| •••   | •••                  |  |  |  |

#### free list

| %x19 |  |
|------|--|
| %x13 |  |
| •••  |  |
| •••  |  |

reorder buffer (ROB)

|    |        |               |        |       |        |       | ,                  |
|----|--------|---------------|--------|-------|--------|-------|--------------------|
|    | remove | instr<br>num. | PC     | dest. | reg    | done? | mispred<br>except? |
|    | here → | 14            | 0x1233 | %rbx  | / %x24 |       |                    |
| on | commit | 15            | 0x1239 | %rax  | / %x30 |       |                    |
|    |        | 16            | 0x1242 | %rcx  | / %x31 | ✓     |                    |
|    |        | 17            | 0x1244 | %rcx  | / %x32 |       |                    |
|    |        | 18            | 0x1248 | %rdx  | / %x34 | ✓     |                    |
|    |        | 19            | 0x1249 | %rax  | / %x38 | ✓     |                    |
|    |        | 20            | 0x1254 | PC    |        |       |                    |
|    |        | 21            | 0x1260 | %rcx  | / %x17 |       |                    |
|    |        |               |        |       |        |       |                    |
|    |        | 31            | 0x129f | %rax  | / %x12 |       | ✓                  |
|    |        |               |        |       |        |       |                    |
|    |        |               |        |       |        |       |                    |

instructions marked done in reorder buffer when computed but not removed ('committed') yet

phys  $\rightarrow$  arch. reg reorder buffer (ROB) for new instrs mispred? / arch. phys. instr done? except? PC dest. reg remove num. reg reg here  $\longrightarrow$  14 0x1233%rbx / %x24 %rax %x12 phys  $\rightarrow$  arch. reg 15 0x1239 %rax / %x30 on commit %rcx %x17 for committed 16 0x1242 %rcx / %x31 %rbx %x13 17 0x1244 %rcx / %x32 arch. phys. %rdx <del>%x07</del> %x19 18 0x1248 %rdx / %x34 reg reg ••• 19 0x1249 %rax / %x38 %x30 %rax 20 0x1254 PC %rcx %x28 free list 21 0x1260 %rcx / %x17 %x23 %rbx %x 19 %rdx %x21 31 0x129f%rax / %x12 %x13 commit stage tracks architectural to physical register map for committed instructions

phys  $\rightarrow$  arch. reg reorder buffer (ROB) for new instrs mispred? / arch. phys. instr done? except? PC dest. reg remove num. reg reg here  $\longrightarrow$  14 0x1233 %rbx / %x24 %rax %x12 phys  $\rightarrow$  arch. reg 15 0x1239 %rax / %x30 on commit %rcx %x17 for committed 16 0x1242 %rcx / %x31 %rbx %x13 17 0x1244 %rcx / %x32 arch. phys. <del>%x07</del> %x19 %rdx 18 0x1248 %rdx / %x34 reg reg ••• 19 0x1249 %rax / %x38 %x30 %rax 20 0x1254 PC %rcx %x28 free list 21 0x1260 %rcx / %x17 %x23 %x24 %rbx %x 19 %rdx %x21 31 0x129f%rax / %x12 %x13 32 0x1230 %rdx / %x19 when next-to-commit instruction is done %x23 update this register map and free register list and remove instr. from reorder buffer

phys  $\rightarrow$  arch. reg reorder buffer (ROB) for new instrs arch. phys. instr done? except? mispred? / PC dest. reg num. reg reg phys  $\rightarrow$  arch. reg remove here for committed for committed %rax %x12 15 0x1239 %rax / %x30 %rcx %x17 16 0x1242 %rcx / %x31 %rbx %x13 17 0x1244%rcx / %x32 arch. phys. <del>%x07</del> %x19 %rdx 18 0x1248 %rdx / %x34 reg reg ••• 19 0x1249 %rax / %x38 %x30 %rax 20 0x1254 PC %rcx %x28 free list 21 0x1260 %rcx / %x17 <del>%x23</del> %x24 %rbx %x 19 %rdx %x21 0x129f%rax / %x12 31 %x13 32 0x1230\%rdx / \%x19 when next-to-commit instruction is done %x23 update this register map and free register list and remove instr. from reorder buffer

## reorder buffer: commit mispredict (one way)

 $\begin{array}{c} \mathsf{phys} \to \mathsf{arch.} \ \mathsf{reg} \\ \mathsf{for} \ \mathsf{new} \ \mathsf{instrs} \end{array}$ 

| arch. | phys. |
|-------|-------|
| reg   | reg   |
| %rax  | %x12  |
| %rcx  | %x17  |
| %rbx  | %x13  |
| %rdx  | %x19  |
| •••   | •••   |

free list

| <del>%x19</del> |  |
|-----------------|--|
| %x13            |  |
| •••             |  |
| •••             |  |

 $\begin{array}{c} \text{phys} \rightarrow \text{arch. reg} \\ \text{for committed} \end{array}$ 

| arch. | phys.                |  |
|-------|----------------------|--|
| reg   | reg                  |  |
| %rax  | <del>%x30</del> %x38 |  |
| %rcx  | <del>%x31</del> %x32 |  |
| %rbx  | <del>%x23</del> %x24 |  |
| %rdx  | <del>%x21</del> %x34 |  |
|       | •••                  |  |

reorder buffer (ROB)

| instr<br>num. | PC     | dest. reg   | done?    | mispred? /<br>except? |
|---------------|--------|-------------|----------|-----------------------|
| 14            | 0x1233 | %rbx / %x24 | <b>√</b> |                       |
| 15            | 0x1239 | %rax / %x30 | <b>√</b> |                       |
| 16            | 0x1242 | %rex / %x31 | <u> </u> |                       |
| 17            | 0×1244 | %rcx / %x32 | <b>V</b> |                       |
| 18            | 0×1248 | %rdx / %x34 | ·        |                       |
| 19            | 0x1249 | %rax / %x38 | <b>√</b> |                       |
| 20            | 0x1254 | PC          | <b>√</b> | <b>√</b>              |
| 21            | 0x1260 | %rcx / %x17 |          |                       |
|               |        | •••         |          |                       |
| 31            | 0x129f | %rax / %x12 | <b>√</b> |                       |
| 32            | 0x1230 | %rdx / %x19 |          |                       |
|               |        |             |          |                       |

## reorder buffer: commit mispredict (one way)

phys  $\rightarrow$  arch. reg for new instrs

| arch. | phys. |
|-------|-------|
| reg   | reg   |
| %rax  | %x12  |
| %rcx  | %x17  |
| %rbx  | %x13  |
| %rdx  | %x19  |
| •••   | •••   |

 $\begin{array}{c} \mathsf{phys} \to \mathsf{arch.} \ \mathsf{reg} \\ \mathsf{for} \ \mathsf{committed} \end{array}$ 

| arch. | phys.                |  |
|-------|----------------------|--|
| reg   | reg                  |  |
| %rax  | <del>%x30</del> %x38 |  |
| %rcx  | <del>%x31</del> %x32 |  |
| %rbx  | <del>%x23</del> %x24 |  |
| %rdx  | <del>%x21</del> %x34 |  |
|       |                      |  |

reorder buffer (ROB)

|         |               |        |             | •        | ,                   |
|---------|---------------|--------|-------------|----------|---------------------|
|         | instr<br>num. | PC     | dest. reg   | done?    | mispred?<br>except? |
|         | 14            | 0×1222 | %rbx / %x24 | <b>√</b> |                     |
|         | 15            | 0x1239 | %rax / %x30 | ·<br>√   |                     |
|         | 16            | 0×1242 | %rex / %x31 | V        |                     |
|         | 17            | 0×1244 | %rex / %x32 | ·        |                     |
|         | 18            | 0x1248 | %rdx / %x34 | <b>√</b> |                     |
|         | 19            | 0x1249 | %rax / %x38 | <b>V</b> |                     |
| <b></b> | 20            | 0x1254 | PC          | <b>√</b> | <b>√</b>            |
|         | 21            | 0x1260 | %rcx / %x17 |          |                     |
|         |               |        | •••         |          |                     |
|         | 31            | 0x129f | %rax / %x12 | ✓        |                     |
|         | 32            | 0x1230 | %rdx / %x19 |          |                     |

free list

| <del>%x19</del> |  |
|-----------------|--|
| %x13            |  |
| •••             |  |
|                 |  |

when committing a mispredicted instruction...

this is where we undo mispredicted instructions

## reorder buffer: commit mispredict (one way)



## reorder buffer: commit mispredict (one way)



#### better? alternatives

can take snapshots of register map on each branch don't need to reconstruct the table (but how to efficiently store them)

can reconstruct register map before we commit the branch instruction

need to let reorder buffer be accessed even more?

can track more/different information in reorder buffer





free regs for new instrs

| X19 | arch. | phys. |
|-----|-------|-------|
| X23 | reg   | reg   |
|     | RAX   | X15   |
|     | RCX   | X17   |
|     | RBX   | X13   |
|     | RBX   | X07   |
|     |       |       |







free regs for new instrs for complete instrs

| X19 |
|-----|
| X23 |
|     |

| arch. | phys. |
|-------|-------|
| reg   | reg   |
| RAX   | X15   |
| RCX   | X17   |
| RBX   | X13   |
| RBX   | X07   |
|       | •••   |

| arch. | phys.             |
|-------|-------------------|
| reg   | reg               |
| RAX   | X21               |
| RCX   | <del>X2</del> X32 |
| RBX   | X48               |
| RDX   | X37               |
|       | •••               |

|   | instr<br>num. | PC     | dest. reg | done? | except? |
|---|---------------|--------|-----------|-------|---------|
| , |               |        |           |       |         |
|   | 17            | 0x1244 | RCX / X32 | V     |         |
|   | 18            | 0x1248 | RDX / X34 |       |         |
|   | 19            | 0x1249 | RAX / X38 | ✓     |         |
|   | 20            | 0x1254 | R8 / X05  |       |         |
|   | 21            | 0x1260 | R8 / X06  |       |         |
|   |               | •••    |           |       |         |



free regs for new instrs for complete instrs

| X19 |
|-----|
| X23 |
|     |

| arch. | phys. |
|-------|-------|
| reg   | reg   |
| RAX   | X15   |
| RCX   | X17   |
| RBX   | X13   |
| RBX   | X07   |
| •••   |       |

| arch. | phys.             |  |
|-------|-------------------|--|
| reg   | reg               |  |
| RAX   | X21               |  |
| RCX   | <del>X2</del> X32 |  |
| RBX   | X48               |  |
| RDX   | X37               |  |
| •••   |                   |  |

|   | instr<br>num. | PC                | dest. reg | done?    | except? |
|---|---------------|-------------------|-----------|----------|---------|
| ¥ |               |                   |           |          |         |
|   | <del>17</del> | <del>0x1244</del> | RCX / X32 | <b>√</b> |         |
|   | 18            | 0x1248            | RDX / X34 |          |         |
|   | 19            | 0x1249            | RAX / X38 | ✓        |         |
|   | 20            | 0x1254            | R8 / X05  | √        | √       |
|   | 21            | 0x1260            | R8 / X06  |          |         |
|   |               |                   | •••       |          |         |
|   |               |                   |           |          |         |



free regs for new instrs for complete instrs

| X19 |
|-----|
| X23 |
|     |

| arch. | phys. |
|-------|-------|
| reg   | reg   |
| RAX   | X15   |
| RCX   | X17   |
| RBX   | X13   |
| RBX   | X07   |
|       | •••   |

| arch. | phys.              |
|-------|--------------------|
| reg   | reg                |
| RAX   | <del>X21</del> X38 |
| RCX   | <del>X2</del> X32  |
| RBX   | X48                |
| RDX   | <del>X37</del> X34 |
|       |                    |

|   | instr<br>num. | PC                | dest. reg | done?          | except? |
|---|---------------|-------------------|-----------|----------------|---------|
| ¥ |               |                   |           |                |         |
|   | 17            | <del>0x1244</del> | RCX / X32 | <b>√</b>       |         |
|   | 18            | 0x·1248           | RDX-/X34  | <b>√</b> ····· |         |
|   | 19            | 0x·1249           | RAX-/X38  | <b>√</b> ····· |         |
|   | 20            | 0x1254            | R8 / X05  | ✓              | √       |
|   | 21            | 0x1260            | R8 / X06  |                |         |
|   |               |                   |           |                |         |









free regs for new instrs

| X19 |
|-----|
| X23 |
|     |

| arch. | phys. |  |
|-------|-------|--|
| reg   | reg   |  |
| RAX   | X15   |  |
| RCX   | X17   |  |
| RBX   | X13   |  |
| RBX   | X07   |  |
| •••   | •••   |  |

for complete instrs

| arch. | phys.              |  |
|-------|--------------------|--|
| reg   | reg                |  |
| RAX   | <del>X21</del> X38 |  |
| RCX   | <del>X2</del> X32  |  |
| RBX   | X48                |  |
| RDX   | <del>X37</del> X34 |  |
|       |                    |  |

|   | instr<br>num. | PC     | dest. reg | done?    | except?  |
|---|---------------|--------|-----------|----------|----------|
| ¥ |               |        |           |          |          |
|   | 17            | 0x1244 | RCX / X32 | √        |          |
|   | 18            | 0x1248 | RDX / X34 | <u>√</u> |          |
|   | 19            |        | RAX / X38 | √        |          |
|   | 20            | 0x1254 | R8 / X05  | <b>√</b> | <b>√</b> |
|   | 21            | 0x1260 | R8 / X06  |          |          |
|   |               | •••    |           |          |          |

### handling memory accesses?

one idea:

list of done + uncommitted loads+stores

execute load early + double-check on commit have data cache watch for changes to addresses on list if changed, treat like branch misprediction

loads check list of stores so you read back own values actually finish store on commit maybe treat like branch misprediction if conflict?

### the open-source BROOM pipeline











#### better data-flow



#### better data-flow



#### better data-flow



### beyond 1-bit predictor

devote more space to storing history

main goal: rare exceptions don't immediately change prediction

example: branch taken 99% of the time

1-bit predictor: wrong about 2% of the time

1% when branch not taken

1% of taken branches right after branch not taken

new predictor: wrong about 1% of the time

1% when branch not taken

## 2-bit saturating counter





### 2-bit saturating counter



branch always taken: value increases to 'strongest' taken value

### 2-bit saturating counter



branch almost always taken, then not taken once: still predicted as taken

# example

|             | 0x40041B | movq \$4,%rax |
|-------------|----------|---------------|
| <b>&gt;</b> | 0x400423 | • • •         |
| :           | 0x400429 | decq %rax     |
| ÷           | 0x40042A | jz 0x400423   |
|             | 0x40042B | • • •         |

| iter. | table  | prediction | outcome   | table |
|-------|--------|------------|-----------|-------|
| itci. | before | prediction | outcome   | after |
| 1     | 01     | not taken  | taken     | 10    |
| 2     | 10     | taken      | taken     | 11    |
| 3     | 11     | taken      | taken     | 11    |
| 4     | 11     | taken      | not taken | 10    |
| 1     | 10     | taken      | taken     | 11    |
| 2     | 11     | taken      | taken     | 11    |
| 3     | 11     | taken      | taken     | 11    |
| 4     | 11     | taken      | not taken | 10    |
| 1     | 10     | taken      | taken     | 11    |
|       |        |            |           |       |

### generalizing saturating counters

2-bit counter: ignore one exception to taken/not taken

3-bit counter: ignore more exceptions

 $000 \leftrightarrow 001 \leftrightarrow 010 \leftrightarrow 011 \leftrightarrow 100 \leftrightarrow 101 \leftrightarrow 110 \leftrightarrow 111$ 

000-011: not taken

100-111: taken

#### exercise

```
use 2-bit predictor on this loop
    executed in outer loop (not shown) many, many times
what is the conditional branch misprediction rate?
int i = 0;
while (true) {
  if (i % 3 == 0) goto next;
next:
  i += 1;
  if (i == 50) break;
```

### branch patterns

```
i = 4;
do {
     i -= 1;
} while (i != 0);
typical pattern for jump to top of do-while above:
TTTN TTTN TTTN TTTN...(T = taken, N = not taken)
goal: take advantage of recent pattern to make predictions
just saw 'NTTTNT'? predict T next
'TNTTTN'? predict T; 'TTNTTT'? predict N next
```

•••













#### recent pattern to prediction?

easy cases:

```
just saw TTTTTT: predict T
just saw NNNNNN: predict N
just saw TNTNTN: predict T
hard cases:
    predict T? loop with many iterations
    (NTTTTTTTNTTTTTTTTTT...)
    predict T? if statement mostly taken
    (TTTTNTTNTTTTTTTTTTTT...)
    predict N? loop with 5 iterations
    (NTTTTNTTTTNTTTTNTTTTNTT...)
```





















### history of history

actual outcome from commit(?) stage



|       | branch     | pat. to |           |        |         | branch               |
|-------|------------|---------|-----------|--------|---------|----------------------|
| iter. | to pat.    | counter | predict   | actual | counter | to pat.<br>tbl after |
|       | tbl before | before  |           |        | after   | tbl after            |
| 1     | TTTN       | 01      | not taken | taken  | 10      | TTNT                 |
| 2     | TTNT       | 01      | not taken | taken  | 10      | TNTT                 |
| 3     | TNTT       | 11      | taken     | taken  | 11      | NTTT                 |
| 4     | NTTT       | 01      | not taken | taken  | 10      | TTTT                 |
| 1     | TTTN       | 10      | taken     | taken  | 11      | TTNT                 |

prediction to fetch sta

### local patterns and collisions (1)

```
i = 10000;
do {
    p = malloc(...);
    if (p == NULL) goto error; // BRANCH 1
    ...
} while (i— != 0); // BRANCH 2
```

what if branch 1 and branch 2 hash to same table entry?

### local patterns and collisions (1)

```
i = 10000;
do {
    p = malloc(...);
    if (p == NULL) goto error; // BRANCH 1
} while (i— != 0); // BRANCH 2
what if branch 1 and branch 2 hash to same table entry?
pattern: TNTNTNTNTNTNTNTNT...
actually no problem to predict!
```

## local patterns and collisions (2)

```
i = 10000;
do {
    if (i % 2 == 0) goto skip; // BRANCH 1
        ...
    p = malloc(...);
    if (p == NULL) goto error; // BRANCH 2
skip: ...
} while (i— != 0); // BRANCH 3
```

what if branch 1 and branch 2 and branch 3 hash to same table entry?

### local patterns and collisions (2)

```
i = 10000;
do {
    if (i % 2 == 0) goto skip; // BRANCH 1
    p = malloc(...);
    if (p == NULL) goto error; // BRANCH 2
skip: ...
} while (i— != 0); // BRANCH 3
what if branch 1 and branch 2 and branch 3 hash to same table
entry?
```

pattern: TTNNTTNNTTNNTTNNTT

also no problem to predict!

## local patterns and collisions (3)

```
i = 10000:
do {
    if (A) goto one // BRANCH 1
one:
    if (B) goto two // BRANCH 2
two:
    if (A or B) goto three // BRANCH 3
    if (A and B) goto three // BRANCH 4
three:
    ... // changes A, B
} while (i— != 0);
```

what if branch 1-4 hash to same table entry?

better for prediction of branch 3 and 4

#### global history predictor: idea

one predictor idea: ignore the PC

just record taken/not-taken pattern for all branches

lookup in big table like for local patterns

# global history predictor (1)



outcome

# global history predictor (1)



| skip:            | TNNT              | 10                |         |           |                  |                  |   |    |
|------------------|-------------------|-------------------|---------|-----------|------------------|------------------|---|----|
| skip.            | TNTN              | 11                |         |           |                  |                  |   |    |
| } whil           | <br>TTTN          | 10                |         |           |                  |                  |   |    |
| iter./<br>branch | history<br>before | counter<br>before | predict | outcome   | counter<br>after | history<br>after |   | 11 |
| 0/mod 2          | NTTT              | 10                | taken   | taken     | 11               | TTTT             |   |    |
| 0/loop           | TTTT              |                   |         | taken     |                  | TTTT             |   |    |
| 1/mod 2          | TTTT              |                   |         | not taken |                  | TTTN             |   |    |
| 1/error          | TTTN              |                   |         | not taken |                  | TTNN             |   |    |
| 1/loop           | TNNT              |                   |         | taken     |                  | NNTT             |   |    |
| 2/mod 2          | NNTT              |                   |         | taken     |                  | NTTT             | 1 |    |
| 2/loop           | TTTT              |                   |         | taken     |                  | TTTT             | 1 |    |

from commit(?) counter 00 00 10 01 prediction to fetch stage

92

pat

NNNN

NNNT

TNNN

outcome

#### correlating predictor

global history and local info good together

one idea: combine history register + PC ("gshare")



#### mixing predictors

different predictors good at different times

one idea: have two predictors, + predictor to predict which is right



### loop count predictors (1)

```
for (int i = 0; i < 64; ++i) ...
```

can we predict this perfectly with predictors we've seen

yes — local or global history with 64 entries

but this is very important — more efficient way?

### loop count predictors (2)

loop count predictor idea: look for NNNNNNT+repeat (or TTTTTN+repeat)

track for each possible loop branch:

how many repeated Ns (or Ts) so far how many repeated Ns (or Ts) last time before one T (or N) something to indicate this pattern is useful?

known to be used on Intel

#### benchmark results

from 1993 paper
(not representative of modern workloads?)
rate for conditional branches on benchmark
variable table sizes

#### 2-bit ctr + local history

from McFarling, "Combining Branch Predictors" (1993)



## 2-bit (bimodal) + local + global hist

from McFarling, "Combining Branch Predictors" (1993)



## global + hash(global+PC) (gshare/gselect)

from McFarling, "Combining Branch Predictors" (1993)



#### real BP?

details of modern CPU's branch predictors often not public but...

#### Google Project Zero blog post with reverse engineered details

```
https:
//googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html
for RF'd BTB size
```

https://xania.org/201602/haswell-and-ivy-btb

#### reverse engineering Haswell BPs

#### branch target buffer

4-way, 4096 entries ignores bottom 4 bits of PC? hashes PC to index by shifting + XOR seems to store 32 bit offset from PC (not all 48+ bits of virtual addr)

#### indirect branch predictor

like the global history + PC predictor we showed, but... uses history of recent branch addresses instead of taken/not taken keeps some info about last 29 branches

what about conditional branches??? loops???

couldn't find a reasonable source