# exercise: throughput/latency (1)

```
      cycle # 0 1 2 3 4 5 6 7 8

      0x100: add %r8, %r9
      F D E M W

      0x108: mov 0x1234(%r10), %r11
      F D E M W

      0x110: ...
      ...
```

suppose cycle time is 500 ps

exercise: latency of one instruction?

A. 100 ps B. 500 ps C. 2000 ps D. 2500 ps E. something else

# exercise: throughput/latency (1)

```
      cycle # 0 1 2 3 4 5 6 7 8

      0x100: add %r8, %r9
      F D E M W

      0x108: mov 0x1234(%r10), %r11
      F D E M W

      0x110: ...
      ...
```

suppose cycle time is 500 ps

```
exercise: latency of one instruction?
```

A. 100 ps B. 500 ps C. 2000 ps D. 2500 ps E. something else

exercise: throughput overall?

A. 1 instr/100 ps B. 1 instr/500 ps C. 1 instr/2000ps D. 1 instr/2500 ps

E. something else

# exercise: throughput/latency (2)

```
cycle #
0x100: add %r8, %r9
0x108: mov 0x1234(%r10), %r11
0x110: ...
                             cycle # 0 1 2 3 4 5 6 7 8
                                     F1 F2 D1 D2 E1 E2 M1 M2 W1 W
0x100: add %r8, %r9
0x108: mov 0x1234(%r10), %r11
                                       F1 F2 D1 D2 E1 E2 M1 M2 W
0x110: ...
```

double number of pipeline stages (to 10) + decrease cycle time from 500 ps to 250 ps — throughput?

A. 1 instr/100 ps B. 1 instr/250 ps C. 1 instr/1000ps D. 1 instr/5000 ps

E. something else













### diminishing returns: uneven split

Can we split up some logic (e.g. adder) arbitrarily?

Probably not...



### diminishing returns: uneven split

Can we split up some logic (e.g. adder) arbitrarily?

Probably not...



### diminishing returns: uneven split

Can we split up some logic (e.g. adder) arbitrarily?

Probably not...



#### a data hazard

```
// initially %r8 = 800,

// %r9 = 900, etc.

addq %r8, %r9 // R8 + R9 -> R9

addq %r9, %r8 // R9 + R8 -> R9

addq ...

addq ...
```



|       | fetch | fetch/decode |    | decode/execute |       |    | execute | execute/memory |      | memory/writeback |  |
|-------|-------|--------------|----|----------------|-------|----|---------|----------------|------|------------------|--|
| cycle | PC    | rA           | rB | R[rB]          | R[rB] | rB | sum     | rB             | sum  | rB               |  |
| 0     | 0×0   |              | •  | •              |       | •  |         |                |      | •                |  |
| 1     | 0x2   | 8            | 9  |                |       |    |         |                |      |                  |  |
| 2     |       | 9            | 8  | 800            | 900   | 9  |         |                |      |                  |  |
| 3     |       |              | •  | 900            | 800   | 8  | 1700    | 9              |      |                  |  |
| 4     |       |              |    |                | •     | •  | 1700    | 8              | 1700 | 9                |  |
| 5     |       |              |    |                |       |    |         | •              | 1700 | 8                |  |

#### a data hazard

```
// initially %r8 = 800,

// %r9 = 900, etc.

addq %r8, %r9 // R8 + R9 -> R9

addq %r9, %r8 // R9 + R8 -> R9

addq ...

addq ...
```



|       | fetch | fetcl          | n/decode | decode/execute |       | execute | execute/memory |    | memory/writeback |    |
|-------|-------|----------------|----------|----------------|-------|---------|----------------|----|------------------|----|
| cycle | PC    | rA             | rB       | R[rB]          | R[rB] | rB      | sum            | rB | sum              | rB |
| 0     | 0×0   |                |          | •              | •     |         | •              | •  |                  |    |
| 1     | 0x2   | 8              | 9        |                |       |         |                |    |                  |    |
| 2     |       | 9              | 8 [      | 800            | 900   | 9       |                |    |                  |    |
| 3     |       |                |          | 900            | 800   | 8       | 1700           | 9  |                  |    |
| 4     |       |                | 8        | 1700           | 9     |         |                |    |                  |    |
| 5     |       | should be 1700 |          |                |       |         |                |    |                  | 8  |

#### data hazard

```
addq %r8, %r9 // (1)
addq %r9, %r8 // (2)
```

| step# | pipeline implementation | ISA specification   |
|-------|-------------------------|---------------------|
| 1     | read r8, r9 for (1)     | read r8, r9 for (1) |
| 2     | read r9, r8 for (2)     | write r9 for (1)    |
| 3     | write r9 for (1)        | read r9, r8 for (2) |
| 4     | write r8 for (2)        | write r8 ror (2)    |

pipeline reads older value...

instead of value ISA says was just written

### data hazard compiler solution

```
addq %r8, %r9
nop
nop
addq %r9, %r8
one solution: change the ISA
     all addqs take effect three instructions later
     (assuming can read register value while it is being written back)
make it compiler's job
problem: recompile everytime processor changes?
```

### data hazard compiler solution

```
addq %r8, %r9
nop
nop
addq %r9, %r8
one solution: change the ISA
     all addqs take effect three instructions later
     (assuming can read register value while it is being written back)
make it compiler's job
problem: recompile everytime processor changes?
```

# stalling/nop pipeline diagram (1)

add %r8, %r9
nop
nop
addq %r9, %r8



# stalling/nop pipeline diagram (1)

```
cycle # 0 1 2 3 4 5 6 7 8
add %r8, %r9
nop
nop
addg %r9, %r8
          assumption:
          if writing register value
          register file will return that value for reads
          not actually way register file worked in single-cycle CPU
          (e.g. can read old %r9 while writing new %r9)
```

# stalling/nop pipeline diagram (2)



# stalling/nop pipeline diagram (2)



if we didn't modify the register file, we'd need an extra cycle

#### data hazard hardware solution

```
addq %r8, %r9
// hardware inserts: nop
// hardware inserts: nop
addq %r9, %r8
how about hardware add nops?
called stalling
extra logic:
    sometimes don't change PC
    sometimes put do-nothing values in pipeline registers
```

### opportunity

```
// initially %r8 = 800,
// %r9 = 900, etc.
0x0: addq %r8, %r9
0x2: addq %r9, %r8
...
```

|       | fetch | fetc           | h/decode | decode/execute |       |    | execute/memory |      | memory/writeback |      |    |
|-------|-------|----------------|----------|----------------|-------|----|----------------|------|------------------|------|----|
| cycle | PC    | rA             | rB       | R[rB           | R[rB] | rB |                | sum  | rB               | sum  | rB |
| 0     | 0×0   |                |          |                | •     |    |                | •    | •                | •    |    |
| 1     | 0x2   | 8              | 9        |                |       |    |                |      |                  |      |    |
| 2     |       | 9              | 8        | 800            | 900   | 9  | -              |      | _                |      |    |
| 3     |       |                |          | 900            | 800   | 8  |                | 1700 | 9                | ]    |    |
| 4     |       | 1700 8         |          |                |       |    |                |      |                  |      | 9  |
| 5     |       | should be 1700 |          |                |       |    |                |      |                  | 1700 | 8  |

### exploiting the opportunity



# exploiting the opportunity



### opportunity 2

```
// initially %r8 = 800,
// %r9 = 900, etc.
0x0: addq %r8, %r9
0x2: nop
0x3: addq %r9, %r8
```

|       | fetch | fetch          | /decode | decode/execute |       |    | execute/memory |    | memory/writeback |    |
|-------|-------|----------------|---------|----------------|-------|----|----------------|----|------------------|----|
| cycle | PC    | rA             | rB      | R[rB           | R[rB] | rB | sum            | rB | sum              | rB |
| 0     | 0×0   |                |         | •              |       |    | •              | •  | •                |    |
| 1     | 0x2   | 8              | 9       |                |       |    |                |    |                  |    |
| 2     | 0x3   |                |         | 800            | 900   | 9  |                |    |                  |    |
| 3     |       | 9              | 8       |                |       |    | 1700           | 9  |                  | _  |
| 4     |       |                | ·       | 900            | 800   | 8  |                |    | 1700             | 9  |
| 5     |       | should be 1700 |         |                |       |    |                |    |                  |    |
| 6     |       |                | 1700    | 9              |       |    |                |    |                  |    |

# exploiting the opportunity



# exercise: forwarding paths

 cycle #
 0
 1
 2
 3
 4
 5
 6
 7
 8

 addq %r8, %r9
 F
 D
 E
 M
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W

in subq, %r8 is \_\_\_\_\_ addq.

in xorq, %r9 is \_\_\_\_\_ addq.

in andq, %r9 is \_\_\_\_\_ addq.

in andq, %r9 is \_\_\_\_\_ xorq.

A: not forwarded from

B-D: forwarded to decode from  $\{execute, memory, writeback\}$  stage of

### unsolved problem



combine stalling and forwarding to resolve hazard

assumption in diagram: hazard detected in subq's decode stage (since easier than detecting it in fetch stage)

### unsolved problem



combine stalling and forwarding to resolve hazard

assumption in diagram: hazard detected in subq's decode stage (since easier than detecting it in fetch stage)

### solveable problem



### why can't we...



clock cycle needs to be long enough
to go through data cache AND
to go through math circuits!
(which we were trying to avoid by putting them in separate stages)

### why can't we...



clock cycle needs to be long enough
to go through data cache AND
to go through math circuits!
(which we were trying to avoid by putting them in separate stages)

### hazards versus dependencies

dependency — X needs result of instruction Y?

has potential for being messed up by pipeline
(since part of X may run before Y finishes)

hazard — will it not work in some pipeline?

before extra work is done to "resolve" hazards
multiple kinds: so far, data hazards

```
addq %rax, %rbx
subq %rax, %rcx
movq $100, %rcx
addq %rcx, %r10
addq %rbx, %r10
```

```
addq %rax, %rbx
subq %rax, %rcx
movq $100, %rcx
addq %rcx, %r10
addq %rbx, %r10
```

```
addq %rax, %rbx

subq %rax, %rcx

movq $100, %rcx

addq %rcx %r10

addq %rbx, %r10
```

```
addq %rax, %rbx

subq %rax, %rcx

movq $100, %rcx

addq %rcx, %r10

addq %rbx, %r10
```

## pipeline with different hazards

```
example: 4-stage pipeline:
fetch/decode/execute+memory/writeback

// 4 stage // 5 stage
addq %rax, %r8 // // W
subq %rax, %r9 // W // M
xorq %rax, %r10 // EM // E
andq %r8, %r11 // D // D
```

## pipeline with different hazards

```
example: 4-stage pipeline:
fetch/decode/execute+memory/writeback
             // 4 stage // 5 stage
addq %rax, %r8 // // W
subq %rax, %r9 // W // M
xorq %rax, %r10 // EM // E
andq %r8, %r11 // D // D
addg/andg is hazard with 5-stage pipeline
```

addq/andq is **not** a hazard with 4-stage pipeline

## pipeline with different hazards

```
example: 4-stage pipeline:
fetch/decode/execute+memory/writeback

// 4 stage // 5 stage
addq %rax, %r8 // // W
subq %rax, %r9 // W // M
xorq %rax, %r10 // EM // E
andq %r8, %r11 // D // D
```

more hazards with more pipeline stages

split execute into two stages: F/D/E1/E2/M/W

result only available near end of second execute stage

where does forwarding, stalls occur?

| cycle #              | 0 | 1 | 2  | 3  | 4 | 5 | 6 | 7 | 8 |  |
|----------------------|---|---|----|----|---|---|---|---|---|--|
| (1) addq %rcx, %r9   | F | D | E1 | E2 | М | W |   |   |   |  |
| (2) addq %r9, %rbx   |   |   |    |    |   |   |   |   |   |  |
| (3) addq %rax, %r9   |   |   |    |    |   |   |   |   |   |  |
| (4) movq %r9, (%rbx) |   |   |    |    |   |   |   |   |   |  |
| (5) movq %rcx, %r9   |   |   |    |    |   |   |   |   |   |  |

| cycle #          | 0 | 1 | 2  | 3  | 4 | 5 | 6 | 7 | 8 |  |
|------------------|---|---|----|----|---|---|---|---|---|--|
| addq %rcx, %r9   | F | D | E1 | E2 | М | W |   |   |   |  |
| addq %r9, %rbx   |   |   |    |    |   |   |   |   |   |  |
|                  |   |   |    |    |   |   |   |   |   |  |
| addq %rax, %r9   |   |   |    |    |   |   |   |   |   |  |
|                  |   |   | :  |    |   |   |   |   |   |  |
| movq %r9, (%rbx) |   |   |    |    |   |   |   |   |   |  |
|                  |   |   |    |    |   |   |   |   |   |  |

| cycle #          | 0 | 1 | 2  | 3  | 4  | 5  | 6  | 7 | 8 |  |
|------------------|---|---|----|----|----|----|----|---|---|--|
| addq %rcx, %r9   | F | D | E1 | E2 | М  | W  |    |   |   |  |
| addq %r9, %rbx   |   | F | D  | E1 | E2 | М  | W  |   |   |  |
|                  |   |   |    |    |    |    |    |   |   |  |
| addq %rax, %r9   |   |   | F  | D  | E1 | E2 | М  | W |   |  |
|                  |   |   |    |    |    |    |    |   |   |  |
| movq %r9, (%rbx) |   |   |    | F  | D  | E1 | E2 | М | W |  |
|                  |   |   |    |    |    |    |    |   |   |  |

| cycle #          | 0 | 1 | 2  | 3  | 4  | 5  | 6  | 7  | 8            |   |
|------------------|---|---|----|----|----|----|----|----|--------------|---|
| addq %rcx, %r9   | F | D | E1 | E2 | М  | W  |    |    |              |   |
| addq %r9, %rbx   |   | F | D  | Ε1 | E2 | М  | W  |    |              |   |
| addq %r9, %rbx   | : | F | D  | D  | E1 | E2 | М  | W  |              |   |
| addq %rax, %r9   | : |   | F  | D  | Ε1 | E2 | М  | W  |              |   |
| addq %rax, %r9   |   |   | F  | F  | D  | E1 | E2 | М  | W            |   |
| movq %r9, (%rbx) |   |   |    | F  | D  | E1 | E2 | M  | $\mathbb{W}$ |   |
| movq %r9, (%rbx) |   |   |    |    | F  | D  | E1 | E2 | M            | W |

| cycle #          | 0 | 1 | 2  | 3  | 4  | 5  | 6  | 7  | 8            |   |
|------------------|---|---|----|----|----|----|----|----|--------------|---|
| addq %rcx, %r9   | F | D | E1 | E2 | М  | W  |    |    |              |   |
| addq %r9, %rbx   |   | F | D  | Ε1 | E2 | М  | W  |    |              |   |
| addq %r9, %rbx   | : | F | D  | D  | E1 | E2 | М  | W  |              |   |
| addq %rax, %r9   | : |   | F  | D  | Ε1 | E2 | М  | W  |              |   |
| addq %rax, %r9   |   |   | F  | F  | D  | E1 | E2 | М  | W            |   |
| movq %r9, (%rbx) |   |   |    | F  | D  | E1 | E2 | M  | $\mathbb{W}$ |   |
| movq %r9, (%rbx) |   |   |    |    | F  | D  | E1 | E2 | M            | W |

movq %r9, (%rbx)

split execute into two stages: F/D/E1/E2/M/W

cycle # 0 1 2 3 4 5 6 7 8

addq %rcx, %r9 F D E1 E2 M W

addq %r9, %rbx F D D E1 E2 M W

addq %r9, %rbx F D D E1 E2 M W

addq %rax, %r9 F D E1 E2 M W

addq %rax, %r9 F D E1 E2 M W

movq %rcx, %r9 F D E1 E2 M W

F D E1 E2 M W

#### control hazard

0x00: cmpq %r8, %r9

0x08: je 0xFFFF

0x10: addq %r10, %r11

|       | fetch | fetch- | decode · | decode- | →execut | execute→writel | execu | te→writeback |  |
|-------|-------|--------|----------|---------|---------|----------------|-------|--------------|--|
| cycle | PC    | rA     | rB       | R[rA]   | R[rB]   | result         |       |              |  |
| 0     | 0×0   |        |          |         |         |                |       |              |  |
| 1     | 0x8   | 8      | 9        |         |         |                |       |              |  |
| 2     | ???   |        |          | 800     | 900     |                |       |              |  |
| 3     | ???   |        |          |         |         | less than      |       |              |  |

#### control hazard

0x00: cmpq %r8, %r9

0x08: je 0xFFFF

0x10: addq %r10, %r11

|       | fetch | $fetch \!\! 	o \!\!$ | decode d | lecode- | →execute | execute→writel | execu | te→writeback |  |
|-------|-------|----------------------|----------|---------|----------|----------------|-------|--------------|--|
| cycle | PC    | rA                   | rB       | R[rA]   | R[rB]    | result         |       |              |  |
| 0     | 0×0   |                      |          |         |          |                |       | •            |  |
| 1     | 0x8   | 9                    | 9        |         |          |                |       |              |  |
| 2     | ???   |                      |          | 800     | 900      |                |       |              |  |
| 3     | ???   |                      |          |         |          | less than      |       |              |  |

0xFFFF if R[8] = R[9]; 0x10 otherwise

•••

```
cmpq %r8, %r9
       ine LABEL
                    // not taken
       xorq %r10, %r11
       movg %r11, 0(%r12)
                             cycle # 0 1 2 3 4 5 6 7 8
cmpq %r8, %r9
                                             М
ine LABEL
                                             Ε
                                                М
                                           D
                                                   W
(do nothing)
                                                   М
(do nothing)
                                                   Е
                                                        W
xorg %r10, %r11
                                                   D
                                                        М
                                                           W
movg %r11, 0(%r12)
```

```
cmpq %r8, %r9
       ine LABEL
                     // not taken
       xorq %r10, %r11
       movg %r11, 0(%r12)
                             cycle # 0 1 2 3 4 5 6 7 8
cmpq %r8, %r9
                          compare sets flags | E
ine LABEL
                                              Ε
                                           D
                                                 М
                                                    W
(do nothing)
                                                    М
(do nothing)
                                                    Е
                                                         W
xorg %r10, %r11
                                                    D
                                                         М
                                                            W
movg %r11, 0(%r12)
```

```
cmpq %r8, %r9
       ine LABEL // not taken
       xorq %r10, %r11
       movg %r11, 0(%r12)
                            cycle # 0 1 2 3 4 5 6 7 8
cmpq %r8, %r9
ine LABEL
           compute if jump goes to LABEL
(do nothing)
                                                 М
(do nothing)
                                                 Е
                                                      W
xorg %r10, %r11
                                                 D
                                                      М
                                                         W
movg %r11, 0(%r12)
```

```
cmpq %r8, %r9
       ine LABEL
                     // not taken
       xorq %r10, %r11
       movg %r11, 0(%r12)
                             cycle # 0 1 2 3 4 5 6 7 8
cmpq %r8, %r9
                                              М
ine LABEL
                                              Ε
                                                    W
(do nothing)
                                                    М
(do nothing)
                                                    Ε
                                                         W
xorg %r10, %r11
                              use computed result | F
                                                         М
                                                            W
movq %r11, 0(%r12)
```

#### making guesses

```
cmpq %r8, %r9
jne LABEL
xorq %r10, %r11
movq %r11, 0(%r12)
...
```

```
LABEL: addq %r8, %r9 imul %r13, %r14
```

speculate (guess): jne won't go to LABEL

right: 2 cycles faster!; wrong: undo guess before too late

# jXX: speculating right (1)

•••

```
cmpq %r8, %r9
       ine LABEL
       xorq %r10, %r11
       movg %r11, 0(%r12)
        . . .
LABEL: addg %r8, %r9
       imul %r13, %r14
        . . .
                               cycle # 0 1 2 3 4 5 6 7 8
cmpq %r8, %r9
                                             Е
                                                М
                                           D
jne LABEL
                                                Ε
xorq %r10, %r11
                                                D
                                                      М
movq %r11, 0(%r12)
                                                      Е
```

31

## jXX: speculating wrong

```
0 1 2 3 4 5 6 7 8
               cycle #
cmpq %r8, %r9
ine LABEL
                               Е
                            D
                                    W
xorq %r10, %r11
                            F
                               D
(inserted nop)
movq %r11, 0(%r12)
                               F
(inserted nop)
                                    Е
                                         W
LABEL: addq %r8, %r9
                                         М
                                    D
imul %r13, %r14
```

•••

# jXX: speculating wrong

```
cycle # 0 1 2 3 4 5 6 7 8
cmpq %r8, %r9
ine LABEL
                          F
                             D
xorq %r10, %r11
                               D instruction "squashed"
(inserted nop)
movq %r11, 0(%r12)
                                  instruction "squashed"
(inserted nop)
                                     Е
                                          W
LABEL: addq %r8, %r9
                                          М
                                     D
imul %r13, %r14
```

### "squashed" instructions

on misprediction need to undo partially executed instructions

mostly: remove from pipeline registers

more complicated pipelines: replace written values in cache/registers/etc.

# performance

#### hypothetical instruction mix

| kind          | portion | cycles<br>(predict<br>not-taken) | cycles<br>(stall) |
|---------------|---------|----------------------------------|-------------------|
| taken jXX     | 3%      | ,                                | 3                 |
| non-taken jXX | 5%      | 1                                | 3                 |
| others        | 92%     | 1*                               | 1*                |

# performance

#### hypothetical instruction mix

| kind          | portion | cycles<br>(predict<br>not-taken) |    |
|---------------|---------|----------------------------------|----|
| taken jXX     | 3%      | 3                                | 3  |
| non-taken jXX | 5%      | 1                                | 3  |
| others        | 92%     | 1*                               | 1* |

#### static branch prediction

forward (target > PC) not taken; backward taken intuition: loops: LOOP: ... ie LOOP LOOP: ... ine SKIP LOOP imp LOOP SKIP LOOP:

### exercise: static prediction

```
.global foo
foo:
   xor %eax, %eax // eax <- 0</pre>
foo_loop_top:
   test $0x1, %edi
   je foo_loop_bottom // if (edi & 1 == 0) goto for_loop_bottom
   add %edi, %eax
foo loop bottom:
   jg for_loop_top // if (edi > 0) goto for_loop_top
    ret
suppose \%edi = 3 (initially)
and using forward-not-taken, backwards-taken strategy:
how many mispreditions for je? for jg?
```











#### example



#### example



#### example











39







#### collisions?

two branches could have same hashed PC nothing in table tells us about this versus direct-mapped cache: had *tag bits* to tell

is it worth it?

adding tag bits makes table *much* larger and/or slower but does anything go wrong when there's a collision?

#### collision results

```
possibility 1: both branches usually taken no actual conflict — prediction is better(!)
```

possibility 2: both branches usually not taken no actual conflict — prediction is better(!)

possibility 3: one branch taken, one not taken performance probably worse

## 1-bit predictor for loops

predicts first and last iteration wrong

example: branch to beginning — but same for branch from beginning to end

everything else correct

```
use 1-bit predictor on this loop
    executed in outer loop (not shown) many, many times
what is the conditional jump misprediction rate?
int i = 0;
while (true) {
  if (i % 3 == 0)
    goto next;
next:
  i += 1;
  if (i == 50)
    break;
```

```
use 1-bit predictor on this loop executed in outer loop (not shown) many, many times
```

what is the conditional jump misprediction rate?

```
int i = 0;
while (true) {
   if (i % 3 == 0)
      goto next;
   ...
next:
   i += 1;
   if (i == 50)
      break;
}
```

| _   |        |      | I            |              |
|-----|--------|------|--------------|--------------|
| i = | branch | pred | outcome      | correct?     |
| 0   |        |      | outcome<br>T | ???          |
| 1   | == 50  | ???  | F            | ???          |
| 1   | mod 3  | Т    | F            |              |
| 2   | == 50  | F    | F            | $\checkmark$ |
|     |        |      |              |              |

```
use 1-bit predictor on this loop executed in outer loop (not shown) many, many times
```

what is the conditional jump misprediction rate?

```
int i = 0;
while (true) {
   if (i % 3 == 0)
      goto next;
   ...
next:
   i += 1;
   if (i == 50)
      break;
}
```

|     | l I    | I    | l <b>.</b>   |              |
|-----|--------|------|--------------|--------------|
| I = | brancn | prea | outcome      | correct?     |
| 0   | mod 3  | ???  | outcome<br>T | ???          |
| 1   | == 50  | ???  | F            | ???          |
| 1   | mod 3  | T    | F            |              |
| 2   | == 50  | F    | F            | $\checkmark$ |
|     |        |      |              |              |

### beyond local 1-bit predictor

can predict using more historical info

whether taken last several times example: taken 3 out of 4 last times  $\rightarrow$  predict taken

example: if last few are T, N, T, N, T, N; next is probably T makes two branches hashing to same entry not so bad

outcomes of last N conditional jumps ("global history") take into account conditional jumps in surrounding code example: loops with if statements will have regular patterns

## predicting ret: ministack of return addresses

predicting ret — ministack in processor registers push on ministack on call; pop on ret

ministack overflows? discard oldest, mispredict it later

| baz saved registers |
|---------------------|
| baz return address  |
| bar saved registers |
| bar return address  |
| foo local variables |
| foo saved registers |
| foo return address  |
| foo saved registers |
|                     |

baz return address
bar return address
foo return address

(partial?) stack in CPU registers

stack in memory

## 4-entry return address stack

4-entry return address stack in CPU



next saved return address from call

on call: increment index, save return address in that slot on ret: read prediction from index, decrement index

## branch target buffer

what if we can't decode LABEL from machine code for jmp LABEL or jle LABEL fast?

will happen in more complex pipelines

what if we can't decode that there's a RET, CALL, etc. fast?

## BTB: cache for branch targets

| idx  | valid | tag   | ofst | type | target   | (more info?) |
|------|-------|-------|------|------|----------|--------------|
| 0×00 | 1     | 0x400 | 5    | Jxx  | 0x3FFFF3 | •••          |
| 0x01 | 1     | 0x401 | С    | ЈМР  | 0x401035 |              |
| 0x02 | 0     |       |      |      |          |              |
| 0x03 | 1     | 0x400 | 9    | RET  |          | •••          |
| •••  | •••   | •••   |      | •••  | •••      | •••          |
| 0xFF | 1     | 0x3FF | 8    | CALL | 0x404033 | •••          |

| valid |     |
|-------|-----|
| 1     | ••• |
| 0     |     |
| 0     |     |
| 0     | ••• |
| •••   |     |
| 0     | ••• |

0x3FFFF3: movq %rax, %rsi

0x3FFFF7: pushq %rbx

0x3FFFF8: call 0x404033

0x400001: popq %rbx

0x400003: cmpq %rbx, %rax 0x400005: jle 0x3FFFF3

•••

0x400031: ret

...

## BTB: cache for branch targets

| idx  | valid | tag   | ofst | type | target   | (more info?) |
|------|-------|-------|------|------|----------|--------------|
| 0×00 | 1     | 0x400 | 5    | Jxx  | 0x3FFFF3 | •••          |
| 0x01 | 1     | 0x401 | С    | JMР  | 0x401035 |              |
| 0x02 | 0     |       |      |      |          |              |
| 0x03 | 1     | 0x400 | 9    | RET  |          | •••          |
| •••  | •••   | •••   |      | •••  | •••      | •••          |
| 0xFF | 1     | 0x3FF | 8    | CALL | 0x404033 | •••          |

| valid |     |
|-------|-----|
| 1     | ••• |
| 0     | ••• |
| 0     | ••• |
| 0     | ••• |
|       | ••• |
| 0     | ••• |

0x3FFFF3: movq %rax, %rsi

0x3FFFF7: pushq %rbx

0x3FFFF8: call 0x404033

0x400001: popq %rbx

0x400003: cmpq %rbx, %rax 0x400005: jle 0x3FFFF3

•••

0x400031: ret

. ...

## **BTB**: cache for branch targets

| idx  | valid | tag   | ofst | type | target   | (more info?) |
|------|-------|-------|------|------|----------|--------------|
| 0×00 | 1     | 0x400 | 5    | Jxx  | 0x3FFFF3 | •••          |
| 0x01 | 1     | 0x401 | С    | JMP  | 0x401035 |              |
| 0x02 | 0     |       |      |      |          |              |
| 0x03 | 1     | 0x400 | 9    | RET  |          | •••          |
| •••  | •••   | •••   |      | •••  | •••      | •••          |
| 0xFF | 1     | 0x3FF | 8    | CALL | 0x404033 | •••          |

| valid |     |
|-------|-----|
| 1     | ••• |
| 0     | ••• |
| 0     | ••• |
| 0     | ••• |
| •••   | ••• |
| 0     | ••• |

0x3FFFF3: movq %rax, %rsi

0x3FFFF7: pushq %rbx

0x3FFFF8: call 0x404033

0x400001: popq %rbx

0x400003: cmpq %rbx, %rax 0x400005: jle 0x3FFFF3

...

0x400031: ret

. ...

#### indirect branch prediction

```
jmp *%rax or jmp *(%rax, %rcx, 8)
```

BTB can provide a prediction

but can do better with more context

example—predict based on other recent computed jumps good for polymophic method calls

table lookup with Hash(last few jmps) instead of Hash(this jmp)

### beyond 1-bit predictor

devote more space to storing history

main goal: rare exceptions don't immediately change prediction

example: branch taken 99% of the time

1-bit predictor: wrong about 2% of the time

1% when branch not taken

1% of taken branches right after branch not taken

new predictor: wrong about 1% of the time

1% when branch not taken

# 2-bit saturating counter



## 2-bit saturating counter



branch always taken: value increases to 'strongest' taken value

## 2-bit saturating counter



branch almost always taken, then not taken once: still predicted as taken

|     | 0x40041B | movq \$4,%rax |
|-----|----------|---------------|
| : " | 0x400423 |               |
|     |          | decq %rax     |
| :   | 0x40042A | jz 0x400423   |
|     | 0x40042B | • • •         |

| iter. | table  | prediction | outcome   | table |
|-------|--------|------------|-----------|-------|
| itei. | before | prediction | outcome   | after |
| 1     | 01     | not taken  | taken     | 10    |
| 2     | 10     | taken      | taken     | 11    |
| 3     | 11     | taken      | taken     | 11    |
| 4     | 11     | taken      | not taken | 10    |
| 1     | 10     | taken      | taken     | 11    |
| 2     | 11     | taken      | taken     | 11    |
| 3     | 11     | taken      | taken     | 11    |
| 4     | 11     | taken      | not taken | 10    |
| 1     | 10     | taken      | taken     | 11    |
|       |        |            |           |       |

## generalizing saturating counters

2-bit counter: ignore one exception to taken/not taken

3-bit counter: ignore more exceptions

 $000 \leftrightarrow 001 \leftrightarrow 010 \leftrightarrow 011 \leftrightarrow 100 \leftrightarrow 101 \leftrightarrow 110 \leftrightarrow 111$ 

000-011: not taken

100-111: taken

```
use 2-bit predictor on this loop
    executed in outer loop (not shown) many, many times
what is the conditional branch misprediction rate?
int i = 0;
while (true) {
  if (i % 3 == 0) goto next;
next:
  i += 1;
  if (i == 50) break;
```

## branch patterns

```
i = 4;
do {
     i -= 1;
} while (i != 0);
typical pattern for jump to top of do-while above:
TTTN TTTN TTTN TTTN...(T = taken, N = not taken)
goal: take advantage of recent pattern to make predictions
just saw 'NTTTNT'? predict T next
'TNTTTN'? predict T; 'TTNTTT'? predict N next
```

•••













## recent pattern to prediction?

```
easy cases:
just saw TTTTTT: predict T
just saw NNNNNN: predict N
just saw TNTNTN: predict T
hard cases:
    predict T? loop with many iterations
    (NTTTTTTTNTTTTTTTTTT...)
    predict T? if statement mostly taken
    (TTTTNTTNTTTTTTTTTTTT...)
    predict N? loop with 5 iterations
    (NTTTTNTTTTNTTTTNTTTTNTT...)
```





















# history of history

actual outcome from commit(?) stage



|       |            | pat. to |           |        | pat. to | branch    |
|-------|------------|---------|-----------|--------|---------|-----------|
| iter. | to pat.    | counter | predict   | actual | counter | to pat.   |
|       | tbl before | before  |           |        | after   | tbl after |
| 1     | TTTN       | 01      | not taken | taken  | 10      | TTNT      |
| 2     | TTNT       | 01      | not taken | taken  | 10      | TNTT      |
| 3     | TNTT       | 11      | taken     | taken  | 11      | NTTT      |
| 4     | NTTT       | 01      | not taken | taken  | 10      | TTTT      |
| 1     | TTTN       | 10      | taken     | taken  | 11      | TTNT      |

prediction to fetch sta

# local patterns and collisions (1)

```
i = 10000;
do {
    p = malloc(...);
    if (p == NULL) goto error; // BRANCH 1
    ...
} while (i— != 0); // BRANCH 2
```

what if branch 1 and branch 2 hash to same table entry?

# local patterns and collisions (1)

```
i = 10000;
do {
    p = malloc(...);
    if (p == NULL) goto error; // BRANCH 1
} while (i— != 0); // BRANCH 2
what if branch 1 and branch 2 hash to same table entry?
pattern: TNTNTNTNTNTNTNTNT...
actually no problem to predict!
```

# local patterns and collisions (2)

```
i = 10000;
do {
    if (i % 2 == 0) goto skip; // BRANCH 1
        ...
    p = malloc(...);
    if (p == NULL) goto error; // BRANCH 2
skip: ...
} while (i— != 0); // BRANCH 3
```

what if branch 1 and branch 2 and branch 3 hash to same table entry?

# local patterns and collisions (2)

```
i = 10000;
do {
    if (i % 2 == 0) goto skip; // BRANCH 1
    p = malloc(...);
    if (p == NULL) goto error; // BRANCH 2
skip: ...
} while (i— != 0); // BRANCH 3
what if branch 1 and branch 2 and branch 3 hash to same table
entry?
pattern: TTNNTTNNTTNNTTNNTT
```

also no problem to predict!

# local patterns and collisions (3)

```
i = 10000:
do {
    if (A) goto one // BRANCH 1
one:
    if (B) goto two // BRANCH 2
two:
    if (A or B) goto three // BRANCH 3
    if (A and B) goto three // BRANCH 4
three:
    ... // changes A, B
} while (i— != 0);
what if branch 1-4 hash to same table entry?
```

better for prediction of branch 3 and 4

### global history predictor: idea

one predictor idea: ignore the PC

just record taken/not-taken pattern for all branches

lookup in big table like for local patterns

#### outcome global history predictor (1) from branch history register commit(?) pat counter 00 NNNN **NNNT** 00 NTTT 10 TNNN 01 **TNNT** 10

TNTN

TTTN

TTTT

11

10

11

prediction

to fetch stage

# global history predictor (1)



| эктр.<br>•••       |                   |                   |         |           |                  |                  | TNTN |  |  |
|--------------------|-------------------|-------------------|---------|-----------|------------------|------------------|------|--|--|
| } while (i— != 0); |                   |                   |         |           |                  |                  |      |  |  |
| iter./<br>branch   | history<br>before | counter<br>before | predict | outcome   | counter<br>after | history<br>after |      |  |  |
| 0/mod 2            | NTTT              | 10                | taken   | taken     | 11               | TTTT             |      |  |  |
| 0/loop             | TTTT              |                   |         | taken     |                  | TTTT             |      |  |  |
| 1/mod 2            | TTTT              |                   |         | not taken |                  | TTTN             |      |  |  |
| 1/error            | TTTN              |                   |         | not taken |                  | TTNN             |      |  |  |
| 1/loop             | TNNT              |                   |         | taken     |                  | NNTT             |      |  |  |
| 2/mod 2            | NNTT              |                   |         | taken     |                  | NTTT             |      |  |  |
| 2/loop             | TTTT              |                   |         | taken     |                  | TTTT             |      |  |  |

from commit(?) counter 00 00 10 01 10 11 10 prediction 11 to fetch stage

65

pat

NNNN

NNNT

NTTT

TNNN

TNNT

outcome

### correlating predictor

global history and local info good together

one idea: combine history register + PC ("gshare")



### mixing predictors

different predictors good at different times

one idea: have two predictors, + predictor to predict which is right



### loop count predictors (1)

```
for (int i = 0; i < 64; ++i)
```

can we predict this perfectly with predictors we've seen

yes — local or global history with 64 entries

but this is very important — more efficient way?

## loop count predictors (2)

loop count predictor idea: look for NNNNNNT+repeat (or TTTTTN+repeat)

track for each possible loop branch:

how many repeated Ns (or Ts) so far how many repeated Ns (or Ts) last time before one T (or N) something to indicate this pattern is useful?

known to be used on Intel

### benchmark results

from 1993 paper
(not representative of modern workloads?)
rate for conditional branches on benchmark
variable table sizes

### 2-bit ctr + local history

from McFarling, "Combining Branch Predictors" (1993)



# 2-bit (bimodal) + local + global hist

from McFarling, "Combining Branch Predictors" (1993)



# global + hash(global+PC) (gshare/gselect)

from McFarling, "Combining Branch Predictors" (1993)



#### real BP?

details of modern CPU's branch predictors often not public but...

#### Google Project Zero blog post with reverse engineered details

```
https:
//googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html
for RF'd BTB size
```

https://xania.org/201602/haswell-and-ivy-btb

### reverse engineering Haswell BPs

#### branch target buffer

4-way, 4096 entries ignores bottom 4 bits of PC? hashes PC to index by shifting + XOR seems to store 32 bit offset from PC (not all 48+ bits of virtual addr)

#### indirect branch predictor

like the global history + PC predictor we showed, but... uses history of recent branch addresses instead of taken/not taken keeps some info about last 29 branches

what about conditional branches??? loops???

couldn't find a reasonable source

# backup slides

# backup slides