#### Changelog

18 Apr 2023: adjust spacing on control hazard pipeline diagram exercise

18 Apr 2023: fix static prediction exercise: adjust jl to jg, use for\_loop\_top label at bottom correctly, adjust comments to match assembly

18 Apr 2023: predict: repeat last: have assembly use jnz instead of jz

18 Apr 2023: fix static prediction exercise: correctly name forward-NOT-taken, backward TAKEN

#### so far

automating building programs

HW tools for OSes exceptions, virtual memory

OS security policies accounts

networking layers, secure communication

threading+concurrency

HW performance tricks caching, pipelining

#### last time (1)

data hazard problem pipelining changes order of register reads/writes

hazard v dependency

forwarding to resolve data hazard

value available before it is stored

add 'shortcut' to send to right place

add logic to select shortcut conditionally (compare register #s)

#### last time (2)

forwarding paths + choosing most recent version combining stalling with forwarding control hazards can't figure out next instruction after fetch?

#### anonymous feedback (1)

"Are you going to post a study guide for the final? I think this class is a bit all over the place, so it would be really nice to have maybe a practice exam, or some sort of study guide so that we know how to study. Thanks:)"

```
probably not going to have a practice exam
do now have some exams from pilot posted
likely going to try to look over what's covered in schedule in writing
exam
(roughly equal weight for each week of class)
```

#### anonymous feedback (2)

"The exercises in class are actually insanely helpful in helping me understand the topic. Also, you're a goated professor and teach the class so well even though the course material is very difficult."

re: exercises — definitely try to have them for most topics (also helpful for me to know if everyone's lost)

"What are some of your hobbies?" hiking, birding, drawing

#### anonymous feedback (3)

"Some of the TA's do a horrible job of going to their own OH. I sat on discord for over an hour and a half on Monday and none of the scheduled TAs were even on discord. This is not an easy class and when I show up to get help, it should be available as advertised."

This indeed is something that should not happen...

#### quiz Q1

```
cycle time = longest stage time = 500 ps
fetch starts at time Ops
decode starts at time 500ps
execute starts at time 1000ps
memory starts at time 1500ps
writeback starts at time 2000ps
```

#### quiz Q2

no hazards: once pipeline full start/finish one new instruction every cycle

first instruction finishes after 6 cycles

then 1 instruction finishes every cycle

$$\rightarrow (6 + 99 \ 999 \ 999)$$
 cycles

$$\approx 100\ 000\ 000 \cdot 600 \text{ps} = 60 \text{ ms}$$

#### quiz Q3ab

6 to 5 pipeline stages by combining two stages

cycle time likely to be higher

if third+fourth stage were slowest, maybe slightly less than twice if third+fourth stage were fastest, maybe only slightly if not higher, than we started with a poor choice of pipeline stages

higher cycle time  $\rightarrow$  lower throughput assuming stalls are not extremely common, which is likely with forwarding

if pipeline stages are well-balanced before/after change, lower latency

if pipeline stages not well-balanced after, more wasted time ightarrow

#### quiz Q4

usually fetch new instruction + complete new instruction every cycle

5% take two extra cycles to fetch new instruction implies later two extra cycles to complete new instruction

10% take one extra cycle

$$(1 + .05 * 2 + .1 * 1) = 1.2$$
 average cycles per instruction

 $\times 1000$ ps = 1200 ps average time

#### quiz Q5

if no hazards

```
addq %r8, %r9 F1 F2 D E1 E2 E3+M1 M2 W
xorq %r8, %r11 F1 F2 D E1 E2 E2 M1 M2 W
subq ***, %r10 F1 F2 D E1 E2 E3 M1 M2 W
but subq uses %r9 from addq, so it can't start E1 until after addq
computes its value
```

```
compare sets flags
                       F
```

```
compute if jump goes to LABEL
```

```
use computed result
```

#### making guesses

```
xorq %r10, %r11
movq %r11, 0(%r12)
...

ABEL: addq %r8, %r9
imul %r13, %r14
```

speculate (guess): ine won't go to LABEL

right: 2 cycles faster!; wrong: undo guess before too late

### jXX: speculating right (1)

```
B
```

1

#### jXX: speculating wrong

```
E M W
      E M W
        Ε
```

•••

#### jXX: speculating wrong

```
E M W
   instruction "squashed"
   instruction "squashed"
        E M W
           E M W
```

•••

#### "squashed" instructions

on misprediction need to undo partially executed instructions mostly: remove from pipeline registers

more complicated pipelines: replace written values in cache/registers/etc.

#### performance

#### hypothetical instruction mix

#### performance

hypothetical instruction mix

predict: 
$$3 \times .03 + 1 \times .05 + 1 \times .92 = 1.06 \text{ cycles/instr.}$$
  
stall:  $3 \times .03 + 3 \times .05 + 1 \times .92 = 1.16 \text{ cylces/instr.}$   $(1.19 \div 1.09 \approx 1.09 \text{x faster})$ 

### exercise: control hazard timing+forwarding?

- taken, actually not)
- (3) subq %rax, %r9
- (4) call ba
- (5) bar: pushq %r

with F/D/E/M/W: what is fetched when? what is forwarded?

with F/D/E/M/W: what is fetched when? what is forwarded?

```
pass flags
```

(5) bar: pushq %r9

```
with F/D/E/M/W: what is fetched when? what is forwarded?
                                    pass flags
```

: pushq %r9

2

with F/D/E/M/W: what is fetched when? what is forwarded?

(2c) ... (mispred.)

(3) subq %rax, %r9

(4) call bar

oushq %r9

2

```
with F/D/E/M/W: what is fetched when? what is forwarded?
```

(2) jne foo (predicted taken, actually not)
(2b) foo: ... (mispred.)
(2c) ... (mispred.)
(3) subq %rax, %r9

(4) call bar
(5) bar: pushq %r9

with F/D/E/M/W: what is fetched when? what is forwarded? pass flags F D E M W

D E M W

### exercise: with different pipeline

```
with F/D/E1/E2/M/W
```

# [solution]: with different pipeline

```
with F/D/E1/E2/M/W

cycle # 0 1 2 3 4 5

(1) addq %rcx, %r9 F D E1 E2 M W
```



D E1 E2 M W

#### static branch prediction

forward (target > PC) not taken; backward taken intuition: loops:

#### exercise: static prediction

```
suppose \%edi = 3 (initially)
and using forward-not-taken, backwards-taken strategy:
how many mispreditions for je? for il?
```

#### predict: repeat last































#### collisions?

two branches could have same hashed PC nothing in table tells us about this versus direct-mapped cache: had *tag bits* to tell

is it worth it?

adding tag bits makes table *much* larger and/or slower but does anything go wrong when there's a collision?

#### collision results

```
possibility 1: both branches usually taken no actual conflict — prediction is better(!)
```

possibility 2: both branches usually not taken no actual conflict — prediction is better(!)

possibility 3: one branch taken, one not taken performance probably worse

# 1-bit predictor for loops

predicts first and last iteration wrong

example: branch to beginning — but same for branch from beginning to

end

everything else correct

#### exercise

```
use 1-bit predictor on this loop
executed in outer loop (not shown) many, many times
```

what is the conditional branch misprediction rate?

```
int i = 0;
while (true) {
  if (i % 3 == 0) goto next;
  ...
next:
  i += 1;
  if (i == 50) break;
}
```

# beyond local 1-bit predictor

can predict using more historical info

```
whether taken last several times
example: taken 3 out of 4 last times → predict taken
pattern of how taken recently
```

example: if last few are T, N, T, N, T, N; next is probably T makes two branches hashing to same entry not so bad

outcomes of last N conditional jumps ("global history") take into account conditional jumps in surrounding code example: loops with if statements will have regular patterns

### predicting ret: ministack of return addresses

predicting ret — ministack in processor registers push on ministack on call; pop on ret

ministack overflows? discard oldest, mispredict it later



baz return address
bar return address
foo return address

(partial?) stack n CPU registers

# 4-entry return address stack



on call: increment index, save return address in that slot

# 1-cycle fetch?

```
assumption so far:
1 cycle to fetch instruction + identify if jmp, etc.
often not really practical
especially if:
     complex machine code format
     many pipeline stages
```

(future idea) fetching 2+ instructions/cycle

more complex instruction cache

### branch target buffer

will happen in more complex pipelines

what if we can't decode LABEL from machine code for jmp LABEL or jle LABEL fast?

what if we can't decode that there's a RET, CALL, etc. fast?

# BTB: cache for branch targets

# BTB: cache for branch targets

| 0xFF | 0x3FF | 8 | 0x404033 |  |
|------|-------|---|----------|--|

0x3FFFF3: movq %rax, %rs<sup>-</sup>

0x3FFFF7: pushq %rb>

0x3FFFF8: call 0x404033

0x400001: popq %rbx

0x400003: cmpq %rbx, %rax

0x400005: jle 0x3FFFF3

0x400031: ret

•

# BTB: cache for branch targets

| 0×00 | 0x400 | 5 | 0x3FFFF3 |  |
|------|-------|---|----------|--|
| 0×01 |       |   |          |  |
|      |       |   |          |  |
|      |       |   |          |  |
|      |       |   |          |  |
|      |       |   |          |  |

0x3FFFF7: pushq %rbx
0x3FFFF8: call 0x404033
0x400001: popq %rbx
0x400005: cmpq %rbx, %rax
0x400005: jle 0x3FFFF3
...
0x400031: rot

# indirect branch prediction

```
for instructions like: jmp *%rax or jmp *(%rax, %rcx, 8) simple idea: lookup jmp in cache table to see what happened last time
```

```
extension: table of (last few jmp instructions, target address) can predict even when %rax, etc. vary example: polymorphic method call idea implemented by Intel's Haswell chips
```

# backup slides

# exercise: forwarding paths (2)

```
in subg, %r8 is _____ addg.
in subg, %r9 is _____
                           _ adda.
in andg, %r9 is ____
                            subq.
in andq, %r9 is _____
                           _ addq.
    A: not forwarded from
```

### beyond 1-bit predictor

devote more space to storing history

main goal: rare exceptions don't immediately change prediction

example: branch taken 99% of the time

1-bit predictor: wrong about 2% of the time
1% when branch not taken
1% of taken branches right after branch not taken

new predictor: wrong about 1% of the time 1% when branch not taken

# 2-bit saturating counter



#### 2-bit saturating counter



value increases to 'strongest' taken value

#### 2-bit saturating counter



branch almost always taken, then not taken once: still predicted as taken

# example

| 0x40042A | jz 0x400423 |
|----------|-------------|
| 0×40042B | • • •       |

| iter. | table<br>before | prediction      | outcome   | table<br>after |
|-------|-----------------|-----------------|-----------|----------------|
| 1     | 01              | not taken taken |           | 10             |
| 2     | 10              | taken           | taken     | 11             |
| 3     | 11              | taken           | taken     | 11             |
| 4     | 11              | taken           | not taken | 10             |
| 1     | 10              | taken           | taken     | 11             |
| 2     | 11              | taken           | taken     | 11             |
|       | 11              | taken           | taken     | 11             |
| 4     | 11              | taken           | not taken | 10             |
| 1     | 10              | taken           | taken     | 11             |
|       |                 |                 |           |                |

#### generalizing saturating counters

2-bit counter: ignore one exception to taken/not taken

3-bit counter: ignore more exceptions

 $000 \leftrightarrow 001 \leftrightarrow 010 \leftrightarrow 011 \leftrightarrow 100 \leftrightarrow 101 \leftrightarrow 110 \leftrightarrow 111$ 

000-011: not taken

100-111: taken

#### exercise

```
use 2-bit predictor on this loop executed in outer loop (not shown) many, many times
```

what is the conditional branch misprediction rate?

```
int i = 0;
while (true) {
  if (i % 3 == 0) goto next;
  ...
next:
  i += 1;
  if (i == 50) break;
}
```

# exercise soln (1)

#### branch patterns

```
typical pattern for jump to top of do-while above:
TTTN TTTN TTTN TTTN...(T = taken, N = not taken)
goal: take advantage of recent pattern to make predictions
iust saw 'NTTTNT'? predict T next
'TNTTTN'? predict T; 'TTNTTT'? predict N next
```













#### recent pattern to prediction?

```
easy cases:
just saw TTTTTT: predict T
just saw NNNNNN: predict N
just saw TNTNTN: predict T
hard cases:
TTNTTT
    predict T? loop with many iterations
    (NTTTTTTTNTTTTTTTTT...)
    predict T? if statement mostly taken
```







|   | brancn     | pat. to |           |        | pat. to | brancn    |
|---|------------|---------|-----------|--------|---------|-----------|
|   |            |         | predict   | actual |         | to pat.   |
|   | tbl before | before  |           |        | after   | tbl after |
| 1 | TTTN       | 01      | not taken | taken  | 10      | TTNT      |
| 2 | TTNT       | 01      | not taken | taken  | 10      | TNTT      |
| _ | TINI       | 01      | not taken | taken  | 10      | IIVII     |

prediction to fetch sta

tbl before before

01

01

TTTN

TTNT



after

10

10

not taken taken

not taken taken

tbl after

TTNT

TNTT

to fetch sta

55

TNTT

11

taken



11

taken

3

TNTT

11

ltaken



11

taken

TNTT

11

ltaken



11

taken

2

3

TTNT

TNTT

01

11



10

11

TNTT

NTTT

not taken taken

taken

ltaken

TNTT

11

ltaken



11

taken



2

3



predictio to fetch sta

TTTN not taken ltaken TTNT 01 10 TTNT not taken taken TNTT 01 10 TNTT 11 11 NTTT ltaken taken

55

# local patterns and collisions (1)

```
i = 10000;
do {
    p = malloc(...);
    if (p == NULL) goto error; // BRANCH 1
    ...
} while (i— != 0); // BRANCH 2
```

what if branch 1 and branch 2 hash to same table entry?

# local patterns and collisions (1)

```
i = 10000;
do {
    p = malloc(...);
    if (p == NULL) goto error; // BRANCH 1
    ...
} while (i— != 0); // BRANCH 2
```

what if branch 1 and branch 2 hash to same table entry?

pattern: TNTNTNTNTNTNTNTNT...

actually no problem to predict!

# local patterns and collisions (2)

```
i = 10000;
do {
   if (i % 2 == 0) goto skip; // BRANCH 1
        ...
   p = malloc(...);
   if (p == NULL) goto error; // BRANCH 2
skip: ...
} while (i— != 0); // BRANCH 3
```

what if branch 1 and branch 2 and branch 3 hash to same table entry?

# local patterns and collisions (2)

also no problem to predict!

```
what if branch 1 and branch 2 and branch 3 hash to same table
entry?
pattern: TTNNTTNNTTNNTTNNTT
```

# local patterns and collisions (3)

what if branch 1-4 hash to same table entry?

#### global history predictor: idea

one predictor idea: ignore the PC

just record taken/not-taken pattern for all branches

lookup in big table like for local patterns

# global history predictor (1)



# global history predictor (1)

```
counter
iter./
           history
                    counter
                                                              history
                              predict
                                         outcome
                    before
                                                    after
branch
           before
                                                              after
0/mod 2
           NTTT
                     10
                              taken
                                         taken
                                                    11
                                                              TTTT
0/loop
           TTTT
                                                              TTTT
                                         taken
1/\text{mod }\overline{2}
           TTTT
                                         not taken
                                                              TTTN
           TTTN
1/error
                                         not taken
                                                              TTNN
```

. .

1 /1....

T. . . . T



....

#### correlating predictor

global history and local info good together

one idea: combine history register + PC ("gshare")



#### mixing predictors

different predictors good at different times

one idea: have two predictors, + predictor to predict which is right



#### loop count predictors (1)

```
for (int i = 0; i < 64; ++i)
```

can we predict this perfectly with predictors we've seen

yes — local or global history with 64 entries

but this is very important — more efficient way?

### loop count predictors (2)

loop count predictor idea: look for NNNNNNT+repeat (or TTTTTN+repeat)

track for each possible loop branch:

how many repeated Ns (or Ts) so far how many repeated Ns (or Ts) last time before one T (or N) something to indicate this pattern is useful?

known to be used on Intel

#### benchmark results

```
from 1993 paper
(not representative of modern workloads?)
rate for conditional branches on benchmark
variable table sizes
```

#### 2-bit ctr + local history





# 2-bit (bimodal) + local + global hist

from McFarling, "Combining Branch Predictors" (1993)



# global + hash(global+PC) (gshare/gselect)

from McFarling, "Combining Branch Predictors" (1993)



#### real BP?

details of modern CPU's branch predictors often not public but...

Google Project Zero blog post with reverse engineered details

```
https:
//googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html
for RE'd BTB size:
```

https://xania.org/201602/haswell-and-ivy-btb

#### reverse engineering Haswell BPs

#### branch target buffer

```
4-way, 4096 entries ignores bottom 4 bits of PC? hashes PC to index by shifting + XOR seems to store 32 bit offset from PC (not all 48+ bits of virtual addr)
```

#### indirect branch predictor

like the global history + PC predictor we showed, but... uses history of recent branch addresses instead of taken/not taken keeps some info about last 29 branches

what about conditional branches??? loops??? couldn't find a reasonable source