## last time (1)

certificates + certificate authorities

#### cryptographic hashes

hard-to-reverse summary specialized versions for password storage

#### key exchange

generate secret + key share combine your secret + other's key share to get shared secret

TLS: everything together

# last time (2)

```
review: single-cycle processor
pipelining idea (laundry analogy)
instructions as series of pipeline stages
latency = time for one (beginning to end)
throughput = rate for many
start: diminishing returns with pipelining
```

# 6:30p lab tomorrow

is in Olsson 018

# anonymous feedback (1)

final exam: could it be remote?

deliberate decision I made early in the semester; has pros/cons remote: tricky to balance for students not spending N hours on exam

not nice re: technical issues to give tight time limit remotely

remote: need to write questions that work in open-book/notes

# anonymous feedback (2)

"Could you give some more examples with pipeline chart and a quick review of assembly again."

we will have more examples with pipeline chart, b/c there are parts of pipelining we haven't talked about

re: (x86-64) assembly, not going to do a detailed review re: time but...

```
instruction operand=source, operand=source+destination \%XXX — some register (\%rXX = 64 bits, \%eXX = 32 bits) \$123 — the constant 123 some_label — label = memory location X(\%rYY) = memory[X + \%rYY] cmp = set condition codes; jXX = jump based on condition codes
```

```
// init. %r8=800, %r9=900, etc.
addq %r8, %r9 // R8+R9->R9
addq %r10, %r11 // R10+R11->R11
addq %r12, %r13 // R12+R13->R13
addq %r14, %r15 // R14+R15->R15
addq %r9, %r8 // R9+R8->R8
```



|       | fetch | fetc | h/decode | de    | decode/execute |    | execute | e/memory | memory/ | writeback |
|-------|-------|------|----------|-------|----------------|----|---------|----------|---------|-----------|
| cycle | PC    | rA   | rB       | R[rB] | R[rB]          | rB | sum     | rB       | sum     | rB        |
| 0     | 0×0   |      |          | •     |                | •  |         |          |         |           |
| 1     | 0x2   | 8    | 9        |       |                |    |         |          |         |           |
| 2     | 0x4   | 10   | 11       | 800   | 900            | 9  |         |          |         |           |
| 3     | 0x6   | 12   | 13       | 1000  | 1100           | 11 | 1700    | 9        |         |           |
| 4     | 0x8   | 14   | 15       | 1200  | 1300           | 13 | 2100    | 11       | 1700    | 9         |
| 5     |       | 9    | 8        | 1400  | 1500           | 15 | 2500    | 13       | 2100    | 11        |
| 6     |       |      | •        | 900   | 1700           | 8  | 2900    | 15       | 2500    | 13        |
| 7     |       |      |          |       |                |    | 2500    | 8        | 2900    | 15        |
| 8     |       |      |          |       |                |    |         |          | 2500    | 8         |

```
// init. %r8=800, %r9=900, etc.
addq %r8, %r9 // R8+R9->R9
addq %r10, %r11 // R10+R11->R11
addq %r12, %r13 // R12+R13->R13
addq %r14, %r15 // R14+R15->R15
addq %r9, %r8 // R9+R8->R8
```



|       | fetch | fetch | n/decode | de    | decode/execute |    | execut | e/memory | memory/ | writeback |
|-------|-------|-------|----------|-------|----------------|----|--------|----------|---------|-----------|
| cycle | PC    | rA    | rB       | R[rB] | R[rB]          | rB | sum    | rB       | sum     | rB        |
| Θ     | 0×0   |       |          |       |                |    |        |          |         |           |
| 1     | 0x2   | 8     | 9        |       |                |    |        |          |         |           |
| 2     | 0x4   | 10    | 11       | 800   | 900            | 9  |        |          |         |           |
| 3     | 0x6   | 12    | 13       | 1000  | 1100           | 11 | 1700   | 9        |         |           |
| 4     | 0x8   | 14    | 15       | 1200  | 1300           | 13 | 2100   | 11       | 1700    | 9         |
| 5     |       | 9     | 8        | 1400  | 1500           | 15 | 2500   | 13       | 2100    | 11        |
| 6     |       |       | •        | 900   | 1700           | 8  | 2900   | 15       | 2500    | 13        |
| 7     |       |       |          |       |                |    | 2500   | 8        | 2900    | 15        |
| 8     |       |       |          |       |                |    |        |          | 2500    | 8         |

```
// init. %r8=800, %r9=900, etc.
addq %r8, %r9 // R8+R9->R9
addq %r10, %r11 // R10+R11->R11
addq %r12, %r13 // R12+R13->R13
addq %r14, %r15 // R14+R15->R15
addq %r9, %r8 // R9+R8->R8
```



|       | fetch | fetch | /decode | de    | , ,   |    | execut | e/memory | memory | /writeback |
|-------|-------|-------|---------|-------|-------|----|--------|----------|--------|------------|
| cycle | PC    | rA    | rB      | R[rB] | R[rB] | rB | sum    | rB       | sum    | rB         |
| 0     | 0×0   |       |         | •     |       |    |        |          |        |            |
| 1     | 0x2   | 8     | 9       |       |       |    |        |          |        |            |
| 2     | 0x4   | 10    | 11      | 800   | 900   | 9  |        |          |        |            |
| 3     | 0x6   | 12    | 13      | 1000  | 1100  | 11 | 1700   | 9        |        |            |
| 4     | 0x8   | 14    | 15      | 1200  | 1300  | 13 | 2100   | 11       | 1700   | 9          |
| 5     |       | 9     | 8       | 1400  | 1500  | 15 | 2500   | 13       | 2100   | 11         |
| 6     |       |       | •       | 900   | 1700  | 8  | 2900   | 15       | 2500   | 13         |
| 7     |       |       |         |       | •     |    | 2500   | 8        | 2900   | 15         |
| 8     |       |       |         |       |       |    |        |          | 2500   | 8          |

```
// init. %r8=800, %r9=900, etc.
addq %r8, %r9 // R8+R9->R9
addq %r10, %r11 // R10+R11->R11
addq %r12, %r13 // R12+R13->R13
addq %r14, %r15 // R14+R15->R15
addq %r9, %r8 // R9+R8->R8
```



|       | fetch | fetc | h/decode | de    | decode/execute |    | execute | e/memory | memory/ | writeback |
|-------|-------|------|----------|-------|----------------|----|---------|----------|---------|-----------|
| cycle | PC    | rA   | rB       | R[rB] | R[rB]          | rB | sum     | rB       | sum     | rB        |
| 0     | 0×0   |      |          | •     |                | •  | •       | •        |         | •         |
| 1     | 0x2   | 8    | 9        |       |                |    |         |          |         |           |
| 2     | 0x4   | 10   | 11       | 800   | 900            | 9  |         |          |         |           |
| 3     | 0x6   | 12   | 13       | 1000  | 1100           | 11 | 1700    | 9        |         |           |
| 4     | 0x8   | 14   | 15       | 1200  | 1300           | 13 | 2100    | 11       | 1700    | 9         |
| 5     |       | 9    | 8        | 1400  | 1500           | 15 | 2500    | 13       | 2100    | 11        |
| 6     |       |      | •        | 900   | 1700           | 8  | 2900    | 15       | 2500    | 13        |
| 7     |       |      |          |       |                |    | 2500    | 8        | 2900    | 15        |
| 8     |       |      |          |       |                |    |         | ,        | 2500    | 8         |

```
// init. %r8=800, %r9=900, etc.
addq %r8, %r9 // R8+R9->R9
addq %r10, %r11 // R10+R11->R11
addq %r12, %r13 // R12+R13->R13
addq %r14, %r15 // R14+R15->R15
addq %r9, %r8 // R9+R8->R8
```



|       | fetch | fetc | h/decode | de    | decode/execute |    | execut | e/memory | memory | /writeback |
|-------|-------|------|----------|-------|----------------|----|--------|----------|--------|------------|
| cycle | PC    | rA   | rB       | R[rB] | R[rB]          | rB | sum    | rB       | sum    | rB         |
| 0     | 0x0   |      |          | •     |                | •  |        |          |        |            |
| 1     | 0x2   | 8    | 9        |       |                |    |        |          |        |            |
| 2     | 0x4   | 10   | 11       | 800   | 900            | 9  |        |          |        |            |
| 3     | 0x6   | 12   | 13       | 1000  | 1100           | 11 | 1700   | 9        |        |            |
| 4     | 0x8   | 14   | 15       | 1200  | 1300           | 13 | 2100   | 11       | 1700   | 9          |
| 5     |       | 9    | 8        | 1400  | 1500           | 15 | 2500   | 13       | 2100   | 11         |
| 6     |       |      |          | 900   | 1700           | 8  | 2900   | 15       | 2500   | 13         |
| 7     |       |      |          |       |                |    | 2500   | 8        | 2900   | 15         |
| 8     |       |      |          |       |                |    |        |          | 2500   | 8          |

```
// init. %r8=800, %r9=900, etc.
addq %r8, %r9 // R8+R9->R9
addq %r10, %r11 // R10+R11->R11
addq %r12, %r13 // R12+R13->R13
addq %r14, %r15 // R14+R15->R15
addq %r9, %r8 // R9+R8->R8
```



|       | fetch | fetc | h/decode | de    | decode/execut |    | execute | e/memory | memory/ | writeback |
|-------|-------|------|----------|-------|---------------|----|---------|----------|---------|-----------|
| cycle | PC    | rA   | rB       | R[rB] | R[rB]         | rB | sum     | rB       | sum     | rB        |
| 0     | 0×0   |      |          |       |               |    |         |          |         |           |
| 1     | 0x2   | 8    | 9        |       |               |    |         |          |         |           |
| 2     | 0x4   | 10   | 11       | 800   | 900           | 9  |         |          |         |           |
| 3     | 0x6   | 12   | 13       | 1000  | 1100          | 11 | 1700    | 9        |         |           |
| 4     | 0x8   | 14   | 15       | 1200  | 1300          | 13 | 2100    | 11       | 1700    | 9         |
| 5     |       | 9    | 8        | 1400  | 1500          | 15 | 2500    | 13       | 2100    | 11        |
| 6     |       |      | •        | 900   | 1700          | 8  | 2900    | 15       | 2500    | 13        |
| 7     |       |      |          |       |               |    | 2500    | 8        | 2900    | 15        |
| 8     |       |      |          |       |               |    |         |          | 2500    | 8         |

# exercise: throughput/latency (1)

```
      cycle # 0 1 2 3 4 5 6 7 8

      0x100: add %r8, %r9
      F D E M W

      0x108: mov 0x1234(%r10), %r11
      F D E M W

      0x110: ...
      ...
```

suppose cycle time is 500 ps

exercise: latency of one instruction?

A. 100 ps B. 500 ps C. 2000 ps D. 2500 ps E. something else

# exercise: throughput/latency (1)

```
cycle \# 0 1 2 3 4 5 6 7 8 0x100: add %r8, %r9 F D E M W 0x108: mov 0x1234(%r10), %r11 F D E M W
```

0×110: ...

suppose cycle time is 500 ps

exercise: latency of one instruction?

A. 100 ps B. 500 ps C. 2000 ps D. 2500 ps E. something else

exercise: throughput overall?

A. 1 instr/100 ps B. 1 instr/500 ps C. 1 instr/2000ps D. 1 instr/2500 ps

E. something else

# exercise: throughput/latency (2)

```
cycle #
0x100: add %r8, %r9
0x108: mov 0x1234(%r10), %r11
                                                 D
0x110: ...
                             cycle # 0 1 2 3 4 5 6 7 8
                                     F1 F2 D1 D2 E1 E2 M1 M2 W1 W
0x100: add %r8, %r9
0x108: mov 0x1234(%r10), %r11
                                       F1 F2 D1 D2 E1 E2 M1 M2 W
0x110: ...
```

double number of pipeline stages (to 10) + decrease cycle time from 500 ps to 250 ps — throughput?

A. 1 instr/100 ps B. 1 instr/250 ps C. 1 instr/1000ps D. 1 instr/5000 ps

E. something else













### diminishing returns: uneven split

Can we split up some logic (e.g. adder) arbitrarily?

Probably not...



### diminishing returns: uneven split

Can we split up some logic (e.g. adder) arbitrarily?

Probably not...



### diminishing returns: uneven split

Can we split up some logic (e.g. adder) arbitrarily?

Probably not...



#### a data hazard

```
// initially %r8 = 800,

// %r9 = 900, etc.

addq %r8, %r9 // R8 + R9 -> R9

addq %r9, %r8 // R9 + R8 -> R9

addq ...

addq ...
```



|       | fetch | fetc | h/decode | de    | decode/execute |    |      | /memory | memory/writebac |    |
|-------|-------|------|----------|-------|----------------|----|------|---------|-----------------|----|
| cycle | PC    | rA   | rB       | R[rB] | R[rB]          | rB | sum  | rB      | sum             | rB |
| 0     | 0×0   |      |          |       |                |    |      |         |                 |    |
| 1     | 0x2   | 8    | 9        |       |                |    |      |         |                 |    |
| 2     |       | 9    | 8        | 800   | 900            | 9  |      |         |                 |    |
| 3     |       |      | •        | 900   | 800            | 8  | 1700 | 9       |                 |    |
| 4     |       |      |          |       |                |    | 1700 | 8       | 1700            | 9  |
| 5     |       |      |          |       |                |    |      |         | 1700            | 8  |

#### a data hazard

```
// initially %r8 = 800,

// %r9 = 900, etc.

addq %r8, %r9 // R8 + R9 -> R9

addq %r9, %r8 // R9 + R8 -> R9

addq ...

addq ...
```



|       | fetch | fetc | h/decode | de          | ecode/exe | ecute | execute | execute/memory |      | /writeback |
|-------|-------|------|----------|-------------|-----------|-------|---------|----------------|------|------------|
| cycle | PC    | rA   | rB       | R[rB]       | R[rB]     | rB    | sum     | rB             | sum  | rB         |
| 0     | 0×0   |      |          | •           |           |       |         |                |      |            |
| 1     | 0x2   | 8    | 9        |             |           |       |         |                |      |            |
| 2     |       | 9    | 8        | 800         | 900       | 9     |         |                |      |            |
| 3     |       |      | ·        | 900         | 800       | 8     | 1700    | 9              |      |            |
| 4     |       |      |          | <del></del> | 1700      | _     | 1700    | 8              | 1700 | 9          |
| 5     |       |      | •        | 1700        | 8         |       |         |                |      |            |

#### data hazard

```
addq %r8, %r9 // (1)
addq %r9, %r8 // (2)
```

| step# | pipeline implementation | ISA specification   |
|-------|-------------------------|---------------------|
| 1     | read r8, r9 for (1)     | read r8, r9 for (1) |
| 2     | read r9, r8 for (2)     | write r9 for (1)    |
| 3     | write r9 for (1)        | read r9, r8 for (2) |
| 4     | write r8 for (2)        | write r8 ror (2)    |

pipeline reads older value...

instead of value ISA says was just written

#### data hazard compiler solution

```
addq %r8, %r9
nop
nop
addq %r9, %r8
one solution: change the ISA
     all addqs take effect three instructions later
     (assuming can read register value while it is being written back)
make it compiler's job
problem: recompile everytime processor changes?
```

#### data hazard compiler solution

```
addq %r8, %r9
nop
nop
addq %r9, %r8
one solution: change the ISA
     all addqs take effect three instructions later
     (assuming can read register value while it is being written back)
make it compiler's job
problem: recompile everytime processor changes?
```

# stalling/nop pipeline diagram (1)

```
add %r8, %r9 nop nop addg %r9, %r8
```



# stalling/nop pipeline diagram (1)

```
cycle # 0 1 2 3 4 5 6 7 8

add %r8, %r9

nop

nop

addq %r9, %r8

assumption:
```

if writing register value

register file will return that value for reads

not actually way register file worked in single-cycle CPU (e.g. can read old %r9 while writing new %r9)

# stalling/nop pipeline diagram (2)



# stalling/nop pipeline diagram (2)



if we didn't modify the register file, we'd need an extra cycle

#### data hazard hardware solution

```
addq %r8, %r9
// hardware inserts: nop
// hardware inserts: nop
addq %r9, %r8
how about hardware add nops?
called stalling
extra logic:
    sometimes don't change PC
    sometimes put do-nothing values in pipeline registers
```

### opportunity

```
// initially %r8 = 800,
// %r9 = 900, etc.
0x0: addq %r8, %r9
0x2: addq %r9, %r8
...
```

|       | fetch | fetch | n/decode | dec  | ode/exe | cute | execute/ | memory | memory/writeback |    |
|-------|-------|-------|----------|------|---------|------|----------|--------|------------------|----|
| cycle | PC    | rA    | rB       | R[rB | R[rB]   | rB   | sum      | rB     | sum              | rB |
| 0     | 0×0   |       |          |      | •       |      |          |        | •                | •  |
| 1     | 0x2   | 8     | 9        |      |         |      |          |        |                  |    |
| 2     |       | 9     | 8        | 800  | 900     | 9    |          | _      |                  |    |
| 3     |       |       |          | 900  | 800     | 8    | 1700     | 9      | ]                |    |
| 4     |       |       |          |      | 1700    |      | 1700     | 8      | 1700             | 9  |
| 5     |       |       | shou     | •    | 1700    | 8    |          |        |                  |    |

#### exploiting the opportunity



### exploiting the opportunity



#### opportunity 2

```
// initially %r8 = 800,
// %r9 = 900, etc.
0x0: addq %r8, %r9
0x2: nop
0x3: addq %r9, %r8
```

|       | fetch | fetch          | /decode | ded  |           |    | execute/ | memory | memory/v | vriteback |
|-------|-------|----------------|---------|------|-----------|----|----------|--------|----------|-----------|
| cycle | PC    | rA             | rB      | R[rB | R[rB]     | rB | sum      | sum rB |          | rB        |
| 0     | 0×0   |                | •       |      | •         | •  |          | •      | •        | ,         |
| 1     | 0x2   | 8              | 9       |      |           |    |          |        |          |           |
| 2     | 0x3   |                |         | 800  | 900       | 9  |          |        |          |           |
| 3     |       | 9              | 8       |      |           | :  |          | 9      |          | _         |
| 4     |       |                | ·       | 900  | 900 800 8 |    |          |        | 1700     | 9         |
| 5     |       | should be 1700 |         |      |           |    | 1700     | 9      |          |           |
| 6     |       |                |         |      |           |    |          |        | 1700     | 9         |

## exploiting the opportunity



## exercise: forwarding paths

 cycle #
 0
 1
 2
 3
 4
 5
 6
 7
 8

 addq %r8, %r9
 F
 D
 E
 M
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W
 W

in subq, %r8 is \_\_\_\_\_ addq.

in xorq, %r9 is \_\_\_\_\_ addq.

in andq, %r9 is \_\_\_\_\_ addq.

in andq, %r9 is \_\_\_\_\_ xorq.

A: not forwarded from

B-D: forwarded to decode from  $\{\mbox{execute},\mbox{memory},\mbox{writeback}\}$  stage of

#### unsolved problem

combine stalling and forwarding to resolve hazard

assumption in diagram: hazard detected in subq's decode stage (since easier than detecting it in fetch stage)

#### unsolved problem



combine stalling and forwarding to resolve hazard

assumption in diagram: hazard detected in subq's decode stage (since easier than detecting it in fetch stage)

#### solveable problem



#### why can't we...



clock cycle needs to be long enough to go through data cache AND to go through math circuits! (which we were trying to avoid by putting them in separate stages)

#### why can't we...



clock cycle needs to be long enough
to go through data cache AND
to go through math circuits!
(which we were trying to avoid by putting them in separate stages)

#### control hazard

0x00: cmpq %r8, %r9

0x08: je 0xFFFF

0x10: addq %r10, %r11

|       | fetch | fetch- | →decode | decode | →execut | execute→writel | execu | te→writeback |   |
|-------|-------|--------|---------|--------|---------|----------------|-------|--------------|---|
| cycle | PC    | rA     | rB      | R[rA]  | R[rB]   | result         |       |              |   |
| 0     | 0×0   |        | •       |        | •       |                |       |              | • |
| 1     | 0x8   | 8      | 9       |        |         |                |       |              |   |
| 2     | ???   |        |         | 800    | 900     |                |       |              |   |
| 3     | ???   |        |         |        |         | less than      |       |              |   |

#### control hazard

0x00: cmpq %r8, %r9

0x08: je 0xFFFF

0x10: addq %r10, %r11

|       | fetch | fetch- | decode d |       |       | execute→writel | execu |   |  |
|-------|-------|--------|----------|-------|-------|----------------|-------|---|--|
| cycle | PC    | rA     | rB       | R[rA] | R[rB] | result         |       |   |  |
| Θ     | 0×0   |        |          |       |       |                |       | • |  |
| 1     | 9x8   | 9      | 9        |       |       |                |       |   |  |
| 2     | ???   |        |          | 800   | 900   |                |       |   |  |
| 3     | ???   |        |          |       |       | less than      |       |   |  |

0xFFFF if R[8] = R[9]; 0x10 otherwise

```
cmpq %r8, %r9
       ine LABEL
                     // not taken
       xorq %r10, %r11
       movg %r11, 0(%r12)
                             cycle # 0 1 2 3 4 5 6 7 8
cmpq %r8, %r9
                                              М
ine LABEL
                                              Ε
                                                 М
                                           D
                                                   W
(do nothing)
                                                   М
(do nothing)
                                                    Е
                                                         W
xorg %r10, %r11
                                                   D
                                                         М
                                                            W
movg %r11, 0(%r12)
•••
```

```
cmpq %r8, %r9
       ine LABEL
                     // not taken
       xorq %r10, %r11
       movg %r11, 0(%r12)
                             cycle # 0 1 2 3 4 5 6 7 8
cmpq %r8, %r9
                          compare sets flags | E
ine LABEL
                                              Ε
                                           D
                                                 М
                                                    W
(do nothing)
                                                    М
(do nothing)
                                                    Е
                                                         W
xorg %r10, %r11
                                                    D
                                                         М
                                                            W
movg %r11, 0(%r12)
```

```
cmpq %r8, %r9
       ine LABEL // not taken
       xorq %r10, %r11
       movg %r11, 0(%r12)
                            cycle # 0 1 2 3 4 5 6 7 8
cmpq %r8, %r9
ine LABEL
           compute if jump goes to LABEL
(do nothing)
                                                 М
(do nothing)
                                                 Е
                                                      W
xorg %r10, %r11
                                                 D
                                                      М
                                                         W
movg %r11, 0(%r12)
```

```
cmpq %r8, %r9
       ine LABEL
                     // not taken
       xorq %r10, %r11
       movg %r11, 0(%r12)
                             cycle # 0 1 2 3 4 5 6 7 8
cmpq %r8, %r9
                                              М
ine LABEL
                                              Е
                                                    W
(do nothing)
                                                    М
(do nothing)
                                                    Ε
                                                         W
xorg %r10, %r11
                              use computed result | F
                                                         М
                                                            W
movq %r11, 0(%r12)
```

#### making guesses

```
cmpq %r8, %r9
jne LABEL
xorq %r10, %r11
movq %r11, 0(%r12)
...
```

```
LABEL: addq %r8, %r9 imul %r13, %r14 ...
```

speculate (guess): jne won't go to LABEL

right: 2 cycles faster!; wrong: undo guess before too late

# jXX: speculating right (1)

•••

```
cmpq %r8, %r9
       ine LABEL
       xorq %r10, %r11
       movg %r11, 0(%r12)
        . . .
LABEL: addg %r8, %r9
       imul %r13, %r14
        . . .
                               cycle # 0 1 2 3 4 5 6 7 8
cmpq %r8, %r9
                                             Е
                                                М
                                           D
jne LABEL
                                                Ε
xorq %r10, %r11
                                                D
                                                      М
movq %r11, 0(%r12)
                                                      Е
```

## jXX: speculating wrong

```
0 1 2 3 4 5 6 7 8
               cycle #
cmpq %r8, %r9
ine LABEL
                               Е
                            D
                                    W
xorq %r10, %r11
                            F
                               D
(inserted nop)
movq %r11, 0(%r12)
                               F
(inserted nop)
                                    Е
                                         W
LABEL: addq %r8, %r9
                                         М
                                    D
imul %r13, %r14
```

## jXX: speculating wrong

```
cycle # 0 1 2 3 4 5 6 7 8
cmpq %r8, %r9
ine LABEL
                          F
                             D
xorq %r10, %r11
                               D instruction "squashed"
(inserted nop)
movq %r11, 0(%r12)
                                  instruction "squashed"
(inserted nop)
                                     Е
                                          W
LABEL: addq %r8, %r9
                                          М
                                     D
imul %r13, %r14
```

#### "squashed" instructions

on misprediction need to undo partially executed instructions

mostly: remove from pipeline registers

more complicated pipelines: replace written values in cache/registers/etc.

## performance

#### hypothetical instruction mix

| kind          | portion | cycles<br>(predict<br>not-taken) | cycles<br>(stall) |
|---------------|---------|----------------------------------|-------------------|
| taken jXX     | 3%      | ,                                | 3                 |
| non-taken jXX | 5%      | 1                                | 3                 |
| others        | 92%     | 1*                               | 1*                |

## performance

#### hypothetical instruction mix

| kind          | portion | cycles<br>(predict<br>not-taken) | cycles<br>(stall) |
|---------------|---------|----------------------------------|-------------------|
| taken jXX     | 3%      | ,                                | 3                 |
| non-taken jXX | 5%      | 1                                | 3                 |
| others        | 92%     | 1*                               | 1*                |

```
exercise: predict+forward (1)
                 cycle # 0 1 2 3 4 5 6 7 8
 addg %r8, %r9
                         FDEMW
 cmpq %r9, %r10
                            FDEMW
 ile foo (taken)
 foo: andq %r9, %r8
if ile is correctly predicted:
    in cmpg, %r9 is _____ addg.
    in andg, %r9 is _____ addg.
    A not forwarded from
    B-D: forwarded to decode from {execute, memory, writeback} stage of
```

```
exercise: predict+forward (1)
                 cycle # 0 1 2 3 4 5 6 7 8
 addg %r8, %r9
                         FDEMW
 cmpq %r9, %r10
                           FDEMW
 ile foo (taken)
 foo: andq %r9, %r8
                                 F D E M W
if ile is correctly predicted:
    in cmpg, %r9 is _____ addg.
    in andg, %r9 is _____ addg.
    A not forwarded from
    B-D: forwarded to decode from {execute, memory, writeback} stage of
```

```
exercise: predict+forward (2)
                 cvcle # 0 1 2 3 4 5 6 7 8
 addg %r8, %r9
                         FDEMW
 cmpq %r9, %r10
                           FDEMW
 ile foo (taken)
 foo: andq %r9, %r8
if ile is mispredicted + resolved after ile's execute:
    in cmpg, %r9 is _____ addg.
    in andg, %r9 is _____ addg.
    A not forwarded from
     B-D: forwarded to decode from {execute, memory, writeback} stage of
```

```
exercise: predict+forward (2)
                cycle # 0 1 2 3 4 5 6 7 8
 addg %r8, %r9
                        FDEMW
 cmpq %r9, %r10
                           FDEMW
 ile foo (taken)
 (mispredicted)
                               FDEMW
 (mispredicted)
                                  FDEM
 foo: andq %r9, %r8
                                     FDEMW
if ile is mispredicted + resolved after ile's execute:
    in cmpg, %r9 is _____ addg.
    in andg, %r9 is _____ addg.
    A not forwarded from
    B-D: forwarded to decode from {execute, memory, writeback} stage of
```

#### hazards versus dependencies

dependency — X needs result of instruction Y?

has potential for being messed up by pipeline
(since part of X may run before Y finishes)

hazard — will it not work in some pipeline?

before extra work is done to "resolve" hazards
multiple kinds: so far, data hazards

```
      addq
      %rax,
      %rbx

      subq
      %rax,
      %rcx

      movq
      $100,
      %rcx

      addq
      %rcx,
      %r10

      addq
      %rbx,
      %r10
```

```
addq %rax, %rbx
subq %rax, %rcx
movq $100, %rcx
addq %rcx, %r10
addq %rbx, %r10
```

```
addq %rax, %rbx
subq %rax, %rcx
movq $100, %rcx
addq %rcx %r10
addq %rbx, %r10
```

```
addq %rax, %rbx

subq %rax, %rcx

movq $100, %rcx

addq %rcx, %r10

addq %rbx, %r10
```

### pipeline with different hazards

```
example: 4-stage pipeline:
fetch/decode/execute+memory/writeback

// 4 stage // 5 stage
addq %rax, %r8 // // W
subq %rax, %r9 // W // M
xorq %rax, %r10 // EM // E
andq %r8, %r11 // D // D
```

## pipeline with different hazards

```
example: 4-stage pipeline:
fetch/decode/execute+memory/writeback
              // 4 stage // 5 stage
addq %rax, %r8 // // W
subq %rax, %r9 // W // M
xorq %rax, %r10 // EM // E
andq %r8, %r11 // D // D
addg/andg is hazard with 5-stage pipeline
addq/andq is not a hazard with 4-stage pipeline
```

## pipeline with different hazards

```
example: 4-stage pipeline:

fetch/decode/execute+memory/writeback

// 4 stage // 5 stage

addq %rax, %r8 // // W

subq %rax, %r9 // W // M

xorq %rax, %r10 // EM // E

andq %r8, %r11 // D // D
```

more hazards with more pipeline stages

#### exercise: different pipeline

split execute into two stages: F/D/E1/E2/M/W

result only available near end of second execute stage

where does forwarding, stalls occur?

| cycle #              | 0 | 1 | 2  | 3  | 4 | 5 | 6 | 7 | 8 |  |
|----------------------|---|---|----|----|---|---|---|---|---|--|
| (1) addq %rcx, %r9   | F | D | E1 | E2 | М | W |   |   |   |  |
| (2) addq %r9, %rbx   |   |   |    |    |   |   |   |   |   |  |
| (3) addq %rax, %r9   |   |   |    |    |   |   |   |   |   |  |
| (4) movq %r9, (%rbx) |   |   |    |    |   |   |   |   |   |  |
| (5) movq %rcx, %r9   |   |   |    |    |   |   |   |   |   |  |

#### exercise: different pipeline

split execute into two stages: F/D/E1/E2/M/W

| cycle #          | 0 | 1 | 2  | 3  | 4 | 5 | 6 | 7 | 8 |  |
|------------------|---|---|----|----|---|---|---|---|---|--|
| addq %rcx, %r9   | F | D | E1 | E2 | М | W |   |   |   |  |
| addq %r9, %rbx   |   |   |    |    |   |   |   |   |   |  |
|                  |   |   |    |    |   |   |   |   |   |  |
| addq %rax, %r9   |   |   |    |    |   |   |   |   |   |  |
|                  |   |   |    |    |   |   |   |   |   |  |
| movq %r9, (%rbx) |   |   |    |    |   |   |   |   |   |  |
|                  |   |   | :  |    |   |   |   |   |   |  |

split execute into two stages: F/D/E1/E2/M/W

| cycle #          | 0 | 1 | 2  | 3  | 4  | 5  | 6  | 7 | 8 |
|------------------|---|---|----|----|----|----|----|---|---|
| addq %rcx, %r9   | F | D | E1 | E2 | М  | W  |    |   |   |
| addq %r9, %rbx   |   | F | D  | E1 | E2 | М  | W  |   |   |
|                  |   |   |    |    |    |    |    |   |   |
| addq %rax, %r9   |   |   | F  | D  | E1 | E2 | М  | W |   |
|                  |   |   |    |    |    |    |    |   |   |
| movq %r9, (%rbx) |   |   |    | F  | D  | E1 | E2 | М | W |
|                  |   |   |    |    |    |    |    |   |   |

split execute into two stages: F/D/E1/E2/M/W

| cycle #          | 0 | 1 | 2  | 3  | 4  | 5  | 6  | 7  | 8 |   |
|------------------|---|---|----|----|----|----|----|----|---|---|
| addq %rcx, %r9   | F | D | E1 | E2 | М  | W  |    |    |   |   |
| addq %r9, %rbx   |   | F | D  | Ε1 | E2 | М  | W  |    |   |   |
| addq %r9, %rbx   | : | F | D  | D  | E1 | E2 | М  | W  |   |   |
| addq %rax, %r9   | : |   | F  | D  | Ε1 | E2 | М  | W  |   |   |
| addq %rax, %r9   |   |   | F  | F  | D  | E1 | E2 | М  | W |   |
| movq %r9, (%rbx) |   |   |    | F  | D  | E1 | E2 | M  | W |   |
| movq %r9, (%rbx) |   |   |    |    | F  | D  | E1 | E2 | М | W |

split execute into two stages: F/D/E1/E2/M/W

| cycle #          | 0 | 1 | 2  | 3  | 4  | 5  | 6  | 7  | 8 |   |
|------------------|---|---|----|----|----|----|----|----|---|---|
| addq %rcx, %r9   | F | D | E1 | E2 | М  | W  |    |    |   |   |
| addq %r9, %rbx   |   | F | D  | Ε1 | E2 | М  | W  |    |   |   |
| addq %r9, %rbx   | : | F | D  | D  | E1 | E2 | М  | W  |   |   |
| addq %rax, %r9   | : |   | F  | D  | Ε1 | E2 | М  | W  |   |   |
| addq %rax, %r9   |   |   | F  | F  | D  | E1 | E2 | М  | W |   |
| movq %r9, (%rbx) |   |   |    | F  | D  | E1 | E2 | M  | W |   |
| movq %r9, (%rbx) |   |   |    |    | F  | D  | E1 | E2 | М | W |

movq %r9, (%rbx)

movq %rcx, %r9

split execute into two stages: F/D/E1/E2/M/W cycle # 0 1 2 3 4 5 6 7 8 addq %rcx, %r9 D F1 F2 M addg %r9, %rbx F D E1 E2 M W addq %r9, %rbx D D E1 E2 M addg %rax, %r9 F D E1 E2 M W addq %rax, %r9 F D E1 E2 M movq %r9, (%rbx) F D E1 E2 M W

F D E1 E2 M W

D F1 F2

43

# backup slides

## adding stages (one way)



divide running instruction into steps one way: fetch / decode / execute / memory / writeback

## adding stages (one way)



add 'pipeline registers' to hold values from instruction













### why registers?

example: fetch/decode

need to store current instruction somewhere ...while fetching next one

# exercise: forwarding paths (2)

cycle # 0 1 2 3 4 5 6 7 8 addq %r8, %r9 subg %r8, %r9 ret (goes to andg) andg %r10, %r9 in subg. %r8 is \_\_\_\_\_ addg. in subq, %r9 is \_\_\_\_\_ addq.

in andq, %r9 is \_\_\_\_\_ subq.

in andq, %r9 is \_\_\_\_\_ addq.

A: not forwarded from

B-D: forwarded to decode from  $\{execute, memory, writeback\}$  stage of