#### CSE 560 Computer Systems Architecture

**Pipelining** 

Washington

#### Performance Review

What metric would you use to compare the performance of computers

- 1. With different ISAs?
- 2. With the same ISA?
- 3. With the same ISA and clock speed?
  - A. MIPS
  - B. Instructions/Program
  - C. Execution time
  - D. IPC
  - E. Clock speed





#### Performance Review

What metric would you use to compare the performance of computers

- 1. With different ISAs?
- Execution time
- 2. With the same ISA?
- MIPS
- 3. With the same ISA and clock speed?
- IPC

- A. MIPS
- B. Instructions/Program
- C. Execution time
- D. IPC
- E. Clock speed

 $\frac{\text{seconds}}{\text{program}} = \frac{\text{instructions}}{\text{program}} \ \ \text{x} \ \frac{\text{cycles}}{\text{instruction}} \ \ \text{x} \ \ \frac{\text{seconds}}{\text{cycle}}$ 

₩ashington

#### This Unit: (Scalar In-Order) Pipelining



- · Principles of pipelining
  - Effects of overhead and hazards
  - Pipeline diagrams
- Data hazards
  - · Stalling and bypassing
- Control hazards (Next lecture)
  - Branch prediction
  - Predication

Washington University in St Louis

### Datapath Background

Washington



Washington
University in St. Louis





#### Single-cycle vs. Multi-cycle Performance

- Single-cycle
  - Clock period = 50ns, CPI = 1
  - Performance = 50ns/insn
- Multi-cycle has opposite performance split of single-cycle
  - + Shorter clock period
  - Higher CPI
- Multi-cycle
  - Branch: 20% (3 cycles), load: 20% (5 cycles), ALU: 60% (4 cycles)
    Clock period = 11ns, CPI = (20%\*3)+(20%\*5)+(60%\*4) = 4
    Why is clock period 11ns and not 10ns?

  - Performance = 44ns/insn
- Aside: CISC makes perfect sense in multi-cycle datapath

#### ₿ Washington

## **Pipelining Basics** ₩ashington

#### Latency versus Throughput insn0.fetch, dec, exec insn1.fetch, dec, exec Single-cycle insn0.fetch insn0.dec insn0.exec Multi-cycle insn1.fetch insn1.dec insn1.exec

- Can we have both low CPI and short clock period?
  - Not if datapath executes only one insn at a time
- · Latency vs. Throughput
  - Latency: no good way to make a single insn go faster
  - + Throughput: luckily, single insn latency not so important
    - Goal is to make programs, not individual insns, go faster
    - · Programs contain billions of insns
  - Key: exploit inter-insn parallelism

#### ₩ashington

#### **Pipelining** insn0.fetch insn0.dec insn0.exec insn1.fetch insn1.dec insn1.exec Multi-cycle insn0.fetch insn0.dec insn1.fetch insn1.dec insn1.exec

- Important performance technique
  - · Improves insn throughput rather instruction latency
- · Begin with multi-cycle design
  - One insn advances from stage 1 to 2, next insn enters stage 1
  - Form of parallelism: "insn-stage parallelism"
  - · Maintains illusion of sequential fetch/execute loop
  - · Individual instruction takes the same number of stages
- + But instructions enter and leave at a much faster rate · Laundry analogy
- Washington







# More Terminology & Foreshadowing • Scalar pipeline: one insn per stage per cycle • Alternative: "superscalar", e.g., 4-wide (later) • In-order pipeline: insns enter execute stage in order • Alternative: "out-of-order" (OoO) (later) • Pipeline depth: number of pipeline stages • Nothing magical about five (Pentium 4 had 22 stages!) • Trend: deeper until Pentium 4, then pulled back a bit

















#### Example Pipeline Perf. Calculation

- - Clock period = 50ns, CPI = 1
     Performance = 50ns/insn
- Multi-cvcle
  - Branch: 20% (3 cycles), load: 20% (5 cycles), ALU: 60% (4 cycles)
  - Clock period = 11ns, CPI = (20%\*3)+(20%\*5)+(60%\*4) = 4
  - Performance = 44ns/insn
- 5-stage pipelined
  - Clock period = 12ns approx. (50ns / 5 stages) + overheads + CPI = 1 (each insn takes 5 cycles, but 1 completes each cycle)
  - Performance = 12ns/insn

    Well actually ... CPI = 1 + some penalty for pipelining (next)

    CPI = 1.5 (on average insn completes every 1.5 cycles)

    Performance = 18ns/insn
  - - Much higher performance than single-cycle or multi-cycle

₿ Washington

#### Clock Period of a Pipelined Processor

Delay<sub>do</sub> = time it takes to travel through original datapath  $N_{ps} = number of pipeline stages$ 

#### Pipeline Clock Period > Delay<sub>dp</sub> / N<sub>ps</sub>

- · Latches add delay
- · Extra "bypassing" logic adds delay
- · Pipeline stages have different delays, clock period is max delay
- These factors have implications for ideal number pipeline stages
  - Diminishing clock frequency gains for longer (deeper) pipelines

Washington

#### CPI Calculation: Accounting for Stalls

Why is Pipelined CPI > 1?

- CPI for scalar in-order pipeline is 1 + stall penalties
- · Stalls used to resolve hazards
  - · Hazard: condition that jeopardizes sequential illusion
  - · Stall: pipeline delay introduced to restore sequential illusion
- · Calculating pipeline CPI
  - · Frequency of stall \* stall cycles
  - · Penalties add (stalls generally don't overlap in in-order
  - 1 + stall-freq<sub>1</sub>\*stall-cyc<sub>1</sub> + stall-freq<sub>2</sub>\*stall-cyc<sub>2</sub> + .
- Correctness/performance/make common case fast (MCCF)
  - Long penalties OK if rare, e.g., 1 + 0.01 \* 10 = 1.1
  - · Stalls have implications for ideal number of pipeline stages

₿ Washington

#### Data Dependences, Pipeline Hazards, and Bypassing

₩ashington

#### Dependences and Hazards

- Dependence: relationship between two insns
  - Data: two insns use same storage location
  - Control: 1 insn affects whether another executes at all
  - Not a bad thing, programs would be boring otherwise
  - Enforced by making older insn go before younger one • Happens naturally in single-/multi-cycle designs
    - · But not in a pipeline
- · Hazard: dependence & possibility of wrong insn order
  - · Effects of wrong insn order cannot be externally visible • Stall: for order by keeping younger insn in same stage
  - · Hazards are a bad thing: stalls reduce performance

Washington



- It wouldn't help: peak fetch still only 1 insn per cycle
- Structural hazards: who gets the register file write port?

Washington

#### Structural Hazards

- · Structural hazards
  - · Two insns trying to use same circuit at same time
    - E.g., structural hazard on register file write port
- To fix structural hazards: proper ISA/pipeline design
  - Each insn uses every structure exactly once
  - · For at most one cycle
  - · Always at same stage relative to F (fetch)
- · Tolerate structure hazards
  - · Add stall logic to stall pipeline when hazards occur

Washington

#### **Example Structural Hazard**

ld r2.0(r1) W M X D add r1 1r3,r4 D W sub r1 r3.r5 M X W M st r6,0(r1)

- Structural hazard: resource needed twice in one cycle
  - Example: unified instruction & data memories (caches)
  - - Separate instruction/data memories (caches)
    - Have cache allow 2 accesses per cycle (slow, expensive)
    - Stall pipeline

Washington



- 1w read \$3 two cycles ago → got wrong value addi read \$3 one cycle ago → got wrong value
- sw reads \$3 this cycle → maybe (depends on register file)

₿ Washington

Memory Data Hazards Register D\$ W lw \$4.0(\$1) sw \$5,0(\$1 Are memory data hazards a problem for this pipeline? No 1w following sw to same address in next cycle, gets right value Why? D\$ read/write always take place in same stage Data hazards through registers? Yes (previous slide) Occur because register write is three stages after register read Can only read a register value three cycles after writing it ₩ashington

























#### Performance Impact of Load/Use Penalty

- Assume
  - Branch: 20%, load: 20%, store: 10%, other: 50%
  - $\bullet\,$  50% of loads are followed by dependent instruction
    - require 1 cycle stall (i.e., insertion of 1 nop)
- Calculate CPI
  - CPI = 1 + (1 \* 20% \* 50%) = 1.1

**Washington** 

#### Reducing Load-Use Stall Frequency

|                 | 1 | 2 | 3 | 4  | 5 | 6  | 7 | 8 | 9 |
|-----------------|---|---|---|----|---|----|---|---|---|
| add \$3 \$2,\$1 | F | D | Х | М  | W |    |   |   |   |
| lw \$4,4(\$3)   |   | F | D | ŧχ | М | w  |   |   |   |
| addi \$6 🛼,1    |   |   | F | d* | D | *X | М | W |   |
| sub \$8 \$3,\$1 |   |   |   |    | F | D  | Х | М | w |

- Use compiler scheduling to reduce load-use stall frequency
  - More on compiler scheduling later

|                 | 1 | 2 | 3 | 4 | 5   | 6          | 7 | 8 | 9 |
|-----------------|---|---|---|---|-----|------------|---|---|---|
| add \$3 \$2,\$1 | F | D | Х | М | W   |            |   |   |   |
| lw \$4,4(\$3)   |   | F | D | X | M   | W          |   |   |   |
| sub \$8 \$3,\$1 |   |   | F | D | * X | М          | W |   |   |
| addi \$6 \$4,1  |   |   |   | F | D   | <b>♥</b> X | М | W |   |

Washington
University in St. Louis





#### Pipeline Diagram with Multiplier

|                        | 1 | 2 | 3  | 4  | 5  | 6  | 7 | 8 | 9 |
|------------------------|---|---|----|----|----|----|---|---|---|
| mul \$4 \$3,\$5        | F | D | P0 | P1 | P2 | P3 | W |   |   |
| addi \$6 <b>\$4,</b> 1 |   | F | d* | d* | d* | D  | Χ | М | W |

- · What about...
  - Two instructions trying to write regfile in same cycle?
  - Structural hazard!
- Must prevent:

|                  | 1 | 2 | 3  | 4  | 5  | 6  | 7 | 8 | 9 |
|------------------|---|---|----|----|----|----|---|---|---|
| mul \$4 \$3,\$5  | F | D | P0 | P1 | P2 | P3 | W |   |   |
| addi \$6 \$1,1   |   | F | D  | Х  | М  | W  |   |   |   |
| add \$5 \$6,\$10 |   |   | F  | D  | Х  | М  | w |   |   |

Washington

#### More Multiplier Nasties

- What about.
  - · Mis-ordered register writes
  - SW thinks add gets \$4 from addi, actually gets it from mul

|                  | 1 | 2 | 3  | 4  | 5  | 6  | 7 | 8 | 9 |
|------------------|---|---|----|----|----|----|---|---|---|
| mul \$4 \$3,\$5  | F | D | P0 | P1 | P2 | P3 | W |   |   |
| addi \$4 \$1,1   |   | F | D  | Х  | М  | W  |   |   |   |
|                  |   |   |    |    |    |    |   |   |   |
|                  |   |   |    |    |    |    |   |   |   |
| add \$10 \$4,\$6 |   |   |    |    | F  | D  | Х | М | W |

- Common? Not for a 4-cycle multiply with 5-stage pipeline
  - More common with deeper pipelines
  - Frequency irrelevant: must be correct no matter how rare



52

#### Corrected Pipeline Diagram

- With the correct stall logic
  - · Prevent mis-ordered writes to the same register
  - · Why two cycles of delay?

|                  | 1 | 2 | 3  | 4  | 5  | 6  | 7 | 8 | 9 |
|------------------|---|---|----|----|----|----|---|---|---|
| mul \$4 \$3,\$5  | F | D | P0 | P1 | P2 | P3 | W |   |   |
| addi \$4 \$1,1   |   | F | d* | d* | D  | Х  | М | W |   |
|                  |   |   |    |    |    |    |   |   |   |
|                  |   |   |    |    |    |    |   |   |   |
| add \$10 \$4,\$6 |   |   |    |    | F  | D  | Χ | М | W |

Multi-cycle operations complicate pipeline logic

**Washington** 

#### **Pipelined Functional Units**

- Almost all multi-cycle functional units are pipelined
  - Each operation takes N cycles
  - Can initiate a new (independent) operation every cycle
  - Requires internal latching and some hardware replication
  - + Cheaper than multiple (non-pipelined) units

• Exception: int/FP divide: difficult to pipeline; not worth it

1 2 3 4 5 6 7 8 9 10 11

divf f0 f1,f2 | F D E/ E/ E/ E/ W wlinf f3 f4,f5 | F s\* s\* s\* D E/ E/ E/ E/ W

- s\* = structural hazard, two insns need same structure
  - · ISAs and pipelines designed minimize these
- Canonical example: all insns go through M stage

₩ashington

54

#### ISA Implementability Review

Give an example of an ISA feature that makes pipelining more difficult and the particular pipeline stages it affects.

Washington



#### ISA Implementability Review

Give an example of an ISA feature that makes pipelining more difficult and the particular pipeline stages it affects.

- Variable instruction length and format make pipelining fetch and decode difficult.
- Implicit state makes dynamic scheduling difficult.
- Variable latencies makes scheduling difficult.

Washington

56