# 18-447 Lecture 7: Pipelined Implementation

James C. Hoe
Department of ECE
Carnegie Mellon University

#### Housekeeping

- Your goal today
  - getting started on pipelined implementations
- Notices
  - Lab 1, Part B, due this week
  - HW 2, due Mon 2/21

Look for Handout #7: Lab 2 on Friday

- Readings
  - P&H Ch 4

#### Doing laundry: by the book (P&H 4.6)



- 1."place one load of dirty clothes in washer"
- 2. "when washer is finished, place washed clothes in dryer"
- 3. "when dryer is finished, you fold dried clothes"
- 4. "when folding is finished, friend put away folded clothes"
  - steps to do a load are sequentially dependent
    - no dependence between different loads
      - different steps do not share resources

#### Doing laundry more quickly: in theory



#### Doing laundry more quickly: in practice



the slowest step decides throughput

#### Doing laundry more quickly: in practice



Throughput restored (2 loads per hour) using 2 dryers

#### Pipeline Idealism

Motivation: Increase throughput without adding hardware cost

- Repetition of identical tasks
   same task repeated for many different inputs
- Repetition of independent tasks
   no ordering dependencies between repeated tasks
- Uniformly partitionable suboperations arbitrary number and placement of boundaries

Good examples: automobile assembly line, doing laundry, but instruction execution???

#### (Ideal) HW Pipelining



Notice: evenly divisible; no feedback wires

#### **Performance Model**

Nonpipelined version with delay T

throughput = 1/(T+S) where S =latch delay



• k-stage pipelined version

throughput<sub>k-stage</sub> = 1 / (T/k + S)throughput<sub>max</sub> = 1 / (1 gate delay + S)



per-task latency became longer: T+kS

#### **Cost Model**

Nonpipelined version with combinational cost G



k-stage pipelined version

$$Cost_{k-stage} = G + Lk$$



## Cost/Performance Trade-off [Peter M. Kogge, 1981]

Cost/Performance:

$$C/P = [Lk + G] / [1/(T/k + S)] = (Lk + G) (T/k + S)$$
  
= LT + GS + LSk + GT/k

Optimal Cost/Performance: find min. C/P w.r.t. choice of k

$$\frac{d}{dk} \left( \frac{Lk+G}{\frac{1}{k}+S} \right) = 0+0+LS - \frac{GT}{k^2}$$

$$LS - \frac{GT}{k^2} = 0$$

$$k_{opt} = \sqrt{\frac{GT}{LS}}$$

### Reality of Instruction Pipelining . . . .



#### **RISC Instruction Processing**

- 5 generic steps
  - instruction fetch
  - instruction decode and operand fetch
  - ALU/execute
  - memory access
  - write-back



### Coalescing and "External Fragmentation"

| steps    | IF | ID       | EX       | MEM       | WB        |
|----------|----|----------|----------|-----------|-----------|
| R-type   | V  | <b>V</b> |          |           | $\sqrt{}$ |
| l-type   | V  | 1        |          |           | $\sqrt{}$ |
| LW       | V  | 1        |          | $\sqrt{}$ |           |
| SW       | V  |          |          | $\sqrt{}$ |           |
| Bxx/JALR | V  | <b>√</b> | <b>√</b> |           | /\        |
| JAL      | 1  |          | <b>√</b> |           | 1         |



#### **Dividing into Stages**



Is this the correct partitioning?
Why not 4 or 6 stages? Why not different boundaries

#### Internal and External Fragmentation



#### **Pipeline Registers**



#### **Pipelined Operation**



#### **Pipelined Operation**



What if LW dest is \$2?

### **Optimize Latency of ALU Insts?**

| steps    | IF       | ID       | EX        | MEM | WB         |
|----------|----------|----------|-----------|-----|------------|
| R-type   | <b>√</b> | <b>V</b> | $\sqrt{}$ |     | $\sqrt{}$  |
| l-type   | V        |          |           |     | $\sqrt{}$  |
| LW       | V        | 1        | 1         | 1   | $\sqrt{}$  |
| SW       | V        | 1        | <b>√</b>  | V   |            |
| Bxx/JALR | V        | 1        | 1         |     | <b>/</b> √ |
| JAL      | <b>V</b> |          | <b>√</b>  |     | <b>V</b>   |



## Illustrating Pipeline Operation: Resource View

|     | t <sub>o</sub> | t <sub>1</sub> | t <sub>2</sub> | t <sub>3</sub> | t <sub>4</sub> | t <sub>5</sub> | t <sub>6</sub> | t <sub>7</sub> | t <sub>8</sub> | t <sub>9</sub> | t <sub>10</sub> |
|-----|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------|-----------------|
| IF  | I <sub>0</sub> | l <sub>1</sub> | l <sub>2</sub> | I <sub>3</sub> | I <sub>4</sub> | I <sub>5</sub> | I <sub>6</sub> | I <sub>7</sub> | I <sub>8</sub> | l <sub>9</sub> | I <sub>10</sub> |
| ID  |                | I <sub>o</sub> | I <sub>1</sub> | I <sub>2</sub> | l <sub>3</sub> | I <sub>4</sub> | I <sub>5</sub> | I <sub>6</sub> | I <sub>7</sub> | I <sub>8</sub> | l <sub>9</sub>  |
| EX  |                |                | I <sub>0</sub> | I <sub>1</sub> | l <sub>2</sub> | l <sub>3</sub> | I <sub>4</sub> | I <sub>5</sub> | I <sub>6</sub> | l <sub>7</sub> | I <sub>8</sub>  |
| MEM |                |                |                | I <sub>0</sub> | l <sub>1</sub> | I <sub>2</sub> | l <sub>3</sub> | I <sub>4</sub> | I <sub>5</sub> | I <sub>6</sub> | I <sub>7</sub>  |
| WB  |                |                |                |                | I <sub>0</sub> | l <sub>1</sub> | l <sub>2</sub> | l <sub>3</sub> | I <sub>4</sub> | I <sub>5</sub> | I <sub>6</sub>  |

# Illustrating Pipeline Operation: Operation View



#### **Example: Read-after-Write Hazard**



## **Example: Pipeline Stalls**

|     | t <sub>o</sub> | t <sub>1</sub> | t <sub>2</sub> | t <sub>3</sub>        | t <sub>4</sub> | t <sub>5</sub> | t <sub>6</sub> | t <sub>7</sub>                     | t <sub>8</sub>        | t <sub>9</sub>        | t <sub>10</sub>       |
|-----|----------------|----------------|----------------|-----------------------|----------------|----------------|----------------|------------------------------------|-----------------------|-----------------------|-----------------------|
| IF  | I <sub>0</sub> | l <sub>1</sub> | I <sub>2</sub> | <b>I</b> <sub>3</sub> | I <sub>4</sub> | I <sub>4</sub> | I <sub>4</sub> | I <sub>4</sub>                     | <b>I</b> <sub>5</sub> | <b>I</b> <sub>6</sub> | I <sub>7</sub>        |
| ID  |                | I <sub>o</sub> | I <sub>1</sub> | I <sub>2</sub>        | l <sub>3</sub> | l <sub>3</sub> | l <sub>3</sub> | <b>↑</b> <sup>1</sup> <sub>3</sub> | I <sub>4</sub>        | <b>I</b> <sub>5</sub> | <b>I</b> <sub>6</sub> |
| EX  |                |                | I <sub>0</sub> | l <sub>1</sub>        | l <sub>2</sub> | Ø              | Ø              | Ø                                  | l <sub>3</sub>        | I <sub>4</sub>        | <b>I</b> <sub>5</sub> |
| MEM |                |                |                | I <sub>o</sub>        | I <sub>1</sub> | I <sub>2</sub> | Ø              | Ø                                  | Ø                     | <b>I</b> <sub>3</sub> | <b>I</b> <sub>4</sub> |
| WB  |                |                |                |                       | I <sub>0</sub> | l <sub>1</sub> |                | Ø                                  | Ø                     | Ø                     | l <sub>3</sub>        |

 $I_2$ =addi x1, x0, 0;  $I_3$ =addi x2, x1, 0;

#### **Control Points**



Identical set of control points as the single-cycle datapath!!

#### **Sequential Control: Special Case**

- For a given instruction
  - same control settings as single-cycle, but
  - control signals required at different cycles, depending on stage
  - decode once using the same logic as single-cycle and buffer control signals until consumed



#### **Pipelined Control**



### **Instruction Pipeline Reality**

- Not identical tasks
  - coalescing instruction types into one "multifunction" pipe
  - external fragmentation (some idle stages)
- Not uniform suboperations
  - group or sub-divide steps into stages to minimize variance
  - internal fragmentation (some too-fast stages )
- Not independent tasks
  - dependency detection and resolution
  - next lecture(s)



#### **Data Dependence**

Data dependence

$$x3 \leftarrow x1$$
 op  $x2$  Read-after-Write (RAW)  
 $x5 \leftarrow x3$  op  $x4$ 

Anti-dependence

$$x3 \leftarrow x1$$
 op  $x2$ 
 $x1 \leftarrow x4$  op  $x5$ 

 $x3 \leftarrow x1$  op x2 Write-after-Read (WAR)

Output-dependence

$$x3 \leftarrow x1$$
 op  $x2$  Write-after-Write (WAW)  
 $x3 \leftarrow x6$  op  $x7$ 

Don't forget memory instructions

#### **Control Dependence**

• C-Code

**Control Flow Graph** 



Assembly Code (linearized)



Does B or C come after A?

18-447-S22-L07-S30, James C. Hoe, CMU/ECE/CALCM, ©2022