### **Precise Runahead Execution**

Ajeya Naithani, Josue Feliu, Almutaz Adileh, Lieven Eeckhout







| ROB |              |
|-----|--------------|
|     | <del>'</del> |









































**Increased** Memory-Level Parallelism (MLP)



**Increased** Memory-Level Parallelism (MLP)







fetch



fetch → decode



fetch → decode → rename



fetch → decode → rename → dispatch



fetch → decode → rename → dispatch → issue



fetch → decode → rename → dispatch → issue → execute



fetch → decode → rename → dispatch → issue → execute → commit









# Runahead Buffer Executes Blocking Chain Speculatively



# Runahead Buffer Executes Blocking Chain Speculatively



# Runahead Buffer Executes Blocking Chain Speculatively



# Runahead Buffer Executes Blocking Chain Speculatively



# Runahead Buffer Executes Blocking Chain Speculatively



**Increased** Memory-Level Parallelism (MLP)

→ time





fetch → decode → rename → dispatch → issue → execute → commit

Runahead execution\*

Runahead buffer\*\*

|                          | Runahead              | Runahead                               |
|--------------------------|-----------------------|----------------------------------------|
|                          | execution*            | buffer**                               |
| Flush ROB                |                       |                                        |
|                          |                       |                                        |
|                          |                       |                                        |
|                          |                       |                                        |
|                          |                       |                                        |
|                          |                       |                                        |
|                          |                       |                                        |
|                          |                       |                                        |
|                          |                       |                                        |
| *[Mutlu of al ISCA'05] * | **[Hashami at al MICD | \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ |

|                 | Runahead              | Runahead |  |
|-----------------|-----------------------|----------|--|
|                 | execution*            | buffer** |  |
| Flush ROB       |                       |          |  |
| 1 10011 1 ( 0 2 | <b>Y</b>              |          |  |
|                 |                       |          |  |
|                 |                       |          |  |
|                 |                       |          |  |
|                 |                       |          |  |
|                 |                       |          |  |
|                 |                       |          |  |
|                 |                       |          |  |
| *[N.4           | <br>*[  a  a  a  a  a | 0,451    |  |

|           | Runahead execution*              | Runahead<br>buffer** |  |
|-----------|----------------------------------|----------------------|--|
| Flush ROB | <b>✓</b>                         |                      |  |
|           |                                  |                      |  |
|           |                                  |                      |  |
|           |                                  |                      |  |
|           |                                  |                      |  |
| *[N.4     | *[]   o o b o vo : o t o   NAICO | \<br>\<br>\?\4.51    |  |

Front-end refill = 8 cycles

Front-end refill = 8 cycles

ROB = 192, width = 4
 ROB fill time = 48 cycles

Front-end refill = 8 cycles

ROB = 192, width = 4
 ROB fill time = 48 cycles

Total overhead = 56 cycles

Front-end refill = 8 cycles

ROB = 192, width = 4
 ROB fill time = 48 cycles

Runahead causes a pipeline bubble of 56 cycles per invocation

Total overhead = 56 cycles





**runahead: 15.9%** 



runahead: 15.9%



runahead: 15.9%

runahead without flushing: 22.7%



runahead: 15.9%

runahead without flushing: 22.7%

|           | Runahead execution* | Runahead<br>buffer** |  |
|-----------|---------------------|----------------------|--|
| Flush ROB | <b>√</b>            |                      |  |
|           |                     |                      |  |
|           |                     |                      |  |
|           |                     |                      |  |
|           |                     |                      |  |
|           |                     |                      |  |

|                 | Runahead                          | Runahead |  |
|-----------------|-----------------------------------|----------|--|
|                 | execution*                        | buffer** |  |
| Flush ROB       |                                   |          |  |
| Short intervals |                                   |          |  |
|                 |                                   |          |  |
|                 |                                   |          |  |
|                 |                                   |          |  |
|                 |                                   |          |  |
| *[N.4           | *[]   a a b a a a ' a b a   MIODA |          |  |

|                 | Runahead   | Runahead |  |
|-----------------|------------|----------|--|
|                 | execution* | buffer** |  |
| Flush ROB       |            |          |  |
| Short intervals |            |          |  |
| *[N.4-4]        | <br>*[     | <br>     |  |

|                             | Runahead   | Runahead |  |
|-----------------------------|------------|----------|--|
|                             | execution* | buffer** |  |
| Flush ROB                   |            |          |  |
| Short intervals             |            |          |  |
| *[NA4] a.t. a.l. 1000 N/OF1 |            |          |  |

|                       | Runahead   | Runahead |  |
|-----------------------|------------|----------|--|
|                       | execution* | buffer** |  |
| Flush ROB             |            |          |  |
| Short intervals       | *          | *        |  |
| Instructions executed |            |          |  |
|                       |            |          |  |
|                       |            |          |  |
|                       |            |          |  |
|                       |            |          |  |

|                       | Runahead   | Runahead |  |
|-----------------------|------------|----------|--|
|                       | execution* | buffer** |  |
| Flush ROB             |            |          |  |
| Short intervals       | *          | *        |  |
| Instructions executed | AII        |          |  |
|                       |            |          |  |
|                       |            |          |  |

|                       | Runahead   | Runahead       |
|-----------------------|------------|----------------|
|                       | execution* | buffer**       |
| Flush ROB             |            |                |
| Short intervals       | *          | *              |
| Instructions executed | AII        | Only one slice |

# Runahead Techniques Provide Limited Prefetch Coverage

Runahead execution: Executes useless instructions

# Runahead Techniques Provide Limited Prefetch Coverage

Runahead execution: Executes useless instructions

Runahead buffer: High coverage for only one slice

# Only One Load does not Lead to Majority of Memory Accesses



# Only One Load does not Lead to Majority of Memory Accesses



Most of the long-latency loads during runahead differ from the stalling load

#### **Applications Access Memory through Multiple Slices**



# **Applications Access Memory through Multiple Slices**



There are more than eight unique load instructions accessing memory during each runahead interval

| Runahead   | Runahead       |
|------------|----------------|
| execution* | buffer**       |
|            |                |
| *          | *              |
| All        | Only one slice |
|            | execution*     |

|                       | Runahead execution* | Runahead buffer** |
|-----------------------|---------------------|-------------------|
| Flush ROB             | <b>√</b>            |                   |
| Short intervals       | *                   | *                 |
| Instructions executed | All                 | Only one slice    |
| Performance           |                     |                   |

|                       | Runahead   | Runahead       |
|-----------------------|------------|----------------|
|                       | execution* | buffer**       |
| Flush ROB             |            |                |
| Short intervals       | *          | *              |
| Instructions executed | All        | Only one slice |
| Performance           | High↑      |                |
|                       |            |                |

|                       | Runahead   | Runahead       |  |
|-----------------------|------------|----------------|--|
|                       | execution* | buffer**       |  |
| Flush ROB             |            |                |  |
| Short intervals       | *          | *              |  |
| Instructions executed | All        | Only one slice |  |
| Performance           | High↑      | High           |  |
|                       |            |                |  |

|                       | Runahead      | Runahead       |  |
|-----------------------|---------------|----------------|--|
|                       | execution*    | buffer**       |  |
| Flush ROB             |               |                |  |
| Short intervals       | *             | *              |  |
| Instructions executed | All           | Only one slice |  |
| Performance           | High <b>†</b> | High↑          |  |
| Energy-Efficiency     |               |                |  |
|                       |               |                |  |

<sup>\*[</sup>Mutlu et al. ISCA' 05]

|                       | Runahead      | Runahead       |  |
|-----------------------|---------------|----------------|--|
|                       | execution*    | buffer**       |  |
| Flush ROB             |               |                |  |
| Short intervals       | *             | *              |  |
| Instructions executed | AII           | Only one slice |  |
| Performance           | High <b>†</b> | High 1         |  |
| Energy-Efficiency     | Low           |                |  |
|                       |               |                |  |

<sup>\*[</sup>Mutlu et al. ISCA' 05]

<sup>\*\*[</sup>Hashemi et al. MICRO' 15]

|                       | Runahead   | Runahead       |  |
|-----------------------|------------|----------------|--|
|                       | execution* | buffer**       |  |
| Flush ROB             |            |                |  |
| Short intervals       | *          | *              |  |
| Instructions executed | All        | Only one slice |  |
| Performance           | High↑      | High↑          |  |
| Energy-Efficiency     | Low        | Same           |  |
|                       |            |                |  |

<sup>\*[</sup>Mutlu et al. ISCA' 05]

<sup>\*\*[</sup>Hashemi et al. MICRO' 15]

|                       | Runahead   | Runahead       |  |
|-----------------------|------------|----------------|--|
|                       | execution* | buffer**       |  |
| Flush ROB             |            |                |  |
| Short intervals       | *          | *              |  |
| Instructions executed | All        | Only one slice |  |
| Performance           | High↑      | High           |  |
| Energy-Efficiency     | Low        | Same           |  |
|                       |            |                |  |

<sup>\*[</sup>Mutlu et al. ISCA' 05]

|                       | Runahead      | Runahead       |   |
|-----------------------|---------------|----------------|---|
|                       | execution*    | buffer**       |   |
| Flush ROB             |               |                | × |
| Short intervals       | ×             | ×              |   |
| Instructions executed | All           | Only one slice |   |
| Performance           | High <b>†</b> | High <b>†</b>  |   |
| Energy-Efficiency     | Low           | Same           |   |
|                       |               |                |   |

<sup>\*[</sup>Mutlu et al. ISCA' 05]

|                       | Runahead execution* | Runahead buffer** |   |
|-----------------------|---------------------|-------------------|---|
|                       | execution           | Dullel            |   |
| Flush ROB             |                     | <b>✓</b>          | × |
| Short intervals       | X                   | *                 |   |
| Instructions executed | All                 | Only one slice    |   |
| Performance           | High <b>†</b>       | High 1            |   |
| Energy-Efficiency     | Low                 | Same              |   |
|                       |                     |                   |   |

<sup>\*[</sup>Mutlu et al. ISCA' 05]

|                       | Runahead      | Runahead       |            |
|-----------------------|---------------|----------------|------------|
|                       | execution*    | buffer**       |            |
| Flush ROB             |               |                | *          |
| Short intervals       | *             | *              |            |
| Instructions executed | All           | Only one slice | All slices |
| Performance           | High <b>†</b> | High <b>†</b>  |            |
| Energy-Efficiency     | Low           | Same           |            |
|                       |               |                |            |

<sup>\*[</sup>Mutlu et al. ISCA' 05]

<sup>\*\*[</sup>Hashemi et al. MICRO' 15]

|                       | Runahead execution* | Runahead buffer** |            |
|-----------------------|---------------------|-------------------|------------|
| Flush ROB             | <b>√</b>            |                   | *          |
| Short intervals       | *                   | *                 |            |
| Instructions executed | AII                 | Only one slice    | All slices |
| Performance           | High                | High              | Very high  |
| Energy-Efficiency     | Low                 | Same              |            |

<sup>\*[</sup>Mutlu et al. ISCA' 05]

|                       | Runahead execution* | Runahead buffer** |            |
|-----------------------|---------------------|-------------------|------------|
| Flush ROB             | ✓ ✓                 | Julio1            | *          |
| Short intervals       | ×                   | *                 |            |
| Instructions executed | All                 | Only one slice    | All slices |
| Performance           | High <b>†</b>       | High              | Very high  |
| Energy-Efficiency     | Low                 | Same              | High       |
|                       |                     |                   |            |

<sup>\*[</sup>Mutlu et al. ISCA' 05]

<sup>\*\*[</sup>Hashemi et al. MICRO' 15]

|                       | Runahead execution* | Runahead<br>buffer** | Precise runahead*** |
|-----------------------|---------------------|----------------------|---------------------|
| Flush ROB             |                     |                      | *                   |
| Short intervals       | *                   | *                    |                     |
| Instructions executed | All                 | Only one slice       | All slices          |
| Performance           | High                | High↑                | Very high           |
| Energy-Efficiency     | Low                 | Same                 | High                |

<sup>\*[</sup>Mutlu et al. ISCA' 05]

<sup>\*\*[</sup>Hashemi et al. MICRO' 15]

**Key insight:** There are sufficient resources to (start) run ahead without flushing the ROB

**Key insight:** There are sufficient resources to (start) run ahead without flushing the ROB

When running ahead:

**Key insight:** There are sufficient resources to (start) run ahead without flushing the ROB

#### When running ahead:

1. Executes only useful instructions in runahead mode

**Key insight:** There are sufficient resources to (start) run ahead without flushing the ROB

#### When running ahead:

1. Executes only useful instructions in runahead mode

2. Efficiently manages microarchitectural resources















There are sufficient resources to start runahead without flushing the ROB

#### **Processor Resources** → Loads **During Runahead** → Producer → Normal commit → Speculative register issue file queue no progress L2 L3 **L1** time current ROB full-window stall memory memory memory

# Processor Resources During Runahead





time

#### **Processor Resources** → Loads **During Runahead** → Producer → Normal commit → Speculative register issue file queue L2 L3 **L1** time future instructions current ROB full-window stall memory memory memory

#### **Processor Resources** → Loads **During Runahead** → Producer → Normal commit → Speculative register issue file queue L2 L3 L4 **L1** time future instructions current ROB full-window stall memory memory memory

#### **Processor Resources** → Loads **During Runahead** → Producer → Normal commit → Speculative register issue file queue L2 L3 L4 **L1** time current ROB future instructions full-window stall memory memory memory memory

# **Processor Resources**



# **Processor Resources**



# **Processor Resources**































#### **Processor Resources** → Loads **During Runahead** → Producer → Normal commit → Speculative register issue file queue L2 L3 **L1** time future instructions current ROB

full-window

stall

memory

memory

memory

memory

# **Processor Resources**









# **Processor Resources**



#### **Processor Resources During Runahead**



## Processor Resources During Runahead



→ Loads

#### **Processor Resources** → Loads **During Runahead** → Producer → Normal commit → Speculative register issue file queue L2 L3 A1 A2 **L1 A3** L4 time current ROB future instructions full-window stall memory memory memory memory

1. How to identify only useful instructions?

1. How to identify only useful instructions?

2. How to recycle (physical) registers?

1. How to identify only useful instructions?



Iterative Backward
Dependency Analysis (IBDA)

2. How to recycle (physical) registers?

1. How to identify only useful instructions?



Iterative Backward
Dependency Analysis (IBDA)

2. How to recycle (physical) registers?



Runahead Register Reclamation



#### Register Allocation Table (RAT)



#### **Register Allocation Table (RAT)**

Arch. register







#### Register Allocation Table (RAT)

Arch. Phy. register







#### Register Allocation Table (RAT)



Phy.

#### **Register Allocation Table (RAT)**

Last-writer

|    |    |             |    | register | register | instruction |
|----|----|-------------|----|----------|----------|-------------|
| A1 | r2 | <del></del> | r1 | r1       | P1       |             |
| A2 | r3 | <del></del> | r2 | r2       | P2       |             |
| A3 | r4 | <del></del> | r3 | r3       | Р3       |             |
| L4 | r5 | <del></del> | r4 | r4       | P4       |             |
|    |    |             |    | r5       | P5       |             |

Arch.

Phy.

#### **Register Allocation Table (RAT)**

Last-writer

|    |    |             |    | register | register | instruction |
|----|----|-------------|----|----------|----------|-------------|
| A1 | r2 | <del></del> | r1 | r1       | P1       | AO          |
| A2 | r3 | ←           | r2 | r2       | P2       |             |
| A3 | r4 | <b></b>     | r3 | r3       | Р3       |             |
| L4 | r5 | <del></del> | r4 | r4       | P4       |             |
|    |    |             |    | r5       | P5       |             |

Arch.

Phv.

#### **Register Allocation Table (RAT)**

**Last-writer** 

|    |             | regist | er register | instruction |
|----|-------------|--------|-------------|-------------|
| A1 | r2 ←        | r1 r1  | P1          | A0          |
| A2 | r3 ←        | r2 r2  | P2          | A1          |
| A3 | r4 <b>←</b> | r3 r3  | Р3          |             |
| L4 | r5 ←—       | r4 r4  | P4          |             |
|    |             | r5     | P5          |             |

Arch.

#### **Register Allocation Table (RAT)**

|    |    |             |    | Arch    | . Phy.     | Last-writer |
|----|----|-------------|----|---------|------------|-------------|
|    |    |             |    | registe | r register | instruction |
| A1 | r2 | <del></del> | r1 | r1      | P1         | AO          |
| A2 | r3 | <del></del> | r2 | r2      | P2         | A1          |
| A3 | r4 | <del></del> | r3 | r3      | Р3         | A2          |
| L4 | r5 | <del></del> | r4 | r4      | P4         |             |
|    |    |             |    | r5      | P5         |             |

### Register Allocation Table (RAT)

|    |    |             |    | registe |
|----|----|-------------|----|---------|
| A1 | r2 | <del></del> | r1 | r1      |
| A2 | r3 | <del></del> | r2 | r2      |
| A3 | r4 | <del></del> | r3 | r3      |
| L4 | r5 | <b></b>     | r4 | r4      |
|    |    |             |    | r5      |

| Arc<br>egist | ,  | Last-writer instruction |  |
|--------------|----|-------------------------|--|
| r1           | P1 | AO                      |  |
| r2           | P2 | A1                      |  |
| r3           | Р3 | A2                      |  |
| r4           | P4 | A3                      |  |
| r5           | P5 |                         |  |

### **Register Allocation Table (RAT)**

|            | Arch.    | Phy.     | Last-writer |
|------------|----------|----------|-------------|
|            | register | register | instruction |
| A1 r2 ← r1 | r1       | P1       | AO          |
| A2 r3 ← r2 | r2       | P2       | A1          |
| A3 r4 ← r3 | r3       | Р3       | A2          |
| L4 r5 ← r4 | r4       | P4       | A3          |
|            | r5       | P5       | L4          |

### Register Allocation Table (RAT)







### **Register Allocation Table (RAT)**





#### Iteration-1:

L4 stalls the window

### **Register Allocation Table (RAT)**



| Arc<br>regist | J  | Last-writer instruction |
|---------------|----|-------------------------|
| r1            | P1 | AO                      |
| r2            | P2 | A1                      |
| r3            | Р3 | A2                      |
| r4            | P4 | A3                      |
| r5            | P5 | L4                      |



#### Iteration-1:

L4 stalls the window

### Register Allocation Table (RAT)



| Arc<br>regist | <b>,</b> | Last-writer instruction |  |
|---------------|----------|-------------------------|--|
| r1            | P1       | AO                      |  |
| r2            | P2       | A1                      |  |
| r3            | Р3       | A2                      |  |
| r4            | P4       | A3                      |  |
| r5            | P5       | L4                      |  |



### **Register Allocation Table (RAT)**





Iteration-2:

L4 hits in the SST

### **Register Allocation Table (RAT)**







Iteration-2:

L4 hits in the SST



While renaming source r4, read A3

### **Register Allocation Table (RAT)**







Iteration-2:

L4 hits in the SST



While renaming source r4, read A3

### **Register Allocation Table (RAT)**





Iteration-3:

A3 hits in the SST

### Register Allocation Table (RAT)



| Arch.<br>register |    | ,  | Last-writer instruction |
|-------------------|----|----|-------------------------|
|                   | r1 | P1 | AO                      |
|                   | r2 | P2 | A1                      |
|                   | r3 | Р3 | A2                      |
|                   | r4 | P4 | A3                      |
|                   | r5 | P5 | L4                      |



Iteration-3:



A3 hits in the SST While renaming source r3, read A2

### Register Allocation Table (RAT)



| Ard<br>regist | ,  | Last-writer instruction |
|---------------|----|-------------------------|
| r1            | P1 | AO                      |
| r2            | P2 | A1                      |
| r3            | Р3 | A2                      |
| r4            | P4 | A3                      |
| r5            | P5 | L4                      |



Iteration-3: A3 hits in the SST



While renaming source r3, read A2

### **Register Allocation Table (RAT)**



| Arc<br>regist | ,  | Last-writer instruction |
|---------------|----|-------------------------|
| r1            | P1 | AO                      |
| r2            | P2 | A1                      |
| r3            | Р3 | A2                      |
| r4            | P4 | A3                      |
| r5            | P5 | L4                      |



Iteration-4: A2 hits in the SST

### Register Allocation Table (RAT)



| Arch.<br>register |   | Phy.<br>register | Last-writer instruction |  |
|-------------------|---|------------------|-------------------------|--|
| r1 P1             |   | P1               | AO                      |  |
| r                 | 2 | P2               | A1                      |  |
| r                 | 3 | Р3               | A2                      |  |
| r                 | 4 | P4               | A3                      |  |
| r                 | 5 | P5               | L4                      |  |



Iteration-4:



A2 hits in the SST While renaming source r2, read A1

### Register Allocation Table (RAT)



| Arch.<br>register |    | J  | Last-writer instruction |  |
|-------------------|----|----|-------------------------|--|
| r1                |    | P1 | A0                      |  |
| r                 | 2  | P2 | A1                      |  |
| r                 | .3 | Р3 | A2                      |  |
| r                 | ٠4 | P4 | A3                      |  |
| r                 | .2 | P5 | L4                      |  |



Iteration-4:



A2 hits in the SST While renaming source r2, read A1

### Register Allocation Table (RAT)



| Arch.<br>register |    | ,  | Last-writer instruction |  |
|-------------------|----|----|-------------------------|--|
| r1                |    | P1 | AO                      |  |
| r                 | 2  | P2 | A1                      |  |
| r                 | .3 | Р3 | A2                      |  |
| r                 | ٠4 | P4 | A3                      |  |
| r                 | ·5 | P5 | L4                      |  |



Iteration-4: A2 hits in the SST



While renaming source r2, read A1



| 1 |  |  |  |
|---|--|--|--|
|   |  |  |  |
|   |  |  |  |
|   |  |  |  |
|   |  |  |  |
|   |  |  |  |
|   |  |  |  |
|   |  |  |  |
|   |  |  |  |
|   |  |  |  |
|   |  |  |  |
|   |  |  |  |

| instruction | dest | src1 | src2 | OldPhy<br>register |
|-------------|------|------|------|--------------------|
|             |      |      |      |                    |
|             |      |      |      |                    |
|             |      |      |      |                    |
|             |      |      |      |                    |
|             |      |      |      |                    |
|             |      |      |      |                    |

|    | instruction     | dest | src1 | src2 | OldPhy<br>register |
|----|-----------------|------|------|------|--------------------|
| I1 | add r1 ← r2, r3 | P1   | P2   | Р3   | Р0                 |
|    |                 |      |      |      |                    |
|    |                 |      |      |      |                    |
|    |                 |      |      |      |                    |
|    |                 |      |      |      |                    |
|    |                 |      |      |      |                    |

|    | instruction     | dest | src1 | src2 | OldPhy<br>register |
|----|-----------------|------|------|------|--------------------|
| I1 | add r1 ← r2, r3 | P1   | P2   | Р3   | P0                 |
| 12 | mul r2 ← r1, r4 | P5   | P1   | P4   | P2                 |
|    |                 |      |      |      |                    |
|    |                 |      |      |      |                    |
|    |                 |      |      |      |                    |
|    |                 |      |      |      |                    |

|    | instruction               | dest | src1 | src2 | OldPhy<br>register |
|----|---------------------------|------|------|------|--------------------|
| I1 | add r1 ← r2, r3           | P1   | P2   | Р3   | Р0                 |
| 12 | mul r2 ← r1, r4           | P5   | P1   | P4   | P2                 |
| 13 | $ld r1 \leftarrow mem[x]$ | P6   |      |      | P1                 |
|    |                           |      |      |      |                    |
|    |                           |      |      |      |                    |
|    |                           |      |      |      |                    |

|    | instruction               | dest | src1 | src2 | OldPhy<br>register |
|----|---------------------------|------|------|------|--------------------|
| I1 | add r1 ← r2, r3           | P1   | P2   | Р3   | Р0                 |
| 12 | mul r2 ← r1, r4           | P5   | P1   | P4   | P2                 |
| I3 | $ld r1 \leftarrow mem[x]$ | P6   |      |      | P1                 |
| 14 | add r2 ← r1, r3           | P7   | P6   | Р3   | P5                 |
|    |                           |      |      |      |                    |
|    |                           |      |      |      |                    |

|            | instruction               | dest | src1 | src2 | OldPhy<br>register |
|------------|---------------------------|------|------|------|--------------------|
| I1         | add r1 ← r2, r3           | P1   | P2   | Р3   | Р0                 |
| 12         | mul r2 ← r1, r4           | P5   | P1   | P4   | P2                 |
| 13         | $ld r1 \leftarrow mem[x]$ | P6   |      |      | P1                 |
| 14         | add r2 ← r1, r3           | P7   | P6   | Р3   | P5                 |
| <b>I</b> 5 | add r2 ← r4, r5           | P9   | P4   | Р8   | P7                 |
|            |                           |      |      |      |                    |

|    | instruction               | dest | src1 | src2 | OldPhy<br>register |
|----|---------------------------|------|------|------|--------------------|
| I1 | add r1 ← r2, r3           | P1   | P2   | Р3   | Р0                 |
| 12 | mul r2 ← r1, r4           | P5   | P1   | P4   | P2                 |
| I3 | $ld r1 \leftarrow mem[x]$ | P6   |      |      | P1                 |
| 14 | add r2 ← r1, r3           | P7   | P6   | Р3   | P5                 |
| 15 | add r2 ← r4, r5           | P9   | P4   | Р8   | P7                 |
| 16 | sub r1 ← r2, r6           | P11  | P9   | P10  | P6                 |

### normal mode

#### runahead mode

|            | instruction               | dest | src1 | src2 | OldPhy<br>register |
|------------|---------------------------|------|------|------|--------------------|
| I1         | add r1 ← r2, r3           | P1   | P2   | Р3   | Р0                 |
| I2         | mul r2 ← r1, r4           | P5   | P1   | P4   | P2                 |
| I3         | $ld r1 \leftarrow mem[x]$ | P6   |      |      | P1                 |
| 14         | add r2 ← r1, r3           | P7   | P6   | Р3   | P5                 |
| <b>I</b> 5 | add r2 ← r4, r5           | P9   | P4   | Р8   | P7                 |
| 16         | sub r1 ← r2, r6           | P11  | P9   | P10  | P6                 |

#### normal mode

|            | instruction               | dest | src1 | src2 | OldPhy<br>register |
|------------|---------------------------|------|------|------|--------------------|
| I1         | add r1 ← r2, r3           | P1   | P2   | Р3   | Р0                 |
| 12         | mul r2 ← r1, r4           | P5   | P1   | P4   | P2                 |
| 13         | $ld r1 \leftarrow mem[x]$ | P6   |      |      | P1                 |
| 14         | add r2 ← r1, r3           | P7   | P6   | Р3   | P5                 |
| <b>I</b> 5 | add r2 ← r4, r5           | P9   | P4   | Р8   | P7                 |
| 16         | sub r1 ← r2, r6           | P11  | P9   | P10  | P6                 |

### runahead mode

|            | OldPhy<br>register |
|------------|--------------------|
| I1         | PØ                 |
| 12         | P2                 |
| <b>I</b> 3 | P1                 |
| 14         | P5                 |
| <b>I</b> 5 | P7                 |
| <b>I</b> 6 | P6                 |

### runahead mode

|    | OldPhy<br>register |
|----|--------------------|
| I1 | PØ                 |
| 12 | P2                 |
| 13 | P1                 |
| 14 | P5                 |
| 15 | P7                 |
| 16 | P6                 |

#### runahead mode

|    | OldPhy<br>register |
|----|--------------------|
| I1 | P0                 |
| 12 | P2                 |
| 13 | P1                 |
| 14 | P5                 |
| 15 | P7                 |
| 16 | P6                 |

#### runahead mode

|             |            | OldPhy<br>register |
|-------------|------------|--------------------|
|             | I1         | P0                 |
| dienotch —— | 12         | P2                 |
| dispatch>   | 13         | P1                 |
|             | 14         | P5                 |
|             | 15         | P7                 |
|             | <b>I</b> 6 | P6                 |

#### runahead mode

|           |            | OldPhy<br>register | Executed ? |
|-----------|------------|--------------------|------------|
|           | I1         | P0                 |            |
| dispatch> | 12         | P2                 |            |
|           | <b>I</b> 3 | P1                 |            |
|           | 14         | P5                 |            |
|           | <b>I</b> 5 | P7                 |            |
|           | 16         | P6                 |            |

#### runahead mode

|            |            | OldPhy<br>register | Executed ? |
|------------|------------|--------------------|------------|
| dispatch — | I1         | P0                 | 0          |
|            | 12         | P2                 | 0          |
|            | <b>I</b> 3 | P1                 | 0          |
|            | 14         | P5                 | 0          |
|            | <b>I</b> 5 | P7                 | 0          |
|            | <b>I</b> 6 | P6                 | 0          |

#### runahead mode

|            |            | OldPhy<br>register | Executed ? |         |
|------------|------------|--------------------|------------|---------|
|            | I1         | P0                 | 0          |         |
| dispatch — | 12         | P2                 | 0          | over to |
|            | 13         | P1                 | 0          | execute |
|            | 14         | P5                 | 0          |         |
|            | <b>I</b> 5 | P7                 | 0          |         |
|            | <b>I</b> 6 | P6                 | 0          |         |

#### runahead mode

|           |    | OldPhy<br>register | Executed ? |          |
|-----------|----|--------------------|------------|----------|
|           | I1 | P0                 | 1          |          |
|           | 12 | P2                 | 0          | Overente |
| dispatch> | 13 | P1                 | 0          | execute  |
|           | 14 | P5                 | 0          |          |
|           | 15 | P7                 | 0          |          |
|           | 16 | P6                 | 0          |          |

#### runahead mode

|            |            | OldPhy<br>register | Executed ? |         |
|------------|------------|--------------------|------------|---------|
|            | I1         | P0                 | 1          |         |
| dispatch — | I2         | P2                 | 1          | overute |
|            | <b>I</b> 3 | P1                 | 0          | execute |
|            | 14         | P5                 | 0          |         |
|            | <b>I</b> 5 | P7                 | 0          |         |
|            | 16         | P6                 | 0          |         |

#### runahead mode

|            |    | OldPhy<br>register | Executed ? |         |
|------------|----|--------------------|------------|---------|
|            | I1 | P0                 | 1          |         |
| dispatch — | 12 | P2                 | 1          | over to |
|            | 13 | P1                 | 0          | execute |
|            | 14 | P5                 | 0          |         |
|            | 15 | P7                 | 1          |         |
|            | I6 | P6                 | 0          |         |

### Runahead Register Reclamation

#### runahead mode



Precise Register Deallocation Queue (PRDQ)

### Runahead Register Reclamation

#### runahead mode



Precise Register Deallocation Queue (PRDQ)

### Runahead Register Reclamation

#### runahead mode



Precise Register Deallocation Queue (PRDQ)



















Mode















Simulator: Sniper 6.0, McPAT



Simulator: Sniper 6.0, McPAT



Workloads: SPEC CPU2006/CPU2017, 1B SimPoints

Simulator: Sniper 6.0, McPAT



Workloads: SPEC CPU2006/CPU2017, 1B SimPoints

**Baseline:** ROB=192, issue queue=92, load/store queue=64, register file=168/168

OoO: Baseline out-of-order core

OoO: Baseline out-of-order core

RA: Runahead execution\*

-- No short runahead intervals

OoO: Baseline out-of-order core

RA: Runahead execution\*

- -- No short runahead intervals
- -- No overlapping intervals

OoO: Baseline out-of-order core

RA: Runahead execution\*

-- No short runahead intervals

-- No overlapping intervals

RA-buffer: Runahead buffer\*\*

OoO: Baseline out-of-order core

RA: Runahead execution\*

-- No short runahead intervals

-- No overlapping intervals

RA-buffer: Runahead buffer\*\*

RA-hybrid: Better performing mechanism between RA-buffer and RA

**OoO:** Baseline out-of-order core

**RA:** Runahead execution\*

-- No short runahead intervals

-- No overlapping intervals

RA-buffer: Runahead buffer\*\*

PRE: Precise runahead

execution\*\*\*

RA-hybrid: Better performing mechanism between RA-buffer and RA





**RA: 15.9%** 



RA: 15.9% RA-buffer: 13.3%



RA: 15.9% RA-buffer: 13.3% RA-hybrid: 20%



PRE: 38.2%



36











PRE: 38.2%





**RA: 1.5X** 











RA: 26.4%



RA: 26.4% RA-buffer: 27.7%



RA: 26.4% RA-buffer: 27.7% RA-hybrid: 31%



RA: 26.4% RA-buffer: 27.7% RA-hybrid: 31% PRE: 50.2%





**RA: +2.4%** 



RA: +2.4% RA-buffer: Same



RA: +2.4% RA-buffer: Same RA-hybrid: Same



1. Never flushes the ROB

1. Never flushes the ROB

2. Executes only useful instructions in runahead mode

1. Never flushes the ROB

2. Executes only useful instructions in runahead mode

3. Efficiently manages microarchitectural resources

1. Never flushes the ROB

2. Executes only useful instructions in runahead mode

3. Efficiently manages microarchitectural resources

18.2% better performance

1. Never flushes the ROB

2. Executes only useful instructions in runahead mode

3. Efficiently manages microarchitectural resources

18.2% better performance

6.2% better energy

# **Precise Runahead Execution**



Ajeya Naithani
Josue Feliu
Almutaz Adileh
Lieven Eeckhout



