#### **Lecture 14: MIPS R10000**

# Kunle Olukotun Gates 302 kunle@ogun.stanford.edu

http://www-leland.stanford.edu/class/ee282h/

#### **MIPS R10000**

- Shipping in 1996
- Designed to overcome performance limits
  - » Memory bandwidth and latency
  - » Compiler scheduling not effective on integer applications
- Representative modern microprocessor design
  - » multiple instruction issue
  - » register renaming
  - » out-of-order execution
  - » speculative execution
  - » non-blocking caches
  - » precise exceptions
- Detailed look at architecture
- Performance summary
- K. Yeager, "The MIPS R10000 superscalar microprocessor," IEEE Micro, vol. 16, pp. 28–40, 1996.

2

# **R10K Block Diagram**



# **R10K Pipeline**



4

#### Instruction Fetch/Decode

- Fetch four instructions per cycle
  - » any word alignment within 16 word cache line
- Decode up to four instructions per cycle
  - » requires four separate decoders
- Instructions not decoded are placed in 8-word instruction buffer
  - » caused by structural hazards or special instructions

.

#### **Branch Unit**

- Branches occur frequently
- 2-bit 512 entry BHT
- 1 branch delay slot for taken branches
- Delayed branches add difficulty
- Four entry branch stack
  - » alternate branch address
  - » copy of integer and FP register map tables
- Mispredicted branches
  - » immediately restore state (Int &FP tables) and fetch form alternate branch address
  - » 4 bit branch mask associated with each instruction used to abort instructions

#### **Instruction Fetch and Branch**

#### Instruction Fetch and Branch



## **Register Renaming**

- Convert 5-bit logical register numbers to 6-bit physical register numbers
  - » Eliminates WAR and WAW hazards
  - » Support for speculation and precise interrupts
- Register map tables
  - » Integer: 33 × 6 bit RAM
  - » FP:  $32 \times 6$  bit RAM
- Free lists
  - » lists of currently unassigned physical registers
  - » 32 entry FIFOs
- Active list
  - » All instructions "in flight" in the machine kept in 32 entry FIFO
  - » provides unique 5-bit ID for each instruction
  - » operates like a reorder buffer
  - » logical destination number
  - » old physical register number
  - » done bit

## **Register Renaming**





Queues

- Integer, Address, FP queues
- Design limits clock frequency
- Entries allocated at decode
- Integer queue
  - » 16 entries
  - » no order
  - » ten 16 bit comparators per entry for RAW hazards
- FP queue
  - » similar to integer queue
- Address queue
  - » similar to integer queue
  - » FIFO order

## **Integer Queue**



# **Register Files**

- Integer register files
  - » 64 registers
  - » 7 read ports
  - » 3 write ports
  - » separate 64 bit condition file
- FP register file
  - » 64 registers
  - » 5 read ports
  - » 3 write ports

#### **Execution Units**

| Unit       | Latency<br>(cycles) | Repeat rate (cyles) | Instruction             |
|------------|---------------------|---------------------|-------------------------|
| Either ALU | 1                   | 1                   | add, sub, logical, trap |
| ALU1       | 1                   | 1                   | branch                  |
| ALU2       | 10                  | 10                  | 64-bit multiply         |
| ALU2       | 67                  | 67                  | 64-bit divide           |
| Load/store | 2                   | 1                   | load integer            |
| Add        | 2                   | 1                   | add, sub, compare       |
| Multiply   | 2                   | 1                   | DP multliply            |
| Divide     | 19                  | 21                  | DP divide               |
| Load/store | 3                   | 1                   | load FP value           |

1:

### **Instruction Execution Review**



### **Memory Hierarchy**



1

#### **Memory Hierarchy**

- Instruction cache
  - » 32 KB, 2-way SA, 128 B line size
- Load/store Unit
  - » address calculation
  - » memory address translation (TLB)
    - 64 entries FA
    - 2 pages per entry
- Data cache
  - » 32 KB, 2-way SA, 64 B line size, write-back
  - » 2-way interleaved for bandwidth to support loads, stores, cache refills
  - » nonblocking with four outstanding requests
- Secondary cache
  - » 128 b wide interface
  - » 512KB-16 MB
  - » pseudo 2-way SA using 8 Kb MRU table

## **Implementation**

- 0.35 micron process
- 16.6 × 17.9 mm chip
- 298 mm<sup>2</sup>
- 6.8 million transistors
  - » 4.4 million cache
  - » 2.4 million logic
- Full-custom design for datapaths and control logic
- Semi-custom design for less critical control logic



1

## **Nonblocking Data Cache Performance**



# **R10K Performance**



13 SPEC95int and 22 SPEC95fp @ 250 MHz

10