

CMP

**Key:**u(<): Unsigned less than
u(>=): Unsigned greater than equal to



# JMP Logic



# Forwarding Unit Logic

#### Inputs:

idex\_ireg\_out.rs1
idex\_ireg\_out.rs2
exmem\_ireg\_out.rd
exmem\_ireg\_out.opcode
memwb\_ireg\_out.rd
exmem\_ctrlreg\_out.regfile\_ld
memwb\_ctrlreg\_out.regfile\_ld

### **Outputs:**

rs1forw\_mux\_sel rs2forw mux sel

#### Defaults:

rs1forw\_mux\_sel = forw\_mux::idex\_rs1reg\_out rs2forw mux sel = forw mux::idex rs1reg out

```
if exmem ctrlreg out regile Id & (exmem ireg out rd != 0)
& (exmem irea out.rd = idex irea out.rs1))
     if (exmem irea out.opcode == op load)
           rs1forw mux sel = forw mux::dcache rdata
     else
           rs1forw mux sel = forw mux::exmem alureg out
else if (memwb ctrlreg out.regfile ld & (memwb ireg out.rd!= 0)
&!(exmem ctrlreg out.regfile Id & (exmem ireg out.rd!= 0)
& (exmem ireg out.rd == idex ireg out.rs1))
& (memwb ireg out.rd = idex ireg out.rs1))
     rs1forw mux sel = forw mux::regfile wdata
// RS2
if exmem ctrlreg out.regfile Id & (exmem ireg out.rd!= 0)
& (exmem irea out.rd = idex irea out.rs2))
     if (exmem ireg out.opcode == op load)
           rs2forw mux sel = forw mux::dcache rdata
     else
           rs2forw mux sel = forw mux::exmem alureg out
else if (memwb ctrlreg out.regfile ld & (memwb ireg out.rd != 0)
&!(exmem ctrlreg out.regfile Id & (exmem ireg out.rd!= 0)
& (exmem ireg out.rd == idex ireg out.rs1))
& (memwb ireg out.rd = idex ireg out.rs1))
     rs1forw mux sel = forw mux::regfile wdata
```

// RS1

# Mem Forwarding Unit Logic

#### Inputs:

idex\_ireg\_out.opcode idex\_ireg\_out.rs2 memwb\_ireg\_out.rd memwb\_ctrlreg\_out.regfile\_ld

### **Outputs:**

memforw\_mux\_sel

#### Defaults:

memforw\_mux\_sel = memforw\_mux::exmem\_rs2reg\_out

# **Memory Hierarchy**



# Arbiter



# **Advanced Features**

# **Prefetching Unit**



### **Arbiter Control Unit State Machine**



```
Output at each state:
do nothina
- mem read = 0
- mem write = 0
- addr sel = addr mux::icache addr
- icache resp = 0
- dcache resp = 0
-pf resp = 0
icache read
- mem read = 1
- mem write = 0
- addr sel = addr mux::icache addr
- icache resp = mem resp
- dcache resp = 0
- pf resp = 0
dcache read
- mem read = 1
- mem write = 0
- addr sel = addr mux::dcache addr
- dcache resp = mem resp
- icache resp = 0
- pf resp = 0
dcache write
- mem read = 0
- mem write = 1
- addr_sel = addr_mux::dcache_addr
- dcache resp = mem resp
- icache resp = 0
- pf resp = 0
prefetch read
- mem read = 1
- mem write = 0
- addr sel = addr mux::pf addr
- pf resp = mem resp
- dcache resp = 0
```

- icache resp = 0

# **Branch History Register**



This will keep track of the last N branches, and the N-length history is used as part of the index for the PHT.

# Pattern History Table (Local History Table & 2nd Level of Global Predictor)



The FSM shown to the right decides what state the branch entry will be in. This is based off of whether a specific branch is taken or not taken. The state inside the table will be updated depending on the FSM.

### **Global Predictor**



br\_en will tell us the actual outcome of the branch and will shift out the LSB value in the BHR

### **Tournament Predictor**



### **Advanced Cache**

L2 Cache, 4-way set associative cache, parameterized cache



The 4-way cache would use a pseudo LRU to approximate replacement instead of a complete one because it is more efficient. The hardware needed to have a true replacement policy would need to keep an order of previous line accesses which has 4!=24 different scenarios. This would be expensive and not worth. A binary tree instead would only need 3 bits to approximate.

Since the L2 cache does not have to respond to hits in one cycle like the other caches do, it can respond to hits in 2 cycles like in mp3. However, we are not sure yet what the final hit response time will be for our L2 cache.

We also can parameterize the number of sets in our cache. This gives us an option in the future to tune our memory system by adding more sets. This makes the cache footprint bigger but it can allow more hits. The tradeoffs for this will be evaluated in more detail further down the road when we try to maximize performance.

### Write Buffer



The write buffer has to sit in between levels of memory to intercept incoming writes. This lets lower level caches not waste as much time waiting for writebacks and can instead move on to getting read responses.

The buffer itself is similar to a cache in that it would need to have multiple lines to store information and logic to check if there was a hit on of them.

For incoming writes, if there is space, it stores the data and the address in its internal buffer. If there is no space, it is forced to write it back. For incoming reads if the address hits, the buffer is outputted because it has the most up to date copy. If the read misses, the read request is forwarded to the next level. When the buffer is in a idle state, it can choose a buffer to write back.

## Multiplier/Division Unit



We will 'prep' the operands in order to get the correct operation, by sign-extending them accordingly

Our mutiplication unit takes in either the original or forwarded register data and then also takes in the operand for the type of multiplication This would then choose what unit to activate and also whether to use the product, quotient or remainder buffer

Our control word would most likely have a load multiplier signal to know when to load the multiplier. Then, the EX-MEM would be stalled until the multiplication or division operation is done (checked by mult\_div\_done signal)

Algorithm for division: https://ieeexplore.ieee.org/document/146763 "A hardware algorithm for integer division"