# EE739 - Processor Design

# Course Project Report

# PIPELINED IITB-RISC PROCESSOR

# Submitted by

| Name                     | Roll Number |
|--------------------------|-------------|
| Robin James Payyappillil | 193079026   |
| Nikhil Ajith             | 193079027   |

# Contents

| 1     | Introduction |                                               |  |  |  |  |  |
|-------|--------------|-----------------------------------------------|--|--|--|--|--|
| 2 ISA |              |                                               |  |  |  |  |  |
| 3     | Imp          | plementation                                  |  |  |  |  |  |
|       | 3.1          | Hardware Flow Chart                           |  |  |  |  |  |
|       | 3.2          | Summary of Components in Datapath             |  |  |  |  |  |
|       |              | 3.2.1 Pipelined and CCR Registers             |  |  |  |  |  |
|       |              | 3.2.2 Multiplexers                            |  |  |  |  |  |
|       |              | 3.2.3 Memory                                  |  |  |  |  |  |
|       |              | 3.2.4 ALU and ALU Controller                  |  |  |  |  |  |
|       |              | 3.2.5 Forwarding Unit                         |  |  |  |  |  |
|       |              | 3.2.6 Control Decoder                         |  |  |  |  |  |
|       |              | 3.2.7 Register File                           |  |  |  |  |  |
|       |              | 3.2.8 LA LM SA SM Controller                  |  |  |  |  |  |
|       |              | 3.2.9 Branch and Jump Controller              |  |  |  |  |  |
|       |              | 3.2.10 Branch Predictor                       |  |  |  |  |  |
|       | 3.3          | Complete Datapath                             |  |  |  |  |  |
|       | 3.4          | Stages in the Pipeline                        |  |  |  |  |  |
|       |              | 3.4.1 Instruction Fetch Stage                 |  |  |  |  |  |
|       |              | 3.4.2 Instruction Decode + Operand Read Stage |  |  |  |  |  |
|       |              | 3.4.3 Execution Stage                         |  |  |  |  |  |
|       |              | 3.4.4 Memory Stage                            |  |  |  |  |  |
|       |              | 3.4.5 Write Back Stage                        |  |  |  |  |  |
|       | 3.5          | Signals present in the Pipeline Registers     |  |  |  |  |  |
|       |              | 3.5.1 IF/ID Register                          |  |  |  |  |  |
|       |              | 3.5.2 ID/EX Register                          |  |  |  |  |  |
|       |              | 3.5.3 EX/MEM Register                         |  |  |  |  |  |
|       |              | 3.5.4 MEM/WB Register                         |  |  |  |  |  |
|       | 3.6          | Hazards                                       |  |  |  |  |  |
|       |              | 3.6.1 Branches                                |  |  |  |  |  |
|       |              | 3.6.2 Load Hazard                             |  |  |  |  |  |
|       |              | 3.6.3 LA LM Hazards                           |  |  |  |  |  |
| 4     | Res          | sults                                         |  |  |  |  |  |
| 5     | Cox          | nclusion                                      |  |  |  |  |  |

# 1 Introduction

The project implements a 5 stage Pipelined Processor, IITB - RISC. It follows the standard 5 stage pipelines (Instruction fetch, instruction decode register read, execute, memory access, and write back).

It has 8 general purpose registers (R0 to R7). Register R7 always stores the Program Counter. It has 19 instructions and the architecture uses condition code registers which has two flags, Carry flag and Zero flag.

The architecture is optimized for performance, it has **Forwarding Mechanism** to limit the stalls due to RAW Hazards, also a **1 bit Branch Predictor** is implemented to speculate the branches and jumps.

## 2 ISA

The ISA as per the problem statement is shown below. It has 3 instruction formats (R, I and J) and a total of 19 instructions.

#### R Type Instruction format

| Opcode  | Register A (RA) | Register B (RB) | Register B (RB) | Unused  | Condition (CZ) |
|---------|-----------------|-----------------|-----------------|---------|----------------|
| (4 bit) | (3 bit)         | (3-bit)         | (3-bit)         | (1 bit) | (2 bit)        |

### I Type Instruction format

| Opcode  | Register A (RA) | Register C (RC) | Immediate       |
|---------|-----------------|-----------------|-----------------|
| (4 bit) | (3 bit)         | (3-bit)         | (6 bits signed) |

#### J Type Instruction format

| Opcode  | Register A (RA) | Immediate       |  |
|---------|-----------------|-----------------|--|
| (4 bit) | (3 bit)         | (9 bits signed) |  |

Figure 1: Instruction Set

The datapath is conceived using Hardware flowchart and evolves according to the instruction requirement. The implementation/ HFC for the instructions are shown below.

# 3 Implementation

## 3.1 Hardware Flow Chart

#### ADD, ADC, ADZ, NDU, NDC, NDZ

PC -> IM , ALU1\_A

+1 -> ALU1\_B

ALU1\_OUT -> PC

IM\_D [11:9] -> RF\_A1

IM\_D [8:6] -> RF\_A2

RF\_D1 -> ALU2\_A

RF\_D2 -> ALU2\_B

ALU2\_OUT -> RF\_D3

IM\_D [5:3] -> RF\_A3

#### ADL

PC -> IM , ALU1 A

+1 -> ALU1\_B

ALU1\_OUT -> PC

IM\_D [11:9] -> RF\_A1

IM\_D [8:6] -> RF\_A2

RF\_D1 -> ALU2\_A

RF\_D2 -> LS1 -> ALU2\_B

ALU2\_OUT -> RF\_D3

IM\_D [5:3] -> RF\_A3

#### ADI

PC -> IM , ALU1 A

+1 -> ALU1\_B

ALU1\_OUT -> PC

IM\_D [11:9] -> RF\_A1

RF\_D1 -> ALU2\_A

IM\_D[5:0] -> SE10 -> ALU2\_B

ALU2\_OUT -> RF\_D3

IM\_D [8:6] -> RF\_A3

#### LW

PC -> IM , ALU1\_A

+1 -> ALU1\_B

ALU1\_OUT -> PC

IM\_D [8:6] -> RF\_A1

RF\_D1 -> ALU2\_A

IM\_D[5:0] -> SE10 -> ALU2\_B

ALU2\_OUT -> DATA\_MEM\_ADD

IM\_D [11:9] -> RF\_A3

DATA\_MEM\_DATA -> RF\_D3

#### LHI

PC -> IM , ALU1\_A

+1 -> ALU1 B

ALU1\_OUT -> PC

IM\_D [8:0] -> LS7 -> RF\_D3

IM\_D [11:9] -> RF\_A3

#### SW

PC -> IM , ALU1\_A

+1 -> ALU1 B

ALU1\_OUT -> PC

IM\_D [8:6] -> RF\_A1

RF D1 -> ALU2 A

IM\_D[5:0] -> SE10 -> ALU2\_B

ALU2\_OUT -> DATA\_MEM\_ADD

IM\_D [11:9] -> RF\_A2

RF\_D2 -> DATA\_MEM\_IN

## JAL (ALU3 - TO DO PC + 9 BIT IMM)

PC -> IM , ALU1\_A, ALU3\_A

+1 -> ALU1 B

ALU1\_OUT -> RF\_D3

IM\_D [11:9] -> RF\_A3

IM\_D[8:0] -> SE7 -> ALU3\_B

ALU3\_OUT -> PC

#### JLR

PC -> IM , ALU1\_A

+1 -> ALU1 B

IM\_D [8:6] -> RF\_A2

ALU1\_OUT -> RF\_D3

IM\_D [11:9] -> RF\_A3

RF\_D2 -> PC

### JRI

PC -> IM , ALU1\_A

+1 -> ALU1\_B

IM\_D [11:9] -> RF\_A1

RF\_D1-> ALU3\_A

IM\_D[8:0] -> SE7 -> ALU3\_B

ALU3\_OUT -> PC

#### **BEQ**

PC -> IM , ALU1\_A, ALU3\_A

+1 -> ALU1 B

IM\_D [11:9] -> RF\_A1

IM\_D [8:6] -> RF\_A2

RF\_D1-> COMPARATOR\_A

RF\_D2 -> COMPARATOR\_B

IM\_D[5:0] -> SE10 -> ALU3\_B

if (zero)

ALU3\_OUT -> PC

else

ALU1\_OUT -> PC

LA

```
PC -> IM , ALU1_A
+1 -> ALU1_B

ALU1_OUT -> PC

IM_D [11:9] -> RF_A1

for(K=0; K <=6; K++) {

    RF_D1 -> ALU4_A

    K -> ALU4_B

    ALU4_OUT -> DATA_MEM_ADD

    K -> RF_A3

    DATA_MEM_DATA -> RF_D3
}
```

SA

```
PC -> IM , ALU1_A
+1 -> ALU1_B

ALU1_OUT -> PC

IM_D [11:9] -> RF_A1

for(K=0; K <=6; K++) {

    RF_D1 -> ALU4_A

    K -> ALU4_B

    ALU4_OUT -> DATA_MEM_ADD

    RF_R6-R0[(K+1)*16 -1 : K*16] -> DATA_MEM_IN

}
```

LM

```
PC -> IM , ALU1_A
+1 -> ALU1_B

ALU1_OUT -> PC

IM_D [11:9] -> RF_A1

L = 0

for(K=0; K <=6; K++) {

    RF_D1 -> ALU4_A

    if( IM_D [6:0] [6 - K] == 1){

        L -> ALU4_B

        ALU4_OUT -> DATA_MEM_ADD

        K -> RF_A3

        DATA_MEM_DATA -> RF_D3
        L = L + 1;
    }

}
```

SM

```
PC -> IM , ALU1_A
+1 -> ALU1_B

ALU1_OUT -> PC

IM_D [11:9] -> RF_A1

L = 0

for(K=0; K <=6; K++) {

    RF_D1 -> ALU4_A

    if( IM_D [6:0] [6 - K] == 1){

        L -> ALU4_B

        ALU4_OUT -> DATA_MEM_ADD

        RF_R6-R0[(K+1)*16 -1 : K*16] -> DATA_MEM_IN

        L = L + 1;
    }

}
```

## 3.2 Summary of Components in Datapath

The implementation has been done in Verilog.

## 3.2.1 Pipelined and CCR Registers

There will be 4 pipelined registers namely IF/ID, ID/EX, EX/MEM, and MEM/WB between each stage of the pipeline each having an active high write enable. Each corresponding pipeline registers stores the necessary data required for the upcoming stages. Also it has 3 additional registers for Program Counter, Carry and Zero Flags.

## 3.2.2 Multiplexers

As shown in the datapath, Multiplexers are required for the steering logic, whose control signals will be given by the main decoder (or from any other auxiliary decoders for branch / branch predictor / LA, LM Controller). Mainly there are 2 input and 4 input Multiplexers. In the EX stage, forwarded operand1 from register file is directly connected to ALU, the second input to ALU comes from 3 possible combinations, either the forwarded operand2 from register file, or the forwarded operand2 from register file that is left shifted by 1 (for ADL instruction), or the 6 bit immediate sign extended to 16 bits. Similarly several multiplexers decides the steering logic of the processor.

## **3.2.3** Memory

The memory address points to two bytes in the memory and the size of the memory is 4096 bytes ( 4 KBytes ). Both data and instruction memory are of size 4KB. All the unused part of the instruction memory are filled with NOP instructions.

#### 3.2.4 ALU and ALU Controller

The ALU performs ADD, NAND or NOP operations based on the control signals from the ALU Controller. Although the main decoder decides the operation the ALU should perform, it is reevaluated at EX stage in case of ADC, ADZ and other conditional instructions to determine whether the operation should be performed based on C and Z flags. It also determines whether C and Z should be modified or not. In case of conditional instructions, if the ALU controller sees that the condition has not been met, then the ALU does not perform any operation, also the writeEnable of the Register File ( RO to R7 ) and the data memory write enable will be redefined to logic low ( 0 ) so that it does not modify the system state.

## 3.2.5 Forwarding Unit

A seperate control for Forwarding unit is present at the execute stage. Forwarding is implemented at the beginning of execute stage. It compares the source operand address with the destination register address in the stages ahead ( if it is supposed to update the register, which means writeEnable of the register file should also be high ). If a match is present it forwards the most recent data.

#### 3.2.6 Control Decoder

The main control decoder at the Decode Stage, it provides the control signals for all the multiplexers for the steering logic, the write enable signals for the memory and register file based on the decoded instruction.

### 3.2.7 Register File

The register file consists of 8 registers (RO to R7) that gets updated on the positive edge of the clock, if an active high write enable is asserted. Register R7 always stores the program counter. The register R7 cannot be modified by any other instructions, but it can be read. A separate hardwired entry is made of register R7 from the PC, so that the latest value of PC is updated on to R7 at every clock.

Since the SA/SM instruction needs access to all the register values, a seperate output that gives out all the values of registers in a 112 bit vector format is given to LA LM SA SM controller

#### 3.2.8 LA LM SA SM Controller

A seperate controller for LA, LM , SA , SM at Memory stage. Since memory can only have one read/write port, the controller will stall the pipeline every time an LA or LM or SA or SM instruction comes at memory stage. This will stall the pipeline until one of these 4 instructions finish at MEM stage.

## 3.2.9 Branch and Jump Controller

A separate controller which selects the next instruction address calculated from the ID or EX stage. (jal, beq, jlr, jri ).



Figure 2: Branch Controller

#### 3.2.10 Branch Predictor

A 1 bit predictor. The depth of the predictor is 8 entries. Each entry consists of 33 bits, a 16 bit PC, a 16 bit Branch Target address and a 1 bit History Bit. If the present program counter matches with the entry in the Branch History Table, then the next Program counter is fetched from the target address from the LUT using a MUX. The control for the MUX will be logical AND operation of match and history bit, where match becomes high when an entry corresponding to the present PC is found in Look Up Table. Each time a new Jump or Branch instruction is seen, the History Table is updated. It also dynamically updates the History bit based on taken/not taken.



Figure 3: Branch Predictor Logic



Figure 4: Branch State Diagram

# 3.3 Complete Datapath



The high resolution .svg file of the datapath can be seen here: Click Here

## 3.4 Stages in the Pipeline

## 3.4.1 Instruction Fetch Stage

The first stage in the pipeline, that fetches the two byte instruction from the memory. Also consists of a PC Selector MUX, which selects between PC+1 or various other Target address based on the address resolved from ID and EX stages (jal, jlr, jri, beq). The output of PC Selector MUX goes to a branch predictor select MUX, where the other input to the MUX comes from the branch LUT. The fetched instruction pass through a NOP MUX, whose control is determined by the branch controller (whether the current instruction is to be flushed or not, depending on the decision from branch/jump controller).

## 3.4.2 Instruction Decode + Operand Read Stage

The second stage in the pipeline, decides the operand address, the target destination address, whether the instruction modifies the memory or registers, and generates all the control signals for the steering logic, and controls to modify the system state. This stage also consists of the Register File, which was described in the previous section. It also has a NOP MUX, in case to flush the instructions when a branch occurs and the speculation of the instruction ahead was incorrect @ EX stage. JAL instruction is resolved at ID stage, if the same JAL occurs again at the same PC, there would not be any penalty since prediction takes care of it, however there would be a 1 cycle penalty when JAL at a particular PC occurs for the first time.

#### 3.4.3 Execution Stage

The third stage in the pipeline, it consists of the forwarding controller, the ALU and ALU Controller which defines the enable for carry and zero, as well as redefines the register and memory write signals based on the conditional instructions. The instructions branch, jlr, jri is resolved @ EX stage, since the forwarded value of the registers is available @ EX stage. If the speculated BEQ instruction is correct then there would not be any penalty, however there would be a 2 cycle penalty in case the speculation is incorrect.

#### 3.4.4 Memory Stage

The fourth stage in the pipeline, it consists of a 4KB memory, with an active high write enable. It also have a seperate LA LM SA SM controller. The main functionalities of the controller is to stall the pipeline whenever an LA or LM or SA or SM instruction reaches the memory stage. At this point of time the instruction coming after the LA or SA or LM or SM is halted at the ID stage by a seperate hazard logic block, and LA LM SA SM controller stalls one of LA or SA or LM or SM instruction in MEM stage until all the memory access are completed. The controller also computes the consecutive memory address that needs to be accessed. This is necessary because the memory in practice cannot have multiple ports for read or write.

## 3.4.5 Write Back Stage

The final stage in the pipeline, this updates the register file if the corresponding write enable is high.

## 3.5 Signals present in the Pipeline Registers

## 3.5.1 IF/ID Register

- (1) Current Program Counter (PC) Needed for calculating Branch/ Jump Address.
- (2) PC + 1 Needed in case if prediction fails and for updating register file for JAL, JLR.
- (3) Instruction Present Instruction being executed.
- (4) Speculation Whether the branch has been speculated.

## 3.5.2 ID/EX Register

- (1) Current Program Counter (PC) Needed for calculating Branch/ Jump Address.
- (2) PC + 1 Needed in case if prediction fails.
- (3) Control Signals All the control signals for the steering logic.
- (4) Source Operand Address needed for forwarding logic at EX
- (5) One entry to hold the destination address and data from WB stage solely for the purpose of forwarding for 3 cycle apart data dependency (WB updates register file at the positive edge of CLK, hence the data would only be available at the next clock edge. To accommodate the forwarding for data at WB, this entry is made. WB stage simultaneously updates the register file, as well as the pipelined register).
- (6) Instruction Present Instruction being executed.
- (7) Speculation Whether the branch has been speculated.

### 3.5.3 EX/MEM Register

- (1) Control signals For the steering logic, register and memory write enable signals.
- (2) Destination Address of register and ALU result for the purpose of updating results, accessing or updating memory and for forwarding.
- (3) Register Data to be written to memory in case of SW, SA, SM.

(4) Memory Access Address.

## 3.5.4 MEM/WB Register

- (1) Register File write enable.
- (2) Destination address of the Register File.
- (3) Data to be updated at the Register File.

## 3.6 Hazards

To minimize stalling and improve the performance, we have implemented **Data Forwarding** and **Dynamic Branch Prediction using a 1 bit History**.

The Data Forwarding takes care of the immediate, 2 cycles apart and 3 cycles apart data dependency.

There are a couple more issues aside from the above two that needs to be handled. First is obviously misprediction of the branch - in that case the pipeline needs to be flushed. We have flushed the pipeline by inserting NOP instructions wherever needed since that was the most direct approach. The hazards that are addressed in this implementation are discussed below.

#### 3.6.1 Branches

Branch instructions are speculated with predictions, in case if the prediction is incorrect, then the instructions in IF and ID are replaced with NOP. JLR, JRI has the target addressed based on the value of register. Hence prediction may not work most of the time. As a result prediction is only applicable for JAL and BEQ instructions. JLR, JRI will have two cycle penalty, since forwarding is implemented at EX and hence the latest value of registers are available at EX stage.

#### 3.6.2 Load Hazard

Consider the below figure.

I1: lw ra, MemoryAddr

I2: op rd, ra, rb

Figure 5: Illustration of Immediate dependency on load

When I1 reaches EX stage, I2 will be in ID stage. However the value of ra would only be available from end of memory stage, therefore we need 1 cycle stall.

### 3.6.3 LA LM Hazards

Consider the below figure.

I1 : la/lm, MemoryAddr

I2: op rd, ra, rb

Figure 6: Illustration of LA or LM hazards

When instruction I1 reaches MEM stage, the pipeline is stalled so that the memory can be accessed only through one read port in each cycle. LA requires to access memory for 7 cycles. If instruction I2 stays in EX stage, the main issue is that it won't get the updated register values that LA performs. Therefore the approach performed is that, I2 will be stalled in ID stage and while I1 has completed accessing the memory(MEM stage), I2 can proceed. Any pending update to register due to LA/LM instruction that is pending in WB will be handled through forwarding.

## 4 Results

Several Test Cases was performed to make sure that the accurate results are obtained. To our knowledge, the results are satisfactory. Here, we will demonstrate a couple of instructions being executed.

(1) The first program is to perform sum of natural numbers upto 20. The expected output should be 210  $(\frac{n*(n+1)}{2})$ .

The pseudo code of the instruction is shown below.

| PC    | INSTRUCTION       |                                          |
|-------|-------------------|------------------------------------------|
| 0x000 | ADD RO,RO,RO      | R0 = 0                                   |
| 0x001 | ADI R1,R0,010100  | R1 = 20 (SUM OF NATURAL NUMBERS TILL 20) |
| 0x002 | BEQ R0,R1,000100  |                                          |
| 0x003 | ADI R0,R0,000001  |                                          |
| 0x004 | ADD R2,R2,R0      | R2 WILL HAVE FINAL RESULT ( 20*21/2)     |
| 0x005 | JAL R4, 111111101 |                                          |
|       |                   |                                          |

Figure 7: Code to find sum of natural numbers upto 20

The values of all **registers at the beginning are 0**. The register R1 holds the value 20, and when register R0 becomes equal to R1 after repetitive iterations, the final sum will be held in R2. The code snippet from PC = 0x002 to 0x005 repeats 20 times.

The instruction memory being encoded is shown in the below figure.

```
// PROGRAM TO FIND SUM OF NUMBERS UPTO 20
instruction_mem[0] <= 16'b0001000000000000;  // add r0 r0 r0
instruction_mem[1] <= 16'b0000000001010100;  // ADI R1,R0,010100 r1 =20;
instruction_mem[2] <= 16'b10000000001000100;  //BEQ R0,R1,000100
instruction_mem[3] <= 16'b000000000000000000000;  // ADI R0,R0,000001
instruction_mem[4] <= 16'b0001010000010000;  // ADD R2,R2,R0
instruction_mem[5] <= 16'b10011001111111101;  // JAL R4, 111111101</pre>
```

Figure 8: Screenshot of the program in Instruction Memory

All the other locations of Instruction memory are filled with NOP instructions.

The program took **85** cycles with prediction to produce the result and the PC will point to location 0x006. Without prediction, theoretically it would have taken 104 cycles, hence there is performance improvement just by predicting jumps.

| Instruction | Number of Cycles           | Comments                                             |
|-------------|----------------------------|------------------------------------------------------|
| I0: (ADD)   | 1                          | -                                                    |
| I1: (ADI)   | 1                          | -                                                    |
| I2: (BEQ)   | 20  (repeats 20 times) + 2 | Prediction - by default NT for first 20 cycles.      |
|             |                            | 2 cycle penalty for final iteration.                 |
| I3: (ADI)   | 20 (repeats 20 times)      | -                                                    |
| I4: (ADD)   | 20 (repeats 20 times)      | -                                                    |
| I5: (JAL)   | 20  (repeats 20 times) + 1 | 1 cycle penalty for first time                       |
|             |                            | Rest iterations, they are predicted taken correctly. |

The final expected result is 210 in register R2, which is shown in the figure below.



Figure 9: Screenshot of Register File

The below figure shows the screenshot of Modelsim waveform with branch prediction. As shown in the highlighted area, when PC = 0x005, the instruction is JAL. As a result the NEXT PC is predicted as 0x0002.



Figure 10: Demonstration of JAL prediction

The branch is predicted as NT for the first 20 iterations, as R0 and R1 are not equal, at the time it becomes equal, an entry is made into Branch History Table. However in the current program after BEQ at PC = 0x0002 becomes taken, it6 is not used anymore.

The below figure shows the entry in Branch History Table (BHT). The entries are made by pre-incrementing a pointer. When the pointer reaches 7, it goes back to 0 and overwrites the existing entries. In this scenario, the entry corresponding to 1 is for JAL, the most significant 16 bits represent PC (0x0005 in this case), the next significant 16 bits represents BTA(0x0002 in this case) and the LSB represents History Bit (always 1 for JAL). BEQ is in 2nd entry since it is TAKEN only in the final iteration. Here, for BEQ the entry corresponding to PC is 0x0002, the BTA is 0x0006 and History bit is 1 ( since the last time while exiting the loop it was TAKEN ). Successful prediction test case of branch is shown in the attached spreadsheet along with the archive.



Figure 11: Entries in Branch History Table

(2) For completion, the second program shows the illustration of load and store instructions mainly. The **initial conditions** are shown below.

| Register | Value  |
|----------|--------|
| R0       | 0      |
| R1       | 1      |
| R2       | 2      |
| R3       | 3      |
| R4       | 4      |
| R5       | 5      |
| R6       | 6      |
| R7       | 0 (PC) |

The data memory is initialized sequentially, that is, M[0] contains 0, M[1] contains 1, and so on. Hence M[4095] is 4095.

```
ADD R0, R0, R0
ADD R2, R1, R6
ADD R5, R2, R4
LA R5
ADD R1, R2, R3
LW R5, R5,1
ADD R4, R5, R0
SA R4
SW R2, R6, 1
```

Figure 12: Test Program - 2

The encoded program in instruction memory is shown below.

```
instruction_mem[0] <= 16'b000100000000000000;
instruction_mem[1] <= 16'b000101110010000;
instruction_mem[2] <= 16'b0001010100101000;
instruction_mem[3] <= 16'b11101010000000000;
instruction_mem[4] <= 16'b00010101101000000;
instruction_mem[5] <= 16'b01001011010000001;
instruction_mem[6] <= 16'b01001101000000000;
instruction_mem[7] <= 16'b11111000000000000;
instruction_mem[8] <= 16'b0101010110100000001;</pre>
```

Figure 13: Screenshot of Program in Instruction memory

The state of register and memory after each instruction execution is tabulated as below

| Instruction: ADD R2, R1, R6 |       |
|-----------------------------|-------|
| Register                    | Value |
| R0                          | 0     |
| R1                          | 1     |
| R2                          | 7     |
| R3                          | 3     |
| R4                          | 4     |
| R5                          | 5     |
| R6                          | 6     |
| R7                          | PC    |

| Instruction: ADD R5, R2, R4 |       |
|-----------------------------|-------|
| Register                    | Value |
| R0                          | 0     |
| R1                          | 1     |
| R2                          | 7     |
| R3                          | 3     |
| R4                          | 4     |
| R5                          | 11    |
| R6                          | 6     |
| R7                          | PC    |

| Instruction: LA R5 |       |
|--------------------|-------|
| Register           | Value |
| R0                 | 11    |
| R1                 | 12    |
| R2                 | 13    |
| R3                 | 14    |
| R4                 | 15    |
| R5                 | 16    |
| R6                 | 17    |
| R7                 | PC    |

| Instruction: ADD R1, R2, R3 |       |
|-----------------------------|-------|
| Register                    | Value |
| R0                          | 11    |
| R1                          | 27    |
| R2                          | 13    |
| R3                          | 14    |
| R4                          | 15    |
| R5                          | 16    |
| R6                          | 17    |
| R7                          | PC    |

| Instruction: LW R5, R5, 1 |       |
|---------------------------|-------|
| Register                  | Value |
| R0                        | 11    |
| R1                        | 27    |
| R2                        | 13    |
| R3                        | 14    |
| R4                        | 15    |
| R5                        | 17    |
| R6                        | 17    |
| R7                        | PC    |

| Instruction: ADD R4, R5, R0 |       |
|-----------------------------|-------|
| Register                    | Value |
| R0                          | 11    |
| R1                          | 27    |
| R2                          | 13    |
| R3                          | 14    |
| R4                          | 28    |
| R5                          | 17    |
| R6                          | 17    |
| R7                          | PC    |

Finally we have 2 more instructions, SA R5 and SW R2, R6, R1.

SA R4, will store all the contents of R0 to R6 in successive memory locations starting from M[R4] = M[28].

Hence M[28]  $\hookrightarrow$  M[34] = R0  $\hookrightarrow$  R6. Also SW R2, R6 , 1 means M[R6+1]  $\hookleftarrow$  R2.

The final register and memory being modified are.

| Register or  | Value |
|--------------|-------|
| Memory Locn. |       |
| R0           | 11    |
| R1           | 27    |
| R2           | 13    |
| R3           | 14    |
| R4           | 28    |
| R5           | 17    |
| R6           | 17    |
| R7           | PC    |
| M[18]        | 13    |
| M[28]        | 11    |
| M[29]        | 27    |
| M[30]        | 13    |
| M[31]        | 14    |
| M[32]        | 28    |
| M[33]        | 17    |
| M[34]        | 17    |

The screenshot of result of register file is attached below, which matches with the expected results.



Figure 14: Screenshot of Register File for second Test Case

The memory being modified is highlighted and attached below, which also matches with the expected results.



Figure 15: Screenshot of Data Memory for second Test Case

More Test cases are attached in the spreadsheet, that is present in the archive.

## 5 Conclusion

A 5 stage, 16 bit pipelined processor that supports 19 instructions and optimized for performance having data forwarding and a 1 bit branch prediction have been implemented. The design have been synthesized in Quartus 18.1 and RTL simulations have been carried out in ModelSim.