

## Politecnico di Torino

## Microelectronic Systems

# DLX Microprocessor: Design & Development Final Project Report

Master degree in Computer Engineering

Referents: Prof. Mariagrazia Graziano, Giovanna Turvani

Authors: Group 19

Fausto Chiatante, Alessandro Tempia Calvino

October 21, 2018

# Contents

| 1 |     | oduction             | 1  |
|---|-----|----------------------|----|
|   | 1.1 | Architecture         | 1  |
|   | 1.2 | Implementation       | 2  |
| 2 | Des | O .                  | 4  |
|   | 2.1 | Top Level            | 4  |
|   | 2.2 | Control Unit         | 5  |
|   | 2.3 | DataPath             | 5  |
|   | 2.4 | ALU                  | 6  |
|   | 2.5 | Forwarding Unit      | 6  |
|   | 2.6 | Hazard Unit          | 9  |
|   | 2.7 | Branch Target Buffer | 9  |
| 3 | Con |                      | 12 |
|   | 3.1 | P4 adder             | 12 |
|   | 3.2 | Floating Point Adder | 13 |

### CHAPTER 1

## Introduction

#### 1.1 Architecture

The DLX (DELUXE) is a fully pipelined processor with a RISC architecture derived from notorious predecessors (AMD 29K, DECstation 3100, HP 850, IBM 801, Intel i860, MIPS M/120A, MIPS M/1000, Motorola 88K, RISC I, SGI 4D/60, SPARCstation-1, Sun-4/110,Sun-4/260) / 13 = 560 = DLX (in roman number).

It is based on the Harvard Architecture which relies on two different memories for instructions and data, allowing simultaneous instruction-fetching and data transactions.

It has a simple 32-bits Load/store architecture with 32 integer registers (R0-R31) and 32 floating-point registers (F0-F31). It uses two main addressing modes:

- Immediate: OP RD, RS1, #56, in which one immediate operand is included in the operation
- Displacement: OP RD, 128(RS), in which the memory is addressed adding an immediate to the content of a register

It features a 5-stage datapath composed of the FETCH, DECODE, EXECUTE, MEMORY and WRITEBACK phases, inhereted from the MIPS architecture.

It executes three types of instructions:

- R-Type: pure registers instruction with two source registers and one destination register specified (ALU register operations); table 1.1
- I-Type: one immediate source, optional source register and destination register (Immediate ALU operations and Branches); table 1.2
- J-Type: Jump and Jump&Link instructions; table 1.3

| _ |            |         |         |        |           |
|---|------------|---------|---------|--------|-----------|
|   | OpCode (6) | RS1 (5) | RS2 (5) | RD (5) | FUNC (11) |

Table 1.1: R-Type instruction.

| OpCode (6) | RS1 (5) | RS2 (5) | IMMEDIATE (16) |
|------------|---------|---------|----------------|

Table 1.2: I-Type instruction.

| OpCode (6) | OFFSET (26) |
|------------|-------------|
| Opcode (o) | 011011 (20) |

Table 1.3: J-Type instruction.

## 1.2 Implementation

| R-t       | ype  | General I | nstructions |
|-----------|------|-----------|-------------|
| Operation | Code | Operation | Code        |
| SLL       | 0x04 | J         | j,0x02      |
| SRL       | 0x06 | JAL       | j,0x03      |
| SRA       | 0x07 | BEQZ      | i,0x04      |
| ADD       | 0x20 | BNEZ      | i,0x05      |
| ADDU      | 0x21 | ADDI      | i,0x08      |
| SUB       | 0x22 | ADDUI     | i,0x09      |
| SUBU      | 0x23 | SUBI      | i,0x0A      |
| AND       | 0x24 | SUBUI     | i,0x0B      |
| OR        | 0x25 | ANDI      | i,0x0C      |
| XOR       | 0x26 | ORI       | i,0x0D      |
| SEQ       | 0x28 | XORI      | i,0x0E      |
| SNE       | 0x29 | JR        | j,0x12      |
| SLT       | 0x2A | JALR      | j,0x13      |
| SGT       | 0x2B | SLLI      | i,0x14      |
| SLE       | 0x2C | NOP       | n,0x15      |
| SGE       | 0x2D | SRLI      | i,0x16      |
| SLTU      | 0x3A | SRAI      | i,0x17      |
| SGTU      | 0x3B | SEQI      | i,0x18      |
| SLEU      | 0x3C | SNEI      | i,0x19      |
| SGEU      | 0x3D | SLTI      | i,0x1A      |
|           |      | SGTI      | i,0x1B      |
|           |      | SLEI      | i,0x1C      |
|           |      | SGEI      | i,0x1D      |
|           |      |           | i,0x23      |
|           |      | SW        | i,0x2B      |
|           |      | SLTUI     | i,0x3A      |
|           |      | SGTUI     | i,0x3B      |
|           |      | SLEUI     | i,0x3C      |
|           |      | SGEUI     | i,0x3D      |

Table 1.4: Supported instruction set

The processor developed in this project keeps unvaried the base architecture of DLX adding optional features such has:

- Hardwired Control Unit
- Expanded instruction set architecture (Jump&Link, Unsigned operations, shift operations, comparison operations)
- Advanced Datapath with reduced branch delay

- Efficient ALU logic (Pentium4 adder, Logic Unit and Shifter derived from SUN Microsystem SPARC T2)
- Forwarding
- Hazard Detection and stalls
- Branch prediction

The ISA supported by the final implementation is contained in table 1.4.

## CHAPTER 2

## Design

### 2.1 Top Level

The top-level during the design phase includes all the major components of the projects (figure 2.1), including the two memories, and it only needs to be inserted into a testbench that provides Clk and Rst signals.



Figure 2.1: Schematic of the DLX top-level entity a-DLX.

At this level are present the connections between the upper-most components. Here also takes place the fetch stage of the pipeline. In this phase, at each clock cycle, the Program Counter (PC) is continuously updated with the correct Next-PC, in order to fetch an instruction from Instruction RAM. The Instruction Register (IR) will contain the latest-fetched instruction.

The IR is parsed to deliver RS1, RS2, RD and the Immediate value to the datapath, according the type of instruction in execution.

#### 2.2 Control Unit

The implemented Control Unit (CU) is of type hard-wired, which means that all the decision are the result of a combinational net. This decision has been taken to ensure the fastest and most easily readable realization.

The CU communicate with the datapath using a control word (CW) generated from the OpCode and eventual Func contained in the IR. This CW is composed of 29 single-bit signals, detailed in tables from 2.1 to 2.4, used to set the right configuration of the datapath. The CW is pipelined in four stages according to the architecture of the processor (the FETCH stage is already completed when an instruction enters the CU's domain). The CU also delivers a 5-bit Operation Code (ALU\_OpCode) which determine the behavior of the ALU during the EXE stage.

| 1           | 2           | 3           | 4           | 5           | 6             | 7          | 8                 | 9             | 10   | 11         | 12         | 13          |
|-------------|-------------|-------------|-------------|-------------|---------------|------------|-------------------|---------------|------|------------|------------|-------------|
| I/R<br>Type | RF<br>Read1 | RF<br>Read2 | RegA<br>LEn | RegB<br>LEn | RegImm<br>LEn | RD1<br>LEn | Sign/<br>Unsigned | MuxImm<br>Sel | Jump | Jump<br>En | Eq<br>Cond | MuxA<br>Sel |

Table 2.1: DECODE stage signals of CW

| 14   | 15    | 16      | 17  |
|------|-------|---------|-----|
| MuxB | RegMe | RALUout | RD2 |
| Sel  | LEn   | LEn     | LEn |

Table 2.2: EXECUTE stage signals of CW

| 18   | 19   | 20   | 21    | 22  | 23       | 24  | 25  |
|------|------|------|-------|-----|----------|-----|-----|
| PC+8 | DRAM | DRAM | DRAM  | LMD | RALUOut2 | RD3 | PC  |
| LEn  | LEn  | Read | Write | LEn | LEn      | LEn | LEn |

Table 2.3: MEMORY stage signals of CW

| 26    | 27    | 28   | 29  |
|-------|-------|------|-----|
| WBMux | RF    | ROut | J&L |
| Sel   | Write | LEn  | Mux |

Table 2.4: WRITEBACK stage signals of CW

There are also present the two signals STALL and Flush\_BTB which will be described respectively in sections 2.6 and 2.7.

#### 2.3 DataPath

The datapath contains the most important phases and critical components of a processor, detailed in the figure 2.2. Four of the five pipeline stages take place at this level.

The first one is the DECODE phase in which the instruction is interpreted by the CU. At this point the register file is accessed and the registers content read. The 16 or 26-bit immediate is extended as Signed/Unsigned and the correct one between the two is selected using a MUX (MuxImm).

Accessing the RF is done in parallel even before decoding the instruction thanks to the fixed location format of DLX instructions. This technique is known as fixed-field decoding.

In this stage are also included all the branching features, anticipated with respect to the standard DLX. This choice is intended to reduce the branch delay slot from three to one clock cycle, with respect to implementations which calculate branch targets through the EXE stage. Reducing the Branch Delay Slot comes with the trade-off of including an additional adder in the DEC stage, resulting in increased area.

In the EXECUTION stage the ALU elaborate the two operands coming from the previous cycle. It performs arithmetic and logical operations selected by the CU according to the current instruction. ALU is also used to calculate memory addresses to perform DRAM accesses.

In the MEMORY stage the data RAM is accessed either in Read or Write, if the instruction is a LOAD or a STORE. Otherwise, the stage is simply bypassed.

Finally, in the WRITE-BACK stage the result of the pipeline (either coming from ALU or DRAM) is saved to the Register File, if needed. At this stage also takes place the Jump&Link procedure which saves the original PC value in the register R31 for future use.

#### **2.4** ALU

The ALU, in figure 2.3, contains 4 main components:

- A Pentium4 adder/subtractor which uses a sparse tree carry-generator and 4-bit carry-select blocks
  to improve the performance over traditional adders. This is the most critical component of the design
  as it has the biggest combinational delay in the system.
- A Logical Unit derived from the UltraSPARC T2 which features a two-level combinational logic, and is detailed in the schematic 2.4. The Logical Unit has been simplified with respect to the original implementation as the ISA of the DLX doesn't include NAND, NOR and XNOR. The two input lines A and B are composed of 32 bits, therefore is important to note that all the NAND gates are implemented internally as 32 3-input-NAND gates.
- A 3-stages Shifter derived from the UltraSPARC T2 as in figure 2.5. The first stage generates four masks
  on 40 bits, shifting at multiples of 8-bits depending on the type of shifting (Left, Right and Arithmetic
  Right). The second stage act a coarse grain shifting choosing the right mask between the available
  ones. At the end, a fine grain shifting translates back the operand to 32 bits with the right shift applied.
- A comparator able to evaluate <, >, ==, !=, >=, <= between two operands. To perform these operations a subtraction between the two operands is needed, and it is performed by the Pentium4 adder. The result of the subtraction and the eventual carry out is taken as an input and a 1 or 0 on 32 bits is produced, depending on the wanted comparison. The comparator is able to perform operations both in signed and unsigned mode.

Talk about multi-cycle

## 2.5 Forwarding Unit

In a pipelined architecture, parallelization is exploited to achieve higher clock frequencies and performance. On the other side, though, having multiple operations in execution at the same time (five operations in this case) can lead to numerous data dependencies. Read-after-Write (RAW) dependencies can



Figure 2.2: Schematic of the Datapath.



Figure 2.3: High-level schematic of the ALU's architecture.

be solved effectively using the Forwarding technique. Forwarding allows data to skip the usual path ending with a WRITE-BACK in the Register File, and arrive directly where is needed. Data is produced mainly from:

- ALU: the output of RegALUout.
- DRAM: the output of MuxC, coming from RegLMD.

On the other hand, data is read from:

- ALU: Operand A and B taken as input from ALU.
- DRAM: the address targeted (doesn't cause problem because is directly connected to the ALU, which will be taken care of with forwarding) and the data to be written.
- Branching: the register values checked to either take a branch or not.

We need to select at each reading point which of the possible data in the pipeline are to be read, either coming form the previous stage, the ALU output or the DRAM output. The Forwarding Unit introduces four signals that control the MUXs inserted before each reading point. The most important information that the FU take as input, is the source and destination registers of each instruction in execution. The second needed information is wether the instruction in execution will write the result in the register file or not. This informations are pipelined (a similar pipeline to the datapath) inside the FU and compared at each stage to reveal possible data dependencies. Depending on the stage in which the dependency is discovered. data will be forwarded from ALU or from MEM stage. As an example, OperationX (just terminated the EXE stage) present a destination register which is equal to the source register A of OperationY (about to enter the EXE stage). The FU will recognize that operationX will write in the Register File, that the register is shared with OperationY and will react activating ForwardA on position 2 (refer to figure 2.2 for the MUX's positions).



Figure 2.4: Schematic of the Logic Unit.

#### 2.6 Hazard Unit

Not all data dependencies can be solved using forwarding. Two particular situations are critical in the presented pipeline:

- A LOAD operation followed by an ALU operation which uses the data retrieved from memory: Data
  is retrieved in MEM stage and should be used in EXE stage. In two successive instructions this two
  stages are concurrent, making forwarding useless.
- A Branch checking a register that has been written in the previous instruction: the data to be written
  in the register is calculated in the EXE stage, while the (anticipated) branch is performed in the DEC
  stage, resulting in the same problem.

More in general, every time a data dependency involves two stage that would necessitate to "go back in time", forwarding cannot be used and other techniques are to be taken in consideration. Normal execution cannot be an option therefore a stall is used. Stalling means inserting a bubble in the normal program execution, in order to delay the problematic instruction just enough to resolve the data hazard. In both of the cases highlighted above, delaying the execution of 1 clock cycle is enough to resolve the problem.

A stall in this architecture is implemented transforming the faulty instruction, currently in the DEC stage, into a NOP which will have no effect on the system. At the same time, fetching has to be paused for 1 clock cycle, in order to be able to re-execute the same instruction in the following cycle. Fetch pausing is performed by disabling the LatchEn on the PC register.

### 2.7 Branch Target Buffer

Another idea in order to increase code execution performances is to insert a branch prediction unit in the processor.

Branches are a very critical part of the code. Take a wrong branch path means to undo the incorrect instructions executed, creating a bubble in the pipeline. That happens because the jump direction is unknown



Figure 2.5: Schematic of the Shifter.

during the fetch as it is calculated in the decode stage.

The goal of this unit is to minimize the possibility of take wrong branches. The classical and easier branch decision is the static one. In this case branches are considered always taken or not taken but this is efficient in term of performance. A little improvement could be realized considering that usually forward branches are more often not taken and that backward branches as more often taken due to loops structures.

The BTB is a type of dynamic prediction. It takes decisions depending the previous history of the instructions. It is composed by a small direct-associative memory in which each entry contains the address of the considered branch and the target value to be loaded in the PC. The access to the memory is realized trough the lower bits of the instruction address depending the number of lines of it the cache. The PC is updated at the end of the fetch stage, before that the branch instruction is decoded, following the criteria indicated in the table 2.5.

| Instruction in buffer | Prediction | Actual branch | Penality c.c. | Action          |
|-----------------------|------------|---------------|---------------|-----------------|
| Yes                   | Taken      | Taken         | 0             | Nothing         |
| Yes                   | Taken      | Not taken     | 1             | Remove from BTB |
| No                    | Not taken  | Taken         | 1             | Add in BTB      |
| No                    | Not taken  | Not taken     | 0             | Nothing         |

Table 2.5: BTB penalty and action table

A basic architecture can be seen in figure 2.6.

In order to undo instructions, in the case of wrong prediction, a BTB logic raises a flush signal to the control unit in the decode stage. Then the CU substitute the wrong instruction with a NOP to guarantee a correct program flow. In particular, a flush is raised when:

- In the previous clock cycle there was not a flush, in order to not allow a comparison on a wrong instruction;
- The previous prediction of the branch was wrong i.e. it differs from the one calculated in the decode stage;



Figure 2.6: Branch-target buffer architecture.

#### CHAPTER 3

# Combinational logic insights

#### 3.1 P4 adder

This is a particular type of adder, on 32 bits, based on a Sparse-Tree carry generator logic that generates the carry every four bits. The carries go into a carry select adder block that generates in parallel the sum for a carry in of zero and of one 4 bits using Ripple-Carry adders, see figure 3.1.



Figure 3.1: Carry Select Block.

This adder carry logic is based on the concept of generation and propagation.

• A generation happens when a carry bit is generated by the sum of two bits, so they are both one.

$$g_i = a_i \wedge b_i$$
.

• A propagation happens when one of the two bits are 1 so that a carry can propagates through.

$$p_i = a_i \vee b_i$$
.

Using generation and propagation the carry can be written in this way:

$$C_{i+1} = (a_i \wedge b_i) \vee (a_i \wedge c_i) \vee (b_i \wedge c_i) = g_i \vee (p_i \wedge c_i)$$

If  $g_i = 1$  then the carry is generated at step i, while if  $p_i = 1$  then input carry is propagated. Iteratively:

$$C_{i+1} = g_i \vee (p_i \wedge (g_{i-1} \vee (p_{i-1} \wedge c_{i-2})))$$

These values are calculated in the first step of the Carry Sparse-Tree in the PG network.

Other two types of blocks should be defined in the carry net. To do that two more definitions are needed:

• The general generate of a stage i is equal to 1 if there is a generation of a carry in i or if there is a propagation in i and a generation in i-1:

$$G_i = G_i \vee (P_i \wedge G_{i-1})$$

• The general propagate of a stage i is equal to 1 if there is a propagate both in stage i and i-1.

$$P_i = P_i \wedge P_{i-1}$$

With these two definitions we can define two types of blocks: PG that implements both the general generation and the general propagation and G that implements only the general generation. They have a particular topology and disposition that can be seen in figure 3.2.



Figure 3.2: Carry Sparse-Tree: PG in white and G in grey.

Could be necessary adding a Carry in to the adder. To do that we need to modify the first block of the PG network adding the general generate and a Cin input. So the resulting generate of the block will be:

$$G = (a_0 \wedge b_0) \vee (C_{in} \wedge (a_0 \oplus b_0));$$

This can be very useful in order to allow the adder also to do the subtraction. To do that just put a xor port between b and the carry in for every b input. Cin now is the control SUB/ADD.

## 3.2 Floating Point Adder