

# Computer Architecture Final Project: Single Cycle CPU

TA: 黃士豪 (Shih-hao Huang)

Due: 2022/12/26 (Mon.) 23:59 (UTC+8)

Email: r10943004@ntu.edu.tw



#### Goal

- Implement a single cycle CPU
- Add multiplication/division unit (mulDiv) to CPU (HW2)
- Handle multi-cycle operations
- Get more familiar with assembly and Verilog



### **Specification**





### **Port Definition**

| Name        | I/O | Width | Description                                                            |
|-------------|-----|-------|------------------------------------------------------------------------|
| clk         | 1   | 1     | Positive edge-triggered clock                                          |
| rst_n       | I   | 1     | Asynchronous negative edge reset                                       |
| mem_wen_D   | 0   | 1     | 0: Read data from data/stack memory 1: Write data to data/stack memory |
| mem_addr_D  | 0   | 32    | Address of data/stack memory                                           |
| mem_wdata_D | 0   | 32    | Data written to data/stack memory                                      |
| mem_rdata_D | I   | 32    | Data read from data/stack memory                                       |
| mem_addr_I  | 0   | 32    | Address of instruction (text) memory                                   |
| mem_rdata_I | I   | 32    | Instruction read from instruction (text) memory                        |



### **Memory Layout (1/2)**

- In RARS simulator
- Text
  - Program code
- Data
  - Variables, arrays, etc.
- Stack
  - Automatic storage





### Relate Memory to Testbench (1/4)

Instruction (text) memory





### Relate Memory to Testbench (2/4)

Data/stack memory





### Relate Memory to Testbench (3/4)

 Reduce size of memory blocks to improve simulation speed



```
`define SIZE_TEXT 36
`define SIZE_DATA 36
`define SIZE_STACK 36
```

 Define offset address for each memory block



 Define high impedance to avoid output conflict

Not synthesizable coding style!

```
always @(*) begin
    q = {(BITS){1'bz}};
    for (i=0; i<word_depth; i=i+1) begin
        if (mem_addr[i] == a)
            q = mem[i];
    end
    if (wen) q = d;
end</pre>
```



## Relate Memory to Testbench (4/4)

In RARS

In Testbench



| Value (+0) | Value (+4)                             | Value (+8)                                                   | Value (+c)                                                                                                                                                       |
|------------|----------------------------------------|--------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0x00000006 | 0x00000009                             | 0x0000000e                                                   | 0x00000004                                                                                                                                                       |
| UxUUUUUUU  | UxUUUUUUUU                             | UxUUUUUUUU                                                   | UxUUUUUUUU                                                                                                                                                       |
| 0x00000000 | 0x00000000                             | 0x00000000                                                   | 0x00000000                                                                                                                                                       |
| 0x00000000 | 0x00000000                             | 0x00000000                                                   | 0x00000000                                                                                                                                                       |
|            | 0x00000006<br>0x00000000<br>0x00000000 | 0x00000006 0x00000009<br>Ux00000000 0x00000000<br>0x00000000 | 0x0000006         0x0000009         0x000000e           0x0000000         0x0000000         0x0000000           0x00000000         0x00000000         0x00000000 |





#### **Architecture**

Not complete (does not include jal, jalr, ...)





### **Supporting Instructions**

- Your design must <u>at least</u> support
  - auipc, jal, jalr
  - beq, lw, sw
  - addi, slti, add, sub, xor
  - mul
- For bonus challengers
  - bge, srai, slli ... (you might have to use these instructions to finish your bonus)
- See "Instruction\_Set\_Listings.pdf" for more information of machine code



### Supplement: Instruction "auipc"

| 31 | 0.000              | 12 11 | 7 6    |
|----|--------------------|-------|--------|
|    | imm[31:12]         | rd    | opcode |
|    | 20                 | 5     | 7      |
|    | U-immediate[31:12] | dest  | AUIPC  |

- Add upper immediate to PC, and store the result to rd
  - auipc rd, U-immediate
- Example: auipc x5, 1 (PC = 0x0001001c)
  - $\bullet$  0x0001001c + 0x00001000 = 0x0001101c
  - Store 0x0001101c in x5



### Supplement: Instruction "mul"

| 31     | 25 24  | 20 19          | 15 14        | 12 11         | 7 6    | 0 |
|--------|--------|----------------|--------------|---------------|--------|---|
| funct7 | rs2    | rs1            | funct        | 3 rd          | opcode |   |
| 7      | 5      | 5              | 3            | 5             | 7      |   |
| MULDIV | multip | olier multiple | icand MUL/MU | LH[[S]U] dest | OP     |   |

- Not included in RV32I
- Store the lower 32-b result (rs1 × rs2) to rd
- Example: mul x10, x10, x6
  - $\star$  x10 = 0x00000001, x6 = 0x00000002
  - $\bullet$  0x0000001 × 0x00000002 = 0x00000002
  - Store 0x00000002 in x10
- Your mulDiv can support this instruction!



### **Multi-Cycle Operation**

- Once CPU decodes mul operation, issue valid to your mulDiv
- Once CPU receives ready, store the lower 32-b result to rd
- You might have to design FSM in your CPU





#### **Test Pattern 1: Leaf**

- Modified from lecture slides
- The procedure loads a,b,c,d from 0x10010000–0x1001000c, and stores the result to 0x10010010
- Run simulation:
  - \$ ncverilog Final\_tb.v +define+leaf +access+r

```
def leaf(a,b,c,d):
    f = (a^b) + (b^c) - (c^d)
    return f
```



| Address    | Value (+0) | Value (+4) | Value (+8) | Value (+c) | Value (+10) |
|------------|------------|------------|------------|------------|-------------|
| 0x10010000 | 6          | 9          | 14         | 4          | 1           |
| 0x10010020 | 0          | 0          | 0          | 0          |             |
| 0x10010040 | 0          | 0          | 0          | 0          |             |
| 0x10010060 | 0          | 0          | 0          | 0          |             |
| 0x10010080 | 0          | 0          | 0          | 0          | /           |
| 0x100100a0 | 0          | 0          | 0          | 0          |             |
| 0x100100c0 | 0          | 0          | 0          | 0          |             |
| 0x100100e0 | 0          | 0          | 0          | 0          |             |
| 0+10010100 | n          | n          | n          | n          |             |



#### **Test Pattern 2: Perm**

- Modified from lecture slides
- The procedure loads n,r from 0x10010000–0x10010004, and stores the result to 0x00010008
- Run simulation:
  - \$ ncverilog Final\_tb.v +define+perm +access+r

```
def perm(n,r):
    if r < 1:
        return 1
        else:
        return n*perm(n-1,r-1)

.data
        n: .word 10
        r: .word 3</pre>
```

| Data Segment |            |            |                |
|--------------|------------|------------|----------------|
| Address      | Value (+0) | Value (+4) | Value (+8)     |
| 0x10010000   | 10         | 3          | 720            |
| 0x10010020   | 0          | 0          | 0              |
| 0x10010040   | 0          | 0          | 0              |
| 0x10010060   | 0          | 0          | 0              |
| 0x10010080   | 0          | 0          | 0              |
| 0x100100a0   | 0          | 0          | 0              |
| 0x100100c0   | 0          | 0          | 0              |
| 0x100100e0   | 0          | 0          | 0              |
| 0=10010100   | n          | n          | n              |
|              |            | <b>\$</b>  | 010000 (.data) |



### (Bonus) Test Pattern 3: (1/4)

Design your assembly first

$$T(n) = \begin{cases} 2 \times T(\left[\frac{3n}{4}\right]) + \lfloor 0.875n \rfloor - 137, & n \ge 10 \\ 2 \times T(n-1), & 1 \le n < 10 \\ 7, & n = 0 \end{cases}$$

- Example: T(11) = 3456, T(30) = 55489
- Use recursive function

```
# Todo: Define your own function

# Do NOT modify this part!!!
__start:
    la t0, n
    lw x10, 0(t0)
    jal x1,FUNCTION
    la t0, n
    sw x10, 4(t0)
    addi a0,x0,10
    ecall
```



### (Bonus) Test Pattern 3: (2/4)

Dump text memory file (Hexadecimal format)





## (Bonus) Test Pattern 3: (3/4)





### (Bonus) Test Pattern 3: (4/4)

- Save the binary file as: ./Verilog/bonus/bonus\_text.txt
  - Modify the text file: delete the last 2 instructions
- Test pattern generation: ./Verilog/bonus/bonus\_gen.py
- Run simulation:
  - \$ ncverilog Final\_tb.v +define+bonus +access+r

00000317 00830067 0fc10297 ff828293 0002a503 ff5ff0ef 0fc10297 fe828293 00a2a223

Delete

system call



#### **Pattern Generation**

- Three python codes provided:
  - leaf\_gen.py
  - perm\_gen.py
  - bonus\_gen.py
- TA will change the variables in \*\_gen.py to generate new test patterns when testing your CPU design



### **Coding Style Check**

| Register Name | Type      | Width          | 1 | Bus | 1 | MB | 1 | AR | 1 | AS | s          | R | SS | 1 | ST |   |
|---------------|-----------|----------------|---|-----|---|----|---|----|---|----|------------|---|----|---|----|---|
| alu_in_reg    | Flip-flop | ======<br>  32 | I | Y   | I | N  | I | Υ  | Ī | N  | ===<br>  N |   | N  | I | N  | = |
| counter reg   | Flip-flop | 5              | Ť | Y   | Ť | N  | Ť | Y  | Ĺ | N  | I N        |   | N  | ı | N  |   |
| shreg reg     | Flip-flop | 64             | Ť | Υ   | Ĺ | N  | Ĺ | Y  | Ĺ | N  | ÍN         | İ | N  | Î | N  |   |
| state reg     | Flip-flop | 2              | İ | Υ   | İ | N  | İ | Υ  | İ | N  | N          | İ | N  | İ | N  |   |

- All sequential elements must be flip-flops
- Check by Design Compiler
- Command:
  - \$ dv -no\_gui
  - design\_vision> read\_verilog CHIP.v
- Exit:
  - design\_vision> exit



### Report

- Briefly describe your CPU architecture
- Describe how you design the data path of instructions not referred in the lecture slides (jal, jalr, auipc, ...)
- Describe how you handle multi-cycle instructions (mul)
- Record total simulation time (CYCLE = 10 ns)
  - Leaf: a = 3, b = 9, c = 5, d = 17
  - Perm: n = 8, r = 5
  - (Bonus: n = 11)

#### Simulation complete via \$finish(1) at time 4795 NS + 0

- Describe your observation
- Snapshot the "Register table" in Design Compiler (p. 22)
- List a work distribution table



#### **Submission**

- Deadline: 12/26 (Mon.) 23:59
  - Late submission: 20 % reduction per day
- Upload Final\_group\_<group\_id>.zip to ceiba
  - Final group <group id>.zip
    - □ Final\_group\_<group\_id>/
    - Final\_group\_<group\_id>/CHIP.v
    - (Final\_group\_<group\_id>/bonus.s)
    - (Final\_group\_<group\_id>/bonus\_text.txt)
    - □ Final\_group\_<group\_id>/report.pdf
  - Wrong format: 20% reduction
- Example



#### Score

- Simulation: 70 % (+ bonus 20 %)
  - Leaf
    - Default: 15 %
    - Change test pattern: 15 %
  - Perm
    - Default: 20 %
    - Change test pattern: 20 %
  - Bonus
    - Default: 10 %
    - Change test pattern: 10 %
- Report: 30 %
  - Content: 20 %
  - Snapshots: 5 %
  - Work distribution: 5 %