### **Computer-Aided VLSI System Design**

# **Homework 1: Arithmetic Logic Unit**

Graduate Institute of Electronics Engineering, National Taiwan University



### Goal



- In this homework, you will learn
  - How to design ALU with simple operations
  - Differences between combinational circuit and sequential circuit
  - How to define registers and wires
  - How to read spec

### Introduction



- The Arithmetic logic unit (ALU) is one of the components of a computer processor
- In this homework, you are going to design an ALU with some special instructions, and use the ALU to compute input data to get the correct results

# **Block Diagram**





# Input/Output



| Signal Name | I/O | Width | Simple Description                                                                                                                                                                                                    |
|-------------|-----|-------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| i_clk       | I   | 1     | Clock signal in the system                                                                                                                                                                                            |
| i_rst_n     | 1   | 1     | Active <b>low</b> asynchronous reset                                                                                                                                                                                  |
| i_valid     | I   | 1     | The signal is <b>high</b> if input data is ready                                                                                                                                                                      |
| i_data_a    | I   | 10    | <ol> <li>For instruction 0000~0100, signed input data with 2's complement representation (4-bit integer + 6-bit fraction)</li> <li>For instruction 0101~1001, no fractional part (10-bit number)</li> </ol>           |
| i_data_b    | I   | 10    |                                                                                                                                                                                                                       |
| i_inst      | Ι   | 4     | Instruction for ALU to operate                                                                                                                                                                                        |
| o_valid     | 0   | 1     | Set <b>high</b> if ready to output result                                                                                                                                                                             |
| o_data      | 0   | 10    | <ol> <li>For instruction 0000~0100, result after ALU processing with 2's complement representation (4-bit integer + 6-bit fraction)</li> <li>For instruction 0101~1001, no fractional part (10-bit number)</li> </ol> |

# **Specification (1)**



- All inputs are synchronized with the negative clock edge
- All outputs should be synchronized at clock rising edge (Flipflops are added before outputs)
- Active low asynchronous reset is used and only once (You should set all your outputs to be zero when i\_rst\_n is low)



# **Specification (2)**



- The i\_valid will turn to high in one cycle for ALU to get i\_data\_a,
   i\_data\_b and i\_inst
- The i\_valid will pulled high in random
- Your o\_valid should be pulled high for only one cycle for every o\_data



# Specification (3)



- The testbed will get your output at negative clock edge to check the answer when your o\_valid is high
- You can raise your o\_valid at any moment.



### Instruction



| Operation                | i_inst [3:0] | Meaning                                                                                | Note                           |  |
|--------------------------|--------------|----------------------------------------------------------------------------------------|--------------------------------|--|
| Signed Addition          | 4'b0000      | o_data = i_data_a + i_data_b                                                           |                                |  |
| Signed<br>Subtraction    | 4'b0001      | o_data = i_data_a - i_data_b                                                           | Overflow may be happened       |  |
| Signed<br>Multiplication | 4'b0010      | o_data = i_data_a * i_data_b                                                           |                                |  |
| MAC                      | 4'b0011      | o_mult = i_data_a * i_data_b<br>o_data <sub>new</sub> = o_mult + o_data <sub>old</sub> |                                |  |
| Tanh                     | 4'b0100      | o_data = tanh(i_data_a)                                                                | Piecewise linear approximation |  |
| ORN                      | 4'b0101      | o_data = i_data_a   i_data_b'                                                          | Bit-wise operation             |  |
| CLZ                      | 4'b0110      | Count leading zero bits                                                                |                                |  |
| CTZ                      | 4'b0111      | Count trailing zero bits                                                               | Only i_data_a is<br>used       |  |
| СРОР                     | 4'b1000      | Count set bits                                                                         | useu                           |  |
| ROL                      | 4'b1001      | Rotate left                                                                            | View i_data_b as shift amount  |  |

# **Output Saturation & Rounding**



- For instructions 0000, 0001, 0010, and 0011, If the output value exceeds the maximum value of 10-bit representation, use the maximum value as output, and vice versa
- For instructions 0010, 0011 and 0100, the result needs to be rounded to the nearest number
  - Check reference [3]

#### **MAC**



- For instruction 0011, you have to implement Multiply accumulate operation (MAC) function
- The output of the Flip-flop must be rounded to the nearest number



### **Tanh Function**



- For instruction 0100, you need to implement a popular activation function, a hyperbolic tangent function. It becomes preferred over the sigmoid function as it gives better performance for multi-layer neural networks
- However, it's not intuitive to implement an exponential operation on hardware



# **Tanh Function Approximation**



- In order to simplify the implementation, we use piecewise linear approximation to compute tanh function
- We divide the curve into 5 segments to compute the output



# **Tanh Function Approximation**



- Choose the slope of the 2<sup>nd</sup>, 3<sup>rd</sup>, 4<sup>th</sup> segment appropriately
- The 5 segments have 4 intersections
  - -(-1.5, -1), (-0.5, -0.5), (0.5, 0.5), (1.5, 1)
- Output must be rounded to the nearest number



#### **CLZ**



- For instruction 0110, count leading zero
  - Count the number of zero bits from MSB end
- For example, if a = 8'b0010\_0000, then CLZ(a)=2

0010\_0000

MSB ---

#### CTZ



- For instruction 0111, count trailing zero
  - Count the number of zero bits from LSB end
- For example, if a = 8'b0010\_0000, then CTZ(a)=5

#### **CPOP**



- For instruction 1000, count the number of set bits
- For example, if a = 8'b0010\_0001, then CPOP(a)=2
- Note you have to avoid the combinational loop, where static timing analysis (STA) cannot be applied



An error example

#### ROL



- For instruction 1001, you need to implement rotate left and view i\_data\_b as the rotation amount
  - The range of i\_data\_b: 0~9
- In rotation, the shifted bits will be filled into the vacant position



Left rotate 1 bit

### alu.v



```
module alu #(
   parameter INT W = 4,
   parameter FRAC W = 6,
   parameter INST_W = 4,
   parameter DATA W = INT W + FRAC W
                     i_clk,
   input
        i_rst_n,
         i valid,
   input signed [DATA_W-1:0] i_data_a,
   input signed [DATA W-1:0] i data b,
   input [INST_W-1:0] i_inst,
   output o_valid,
   output [DATA W-1:0] o data
); // Do not modify
// Wires and Registers
reg [DATA_W-1:0] o_data_w, o_data_r;
reg o_valid_w, o_valid_r;
```

### alu.v



```
// ----
// Continuous Assignment
//
assign o_valid = o_valid_r;
assign o_data = o_data_r;
// ---- Add your own wire data assignments here if needed ---- //

// Combinational Blocks
// ---- Write your conbinational block design here ---- //
always@(*) begin
    o_data_w = ;
    o_valid_w = ;
end
```

### alu.v



```
//
// Sequential Block
//
---- Write your sequential block design here ---- //
always@(posedge i_clk or negedge i_rst_n) begin
    if (!i_rst_n) begin
        o_data_r <= 0;
        o_valid_r <= 0;
    end else begin
        o_data_r <= o_data_w;
        o_valid_r <= o_valid_w;
    end
end
end</pre>
```

### testbench.v



```
timescale 1ns/1ps
define CYCLE
                   10.0
define RST DELAY
define MAX CYCLE 100000
`ifdef I0
   `define Inst I "../00 TESTBED/pattern/INST0 I.dat"
   `define Inst_0 "../00 TESTBED/pattern/INST0 0.dat"
   `define PAT NUM 40
elsif I1
   `define Inst_I "../00_TESTBED/pattern/INST1_I.dat"
   `define Inst 0 "../00 TESTBED/pattern/INST1 0.dat"
   `define PAT NUM 40
elsif I2
   `define Inst_I "../00_TESTBED/pattern/INST2_I.dat"
   `define Inst 0 "../00 TESTBED/pattern/INST2 0.dat"
   `define PAT NUM 40
elsif I3
   `define Inst_I "../00_TESTBED/pattern/INST3_I.dat"
   `define Inst 0 "../00 TESTBED/pattern/INST3 0.dat"
   `define PAT NUM 40
elsif I4
   `define Inst_I "../00_TESTBED/pattern/INST4 I.dat"
   `define Inst 0 "../00 TESTBED/pattern/INST4 O.dat"
    `define PAT NUM 40
```

```
elsif I5
   `define Inst_I "../00_TESTBED/pattern/INST5_I.dat"
   `define Inst 0 "../00 TESTBED/pattern/INST5 0.dat"
   `define PAT NUM 40
elsif I6
   `define Inst I "../00 TESTBED/pattern/INST6 I.dat"
   `define Inst 0 "../00 TESTBED/pattern/INST6 0.dat"
   `define PAT NUM 40
elsif I7
   `define Inst_I "../00_TESTBED/pattern/INST7_I.dat"
   `define Inst 0 "../00 TESTBED/pattern/INST7 0.dat"
   `define PAT NUM 40
`elsif I8
   `define Inst I "../00 TESTBED/pattern/INST8 I.dat"
   `define Inst_0 "../00_TESTBED/pattern/INST8_0.dat"
   `define PAT NUM 40
`elsif I9
   `define Inst I "../00 TESTBED/pattern/INST9 I.dat"
   `define Inst 0 "../00 TESTBED/pattern/INST9 O.dat"
   `define PAT NUM 40
endif
```

### **Commands**



- ./01\_run arg1
   ncverilog -f rtl.f +define+<arg1> +access+rw
   For example:
   ./01\_run I0
- " ./99\_clean
  rm -rf \*.history \*.key \*.log
  rm -rf novas.rc novas.fsdb novas.conf
  rm -rf INCA libs nWaveLog BSSLib.lib++
- Note before you execute the shell script, change the permission of the file by

```
chmod +x 01_run
```

# **Pattern (Input Data)**



| i_inst | i_data_a   | i_data_b<br>I |
|--------|------------|---------------|
| 0000   | 1101001110 | 0100010101    |
| 0000   | 1010010000 | 1110011101    |
| 0000   | 0000110011 | 0101010100    |
| 0000   | 1011111100 | 1101111000    |
| 0000   | 1000101011 | 0011100010    |
| 0000   | 0011010101 | 1101111101    |
| 0000   | 1011101001 | 0100001000    |
|        |            |               |

# **Pattern (Golden Output)**



o\_data

# **Grading Policy (1)**



- Released pattern 70%
  - − 10~14: 40%
  - − 15~19: 30%
- Hidden pattern: 30%
  - Hidden pattern contains all instructions
    - I0~I4: 10000
    - I5~I9: 10000
  - Only if you pass all patterns will you get the full 30% score

# **Grading Policy (2)**



- No late submission
  - O point for this homework
- Lose 3 points for any wrong naming rule or format for submission
- No plagiarize

### **Submission**



 Create a folder named studentID\_hw1 and follow the hierarchy below

```
r11943123_hw1/

--- 01_RTL
--- alu.v
--- rtl.f
```

- Compress the folder studentID\_hw1 in a tar file named studentID\_hw1\_vk.tar (k is the number of version, k =1,2,...)
  - Note: Use lower case in your student ID.(Ex. r11943123\_hw1\_v1.tar)
- Submit to NTU Cool

### **Discussion**



#### = 電腦輔助積體電路系統設計 (EEE5022) > 討論 → [HW1]Discussion

| 111-2 |   |                                                          |
|-------|---|----------------------------------------------------------|
| 首頁    |   |                                                          |
| 課程資訊  |   | [HW1]Discussion                                          |
| 課程內容  |   | 所有班別                                                     |
| 公告    |   | HW1相關問題在此討論,並請以下列格式發問,方便助教按照每個問題回答                       |
| 作業    |   | 1. 問題一                                                   |
| 成績    |   | 2. 問題                                                    |
| 討論    |   |                                                          |
| 文件    | Ø | 另外,若需要截圖,請勿把自己的code截圖或code文字上傳,變成大家的參考答案,若違反將扣本次作業總分10分。 |
| 頁面    | Ø |                                                          |
| 成員    | Ø | [提醒]                                                     |
| 線上測驗  | Ø | 1                                                        |
| 設定    |   | 2                                                        |
|       |   | 3                                                        |
|       |   |                                                          |
|       |   | 祝同學們學習順心                                                 |
|       |   | TA                                                       |

### References



- [1] Reference for 2'complement:
  - https://en.wikipedia.org/wiki/Two%27s\_complement
- [2] Reference for fixed-point representation
  - Fixed-Point Representation: The Q Format and Addition
     Examples
- [3] Reference for rounding to the nearest:
  - Rounding MATLAB & Simulink (mathworks.com)