## Final Project (xml: NoC\_FP.xml)

Part I. Architecture and Algorithm Choice

| Network topology  | 4x4 Torus mesh                     |
|-------------------|------------------------------------|
| Routing Algorithm | X-Y                                |
| Switching         | Wormhole switching                 |
| Buffer Size       | 1 ("out_flit" slot is as a buffer) |
| Virtual Channel   | No/1 Channel per input port        |
| Flow Control      | ACK-REQ protocol                   |



Fig.1 4x4 Torus mesh

To simplify the implementation, I choose X-Y routing algorithm. This algorithm guarantee there will no dead lock or living lock during the system running. Regarding switching method, I use wormhole switching, which only allow the flits from the same packet to fill in a channel. Below shows the design of the flit. The first flit in the packet contains the BoP (the 33th bit), source id[31:28], destination id[27:24], Is\_parameter[23], Is\_bias[22], and OP\_id[21:18]. The body flits begin with 2 zero and follow with data value. Tail flit consists of EoP[32] and the last data value.

| Header | BoP  | EoP [32] | Src     | Dst     | Is_parameter | Is_bias | OP id   | 0      |
|--------|------|----------|---------|---------|--------------|---------|---------|--------|
|        | [33] |          | [31:28] | [27:24] | [23]         | [22]    | [21:18] | [17:0] |
| Body   | BoP  | EoP [32] |         |         | Data [31:0]  |         |         |        |
|        | [33] |          |         |         |              |         |         |        |
| Tail   | BoP  | EoP [32] |         |         | Data [31:0]  |         |         |        |
|        | [33] |          |         |         |              |         |         |        |

Fig2. Flit format

#### A. Controller

Controller can access ROM to get the parameters and data, and send them to the assigned place. At first, Controller will send weights and bias for each conv and linear layer. To weights, the header of "Is\_parameter" signal will be high to indicate the packet is for parameter, while "Is\_bias" signal will be low. To bias, instead, "Is bias" will be high.

All the weights and bias will be transmitted to the core, to be more precious, they are stored in the PEs.

After finishing sending out all the parameters, Controller will send the image to the first PE (PE\_1) to compute Conv1+MaxPooling1. Then, the following data propagating will be ruled by each PE because they will know their next destination for their output feature map.

Finally, Controller will receive the output feature map from linear3 (fc8). Controller will perform the softmax function, calculate the probability of each class, and print the results on screen.

#### B. PE design

PEs are included in the cores. While initializing the cores object, PE will also be initialized and got to know what layer and functions shall be executed in this PE. Also, the attributes of the layer, such as stride and kernel size, will be set then. The below lists the PE id and the corresponding operations. This figure also shows how the Vgg-16 be partitioned and how the data propagate during inference. To be noticed, PE\_0 is replaced by controller.

```
// PE_v2:
// 0: Controller
// 1: Conv1 + ReLU + Max_pooling1
// 2: Conv2 + ReLU + Max_pooling2
// 3: Conv3 + ReLU
// 7: Conv4 + ReLU
// 6: Conv5 + ReLU + Max_pooling3
// 5: Linear1 + ReLU
// 4: Linear2 + ReLU
// 12: Linear3
// 0: + Softmax + Sort
```

Fig3. PE and the corresponding operations

| - | PE_0(controller) | → PE_1 —        | → PE_2 —      | PE_3            |
|---|------------------|-----------------|---------------|-----------------|
|   | PE_4             | — PE_5 <b>←</b> | PE_6 <b>~</b> | —— <b>₽</b> E_7 |
|   | PE_8             | PE_9            | PE_10         | PE_11           |
|   | PE_12            | PE_13           | PE_14         | PE_15           |

# Fig4. PE id and their place (Orange block means some ops will be executed there.)

While initializing PEs, PE will also know what next Op will be done in the next destination PE. Then, when sending the output feature map, PE will config the OP\_id in the header of the packet to indicate what op will be executed in the next PE. The below shows the Op\_id and the corresponding operations.

| Op_id (in decimal)      | Op type              |
|-------------------------|----------------------|
| (use [21:18] in header) |                      |
| 1                       | Conv+ReLU+MaxPooling |
| 2                       | Conv+ReLU            |
| 3                       | Linear+ReLU          |
| 4                       | Linear               |

#### C. Core design:

It contains a PE and NI in each core. First, the core will always check if there is packet needed to send to somewhere. After fetching the packet, I split the source id, destination id, data and op\_id. Then, I reformat them into flit format, and store in a "std::queue<sc\_lv <34>>". With ACK-REQ (ack\_tx, req\_tx) handshaking, the flits will send to the router.

Conversely, when "req\_rx" is on, which means the router want to send the flit to this core, the core will let "ack\_rx", and prepare to format the packet.

After collecting all flit received from the router, the core will execute "check packet()" method to send the packet to PE.

#### D. Router design:

It contains only 1 slot buffer for 5 input ports. In the router, I separate it into 2 parts: input port control and output port control.

In the former part, a routing unit is implemented to determine the destination of the input flit. After getting the destination, it will lock the destination output port and limit the user (here means the packet). Then, it will start transmission and get all the receiving flit into the input buffer (handshake at input gate). After handshaking at input gate, the flit will pass to its output gate and wait the next node handshake with it. As for the input gate, the last node will pass new flit

until the tail flit was passed.

In the output port control, it need to check if current flit is a valid data. After handshaking at output gate, output gate will need to wait the flit from input gate handshake. Therefore, it is necessary to control "out\_req" signal.

About the routing unit, which implemented X-Y routing algorithm, it will calculate the next step in each router using its own router id. Therefore, when router get a source id from a header flit, it can calculate the next step according current router id.

### Part II. Simulation results (on PA)

|                                                           | 33;5m Top-5 Results                    |                                                   |              | [ Om                                        |         |        |
|-----------------------------------------------------------|----------------------------------------|---------------------------------------------------|--------------|---------------------------------------------|---------|--------|
| L                                                         |                                        |                                                   |              | ClassName                                   |         |        |
| 85                                                        | 20.206686                              | 9                                                 | 6.381480%    | Egyptian cat                                |         |        |
| 81                                                        | 16.136835                              | 1                                                 | .646189%     | tabby                                       |         |        |
| 82                                                        | 15.733852                              | 1                                                 | .100187%     | tiger cat                                   |         |        |
| 287                                                       | 14.790856                              | 0                                                 | .428478%     | lynx                                        |         |        |
| 28                                                        | 14.411864                              | 0                                                 | .293315%     | plastic bag                                 |         |        |
|                                                           |                                        |                                                   | ser.         |                                             |         |        |
|                                                           |                                        |                                                   |              | t: mlchip007, cat                           |         |        |
| [33m**                                                    | ******                                 | *****                                             | Fig4. accoun | t: mlchip007, cat                           | ******  | ****** |
| [33;5m                                                    |                                        | Top-5 Res                                         | Fig4. accoun | **************************************      |         |        |
| [33;5m<br>[33m**                                          | ******                                 | Top-5 Res                                         | Fig4. accoun | ******                                      | ******* |        |
| [33;5m<br>[33m***<br>[32mInd                              | ******                                 | Top-5 Res<br>************************************ | Fig4. accoun | **************************************      | ******* |        |
| [33;5m<br>[33m**:<br>[32mInd                              | ************************************** | Top-5 Res<br>*****************<br>Val             | Fig4. accoun | **************************************      | ******* |        |
| [33;5m<br>[33m***<br>[32mInd<br>                          | ************************************** | Top-5 Res ************ Val                        | Fig4. accoun | [0m<br>************************************ | ******* |        |
| [33;5m<br>[33m***<br>[32mInd<br>207<br>207<br>220         | dex<br>                                | Top-5 Res ************ Val                        | Fig4. accoun | [0m<br>************************************ | ******* |        |
| [33;5m<br>[33m***:<br>[32mInd<br>207<br>.75<br>220<br>.63 | dex<br>                                | Top-5 Res ***************** Val 3 1 1 7           | Fig4. accoun | [Om  ***********************************    | ******* |        |

Fig5. account: mlchip007, dog

## Part III. Challenges and Observations

#### A. Test function by unit

Because loading weights, bias and image takes a lot of time, I try to split the whole design into two parts while implement it. One is loading all weights, bias and image. The other is inference the image. Because weights and bias will be stored in the corresponding PE, I write another "pe\_load.h", which also declare the PE class and point out how PE work, to trace the process and transmission of weights and bias. It will directly read the weights/bias as golden\_weights and golden\_bias. After PE got its weights and bias from ROM, it will check them with golden\_weights and golden\_bias. If there exists error, it will print a fatal message in the screen.

To preload all parameters, please follow Fig6., Fig7., Fig8..

Fig6. In "pe\_load.h", after modify line273~302 to access data

```
C pe.h × C core.h 3 ×

C core.h > ...

1 #ifndef CORE_H
2 #define CORE_H
3 #define SC_INCLUDE_FX
4
5 #include "systemc.h"
6 //#include "pe.h"
7 #include "pe_load.h"
8
```

Fig7. In "core.h", comment line6 and uncomment line7.

Fig8. In "controller.h", modify line 267 as above (original: current\_step will be 1)

| data folder                                        | Include weight, bias, and input image matrix         |  |
|----------------------------------------------------|------------------------------------------------------|--|
| run                                                | Executable files for SystemC program                 |  |
| controller.h                                       | Implement controller                                 |  |
| core.h                                             | Implement core and Ni                                |  |
| pe.h                                               | Implement the pe                                     |  |
| pe_load.h                                          | Debug use, for load parameters                       |  |
| router.v                                           | Implement the router                                 |  |
| Makefile Makefile script for compile systemC progr |                                                      |  |
|                                                    | Declare the main function, create the module         |  |
| main ann                                           | instances, mapping the signals. It includes all      |  |
| main.cpp                                           | operation units, pattern module, clockreset modules, |  |
|                                                    | and monitor module.                                  |  |
| ROM.h/ROM.o/ROM.cpp                                | ROM stores image and parameters                      |  |
| clockreset.h                                       | Declare the clock module and reset modules.          |  |
| clockreset.cpp                                     | Implement the clock module and reset modules.        |  |
| NoC_FP.xml                                         | NoC_FP.xml PA xml file                               |  |

