## High Level Synthesis of a trained CNN for handwritten digit recognition

Federico Serafini federico.serafini@studenti.unipr.it

Embedded Systems Università degli Studi di Parma

12/07/2022

#### Outline

- Introduction
- 2 SW Implementation
- 3 High Level Synthesis
- Results and Verification
- Conclusions

### Convolutional Neural Network (CNN)

Neural network with a special architecture that is able to detect spatial structures (features) of the input:

- widely used in image-recongnition problems;
- highly-parallelizable algorithm.



#### Basic concepts

#### A single and abitious objective

Overtake C performances through HW parallelism!

#### Workflow

- Opening Python:
  - model definition, training and evaluation;
  - export of (trained) network weights and architecture.
- C: replication of the network.
- Vitis HLS:
  - naive approach (high level synthesis of basic C);
  - 2 stream and dataflow approach.
- Verification.

#### Model definition and evaluation in Python







#### Network replication in C

```
void cnn
      float img_in [IMG_ROWS][IMG_COLS],
4
5
6
      float prediction [DIGITS]
7
      /****** Normalization and padding. ******/
8
      float pad img [PAD IMG ROWS][PAD IMG COLS] = { 0 }:
9
      normalization_and_padding(img_in, pad_img);
10
11
      /***** Convolution laver. ******/
12
      float features [FILTERS][IMG_ROWS][IMG_COLS] = { 0 };
13
      // Convolution with relu as activation function.
14
      convolutional_layer(pad_img, features);
15
16
      /***** Max-pooling layer. ******/
17
      float pool features [FILTERS][POOL IMG ROWS][POOL IMG COLS] = { 0 }:
18
      max pooling laver(features, pool features):
19
20
      /***** Flatten layer. ******/
21
      float flat array [FLAT SIZE] = { 0 }:
22
      flattening_layer(pool_features, flat_array);
23
24
      /***** Dense laver. ******/
25
      dense_layer(flat_array, prediction);
26
```

#### Naive approach: latency estimation



#### Vitis HLS Stream <sup>1</sup>

#### Streaming data transfer

- A FIFO queue (ap\_fifo interface).
- Data samples are sent in sequential order starting from the first (no address management is required).

#### Array implemented as FIFO interface

- Array must be only read or written, thus allowing a point-to-point connection.
- Program must follow a FIFO semantics (no random accesses).
- If a stream is used to transfer data between tasks, consider a dataflow region where data streams from one task to the next.

<sup>&</sup>lt;sup>1</sup>Vitis-HLS User Guide

#define FILTERS 8

#### Stream-dataflow approach: code structure

```
void cnn
5
      float img in
                     [IMG_ROWS][IMG_COLS],
6
      float prediction [DIGITS]
7
8
9
      /***** Pre-processing the img in. ******/
10
11
      // Normalization and padding.
12
13
      /***** Clone the normalized and padded image. *******/
14
15
      /*
16
       * Clone the normalized and padded image in order to
17
         have an image for each parallel execution (for each filter).
       */
18
19
20
      /***** Parallel executions start here. *******/
21
22
         Dataflow section with streams used to transfer data between tasks:
23
       * -convolution_layer;
24
        -max_pooling_layer;
25
       * -flattening_layer;
26
       * -dense_layer;
27
       * -dense_layer_softmax.
28
       */
29
    }
```

#### Dataflow view



#### Stream-dataflow approach: latency estimation



#### Verification

#### Co-simulation

| Total predictions   | 10   |
|---------------------|------|
| Correct predictions | 100% |

|                  |      | Avg  |      |
|------------------|------|------|------|
| Latency (cycles) | 5975 | 5975 | 5975 |

#### Export RTL with Vivado synsthesis and place and route

|           | BRAM | DSP | FF    | LUT   |
|-----------|------|-----|-------|-------|
| Vitis HLS | 384  | 143 | 47201 | 37585 |
| Vivado    | 224  | 143 | 38791 | 26753 |

|                   | Required | Post-synth | Post-impl |
|-------------------|----------|------------|-----------|
| Clock period (ns) | 10       | 8.123      | 9.157     |

#### Future work

Introduction

- Smarter SW algorithm → smarter HW accelerator.
- Fixed-point arithmetic reduced area.
- Vitis HLS syntax constructs and libraries → better synthesis.

#### Hardware or software implementation?

Further invistigation is needed:

- apply improvements listed above;
- consider application domain requirements (for example the available HW and timing constraints);
- consider also a co-design (HW and SW);
- Ochoose the cheapest solution satisfying the requirements.

# Thank you for your attention.