

# Università di Pisa

Computer Engineering

Electronic and Communication Systems

# Perceptron

Project Report

TEAM MEMBERS: Olgerti Xhanej

Academic Year: 2020/2021

# Contents

| 1        | Inti                    | roduction                                | 2  |  |  |  |  |  |  |  |  |  |  |
|----------|-------------------------|------------------------------------------|----|--|--|--|--|--|--|--|--|--|--|
|          | 1.1                     | Problem Description                      | 2  |  |  |  |  |  |  |  |  |  |  |
|          | 1.2                     | Applications                             | 3  |  |  |  |  |  |  |  |  |  |  |
|          | 1.3                     | Possible Architectures                   | 4  |  |  |  |  |  |  |  |  |  |  |
| <b>2</b> | Arc                     | Architecture                             |    |  |  |  |  |  |  |  |  |  |  |
|          | 2.1                     | Multiplication Circuit Architecture      | 6  |  |  |  |  |  |  |  |  |  |  |
|          | 2.2                     | Adder Circuit Architecture               |    |  |  |  |  |  |  |  |  |  |  |
|          | 2.3                     | Activation Function Circuit Architecture | 12 |  |  |  |  |  |  |  |  |  |  |
| 3        | VHDL CODE 13            |                                          |    |  |  |  |  |  |  |  |  |  |  |
|          | 3.1                     | Modules List                             | 13 |  |  |  |  |  |  |  |  |  |  |
|          | 3.2                     | Perceptron                               | 14 |  |  |  |  |  |  |  |  |  |  |
|          | 3.3                     | Parallel Multiplier                      |    |  |  |  |  |  |  |  |  |  |  |
|          | 3.4                     | Unsigned Parallel Multiplier             |    |  |  |  |  |  |  |  |  |  |  |
|          | 3.5                     | Tree Adder                               | 19 |  |  |  |  |  |  |  |  |  |  |
|          |                         | 3.5.1 Ripple Carry Adder Pipelined       | 19 |  |  |  |  |  |  |  |  |  |  |
|          | 3.6                     | LUT                                      | 21 |  |  |  |  |  |  |  |  |  |  |
|          |                         | 3.6.1 Lut generation code                | 22 |  |  |  |  |  |  |  |  |  |  |
| 4        | Test Plan 24            |                                          |    |  |  |  |  |  |  |  |  |  |  |
|          | 4.1                     | System Estimation Test                   | 24 |  |  |  |  |  |  |  |  |  |  |
|          |                         | 4.1.1 Estimation Test 1                  | 24 |  |  |  |  |  |  |  |  |  |  |
|          |                         | 4.1.2 Estimation Test 2                  |    |  |  |  |  |  |  |  |  |  |  |
|          |                         | 4.1.3 Estimation Test 3                  |    |  |  |  |  |  |  |  |  |  |  |
|          | 4.2                     | System Aimed Test                        | 26 |  |  |  |  |  |  |  |  |  |  |
| 5        | XILINX VIVADO Report 3: |                                          |    |  |  |  |  |  |  |  |  |  |  |
|          | 5.1                     | RTL Analysis                             | 31 |  |  |  |  |  |  |  |  |  |  |
|          | 5.2                     | Timing Report                            | 31 |  |  |  |  |  |  |  |  |  |  |
|          | 5.3                     | Resource Utilization Report              | 32 |  |  |  |  |  |  |  |  |  |  |
|          | 5.4                     | Power Consumption Report                 |    |  |  |  |  |  |  |  |  |  |  |
|          | 5.5                     | Warning Messages                         | 33 |  |  |  |  |  |  |  |  |  |  |
| 6        | Cor                     | nclusion                                 | 35 |  |  |  |  |  |  |  |  |  |  |

# 1 - Introduction

#### 1.1 Problem Description

The main goal of the activity described in this report is the following: realizing a network implementing a **perceptron** with a **sigmoid activation** function.

Before describing the whole design and implementation process a very little introduction about the architecture must be done.



Figure 1: Perceptron Architecture

A **Perceptron** is a binary classifier that maps his inputs to a specific output y = f(z), where f(z) is the **activation function** of the perceptron. The inputs are real numbers and the input z of the activation function is obtained as:

$$z = b + \sum_{i=0}^{N_L - 1} w_i x_i \tag{1}$$

Every input  $x_i$ , every weight  $w_i$  and the bias b are real numbers in the range of [-1, 1].

The activation function, in our case, will be a sigmoid function, described as follows:

$$y = \frac{1}{1 + e^{-z}} \tag{2}$$

0.6

Figure 2: Sigmoid Function Plot

Where z is the result of the equation (1.1).

#### 1.2 Applications

A single perceptron is the building block of *artificial neural networks*, in which different layers of perceptrons are connected. The output of the neural network is a real number and could be use to classify *complex objects*: patterns, human faces, handwritings, medical diagnosis, e-mail spams and so on and so forth.



Figure 3: Neural network example

In the image above there is a simple schema of a neural network, in which the circles represent the perceptrons.

#### 1.3 Possible Architectures

The main architecture will be made up by three main logical parts, from an higher-lever point of view:

- Multiplication Circuit: implementation of the multiplication operation between each input  $x_i$  and each weight  $w_i$ .
- Adder Circuit: implementation of the addition between the results of the former phase and the bias b.
- Activation Function Circuit: implementation of the computation of the sigmoid function.

In the next chapter the architecture will be documented with more precision. Different project choices could be made for each logical part of the architecture:

- Multiplication Circuit: could be implemented through a ROM-based solution in which every possible result is stored and the two inputs represent the addresses for getting the result. This solution is good only with a very low number of bits, which is not our case: in fact the ROM will be composed by  $2^{(n_{w_i}+n_{b_i})}$  memory cells  $(n_{w_i}$  represent the number of bits of  $w_i$  and the same for  $n_{b_i}$  and  $b_i$ ). In order to implement the multiplication circuit will be implemented through a Paraller Multiplier, with some additional logic to handle the signed inputs.
- Adder Circuit: different choices could be made to implement the adder circuit. Starting from the simplest to the more complex solution we can exploit the Serial Adder, the Parallel Adder or the Parallel Adder with Pipeline. The first one needs less logic but requires n clock cycles for computing an n bits result. The second solution improves the first one by computing one result in one clock cycle, on the other hand it could add some problems due to long logic chains between two register. The third solution is the best from the perspective of the number of clock cycles required and the critical path, in fact by adding some registers in between the computation of the bits will reduce the logic chains (and increasing the number of clock cycles though).
- Activation Function Circuit: As seen during the laboratory class, this part will be implemented by exploiting a Look-Up-Table. In

order to do so, could be necessary a **truncation** of the result of the former computation in order to limit the size of the LUT. With 12 bits are necessary  $2^{12} = 4096$  entries, which could be even reduced by performing some optimization by exploiting the sigmoid function symmetry. For further details see next Chapter.

# 2- Architecture

In this chapter will be discussed deeply the architecture of the three main parts of the **Perceptron**. The general structure could be summarized by the following schema:



Figure 4: General Schema

### 2.1 Multiplication Circuit Architecture

The Multiplication Circuit, as said before, will be implemented through a Parallel Multiplier. The inputs  $b_i$  and  $w_i$  are composed respectively by  $b_x = 8$  bits and  $b_w = 9$  bits. In order to compute the multiplication in the correct way, the inputs need to be translated in the **unsigned form** and then is

possible to perform the multiplication with the parallel multiplier. In the following image is presented the general schema of the Parallel Multiplier:



Figure 5: Parallel Multiplier Architecture

Notice that the sign of the result will be computed by a simple XOR operation between the inputs signs. The **Unsigned Parallel Multiplier** architecture is the following:



Figure 6: Unsigned Parallel Multiplier Architecture

Each logic block is translated with a related logic block:



Figure 7: Unsigned Parallel Multiplier Architecture

#### 2.2 Adder Circuit Architecture

In order to compute the equation (1.1) different sums need to be computed. The building block of this part will be the **Parallel Adder with Pipeline**: as said before, by adding some registers in between the Carry chains, the critical path impact can be reduced. Furthermore, by exploiting the parallel architecture, a single sum can be computed in a single clock cycle. In the next figure will be presented the Parallel Adder:



Figure 8: Parallel Adder Architecture

In order to obtain an output, after an input drive, there is a need to wait  $\left\lfloor \frac{N}{N_{pipeline}} \right\rfloor$  clock cycles, where N represent the number of bits of a or b and  $N_{pipeline}$  represent the maximum number of consecutive FA without a register in between.

To implement the whole sum of 11 terms, in order to decrease the number of cycles needed to compute the whole sum and to reduce the number of bits needed, a tree approach has been chosen. The schema of the tree parallel adder is the following:



Figure 9: Parallel Multiplier Architecture

Note some extension or left shifts (i.e. for b) were not represented

Some register has been put in between the sum to limit the critical path

impact on the performances and clock period limit. To obtain a good output after an input drive there is a need to sum to 3 (the maximum number of consecutive register in the previous architecture) each Ripple Carry Adder contribution in terms of number of clock as seen before.

#### 2.3 Activation Function Circuit Architecture

At the end of the computation of the latter phase the output is composed by 21 bits. The computation of the sigmoid function will be done through a **Look-Up-Table**, which will need  $2^{21} = 2097152$  entries of different outputs with 16 bits. In order to reduce the size of the Look-Up-Table a truncation is needed: from 21 bits to 12 bits. In this case the Look-Up Table will be composed by  $2^{12} = 4096$  entries, but, by exploiting the odd symmetry of the sigmoid, only 4096/2 = 2048 entries are needed.



Figure 10: Look-Up Table Architecture

All things considered, by performing the calculation showed before, there is a need of 26 **clock cycles** to obtain a correct output after driving an input.

### 3 - VHDL CODE

In this chapter will be presented the main modules that compose the architecture of the **Perceptron with sigmoid activation function**. To implement the architecture was exploited **Sigasi Studio** for the vhdl part and **Visual Studio Code** for the python part.

First of all, as written in the specifics of the project, each input shall be represented in the standard 2's complement notation: for this reasons, inputs, in the architecture, will be treated like simple std\_logic\_vectors (as will be shown in the following paragraphs). By exploiting 2's complement architectures and, in particular, fixed point representations for real numbers (de-facto is equal to 2's complement) will be possible to compute sums, multiplications as described in the latter chapter.

#### 3.1 Modules List

As presented in the last chapter, I have followed a similar approach for creating the architecture. The following modules were created:

- Perceptron
  - Parallel\_Multiplier
    - \* Unsigned Parallel Multiplier
      - · Full Adder
      - · Half Adder
  - Tree\_Adder
    - \* Ripple\_Carry\_Adder\_Pipelined
      - · DFF
      - · Full Adder
  - Sigmoid\_Lut\_2048

A **bottom-up strategy** was followed in order to build up the architecture: starting from some modules that will made up the architecture, after finishing each of them some testbenches were written in order to test each building block of the **Perceptron** (See next chapter for details).

#### 3.2 Perceptron

The main hardware description of the architecture. This module will connect all the other modules in order to create the correct architecture. In order to not show too much lines of code only the entity definition of this module will be shown.

```
entity Perceptron is
    port(
      -- x_1 to x_10 inputs of the perceptron with 8 bits
      x_1: in std_logic_vector(7 downto 0);
5
6
      x_10: in std_logic_vector(7 downto 0);
      -- w_1 to w_10 inputs of the perceptron with 9 bits
9
      w_1: in std_logic_vector(8 downto 0);
11
      w_10: in std_logic_vector(8 downto 0);
12
13
      -- b input of the perceptron with 9 bits
14
      b: in std_logic_vector(8 downto 0);
16
      clk: in std_logic;
17
      rst: in std_logic;
      -- output of the perceptron 16 bits
20
      f_z: out std_logic_vector(15 downto 0)
21
    );
22
    end Perceptron;
24 architecture rtl of Perceptron is
26 begin
27
    d_process: process(clk, rst)
28
29
    begin
      -- If z, the candidate input of the sigmoid function, is
     negative,
      -- then is passed his complement.
31
      if(z_{in}(20) = '1') then
        z_in_lut <= std_logic_vector(unsigned(not(z_in)) + 1);</pre>
34
        z_in_lut <= z_in;</pre>
35
      end if;
36
38
39
      -- On the output side if the candidate input was negative
40
      -- the output is complemented with the highest possible
```

```
-- number in the lut in order to mirror it

if (z_in(20) = '1') then

f_z <= std_logic_vector(32766 - unsigned(f_z_todo));

else

f_z <= std_logic_vector(unsigned(f_z_todo));

end if;

end process d_process;

end rtl;
```

In the rest of this modules are instantiated and linked the various submodules that made up the **Perceptron** module. At the bottom of the previous code snippet is shown how the optimization of the LUT is made.

#### 3.3 Parallel Multiplier

This module has the duty to convert the inputs, which are signed with a 2's complement representation, link their unsigned representation with the **Unsigned Parallel Multiplier** module and then reconvert the product in the signed representation. The general architecture is shown in Figure 5.

```
entity Parallel_Multiplier is
      generic (
        Nbit_a : positive;
        Nbit_b: positive
      );
5
      port(
6
        a_p_signed: in std_logic_vector(Nbit_a - 1 downto 0);
        b_p_signed: in std_logic_vector(Nbit_b - 1 downto 0);
          - The product will need Nbit_a + Nbit_b bits
        p_signed: out std_logic_vector(Nbit_a + Nbit_b - 1
     downto 0)
11
    end entity Parallel_Multiplier;
12
13
    architecture rtl of Parallel_Multiplier is
14
15
      -- Building blocks of the Parallel Multiplier
16
      component Unsigned_Parallel_Multiplier
17
        generic(
          Nbit_a : positive;
19
          Nbit_b : positive
20
        );
21
        port(
          a_p: in std_logic_vector(Nbit_a - 1 downto 0);
          b_p : in std_logic_vector(Nbit_b - 1 downto 0);
24
               : out std_logic_vector(Nbit_a + Nbit_b - 1 downto
      0)
```

```
26
      end component Unsigned_Parallel_Multiplier;
27
29
      -- Unsigned component (will work for the unsigned
30
     parallel multiplier
      signal p_unsigned: std_logic_vector(Nbit_a + Nbit_b - 1
     downto 0);
      signal a_p_unsigned: std_logic_vector(Nbit_a - 1 downto
32
     0);
      signal b_p_unsigned: std_logic_vector(Nbit_b - 1 downto
     0);
34
      -- will carry the sign bit for the signed rapresentation
35
     of the inputs
      signal a_sign: std_logic;
36
      signal b_sign: std_logic;
37
39
    begin
40
      -- Compute the unsigned representation from the signed
41
     one
      a_p_unsigned <= std_logic_vector(abs(signed(a_p_signed)))</pre>
42
      b_p_unsigned <= std_logic_vector(abs(signed(b_p_signed)))</pre>
43
44
      \operatorname{\text{\rm ---}} 2's complement rapresentation, the result sign uis
45
     computed through the xor op. between a and b
      p_signed <= std_logic_vector(unsigned(not(p_unsigned)) +</pre>
     1) when (((a_sign xor b_sign) = '1')) else p_unsigned;
47
      -- Getting of the sign from a and b (the MSB of the C2
48
     representation)
      a_sign <= a_p_signed(Nbit_a - 1);</pre>
49
      b_sign <= b_p_signed(Nbit_b - 1);</pre>
50
51
      unsigned_parallel_mul: Unsigned_Parallel_Multiplier
52
        generic map(
53
           Nbit_a => Nbit_a,
54
           Nbit_b => Nbit_b
        )
56
        port map(
57
           a_p => a_p_unsigned,
58
           b_p => b_p_unsigned,
               => p_unsigned
           p
60
        );
61
62
    end architecture rtl;
```

#### 3.4 Unsigned Parallel Multiplier

This module will effectively compute a multiplication. Through the replication of the architecture shown in Figure 6 and 7 this module will return the *unsigned product* of two *unsigned operands*. In order to implement the architecture in the simplest way a **Structured approach** has been followed in the description. Some lines of code were just skipped in order to show only a general schema, for further details there is also the source code.

```
entity Unsigned_Parallel_Multiplier is
      generic (
        Nbit_a : positive;
3
        Nbit_b: positive
4
      );
      port(
        -- Unsigned representation of inputs
        a_p: in std_logic_vector(Nbit_a - 1 downto 0);
        b_p: in std_logic_vector(Nbit_b - 1 downto 0);
        -- p = a_p * b_p
11
        p: out std_logic_vector(Nbit_a + Nbit_b - 1 downto 0)
      );
13
    end entity Unsigned_Parallel_Multiplier;
14
    architecture rtl of Unsigned_Parallel_Multiplier is
16
      -- Building blocks of the Unsigned Parallel Multiplier
17
      component FULL_ADDER is
18
19
      end component;
20
21
      component HALF_ADDER is
22
      end component;
24
      -- Will hold the carry signals among the whole
     architecture
      signal carry_signal: std_logic_vector((Nbit_a - 1)*(
27
     Nbit_b - 1) - 1 downto 0);
      signal last_carry_signal: std_logic_vector((Nbit_b - 1)
28
     downto 0);
      -- Will hold the sum result of the FA and HA among the
     whole architecture
      signal sum_signal: std_logic_vector((Nbit_a - 1)*(Nbit_b
31
     - 2) - 1 downto 0);
      -- will hold the precomputed values for the inputs a and
33
     b of the various Half Adder and Full Adder
```

```
signal a_multiplier: std_logic_vector(Nbit_a + Nbit_b - 2
       downto 0);
      signal b_multiplier: std_logic_vector((Nbit_a - 1)*(
      Nbit_b - 1) - 1 downto 0);
36
    begin
37
      -- First bit of the result
39
      p(0) \le (a_p(0) \text{ and } b_p(0));
40
43
       -- Computation of the various inputs of each HA and FA
      d_process: process(a_p, b_p)
44
      begin
45
      for j in 1 to Nbit_b loop
47
      a_multiplier(j - 1) \le (a_p(0) \text{ and } b_p(Nbit_b - j));
      end loop;
51
      . . .
      end process d_process;
52
53
54
      -- Architecture will follow schema of the Parallel
55
      Multiplier
      -- Row index i
57
      GEN_a: for i in 1 to Nbit_a generate
         -- Column index j
58
         GEN_b: for j in 1 to Nbit_b - 1 generate
59
           FIRST_ROW: if i=1 generate
             -- In the first Row only HA
61
             LEFT: if j < Nbit_b -1 generate
               ROW1_LEFT: HALF_ADDER
                 port map
65
                         => a_multiplier(j - 1),
66
                   a
                        => b_multiplier(j - 1),
                         => sum_signal(j - 1),
68
                   cout => carry_signal(j - 1)
69
                 );
70
               end generate LEFT;
             RIGHT: if j = Nbit_b - 1 generate
72
               ROW1_RIGHT: HALF_ADDER
73
                 port map
74
                         => a_multiplier(j - 1),
76
                         => b_multiplier(j - 1),
77
                         => p(1), -- Result bit
78
                    cout => carry_signal(j - 1)
```

```
80
             end generate RIGHT;
81
           end generate FIRST_ROW;
83
84
           INTERNAL_ROW: if i > 1 and i < Nbit_a generate</pre>
85
           end generate INTERNAL_ROW;
87
           LAST_ROW: if i = Nbit_a generate
91
           end generate LAST_ROW;
92
         end generate GEN_b;
93
      end generate GEN_a;
    end architecture rtl;
```

#### 3.5 Tree Adder

This module will take up the ten multiplication results and the bias and will sum up every term by making the computation shown at the equation (1). Even in this case will not be shown the architectural code due to the fact that consist only in linking some submodules in the proper way in order to replicate the architecture shown in Figure (9).

```
entity Tree_Adder is
    port(
      -- Inputs: result of the multiplication of xi*wi
      in_1: in std_logic_vector(16 downto 0);
      in_10: in std_logic_vector(16 downto 0);
6
      -- Bias input
      b: in std_logic_vector(8 downto 0);
      clk: in std_logic;
10
      rst: in std_logic;
11
      -- Output
13
      z: out std_logic_vector(20 downto 0)
14
    );
15
16 end Tree_Adder;
```

#### 3.5.1 Ripple Carry Adder Pipelined

This module will be the main building block of the **Tree Adder** module. In order to reduce the logic chain some registers were added by exploiting the **DFF** module as seen in the Lab lectures.

```
entity Ripple_Carry_Adder_Pipelined is
generic (Nbit: positive);
    port(
      -- Inputs
      a_r: in std_logic_vector(Nbit-2 downto 0);
5
      b_r: in std_logic_vector(Nbit-2 downto 0);
      cin_r: in std_logic;
      cout_r: out std_logic;
      -- Will store the result of a_r+b_r
10
      s_r: out std_logic_vector(Nbit-1 downto 0);
      clk: in std_logic;
12
      rst: in std_logic
13
    );
14
15 end Ripple_Carry_Adder_Pipelined;
17 architecture rtl of Ripple_Carry_Adder_Pipelined is
    -- Building blocks of the Ripple Carry Adder Pipelined
18
19
    component FULL_ADDER
20
      . . .
    end component FULL_ADDER;
21
22
    -- Need of a register to obtain the pipelined version
23
    component DFF
24
25
26
    component DFF;
27
    -- Will propagate the carry signal among the whole
28
     architecture
    signal carry_signal: std_logic_vector(Nbit-1 downto 1);
29
30
    -- Will store the outputs signal of the registers
31
    signal dff_signal: std_logic_vector(Nbit-1 downto 0) := (
     others => '0');
33
34 begin
    -- Implemented in a structured way in a similar fashion as
     seen in the Lab lessions
    GEN: for i in 1 to Nbit generate
36
      FIRST: if i=1 generate
37
      -- First FA
38
        FFI: FULL_ADDER port map (a_r(0), b_r(0), cin_r, s_r(0)
      , carry_signal(1));
      end generate FIRST;
40
      INTERNAL: if i > 1 and i < Nbit generate</pre>
41
        -- Need of Register detection
42
        PIPE: if (i mod 3 = 0) generate
43
          DFF_I: DFF
44
            port map(
```

```
d_dff
                        => carry_signal(i-1),
46
            clk_dff
                        => clk,
47
            resetn_dff => rst,
                        => dff_signal(i-1)
49
          );
50
          FFI: FULL_ADDER port map (a_r(i-1), b_r(i-1),
     dff_signal(i-1), s_r(i-1), carry_signal(i));
        end generate PIPE;
53
        -- No need of a register
54
        NOT_PIPE: if (i mod 3 /= 0) generate
          FFI: FULL_ADDER port map (a_r(i-1), b_r(i-1),
56
     carry_signal(i-1), s_r(i-1), carry_signal(i));
        end generate NOT_PIPE;
57
      end generate INTERNAL;
58
59
      -- Implicit extension (the inputs have Nbit-2 bits, the
     output has Nbit-1 bits and there
      -- are Nbit-1 FA so the last bit is replicated in order
61
     to make the extension in the
      -- correct way in C2 representation)
62
63
      LAST: if i=Nbit generate
64
      FFI: FULL_ADDER port map (a_r(Nbit-2), b_r(Nbit-2),
65
     carry_signal(Nbit-1), s_r(Nbit-1), cout_r);
      end generate LAST;
    end generate GEN;
68 end rtl;
```

#### 3.6 LUT

This module will store every possible output of the sigmoid function in the unsigned input range of [0; +31] with 12 bits of precision, (even if the working range is [-11; +11], but with the addition of some logic to make the lut optimization an unsigned representation was used), and output range of [0; 1] with 16 bits of precision. In order to do so a quantization is needed: is unthinkable to define precisely a rational number with a finite number of bits. To determine the quantized quantity of the input and the output there is a need to calculate the weight of the LSB in that range. Through the **Reach the LSB** method seen during the Lab lectures, the LSB, with N bits of precision, can be calculated as:

$$LSB = \frac{\max x}{2^{N-1} - 1} \quad or \quad \frac{|\min x|}{2^{N-1}} \tag{3}$$

So, in out case:

$$LSB(in) = \frac{31}{2^{11} - 1} = 0.015144113 \tag{4}$$

$$LSB(out) = \frac{1}{2^{15} - 1} = 3.051850947e - 5 \tag{5}$$

The Look-Up table will store the values round(f(x)/LSB/out) for x in [0; 2047].

The input will be treated as an address signal to obtain the correct output value in the same fashion as shown in the Laboratory lectures.

```
entity sigmoid_lut_2048 is
    port (
      address : in std_logic_vector(10 downto 0);
      dds_out : out std_logic_vector(15 downto 0)
    );
6 end sigmoid_lut_2048;
8 architecture rtl of sigmoid_lut_2048 is
    type LUT_t is array (natural range 0 to 2047) of integer;
    constant LUT: LUT_t := (
      0 \Rightarrow 16384
11
      1 => 16428,
12
      2046 => 32766,
14
      2047 => 32766
15
    );
16
17
18 begin
    dds_out <= std_logic_vector(TO_SIGNED(LUT(TO_INTEGER(</pre>
     unsigned(address))),16));
20 end rtl;
```

#### 3.6.1 Lut generation code

The whole Lut was not compiled "by hand" obviously. The look-up table outputs were generated through the following python script by exploiting the computation concerning the LSB with 12 bits and 16 bits resolution made before.

```
import math

#Calculate lsb of x (16 bits) and f(x) (12 bits)

lsb_out = (1)/(2**15 - 1)

lsb_in = (31)/(2**11 - 1)

result = ""
```

```
for x in range(0, 2048):
    f_x = (1)/(1 + math.exp(-(x*lsb_in)))
lut = round(f_x/lsb_out)

#Generate lut entries for every x
result += str(x) + " => " + str(lut) + ",\n"

print(result)
```

### 4 — Test Plan

In order to verify the correctness of the system the following tests were made:

- 1. **Unit Tests**: following the **bottom-up** strategy, each sub-module (Parallel Multiplier, Ripple Carry Adder Pipelined...), after completing the implementation, has a dedicated testbench in order to check the correctness of the single sub-module in isolation. Considering the fact that this test are **trivial** (just checking if the sum or the product of some numbers is correct) this will not be showed in this documentation.
- 2. **System Estimation Test**: in this phase, some testbenches were written with particular inputs. The aim of this test is only to check if the result will resemble the sigmoid curve by varying the inputs in time in an increasing way.
- 3. **System Aimed Test**: after checking that the system is *likely* correct by the latter test, through a python script will be made a test with different inputs in the range considered and check, with an additional testbench, if the outputs are equal.

#### 4.1 System Estimation Test

Even before checking the correctness of the output of the system by a given input, three "estimation tests" were made. The aim of these tests is just to obtain a sigmoid curve by setting each  $x_i$  and  $w_i$  and varying the bias b in the range of [-1; +1].

#### 4.1.1 Estimation Test 1

In this case we have the following inputs:

$$x_1 \dots x_{10} = 0, w_1 \dots w_{10} = 0$$
 (6)

The bias b will vary in the whole range of [-1, +1]. Just to remind the sigmoid function curve, we expect to obtain a curve with an odd symmetry and a "linear" behaviour, due to the fact that we are considering the values near zero of the curve:

Figure 11: Sigmoid Function Plot



The output of the system is the following:

Figure 12: Output of the System 1



There are different replication of the system due to the fact that the bias b will "turn back" when he will get to his maximum. We can state that, by comparing the two figures, the **first estimation test is passed**.

#### 4.1.2 Estimation Test 2

$$x_1 \dots x_{10} \approx 1, w_1 \dots w_{10} \approx 1 \tag{7}$$

The bias b will vary in the whole range of [-1, +1]. We expect to obtain a likely flat curve with some "high values" due to the fact that the summation of the ten product is 10 and the sigmoid at that value tends to 1.

Figure 13: Output of the System 2



The output of the system is similar to a flat curve. We can state that, by comparing the system output with what we expected, the **second estimation test is passed**.

#### 4.1.3 Estimation Test 3

$$x_1 \dots x_{10} \approx -1, w_1 \dots w_{10} \approx 1$$
 (8)

The bias b will vary in the whole range of [-1, +1]. We expect to obtain a likely flat curve with some "low values" due to the fact that the summation of the ten product is -10.

Figure 14: Output of the System 3



The curve is flat with low values but we can notice some *noise*. This is due to the fact that to obtain a proper output some clock cycles are needed: this will lead to obtain intermediate results that are not good. But we can state that, by comparing the system output with what we expected, the **third estimation test is passed**.

#### 4.2 System Aimed Test

At this point will be carried out some tests with the same inputs using the **Perceptron** architecture realized at this point and a *python script* which will simulate the desired behaviour of the **Perceptron**. The latter is described through the following script:

```
1 import math
3 def get_outputs(i, x, w, b):
    print(f"############## TEST #{i} #############")
    print(f"X: {x}")
    print(f"W: {w}")
    print(f"b: {b}")
    sum = summation(x, w, b)
    print(f"Sum result:\t\t\t\t {sum}")
11
    f_z = sigmoid_output(sum)
    print(f"Sigmoid output:\t\t\t {f_z}")
13
14
    #the sum with 12 bits
    sum_in_circuit = round(sum/lsb_in)
print(f"Sum value quantized:\t {sum_in_circuit}")
19 f_z_in_circuit = round(f_z/lsb_out)
```

```
print(f"Output value quantized:\t {f_z_in_circuit}")
    def summation(x, w, b):
    sum = 0
22
    for i in range(0, 10):
23
    sum += x[i]*w[i]
24
    sum += b
25
   return sum
27
def sigmoid_output(s):
   res = (1)/(1 + math.exp(-s))
   return res
32 lsb_out = (1)/(2**15 - 1)
1sb_i = (32)/(2**11 - 1)
34 #Test #1
w = [1,1,1,1,1,1,1,1,1,1]
_{37} b = 0
get_outputs(1, x, w, b)
39 . . .
40 #Other tests
41 ...
```

By running the python script the following output has been displayed in the console:

```
1 ################# TEST #1 ###################
  2 X: [-1, -1, -1, -1, -1, -1, -1, -1, -1]
  3 W: [1, 1, 1, 1, 1, 1, 1, 1, 1]
  4 b: 0
  5 Sum result:
                                                                             -10
                                                                                    4.5397868702434395e-05
  6 Sigmoid output:
                                                                                         -640
  7 Sum value quantized:
  8 Output value quantized: 1
  9 ################ TEST #2 ######################
10 \ X: [-0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75, -0.75
                    -0.75, -0.75]
11 W: [1, 1, 1, 1, 1, 1, 1, 1, 1]
12 b: 0
13 Sum result:
                                                                             -7.5
                                                                              0.0005527786369235996
14 Sigmoid output:
15 Sum value quantized:
                                                                                     -480
16 Output value quantized: 18
17 ################ TEST #3 ####################
18 X: [-0.5, -0.5, -0.5, -0.5, -0.5, -0.5, -0.5, -0.5, -0.5, -0.5]
                    -0.5]
19 W: [1, 1, 1, 1, 1, 1, 1, 1, 1]
20 b: 0
21 Sum result:
                                                                             -5.0
22 Sigmoid output: 0.0066928509242848554
```

```
23 Sum value quantized:
0utput value quantized: 219
26 X: [-0.5, -0.5, -0.5, -0.5, -0.5, -0.5, -0.5, -0.5, -0.5,
     -0.5]
28 b: 0
                   -2.5
29 Sum result:
30 Sigmoid output:
                     0.07585818002124355
31 Sum value quantized:
                      -160
32 Output value quantized:
                       2486
34 \ X: [-0.5, -0.5, -0.5, -0.5, -0.5, -0.5, -0.5, -0.5, -0.5, -0.5]
    -0.51
35 W: [0, 0, 0, 0, 0, 0, 0, 0, 0]
36 b: 0
37 Sum result:
                   0.0
38 Sigmoid output:
39 Sum value quantized:
40 Output value quantized: 16384
41 ################# TEST #6 #####################
42 X: [-0.5, -0.5, -0.5, -0.5, -0.5, -0.5, -0.5, -0.5, -0.5,
     -0.5]
43 W: [-0.5, -0.5, -0.5, -0.5, -0.5, -0.5, -0.5, -0.5, -0.5]
    -0.5]
44 b: 0
45 Sum result:
                   2.5
46 Sigmoid output:
                     0.9241418199787566
                      160
47 Sum value quantized:
48 Output value quantized: 30281
49 ################# TEST #7 ######################
51 W: [1, 1, 1, 1, 1, 1, 1, 1, 1]
52 b: 0
53 Sum result:
                   5.0
54 Sigmoid output:
                     0.9933071490757153
                     320
55 Sum value quantized:
56 Output value quantized: 32548
57 ################## TEST #8 #####################
58 X: [0.75, 0.75, 0.75, 0.75, 0.75, 0.75, 0.75, 0.75, 0.75,
    0.75]
59 W: [1, 1, 1, 1, 1, 1, 1, 1, 1]
60 b: 0
61 Sum result:
                   7.5
                     0.9994472213630764
62 Sigmoid output:
63 Sum value quantized:
64 Output value quantized: 32749
65 ################# TEST #9 #######################
66 X: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
```

```
67 W: [1, 1, 1, 1, 1, 1, 1, 1, 1]
68 b: 0
69 Sum result:
                      10
                        0.9999546021312976
70 Sigmoid output:
71 Sum value quantized:
                          640
72 Output value quantized:
                            32766
73 ################# TEST #10 #####################
74 X: [1, 1, 1, 1, 1, 1, 1, 1, 1]
75 W: [1, 1, 1, 1, 1, 1, 1, 1, 1]
76 b: 1
77 Sum result:
                      11
                        0.999983298578152
78 Sigmoid output:
79 Sum value quantized:
                          704
80 Output value quantized:
                            32766
81 ################### TEST #11 ####################
82 X: [-1, -1, -1, -1, -1, -1, -1, -1, -1]
83 W: [1, 1, 1, 1, 1, 1, 1, 1, 1]
84 b: -1
85 Sum result:
86 Sigmoid output:
                        1.670142184809518e-05
87 Sum value quantized:
                          -704
88 Output value quantized: 1
```

By running a new testbench with **likely** the same inputs the following results were displayed in **Modelsim**:

Figure 15: Output of the System 3



A comparison with the outputs can be seen in the following table:

| Test     | Python Scrip                      | t                                    | Modelsim                          |                                      |  |
|----------|-----------------------------------|--------------------------------------|-----------------------------------|--------------------------------------|--|
| Test     | $round\left(\frac{z}{LSB}\right)$ | $round\left(\frac{f(z)}{LSB}\right)$ | $round\left(\frac{z}{LSB}\right)$ | $round\left(\frac{f(z)}{LSB}\right)$ |  |
| Test #1  | -640                              | 1                                    | -638 (+2)                         | 1 (=)                                |  |
| Test #2  | -480                              | 18                                   | -479 (+1)                         | 18 (=)                               |  |
| Test #3  | -320                              | 219                                  | -319 (+1)                         | 225 (+5)                             |  |
| Test #4  | -160                              | 2486                                 | -160 (=)                          | 2482 (-4)                            |  |
| Test #5  | 0                                 | 16384                                | 0 (=)                             | 16384 (=)                            |  |
| Test #6  | 160                               | 30281                                | 160 (=)                           | 30284 (+3)                           |  |
| Test #7  | 320                               | 32548                                | 318 (-2)                          | 32541 (-7)                           |  |
| Test #8  | 480                               | 32749                                | 478 (-2)                          | 32748 (-1)                           |  |
| Test #9  | 640                               | 32766                                | 632 (-8)                          | 32765 (-1)                           |  |
| Test #10 | 704                               | 32766                                | 696 (-8)                          | 32766 (=)                            |  |
| Test #11 | -704                              | 1                                    | -702 (+2)                         | 0 (-1)                               |  |

In the latter table are compared the z and f(z) (See Equation (1) and (2) for further details) as they are represented in the architecture: with a C2 representation.

As we can see in the latter table the outputs are **likely** the same, with some few differences that can be ignored. These differences can be easily explained: **Python's float** number will use **64 bits** instead of 12 or 16 as in our case. This difference will change the outputs, in fact, in our case, the number +1 ("01111111" in base of  $x_i$  with 8 bits), for example, can't be represented precisely with a finite number of bits: so, the higher number of bits are available, the higher precision will be granted. All things considered, we can state **the system has passed the System Aimed Test** and, for our purpose, **can be considered verified.** 

# 5 — XILINX VIVADO Report

In this chapter will be presented the results obtained by creating a project with Xilin VIVADO by selecting the Zybo Zynq-7000 (xc7z010clg400-1) as working device. As will be highlighted by the resource utilization paragraph, the Implementation phase has not been performed due to the fact that the Zybo-Board does not have enough inputs and outputs ports: in our case we need 10\*8 + 11\*9 = 179 input pins (even if we fix the weights there is a need of 99 input pins) with only 4 slides switches and 4 pushbuttons available to drive the inputs. All things considered the results that will be shown in this chapter are obtained after the Synthesis phase with the timing (clock) constraint.

#### 5.1 RTL Analysis

Before heading with the Synthesis a preliminary double-check of the correctness of the system has been made by simply comparing the schemas obtained by the **Elaborated Design** with the ones in shown in the architecture chapter. **No problems has been found at this stage** (in the project folder can be seen all the schemas in pdf format).

#### 5.2 Timing Report

After running the Synthesis command after adding the clock constraint the following Timing Report has been displayed:

| Setup                        |          | Hold                         |          | Pulse Width                              |          |  |
|------------------------------|----------|------------------------------|----------|------------------------------------------|----------|--|
| Worst Negative Slack (WNS):  | 5.669 ns | Worst Hold Slack (WHS):      | 0.139 ns | Worst Pulse Width Slack (WPWS):          | 3.500 ns |  |
| Total Negative Slack (TNS):  | 0.000 ns | Total Hold Slack (THS):      | 0.000 ns | Total Pulse Width Negative Slack (TPWS): | 0.000 ns |  |
| Number of Failing Endpoints: | 0        | Number of Failing Endpoints: | 0        | Number of Failing Endpoints:             | 0        |  |
| Total Number of Endpoints:   | 241      | Total Number of Endpoints:   | 241      | Total Number of Endpoints:               | 412      |  |

Figure 16: Timing Report

As we can see the Worst Negative Slack (WNS) is **positive**, so we can drive the board at an higher frequency than 125MHz. We can calculate the **maximum frequency** as:

$$f_{max} = \frac{1}{T_{clk} - WNS} = 429.0MHz \tag{9}$$

 $T_{clk}$  is given by the Zybo Board which operates with 125 MHz and, for this reason, will grant an  $T_{clk} = 1/125MHz = 8ns$ . The WSN is determined by the **Critical Path** of the architecture which is shown in the following table (see first row):

| Name             | Slack ^1 | Levels | Routes | High Fanout | From                             | То                                     | Total Delay | Logic |
|------------------|----------|--------|--------|-------------|----------------------------------|----------------------------------------|-------------|-------|
| → Path 1         | 5.669    | 2      | 3      | 2           | TREE_ADD/reg_1_5/q_dff_reg[9]/C  | TREE_ADD/reg_2_3/q_dff_reg[10]/D       | 2.180       |       |
| → Path 2         | 5.669    | 2      | 3      | 2           | TREE_ADD/reg_1_5/q_dff_reg[12]/C | TREE_ADD/reg_2_3/q_dff_reg[13]/D       | 2.180       |       |
| → Path 3         | 5.675    | 2      | 3      | 2           | TREE_ADD/reg_1_5/q_dff_reg[9]/C  | TREE_ADD/sum_2_3/GEE.DFF_I/q_dff_reg/D | 2.174       |       |
| → Path 4         | 5.675    | 2      | 3      | 2           | TREE_ADD/reg_1_5/q_dff_reg[12]/C | TREE_ADD/sum_2_3/GEE.DFF_I/q_dff_reg/D | 2.174       |       |
| → Path 5         | 5.697    | 2      | 3      | 2           | TREE_ADD/reg_3_1/q_dff_reg[9]/C  | REG_TREE/q_dff_reg[10]/D               | 2.152       |       |
| 3 Path 6         | 5.697    | 2      | 3      | 2           | TREE_ADD/reg_3_1/q_dff_reg[12]/C | REG_TREE/q_dff_reg[13]/D               | 2.152       |       |
| → Path 7         | 5.697    | 2      | 3      | 2           | TREE_ADD/reg_3_1/q_dff_reg[15]/C | REG_TREE/q_dff_reg[16]/D               | 2.152       |       |
| → Path 8         | 5.697    | 2      | 3      | 2           | TREE_ADD/reg_3_1/q_dff_reg[3]/C  | REG_TREE/q_dff_reg[4]_inv/D            | 2.152       |       |
| → Path 9         | 5.697    | 2      | 3      | 2           | TREE_ADD/reg_3_1/q_dff_reg[6]/C  | REG_TREE/q_dff_reg[7]_inv/D            | 2.152       |       |
| <b>⊸</b> Path 10 | 5.697    | 2      | 3      | 2           | REG_1/q_dff_reg[9]/C             | TREE_ADD/reg_1_1/q_dff_reg[10]/D       | 2.152       |       |

Figure 17: Critical Path Report

We can state that the **Tree Adder** module has the most impact on the **critical path** so the addition of some registers in between the ripple carry adders and even the addition of the pipelines registers in the latter, during the project phase, were all **good project choices**.

#### 5.3 Resource Utilization Report

The resource utilized by the architecture synthesized are the following:



Figure 18: Critical Path Report

As we can see the IO resource utilization is greater than the 100% of the available one (179 input pins and 16 output pins necessary). So it was not possible to head with the implementation phase with the Zybo Zynq 7000. We can also see that the LUT resource has been utilized by the roughly 10%, so, with this numbers, the LUT optimization done has halved the LUT utilization and it was a **good project choice**.

#### 5.4 Power Consumption Report

The power consumption report is the following:



Figure 19: Power Consumption Report

As we can see, a total of 0.174 W of Power are needed (with the standard settings suggested by Xilinx VIVADO), which is roughly divided equally between dynamic and static power. For the **Dynamic Power Consumption** the most relevant contributes are from the logic and signals.

#### 5.5 Warning Messages

After the synthesis phase the following warning messages were shown:



Figure 20: Warning Messages

So, let's analyse them:

- [Constraints 18-5210] No constraints selected for write: this warning message was shown even during the laboratory classes and it can be ignored.
- [Synth 8-3936] Found unconnected internal register 'z\_in\_lut\_reg' and it is trimmed from '21' to '20' bits: this warning message is

due to the fact that to get the proper address on the lut, after the lut optimization, the first bit (the 21th) is the sign one and is used to perform just some additional operation, it has no relevance on the addressing of the lut.

# 6 — Conclusion

After performing the Synthesis with Xilinx VIVAVO, we can say that in order to head into the **implementation** and generating the bitstream of the **Perceptron** we could use another board with an higher i/o capacity. The implementation part was not executed because of the results obtained will be biased by the i/o planning constraint; in our case this would lead only to partial conclusion.

On the other hand, we can also increase the maximum frequency of the board by adding pipeline registers in a more frequent manner (in the implementation there was a register after 3 FA modules) with the drawback of an higher resource utilization.

Another optimization could be done in the adder architecture by adding some carry generation logic in order to get results with a lower number of clock cycles.