

# Università di Pisa

Computer Engineering

Electronic and Communication Systems

### Perceptron

Project Report

TEAM MEMBERS: Olgerti Xhanej

Academic Year: 2020/2021

# Contents

| 1 | Introduction |                                          |    |
|---|--------------|------------------------------------------|----|
|   | 1.1          | Problem Description                      | 2  |
|   | 1.2          | Applications                             | 3  |
|   | 1.3          | Possible Architectures                   |    |
| 2 | Architecture |                                          |    |
|   | 2.1          | Multiplication Circuit Architecture      | 5  |
|   | 2.2          | Adder Circuit Architecture               |    |
|   | 2.3          | Activation Function Circuit Architecture | 11 |
| 3 | VHDL CODE    |                                          |    |
|   | 3.1          | Modules List                             | 12 |
|   | 3.2          | Perceptron                               | 12 |
|   | 3.3          | Parallel Multiplier                      | 14 |
|   | 3.4          | Unsigned Parallel Multiplier             | 15 |
|   | 3.5          | Tree Adder                               |    |
|   |              | 3.5.1 Ripple Carry Adder Pipelined       |    |
|   | 3.6          | LUT                                      | 20 |
|   |              | 3.6.1 Lut generation code                | 21 |
| 4 | Test Plan    |                                          | 23 |
| 5 | XII          | LINX VIVADO Report                       | 24 |
| 6 | Cor          | nclusion                                 | 25 |

### 1 - Introduction

#### 1.1 Problem Description

The main goal of the activity described in this report is the following: realizing a network implementing a **perceptron** with a **sigmoid activation** function.

Before describing the whole design and implementation process a very little introduction about the architecture must be done.



Figure 1: Perceptron Architecture

A **Perceptron** is a binary classifier that maps his inputs to a specific output y = f(z), where f(z) is the **activation function** of the perceptron. The inputs are real numbers and the input z of the activation function is obtained as:

$$z = b + \sum_{i=0}^{N_L - 1} w_i x_i \tag{1}$$

Every input  $x_i$ , every weight  $w_i$  and the bias b are real numbers in the range of [-1, 1].

The activation function, in our case, will be a sigmoid function, described as follows:

$$y = \frac{1}{1 + e^{-z}} \tag{2}$$

0.6

Figure 2: Sigmoid Function Plot

Where z is the result of the equation (1.1).

#### 1.2 Applications

A single perceptron is the building block of *artificial neural networks*, in which different layers of perceptrons are connected. The output of the neural network is a real number and could be use to classify *complex objects*: patterns, human faces, handwritings, medical diagnosis, e-mail spams.



Figure 3: Neural network example

In the image above there is a simple schema of a neural network, in which the circles represent the perceptrons.

#### 1.3 Possible Architectures

The main architecture will be made up by three main logical parts, from an higher-lever point of view:

- Multiplication Circuit: implementation of the multiplication operation between each input  $x_i$  and each weight  $w_i$ .
- Adder Circuit: implementation of the addition between the results of the former phase and the bias b.
- Activation Function Circuit: implementation of the computation of the sigmoid function.

In the next chapter the architecture will be documented with more precision. Different project choices could be made for each logical part of the architecture:

- Multiplication Circuit: could be implemented through a ROM-based solution in which every possible result is stored and the two inputs represent the addresses for getting the result. This solution is good only with a very low number of bits, which is not our case: in fact the the ROM will be composed by  $2^{(n_{w_i}+n_{b_i})}$  memory cells. In order to implement the multiplication circuit will be implemented through a Paraller Multiplier.
- Adder Circuit: different choice could be made to implement the adder circuit. Starting from the simplest to the more complex solution we can exploit the Serial Adder, the Parallel Adder or the Parallel Adder with Pipeline. The first one needs less logic but requires n clock cycles for computing an n bits result. The second solution improves the first one by computing one result in one clock cycle, on the other hand it could add some problems due to long logic chains between two register. The third solution is the best from the perspective of the number of clock cycles required and the critical path, in fact by adding some registers in between the computation of the bits will reduce the logic chains.
- Activation Function Circuit: As seen during the laboratory class, this part will be implemented by exploiting a Look-Up-Table. In order to do so, could be necessary a **truncation** of the result of the former computation in order to limit the size of the LUT. With 12 bits are necessary  $2^{12} = 4096$  entries, which could be even reduced by performing some optimization by exploiting the sigmoid function symmetry.

### 2- Architecture

In this chapter will be discussed deeply the architecture of the three main parts of the perceptron. The general structure could be summarized by the following schema:



Figure 4: General Schema

### 2.1 Multiplication Circuit Architecture

The Multiplication Circuit, as said before, will be implemented through a Parallel Multiplier. The inputs  $b_i$  and  $w_i$  are composed respectively by  $b_x = 8$  bits and  $b_w = 9$  bits. In order to compute the multiplication in the correct way, the inputs need to be translated in the **unsigned form** and then is

possible to perform the multiplication with the parallel multiplier. In the following image is presented the general schema of the Parallel Multiplier:



Figure 5: Parallel Multiplier Architecture

Notice that the sign of the result will be computed by a simple XOR operation between the inputs signs. The Unsigned Parallel Multiplier architecture is the following:



Figure 6: Unsigned Parallel Multiplier Architecture

Each logic block is translated with a related logic block:



Figure 7: Unsigned Parallel Multiplier Architecture

#### 2.2 Adder Circuit Architecture

In order to compute the equation (1.1) different sums need to be computed. The building block of this part will be the **Parallel Adder with Pipeline**: as said before, by adding some registers in between the Carry chains, the critical path impact can be reduced. Furthermore, by exploiting the parallel architecture, a single sum can be computed in a single clock cycle. In the next figure will be presented the Parallel Adder:



Figure 8: Parallel Adder Architecture

To implement the whole sum of 11 terms, in order to decrease the number of cycles needed to compute the whole sum and to reduce the number of bits needed, a tree approach has been chosen. The schema of the tree parallel adder is the following:



Figure 9: Parallel Multiplier Architecture

Some register has been put in between the sum to limit the critical path impact on the performances and clock period limit.

#### 2.3 Activation Function Circuit Architecture

At the end of the computation of the latter phase the output is composed by 21 bits. The computation of the sigmoid function will be done through a **Look-Up-Table**, which will need  $2^{21} = 2097152$  entries of different outputs with 16 bits. In order to reduce the size of the Look-Up-Table a truncation is needed: from 21 bits to 12 bits. In this case the Look-Up Table will be composed by  $2^{12} = 4096$  entries, but, by exploiting the odd symmetry of the sigmoid, only 4096/2 = 2048 entries are needed.



Figure 10: Look-Up Table Architecture

### 3 - VHDL CODE

In this chapter will be presented the main modules that compose the architecture of the **Perceptron with sigmoid activation function**.

#### 3.1 Modules List

As presented in the last chapter, I have followed a similar approach for creating the architecture. The following modules were created:

- Perceptron
  - Parallel\_Multiplier
    - \* Unsigned Parallel Multiplier
      - · Full Adder
      - · Half Adder
  - Tree\_Adder
    - \* Ripple\_Carry\_Adder\_Pipelined
      - · DFF
      - · Full Adder
  - Sigmoid\_Lut\_2048

A **bottom-up strategy** was followed in order to build up the architecture: starting from some modules that will made up the architecture, after finishing each of them some testbenches were written in order to test each building block of the **Perceptron** (See next chapter for details).

#### 3.2 Perceptron

The main hardware description of the architecture. This module will connect all the other modules in order to create the correct architecture. In order to not show too much lines of code only the entity definition of this module will be shown.

```
entity Perceptron is
port(

-- x_1 to x_10 inputs of the perceptron with 8 bits
x_1: in std_logic_vector(7 downto 0);
```

```
6
      x_10: in std_logic_vector(7 downto 0);
      -- w_1 to w_10 inputs of the perceptron with 9 bits
9
      w_1: in std_logic_vector(8 downto 0);
11
      w_10: in std_logic_vector(8 downto 0);
13
      -- b input of the perceptron with 9 bits
14
      b: in std_logic_vector(8 downto 0);
15
      clk: in std_logic;
17
      rst: in std_logic;
18
19
      -- output of the perceptron 16 bits
      f_z: out std_logic_vector(15 downto 0)
21
    );
22
    end Perceptron;
24 architecture rtl of Perceptron is
25
26 begin
27
    d_process: process(clk, rst)
28
    begin
29
      -- If z, the candidate input of the sigmoid function, is
30
     negative,
      -- then is passed his complement.
31
      if(z_in(20) = '1') then
32
        z_in_lut <= std_logic_vector(unsigned(not(z_in)) + 1);</pre>
33
      else
34
        z_in_lut <= z_in;</pre>
35
      end if;
36
37
39
      -- On the output side if the candidate input was negative
40
      -- the output is complemented with the highest possible
41
      -- number in the lut in order to mirror it
42
      if (z_in(20) = '1') then
43
        f_z <= std_logic_vector(32766 - unsigned(f_z_todo));</pre>
44
      else
        f_z <= std_logic_vector(unsigned(f_z_todo));</pre>
47
      end if;
    end process d_process;
48
50 end rtl;
```

In the rest of this modules are instantiated and linked the various submodules that made up the **Perceptron** module. At the bottom of the previous

code snippet is shown how the optimization of the LUT is made.

#### 3.3 Parallel Multiplier

This module has the duty to convert the inputs, which are signed with a 2's complement representation, link their unsigned representation with the **Unsigned Parallel Multiplier** module and then reconvert the product in the signed representation. The general architecture is shown in Figure 5.

```
entity Parallel_Multiplier is
      generic (
        Nbit_a : positive;
        Nbit_b: positive
      );
      port(
        a_p_signed: in std_logic_vector(Nbit_a - 1 downto 0);
        b_p_signed: in std_logic_vector(Nbit_b - 1 downto 0);
          - The product will need Nbit_a + Nbit_b bits
9
        p_signed: out std_logic_vector(Nbit_a + Nbit_b - 1
     downto 0)
      );
11
    end entity Parallel_Multiplier;
13
    architecture rtl of Parallel_Multiplier is
14
      -- Building blocks of the Parallel Multiplier
16
17
      component Unsigned_Parallel_Multiplier
        generic(
18
          Nbit_a : positive;
19
          Nbit_b : positive
20
        );
21
        port(
22
          a_p: in std_logic_vector(Nbit_a - 1 downto 0);
23
          b_p : in std_logic_vector(Nbit_b - 1 downto 0);
               : out std_logic_vector(Nbit_a + Nbit_b - 1 downto
      0)
26
      end component Unsigned_Parallel_Multiplier;
27
29
      -- Unsigned component (will work for the unsigned
30
     parallel multiplier
      signal p_unsigned: std_logic_vector(Nbit_a + Nbit_b - 1
31
     downto 0);
      signal a_p_unsigned: std_logic_vector(Nbit_a - 1 downto
32
     0);
      signal b_p_unsigned: std_logic_vector(Nbit_b - 1 downto
33
     0);
```

```
34
      -- will carry the sign bit for the signed rapresentation
35
     of the inputs
      signal a_sign: std_logic;
36
      signal b_sign: std_logic;
37
38
    begin
39
40
      -- Compute the unsigned representation from the signed
41
     one
      a_p_unsigned <= std_logic_vector(abs(signed(a_p_signed)))</pre>
      b_p_unsigned <= std_logic_vector(abs(signed(b_p_signed)))</pre>
43
      -- 2's complement rapresentation, the result sign uis
45
     computed through the xor op. between a and b
      p_signed <= std_logic_vector(unsigned(not(p_unsigned)) +</pre>
      1) when (((a_sign xor b_sign) = '1')) else p_unsigned;
47
      -- Getting of the sign from a and b (the MSB of the {\tt C2}
48
     representation)
      a_sign <= a_p_signed(Nbit_a - 1);</pre>
49
      b_sign <= b_p_signed(Nbit_b - 1);</pre>
50
51
      unsigned_parallel_mul: Unsigned_Parallel_Multiplier
         generic map(
           Nbit_a => Nbit_a,
           Nbit_b => Nbit_b
        port map(
57
           a_p =>
                   a_p_unsigned,
           b_p => b_p_unsigned,
               => p_unsigned
           р
        );
61
62
    end architecture rtl;
```

#### 3.4 Unsigned Parallel Multiplier

This module will effectively compute a multiplication. Through the replication of the architecture shown in Figure 6 and 7 this module will return the *unsigned product* of two *unsigned operands*. In order to implement the architecture in the simplest way a **Structured approach** has been followed in the description. Some lines of code were just skipped in order to show only a general schema, for further details there is also the source code.

```
entity Unsigned_Parallel_Multiplier is
1
      generic (
2
        Nbit_a : positive;
3
        Nbit_b: positive
4
      );
5
      port(
6
        -- Unsigned representation of inputs
        a_p: in std_logic_vector(Nbit_a - 1 downto 0);
        b_p: in std_logic_vector(Nbit_b - 1 downto 0);
9
10
        -- p = a_p * b_p
        p: out std_logic_vector(Nbit_a + Nbit_b - 1 downto 0)
12
      );
13
    end entity Unsigned_Parallel_Multiplier;
14
15
    architecture rtl of Unsigned_Parallel_Multiplier is
16
      -- Building blocks of the Unsigned Parallel Multiplier
17
      component FULL_ADDER is
19
      end component;
20
21
      component HALF_ADDER is
22
23
      end component;
24
25
      -- Will hold the carry signals among the whole
     architecture
      signal carry_signal: std_logic_vector((Nbit_a - 1)*(
27
     Nbit_b - 1) - 1 downto 0);
      signal last_carry_signal: std_logic_vector((Nbit_b - 1)
     downto 0);
29
      -- Will hold the sum result of the FA and HA among the
     whole architecture
      signal sum_signal: std_logic_vector((Nbit_a - 1)*(Nbit_b
31
     - 2) - 1 downto 0);
32
      -- will hold the precomputed values for the inputs a and
33
     b of the various Half Adder and Full Adder
      signal a_multiplier: std_logic_vector(Nbit_a + Nbit_b - 2
34
      downto 0);
      signal b_multiplier: std_logic_vector((Nbit_a - 1)*(
35
     Nbit_b - 1) - 1 downto 0);
36
    begin
37
38
      -- First bit of the result
39
      p(0) \le (a_p(0) \text{ and } b_p(0));
40
```

```
42
      -- Computation of the various inputs of each HA and FA
43
      d_process: process(a_p, b_p)
      begin
45
46
      for j in 1 to Nbit_b loop
47
      a_multiplier(j - 1) \le (a_p(0) \text{ and } b_p(Nbit_b - j));
      end loop;
49
50
51
      end process d_process;
53
54
      -- Architecture will follow schema of the Parallel
     Multiplier
      -- Row index i
56
      GEN_a: for i in 1 to Nbit_a generate
57
         -- Column index j
         GEN_b: for j in 1 to Nbit_b - 1 generate
           FIRST_ROW: if i=1 generate
60
             -- In the first Row only HA
61
62
             LEFT: if j < Nbit_b -1 generate
               ROW1_LEFT: HALF_ADDER
63
                 port map
64
65
                         => a_multiplier(j - 1),
                   a
                         => b_multiplier(j - 1),
67
                         => sum_signal(j - 1),
                   s
68
                    cout => carry_signal(j - 1)
69
                 );
               end generate LEFT;
71
             RIGHT: if j = Nbit_b - 1 generate
               ROW1_RIGHT: HALF_ADDER
                 port map
                  (
75
                         => a_multiplier(j - 1),
76
                   а
                         => b_multiplier(j - 1),
77
                         => p(1), -- Result bit
78
                   cout => carry_signal(j - 1)
79
                 );
80
             end generate RIGHT;
           end generate FIRST_ROW;
82
83
84
           INTERNAL_ROW: if i > 1 and i < Nbit_a generate</pre>
86
           end generate INTERNAL_ROW;
87
88
```

```
LAST_ROW: if i = Nbit_a generate

...

end generate LAST_ROW;

end generate GEN_b;

end generate GEN_a;

end architecture rtl;
```

#### 3.5 Tree Adder

This module will take up the ten multiplication results and the bias and will sum up every term by making the computation shown at the equation (1). Even in this case will not be shown the architectural code due to the fact that consist only in linking some submodules in the proper way in order to replicate the architecture shown in Figure (9).

```
entity Tree_Adder is
    port(
      -- Inputs: result of the multiplication of xi*wi
      in_1: in std_logic_vector(16 downto 0);
      in_10: in std_logic_vector(16 downto 0);
      -- Bias input
      b: in std_logic_vector(8 downto 0);
9
      clk: in std_logic;
      rst: in std_logic;
11
12
13
      -- Output
      z: out std_logic_vector(20 downto 0)
14
    );
15
16 end Tree_Adder;
```

#### 3.5.1 Ripple Carry Adder Pipelined

This module will be the main building block of the **Tree Adder** module. In order to reduce the logic chain some registers were added by exploiting the **DFF** module as seen in the Lab lectures.

```
entity Ripple_Carry_Adder_Pipelined is
generic (Nbit: positive);
port(
    -- Inputs
    a_r: in std_logic_vector(Nbit-2 downto 0);
    b_r: in std_logic_vector(Nbit-2 downto 0);
    cin_r: in std_logic;
    cout_r: out std_logic;
```

```
-- Will store the result of a_r+b_r
      s_r: out std_logic_vector(Nbit-1 downto 0);
      clk: in std_logic;
      rst: in std_logic
13
    );
14
15 end Ripple_Carry_Adder_Pipelined;
17 architecture rtl of Ripple_Carry_Adder_Pipelined is
    -- Building blocks of the Ripple Carry Adder Pipelined
    component FULL_ADDER
19
20
21
    end component FULL_ADDER;
22
    -- Need of a register to obtain the pipelined version
23
    component DFF
24
25
    component DFF;
26
28
    -- Will propagate the carry signal among the whole
     architecture
    signal carry_signal: std_logic_vector(Nbit-1 downto 1);
29
30
    -- Will store the outputs signal of the registers
31
    signal dff_signal: std_logic_vector(Nbit-1 downto 0) := (
     others => '0');
34 begin
    -- Implemented in a structured way in a similar fashion as
     seen in the Lab lessions
    GEN: for i in 1 to Nbit generate
      FIRST: if i=1 generate
37
      -- First FA
38
        FFI: FULL_ADDER port map (a_r(0), b_r(0), cin_r, s_r(0)
     , carry_signal(1));
      end generate FIRST;
40
      INTERNAL: if i > 1 and i < Nbit generate</pre>
41
        -- Need of Register detection
42
        PIPE: if (i mod 3 = 0) generate
          DFF_I: DFF
44
            port map(
            d_dff
                        => carry_signal(i-1),
            clk_dff
                        => clk,
47
            resetn_dff => rst,
48
            q_dff
                        => dff_signal(i-1)
49
          );
50
          FFI: FULL_ADDER port map (a_r(i-1), b_r(i-1),
51
     dff_signal(i-1), s_r(i-1), carry_signal(i));
        end generate PIPE;
52
```

```
-- No need of a register
54
        NOT_PIPE: if (i mod 3 /= 0) generate
55
          FFI: FULL_ADDER port map (a_r(i-1), b_r(i-1),
     carry_signal(i-1), s_r(i-1), carry_signal(i));
        end generate NOT_PIPE;
      end generate INTERNAL;
58
      -- Implicit extension (the inputs have Nbit-2 bits, the
60
     output has Nbit-1 bits and there
      -- are Nbit-1 FA so the last bit is replicated in order
61
     to make the extension in the
      -- correct way in C2 representation)
62
63
      LAST: if i=Nbit generate
64
      FFI: FULL_ADDER port map (a_r(Nbit-2), b_r(Nbit-2),
     carry_signal(Nbit-1), s_r(Nbit-1), cout_r);
      end generate LAST;
    end generate GEN;
  end rtl;
```

#### 3.6 LUT

This module will store every possible output of the sigmoid function in the input range of [-11;+11] with 12 bits of precision and output range of [0;1] with 16 bits of precision. In order to do so a quantization is needed: is unthinkable to define precisely a rational number with a finite number of bits. To determine the quantized quantity of the input and the output there is a need to calculate the weight of the LSB in that range. Through the **Reach the LSB** method seen during the Lab lectures, the LSB, with N bits of precision, can be calculated as:

$$LSB = \frac{\max x}{2^{N-1} - 1} \quad or \quad \frac{|\min x|}{2^{N-1}} \tag{3}$$

So, in out case:

$$LSB(in) = \frac{11}{2^{11} - 1} = 0.005373717 \tag{4}$$

$$LSB(out) = \frac{1}{2^{15} - 1} = 3.051850947e - 5 \tag{5}$$

The Look-Up table will store the values round(f(x)/LSB/out) for x in [0; 2047].

The input will be treated as an address signal to obtain the correct output value in the same fashion as shown in the Laboratory lectures.

```
1 entity sigmoid_lut_2048 is
    port (
      address : in std_logic_vector(10 downto 0);
      dds_out : out std_logic_vector(15 downto 0)
    );
5
end sigmoid_lut_2048;
8 architecture rtl of sigmoid_lut_2048 is
    type LUT_t is array (natural range 0 to 2047) of integer;
    constant LUT: LUT_t := (
10
     0 => 16384,
11
     1 => 16428,
12
13
      . . .
     2046 => 32766,
     2047 => 32766
   );
16
17
18 begin
    dds_out <= std_logic_vector(TO_SIGNED(LUT(TO_INTEGER(</pre>
     unsigned(address))),16));
20 end rtl;
```

#### 3.6.1 Lut generation code

The whole Lut was not compiled "by hand" obviously. The look-up table outputs were generated through the following python script by exploiting the computation concerning the LSB with 12 bits and 16 bits resolution made before.

```
import math
    #Calculate lsb of x (16 bits) and f(x) (12 bits)
    lsb_out = (1)/(2**15 - 1)
4
    lsb_in = (11)/(2**11 - 1)
    result = ""
    for x in range(0, 2048):
      f_x = (1)/(1 + math.exp(-(x*lsb_in)))
9
      lut = round(f_x/lsb_out)
10
11
      #Generate lut entries for every x
12
      result += str(x) + " => " + <math>str(lut) + ", \n"
13
14
    print(result)
15
16
```

## 4 — Test Plan

# 5 — XILINX VIVADO Report

# 6 — Conclusion