# Artificial neural networks

After going through Lesson 2, you should now have a much better intuition about neural networks. PyTorch and other deep learning frameworks will let you easily manipulate these structures. In this exercise, you will dive into the challenges of computation and data structures for versatile systems such as Multi-Layer Perceptron (MLP).

In [1]:
import numpy as np
np.random.seed(42)

## Object-oriented Programming

Contrary to previous exercise on notebooks, on top of being able to define functions, you can define classes (your pytorch models are instances of a class).

![Class](https://upload.wikimedia.org/wikipedia/commons/thumb/9/98/CPT-OOP-objects_and_classes_-_attmeth.svg/300px-CPT-OOP-objects_and_classes_-_attmeth.svg.png "Basic car class")
Source: wikibooks.org

### Attributes and constructor
In Python, a basic class (let's call it MyClass) is defined as below:
```
class MyClass:

    def __init__(self, obj_name):
        self.name = obj_name

```

Here we defined the class MyClass and its constructor __init__ (more or so the function that creates the object). This way, we defined the Class as having a single attribute called name.
```
my_obj = MyClass('my_obj_name')
```
will thus create a class instance with the attribute name = 'my_obj_name'. The only mandatory argument of the __init__ method is self, it's a way for Python to relate to the instance you will create.


### Methods
Last thing you need to know is that you can define functions specific to your class that are called methods. You can define as many as you want with whichever arguments you see fit.
```
class MyClass:

    def __init__(self, obj_name):
        self.name = obj_name
    
    def my_method():
        print('Hello World!')
    
    def whatsyourname(self):
        return self.name

```

Above we defined two methods: one that prints a hard-coded famous string, and the other than return your attribute. Where a function f, is used with f(input) in Python, a method of a given obj is used with my_obj.my_method(input) in Python.
```
my_obj = MyClass('my_obj_name')
my_obj.my_method()
name = my_obj.whatsyourname()
print(name)
```
will return
```
Hello World!
my_obj_name
```

Notice that everytime we added self as an argument for class methods, we didn't have to specify it when calling this method.

### Example
We model a basic Phone usage/charging interactions as follows:
- the phone has an owner, and physical characteristics. The owner is specific to the object, but the physical characteristics are the same for all objects of this class.
- we want to be able to check the owner of the phone, charge it, and use it.


```
class Phone:

    charge_speed = 0.1
    max_charge = 1.
    draining_speed = 0.02

    def __init__(self, owner):
        self.owner = owner
        # Set initial charge to 0
        self.battery = 0.
    
    def charge(self, charge_duration):
        self.battery += charge_duration * self.charge_speed
        self.battery = max(self.max_charge, self.battery)
    
    def whosyourowner(self):
        return self.owner
    
    def use(self, usage_duration):
        if self.battery == 0.:
            raise ValueError("No more battery")
        else:
            self.battery -= usage_duration * self.draining_speed
            self.battery = min(0., self.battery)

```
So now, this will throw you an error since your phone isn't initially charged:
```
my_phone = Phone('John')
my_phone.whosyourowner()
my_phone.use(1)
```
but this will work:
```
my_phone = Phone('John')
my_phone.charge(1)
my_phone.use(1.5)
```

Check this resource https://realpython.com/python3-object-oriented-programming/ for more information.
Alright, you're ready now!

## The Perceptron

Your naive feed-forward perceptron model

- Inputs: input features of size N
- Outputs: scalar activation value (size 1)
- Parameters: weights, bias, activation

Given X the input features' vector of size N, W the weights, b the bias, and activation the activation function of the perceptron, you aim at reproducing the operation we will call forward, $f: \mathbb{R}^N \rightarrow \mathbb{R}$, as follows:
$$ \forall (X_i)_{i\ \in\ [1, N]}\ \in\ \mathbb{R}^N,\ forward((X_i)_{i\ \in\ [1, N]}) = activation\Bigg(\Big(\sum_{i=1}^{N} W_i * X_i\Big) + b\Bigg) $$

![Perceptron](https://cdn-images-1.medium.com/max/1600/1*n6sJ4yZQzwKL9wnF5wnVNg.png "Perceptron representation")
*Source: Towards Data Science*

which we will rewrite more concisely using the convention that for two 1-dimensional vectors of identical size A and B, AB represents the dot-product of both vectors.
$$ \forall X\ \in\ \mathbb{R}^N, forward(X) = activation(WX + b) $$

In [2]:
# This will generate your inputs for this sub-section
def get_inputs(input_size):
    
    inputs = np.random.rand(input_size)
    # Let's take our weights and bias in [-1, 1] rather than [0, 1]
    weights = 2 * np.random.rand(input_size) - 1
    bias = 2 * np.random.rand(1) - 1
    
    return inputs, weights, bias

### Linear operation
Let's focus on the weighted sum and the bias as they represent the linear part of the forward function.

In [7]:
# Choose the number of inputs
input_size = 8

# Load the inputs
inputs, weights, bias = get_inputs(input_size)
print(inputs.shape, weights.shape, bias.shape)
print(inputs)

(8,) (8,) (1,)
[0.52475643 0.43194502 0.29122914 0.61185289 0.13949386 0.29214465
 0.36636184 0.45606998]


Taking $X = (X_i)_{i\ \in\ [1, N]}$ as inputs, and $W = (W_i)_{i\ \in\ [1, N]}$ as weights, with $b$ being the bias, compute the value of $lin\_comb: \mathbb{R}^N \rightarrow \mathbb{R}$:
$$ lin\_comb(X, W, b) = \Big(\sum_{i=1}^{N} W_i * X_i\Big) + b $$

In [4]:
# TODO: Check if everything is alright by computing WX + b
lin_comb = 0
# Check if this is indeed 1-dimensional
print(lin_comb)

0


### Activation (non-linearity)
If you remember the lesson material, we need to introduce a non-linearity in our perceptron model. This operation is called the activation. 

For feed-forward purposes, you can take any real function operating on scalar inputs $f: \mathbb{R} \rightarrow \mathbb{R}$. Being able to then compare the output of the activation function is good, but using the entire real space is a bit of an overkill. So it comes handy to squeeze the output space to a narrower segment.

A common activation function is the sigmoid (also called logistic function) $\sigma$ which meets the squeezing requirement $\sigma: \mathbb{R} \rightarrow [0, 1]$

$$ \forall x\ \in\ \mathbb{R},\ \sigma(x) = \frac{1}{1 + e^{-x}} $$

![sigmoid](https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Logistic-curve.svg/320px-Logistic-curve.svg.png "Sigmoid plot")
Source: Wikipedia

In [2]:
# Let's define an activation function (feel free to change this)
def sigmoid(x): return 1. / (1. + np.exp(-x))

Just like earlier we discussed defining a class, its attributes and methods, we will implement a naive perceptron class. Its attributes would be its weights and bias, and apart from the constructor, we will need a forward method.

Now edit the NaivePerceptron class of the naive_perceptron.py file.
Use the refresher about Object-Oriented Programming (OOP) at the top of the notebook if necessary.

In [6]:
## TODO: edit the NaivePerceptron methods __init__ and forward to compute the correct activation
from naive_perceptron import NaivePerceptron
p = NaivePerceptron(weights, bias, activation=sigmoid)
p.forward(inputs)

array([0.34733063])

Congrats! You managed to build your own feed-forward perceptron in base Python!
Last thing to do: edit the __author__ and __maintainer__ at the top of your naive_perceptron.py file to put yours, you're in charge now ;)

## Your first layer
Alright, we got 1 perceptron but state-of-the-art architectures have thousands of them. So we need to be able to stack many of them in a consistent way: a perceptron layer.

In [7]:
# This will generate your inputs for this sub-section
def get_inputs(input_size, output_size):
    
    inputs = np.random.rand(input_size)
    # Let's take our weights and bias in [-1, 1] rather than [0, 1]
    weights = 2 * np.random.rand(output_size, input_size) - 1
    bias = 2 * np.random.rand(output_size) - 1
    
    return inputs, weights, bias

Given a layer of M perceptrons, with the same input X as previously, and Wj and bj being the weigths and the bias of the j-th perceptron in the layer, we want to define a new forward operation, $f: \mathbb{R}^N \rightarrow \mathbb{R}^M$ , that will give us the output features of the layer:
$$ \forall X\ \in\ \mathbb{R}^N, forward(X) = (activation(W_jX + b_j))_{j\ \in\ [1, M]} $$

In [10]:
# Choose the number of inputs
input_size = 64
nb_perceptrons = 8

# Load the inputs
inputs, weights, biases = get_inputs(input_size, nb_perceptrons)
print(inputs.shape, weights.shape, biases.shape)
print(inputs[:5])

(64,) (8, 64) (8,)
[0.96563203 0.80839735 0.30461377 0.09767211 0.68423303]


Using what you already implemented in your NaivePerceptron class, we will define a NaiveLayer that is composed of multiple NaivePerceptron objects.

Edit the NaiveLayer class of the naive_layer.py file.
Use the refresher about Object-Oriented Programming (OOP) at the top of the notebook.

In [11]:
## TODO: edit the NaiveLayer methods __init__ and forward to compute the correct activation
from naive_layer import NaiveLayer
l = NaiveLayer(weights, biases, activation=sigmoid)
l.forward(inputs)

array([1.16338088e-01, 1.25305140e-01, 9.95996845e-01, 2.00959868e-03,
       8.67587232e-01, 7.51686419e-01, 9.96706154e-01, 5.39813449e-04])

You will notice that we could have defined a different activation function for each unit. But first, for parallel computing it is not a good idea, and second, this would give less consistency in a given layer output feature map.
Don't forget to edit the author and maintainer's name at the top of your naive_layer.py file!

## MLP comeback
Since we have a common activation function for the layer, why not computing it all at once?
We have naively defined separate perceptrons that computed individually their activation values.

![MLP](http://pubs.sciepub.com/ajmm/3/3/1/bigimage/fig5.png "Multi-layer perceptron")
*Source: Science and Education Publishing*

As per the representation above, we can see a parametrization by units in a layer, or by their connections with input/output layers. As we are able to forward information, it would be good to prepare the weight and bias update. In our previous definition, the arguments weights, biases and activation were enough to define the entire layer.

So let's try to simplify and speed up our layer modelling.

In [2]:
# Choose the number of inputs
input_size = 64
nb_perceptrons = 8
inputs = np.random.rand(64)
print(inputs.shape)

(64,)


Considering a single layer, of M perceptrons with N inputs, the activation value $f_j$ of the j-th perceptron is: 

$$ \forall X\ \in \mathbb{R}^N, \forall j\ \in [1,M], f_j = activation\Bigg(\Big(\sum_{i=1}^{N}W_{j,i}X_i\Big) + b_j\Bigg) = (activation(W_jX + b_j))_{j\ \in\ [1, M]}$$
where $W_{j,i}$ represents the weight of the i-th input $X_i$ for the j-th perceptron.

Let's enlarge our definition of the activation function that will operate on a larger space than scalars, $activation: \mathbb{R}^k \rightarrow \mathbb{R}^k$ (vectorized function programmingly speaking).

We can now have a matrix understanding of the previous expression, with $f\ \in \mathbb{R}^M$ the vector of all perceptrons activation values,

$$ \forall X\ \in \mathbb{R}^N, f(X) = activation\Big(\Big(\Big(\sum_{i=1}^{N}W_{j,i}X_i\Big) + b_j\Big)_{j\ \in\ [1, M]}\Big) = activation\Bigg(\begin{bmatrix}
    (\sum_{i=1}^{N}W_{1,i}X_i) + b_1 \\
    \vdots \\
    (\sum_{i=1}^{N}W_{M,i}X_i) + b_M 
\end{bmatrix}\Bigg)$$

$$ = activation\Bigg(\begin{bmatrix}
    (\sum_{i=1}^{N}W_{1,i}X_i) \\
    \vdots \\
    (\sum_{i=1}^{N}W_{M,i}X_i)
\end{bmatrix} + \begin{bmatrix}b_1 \\ \vdots \\ b_M\end{bmatrix}\Bigg)
= activation\Bigg(\begin{bmatrix}
    W_{1,1} & \cdots & W_{1,N} \\
    \vdots & & \vdots \\
    W_{M,1} & \cdots & W_{M,N}
\end{bmatrix}*\begin{bmatrix}X_1 \\ \vdots \\ X_N\end{bmatrix} + \begin{bmatrix}b_1 \\ \vdots \\ b_M\end{bmatrix}\Bigg)$$

Thus with  $ \mathbb{M}_{a,b}(\mathbb{R})$ being the ensemble of real matrices of size a * b,
$$ \forall\ W\ \in\ \mathbb{M}_{M,N}(\mathbb{R}), \forall\ b\ \in\ \mathbb{R}^M, \forall\ X\ \in\ \mathbb{R}^N,\ f = activation(WX + b) $$

As you can see, we can condense all layer operations in a single matrix operation, plus the cross-unit activation. Let's implement this!

This time will let the layer initialize its weights and biases itself. We just want to make sure its initializing the right amount of it.
Edit the layer class of the layer.py file.
Use the refresher about Object-Oriented Programming (OOP) at the top of the notebook.

In [5]:
## TODO: edit the Layer methods __init__ and forward to compute the correct activation
from layer import Layer
l = Layer(input_size, nb_perceptrons, activation=sigmoid)
l.forward(inputs)

array([0.32327157, 0.18556123, 0.94712284, 0.13010543, 0.01774532,
       0.98100981, 0.82891384, 0.51758245])

Let's check if we are computing the result as fast as earlier

In [8]:
# Choose the number of inputs
input_size = 1024
nb_perceptrons = 64

# Load the inputs
inputs, weights, biases = get_inputs(input_size, nb_perceptrons)
print(inputs.shape, weights.shape, biases.shape)

(1024,) (64, 1024) (64,)


In [13]:
from datetime import datetime
l = NaiveLayer(weights, biases, activation=sigmoid)
stime = datetime.now()
print(f"{input_size}-unit naive layer")
print(f"Duration: {datetime.now() - stime}")

l = Layer(input_size, nb_perceptrons, activation=sigmoid)
stime = datetime.now()
print(f"\n{input_size}-unit layer")
print(f"Duration: {datetime.now() - stime}")

1024-unit naive layer
Duration: 0:00:00.000124

1024-unit layer
Duration: 0:00:00.000130


Great! Now time to create a proper MLP with some depth.

![Deep MLP](http://cs231n.github.io/assets/nn1/neural_net2.jpeg "Deep MLP")
*Source: Stanford CS231n*

For L consecutive layers with individual forward function $(f_l)_{l\ \in\ [1, L]}$, where the input is of size N and the last layer output is of size O, the MLP forwarding operation $F: \mathbb{R}^N \rightarrow \mathbb{R}^O$ will be:

$$\forall\ X\ \in\ \mathbb{R}^N, F(X) = (f_L \circ f_{L-1} \circ \cdots \circ f_1)(X) $$

In [4]:
# Choose the number of inputs (feel free to change this)
input_size = 1024
inputs = np.random.rand(input_size)
# We'll define the number of units in each consecutive layer (feel free to change this)
layer_output_sizes = [256, 64, 64, 32, 8]

You can now implement your MLP by forwarding the input through each layer

In [5]:
## TODO: edit the MLP methods __init__ and forward to compute the correct network output
from layer import MLP
mlp = MLP(input_size, layer_output_sizes, activation=sigmoid)
mlp.forward(inputs)

array([0.97895182, 0.2379899 , 0.87799318, 0.07823132, 0.90545582,
       0.8116613 , 0.01358233, 0.20985839])

Now this is really good, do you see how close you are from having your own framework building blocks?
This missing part is about the update of this network's parameters, with the famously known backpropagation.

But you're done for the day, congratulations!