<img src="./pics/DL.png" width=110 align="left" style="margin-right: 10px">

# Introduction to Deep Learning

## 02. Neurons

---

### What is a neuron?

<img src ="./pics/external/neuron.png" width="300" align="left"/>

###### Purpose

The basic building block of our neural system is the **neuron**, or nerve cell. These cells are specified to process and transmit information through the nerve system. It is activated by electrical signals and it is able to communicate with other cells through **synapses**. For this purpose neurons have different extensions.  

##### Main parts

- **dendrite**s: To receive information, neurons has branching extensions called dendrites. A neuron can be connected to many other cells through its dendrites.
- **axon**: A longer, cable like extension used to transmit electrical surges - information to other cells. It's length could be 100x of the neuron's body. Many neurons can connect to a sigle axon.
- **soma**: The body of the neuron. It contains the **nucleon**.

##### Working mechanism

A simplified high level view of the inner workings of a neuron is the following:  

The **neuron is stimulated** through its dendrites by the connected neurons through the synapses between the stimulating neuron's axon and the receiving neuron's dendrite.  
The cell absorbes the stimulation **until a threshold**. When it reaches said threshold, **it'll** start to **stimulate the connected cells** through it's axon. A neuron actively stimulating is called a "firing neuron".  
These stimulations are small electrical signals. As a **connection used** more and more it **is getting stronger** and the information / **knowledge is stored in** these **connection strengths**.


### What is an artificial neuron?

<img src ="./pics/ann/artificial_neuron.png" width="300" align="left"/>

##### Taxonomy

Artificial Neural Networks are a supervised machine learning methods for classification and regression purpose. They are based on the inner workings of the (human) brain. Their basic building blocks are simple execution units and the connections between them.  

##### Prediction

These execution units are called **neurons**, and their job is to compute the weighted summation of their inputs, then applying an output function. Based on their simple nature a [neuron](https://en.wikipedia.org/wiki/Perceptron) is only capable of solving linear problems. Their mechanism is easily expressed by the following equation:
$$y = f(w \cdot x + b)$$
where $f$ is an applied activation function (also referenced as output function, or output nonlinearity), representing the nonlinear physical properties of a biological nerve cell; $w$ is the weights associated with the inputs, representing the strength of the connections between the cells; $x$ is the incoming data, representing the electrical signal values; and $b$ is the bias, representing a neuron's threshold.

##### Training

The learning process is simply adjusting the input weights based on the observed data. First a **forward step** is executed which **is** a **prediction** based on the incoming inputs, the actual weight and bias values using the equation above. The output is compared to the associated label and the error is computed. The goal is to minimize this *error* by modifying the weights. The **weight update is** often referenced as the **backward step**.  
Depending on the activation function, there are many different weight update rule available. There are many ways to **create** such **update rules**, but the most common is by **using stochastic gradient descent methods**.

### Let's try it out!

Implement a simple neuron capable of predicting based on its initial weights.

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from helpers import plot_results_with_hyperplane

#### Example Dataset

Let's use a simple dataset, the truth table of the `AND` logical operation.

A | B | output |
--|---|--------|
0 | 0 | 0      |
0 | 1 | 0      |
1 | 0 | 0      |
1 | 1 | 1      |

In [None]:
inputs = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
labels = np.array([0, 0, 0, 1])

plt.scatter(x=inputs[:, 0], y=inputs[:, 1], c=labels)

#### Example activation function

Let's use a really simple binary absolute function:

$$
\operatorname{abs_{bin}}(x) := {\begin{cases}
                                    1 & {\text{if }} x>=0, \\
                                    0 & {\text{otherwise}}.
                                \end{cases}}
$$

In [None]:
def binary_abs(x):
    """Returns 1 if x >= 0; 0 otherwise."""
    return (x >= 0).astype('int')

Quick check if it really does what it's suppose to do:

In [None]:
for value in np.array([-0.1, 0, 12]):
    print(f'binary_abs({value}) = {binary_abs(value)}')

In [None]:
class Neuron:
    
    def __init__(self, n):
        self.bias = -np.random.random()
        self.weights = np.random.rand(1, n)
        
    def predict(self, inputs):
        weighted_inputs = inputs * self.weights + self.bias
        sums_of_inputs = np.sum(weighted_inputs, axis=1)
        result = binary_abs(sums_of_inputs)
        return result

In [None]:
neuron = Neuron(2)
preds = neuron.predict(inputs)

plot_results_with_hyperplane(inputs, labels, neuron);

---

### Rosenblatt Perceptron model

<img src ="./pics/ann/neuron_explanation.png" width="500"/>

One of the first neuron model was the Rosenblatt Perceptron model named after its creator, the psychologist Frank Rosenblatt, who experimented with artificial neural networks and proposed the model in 1958. It is a linear binary classifier. 

#### Prediction

The selected activation function was the threshold function (also referenced as $\operatorname{sign}$):

$$
\operatorname{sign}(x):={\begin{cases}
                             1 & {\text{if }} x>=0, \\
                             -1 & {\text{otherwise}}.
                         \end{cases}}
$$

So the prediction can be written as:

$$\operatorname{f}(x) = \operatorname{sign}(w \cdot x + b)$$

#### Training

To compute the weight update function, we are going to use stochastic gradient descent. The gradient theorem states that the gradient points towards the maximum increase from any given point (given some condition is met). Gradient descent methods use this theorem: pick an initial point, and follow the negative gradient to find a local optimum.

Our objective is to minimize the error which can be expressed as follows:
$$e = t - y$$
where $d$ is the expected label, $y$ is the outcome of the model. Based on the value range of $t$ and $y$, there can be 3 possible values: 
- **0**, if $t$ and $y$ matches, 
- **+2** if the prediction was -1 instead of 1,
- and **-2** if we it predicted 1 when the expected value was -1.

To simplify the computation of the [derivative](https://en.wikipedia.org/wiki/Sign_function), we'll introduce the following inequalities:
$$t \cdot (w \cdot x + b) > 0$$
if there is an error, 
$$t \cdot (w \cdot x + b) < 0$$
otherwise.

Using these we'll write the function to minimize as:

$$\min \operatorname{L}(w, b) = - \sum_{i | x_i \in E} t_i \cdot (w \cdot x_i + b)$$

Where $E$ is the set of incorrectly classified cases. From here we can get the two update rule by taking the partial derivatives:

$$\begin{align}
    \frac{\partial \operatorname{L}(w, b)}{\partial w} & = - \sum_{i | x_i \in E} t_i \cdot x_i \\
    \frac{\partial \operatorname{L}(w, b)}{\partial b} & = - \sum_{i | x_i \in E} t_i
\end{align}$$

Using this results, our update functions are:

$$\begin{align}
    w_{t+1} & = w_{t} + \alpha \cdot t_{i} \cdot x_{i} \\
    b_{t+1} & = w_{t} + \alpha \cdot t_{i}
\end{align}$$

Where $\alpha$ is the learning rate.

### Let's try it out!

Implement the Rosenblatt Perceptron model.


#### Example Dataset

We are going to use the same dataset with a slight change: the outputs should be -1 instead of 0 in order to work with the Perceptron model.

A | B | output |
--|---|--------|
0 | 0 | -1     |
0 | 1 | -1     |
1 | 0 | -1     |
1 | 1 |  1     |

In [None]:
inputs = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
labels = np.array([-1, -1, -1, 1])

plt.scatter(x=inputs[:, 0], y=inputs[:, 1], c=labels)

Implement the activation function:

In [None]:
def sign(x):
    return (x >= 0).astype('int') - (x < 0).astype('int')

Sanity check:

In [None]:
for value in np.array([-0.1, 0, 12]):
    print(f'sign({value}) = {sign(value)}')

In [None]:
class Rosenblatt:
    
    def __init__(self, n, alpha=0.1, epochs=10):
        """Implement the Rosenblatt Perceptron model.
        
        Parameters:
        -----------
        - n : int
          Number of inputs
        - alpha : float
          Learning rate
        - epochs : int
          Number of training rounds
        """
        self.alpha = alpha
        self.epochs = epochs
        self.weights = np.random.rand(n)
        self.bias = np.random.random()
        
    def predict(self, X):
        X_w = np.dot(X, self.weights) + self.bias
        return sign(X_w)
    
    def train_one_step(self, x, y):
        y_hat = self.predict(x)
        direction = y * (y != y_hat)  # equals 0 if the prediction is correct 
        self.bias += self.alpha * direction
        self.weights += self.alpha * direction * x    
        
    def train(self, X, y):
        nrows, nfeats = X.shape
        for _ in range(self.epochs):
            for i in range(nrows):
                self.train_one_step(X[i], y[i])

In [None]:
perceptron = Rosenblatt(n=2)
preds = perceptron.predict(inputs)

plot_results_with_hyperplane(inputs, labels, perceptron, 'perceptron');

Let's see how the model is improved after training

In [None]:
perceptron.train(inputs, labels)
preds = perceptron.predict(inputs)

plot_results_with_hyperplane(inputs, labels, perceptron, 'perceptron');

In [None]:
perceptron.weights, perceptron.bias

#### Scikit-learn model

Scikit-learn implemented the perceptron model, let's try that implementation on the same problem:

In [None]:
from sklearn.linear_model import Perceptron

In [None]:
perceptron = Perceptron(verbose=2, random_state=42).fit(inputs, labels)

plot_results_with_hyperplane(inputs, labels, perceptron, 'perceptron');

---

### Limitations

The perceptron model (and a single neuron really) is a linear model with known limitations. Consider the following dataset - the truth table of the `XOR` logical operation:

A | B | output |
--|---|--------|
0 | 0 | -1     |
0 | 1 |  1     |
1 | 0 |  1     |
1 | 1 | -1     |

In [None]:
inputs = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
labels = np.array([-1, 1, 1, -1])

plt.scatter(x=inputs[:, 0], y=inputs[:, 1], c=labels)

This is a **linearly unseparable case**: we cannot draw a single straight line which separates the classes without errors. Let's see what happens if we try to fit a model on it:

In [None]:
perceptron = Perceptron(random_state=42).fit(inputs, labels)
plot_results_with_hyperplane(inputs, labels, perceptron, 'perceptron');

In [None]:
perceptron.coef_, perceptron.intercept_

In [None]:
from sklearn.metrics import confusion_matrix
conf_mat = confusion_matrix(labels, perceptron.predict(inputs))
sns.heatmap(conf_mat)

We can see that it failed at this problem, just as we expected.

### Exercise

Consider the following rosenblatt perceptron:

<img src ="./pics/ann/neuron_exercise.png" width="400" align="left"/>

**1. Compute** what will be the prediction, the value of the weights, and the bias after one pass on the data, given the following inputs?

|$x_1$ | $x_2$ | $x_3$ | $d$ |
|------|-------|-------|-----|
| 1    | 0     | 1     | 1   |
| 0    | 1     | 0     | -1  |

And the initial weights are:

|$w_1$ | $w_2$ | $w_3$ | $b$  |
|------|-------|-------|------|
| 0.3  | 0.5   | -0.2  | -1.0 |

**2. Validate** your answer using the previous Rosenblatt class!

In [None]:
# TODO



### Good job!

In the next chapter we'll look into the neural networks and examine what happens if more than one neuron is present in the network.