# Introduction to deep learning

We can recall the perceptrion algorithm lessons in `..\1_supervised_learning\02 Classification.ipynb`. We will move from there forward. Here we remind:
"""
More generally, the steps can be represented by a perceptrion diagram:
 * inputs
 * weights (linear function coefficients, + bias)
 * linear function
 * step function
 * output

<img src="../1_supervised_learning/images/2.03_perceptron_diagram.png" alt="Drawing" style="width: 700px;"/>

The way to update the function is as follows:
 * initialise function coefficients
 * evaluate if points are in the correct semi-space
 * consider only the wrongly predicted, and loops through them:
     * if the point is positive, but predicted negative, the weights are increased
     * if the poins is negative, but predicted positive, the weights are dicreased

Formula to update weights:
\begin{equation*}
W = W \pm \alpha X
b = b \pm \alpha
\end{equation*}
where $\pm$ is applied accordingly to how the point is misclassified
"""

## Error Functions

Generally, error functions must be:
 * Differentiable
 * Continuous

This is to be able to use gradient descent on the error functions to train our models, minimising the error.

<img src=".\images\0.01_disc-cont.PNG" style="width: 600px;" />

This can be done by moving from discrete predictions to continuous predictions (from {yes,no} to {67.4% yes}). To do this, we use a continuous activation function to be applied to our score (e.g. **sigmoid function**). 

The overall process is:

#### 1 calculate the score for a point

\begin{equation*}
\newcommand{\vect}[1]{\boldsymbol{#1}}
score = \vect{Wx}+b\\
\end{equation*}

where:
 * $\vect{W}$ is the weight matrix
 * $\vect{x}$ is the feature vector
 * $\vect{b}$ is the bias

#### 2 calculate the probability of a point of being classified by applying the activation function

\begin{equation*}
\sigma(x) = \frac{1}{1+e^{-x}}
\end{equation*}


where:
 * $x$ is the score calculated above (distance from the separation function)
 

<img src=".\images\0.03_sigmoid.PNG" style="width: 600px;" />

#### Scores and Predictions

Therefore, our preceptrion will calculate scores and will spit out predictions are follow:

<img src=".\images\0.04_sigmoid-pred.PNG" style="width: 600px;" />

<img src=".\images\0.05_cont-pred.PNG" style="width: 600px;" />

### Multi-class activation function: SofMax

Softmax is the general case of the sigmoid for more than one class.

\begin{equation*}
P(class_a) = \frac{e^{Z_a}}{\sum{e^{Z_i}}}
\end{equation*}

where:
 * $Z_a$ is the score of the point belonging to the $a$ class
 * $Z_i$ is the score of the point belonging to the $i$ class
 
<img src=".\images\0.06_softmax.PNG" style="width: 600px;"/>

In Python, it is:

In [5]:
import numpy as np

# Write a function that takes as input a list of numbers, and returns
# the list of values given by the softmax function.
def softmax(L):
    total = sum(map(np.exp,L))
    return np.array(list(map(lambda x:np.exp(x)/total,L)))

# example
softmax([4,12,0])

array([3.35348071e-04, 9.99658510e-01, 6.14211417e-06])

Finally we output each probability of a data point being of a certain class. 

<img src=".\images\0.07_softmax-out.PNG" style="width: 600px;"/>

### Max Likelyhood

If we have two models and we want to understand which one is best. We calculate the scores for four points, and then their probability by applying the **Softmax** function. We pick the probability of them belonging to their actual classes. We multiply them, and that is a measure to compare models. The model on the right returns higher probability for the training data.

<img src=".\images\0.08_max-likelyhood.PNG" style="width: 600px;" />

