# Introduction to deep learning

We can recall the perceptrion algorithm lessons in `..\1_supervised_learning\02 Classification.ipynb`. We will move from there forward. Here we remind:
"""
More generally, the steps can be represented by a perceptrion diagram:
 * inputs
 * weights (linear function coefficients, + bias)
 * linear function
 * step function
 * output

<img src="../1_supervised_learning/images/2.03_perceptron_diagram.png" alt="Drawing" style="width: 700px;"/>

The way to update the function is as follows:
 * initialise function coefficients
 * evaluate if points are in the correct semi-space
 * consider only the wrongly predicted, and loops through them:
     * if the point is positive, but predicted negative, the weights are increased
     * if the poins is negative, but predicted positive, the weights are dicreased

Formula to update weights:
\begin{equation*}
W = W \pm \alpha X
b = b \pm \alpha
\end{equation*}
where $\pm$ is applied accordingly to how the point is misclassified
"""

## Error Functions

Generally, error functions must be:
 * Differentiable
 * Continuous

This is to be able to use gradient descent on the error functions to train our models, minimising the error.

<img src=".\images\0.01_disc-cont.PNG" style="width: 600px;" />

This can be done by moving from discrete predictions to continuous predictions (from {yes,no} to {67.4% yes}). To do this, we use a continuous activation function to be applied to our score (e.g. **sigmoid function**). 

The overall process is:

#### 1 calculate the score for a point

\begin{equation*}
\newcommand{\vect}[1]{\boldsymbol{#1}}
score = \vect{Wx}+b\\
\end{equation*}

where:
 * $\vect{W}$ is the weight matrix
 * $\vect{x}$ is the feature vector
 * $\vect{b}$ is the bias

#### 2 calculate the probability of a point of being classified by applying the activation function

\begin{equation*}
\sigma(x) = \frac{1}{1+e^{-x}}
\end{equation*}


where:
 * $x$ is the score calculated above (distance from the separation function)
 

<img src=".\images\0.03_sigmoid.PNG" style="width: 600px;" />

#### Scores and Predictions

Therefore, our preceptrion will calculate scores and will spit out predictions are follow:

<img src=".\images\0.04_sigmoid-pred.PNG" style="width: 600px;" />

<img src=".\images\0.05_cont-pred.PNG" style="width: 600px;" />

### Multi-class activation function: SofMax

Softmax is the general case of the sigmoid for more than one class.

\begin{equation*}
P(class_a) = \frac{e^{Z_a}}{\sum{e^{Z_i}}}
\end{equation*}

where:
 * $Z_a$ is the score of the point belonging to the $a$ class
 * $Z_i$ is the score of the point belonging to the $i$ class
 
<img src=".\images\0.06_softmax.PNG" style="width: 600px;"/>

In Python, it is:

In [5]:
import numpy as np

# Write a function that takes as input a list of numbers, and returns
# the list of values given by the softmax function.
def softmax(L):
    total = sum(map(np.exp,L))
    return np.array(list(map(lambda x:np.exp(x)/total,L)))

# example
softmax([4,12,0])

array([3.35348071e-04, 9.99658510e-01, 6.14211417e-06])

Finally we output each probability of a data point being of a certain class. 

<img src=".\images\0.07_softmax-out.PNG" style="width: 600px;"/>

### Max Likelyhood and Cross Entropy

If we have two models and we want to understand which one is best. We calculate the scores for four points, and then their probability by applying the **Softmax** function. We pick the probability of them belonging to their actual classes. We multiply them, and that is a measure to compare models. The model on the right returns higher probability for the training data.

<img src=".\images\0.08_max-likelyhood.PNG" style="width: 600px;" />

In order to maximise the probability (and minimize the error) we want to see how Max likelyhood and error are related. First, we want to calc the probability and maximise it. However, the product many (thousands) points might be tiny (very bad!!). So we can calculate the Max likelyhood differently, as the sum of the logarithms:

\begin{equation*}
-\sum{ln(P_i)}
\end{equation*}

where:
 * $P_i$ is the likelyhood of a single event to happen in relation to the given label.

Since $0.0 \le P_i\le 1.0$, their $ln(P_i) \le 0$, thus there is a minus in front of the sum. Therefore, we move from maximizing the probability to minimizing the cross entropy. We can see also the cross entropy as a way to calculate each point's error:

<img src=".\images\0.09_cross-entropy.PNG" style="width: 600px;" />

Now, we took $P_i$ as the likelyhood of an event to happen for a given label. So we can introduce the label as a variable by expanding the forumla as:

\begin{equation*}
\mbox{Cross-Entropy} = -\sum{y_i ln(P_i) + (1-y_i) ln(1-P_i)}
\end{equation*}

where:
 * $P_i$ is the probability of the i-th event to happen
 * $y_i$ is the label of the i-th event

<img src=".\images\0.10_cross-entropy.PNG" style="width: 600px;" />

<img src=".\images\0.11_cross-entropy.PNG" style="width: 600px;" />

Here we run an example for the most likely combination (line 2):

In [5]:
import numpy as np

y = [1,1,0]
p = [0.8,0.7,0.1]

# Write a function that takes as input two lists Y, P,
# and returns the float corresponding to their cross-entropy.
def cross_entropy(Y, P):
    r = map(lambda y,p: y*np.log(p)+(1-y)*np.log(1.0-p),Y,P)
    return -sum(r)

round(cross_entropy(y,p),2)

0.69

### Multi-class Cross-Entropy

Now we assume to have $n=3$ random variables (the doors, ???representing the label of each data-point???), each with $m=3$  $categories=\{duck,beaver,walrus\}$, with different probabilities to happen:

<img src=".\images\0.12_multi-cross-entropy.PNG" style="width: 600px;" />

So we have the probability matrix:

\begin{equation*}
\underbrace{\begin{bmatrix}
P_{1,1} &   P_{1,2} & \dots     &   P_{1,m} \\
P_{2,1} &   P_{2,2} & \dots     &   P_{2,m} \\
\vdots  &  \vdots   &   \vdots  &   \vdots  \\   
P_{n,1} &   P_{n,2} & \dots     &   P_{n,m} \\
            \end{bmatrix}
            }_{\mathbf{P}}
            \ ,
\underbrace{\begin{bmatrix}
y_{1,1} &   y_{1,2} & \dots     &   y_{1,m} \\
y_{2,1} &   y_{2,2} & \dots     &   y_{2,m} \\
\vdots  &  \vdots   &   \vdots  &   \vdots  \\   
y_{n,1} &   y_{n,2} & \dots     &   y_{n,m} \\
            \end{bmatrix}
            }_{\mathbf{y}}
\end{equation*}

where:
 * $\mathbf{P}$ is the probability matrix with n categories and m random variables
 * $\mathbf{y}$ is the label matrix with the classified categories

\begin{equation*}
\mbox{Cross-Entropy} = -\sum_{i=1}^{n}\sum_{j=1}^{m} y_{i,j} \ln(P_{i,j})
\end{equation*}

----

The example shown above is coded here below:

In [12]:
import numpy as np

y = [[1,0,0],
     [0,0,0],
     [0,1,1]]
p = [[0.7,0.3,0.1],
     [0.2,0.4,0.5],
     [0.1,0.3,0.4]]

# to np array
y = np.array(y)
p = np.array(p)

# Write a function that takes as input two lists Y, P,
# and returns the float corresponding to their cross-entropy.
def cross_entropy(Y, P):
    r = map(lambda y,p: y*np.log(p),Y,P)
    return -sum(r)[0]

cross_entropy(y.reshape(-1,1),p.reshape(-1,1)) # arrays are processed flattened for better performances

2.4769384801388235

For convention, we take the average cross-entropy to obtain an error function.

\begin{equation*}
\mbox{Cross-Entropy} = -\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{m} y_{i,j} \ln(P_{i,j})
\end{equation*}

Now, we know that $P_i = \hat{y}$

\begin{equation*}
\mbox{Cross-Entropy} = -\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{m} y_{i,j} \ln(\hat{y_{i,j}})
\end{equation*}

And $\hat{y} = \sigma(\vect{Wx}+b)$

\begin{equation*}
\mbox{Cross-Entropy} = -\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{m} y_{i,j} \ln(\sigma(w_{i,j}x_{i,j}+b_{i}))
\end{equation*}

where:
 * $n$ is the number of random variables - i.e. of samples
 * $m$ is the number of categories
 * $w_{i,j}$ is the weight of the i-th feature, j-th category


---
#### k

**The entropy of a probability distribution is as follows:**

$\sum{P(i) log P(i)}$

We assume that we know the probability P for each i. The term i indicates a discrete event that could mean different things depending on the scenario you are dealing.

**For continuous variables, it can be written using the integral form:**

$\int{P(x) log P(x)dx}$

Here, x is a continuous variable, and P(x) is the probability density function.

In both discrete and continuous variable cases, we are calculating the expectation (average) of the negative log probability which is the theoretical minimum encoding size of the information from the event x.

So, the above formula can be re-written in the expectation form as follows:
$E_{x~P} (- log P(x))$

where:
 * x~P means that we calculate the expectation with the probability distribution P.

**In short, the entropy tells us the theoretical minimum average encoding size for events that follow a particular probability distribution.**

*Cross-entropy ≥ Entropy*

Commonly, the cross-entropy is expressed using H as follows:
_H(P, Q) = E (subscript (x~P) [- log Q(x)])_

H(P, Q) means that we calculate the expectation using P and the encoding size using Q. As such, H(P, Q) and H(Q, P) is not necessarily the same except when Q=P, in which case H(P, Q) = H(P, P) = H(P) and it becomes the entropy itself.

This point is subtle but essential. For the expectation, we should use the true probability P as that tells the distribution of events. For the encoding size, we should use Q as that is used to encode messages.

Since the entropy is the theoretical minimum average size, the cross-entropy is higher than or equal to the entropy but not less than that.

In other words, if our estimate is perfect, Q = P and, hence, H(P, Q)=H(P). Otherwise, H(P, Q) > H(P).
he cross-entropy compares the model’s prediction with the label which is the true probability distribution. The cross-entropy goes down as the prediction gets more and more accurate. It becomes zero if the prediction is perfect. As such, the cross-entropy can be a loss function to train a classification model.