# Deep Learning

Deep learning has found applications in medical diagnosis, image recognition, self driving cars and even in competions against humans in a game of Go or Jeopardy.

Neural networks lies at the heart of deep learning. They are called so as they mimic the functioning of neurons in human brain.

## Perceptron

Perceptrons form the building blocks of Neural Networks.

![Perceptron](./images/perceptron.png "Perceptron")

In the figure above X1, X2, .... Xn forms input, W1, W2,... Wn form the weights, and b is the bias. Together they form the input nodes. The input multiplied by the weights and the bias is fed to the "linear function" node, which performs the necessary calculations and supplies the output to a "step function" node. The step functions returns a "0", or "1" as the final output.

### Perceptron algorithm

The perceptron step works as follows. For a point with coordinates (p,q), label y,and prediction given by the equation $\hat{y} = step(w_1x_1 + w_2x_2 + b)$

* If the point is correctly classified, do nothing.
* If the point is classified positive, but it has a negative label, subtract $\alpha p, \alpha q$ and $\alpha$ from $w_1, w_2$, and $b$ respectively.
* If the point is classified negative, but it has a positive label, add $\alpha p, \alpha q$, and $\alpha$ to $w_1, w_2$ and $b$ respectively.

## Error Functions

When the linear function can no longer clearly separate the points of interest, we use an error function to do the job. An error function simply tells us how far we are from our goal.

In the example given below this would be distance of the point from the separating line. Therefore the probablity of highlighted point being blue is 0.5, and that it being red is also 0.5 (1 - 0.5). As we go above the central line the probablity of the point being blue increases, and while it's below the line the probablity of point being blue decreases.

![Probablity](./images/probablity.png)

### Sigmoid function

The discrete (Yes/No) values can be changed to continous (numeric probablities) values by changing the step function to a sigmoid function.

$$\sigma(x) = 1/(1 + e^{-x})$$

The sigmoid function converts high positive values to a value close to 1 and high negative values to a value close to 0. The middle values are given value of near 0.5.

![Perceptron](./images/perceptron2.png "Perceptron")

### Softmax function

When we have more than two features, a softmax function can be used to turn scores into probablities. The softmax function for an event E0 among E0, E1 and E2 is defined as:

$$\frac{e^0}{e^2 + e^1 + e^0}$$


### Maximum likelihood

How do we know which model makes the best predictions among several? We compute the probablities of current labels and multiplies them to arrive at the total probablity for a given model. This exercise is repeated for all models, and we then pick the model that has the highest probablity value. Therefore, the figure on the right has the maximum likelihood of being the best model.

![Maximum likelihood example](./images/maximum_likelihood.png)

## Cross Entropy

Instead of taking the product of the probablities, we can take the sum of the negative of their logarithms, which would give us a smaller value. This is known as a cross entropy value. Better models will have a cross entropy value that is close to 1. So our goal is to look gor a model with the small entropy score.

![Cross entropy example](./images/cross_entropy.png)

## Logistic Regression

Logistic regression is a classification algorithm used to assign observations to a discrete set of classes. Unlike linear regression which outputs continuous number values, logistic regression transforms its output using the logistic sigmoid function to return a probability value which can then be mapped to two or more discrete classes.

The logistic regression uses Cross Entropy (or Log Loss function) as it's error function, and gradient descent steps to minimise the error.

$$ Error = -1/m \sum_{n=1}^{m} (1-y_i) log(1 - y') + y_i log(y')$$

$$ Error(W,b) = -1/m \sum_{n=1}^{m} (1-y_i) log(1 - \sigma(Wx_i + b)) + y_i log(\sigma(Wx_i + b))$$

Where y' is the prediction. 

### Logistic regression algorithm implementing gradient descent 

1. Start with random weights. w1, w2,.... wn and b for the bias unit. This would give us a line given by $\sigma(Wx+b)$
2. Compute the error for all points w1, w2,... wn. We can observe that the error is higher for mis classified points and less for correctly classified points.
3. For each point, update $w_i = w_i - \alpha * (y' - y)$ and $b = b - \alpha * (y' - y) $. Go to step 2.
4. Repeat untill error is small or stop after a fixed number of iterations, epoh.



## Neural Networks

Perceptrons can be layered together to form Multi Layered Perceptrons or Neural Networks. A typical neural network has three types of layers. These are the input, hidden, and output layers.

![Neural Network](./images/neural_network.png)

Neural networks can have different architectures.

When we add more nodes to the hidden layer, we are increasing the number of linear models the neural network uses to form a non linear model in the ouput layer.

When we add more nodes to the input layer it means that are working with a multi dimensional problem, i.e. n input nodes means we are working in an n-dimensional space.

When we have more nodes in the output layer, we are dealing with a multi class classification problem.

And finally, when we increase the number of layers themselves, we get what is known as a Deep Neural Network, which can combine linear model to form non linear models. These non linear output models are combined together to form even more complex models, and so on.







### Feed forward

Feedforward is the process neural networks use to turn the input into an output.

![Feed forward](./images/feed_forward.png)

### Back Propagation

How do we train a Neural Network? For this, we'll use the method known as backpropagation. In a nutshell, backpropagation will consist of:

1. Doing a feedforward operation.
2. Comparing the output of the model with the desired output.
3. Calculating the error.
4. Running the feedforward operation backwards (backpropagation) to spread the error to each of the weights.
5. Use this to update the weights, and get a better model.
6. Continue this until we have a model that is good.

![Back propagation 1](./images/back_propagation_1.png)

From the above figure, we can see that the point in the final output was mis classified. The point wants the blue boundary to move closer to it. Using step 3 of the algorithm, we pass the error back to the previous (hidden) layers, which adjusts the weights to the output layer. We can notice this from the change in the weights of the image below, where the weight from the top layer was reduced, and the bottom layer was increased.

Back propagation doesn't just end there. The hidden layers themselves check how they have classifed the point and pass their respective errors back to the input layers, which adjust their weights to the hidden layer. Finally, we are left with a better model than before, given in the figure below. The alogorithm then goes back to step 1.

![Back propagation 2](./images/back_propagation_2.png)


## Training Neural Networks

Despite of our best efforts, our model may not deliver the results as expected. This could be due to multitude of factors such as noisy data, poorly architecting the neural network, or because our model is taking too long to train. A lof of these problems can be avoided by optimizing the training of our models.

![Neural Network Architectures](./images/neural_network_architectures.png)

The figure above illustrates three neural network models. The model in the middle is, obviously, the best model, but how do we arrive at that model? In reality, a model either end up underfitting, or overfitting. So it's better to take an over fitting model and try to optimize it to make it more like the best fitting model in the middle. 

Now we shall look at some of the strategies to arrive at a better model.

## Early stopping

When a model underfits, the number of training and testing errors are high, and when it overfits, the training errors are low but the testing errors remain high.

![Early stopping](./images/early_stopping.png)

Our aim, therefore is to stop gradient descent once the testing errors starts to increase again after being low, as illustrated in the figure above. It's highly likely that the training and testing errors are the lowest at this point. This is known as early stopping.

## Regularization

A prediction function with a large coeffient (e.g. $ 10x_1 + 10x_2 $) results in a predition with better error rates than a prediction with a small coefficient (e.g. $x_1 + x_2 $). However, such predictions tend to over fit.

We can avoid such issues be applying the $\lambda$ parameter which would punish models with higher coeffients (that overfit). When the lamba is high, the cost of penalty is high and vise versa.

*L1 Regularization*

Add $ \lambda (|w_1| + |w_2| + ..... + |w_n|) $ to the error function.

L1 regularizatio tends to result in sparse vectors (E.g. 1, 0, 0, 1, 0, ...) as lower weights tends to go to zero. This is good for feature selection among large number of features, wherein features that are unimportant will zero out.

*L2 Regularization* 

Add $ \lambda (w_1^2 + w_2^2 + .... + w_n^2) $ to the error function.

L2 regularization tends not to prefer sparse vectors and preserves the homogeneus nature of small weights (E.g. 0.5, 0.3, -2.0, 0.4, 0.1). L2 regularization tends to do better model training compared to L1.

## Dropout

Some nodes might end up dominating the training owing to their higher weights. To counter this, we can randomly turn off certain nodes during each epoh. This is done by assigning a probablity to a node that it will be dropped.

## Random Restart

With gradient descent we might end up with a local minima as the final solution. To avoid this, we can perform gradient descent from random locations, which would help us arrive at a global minima or better local minima.

## Different activation functions

A hyperbolic tangent function or a Rectified Linear Unit (ReLU) function can be used as the activation function instead of a sigmoid function. A sigmoid function returns the deriviative as zero as it moves closer to the edges of the sigmoid. This causes the vanishing gradient problem, i.e. the gradient gets closer and closer to zero after many epochs.

## Batch gradient descent vs. Stochastic gradient descent

Batch gradient descent takes a significant amount of time to complete as the amount of data increases. Matrix computations over the entire data set are both memory and time intesive.

In stochastic gradient descent, the total batch is divided in to several samples. One of the samples is used to train the neural network, perform back propagaion and it's weights are adjusted. Then the second sample is used to train the model, and so on till all the samples are done.

## Momentum

Another technique to overcome local minimums is to use a momentum parameter $ \beta $, which is a constant between zero and one, and attaches itself to the steps. This helps us to go over small local minimums and has good chance to arrive at a global minimum.