# Improving Deep Neural Network - Part 1

## Table of Contents

* [1. Setting up your ML application](#chapter1)
    * [1.1  Train/dev/test](#section_1_1)
    * [1.2  Bias vs Variance](#section_1_2)
    * [1.3  Basic recipe of ML](#section_1_3)
* [2. Regularizing your Neural Network](#chapter2)
    * [2.1  Regularization](#section_2_1)
    * [2.2  Why Regularization reduces overfitting](#section_2_2)
    * [2.3  Dropout Regularization](#section_2_3)
    * [2.4  Other Regularization](#section_2_4)
* [3. Setting up your Optimization problem](#chapter3)
    * [3.1 Normalizing Inputs](#section_3_1)
    * [3.2 Exploding Gradients and Weights Initialization](#section_3_2)
    * [3.3 gradient checking](#section_3_3)

We know how to implement a neural network. Now we will learn how to make our neural network work well.
-  hyperparameter tuning 
-  set up your data
-  make sure your optimization algorithm runs quickly so that you get your learning algorithm to learn in a reasonable time

# 1. Setting up your ML application <a class="anchor" id="chapter1"></a>

When we train our neural network we have to make a lot of decisions such as :
- number of layers in the neural network
- number of hidden units
- learning rate
- activation function
- ...

In order to find the good parameters we have to test different models.

So in pratice applied Machine Learning is a highly iterative process

<center><img src="images/07-Improving Deep Neural Network/process-ML.PNG" width = "3a00px"></center>

- You start with an idea
- you code it up and try it by running your code
- you experiment and you get back a result that tells you how well work your network

And then you refine your idea and try a new one

## 1.1 Train/Dev/Test <a class="anchor" id="section_1_1"></a>

Set up your dataset in terms of your train, development and test sets can make you much more efficient at determine more quickly a good model.

<center><img src="images/07-Improving Deep Neural Network/dataset.PNG" width = "400px"></center>

- You train your algorithms on the <b>Training set </b>
- you use your <b>Dev set</b> to see which of many different models performs best 
- you evaluate the best model on your <b>Testing set</b>

> In the previous era, or on small dataset :

We split the dataset on :
- 70% / 30%
- 60% / 20% / 20%

> Big data (Big dataset):


We split the dataset on :
- 98% / 1% / 1%
- 99.5% / 0.5% / 0.5%

##### Mismatched Train/Test distribution

Ex :
- Training set: Cat pictures from web
- Dev/ test sets: Cat pictures from users using your app

> Make sure the dev and test set come from the same distribution.

So having set up a train dev and test set will allow you to integrate more quickly. It will also allow you to more efficiently measure the bias and variance of your algorithm so you can more efficiently select ways to improve your algorithm.

## 1.2 Bias vs Variance <a class="anchor" id="section_1_2"></a>

<center><img src="images/07-Improving Deep Neural Network/bias-variance.PNG" width = "500px"></center>

In Deep learning the two key numbers to look at to understand bias and variance will be the train set error and the dev set or the development set error.

> For example we take a Cat Classifier:

<br>
<center>
<table>
    <tbody>
        <tr>
            <td>Train set Error</td>
            <td>Dev set Error</td>
            <td>Bias-Variance</td>
        </tr>
        <tr>
            <td>1%</td>
            <td>11%</td>
            <td>High variance</td>
        </tr>
        <tr>
            <td>15%</td>
            <td>16%</td>
            <td>High Biais</td>
        </tr>
        <tr>
            <td>15%</td>
            <td>30%</td>
            <td>High variance & High bias</td>
        </tr>
        <tr>
            <td>0.5%</td>
            <td>1%</td>
            <td>Low bias & Low variance</td>
        </tr>
    </tbody>
</table>
</center>

## 1.3 Basic recipe of ML <a class="anchor" id="section_1_3"></a>

- High Bias (Training set performance):
    - Test bigger Network
    - Train longer

- High Variance (Dev set performance):
    - Regularization
    - More Data

# 2. Regularizing your Neural Network <a class="anchor" id="chapter2"></a>

## 2.1 Regularization <a class="anchor" id="section_2_1"></a>

> Logistic Regression :

$$J(W,b) = \frac{1}{m} \sum_{i=1}^{m} L(y_{pred},y) + \frac{\lambda}{2m} \lVert w \rVert^2_2$$

- L2 Regularization :

 $$\lVert w \rVert^2_2  = \sum_{j=1}^{n_x}w_j^2 = W^TW  $$

- L1 Regularization :



$$ \frac{\lambda}{2m} \sum_{j=1}^{n_x} |w_j| = \frac{\lambda}{2m} \lVert w \rVert_1     $$

> Neural Network:

$$J(W^{[1]},b^{[1]},W^{[2]},b^{[2]}....,W^{[L]},b^{[L]}) = \frac{1}{m} \sum_{i=1}^{m} L(y_{pred},y) + \frac{\lambda}{2m} \sum_{l=1}^{L} \lVert W^{[l]} \rVert^2_F$$

Regularization:

$$\lVert W^{[l]} \rVert^2_F = \sum_{i=1}^{n^{[l]}}  \sum_{j=1}^{n^{[l-1]}} (w_{ij}^{[l]})^2$$

On Backpropagation :

$$ dW^{[l]} = (From \ \ backprop) + \frac{\lambda}{m} W^{[l]} $$

## 2.2 Why Regularization reduces overfitting? <a class="anchor" id="section_2_2"></a>

> Example :

Let's take for example a neural network with overfitting, so High Variance.

We add the extra term <b>regularization</b> in the loss function:

$$J= \frac{1}{m} \sum_{i=1}^{m} L(y_{pred},y) + \frac{\lambda}{2m} \sum_{l=1}^{L} \lVert W^{[l]} \rVert^2_F$$

<b>It penalizes the weight matrices from being too large.</b>

- If for example we set lambda to be really big
- Then it set the weight matrices W to be reasonably close to zero.

<center><img src="images/07-Improving Deep Neural Network/nn_example.PNG" width = "300px"></center>

<center><img src="images/07-Improving Deep Neural Network/nn_simplified.PNG" width = "300px"></center>

<b>Some hidden units will be close to zero. And therefor the neural network will be simplified and become smaller.So that will take you from this overfitting case much closer to the high bias case.</b> By testing different value of lambda you will reduce the overfitting.

> Tanh Example :

<center><img src="images/07-Improving Deep Neural Network/tanh_function.png" width = "300px"></center>

If Lambda is big, then W becomes smaller. So Z[l] = W[l]. A[l-1] + b[l] will be very small. If Z is small, it can take values close to 0. And then <b>tanh = g(Z) will be roughly linear</b>. The activation function will be relatively linear. And we saw in a previous course that if a layer is linear then the all neural network is linear. So By computing a relativaly linear activation function, we will have a neural network relatively linear than a high-complex non linear function. So it will reduce the complexity of our model and therefor reduce the overfitting.

**What is L2-regularization actually doing?**:

L2-regularization relies on the assumption that a model with small weights is simpler than a model with large weights. Thus, by penalizing the square values of the weights in the cost function you drive all the weights to smaller values. It becomes too costly for the cost to have large weights! This leads to a smoother model in which the output changes more slowly as the input changes. 

## 2.3 Dropout Regularization <a class="anchor" id="section_2_3"></a>

Finally, **dropout** is a widely used regularization technique that is specific to deep learning.

<center><img src="images/07-Improving Deep Neural Network/dropout1.PNG" width = "300px" height="160px"><img src="images/07-Improving Deep Neural Network/dropout2.PNG" width = "300px" height="160px"><img src="images/07-Improving Deep Neural Network/dropout3.PNG" width = "300px" height="160px"><img src="images/07-Improving Deep Neural Network/dropout4.PNG" width = "300px" height="160px"></center>

<center><b>Figure</b>:<b>Drop-out on the second hidden layer.</b> <br> At each iteration, you shut down (= set to zero) each neuron of a layer with probability 1 - keep_prob or keep it with probability keep_prob (50% here). The dropped neurons don't contribute to the training in both the forward and backward propagations of the iteration.</center>

<center><img src="images/07-Improving Deep Neural Network/dropout-it1.PNG" width = "37%"><img src="images/07-Improving Deep Neural Network/dropout-it2.PNG" width = "38%"></center>

<center><b>Figure </b>:<b> Drop-out on the first and third hidden layers. </b><br> 1st layer: we shut down on average 40% of the neurons. 3rd layer: we shut down on average 20% of the neurons.</center>

When you shut some neurons down, you actually modify your model. The idea behind drop-out is that at each iteration, you train a different model that uses only a subset of your neurons. With dropout, your neurons thus become less sensitive to the activation of one other specific neuron, because that other neuron might be shut down at any time. 

## 2.4 Other Regularization <a class="anchor" id="section_2_4"></a>

> DATA AUGMENTATION :

Getting more training data can be expensive and sometimes you just can't get more data. A free way to have more data is to apply <b>Data Augmentation</b> on your images.

- By flipping the images horizontally, you could double the size of your training set.
- Randomly Zoom to the image

<center><img src="images/07-Improving Deep Neural Network/data-augmentation.PNG" width = "600px"></center>

So by taking random distortions and translations of the image you could augment your data set and make additional fake training examples. This can be an inexpensive way to give your algorithm more data and therefore sort of regularize it and reduce over fitting.

> Early Stopping

With Early stopping you plot the Cost function J that you try to optimize. And you plot also your Dev set Error.

<center><img src="images/07-Improving Deep Neural Network/early-stopping.PNG" width = "400px"></center>

The term early stopping refers to the fact that you're just stopping the training of your neural network earlier.

And as you iterate, as you train, w will get bigger and bigger and bigger until here maybe you have a much larger value of the parameters w for your neural network. So what early stopping does is by stopping halfway you have only a mid-size rate w. And so similar to L2 regularization by picking a neural network with smaller norm for your parameters w, hopefully your neural network is over fitting less.

In a Machine learning project you want an algorithm :
- to optimize the cost function j (tools: gradient descent ..)
-  after optimizing the cost function j, you also wanted to not over-fit. (regularization...)
And to realize these tasks you use different tools.

The main downside of early stopping is that this couples these two tasks, because by stopping gradient decent early, you're sort of breaking whatever you're doing to optimize cost function J, because now you're not doing a great job reducing the cost function J. And then you also simultaneously trying to not over fit. 

# 3. Setting up your optimization problem <a class="anchor" id="chapter3"></a>

## 3.1 Normalizing Inputs <a class="anchor" id="section_3_1"></a>

Normalizing your inputs corresponds to two steps:
- the first is to subtract out or to zero out the mean
- then the second step is to normalize the variances

$$ X_{standard} = \frac{X-\mu}{\sigma}$$

<u>Example</u> :
<center><img src="images/07-Improving Deep Neural Network/normalization-steps.PNG" width = "500px"></center>

If your input features came from very different scales, maybe some features are from 0-1, sum from 1-1000, then it's important to normalize your features.

- That just makes your cost function j easier and faster to optimize.
- If the data are not on the same scales. That really hurts your optimization algorithm. The gradient decent might need a lot of steps to oscillate back and forth before it finally finds its way to the minimum.

<u>2D Example</u>:

<center><img src="images/07-Improving Deep Neural Network/normalization-ex.PNG" width = "500px"></center>

<b>IMPORTANT :</b> User the same mu and sigma to normalize the testing set

## 3.2  Exploding Gradients and Weights Initialization <a class="anchor" id="section_3_2"></a>

One of the problems of training neural network, especially very deep neural networks, is data vanishing and exploding gradients. What that means is that when you're training a very deep network your derivatives or your slopes can sometimes get either very, very big or very, very small, maybe even exponentially small, and this makes training difficult.


In general, initializing all the weights to zero results in the network failing to break symmetry. This means that every neuron in each layer will learn the same thing, so you might as well be training a neural network with $n^{[l]}=1$ for every layer. This way, the network is no more powerful than a linear classifier like logistic regression. 

It turns out there's a partial solution that doesn't completely solve this problem but it helps a lot which is careful choice of how you initialize the weights.

A well-chosen initialization method helps the learning process. A well-chosen initialization can:
- Speed up the convergence of gradient descent
- Increase the odds of gradient descent converging to a lower training (and generalization) error

> Initialization Weights:

- Relu : 
$$ np.random.randn(shape) * np.sqrt(\frac{2}{n^{[l-1]}})$$

- tanh :
$$ np.random.randn(shape) * np.sqrt(\frac{1}{n^{[l-1]}})$$

## 3.3  Gradient Checking <a class="anchor" id="section_3_3"></a>

> Gradient Checking steps:

- Take W[1], b[1],W[2], b[2],....,W[L], b[L] and reshape into a big vector Theta
- Take dW[1], db[1], dW[2], db[2],...., dW[L], db[L] and reshape into a big vector dTheta

$$ J(W^{[1]},b^{[1]},.....,W^{[L]},b^{[L]}) = J(\theta^1,...,\theta^L)$$

For each i :
$$ d\theta_{approx}^{[i]} = \frac{J(\theta^1,.,\theta^i + \epsilon.,\theta^L)-J(\theta^1,.,\theta^i - \epsilon.,\theta^L)}{2\epsilon}$$

> Check:

$$ esp = \frac{ \lVert d\theta_{approx} - d\theta \rVert_2}{\lVert d\theta_{approx} \rVert_2 + \lVert d\theta \rVert_2} $$

- esp ~= 10^-7 | great!
- esp ~= 10^-5 | ok!
- esp ~= 10^-3 | worry!

> Gradient checking implementation notes:

- Don't use in training - only to debug
- if the algorithm fails grad check, look at components to try to identify bug
- Remember regularization
- Doesn't work with Dropout
- Run at random initialization