# Course 2. Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization

## Train / Dev / Test sets

List of hyperparameters:
* Number of layers
* Number of hidden units in each layer
* Learning rates
* Regularization parameters
* Activation functions

In order to figure out what are the most suitable hyperparameter values for your neural net it is advised that you constantly experiment with different values and iterate on repeating the experiments.

Previous era of machine learning was characterized by the fact that we had much smaller amounts of data. 
* 70 / 30 train / test split or 60 / 20 / 20 train / dev / test split was ok when you have up to 10000 examples in your dataset, but when you have 1m examples 98 / 1 / 1 is also good choise
* Make sure that dev and test set come from the same distribution
* Not having the test set might be ok

## Bias and Variance

<img src="imgs/biasvariance.png">

It is possible that classifer has both high bias and high variance and that would look something like this:

<img src="imgs/highbias&highvariance.png">


## Basic recipe for machine learning

In [6]:
high_bias = True # measure error on train set 
high_variance = True # measure error on dev / test set

if high_bias:
    print("* Try bigger network")
    print("* Train longer")
    print("* Search different nn architecture")

if high_variance:
    print("* Find more data or create more features")
    print("* Regularization")
    print("* Search different nn architecture")


* Try bigger network
* Train longer
* Search different nn architecture
* Find more data or create more features
* Regularization
* Search different nn architecture


## Regularization

Let introduce regularization on an example of Logistic Regression.

L2 regularization:
\begin{align}
J(W, b) = \frac{1}{m}\sum_{i = 1}^{m}L(y^{(i)}, \hat{y}^{(i)}) + \frac{1}{2m}\sum_{j = 1}^{n_x} {\lVert w_j \rVert}^2
\end{align}

L1 regularization:
\begin{align}
J(W, b) = \frac{1}{m}\sum_{i = 1}^{m}L(y^{(i)}, \hat{y}^{(i)}) + \frac{1}{m}\sum_{j = 1}^{n_x} {\lVert w_j \rVert}
\end{align}

In general, L2 regularization is used much much more often. L1 regularization is used when we want to make $W$ sparse.

In the case of Neural Networks L2 regularization would look like this:
\begin{align}
J(W_1, b_1, W_2, b_2 ... W_L, b_L) = \frac{1}{m}\sum_{i = 1}^{m}L(y^{(i)}, \hat{y}^{(i)}) + \frac{1}{2m}\sum_{l = 1}^{L} {\lVert W_l \rVert}^2 = \frac{1}{m}\sum_{i = 1}^{m}L(y^{(i)}, \hat{y}^{(i)}) + \frac{1}{2m}\sum_{i = 1}^{n_{l - 1}}\sum_{j = 1}^{n_l} {\lVert w_{ij} \rVert}^2
\end{align}

because $W$'s are always $n_{l-1}$ x $n_l$ dimension.

Note that L2 norm is also called Frobenius norm.

## Why Regularization Reduces Overfitting

Couple of good points .. watch it again
https://www.youtube.com/watch?v=NyG-7nRpsW8&index=5&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc

## Dropout regularization

## Understanding dropout

## Other regularization methods

## Normalizing Inputs

Let $X = [x_1, x_2, ..., x_n] \in R^{m \times n}$ be our data, where $x_1, ...x_n$ are feature vectors. You can imagine that our data looks like this for $n = 2$:

<img src="imgs/normalization1.png">

We first subtract the mean of each feature from $X - \mu(X) = [x_1 - \frac{1}{m}\sum_{i = 1}^{m}x_{i, 1}, .. , x_n - \frac{1}{m}\sum_{i = 1}^{m}x_{i, n}]$ and by doing that we are centering our data:

<img src="imgs/normalization2.png">

And then, devide by standard deviation $\frac{X - \mu(X)}{\sigma^2(X)} = [\frac{x_1 - \frac{1}{m}\sum_{i = 1}^{m}x_{i, 1}}{\frac{1}{m}\sum_{i = 1}^{m}x^2_{i, 1}}, .. , \frac{x_n - \frac{1}{m}\sum_{i = 1}^{m}x_{i, n}}{\frac{1}{m}\sum_{i = 1}^{m}x^2_{i, n}}]$ where now variance of all features is equal to 1:

<img src="imgs/normalization3.png">

__Variance__ measures how far a set of (random) numbers are spread out from their average value (mean). And __standard deviation__ is squared variance.

## Vanishing and exploading gradients

When training very deep neural network we can encounter problem where gradients are too small (vanishing) or too big (exploading) which makes the training difficult. Problem of vanishing / exploading gradients can be partialy solved by carefull weight initialization depending on what activation function are you using:

1) If you are using $ReLU$ weights should be initialized to (he-at-al initialization https://arxiv.org/abs/1502.01852) and what it does is basically setting the variance of $w$ to $\frac{2}{n_l}$


In [10]:
import numpy as np
layer_size = [5, 4]
l = 1
w=np.random.randn(layer_size[l],layer_size[l-1])*np.sqrt(2/layer_size[l-1])

2) If you are using $Tanh$ weights should be initialized to (Xavier initialization http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)


In [11]:
import numpy as np
layer_size = [5, 4]
l = 1
w=np.random.randn(layer_size[l],layer_size[l-1])*np.sqrt(1/layer_size[l-1])

or this (Bengio initialization):

In [12]:
import numpy as np
layer_size = [5, 4]
l = 1
w=np.random.randn(layer_size[l],layer_size[l-1])*np.sqrt(2/(layer_size[l] + layer_size[l-1]))

## Numerical aproximation of gradients
## Gradient checking
## Gradient checking implementation notes

## Complete week 2 was skipped

## Tuning process

* Don't use grid search for hyperparameter search. Instead, use random sampling over the range of hyperparameters.
* Consider implementing zoom in of the region that gives the best results and then repeat the process with ranges set to values that are within region that gave the best results