## Train/Dev/Test sets

- It's impossible to get all your hyperparameters right on a new application from the first time
- So the idea is you go through the loop:$idea\rightarrow code\rightarrow experiment$
- You have to go through the loop many times to figure out your hyperparameters 
- Your data will be split into three parts: 
    - Training set (has to be the largest set)
    - Hold-out cross validation set/development or "dev" set
    - Testing set
- You will try to build a model upon training set then try to optimize hyperparameters on dev set as much as possible. Then after your model is ready, you try and evaluate the testing set.
- So the trend on the ratio of splitting the models
    - If the size of the dataset is $100$ to $1000000$: $60/20/20$

    - If size of the dataset is $1000000$ to $\infty$: $98/1/1$ or $99.5/0.25/0.25$

**Make sure the dev and test set are coming from the same distribution with the training set**

- The role of the dev set is to try them on some of the models you've built.

## Bias/Variance

- If your model is simple (underfitting), it has a "high bias"
- If your model is complicated (overfitting), then it has a "high variance."
- Your model will be alright if you balance the bias/variance
<img src="screenshot/1.PNG" style="width:600px;height:350px;">


## Basic Recipe for Machine Learning

- If your algorithm has a high bias (make the model more complicated)
    - Try to make your NN bigger (size of hidden units, number of layers)
    - Try a different model that is suitable for your data
    - Try to run it longer
    - Different (Advanced) optimization algorithms

- If your algorithm has a high variance
    - More data
    - Try regularization
    - Try a different model that is suitable for your data

- Should try the previous two points until have a low bias and a low variance

- In the earlier days before deep learning, there was a "Bias-variance tradeoff". But not since you have more options/tools for solving the bias and variance problem, it is possible to have low bias and low variance. 

- Training a bigger neural network never hurts (unless the computational time). 


## Regularization

- Adding regularization to NN will help it reduce variance (overfitting)
    - $L_{2}$ norm: $||w||^{2}=\sum_{j=1}^{n_{x}}w_{j}^{2}=w^{T}w$
    - $L_{1}$ norm: $||w||=\sum_{j=1}^{n_{x}}|w_{j}|$

- Regularization for logistic regression 
    - The regular cost function that we want to minimize is $J(w,b)=\frac{1}{m}*\sum_{i=1}^{m}L(\hat{y},y)$
    - The $L_2$ cost function that we want to minimize is $J(w,b)=\frac{1}{m}*\sum_{i=1}^{m}L(\hat{y},y)+\frac{\lambda}{2m}\sum_{l=1}^{L}||w^{[l]}||_{F}^{2}$. $L_{2}$ regularization is being used much more often
    - The $L_1$ cost function that we want to minimize is $J(w,b)=\frac{1}{m}*\sum_{i=1}^{m}L(\hat{y},y)+\frac{\lambda}{2m}\sum_{l=1}^{L}||w^{[l]}||_{F}$. $L_{1}$ makes a lot of $w$ values become zeros, which makes the model size smaller.

- Regularization for NN
    - Backward propagation: $dw^{[l]}=from\ back\ propagation+\frac{\lambda}{m}w^{[l]}$
    - $w^{[l]}=(1-\frac{learning\_rate * \lambda}{m})*w^{[l]}-learning\_rate*from\ back\ propagation$
    - In practice, this penalizes large weights and effectively limits the freedom in the model
    - The new term $(1-\frac{learning\_rate * \lambda}{m})*w^{[l]}$ causes the weight to decay in proportion to its size. 

## Why regularization reduces overfitting?

- Intuition 1:
    - If $\lambda$ is too large: a lot of $w$'s will be close to zeros which will make the NN simpler (you can think of it as it would behave closer to logistic regression)
    - If $\lambda$ is good enough it will just reduce some weights that make the neural network overfit

- Intuition 2 (with `tanh` activation function):
    - If $\lambda$ is too large, $w$'s will be small (close to zero)--will use the linear part of the `tanh` activation function, so we will go from non-linear activation to *roughly* linear which would make the NN a roughly linear classifier.

    - If $\lambda$ good enough it will just make some of `tanh` activations roughly linear which will prevent overfitting. 

**Implementation tip**: if you implement gradient descent, one of the steps to debug gradient descent is to plot the cost function `J` as a function of the number of iterations of gradient descent, and you want to see that cost function `J` decreases monotonically after every elevation of gradient descent with regularization. If you plot the old definition of `J` (no regularization), then you might not see it decrease monotonically. 

## Dropout regularization

- The dropout regularization eliminates some neurons/weights on each iteration based on a probability. The most common technique to implement dropout is called "inverted dropout".

<img src="screenshot/2.PNG" style="width:600px;height:350px;">

```python
keep_prob = 0.8 # 0<=keep_prob<=1
l = 3 # this code is only for layer 3
# the generated number that are less than 0.8 will be dropped, 80% stay and 20% dropped
d3 = np.random.rand(a[l].shape[0], a[l].shape[1]) < keep_prop
a3 = np.multiply(a3, d3) # keep only the values in d3

# increase a3 not to reduce the expected value of output
# ensure that expected value of a3 remains the same--to solve the scaling problem
a3 = a3/keep_prob
```

- Vector $d^{[l]}$ is used for forward, and backward propagation and is the same for them, but it is different for each iteration (pass) or training example. 

**At test time we don't use dropout. If you implement dropout at test time--it would add noise to predictions**.

## Understanding dropout

- The intuition was that dropout randomly knocks out units in your network. So it's as if on every iteration you're working with a smaller NN, and so using a smaller NN seems like it should have a regularizing effect.
- Another intuition: can't rely on anyone feature, so have to spread out weights
- It's possible to show that dropout has a similar effect to $L_{2}$ regularization.
- Dropout can have different `keep_prob` per layer.
- The input layer dropout has to near $1$ (or $1$--no dropout) because you don't want to eliminate a lot of features.
- If you're more worried about some layers overfitting than others, you can set a lower `keep_prob` for some layers than others. The downside is, this gives you even more hyperparameters to search for using cross-validation. One other alternative might be to have some layers where you apply dropout and some layers where you don't apply dropouts.
- A lot of researchers are using dropout with computer vision (CV) because they have a very big input size and almost never have enough data, so overfitting is the usual problem. And dropout is a regularization technique to prevent overfitting. 
- A downside of dropout is that the cost function `J` is not well defined, and it will be hard to debug (plot `J` by iteration). To solve that you'll need to turn-off dropout, set all the `keep_prob` to $1$, and then run the code and check that it monotonically decreases `J` and then turn on the dropouts again. 


## Other regularization methods

- Data augmentation: For example in a computer vision data: 
    - You can flip all your pictures horizontally, this will give you `m` more data instances; 
    - You could also apply a random position and rotation to an image to get more data.
- For example in XOR, you can impose random rotations and distortions to digits/letters
- New data obtained using this technique isn't as good as the real independent data, but still can be used as a regularization technique. 

- Early stopping
    - In this technique we plot the training set and the dev set cost together for each iteration. At some iteration the dev set cost will stop decreasing and will start increasing.
    - We will pick the point at which the training set error and dev set error are best (lowest training cost with lowest dev cost)
    - The instructor prefers to use $L_{2}$ regularization instead of early stopping because this technique simultaneously tries to minimize the cost function and not to overfit which contradicts the orthogonalization approach
    - The advantage is that you don't need to search a hyperparameter like in other regularization approaches (like $\lambda$ in $L_{2}$ regularization)

<img src="screenshot/3.PNG" style="width:600px;height:350px;">

- Model ensembles
    - Algorithm: (1) Train multiple independent models; (2) At test time average their results
    - It can get you extra $2\%$ performance 
    - It reduces the generalization error
    - You can see some snapshots of your NN at the training ensembles them and take the results

## Normalizing inputs

- If you normalize your inputs this will speed up the training process a lot

- Normalization are going on these steps: 
    - Get the mean of the training set; 
    - Substract the mean from each input; 
    - Get the variance of the training set; 
    - Normalize the variance.

- These steps should be applied to training, dev, and testing sets (**but using mean and variance of the training set**)

- Why normalize?
    - If we don't normalize the inputs our cost function will be deep and its shape will be inconsistent then optimizing it will take a long time

    - But if we normalize it the opposite will occur. The shape of the cost function will be consistent and we can use a larger learning rate alpha--the optimization will be faster.

## Vanishing/Exploding gradients

<img src="screenshot/4.PNG" style="width:600px;height:350px;">