## Train/dev/test sets

Regarding split size:
- We used to have 70/30 or 60/20/20 in traditional machine learning. 
- However in Big Data era, we have giaaant dataset. So we can have 98/1/1 or 99/0.5/0.5 splits.
- Full dataset is so large that we don't need 20% to get coverage. 1% is more than enough.

Mismatch train/test:

- Training set coming from webpages -> dev/test coming from app camera
- Rule of thumb: Dev and test are coming from same distribution: Then you can capture the issues at dev stage.


If you iterate on the model based on the results on the `test` set, it can not be called an unbiased estimator anymore.

## Bias/Variance

Important to master!! Less tradeoff in the deep learning era.

- High bias: Underfitting (can not even perform well on training set)
- High variance: Overfitting (huge performance drop going from training to test)

Ideally we want low bias and low variance. There are cases where model is both high bias and high variance. 
This could be the case when our model is complex and actually complex in a bad way. So performance is low on training and still gets even worse on the dev set.


Optimal (Bayes) error: Is the error we accept as "good" error rate

- Optimal error is required for us to judge whether a model is underfitting (performing bad)




## Basic Recipe for ML 

1. High bias? (looking at the training set performance only) Try: 
- Bigger network
- Train longer
- NN architecture search

Try until you achieve low bias

2. Next, high variance? (looking at the dev set performance) Try: 

- More data
- Regularization
- NN architecture search (?)

Try these until you lower the variance and go back to step 1.


For example, if you have high bias, training with more data would probably not help.



### Why no bias/variance tradeoff in DL era?

- In old days, usually lowering one meant increasing the other.
- More data usually results in lowering the variance.
- Larger model usually means lowering the bias.
- So in DL era we don't have to always have a tradeoff between the two.
- One of the main reasons why DL with supervised learning was very successful.

## Regularization

One of the main things to try when high variance is regularization.

Logistic regression case (only one $w$ and $b$):

- $L_2$ regularization: $\sum_{j=1}^{n_x} w_{j}^2$  
    - Why only w and not b? Because b is small # of params compared to w.
- $L_1$ regularization: W will end up being sparse (a lot of zeros). 
- $\lambda$ is a regularization parameter which is a hyperparam we have to tune


Neural Network case: 

- sum over all network weight params
- It is called Frobenius Norm (not L2 norm) when its over matrix elements. Elementwise square summed up.
- $L_2$ is also called weight decay. because gradient will always make the $W$ matrix slightly smaller.
    - This is achieved because the weight matrix is always multiplied by a number slightly smaller than 1.
    
    
 

## Why regularization reduces overfitting?

- Imagine the tanh graph. It looks like a linear function around 0. 

- So if you actually introduce regularization which makes your weights close to 0, the function becomes much simpler. Almost like a linear function. Thus avoid overfitting. 

- We can not have crazy decision boundaries.

## Dropout Regularization


How to implement dropout?? 

### Inverted dropout

1) Have a matrix that denotes whether we drop a matrix cell.
2) Divide it by the keep prob so that the outcome total value is not expected to go down!!

**Test Time**

- No drop out !!
- Since we had the scaling up with inverted dropout during training, we can be confident that when we turn off the dropout, the values do NOT become too high.
- In old days, some implementations of dropout excluded this inverted division. 
- That caused problem during inference time because we have to scale down the weights when dropout is turned off.

In [6]:
import numpy as np
keep_prob = 0.5

# drop out
d3= np.random.rand(3,2) < keep_prob
a3 = np.random.rand(3,2)

a3 = np.multiply(a3, d3)

#inverted scaling
a3/=keep_prob # which is essentially scaling the values up

In [7]:
a3

array([[0.60380948, 0.        ],
       [0.1673716 , 0.76804806],
       [0.00951474, 1.21607543]])

## Understanding Dropout

Intuition: Cant rely on any one feature, so have to spread out weights.

- For huuge layers with many params maybe we can increase the dropout rate.
- Usually dropping out input features is not good idea.
- Dropout usually not needed as long as there is no overfitting on the training data.
- With Dropout, your cost function is not really well-defined anymore, since we cancel different cells at each iteration.
- Good practice: Turn off dropout. Confirm the cost monotonically decreases, then repeat with dropout

## Other Regularization Methods

- Data Augmentation: Flip images horizontally.
- Early stopping: stop when dev set error stops improving.


### Downside of Early Stopping

**Orthogonalization Concept**

It is the idea that at any given time, you only focus on one specific goal: Reduce overfitting OR optimize cost function.

However with Early stopping, we kind of make it more complex to focus on one task because early stopping aims to `regularize` the weights by stopping the other goal of minimizing the cost function.

Andrew NG recommends $L_2$ as a more standard way of regularization for this specific reason.

Benefits: No need to do additional hyperparam search or computation. It is kinda win-win: Less training and better generalization.

## Normalizing Inputs

1. Subtract mean
2. Normalize variance: Divide by the standard deviation square. 


This helps to have all input features have roughly uniform scaling. You don't have one input with huuuge variance and other very small variance.

3. **!!Important!!** Always use the same mean and variance that you used for training on the test set!! Do not calculate it separately.


Below are the steps

1. Calculate the population mean and std: 

$$
\mu = \dfrac{\sum_{i=1}^{N} x_i}{N}\\
var = \dfrac{\sum_{i=1}^{N} (x_i-\mu)^2}{N}\\\\
\sigma = \sqrt{var}
$$


2. Normalize:

$$
x_i = \dfrac{x_i - \mu}{\sigma}
$$

**Why normalize??**

If scale is different lets say one input ranges (0,1) and other (0,1000) it will be difficult to train.
It just works way better when they are usually centered around 0 and have similar scale

Even if your input params have similar scale it almost never hurts to normalize your inputs.



## Vanishing/Exploding Gradients

Vanishing case: 


Imagine you have 20 layer network and all weights are slightly lower than 1. 
Then at each layer, the activation of the neurons will get smaller and smaller. 
Which means the effect of lower layer get super small (so vanishing gradients).


Exploding case:

Same with values slightly higher than 1!


## Weight Initialization for Deep Networks

**Basically kinda solution to vanishing/exploding gradients**

Consider a single neuron. It is a weighted combination of all input features:

$$
z = \sum_{i=1}^{n} x_i w_i
$$

So if we have n as a veeery big number we want $w_i$'s to be smaller.

As a solution, when initializing the weights we can divide with some number that increases with value of $n$.


If Relu activation is used, it is recommended to divide by square root of $\dfrac{2}{n}$. One possible reason might be that relu kills the half of the values? 


If tanh is used (Xavier Initialization) we are recommended to use:

$$
\sqrt{\dfrac{1}{n}}
$$


Some also use:

$$
\sqrt{\dfrac{2}{n + n_{+1}}}
$$

where ${n_{+1}}$ just means the size of the output of this layer.



Overall, this ensures that weights are neither too small nor too big. Just around 1 so that we don't get vanishing and exploding gradients.



### More explanation

Ref:https://www.deeplearning.ai/ai-notes/initialization/index.html

To prevent the gradients of the network’s activations from vanishing or exploding, we will stick to the following rules of thumb:

1- The mean of the activations should be zero.   
2- The variance of the activations should stay the same across every layer.

In [20]:
# Code example
import numpy as np 


input_dim = 2000
Xs = np.random.randn(input_dim,1)

# Random initialization
W = np.random.randn(30,input_dim)
print(np.mean(np.matmul(W,Xs)))

# Xavier Initialization 
W = np.random.randn(30,input_dim) * np.sqrt(1/input_dim)
print(np.mean(np.matmul(W,Xs)))
# He Initialization 
W = np.random.randn(30,input_dim) * np.sqrt(2/input_dim)
print(np.mean(np.matmul(W,Xs)))

7.432054162608742
0.1904579926064798
-0.11629760175749276


(30, 50)

In [2]:
# Lets see
import numpy as np
nin = 200


In [3]:
np.random.randn(3,2)

array([[-0.41120098,  0.46342294],
       [-0.3797018 , -0.87384133],
       [-0.53344218,  0.22861401]])

## Numerical Approximation of Gradients

How to check your backpropagation is correct? 

lets take f(x) = x^3

instead of just nudging to right and computing the gradient, it is better to consider $\theta + \epsilon$ and $\theta - \epsilon$ (something like a larger triangle that consider both directions).

This should yield a better approximation to the actual value of the derivative.



Optional Theory:


There are two potential definitions of limit:

two-sided derivative: 


$$
\dfrac{f(\theta+\epsilon)-f(\theta-\epsilon)}{2\epsilon}
$$

Then the error rate is on the order $O(\epsilon^2)$.

But if we take one-sided derivative, error is $O(\epsilon)$. 

Since $\epsilon$ is very small its difference with its square is huuuge.


## Gradient Checking 

Put aaalll your parameters in a single dimensional giant vector $\Theta$.

Then for all params, take the two-sided gradient checking and store them in $d\Theta_{approx}$

Calculate $L_2$ norm of difference between the actual gradient vector, $d\Theta$, and the approximation, divided by the sizes of each vector.

If it is below $10^{-5}$ or $10^{-7}$ then your backprop algorithm is likely correct!!


This is kinda similar to what Andrej Karpathy does in his zero-to-hero lectures

## Gradient checking implementation notes

- Dont use in training of course!! It is slow, that's just for debugging.

- If check fails, look at individual components that are different to identify the bug. 

- If all coming from $dW^{[l]}$ then look there

- Remember regularization when implementing grad check, that should be included in your approximation.

- Wont work with dropout of course! Implement without dropout!!

- Dont just run the grad check at the very beginning. It might be the case that for small values of $w$ and $b$ your backprop works fiine. BUT for larger values, there might be an issue. So Andrew recommends us to run once at the beginning and perhaps again after some training.


