# Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization

# Train / Dev / Test sets

\# layers  
\# hidden units  
Learning rates  
Activation functions

Idea $\longrightarrow$ Experiment $\longrightarrow$ code $\longrightarrow$ Idea $\longrightarrow$ Experiment $\longrightarrow$ code

Training set $\longrightarrow$ Hold out/cross validation/Development set/"dev" $\longrightarrow$ test

70/30  
60/20/20


* if you have a relatively small dataset, these traditional ratios might be okay. But if you have a much larger data set, it's also fine to set your dev and test sets to be much smaller than your 20% or even 10% of your data.

* rule of thumb: make sure dev and test set come from same distribution

* Remember the goal of the test set is to give you a unbiased estimate of the performance of your final network, of the network that you selected. But if you don't need that unbiased estimate, then it might be okay to not have a test set.

# Bias / Variance

## the bias-variance trade-off

high bias $\longrightarrow$ Underfitting

high variance $\longrightarrow$ Overfitting

* this analysis is predicated on the assumption that human level performance gets nearly 0% error or, more generally, that the optimal error, sometimes called base error, so the base in optimal error is nearly 0%. I don't want to go into detail on this in this particular video, but it turns out that if the optimal error or the base error were much higher, say, it were 15%, then if you look at this classifier, 15% is actually perfectly reasonable for training set and you wouldn't see it as high bias and also a pretty low variance. So the case of how to analyze bias and variance, when no classifier can do very well, for example, if you have really blurry images

# Basic Recipe for Machine Learning

* best way to solve a high bias Maybe run trains longer or try some more advanced optimization algorithms.

* best way to solve a high variance problem is to get more data. Or you could try regularization. But if you can find a more appropriate neural network architecture, sometimes that can reduce your variance problem as well,

* you could increase bias and reduce variance, or reduce bias and increase variance. But back in the pre-deep learning era, we didn't have many tools, we didn't have as many tools that just reduce bias or that just reduce variance without hurting the other one. But in the modern deep learning, big data era, so long as you can keep training a bigger network, and so long as you can keep getting more data, which isn't always the case for either of these, but if that's the case, then getting a bigger network almost always just reduces your bias without necessarily hurting your variance

# Regularization

## Logistic Regression

$min_{w, b} J(w, b)$

$$J(w, b) = \frac{1}{m}\sum_{i=1}^{m} L(\hat{y}^{i}, y^{i}) + \frac{\lambda}{2m}||w||^2_2$$

L2 regularization $\longrightarrow ||w||^2_2 = \sum^{n_x}_{j=1}w_j^2 = w^Tw$

L1 regularization $\longrightarrow \frac{\lambda}{2m}\sum^{n_x}_{j=1}|w| = \frac{\lambda}{2m}||w||_1$

w will be sparse

## Neural network
Frobenius norm: it just means the sum of square of elements of a matrix

$$J(w^1, b^1, w^2, b^2, ..., w^L, b^L) = \frac{1}{m}\sum_{i=1}^{m} L(\hat{y}^{i}, y^{i}) + \frac{\lambda}{2m}\sum^{L}_{l=1}||w^l||^2_f$$

$\longrightarrow ||w^l||^2_f = \sum^{l-1}_{i=1}\sum^{l}_{j=1}w_{ij}^2$


$dw$ = (from backpropagation) + \frac{\lambda}{2m}

$w^{l} := w^{l} - \alpha dw^{l}$

# Why regularization reduces overfitting?

* So the intuition you might take away from this is that if lambda, the regularization parameter, is large, then you have that your parameters will be relatively small, because they are penalized being large into a cos function. And so if the blades W are small then because Z is equal to W and then technically is plus b, but if W tends to be very small, then Z will also be relatively small

* And in particular, if Z ends up taking relatively small values, just in this whole range, then G of Z will be roughly linear. So it's as if every layer will be roughly linear. As if it is just linear regression. And we saw in course one that if every layer is linear then your whole network is just a linear network

# Dropout Regularization

* make a copy of the neural network. With dropout, what we're going to do is go through each of the layers of the network and set some probability of eliminating a node in neural network

* a 0.5 chance of keeping each node and 0.5 chance of removing each node

        d3 = np.randn.rand(a3.shape[0], a3.sjape[1]) < keep-prob

        keep-prob = 0.8

        Therefore, each example have a each hidden unit there's a 0.8 chance
        that the corresponding d3 will be one, and a 20% chance there 
        will be zero. 
        
        a3 = np.multiply(a3, d3) # Element wise operation
        
        a3 /= keep-prob  (inverted dropout technique, this makes test time
        easier)
        
        
        
$$z^4 = w^4a^3 + b^4$$

        a3 reduced by 20%
        
        not use dropout at test time

# Understanding Dropout

Intuition: cant rely on any one feature, so have to spread out weights $\longrightarrow$ shrinking the square norm of the weights

* has a similar effect to l2 regularization

* unless my algorithm is over-fitting, I wouldn't actually bother to use drop out.

*  One big downside of drop out is that the cost function J is no longer well-defined. On every iteration, you are randomly killing off a bunch of nodes. And so, if you are double checking the performance of grade and dissent, it's actually harder to double check that you have a well defined cost function J that is going downhill on every iteration

# Other regularization methods

 1. Data Augmentation
 2. Early Stopping
 
plot your, either the training error, you'll use 01 classification error on the training set. Or just plot the cost function J optimizing, and that should decrease monotonically
 
 And this principle is sometimes called orthogonalization. And there's this idea, that you want to be able to think about one task at a time
 
 3. one alternative is just use L2 regularization then you can just train the neural network as long as possible
 
But the downside of this though is that you might have to try a lot of values of the regularization parameter lambda. And so this makes searching over many values of lambda more computationally expensive

# Normalizing inputs

* one of the techniques that will speed up your training is if you normalize your inputs

* you scale your test set in exactly the same way, rather than estimating mu and sigma squared separately on your training set

* if you use unnormalized input features, it's more likely that your cost function will look like this, it's a very squished out bowl, very elongated cost function

# Vanishing / Exploding gradients

* in the very deep network, the activations end up decreasing exponentially. So the intuition I hope you can take away from this is that at the weights W, if they're all just a little bit bigger than one or just a little bit bigger than the identity matrix, then with a very deep network the activations can explode. And if W is just a little bit less than identity. So this maybe here's 0.9, 0.9, then you have a very deep network, the activations will decrease exponentially

* a similar argument can be used to show that the derivatives or the gradients the computer is going to send will also increase exponentially or decrease exponentially as a function of the number of layers

# Weight Initialization for Deep Networks

It turns out that a partial solution to this, doesn't solve it entirely but helps a lot, is better or more careful choice of the random initialization for your neural network.

$$Var(w_i) = \frac{1}{n}$$

$$w^{[l]} = np.random.randn(shape)*np.sqrt \left(\frac{1}{(n^{[l-1]})}\right)$$

if Relu  $\longrightarrow Var(w_i) = \frac{2}{n}$

if tanh  $\longrightarrow Var(w_i) = \frac{1}{n}$

Xavier initialization

Yoshua Bengio and his colleagues:

$$\frac{2}{n^{[l-1]}n^{[l]}}$$

# Numerical approximation of gradients

* So in order to build up to gradient and checking, let's first talk about how to numerically approximate computations of gradients

Diferencias finitas centradas - central finite difference

* When you use this method for grading, checking and back propagation, this turns out to run twice as slow as you were to use a one-sided defense. It turns out that in practice I think it's worth it to use this other method because it's just much more accurate

# Gradient checking

$J(\theta)$

for each i:
$$d\theta_{approx}^{[i]} = \frac{J\left(\theta_{1}, \theta_{2}, \ldots, \theta_{i}+\varepsilon, \ldots\right)-J\left(\theta_{1}, \theta_{2}, \ldots, \theta_{i}^{*}-\varepsilon, \ldots\right)}{2\epsilon}$$

$$\approx? d\theta^{[i]} = \frac{\partial J}{\partial \theta_i}$$

$$check = \frac{||d\theta_{approx}-d\theta||_2}{||d\theta_{approx}||_2+||d\theta||_2}$$
$$epsilon \approx 10^{-7} - great$$
$$epsilon \approx 10^{-3} - wrong$$

# Gradient Checking Implementation Notes

* First, don't use grad check in training, only to debug

* Second, if an algorithm fails grad check, look at the components, look at the individual components, and try to identify the bug..

* Remember regularization

* Doesnt work with dropout

So what I usually do is implement grad check without dropout. So if you want, you can set keep-prob and dropout to be equal to 1.0. And then turn on dropout and hope that my implementation of dropout was correct

<font color="red">Pending to finish with detail</font>