# Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization

This notebook is based on my learning from **course 2** of the **Deep Learning Specialization** provided by **deeplearning.ai**. The course videos could be found on [YouTube](https://www.youtube.com/watch?v=1waHlpKiNyY&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc) or [Coursera](https://www.coursera.org/specializations/deep-learning). Learning through Coursera is highly recommended to get access to the quizes and programmin exercises along the course, as well as the course certification upon completion. Personally, I completed the specialization of 5 coursesand acquired [Specialization Certificate](https://coursera.org/share/e590c28a5c258e500ca6d3ccb4ed57ba). Later, I discovered the YouTube videos and used them for review.

## 1. Practical Aspects of Deep Learning

### Test/Dev/Test Sets
- Applying DL is a very iterative process to find the best hyperparameters.
- Split dataset
    - Previous era: 70/30 or 60/20/20 (n <= 100,000)
    - Big data: 98/1/1 (n >= 1,000,000) 10,000 samples might be enough for test set; 99.5/ .4/ .1 for even bigger dataset
- Beware of mismatched train/test distribution
- It might be ok to not have the test set

### Bias/Variance
- High variance: overfitting the data (low bias in training but high bias in test)
- High bias: underfitting the data compared to baseline (human judgement) (high bias in both training and testing)
- High bias and high variance (high bias in training and even worse in test)
- Low bias and low variance: a really good model (low bias in both training and test)

### Basic Receip
1. High bias? (training data performance) -> Bigger network, train longer, NN architecture search
2. High variance? (dev set performance) -> More data, regularization, NN architecture search

Bias/variance tradeoff is not always the case for DL if appropriate techiniques (ex: bigger network and more data) are selected.

### Regularization (might introduce bias/viariance tradeoff)

- L2 regularization is used much more often than L1 when training NN models.
- $\lambda$ is the regularization parameter
- In Neural Network:

$J(\mathbf{w}, \mathbf{b}) = \frac{1}{m}\sum_{i=1}^{n}L(\hat{y},y) +$<font color='blue'>$\frac{\lambda}{2m}\sum_{l=1}^{L}||\mathbf{w}^{[l]}||^2_F$</font>

<font color='blue'>$\text{Frobenius norm (L2 norm)}: ||\mathbf{w}^{[l]}||^2_F = \sum_{i=1}^{n^{[l-1]}}\sum_{i=1}^{n^{[l]}}(w^{[l]}_{ij})^2$</font>, $\mathbf{w}^{[l]}: (n^{[l-1]}, n^{[l]})$

$\frac{\partial J}{\partial w^{[l]}} =  \mathbf{dw}^{[l]}, \mathbf{dw}^{[l]} = \frac{1}{m}\mathbf{dZ}^{[l]} \mathbf{A}^{[l-1]} +$ <font color='blue'>$\frac{\lambda}{m}\mathbf{w}^{[l]} $</font>

- With large $\lambda$, we are telling the model to get smaller $\mathbf{w}$. This would encourage the training process to return a simpler model, which is like having a smoother boundary between classes if visualized. 

### Dropout Regularization

- In each iteration, randomly eliminate nodes at each layer to get a smaller, more diminished notes.
- Implementation - inverted dropout
```python
keep-prob = 0.8
d3 = np.random.rand(a3.shape[0], a3.shape[1]) < keep-prob
a3 = np.multiply(a3, d3) 
a3 /= keep-prob # Correct the expected value z in z=wa+b
```
#### Notes:
    1. **After adjusting the value of a, we still train w and b properly.**
    2. **With dropout, different set of w and b are trained in each iteration so that overall the w and b are not over-trained.**
    3. **Matrix a is dropped out instead of w. We still have all w when training the model.**


- Making prediction at test time
    - Not to use drop out
    
- Intuition: Can't rely on any one feature, so have to spread out weights. -> Shingking the weights, similar effect of L2 regularization
- Higher keep-prob on smaller layers; No need to use keep-prob on all layers
- Dropout is frequently used in computer vision
- Downside: Cost function $J$ is less well-defined

#### More Regularization Methods
- Data augmentation
- Early stopping

#### Weight Initialization

With large n (notes in each layer), we want smaller w.

$\text{ReLU: Var}(\mathbf{w}^{[l]}) = \frac{2}{n^{[l-1]}}$

$\text{tanh: Var}(\mathbf{w}^{[l]}) = \sqrt{\frac{1}{n^{[l-1]}}}$

```python

w_l = np.random.rand(shape) * np.sqrt(1/n_previous_l)
```

However, in practice, tuning the weight in this way is usually less important compared to other tuning techniques.



## 2. Optimization Algorithms

### Mini Batch Gradient Descent
- Notation: $X^{\{t\}}, Y^{\{t\}}$
- 1 epoch = a round of t mini batch gradient descent

Benefit:
- Reduce training time
- Make progress quickly

Compare mini-batch size:
- If mini-batch size = $m$: **Batch gradient descent**
    - Too long per iteration
- If mini-batch size = 1: **Stochastic gradient descent**
    - Inefficient; lose speed from vactorization
- In practice: somewhere between 1 and $m$
    - Fastest learning with vectorization (~1000)
    - Make progress without processing entire training set

Choose mini-batch size:
- If small training set: use batch gradient descent
    - m less than 2000
- Typical mini-batch sizes: 64, 128, 256, 512
- Make sure minibatch fit in CPU/GPU memory

### Exponentially Weighted Averages

Data: $[\theta_1, \theta_2, ... \theta_t]$

- $V_0 = 0$
- $V_2 = \beta V_0 + (1-\beta) \theta_1$
- ...
- $V_t = \beta V_t-1 + (1-\beta) \theta_t$ (The equation for exponentially weighted averages)

$V_t$ as approximally average over $\approx \frac{1}{1-\beta}$ days' data.
- $\beta = 0.9 :\approx 10$ days' data
- $\beta = 0.98 :\approx 50$ days' data

More about the equation

$$V_t = (1-\beta) \theta_t + \beta V_t-1$$ 
$$= (1-\beta) \theta_t + (1-\beta) \times \beta \times \theta_{t-1} + (1-\beta) \times (\beta)^2 \times \theta_{t-2} + (1-\beta) \times (\beta)^3 \times \theta_{t-3} + ...$$

$(1-\varepsilon)^{\frac{1}{\varepsilon}} = \frac{1}{e} \approx 0.35$: It takes about $(1-\beta)$ for the weight to decay to $\frac{1}{3}$.

### Bias Correction of Exponentially Weighted Averages

Modify the bias due to $V_0 = 0$:

Use $\frac{V_t}{1-\beta^{t}}$ instead of $V_t$

Example:

$t=2: 1-\beta^{t} = 1 (0.98)^2 = 0.0396$

$\frac{V_2}{0.0396} = \frac{0.0196\theta_1 + 0.02\theta_2}{0.0396}$ The denominator is the original value before correction.


### Gradient Descent with Momentum

**Compute exponentially weighted averages of the gradients and then use them to update the gradients.** Conceptially, it averages out the oscillations of gradients in previous steps.

$\mathbf{V}_{dw} = 0, \mathbf{V}_{db} = 0$

On iteration $t$:

{

Compute $dw, db$ on current mini-batch

$\mathbf{V}_{dw} = \beta \mathbf{V}_{dw} + (1-\beta)\mathbf{dw}$

$\mathbf{V}_{db} = \beta \mathbf{V}_{db} + (1-\beta)\mathbf{db}$

$\mathbf{w} := \mathbf{w}-\alpha \mathbf{V}_{dw}, \mathbf{b}:=\mathbf{b}-\alpha \mathbf{v}_{db}$

}

Hyperparameters: $\alpha, \beta$

$\beta = 0.9$ is the most common value, which approximately averages last 10 gradients. In practice, it works very well. ALso, in practice people don't bother using bias correction $\frac{V_t}{1-\beta^{t}}$ and just wait until the bias is gone after the first few iterations.

#### Note: There's only one set of $v_{dw}$ and $v_{db}$ for each neuron at a time. The value of $\mathbf{V}_{dw}$ and $\mathbf{V}_{db}$ are updated after training each mini-batch based on $v_\theta = \beta v_\theta + (1-\beta)\theta_t$. We only care about the result from the previous mini-batch instead of all mini-batches.


### RMSProp

On iteration $t$:

{

Compute $dw, db$ on current mini-batch.

$\mathbf{S}_{dw} = \beta_2 \mathbf{S}_{dw} + (1-\beta_2)\mathbf{dw}^2$

$\mathbf{S}_{db} = \beta_2 \mathbf{S}_{db} + (1-\beta_2)\mathbf{db}^2$

$\mathbf{w} := \mathbf{w}-\alpha \frac{\mathbf{dw}}{\sqrt{\mathbf{S}_{dw}+\varepsilon}}, \mathbf{b} := \mathbf{b}-\alpha \frac{\mathbf{db}}{\sqrt{\mathbf{S}_{db}+\varepsilon}}$

}

$\varepsilon = 10^{-8}$ or a very small number to make sure $dw$ and $db$ are not divided by 0.

Intuitively, It will make the gradient moving faster on the directions that require larger movement and slower on directions that requires less movement.


### Adam Optimization Algorithm

**Combine momentum and RMSProp**

$\mathbf{V}_{dw} = 0, \mathbf{S}_{dw} = 0, \mathbf{V}_{db} = 0, \mathbf{S}_{db} = 0$

On iteration $t$:

{

Compute $dw, db$ on current mini-batch

Momentum:

$\mathbf{V}_{dw} = \beta_1 \mathbf{V}_{dw} + (1-\beta_1)\mathbf{dw}, \mathbf{V}_{db} = \beta_1 \mathbf{V}_{db} + (1-\beta_1)\mathbf{db}$

RMSProp:

$\mathbf{S}_{dw} = \beta_2 \mathbf{S}_{dw} + (1-\beta_2)\mathbf{dw}^2, \mathbf{S}_{db} = \beta_2 \mathbf{S}_{db} + (1-\beta_2)\mathbf{db}^2$

Bias correction:

$\mathbf{V}_{dw}^{corrected} = \mathbf{V}_{dw} / (1-\beta_1^t), \mathbf{V}_{db}^{corrected} = \mathbf{V}_{db} / (1-\beta_1^t)$

$\mathbf{S}_{dw}^{corrected} = \mathbf{S}_{dw} / (1-\beta_2^t), \mathbf{S}_{db}^{corrected} = \mathbf{S}_{db} / (1-\beta_2^t)$

$\mathbf{w} := \mathbf{w}-\alpha \frac{\mathbf{V}_{dw}^{corrected}}{\sqrt{\mathbf{S}_{dw}^{corrected}+\varepsilon}}, \mathbf{b} := \mathbf{b}-\alpha \frac{\mathbf{V}_{db}^{corrected}}{\sqrt{\mathbf{S}_{db}^{corrected}+\varepsilon}}$

}

Hyperparameters: 
- $\alpha$: needs to be tuned
- $\beta_1 \rightarrow dw$: 0.9 recommended; no need to tune; first moment
- $\beta_2 \rightarrow dw^2$: 0.999 recommended; no need to tune; second moment
- $\varepsilon$: $10^{-8}$ recommended; no need to tune

Adam: Adoptive Moment Estimation


### Learning Rate Decay

**Slowly reduce learning rate overtime**

Ituition: Sometimes the weights might never converge. Reducing alpha helps the weight to converge as it approches the minimum by taking smaller steps.

1 epoch = 1 pass throught the data

$$\alpha = \frac{1}{1 + \text{decay rate} \times \text{epoch number}}\alpha_0$$

The decay rate becomes another hyperparameter that we need to tune along with $\alpha_0$.

Other learning rate decay methods:

- $\alpha = 0.95^{\text{epoch number}}\alpha_0$

- $\alpha = \frac{k}{\sqrt{\text{epoch number}}}\alpha_0$ or $\frac{k}{\sqrt{t}}\alpha_0$, where $k$ is a constant and $t$ is mini-batch number

- Discrete staircase

- Manual decay

Learning rate decay is usually lower down on the list to try. Choosing $\alpha$ well would really make a difference.

## 3. Hyperparameter Tuning, Batch Normalization, and Programming Frameworks

### Tuning Process

Piority
- $\alpha$: most important
- $\beta, \text{# hidden units}, \text{mini-batch size}$: second important
- $\text{# layers},  \text{learning rate decay}$: third important
- $\beta_1, \beta_2, \varepsilon$: only use default

Parameter Sampling:
- Try random values: don't use a grid.
- Course to fine.

### Use appropriate scale
- $\alpha$: randomly sample at log scale

### Hyperparameter Tuning in Practice
- Intuition of hyperparameter setting from one application area may or may not transfer to a different one.
- Intuitions do get stale. Re-evaluate occasionally.
- Babysitting one model when there's not a lot of compute resource vs. Training many models in parallel

### Nomalizing Activations in a Network
Not just normalize $X$ but $Z^{[n]}$ as well.

Given some intermediant values in NN $z^{(i)}, ..., z^{(m)}$

{

$\mu = \frac{1}{m}\sum_{i}z{(i)}$

$\sigma^2 = \frac{1}{m}\sum_{i}(z_i-\mu)^2$

$z_norm^{(i)} = \frac{z^{(i)}-\mu}{\sqrt{\mu^2+\varepsilon}}$

- $\varepsilon$ is added here just in case $\mu$ turns out to be 0.

$\tilde{z^{(i)}} = \gamma z_{norm}^{(i)} + \beta$

$\gamma$ and $\beta$ are learnable parameters of model.

- We don't always want $\tilde{z^{(i)}}$ to be normally distributed. If $\gamma = \sqrt{\mu^2+\varepsilon}, \beta = \mu$, then $\tilde{z^{(i)}} = z^{(i)}$

Use $\tilde{\mathbf{z}^{[l](i)}}$ instead of $\mathbf{z}^{[l](i)}$

}

### Fitting Batch Norm Into Neural Networks

$\mathbf{Z}^{[l]} = \mathbf{w}^{[l]}\mathbf{A}^{[l-1]}$ 

- $\mathbf{b}^{[l]}$ has no impact will have no impact on $\tilde{\mathbf{Z}}^{[l]}$ since we're going to  after subtracted $\mathbf{Z}^{[l]}$ by mean.

Compute $\mathbf{Z}_{norm}^{[l]}$

$\tilde{\mathbf{Z}}^{[l]} = \gamma \mathbf{Z}_norm^{[l]} + \mathbf{\beta}^{[l]}$

$\mathbf{A}^{[l]} = g^{[l]}(\tilde{\mathbf{Z}}^{[l]})$ 

Parmeters: $\mathbf{w}^{[l]}, \mathbf{\beta}^{[l]}, \mathbf{\gamma}^{[l]}$

- $\mathbf{b}^{[l]}$ is not a parater any more. It is replace by $\mathbf{\beta}^{[l]}$.
- $\mathbf{\beta}^{[l]}$ here is different form the $\beta$ used for momentum or Adam.
- The shape of $\mathbf{\beta}^{[l]}$ and $\mathbf{\gamma}^{[l]}$ is also $(n^{[l]}, 1)$.

$\mathbf{w}^{[l]}:=\mathbf{w}^{[l]}-\alpha\mathbf{dw}^{[l]}$

$\mathbf{\beta}^{[l]}:=\mathbf{\beta}^{[l]}-\alpha\mathbf{d\beta}^{[l]}$ 

$\mathbf{\gamma}^{[l]}:=\mathbf{\gamma}^{[l]}-\alpha\mathbf{d\gamma}^{[l]}$ 

- We can also use optimization algorithms like Adam to compute $\mathbf{\beta}^{[l]}$

```python
tf.nn.batch-normalization
```

In practice, batch norm is usually applied with mini-batches in training set. Only normalize the mini-batch with the data/output of the same mini-batch.

### Why does Batch Norm Work?

1. Speeds up learning just as normalization of input $X$.

2. Makes weights later in deep NN more robust than weights earlier in NN. Covariate shift.
    - No matter how the value of $\mathbf{A}^{[l-1]}$ changes, it will always have mean 0 and variance 1 (or any other mean and varisou), so that it limits the amount how the updates of previous $\mathbf{w}$ and$ \mathbf{b}$ would affect the distribution of $\mathbf{A}^{[l-1]}$ that layer $l$ will see and used to train $\mathbf{w}^{[l]}, \mathbf{b}^{[l]}$. $\mathbf{A}$ becomes more stable and later layers don't have to adopt as much to the changes of former layers. Each layer learn more independently from other layers.
    
3. Batch norm as regularization.
    - Each mini-batch is scaled by the mean/variance computed on just that mini-batch.
    - This adds some **noise** to the values $z^{[l]}$ within that minibatch. (The mean and variance calculated by the mini-batch are not the true mean and variance.) So similar to dropout, it adds some niose to each hidden layer's activations.
    - This has a **slight** regularization effect. Using a larger mini-batch could reduce the noise and therefore the regularization effect.
    - This is just an unintended side effect. However, do not consider batch norm as a regularization tool.

### Batch Norm at Test Time

In training, the $\mu$ and $\sigma^2$ are calculated in each mini-batch:

$\mu = \frac{1}{m}\sum_{i}z^{(i)}$

$\sigma^2 = \frac{1}{m}\sum_{i}(z^{(i)}-\mu)^2$

However, in testing, it might be impractical since there could be only 1 testing sample in each mini-batch. Therefore, we use weighted average across mini-batches to calculate $\mu$ and $\sigma^2$: keep track of the $\mu$ and $\sigma^2$ value and update the values by weighted average in each mini-batch.

Then use the following to get $z_{norm}^{(i)}$ and $\tilde{z}^{(i)}$
$z_{norm} = \frac{z-\mu}{\sqrt{\mu^2+\varepsilon}}$

$\tilde{z} = \gamma z_{norm} + \beta$


### Softmax Regression

**Make prediction for multi-class classification.**

The probabilities of all output should be summed to 1.

In final layer $L$:

{

$\mathbf{z}^{[L]} = \mathbf{w}^{[L]}\mathbf{a}^{[L-1]} + \mathbf{b}^{[L]}$

Activation function: 
- $\mathbf{t} = e^{\mathbf{z}^{[L]}}$ 
- $\mathbf{a}^{[L]} = \frac{\mathbf{t}}{\sum \mathbf{t}}$

The shape of $\mathbf{z}^{[L]}$ = the shape of $\mathbf{t}$ = the shape of $\mathbf{a}^{[L]}$ = $(C,1)$, where $C=$ number of classes

}

Softmax activation takes in vector inputs and generate vector outputs as compared to other activation functions that can take a single number as an input.:

$\mathbf{a}^{[L]} = g^{[L]}(\mathbf{z}^{[L]})$

Hardmax only look at the vector $\mathbf{a}^{[L]}$ and assign 1 to the highest number and 0 to the others.

### Training Softmax Classifier

**Loss function:**

$L(\hat{y}, y) = -\sum_{j=1}^{4}y_jlog\hat{y}_j$

**Example:**

$y = \begin{bmatrix}
        0 \\
        1 \\
        0 \\
        0\\ 
        \end{bmatrix}$
        
$\hat{y} = \begin{bmatrix}
            0.3 \\
            0.2 \\
            0.1 \\
            0.4\\ 
            \end{bmatrix}$

$L(\hat{y}, y) = -\sum_{j=1}^{4}y_jlog\hat{y}_j = -log\hat{y}_2 \rightarrow $ make $\hat{y}_2$ as big as possible.

**Cost function:**

$J(\mathbf{w}, \mathbf{b}) = \frac{1}{m}\sum_{i=1}^{m}L(\hat{y}^{(i)}, y^{(i)}) $

$\mathbf{Y}$ and $\hat{\mathbf{Y}}$ will be a $(C, m)$ matrix.

**Backpropagation:**

$\mathbf{dz}^{[L]} = \frac{\partial J}{\partial \mathbf{z}^{[L]}}$

Deep learning programming framework (such as TensorFlow) will take care of the derivative compuation.

### The Problem of Local Optima

Most of the local optimas are saddle points. 
- Unlikely to get stuck in a bad local optima
- Plateaus can really can make learning slow. -> speed up by optimizations like Adam


### TensorFlow

Use TensorFlow to minimize $J(w) = w^2 - 10w + 25$ and get $w = 5$.

In [1]:
import numpy as np
import tensorflow as tf

In [2]:
# w is the variable we want to optimize
w = tf.Variable(0, dtype=tf.float32)

# The formula creates computation graph under the hood (backwoard functions are automatically done)
#cost = tf.add(tf.add(w**2, tf.multiply(-10., w)), 25)
cost = w**2 - 10*w + 25

# We can also use other optimization methods like Adam
train = tf.train.GradientDescentOptimizer(0.01).minimize(cost)

init = tf.global_variables_initializer()
session = tf.Session()
session.run(init)
print(session.run(w))

0.0


In [3]:
session.run(train)
print(session.run(w))

0.099999994


In [4]:
for i in range(1000):
    session.run(train)
print(session.run(w))

4.9999886


#### Introducing the concept of x

In [5]:
coefficient = np.array([[1.], [-10.], [25.]])

# Get training data into the function
x = tf.placeholder(tf.float32, [3,1])
cost = x[0][0]*w**2 + x[1][0]*w + x[2][0]
train = tf.train.GradientDescentOptimizer(0.01).minimize(cost)

init = tf.global_variables_initializer()
session = tf.Session()
session.run(init)
print(session.run(w))

0.0


In [6]:
session.run(train, feed_dict={x:coefficient}) # feed different x in mini-batches
print(session.run(w))

0.099999994


In [7]:
for i in range(1000):
    session.run(train, feed_dict={x:coefficient})
print(session.run(w))

4.9999886
