# Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization

This notebook is based on my learning from **course 2** of the **Deep Learning Specialization** provided by **deeplearning.ai**. The course videos could be found on [YouTube](https://www.youtube.com/watch?v=1waHlpKiNyY&list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc) or [Coursera](https://www.coursera.org/specializations/deep-learning). Learning through Coursera is highly recommended to get access to the quizes and programmin exercises along the course, as well as the course certification upon completion. Personally, I completed the specialization of 5 coursesand acquired [Specialization Certificate](https://coursera.org/share/e590c28a5c258e500ca6d3ccb4ed57ba). Later, I discovered the YouTube videos and used them for review.

## 1. Practical Aspects of Deep Learning

### Test/Dev/Test Sets
- Applying DL is a very iterative process to find the best hyperparameters.
- Split dataset
    - Previous era: 70/30 or 60/20/20 (n <= 100,000)
    - Big data: 98/1/1 (n >= 1,000,000) 10,000 samples might be enough for test set; 99.5/ .4/ .1 for even bigger dataset
- Beware of mismatched train/test distribution
- It might be ok to not have the test set

### Bias/Variance
- High variance: overfitting the data (low bias in training but high bias in test)
- High bias: underfitting the data compared to baseline (human judgement) (high bias in both training and testing)
- High bias and high variance (high bias in training and even worse in test)
- Low bias and low variance: a really good model (low bias in both training and test)

### Basic Receip
1. High bias? (training data performance) -> Bigger network, train longer, NN architecture search
2. High variance? (dev set performance) -> More data, regularization, NN architecture search

Bias/variance tradeoff is not always the case for DL if appropriate techiniques (ex: bigger network and more data) are selected.

### Regularization (might introduce bias/viariance tradeoff)

- L2 regularization is used much more often than L1 when training NN models.
- $\lambda$ is the regularization parameter
- In Neural Network:

$J(\mathbf{w}, \mathbf{b}) = \frac{1}{m}\sum_{i=1}^{n}L(\hat{y},y) +$<font color='blue'>$\frac{\lambda}{2m}\sum_{l=1}^{L}||\mathbf{w}^{[l]}||^2_F$</font>

<font color='blue'>$\text{Frobenius norm (L2 norm)}: ||\mathbf{w}^{[l]}||^2_F = \sum_{i=1}^{n^{[l-1]}}\sum_{i=1}^{n^{[l]}}(w^{[l]}_{ij})^2$</font>, $\mathbf{w}^{[l]}: (n^{[l-1]}, n^{[l]})$

$\frac{\partial J}{\partial w^{[l]}} =  \mathbf{dw}^{[l]}, \mathbf{dw}^{[l]} = \frac{1}{m}\mathbf{dZ}^{[l]} \mathbf{A}^{[l-1]} +$ <font color='blue'>$\frac{\lambda}{m}\mathbf{w}^{[l]} $</font>

- With large $\lambda$, we are telling the model to get smaller $\mathbf{w}$. This would encourage the training process to return a simpler model, which is like having a smoother boundary between classes if visualized. 

### Dropout Regularization

- In each iteration, randomly eliminate nodes at each layer to get a smaller, more diminished notes.
- Implementation - inverted dropout
```python
keep-prob = 0.8
d3 = np.random.rand(a3.shape[0], a3.shape[1]) < keep-prob
a3 = np.multiply(a3, d3) 
a3 /= keep-prob # Correct the expected value z in z=wa+b
```
#### Notes:
    1. **After adjusting the value of a, we still train w and b properly.**
    2. **With dropout, different set of w and b are trained in each iteration so that overall the w and b are not over-trained.**
    3. **Matrix a is dropped out instead of w. We still have all w when training the model.**


- Making prediction at test time
    - Not to use drop out
    
- Intuition: Can't rely on any one feature, so have to spread out weights. -> Shingking the weights, similar effect of L2 regularization
- Higher keep-prob on smaller layers; No need to use keep-prob on all layers
- Dropout is frequently used in computer vision
- Downside: Cost function $J$ is less well-defined

#### More Regularization Methods
- Data augmentation
- Early stopping

#### Weight Initialization

With large n (notes in each layer), we want smaller w.

$\text{ReLU: Var}(\mathbf{w}^{[l]}) = \frac{2}{n^{[l-1]}}$

$\text{tanh: Var}(\mathbf{w}^{[l]}) = \sqrt{\frac{1}{n^{[l-1]}}}$

```python

w_l = np.random.rand(shape) * np.sqrt(1/n_previous_l)
```

However, in practice, tuning the weight in this way is usually less important compared to other tuning techniques.



## 2. Optimization Algorithms

## 3. Hyperparameter Tuning, Batch Normalization, and Programming Frameworks