# Hyperparameters Tuning and Batch Normalization

## Table of Contents

* [1.Hyperparameter Tuning](#chapter1)
    * [1.1 Tuning Process](#section_1_1)
    * [1.2 Using an appropriate Scale to pick an Hyperparameters](#section_1_2)
* [2. Batch Normalization](#chapter2)
    * [2.1 Normalizing Activations in a Network](#section_2_1)
    * [2.2 Fitting batch norm into a neural network](#section_2_2)
* [3. MultiClass Classification](#chapter3)
    * [3.1 Softmax Regression](#section_3_1)
    * [3.2 Training Softmax Classifier](#section_3_2)
* [4. Recap](#chapter4)


# 1. Hyperparameter Tuning <a class="anchor" id="chapter1"></a>

## 1.1 Tuning Process <a class="anchor" id="section_1_1"></a>

Training a neural Network can involve setting a lot of hyperparameters. How to organize the hyperparameter tuning process to converge on the good settings of the hyperparameters.

<u>Hyperparamters :</u>

- $\alpha$ : learning rate
- $\beta$ : momentum
- $\beta1$,$\beta2$,$\epsilon$ : Adam parameters
- Number of layers
- Number of hidden units
- learning rate decay
- mini-batch size


<u>Most important Hyperparameters to tune :</u>

- <p style="color:red;">alpha</p>
- <p style="color:orange;">Beta</p>
- <p style="color:orange;">Number of hidden units</p>
- <p style="color:orange;">Mini-batch size</p>
- <p style="color:purple;">Number of layers</p>
- <p style="color:purple;">learning rate decay</p>
- $\beta1$,$\beta2$,$\epsilon$ : Almost always use $\beta1$ = 0.9, $\beta2$ = 0.999 and $\epsilon = 10^{-8}$

***Tuning process :***

- Try <b>random values of Hyperparameters</b> :
    - <u>Example :</u> random values of hyper_param1 = alpha and hyper_param2 = epsilon

- a <b>coarse to fine search process</b> :
    - Try random values of hyperparameters
    - select the ones that work the best
    - Fine the search process and focus more resources on searching within this best values if you're suspecting that the best setting, the hyperparameters.

## 1.2 Using an appropriate Scale to pick an Hyperparameters<a class="anchor" id="section_1_2"></a>

It's important to pick the appropriate scale on which to explore the hyperparamaters.


<u>Picking hyperparameters at random :</u>

- Trying to choose the number of hidden units, $n^{[l]}$:
    - Example: n[l] between 50,....,10

- Trying to choose the number of layers, L:
    - Example: L between 2,3 and 4

Then sampling uniformly at random, migth be reasonable.


<u>Appropriate Scale for hyperparameters :</u>

- searching for $\alpha$: 
    - values between 0.0001, .... ,1 

If we sample values uniformly at random, then about 90% of the values you sample would be between 0.1 and 1. And only 10% of the resources to search between 0.0001 and 0.1. <br>
<b>Instead, it seems more reasonable to search alpha on a log-scale than a linear scale.</b>

--> search alpha in the scale : 0.0001 | ... | 0.001 | ... | 0.01 | ... | 0.1 | ... | 1

To implement this we use :

- r = -4 * np.random.randn() : $r \in [-4,0]$
- $\alpha = 10^{r}$


<u>Hyperparameters for exponentially weighted averages :</u>

- searching  for $\beta$ between 0.9, ..... , 0.999

We need again to sample in a log-scale and not in a linear scale. Therefor we use $1-\beta$ :

--> search $1-\beta$ in the scale : 0.001 | ... | 0.01 | ... | 0.1

- r = -3 * np.random.randn() : $r \in [-3,0]$
- $1-\beta = 10^{r}$
- $\beta = 1 - 10^{r}$






# 2. Batch Normalization <a class="anchor" id="chapter2"></a>

## 2.1 Normalizing Activations in a Network <a class="anchor" id="section_2_1"></a>

Batch Normalization makes your hyperparameters search problem much easier, make your neural network much more robust. It also enable us to much more easily train even very deep networks. 


**Normalize inputs to speed up learning :**<br>

<u>Example with Logistic Regression</u>

- $\mu = \frac{1}{m}\sum_{i} X^{(i)}$
- $ \sigma =\sqrt{\frac{1}{m}\sum_{i} X^{(i)^2}}$
- $X_{norm} = \frac{X-\mu}{\sigma}$

<u>Deep neural Network :</u>

- Normalize each layer $z^{[l]}$ to train much faster $W^{[l]}$ and $b^{[l]}$ 

## 2.2 Fitting batch norm into a neural network <a class="anchor" id="section_2_2"></a>

# 3. MultiClass Classification <a class="anchor" id="chapter3"></a>

## 3.1 Softmax Regression <a class="anchor" id="section_3_1"></a>

So far, the classification example we talked about have used binary classification, where you had two possible labels, 0 or 1.

***What if we have multiple possible classes?***

There is a generalization of logistic regression called <b>Softmax Regression</b> that make predictions when you are trying to recognize one of multiple classes, rather than just recognize two classes.

<u>Example :</u>

- We are trying to recognize cats, dogs, baby chicks and koalas:
    - koala_class = 0
    - cat_class = 1
    - dog_class = 2
    - chick_class = 3

So We're going to build a neural network where the output layer has 4 output units.


<center><img src="images/09-hyperparameters tuning/softmax.PNG" width ="500px"></center>

We wante the number of units in the output layer to tell us what is the probability of each of these four classes.

- 1st node : output the probability of as a kaola given X
- 2nd node : output the probability of as a cat given X
- 3rd node : output the probability of as a dog given X
- 4th node : output the probability of as a chick given X

**Softmax Layer :**

<center><img src="images/09-hyperparameters tuning/softmax2.PNG" width ="500px"></center>

In the L layer:

- $Z^{[L]} = W^{[L]} a^{[L-1]} + b^{[L]}$
- Activation Function :
    - $t = e^{Z^{[L]}}$
    - $a^{[L]} = \frac{e^{Z^{[L]}}}{\sum_{i=1}^{4}t_i}$

<u>In our Example with 4 classes :</u>
- $t \in (4 \times 1)$
- $a^{[L]} \in (4 \times 1)$


<u>Example with values :</u>

- $ Z^{[L]} = \begin{bmatrix} 5 \\ 2 \\ -1 \\ 3 \end{bmatrix}$ 

- $ t = \begin{bmatrix} e^5 \\ e^2 \\ e^{-1} \\ e^3 \end{bmatrix} = \begin{bmatrix} 148.4 \\ 7.4 \\ 0.4 \\ 20.1 \end{bmatrix}$ 


- $\sum_{i=1}^{4}t_i = 176.3 $ 

Activation :
- $ y_{pred} = a^{[L]} = \frac{t}{176.3} = \begin{bmatrix} 0.842 \\ 0.042 \\ 0.002 \\ 0.114 \end{bmatrix}$ 

res = 0.842 + 0.042 + 0.002 + 0.114 = 1

<center><img src="images/09-hyperparameters tuning/example.PNG" width ="300px"></center>

## 3.2 Training Softmax Classifier <a class="anchor" id="section_3_2"></a>

**Loss function :**

- C : number of classes

$L(y_{pred},y) = - \sum_{j=1}^{C} y_j*log(y_{j_{pred}})$

**Cost Function on m examples :**

$J(W^{[1]},b^{[1]},...,W^{[L]},b^{[L]}) = \frac{1}{m} \sum_{i=1}^{m} L(y_{pred}^{(i)},y^{(i)})$

# 4. Recap <a class="anchor" id="chapter4"></a>

1. If searching among a large number of hyperparameters, you should try random values.

2. Some hyperparameters, such as the learning rate, are more critical than others. So a hyperparameter can have a huge negative impact on training if it set poorly.



3. If $\beta$ (hyperparameter for momentum) is between 0.9 and 0.99, then a way to sample a value for beta is :
    - r = np.random.randn()
    - beta = 1-10**(-r-1)



4. In batch normalization, if we applied it on the lth layer, then the normalization is applied on $z^{[l]}$

5. In the normalization formula 

    - $z_{norm}^{(i)} = \frac{z^{(i)} - \mu}{\sqrt{\sigma^2 + \epsilon}}$
    - $\epsilon$ avoid division by zero

6. $\gamma$ and $\beta$ in Batch Norm :

    - set the mean and variance of the linear variable $z^{[l]}$ of a given layer
    - They can be learned using Adam, Gradient Descent, Momentum or RMSprop

7. After training a neural network with Batch Norm, at test time, to evaluate the neural network on a new example we should:

    - Perform the needed normalizations, use $\mu$ and $\sigma^2$ estimated using an exponentially weighted average across mini-batches seen during training.
