# Hyperparameters Tuning and Batch Normalization

## Table of Contents

* [1.Hyperparameter Tuning](#chapter1)
    * [1.1 Tuning Process](#section_1_1)
    * [1.2 Using an appropriate Scale to pick an Hyperparameters](#section_1_2)
    * [1.3  ](#section_1_3)
* [2. Batch Normalization](#chapter2)
    * [2.1 ](#section_2_1)
* [3. MultiClass Classification](#chapter3)

# 1. Hyperparameter Tuning <a class="anchor" id="chapter1"></a>

## 1.1 Tuning Process <a class="anchor" id="section_1_1"></a>

Training a neural Network can involve setting a lot of hyperparameters. How to organize the hyperparameter tuning process to converge on the good settings of the hyperparameters.

<u>Hyperparamters :</u>

- $\alpha$ : learning rate
- $\beta$ : momentum
- $\beta1$,$\beta2$,$\epsilon$ : Adam parameters
- Number of layers
- Number of hidden units
- learning rate decay
- mini-batch size


<u>Most important Hyperparameters to tune :</u>

- <p style="color:red;">alpha</p>
- <p style="color:orange;">Beta</p>
- <p style="color:orange;">Number of hidden units</p>
- <p style="color:orange;">Mini-batch size</p>
- <p style="color:purple;">Number of layers</p>
- <p style="color:purple;">learning rate decay</p>
- $\beta1$,$\beta2$,$\epsilon$ : Almost always use $\beta1$ = 0.9, $\beta2$ = 0.999 and $\epsilon = 10^{-8}$

***Tuning process :***

- Try <b>random values of Hyperparameters</b> :
    - <u>Example :</u> random values of hyper_param1 = alpha and hyper_param2 = epsilon

- a <b>coarse to fine search process</b> :
    - Try random values of hyperparameters
    - select the ones that work the best
    - Fine the search process and focus more resources on searching within this best values if you're suspecting that the best setting, the hyperparameters.

## 1.2 Using an appropriate Scale to pick an Hyperparameters<a class="anchor" id="section_1_2"></a>

It's important to pick the appropriate scale on which to explore the hyperparamaters.


<u>Picking hyperparameters at random :</u>

- Trying to choose the number of hidden units, $n^{[l]}$:
    - Example: n[l] between 50,....,10

- Trying to choose the number of layers, L:
    - Example: L between 2,3 and 4

Then sampling uniformly at random, migth be reasonable.


<u>Appropriate Scale for hyperparameters :</u>

- searching for $\alpha$: 
    - values between 0.0001, .... ,1 

If we sample values uniformly at random, then about 90% of the values you sample would be between 0.1 and 1. And only 10% of the resources to search between 0.0001 and 0.1. <br>
<b>Instead, it seems more reasonable to search alpha on a log-scale than a linear scale.</b>

--> search alpha in the scale : 0.0001 | ... | 0.001 | ... | 0.01 | ... | 0.1 | ... | 1

To implement this we use :

- r = -4 * np.random.randn() : $r \in [-4,0]$
- $\alpha = 10^{r}$


<u>Hyperparameters for exponentially weighted averages :</u>

- searching  for $\beta$ between 0.9, ..... , 0.999

We need again to sample in a log-scale and not in a linear scale. Therefor we use $1-\beta$ :

--> search $1-\beta$ in the scale : 0.001 | ... | 0.01 | ... | 0.1

- r = -3 * np.random.randn() : $r \in [-3,0]$
- $1-\beta = 10^{r}$
- $\beta = 1 - 10^{r}$






# 2. Batch Normalization <a class="anchor" id="chapter2"></a>

# 2.1 Normalizing Activations in a Network <a class="anchor" id="section_2_1"></a>

Batch Normalization makes your hyperparameters search problem much easier, make your neural network much more robust. It also enable us to much more easily train even very deep networks. 


**Normalize inputs to speed up learning :**<br>

<u>Example with Logistic Regression</u>

- $\mu = \frac{1}{m}\sum_{i} X^{(i)}$
- $ \sigma =\sqrt{\frac{1}{m}\sum_{i} X^{(i)^2}}$
- $X_{norm} = \frac{X-\mu}{\sigma}$

<u>Deep neural Network :</u>

- Normalize each layer $z^{[l]}$ to train much faster $W^{[l]}$ and $b^{[l]}$ 

# 3. MultiClass Classification <a class="anchor" id="chapter3"></a>