# Hyperparameter tuning

## Tuning process

Importance of hyperparamets:

- First:
    - learning rate (alpha) is the most important hyperparameter!

- Second:
    - beta: .9
    - num of hidden units
    - mini-batch size

- Third:
    - num of layers
    - learning rate decay
    
- Fourth:
    - beta1: 0.9
    - beta2: 0.999
    - eplison: $10^{-8}$


Historically, practioniers would create a grid and try all the possibilities. This was feasible approach because there were not many hyperparameters to choose from. Thus, we would have to try 25 values if we were to have 5 possible value for 2 hyperparameters

But what is a better method? A better method would be to choose randomly. This is a better method because we are choosing from a wider pool of hyperparameters. When we follow a structure of 5 hyperparmets (say for learning rate and epilson), we are have to choose our hyperparameters from 5 learning rate. Even though we have a grid that has 25 possibilities, we are using the learning rate 5 times rather than 25.

However, if the values are choosen randomly, we are confined to actually choosing from 5 learning rate but 25 distinct learning rates.

<img src="./images/improv_53.png" alt="Drawing" style="width: 400px;"/>

Another method: **Coarse to fine**
- We would find hyperparamters that have the potential to be the optimal and then make our subset zoom into a region within the most optimal hyper parameters.

## Using an appropriate scale to pick hyperparameters

The takeway is to make sure you are choosing your hyperparameters from the correct distribution. For some hyperpameters, choosing from a uniform distribution make sense.
- Example: total number of layers, or number of hidden units.


However, this is not always the case. For other hyperpameters, a uniform distribution does not make sense.

In the image below, if we are sampling from a normal distribution between 0.0001 and 1, most of the values would NOT be in between 0.0001 and 0.01... even though it should be
- <img src="./images/improv_36.png" alt="Drawing" style="width: 400px;"/>
- It's more reasonable to search for a learning rate across a log scale!
- <img src="./images/improv_54.png" alt="Drawing" style="width: 400px;"/>

```python
r = -4 * np.random.rand() # r is in btw [-4, 0]
learning_rate = np.power(10, r) # hence, 10^r would be 10^-4 through 10^0
```

Hyperparamters for exponentially weighted averages:
- Beta could be 0.9 (10 days) or 0.999 (1000 days) 
- Another implication where sampling from a logistic distribution makes sense is when you are calculating for the betas in our weighted average... notice that we are finding the beta between the value 0.9 and 0.999
    - BUT we are using 1-beta. Hence, we are finding the learning rate between 0.1 and .001

<img src="./images/improv_37.png" alt="Drawing" style="width: 300px;"/>


It's bad to sample from a normal distribution for some hyperparameters because the changes are sensitive when the values are close to 1
- Meaning, when the changes are closer to 1, the betas will have a greater change 
- Hence, using a log distribution allows for our model to sample from learning rates that are close to 1

## Hyperparameters tuning in practice: Pandas vs. Caviar

There are two tuning methods:
1. Babysitting one model: Here, we would use certain hyperparameters, check it performance and tweak it.
    - <img src="./images/improv_38.png" alt="Drawing" style="width: 350px;"/>

2. Train many models in parallel!
    - <img src="./images/improv_39.png" alt="Drawing" style="width: 350px;"/>

Reminder: 
- Pandas: Train one model and try to change it manually. 
    - Pandas typically have one kid and invest all their resources to on kid
- Cavier: Change many models in parallel and let it run by itself with different set of hyperparameters.
    - If we have a lot of computational power, the caviar could be a great method to use, we are training it in parallel!
    - Cavier typically have many kids and hope that one kid will turn out well!