# Hyper-Parameter Tuning

Hyperparameter - A variable that we need to set before applying it within an algorithm

```python
learning_rate = 0.01
minibatch_size = 32
epochs = 12

```

There are no set value for hyperparameters, each value depends on the task and dataset. We can generalize hyperparamters into two categories:

Optimzizer Hyperparameters: 
    
    1. Learning Rate
    2. Minibatch size
    3. # of training iteration
 
Model Hyperparameters:

    1. # of Hidden Layers
    2. Model specific model parameters
    
 
## Learning Rate

" The single most important hyperparameter iand one should always make sure that is had been tuned" - Yoshua Bengio

A good starting point is 0.01 but the usual suspects are:

```python
0.1
0.01
0.001
0.0001
0.00001
0.000001
``` 

What is the intuition of the learning rate? 

Recall, we use gradient descednt to train our neural network model. The task boils down to decreasing the error value calculated by a loss function.

<img src='rnn_img/m43.png' width=40% />

Suppose we have a graph that model the Weights vs Error on a graph. The learning rate is the multiplier used to make a step closer to the local minimum. If the learning rate is too large the weights will never achieve the ideal error value. On the flip side, if the error rate is too low the model may never achieve a reasonable error.

<img src='rnn_img/w44.png' width=80% />

We don't exactly know the shapes of the curve, thus it may be more diffuct than we think. Remember the we have several weight that result in a plane in x-dimensions possibly.

## Minibatch Size

This hyperparameter impacts the resource requirement and training speed which may not be as trivial as other parametters. Researchers have been arguing about several training approaches in this regard.

1. Online (Stochastic) -> Training on one example at at time.


2. Batch - Feed the entire dataset and have the model train on iterations.

The instructions today are to set a minibatch size, where the two extremes are:

```python
    minibatch_size = 1  #Stochastic
    or
    minibatch_size = "# of training examples" #Batch
```

The general minibatch sizes are: 1, 2, 4, 8, 16, 32, 64, 128, 256 (32 is most common)

For Larger Minibatch:
    
    Pros: Allows to maximize the number of calculation done on the datset
    Cons: More memory, computational intensity
    
In practice, smaller minibatch sizes have more noise which is often helpful in prevent the learning curve to get stuck on a local minima.

<img src='rnn_img/m44.png' width=80% />

In general, it is ideal to start at 32 and work your way up. A research paper shows the impact of different batch sizes on a CNN.

<img src='rnn_img/w45.png' width=80% />

Overall, too large could cause error and too small could be too slow.

# Number of Training Iterations

To choose the right number of iterations, we look at the Validation Error.
As long as the validation error is decreasing, we can continue to increase the iterations.

```python
Epoch 1, Batch 1, Training Error: 4.4181, Validation Error: 4.5543
Epoch 1, Batch 2, Training Error: 4.3181, Validation Error: 4.3543
Epoch 1, Batch 3, Training Error: 4.5181, Validation Error: 4.1363
Epoch 1, Batch 4, Training Error: 3.8181, Validation Error: 3.8233
Epoch 1, Batch 5, Training Error: 3.7181, Validation Error: 3.4123
```

We can use a technique called early stopping which will stop the training early if the model has not improved for a x amount of training iterations.

# Number of Hidden Units / Layers

In order for a model to learn, it needs enough  "capacity". THe more complex the more learning capacity it has to learn. However, this also implies that the model can overfit and memorize the data easily rather than learn.

<img src='rnn_img/w46.png' width=80% />

If the training accuracy is greater than the validation accuracy, the you can decrease the number of hidden units. You may also utilize regularization techniques like dropout or L2 Regularization.

"in practice it is often the case that 3-layer neural networks will outperform 2-layer nets, but going even deeper (4,5,6-layer) rarely helps much more. This is in stark contrast to Convolutional Networks, where depth has been found to be an extremely important component for a good recognition system (e.g. on order of 10 learnable layers)." ~ Andrej Karpathy in https://cs231n.github.io/neural-networks-1/

# LSTM Vs GRU (RNN)

- "Our results are not conclusive in comparing the LSTM and the GRU, which suggests that the choice of the type of gated recurrent unit may depend heavily on the dataset and corresponding task."

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling by Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, Yoshua Bengio

- "The GRU outperformed the LSTM on all tasks with the exception of language modelling"

An Empirical Exploration of Recurrent Network Architectures by Rafal Jozefowicz, Wojciech Zaremba, Ilya Sutskever

- "Our consistent finding is that depth of at least two is beneficial. However, between two and three layers our results are mixed. Additionally, the results are mixed between the LSTM and the GRU, but both significantly outperform the RNN."

Visualizing and Understanding Recurrent Networks by Andrej Karpathy, Justin Johnson, Li Fei-Fei

- "In our [Neural Machine Translation] experiments, LSTM cells consistently outperformed GRU cells. Since the computational bottleneck in our architecture is the softmax operation we did not observe large difference in training speed between LSTM and GRU cells"

Massive Exploration of Neural Machine Translation Architectures by Denny Britz, Anna Goldie, Minh-Thang Luong, Quoc Le

# Sources

If you want to learn more about hyperparameters, these are some great resources on the topic:

- Practical recommendations for gradient-based training of deep architectures by Yoshua Bengio

- Deep Learning book - chapter 11.4: Selecting Hyperparameters by Ian Goodfellow, Yoshua Bengio, Aaron Courville

- Neural Networks and Deep Learning book - Chapter 3: How to choose a neural network's hyper-parameters? by Michael Nielsen

- Efficient BackProp (pdf) by Yann LeCun

More specialized sources:

- How to Generate a Good Word Embedding? by Siwei Lai, Kang Liu, Liheng Xu, Jun Zhao
- Systematic evaluation of CNN advances on the ImageNet by Dmytro Mishkin, Nikolay Sergievskiy, Jiri Matas
- Visualizing and Understanding Recurrent Networks by Andrej Karpathy, Justin Johnson, Li Fei-Fei