# Example of hyperparameter selection on MNIST

This is a simple/fast example of hyperparameter tuning using an ensemble of DCNNs on MNIST. 

## Approaches

Currently, some of most interesting approaches for automatic hyperparameter tuning are __[HORD](https://github.com/ilija139/HORD)__, __[reverse and forward HG](https://github.com/lucfra/FAR-HO)__, __[GP-EI](https://arxiv.org/abs/1206.2944)__ and __[GP-PES](https://arxiv.org/abs/1406.2541)__, these last two based on Gaussian processes.

### Implementations

__[Hyper-Engine](https://github.com/maxim5/hyper-engine)__ is a generic toolbox. At the moment, it just implements Bayesian Optimization. __[NeuPy](http://neupy.com/2016/12/17/hyperparameter_optimization_for_neural_networks.html)__ provides a generic toolbox. It implements Bayesian Optimization, Gaussian Process with Expected Improvement, and Tree-structured Parzen Estimators. __[Bayesian Optimization](https://github.com/fmfn/BayesianOptimization)__ is a simpler library implementing Bayesian Optimization.

Other related libraries are __[DeepReplay](https://github.com/dvgodoy/deepreplay)__ for visualization, __[Hyperas](https://github.com/maxpumperla/hyperas)__, a wrapper of hyperot (that searches using random search or TPE), __[Kopt](https://github.com/Avsecz/kopt)__ and __[Talos](https://github.com/autonomio/talos)__.
 
### Optimization Algorithm

Additionally, we can avoid parametizing the optimization algorithm by using __[Learning to Learn](https://github.com/deepmind/learning-to-learn)__, an automatic learning algorithm that uses LSTMs.

## NN Architechture

There are several possibilities to choose, including MCDNN, an ensemble of 35 DCNNs that gets one of the best accuracies with MNIST. Due to time and computing power constraints, especially given the extra time requirements of automatic hyperparameter optimization, we have chosen a much simpler architechture.

The __[architechture](https://github.com/kkweon/mnist-competition)__ chosen for the NN is an ensemble of three DCNN: two simplified versions of VGG and a 50-layer ResNet. While training, it performs data augmentation by rotation, shearing and scaling.

## Bayesian Optimization

Due to time constraints during this short exercise, we have not been able to integrate the architechture with the implementation of HORD and thus have opted for the well-known Bayesian Optimization approach. Our work has consisted on adding Bayesian Optimization to this architechture. 

### Hyper-parameters 

Many automatic hyperparameter tuning methods, including BO, are not very good with categorical variables. In our case, most of the variables are categorical, with the exception of a few like the learning rate. The learning rate in particular is a variable that should be tested on a logarithmic scale. For that reason, we have turned it into a categorical value, opting for one of a list of possible values separated by factors of 10.

#### Network architechture
 
Some parameters affecting network architechture are:

- Non-linear activation function used.
- Number of dense hidden layers.
- Neurons per hidden layer.
- Number of convolutional layers (CNN).
- Number of features in each layer (for a Convolutional NN).

In our case, the architechture is an ensemble of three networks: ideally, we should replicate this parameters for each one of them. One possible option is to calculate the hyperparameters of each component network, and then ensemble them, but this might not lead to the best result, and is quite time consuming.

Again due to time and GPU computing-power  restrictions, we have chosen to use less parameters and apply them to the three networks. In particular, we just chose to parametrize the non-linear activation function used.

#### Training

Among the parameters affecting the training are:

- Epochs.
- Batch-size.
- Optimization algorithm.
- Learning rate.
- Decay of the learning rate.
- Weight inizialization.
- Data augmentation: on/off, techniques used.

Due to time and GPU-computing restrictions, we decided to use these hyperparameters to the training of the each network component. These restrictions also made us limit the number of epochs to 20, limiting thus the maximum achievable accuracy. In our example, we decided to parametrize:

- Epochs: 1, 10 and 20. 1 is just for search-example purposes.
- Batch-size: 32 and 64.
- Optimization algorithms: Adagrad, Adam and Adamax (RMSProp seemed to perform clearly worse than these).
- Learning rate: logarithmic scale from 0.01 to 0.001 (due to time limits we did not explore more).
- Decay of the learning rate: 0.99, 0.9, 0.85, 0.8.


### Experiment

You can repeat the experiment and achieve better results, depending on your time and GPU.

#### Prerequisites

If you want to repeat the experiment, you need the following environment:

- Anaconda for Python 2.7
- Sonnet: 
```bash
conda install -c hcc dm-sonnet
```
- Bayesian Optimization: 
```bash
conda install -c yikelu bayesian-optimization
```

Or, in Google Colab:




In [1]:
# This is neccesary in case you did not followed the steps above or you are using Google Colab
!pip install sonnet
!pip install bayesian-optimization





#### Commands

Clone the repository:

In [3]:
!cd /
!git clone https://github.com/carloshavier/hyperparams-bo

Cloning into 'hyperparams-bo'...
remote: Counting objects: 24, done.[K
remote: Compressing objects: 100% (21/21), done.[K
remote: Total 24 (delta 3), reused 24 (delta 3), pack-reused 0[K
Unpacking objects: 100% (24/24), done.


In [0]:
import os
os.chdir('hyperparams-bo')


You can set the default hyperparameters as well as the valid ranges for each one of the hyperparameters by editing hypers.py.

In order to search using Bayesian Optimization, we define how many initial experiments we want to run, and also the number of experiments during the search phase in hysearchbo.py.

In particular, in:

```python
bo.maximize(init_points=3, n_iter=5, kappa=2)
```

To explore the search space trying to maximize the accuracy and finally visualise the results, you just need to run:

```bash
!python hysearchbo.py
```

In [0]:
# Warning: depending on the range of search of the hyperparameters and the number of items to explore (bo.maximize), the following command might take several hours/days.
!python hysearchbo.py

### Results

#### Bayesian Optimization

The following graph shows the result of our experiment. In particular, it shows the combinations of values explored by Bayesian Optimization and their reached accuracy, by color (green equals max accuracy). Note that due to time and computing power restrictions, during this experiment we limited the number of epochs to just 20.

![BO search of best hyperparameters](https://raw.githubusercontent.com/carloshavier/hyperparams-bo/master/experiment-results/experiment.png)

The maximum accuracy was obtained using:

- Non-linear activation function: PReLU.
- Epochs: 10.
- Batch-size: 32.
- Optimization algorithm: Adam.
- Learning rate: 0.01.
- Decay of the learning rate: 0.99.

The ensemble accuracy was $93.55\%$. The raw data is __[here](https://raw.githubusercontent.com/carloshavier/hyperparams-bo/master/experiment-results/ex-7-6-18-01-small.txt)__.

#### Manual Optimization

After looking at the results obtained by BO, we did a few more experiments, including logarithmic changes of the learning rate, learning decay and optimizer. We finally push the epochs to 50, as we were doing just a few experiments. Our best result was obtained with:

- Non-linear activation function: PReLU
- Epochs: 50
- Batch-size: 64
- Optimization algorithm: Adam
- Learning rate: 0.01
- Decay of the learning rate: 0.95

The obtained accuracy was:

- Model-0: $97.76\%
$
- Model-1: $94.87\%
$
- Model-2: $97.27\%
$
- Final Test Accuracy: **$98.30\%$**