# Hyper-parameter selection

In this notebook we carry out an experiment to identify the best hyper-parameters for our network.

This notebook builds on the results of overfitting_experimentation.ipynb:
- We will use regularization, dropount and weight constraints for our models.
- We will use image generators with rotation and shift for image data preprocessing.
- We will focus on models in the low-complexity end of CNNs. This means  we will 
    - no more than 2 convolutional layers
    - no more than 1000 Neurons in fully connected layers
    - no more than 2 fully connected layers.

To find the best network structure, we will experiment with different configurtations of fully connected and convolutional layers. We will perform a Grid Search to ifentify the best structure. 

But first .. let's do the imports.

In [14]:
# correct working directory only once 
if not "working_directory_corrected" in vars():
    %cd ..
    working_directory_corrected = True

import pickle
import pandas as pd

## Grid Search

For the grid search we will test all combinations of the following options:
- Convolutional Layers:
     - We will test one convolutional layer with 8, 16, 32 and 64 patterns
     - We will test two convolutional layers with a 8, 16, 32 and 64 patterns in the first layer and half as many in the second layer.
- Fully Connected Layer:
     - We will test one fully connected layer with 25,50,100,250,500 and 1000 neurons.
     - We will test two fully connected layers with 25,50,100,250,500 and 1000 neurons in each layer.

Overall, this means our grid search will test 8*13 = 104 different configurations.

As noted in our overfitting experiments, 10 runs per configuration still lead to some variation in results. For this reason, we will further increase the number of times we learn each configuration. Furthermore, we are concerned about the train/test split influencing our results. For this reason, we use cross validation (with 5 folds) to receive more robust results.  Overall, each configuration will be learned five times for each of the five folds. This means, we will learn each configuration 25 times.

This means, the grid search will learn a machine learning model 104*25 = 3000 times. Depending on the machine this experiment runs on, this may take hours or even days. For this reason, we will not do this in the notebook environment. Instead, we have defined the script *scripts/hyper_parameter_selection.py* that will execute the experiment. Preferrably, this script is run on a cloud server. 

It stores results in a file which we can load and observe in this jupyter notebook. 

## Results

The results of executing the grid search is stored in pickle files in the folder *evaluation/experiment_records*. The results of the above described grid search is stored in *grid_search_results.pickle*. (Due to a configuration error we did not test the [50,50] for fully connected layers. The experiment will be repeated when there is time).

Let's load the results.

In [22]:
filename = "evaluation/experiment_records/grid_search_results.pickle"
results = pickle.load(open(filename,"rb"))
results = list(results.values())
results_df = pd.DataFrame(results,columns=["CNN","ANN","Hamming"])

results_df

Unnamed: 0,CNN,ANN,Hamming
0,[8],[25],0.1556
1,[8],[50],0.2008
2,[8],[100],0.2076
3,[8],[250],0.208
4,[8],[500],0.2096
5,[8],[1000],0.1884
6,[8],"[25, 25]",0.168


For conveniences sake we converted the results into a data frame with three columns:
- CNN: the configuration of the convolutional part of the network
- ANN: the configuration of the fully connected part of the network
- Hamming: The average hamming score of all experiments

To find the best configuration, we sort the list for the hamming score and have a look at the first ten results.

In [21]:

results_df.sort_values(by = "Hamming", ascending = False, inplace = True)

results_df.head(10)


Unnamed: 0,CNN,ANN,Hamming
4,[8],[500],0.2096
3,[8],[250],0.208
2,[8],[100],0.2076
1,[8],[50],0.2008
5,[8],[1000],0.1884
6,[8],"[25, 25]",0.168
0,[8],[25],0.1556


## Conclusions
Looking at the top ten configurations, we can make the following observations:


The configuration that performed best was the following configuration:
- CNN: 1 layer with 8 patterns
- ANN: 1 layer with 500 neurons

We will use this configuration for our evaluation.

