## Fasion MNIST classification and hyperparameter optimization using Scikit-Optmizer

Here we will be concerned with optimizing hyperparameters for a Convolutional Neural Network to get better results in the task of recognizing images from a data set. We will try to find the best ( or at least good) values for 3 parameters, The learning rate for the Adam optmizer, the numer of dense layers and the number of neurons in each dense layer. To find the best values for the 3 parameters, we have some options, we could try all combinations of parameters, wich is an option, but not a good one, if we were to try 10 different values for each parameter, eu would have to make 1000 model evaluations and it could take too much time, this is called grid search. We could also try random values for some time and then save the best set, and well, it does not sound promising, does it?

First let's talk a little bit about Bayesian optimization using Gaussian Processes.

Gaussian processes provide not only the value for some function, but also the uncertainty around that value. Bayesian Optimization is a class of iterative optimization methods that focuses on the general optimization setting, where a set of values o hyperparameters is known (the vector space) but there is no information about the function we are trying to optimize. As the process is experimented, the optimizer gains more confidence in what the performace function looks like and what values can optimize it.


![1_UWvMP_dtt1lQhwQvfNOjLw](https://user-images.githubusercontent.com/23335136/55368076-5ce0b980-54c6-11e9-88e8-a17f6ab7a20f.png)

The performance is a function of the hyperparameters, and is not known at the beggining, but as the experiments progress, the optimizer gets more and more confident in what that function looks like, this update in believe is made using Bayes Rule, the function that will get the next hyperparameters to experiment, acquisition function, is called Expected Improvement, and as the name sugests, it chooses the hyperparameters it expects will get the best improvements balancing exploration and exploitation.

A better explanation (also longer) can be found here: https://www.youtube.com/watch?v=jtRPxRnOXnk&t=2018s


An even better explanation with way better code (but code is longer) https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/19_Hyper-Parameters.ipynb


In [1]:
# Import TensorFlow and TensorFlow Datasets
import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow.keras import backend as K
tf.logging.set_verbosity(tf.logging.ERROR)

# Helper libraries
import math
import numpy as np
import matplotlib.pyplot as plt

import skopt
from skopt import gp_minimize
from skopt.space import Real, Integer
from skopt.plots import plot_convergence
from skopt.plots import plot_objective, plot_evaluations
from skopt.utils import use_named_args


print("tensorflow GPU Version ",tf.__version__)

tensorflow GPU Version  1.12.0


We will use de fashion MNIST in this problem, it has 70000 images and 10 classes like the good old MNIST, but has more complexity in its shapes and is a better challange. Some samples from this dataset can be seen in the image bellow

<table>
  <tr><td>
    <img src="https://tensorflow.org/images/fashion-mnist-sprite.png"
         alt="Fashion MNIST sprite"  width="600">
  </td></tr>
  <tr><td align="center">
    <b> </b> <a href="https://github.com/zalandoresearch/fashion-mnist">Fashion-MNIST samples</a> (by Zalando, MIT License).<br/>&nbsp;
  </td></tr>
</table>


We will download the dataset set using tensorflow datasets, if you dont have it, $pip install tensorflow_datasets

In [2]:
dataset, metadata = tfds.load('fashion_mnist', as_supervised=True, with_info=True)
train_dataset, test_dataset = dataset['train'], dataset['test']

Then we take 60000 exemples for training and 10000 for testing, as the dataset is somewhat little, we will not have an evaluating set. We will do data exploration here, the notebook is already long enought

In [3]:
num_train_examples = metadata.splits['train'].num_examples
num_test_examples = metadata.splits['test'].num_examples
print("Number of training examples: {}".format(num_train_examples))
print("Number of test examples:     {}".format(num_test_examples))

Number of training examples: 60000
Number of test examples:     10000


We need to normalize the images, changing the pixel intensity to range from 0-1 will speed up convergence while training. Each pixel intensity has 8 bits in RGB images, so we need to divide each image by 2^8-1=255 as the images in the dataset have only one channel

In [4]:
def normalize(images, labels):
    
    """
    Normalizes images pixel intensity to range from 0-1

    """
    images = tf.cast(images, tf.float32)

    images /= 255
    return images, labels

# The map function applies the normalize function to each element in the train
# and test datasets
train_dataset =  train_dataset.map(normalize)
test_dataset  =  test_dataset.map(normalize)

Next we need to create a function to create models with the different combinations of hyperparameters for the gaussian process optimizer, this function will be called in every iteration of the optimizer and needs to return a compiled model to be trained and evaluated

In [5]:
def create_model(learning_rate, num_dense_layers,
                 num_dense_nodes):
    """
    Hyper-parameters:
    learning_rate:     Learning-rate for the optimizer.
    num_dense_layers:  Number of dense layers.
    num_dense_nodes:   Number of nodes in each dense layer.
    
    Returns: compiled model
    """
    model = tf.keras.Sequential()

    model.add(tf.keras.layers.Conv2D(32, (3,3), padding='same', activation=tf.nn.relu,
                           input_shape=(28, 28, 1)))
    model.add(tf.keras.layers.MaxPooling2D((2, 2), strides=2))
    model.add(tf.keras.layers.Conv2D(64, (3,3), padding='same', activation=tf.nn.relu))
    model.add(tf.keras.layers.MaxPooling2D((2, 2), strides=2))
    model.add(tf.keras.layers.Flatten())
        
    for i in range(num_dense_layers):
        model.add(tf.keras.layers.Dense(num_dense_nodes, activation=tf.nn.relu))
        
    model.add(tf.keras.layers.Dense(10,  activation=tf.nn.softmax))
        
    optimizer = tf.keras.optimizers.Adam(lr=learning_rate)
        
    model.compile(optimizer=optimizer, 
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

    return model

Let's run one model to see how it goes

In [6]:
#create model with lr = 0.001, 1 dense layer with 12 nodes
model = create_model(0.001, 1, 12)

BATCH_SIZE = 512
#if you get crashes during training, try lowering BATCH_SIZE

train_dataset = train_dataset.repeat().shuffle(num_train_examples).batch(BATCH_SIZE)
test_dataset = test_dataset.batch(BATCH_SIZE)

The training phase will use the 60000 training images 3 times,  ceil(60000/512) = 118 iterations each epoch

In [7]:
model.fit(train_dataset, epochs=3, steps_per_epoch=math.ceil(num_train_examples/BATCH_SIZE))

Epoch 1/3
Epoch 2/3


Epoch 3/3




<tensorflow.python.keras.callbacks.History at 0x22807932f60>

Lets see how the model does on the test set

In [8]:
test_loss, test_accuracy = model.evaluate(test_dataset, steps=math.ceil(num_test_examples/BATCH_SIZE))



We got a fairly good model, let's see if we can find one better. Now to the optmizations part

In [9]:
print("Acc : ",test_accuracy)

# Delete the Keras model with these hyper-parameters from memory so we get create a new one
# in the optimization part
del model
K.clear_session()
    

Acc :  0.85341796875


First we need to create a list with hyperparamters and the range they are allowed

In [10]:
# the learning rate will be updated with logritmic scale, beucase it has a wide range
dim_learning_rate = Real(low=1e-5, high=1e-1, prior='log-uniform',name='learning_rate')
dim_num_dense_layers = Integer(low=1, high=5, name='num_dense_layers')
dim_num_dense_nodes = Integer(low=5, high=512, name='num_dense_nodes')

dimensions = [dim_learning_rate,
              dim_num_dense_layers,
              dim_num_dense_nodes]

Now we create the function that will evaluate our model and return its performance to the optimizer

In [11]:
@use_named_args(dimensions=dimensions)
def fitness(learning_rate, num_dense_layers,
            num_dense_nodes):
    """
    Hyper-parameters:
    learning_rate:     Learning-rate for the optimizer.
    num_dense_layers:  Number of dense layers.
    num_dense_nodes:   Number of nodes in each dense layer.
    """
    
    # we will be updating best_accuracy and best_params as the optimizer call this function
    # so the two will be global variables
    global best_accuracy
    global best_params
    
    # Create the neural network with these hyper-parameters.
    model = create_model(learning_rate=learning_rate,
                         num_dense_layers=num_dense_layers,
                         num_dense_nodes=num_dense_nodes)
    
    # for some reason, we have to load the dataset all over again, i suppose it is because
    # we close and clear the session, and the dataset is loaded by tensorflow
    BATCH_SIZE = 512
    dataset, metadata = tfds.load('fashion_mnist', as_supervised=True, with_info=True)
    train_dataset, test_dataset = dataset['train'], dataset['test']
    train_dataset =  train_dataset.map(normalize)
    test_dataset  =  test_dataset.map(normalize)
    train_dataset = train_dataset.repeat().shuffle(num_train_examples).batch(BATCH_SIZE)
    test_dataset = test_dataset.batch(BATCH_SIZE)

    # Train network
    model.fit(train_dataset, epochs=3, steps_per_epoch=math.ceil(num_train_examples/BATCH_SIZE))
    
    #evaluate it
    test_loss, test_accuracy = model.evaluate(test_dataset, steps=math.ceil(num_test_examples/BATCH_SIZE))

    # if it did better than the last one, we sabe parameters and acc
    if test_accuracy > best_accuracy:

        best_params = [learning_rate, num_dense_layers,num_dense_nodes]
        best_accuracy = test_accuracy

    # Delete the Keras model with these hyper-parameters from memory so we get create a new one
    # in the next iteration
    del model
    K.clear_session()
    
    # NOTE: Scikit-optimize does minimization so it tries to
    # find a set of hyper-parameters with the LOWEST fitness-value.
    # Because we are interested in the HIGHEST classification
    # accuracy, we need to negate this number so it can be minimized.
    return -test_accuracy

Everything is ready, now we call the our gaussian process optimizer

In [12]:
# We need to give the optimizer a starting point, so it will start with the
# hyperparameters with tryed
default_parameters = [0.001,1,12]

# We will say that the last model has the best hyperparams and acc
# and if the optimizer finds better ones, they will be updated
best_accuracy = test_accuracy
best_params = default_parameters

# gp_minimize is our glorious gaussian process optimizer
# n_call default is 100 it cannot be low or the optimizer wont work
search_result = gp_minimize(func=fitness,
                            dimensions=dimensions,
                            acq_func='EI', # Expected Improvement.
                            n_calls=60,
                            x0=default_parameters)


Epoch 1/3
Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3


Epoch 1/3


Epoch 2/3


Epoch 3/3




In [13]:
print("For this data set, the best hyperparameters found are :")
print("Learn Rate : ",best_params[0])
print("Num of dense layers : ",best_params[1])
print("Num of dense nodes : ",best_params[2])
print("It got an accuracy of :",best_accuracy)

For this data set, the best hyperparameters found are :
Learn Rate :  0.007747151042603749
Num of dense layers :  1
Num of dense nodes :  296
It got an accuracy of : 0.9097886025905609


## Conclusion

Tuning hyperparameters can be hard and time consuming, and it can get even harder if you don't have past experiences to get the feeling of how a CNN should look like to be good at classification tasks.
Here the gp optimizer took 60 iterations and could find a better set of hyperparameter than my first guess, it confirmed that one dense layer is the best choice, maybe more layers would overfit easily, but it find 296 nodes a better option, wich is more than 20 times my first guess, the learn rate was not so far of my first guess.
There is a good chance the optimizer would find an even better set of hyperparameters as it was still exploring the space of possible options, this can be seen in the progression of Accuracy in the tests it does, that's why you should use 100 or more iterations (n_calls).