# Digit recognizer using Keras hyperparameter tuner 


### Content:
1. **Introduction**
2. **Preprocess the data**
    * 2.1. Load data
    * 2.2. Prepare data
    * 2.4. Format y_train
    * 2.5. Normalize, reshape and split data
    * 2.6. Draw some digits
3. **Data augmentation**
4. **Hyperparameter tuning**
    * 4.1. Build hypermodel
    * 4.2. Hyperband tuner
    * 4.3. Tuner search
    * 4.4. Best models
5. **Fit the model**
    * 5.1. Summaries of best models
    * 5.2. Select the model
    * 5.3. Callbacks
    * 5.4. Fit the model
6. **Plot and evaluate the results**
7. **Predict**
8. **Tensorboard**


## 1. Introduction 
The indea in this notebook is to improve the "Hello world" of machine learning using [Keras tuner](https://github.com/keras-team/keras-tuner). To save precious time and to speed up and simplify the process of choosing the right hyperparameters of the model, it is possible to use a tuner. Kera's relatively new library is well suited for this. It allows to automatically play through different parameters and at the end - very conveniently - spits out the desired number of best models. These can then be processed in the further course. 

**Hyperparameter tuning is a time consuming process. This also applies to the automated version. So sit back and have a cup of coffee ☕️ and eat a cookie 🍪 - or simply adjust the tuner, epochs, etc. to your needs.**

To get an overview over the result of the fitting check out the [tensorboard](https://www.tensorflow.org/tensorboard/) (link provided in 1.2.).

Install keras tuner

In [None]:
pip install -U keras-tuner

In [None]:
import numpy as np 
import pandas as pd
import tensorflow as tf
from keras.utils.np_utils import to_categorical
from tensorflow.keras import Model
from sklearn.model_selection import train_test_split
from tensorflow.keras.layers import Input, Dense, Conv2D, MaxPooling2D, Flatten, BatchNormalization, Dropout, AveragePooling2D
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.optimizers import RMSprop, Adam
from tensorflow.keras.losses import CategoricalCrossentropy
import matplotlib.pyplot as plt
from tensorflow.keras.callbacks import ReduceLROnPlateau, TensorBoard, EarlyStopping
from kerastuner.tuners import Hyperband
import datetime, os, tensorboard, random, IPython

%matplotlib inline
%load_ext tensorboard

# 2. Preprocess the data
#### The data provided needs to be processed so it suits the model and can easier be proccessed. This contains e.g. reshaping it so it suits the `Conv2D` layer from Keras.

### 2.1. Load data
Load the `train` and the `test` data set using pandas `read_csv` method. 

In [None]:
(X_train, y_train),(X_test, y_test) = tf.keras.datasets.mnist.load_data(path="mnist.npz")

### 2.2. Prepare data
Create `t_train` and drop the label layer from the data set.

In [None]:
print(f'Shape of the raw X_train: {X_train.shape}')

### 2.3. Format y_train
The labels need to be processed from scalars to one-hot vectors. 

In [None]:
y_train = to_categorical(y_train, num_classes=10)
print(f'Shape of the categorised procced y_train: {y_train.shape}')

### 2.4. Normalize, reshape and split data
Normalize data helps the CNN to converge faster. The additional axis is needed for the Conv2D layer.

In [None]:
X_train = X_train / 255.0
X_test = X_test / 255.0

X_train = np.expand_dims(X_train, axis=3)
X_test = np.expand_dims(X_test, axis=3)

Using `train_test_split` for [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html?highlight=split#sklearn.model_selection.train_test_split) the X_train and y_train data will be split to training and validation data. The `test_size` is set to 10%.

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.1, random_state=42)
print(f'Shape of the splited X_train: {X_train.shape}')
print(f'Shape of the splited X_valid: {X_valid.shape}')
print(f'Shape of the splited y_train: {y_train.shape}')
print(f'Shape of the splited y_valid: {y_valid.shape}')

### 2.5. Draw some digits
Show five random images from the train data and there according label (`y_train`).

In [None]:
fig, ax = plt.subplots(1, 5, figsize=(28, 28))
for idx, rand in enumerate(random.sample(range(1, len(X_train)), 5)):
    ax[idx].imshow(X_train[rand][:,:,0])
    ax[idx].set_title(f"Label: {np.argmax(y_train[rand])}", fontdict={'size': 15})

Add 10 random images to the tensorboard.

In [None]:
images = [X_train[rand][:,:,0] for rand in random.sample(range(1, len(X_train)), 10)]
images = np.expand_dims(images, axis=3)
log_dir_td = "logs/train_data/" + datetime.datetime.now().strftime("%d/%m/%y - %H:%M")
file_writer = tf.summary.create_file_writer(log_dir_td)
with file_writer.as_default():
    tf.summary.image("10 random training data examples", images, max_outputs=10, step=0)

# 3. Data augmentation
Use `ImageDataGenerator` for real-time data augmentation.

In [None]:
datagen = ImageDataGenerator(
    featurewise_center=False,
    featurewise_std_normalization=False,
    rotation_range=10,
    zoom_range=0.1,
    width_shift_range=0.1,
    height_shift_range=0.1,
    horizontal_flip=False)

datagen.fit(X_train)

# 4. Hyperparameter tuning
The `build_model` functions expects the `hp` parameter. 
At the beginning we define various hyperparameter. Some provide a list of choices, others a range of values. 
Afterwards we use the functional API of Tensorflow to create a model using those hyperparameter.

### 4.1. Build hypermodel

The idea behind this implementation is to setup the hypermodel in such a way thay as many hyperparameters as possible can be tuned. The more parameters get tuned the longer takes the tuning process. 

The model is using `Conv2D` layers with filter values from 8 to 128 and kernel sizes from 3 or 5. The `Conv2D` layers are using either `tanh` or `relu` as activation functions. The `MaxPooling2D` layer has a fixed pool size of 2. After this layer there is a `BatchNormalisation` layer followed by a `Dropout` layer with a dropout rate from either 0.0, 0.25 or 0.5.
After a `Flatten` there is at least 1 `Dense` layer, followed by a `Dropout` layer and the final `Dense` layer with the softmax activation and 10 units for categorical classification. The `epsilon` and the `learning_rate` of the Adam optemizer get also tuned. Let's see the impact of this. 

One improvment I want to implement is making more layers optional. So, this part can definitively be more improved. Happy to receive some ideas!

In [None]:
def build_hypermodel(hp):
    
    hp_dense_count = hp.Int('dense_count', min_value=1, max_value=5, step=1)
    hp_dropout_final = hp.Choice('dropout_final', values=[0.0, 0.25, 0.5])
    hp_learning_rate = hp.Choice('learning_rate', values =[0.01, 0.001, 0.0001])
    hp_adam_epsilon = hp.Choice('adam_epsilon', values=[1e-07, 1e-08])
    hp_conv_dropout = hp.Choice('dropout_conv', values=[0.0, 0.25, 0.5])
    hp_filter_1 = hp.Int('filter_1', min_value=8, max_value=64, step=16)
    hp_filter_2 = hp.Int('filter_2', min_value=64, max_value=128, step=16)
    hp_kernel_size_1 = hp.Choice('kernel_size_1', values=[3, 5])
    hp_kernel_size_2 = hp.Choice('kernel_size_2', values=[3, 5])
    hp_conv_activation = hp.Choice('conv_activation', values=['tanh', 'relu'])
    hp_pooling_type = hp.Choice('pooling_type', values=['avg', 'max'])
   
    inputs = Input((28,28,1))
    x = inputs

    x = Conv2D(hp_filter_1,(hp_kernel_size_1, hp_kernel_size_1), padding='same', activation=hp_conv_activation)(x)
    x = BatchNormalization(axis=1)(x)
    x = Conv2D(hp_filter_1,(hp_kernel_size_1, hp_kernel_size_1), padding='same', activation=hp_conv_activation)(x) 
    if hp_pooling_type == 'max': 
      x = MaxPooling2D()(x)
    else:
      x = AveragePooling2D()(x)
    x = BatchNormalization(axis=1)(x)
    x = Dropout(hp_conv_dropout)(x)

    x = Conv2D(hp_filter_2,(hp_kernel_size_2, hp_kernel_size_2), padding='same', activation=hp_conv_activation)(x)
    x = BatchNormalization(axis=1)(x)
    x = Conv2D(hp_filter_2,(hp_kernel_size_2, hp_kernel_size_2), padding='same', activation=hp_conv_activation)(x) 
    if hp_pooling_type == 'max': 
      x = MaxPooling2D()(x)
    else:
      x = AveragePooling2D()(x)
    x = BatchNormalization(axis=1)(x)
    x = Dropout(hp_conv_dropout)(x)        

    x = Flatten()(x)
    x = BatchNormalization()(x)
    
    for i in range(hp_dense_count):    
        hp_dense = hp.Int(f'dense_{i}', min_value=64, max_value=512, step=32)
        hp_dense_activation = hp.Choice(f'dense_activation_{i}', values=['tanh', 'relu'])
        x = Dense(hp_dense, activation=hp_dense_activation)(x)
    
    x = BatchNormalization()(x)
    x = Dropout(hp_dropout_final)(x)
    outputs = Dense(10, activation='softmax')(x)

    model = Model(inputs, outputs)
    model.compile(optimizer=Adam(learning_rate=hp_learning_rate, epsilon=hp_adam_epsilon), loss=CategoricalCrossentropy(), metrics=['accuracy'])
    return model

### 4.2 Hyperband tuner
Using hyperband for seaching for the best model. This tuner uses the feature of early stopping to speed up the tuning process. 

In [None]:
tuner = Hyperband(build_hypermodel, 
                  objective='val_accuracy',
                  executions_per_trial=1,
                  factor=3,
                  hyperband_iterations=2,
                  max_epochs=23,
                  project_name='cnn_keras_tuner')

### 4.3. Tuner search
Searching for the best hyperparameters. This may take quite a while 😉. For speeding it change `executions_per_trial` (4.2.).

In [None]:
class ClearTrainingOutput(tf.keras.callbacks.Callback):
  def on_train_end(*args, **kwargs):
    IPython.display.clear_output(wait=True)

tuner.search(datagen.flow(X_train, y_train, batch_size=64), 
             epochs=15,
             validation_data=(X_valid, y_valid),
             callbacks=[ClearTrainingOutput(), EarlyStopping('val_accuracy', patience=1)])

### 4.4. Best models
Collect the best 3 models found by the hyperband and show a summary of the results.

In [None]:
best_models = tuner.get_best_models(num_models=3)
print(tuner.results_summary())

# 5. Fit the model 
### 5.1. Summaries of best models
Shows the summary of each of the `best_models` selected by the hyperparameter tuning.

In [None]:
for model in best_models:
    model.summary()

### 5.2. Select the model
Select the most promising model (0, 1 or 2).

In [None]:
chosen_model = best_models[0]

### 5.3. Callbacks
The `tensorbord` callback is needed to store the log after each epoch to show it in the board (open it by clicking on the link in 1.2. `reduce_on_plateau` reduces the learning rate to the selected minimum `min_lr`.  

In [None]:
log_dir = "logs/fit/" + datetime.datetime.now().strftime("%d/%m/%y - %H:%M")
tensorboard = TensorBoard(log_dir=log_dir, histogram_freq=1)
reduce_on_plateau = ReduceLROnPlateau(monitor='val_accuracy', patience=3, factor=0.5, min_lr=0.0001)

### 5.4. Fit the model
Fit the chosen model using the augmented data. 20 epochs seems to be a valid number. But less (~15) should also work fine. Just check the tensorboard or the plots for more information. The `reduce_on_plateau` and the `tensorboard` callbacks are set and get called after each epoch.

In [None]:
chosen_model.summary()
history = chosen_model.fit(datagen.flow(X_train, y_train, batch_size=64), 
                    epochs=20, 
                    validation_data=(X_valid, y_valid), 
                    callbacks=[reduce_on_plateau, tensorboard],
                    verbose=2)

# 6. Plot and evaluate the results

In [None]:
fig, axs = plt.subplots(2,1, figsize=(30, 20))

axs[0].plot(history.history['accuracy'], color='red')
axs[0].plot(history.history['val_accuracy'], color='green')
axs[0].legend(labels=['Training accuracy','Validation accuracy'])
axs[0].set_xlim(left=1, right=20)
axs[0].set_xticks(range(1,20))
axs[0].set_xlabel('Epochs')
axs[0].set_ylabel('Accuracy')
axs[0].grid()

axs[1].plot(history.history['loss'], color='red')
axs[1].plot(history.history['val_loss'], color='green')
axs[1].legend(labels=['Training loss','Validation loss'])
axs[1].set_xlim(left=1, right=20)
axs[1].set_xticks(range(1,20))
axs[1].set_xlabel('Epochs')
axs[1].set_ylabel('Loss')
axs[1].grid()

# 7. Predict

In [None]:
loss, acc = model.evaluate(X_valid, y_valid)
print(loss)
print(acc)

In [None]:
prediction = chosen_model.predict(X_valid)
print(f'Prediction shape: {prediction.shape}')

# 8. Tensorboard

In [None]:
%tensorboard --logdir logs