[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/francisco-ortin/data-science-course/blob/main/deep-learning/activation/hyperparameter.ipynb)
[![License: CC BY-NC-SA 4.0](https://img.shields.io/badge/License-CC%20BY--NC--SA%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/)

# Hyperparameter tuning/optimization

When defining ANNs, there are many hyperparameters to tweak. Not only ANN architecture parameters (the number of layers, the number of neurons and the type of activation function to use in each layer) but also the way they are trained (the initialization logic, the type of optimizer to use, its learning rate, the batch size, and more).

The hyperparameter tuning/optimization problem involves finding the best set of hyperparameters for a machine learning model to optimize its performance on a given task or dataset. The goal of hyperparameter tuning/optimization is to search the hyperparameter space efficiently to find the set of hyperparameters that maximizes the model's performance metric, such as accuracy, loss, or some other evaluation metric. This process typically involves conducting multiple experiments with different hyperparameter configurations, training and evaluating the model for each configuration, and selecting the configuration that yields the best performance.

As mentioned in this course, we should not use the test set to tune the hyperparameters. Instead, we should split the training set into a training set and a validation set. The training set is used to train the model, while the validation set is used to evaluate the model's performance on unseen data and tune the hyperparameters. Once the hyperparameters are tuned and the model is trained, we can evaluate the model's performance on the test set to get an unbiased estimate of its performance on unseen data.

There are many different approaches to perform hyperparameter tuning/optimization:
- Manual search: manually selecting hyperparameters based on intuition, experience, or trial and error.
- Grid search: exhaustively searching the hyperparameter space by evaluating all possible combinations of hyperparameters.
- Random search: randomly sampling hyperparameters from a predefined search space and evaluating them.
- Bayesian optimization: using probabilistic models to model the hyperparameter space and guide the search process.
- Evolutionary algorithms: using evolutionary algorithms to evolve a population of hyperparameter configurations over time.

The three first methods are very easy to implement but are not very efficient. In this notebook we will use [Bayesian optimization](https://en.wikipedia.org/wiki/Bayesian_optimization). We will use the `Keras Tuner` library, which provides a simple and efficient way to perform hyperparameter tuning/optimization in Keras models.  

In [8]:
# make sure the required packages are installed
%pip install pandas numpy seaborn matplotlib scikit-learn keras tensorflow keras_tuner --quiet
# if running in colab, install the required packages and copy the necessary files
directory='data-science-course/deep-learning/activation'
if get_ipython().__class__.__module__.startswith('google.colab'):
    !git clone --depth 1 https://github.com/francisco-ortin/data-science-course.git  2>/dev/null
    !cp --update {directory}/*.py .
    !mkdir -p img data
    !cp {directory}/img/* img/.

import tensorflow as tf
import keras_tuner as kt

Note: you may need to restart the kernel to use updated packages.


## Data preparation

We use the [Fashion-MNIST](https://keras.io/api/datasets/fashion_mnist/) dataset. 

In [2]:
# We download fashion MNIST the dataset from keras.
(X_train_full, y_train_full), (X_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()
# we split the training set into training and validation sets
N_TRAIN_INSTANCES, N_VAL_INSTANCES = 1_000, 1_000
X_train, y_train = X_train_full[:N_TRAIN_INSTANCES], y_train_full[:N_TRAIN_INSTANCES]
X_val, y_val = X_train_full[-N_VAL_INSTANCES:], y_train_full[-N_VAL_INSTANCES:]
CLASS_LABELS = ["T-shirt/top", "Trouser", "Pullover", "Dress", "Coat",
                "Sandal", "Shirt", "Sneaker", "Bag", "Ankle boot"]
# Show shapes of all the datasets
print(f"Shape of X_train = {X_train.shape} and y_train = {y_train.shape}.")
print(f"Shape of X_val = {X_val.shape} and y_val = {y_val.shape}.")
print(f"Shape of X_test = {X_test.shape} and y_test = {y_test.shape}.")
# We rescale the colors to real numbers between 0 and 1
X_train, X_val, X_test = X_train / 255, X_val / 255, X_test / 255

Shape of X_train = (1000, 28, 28) and y_train = (1000,).
Shape of X_val = (1000, 28, 28) and y_val = (1000,).
Shape of X_test = (10000, 28, 28) and y_test = (10000,).


## Hyperparameter variable specification

To use the `Keras Turner` library, we define a subclass of `kt.HyperModel` that defines the model architecture and hyperparameters to tune. For example, we could tune the number of hidden layers, the number of neurons in each hidden layer, the learning rate, and the optimizer. This is done in the `build` method of the `HyperModel` subclass. The `fit` method is used to train the model with a given set of hyperparameters.

In [3]:
class MyClassificationHyperModel(kt.HyperModel):
    def build(self, hp):
        # Searches for different number of hidden layers between 0 and 8
        n_hidden = hp.Int("n_hidden", min_value=0, max_value=8)  # number of hidden layers
        n_neurons = hp.Int("n_neurons", min_value=16, max_value=256)  # number of neurons per layer
        learning_rate = hp.Float("learning_rate", min_value=1e-4, max_value=1e-2, sampling="log")  # learning rate
        optimizer = hp.Choice("optimizer", values=["sgd", "adam"])  # optimizer
        if optimizer == "sgd":
            optimizer = tf.keras.optimizers.SGD(learning_rate=learning_rate)
        else:
            optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
        # model creation
        model = tf.keras.Sequential()
        model.add(tf.keras.layers.Flatten())
        for _ in range(n_hidden):
            model.add(tf.keras.layers.Dense(n_neurons, activation="relu"))
        model.add(tf.keras.layers.Dense(len(CLASS_LABELS), activation="softmax"))
        model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])
        return model

    def fit(self, hp, model, X, y, **kwargs):
        return model.fit(X, y, **kwargs)

## Hyperparameter search

`Keras Turner` provides different ways to search for hyperparameters: `RandomSearch`, `BayesianOptimization`, `Hyperband` (early stopping method to prune poor configurations), `Greedy` (a simple search algorithm that greedily selects the best configuration at each step), and `Sklearn` (a search algorithm that uses scikit-learn's search methods). In this example, we use `BayesianOptimization`. It performs a probabilistic search (a Gaussian process) that approximates the objective function (model performance metric) based on the observed evaluations of hyperparameters. This allows it to explore the hyperparameter space more efficiently and find better configurations in fewer iterations.

In [5]:
# first, we define the search method
bayesian_opt_tuner = kt.BayesianOptimization(
    MyClassificationHyperModel(),
    objective="val_accuracy",  # objective function of the optimization problem
    max_trials=10,  # max number of executions
    directory="hyperparams",  # output folder
    overwrite=False)  # does not delete the existing models in the output folder upon new execution
# second, we perform the search
bayesian_opt_tuner.search(X_train, y_train, epochs=10,
                          validation_data=(X_val, y_val),
                          callbacks=[tf.keras.callbacks.EarlyStopping()])

Trial 10 Complete [00h 00m 02s]
val_accuracy: 0.7590000033378601

Best val_accuracy So Far: 0.7879999876022339
Total elapsed time: 00h 00m 40s


## Best hyperparameter, performance and mode retrieval

We can retrieve the best hyperparameters, performance metrics, and model from the tuner.

In [7]:
# get the first hyperparameter set in descending order of performance (the best ones)
hyperparameters = bayesian_opt_tuner.get_best_hyperparameters()[0]   
print(f"Best hyperparameters: {hyperparameters.values}.")

# get the best trial (the one with the best performance); then ge the validation accuracy
best_trial = bayesian_opt_tuner.oracle.get_best_trials()[0]
val_accuracy = best_trial.metrics.get_last_value("val_accuracy")
print(f"Validation accuracy with the best hyperparameters: {val_accuracy:.4f}.")

# get the best model and evaluate it on the test set
best_model = bayesian_opt_tuner.get_best_models()[0]
evaluation_results = best_model.evaluate(X_test, y_test)
print(f"Best model's loss (test set): {evaluation_results[0]:.4f}. Best model's accuracy (test set): {evaluation_results[1]:.4f}.")

Best hyperparameters: {'n_hidden': 3, 'n_neurons': 198, 'learning_rate': 0.0002990356912538018, 'optimizer': 'adam'}.
Validation accuracy with the best hyperparameters: 0.7880.
Best model's loss (test set): 0.6144. Best model's accuracy (test set): 0.7810.


## ✨ Questions ✨ 

1. Do you think the search will find a better model if it keeps searching?
2. Considering the hyperparameter tuning process as a black box, what sets do we have to pass it (choose one answer):
a) train.
b) train and validation.
c) train, validation and test.
3. Why?
4. Do you take the best model out of the hyperparameter tuning and use it for inference?

### Answers

*Write your answers here.*
