Machine Learning operates by solving an optimization problem to determine the parameters of a function that best fits the data. However, some parameters - known as hyperparameters - can't be learned through this process, as they set the model structure and guide the optimization procedure. Tuning these hyperparameters is a crucial part of model development and can greatly impact model performance.

Hyperparameter tuning refers to the task of discovering the optimal hyperparameters for a given model and dataset. This process is a key stage in the machine learning workflow. Nevertheless, it poses a challenging task due to its complex nature - there is no universal methodology that applies to all scenarios. The selection of the best hyperparameters relies heavily on the dataset at hand, the chosen model architecture, and the specific learning task. Therefore, identifying the ideal set of hyperparameters is not about finding a one-size-fits-all solution, but rather about employing a mix of intuition, systematic testing, and optimization techniques.

In this lab, we will delve into the fundamentals of hyperparameter tuning, exploring methods such as Grid Search, Cross Validation, and Bayesian Optimization. Let's dive in!

In [1]:
# Check if we are running on Google Colab, or locally
import sys

IN_COLAB = "google.colab" in sys.modules

In [2]:
if not IN_COLAB:
    # Colab already has these installed
    !pip install -q torch torchvision torchaudio numpy pandas matplotlib scikit-learn
!pip install -q optuna

In [3]:
# Import standard libraries
import numpy as np
from time import time

# Import libraries for data handling and manipulation
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import (
    train_test_split,
    GridSearchCV,
    cross_validate,
    StratifiedKFold,
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, accuracy_score

import optuna
from torch import nn
import torch
from torch.utils.data import TensorDataset, DataLoader

## Data Preparation

In this lab, we'll be using the Titanic dataset. To prepare our dataset for the subsequent learning tasks, we'll have to perform the following steps:

1. Discard irrelevant columns: Some columns may not contribute to the model's predictive performance and can be removed.
2. Adjust data types of certain columns: Some columns may have incorrect data types that need to be fixed for proper analysis.
3. Split the dataset into Training and Test sets: This ensures we have a separate dataset (Test set) to evaluate our model's performance.
4. Handle missing values: We need to impute or fill in missing values to avoid complications during the learning process.
5. Scale the data: It's necessary to standardize our data to ensure all features have equal importance in model training.

We'll be carrying out all these steps in the following cells:

In [11]:
fetch_openml("titanic", version=1, as_frame=True, parser="auto").frame

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0000,0,0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0000,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1,2,113781,151.5500,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,3,0,"Zabour, Miss. Hileni",female,14.5000,1,0,2665,14.4542,,C,,328.0,
1305,3,0,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,,C,,,
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.5000,0,0,2656,7.2250,,C,,304.0,
1307,3,0,"Zakarian, Mr. Ortin",male,27.0000,0,0,2670,7.2250,,C,,,


In [4]:
# Fetch data
data = fetch_openml("titanic", version=1, as_frame=True, parser="auto").frame

# Select features
unimportant_cols = ["name", "ticket", "cabin", "embarked", "boat", "body", "home.dest"]
data = data.drop(unimportant_cols, axis=1)

# Encode categorical features and convert relevant columns to numeric data type
label_encoder = LabelEncoder()  # Encoder for categorical features
data["sex"] = label_encoder.fit_transform(data["sex"])
data["survived"] = data["survived"].astype("int")

# Split data into features and target
target_data = data["survived"]
feature_data = data.drop("survived", axis=1)

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(
    feature_data, target_data, test_size=0.25, random_state=0
)

# Handle missing values with imputation
imputer = SimpleImputer(
    missing_values=np.nan, strategy="most_frequent"
)  # Imputer for handling missing values
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Print information about the final dataset
print(
    f"There are {X_train.shape[0]} training data points and {X_test.shape[0]} testing points"
)
print(f"There are {X_train.shape[1]} features in the dataset")

There are 981 training data points and 328 testing points
There are 6 features in the dataset


## Grid Search

With our data ready, we can commence our exploration of hyperparameter tuning. Our initial approach will be grid search, a method that involves constructing a grid of potential hyperparameters and systematically examining the model's performance for each combination. To attain a reliable measure of model performance, grid search is often combined with K-Fold cross-validation. The Grid Search CV process involves:

1. Hyperparameter grid definition: Identify the hyperparameters to be tuned and designate possible values for each one. This forms a grid, where each point represents a unique set of hyperparameters.
2. Cross-validation across folds: For each unique set of hyperparameters, execute a K-Fold Cross-validation on your model and calculate the average error.
3. Hyperparameters selection: Opt for the hyperparameters that result in the best performance (i.e., the lowest error).

Let's delve into the impact of hyperparameter tuning using a simple Random Forest Classifier in the following section.

In [5]:
print("Setting up a Random Forest Classifier...")
clf = RandomForestClassifier()

print("Defining hyperparameters for Grid Search...")
hyperparameter_search = {
    "max_depth": [3, 4, 5, 6, 7],  # Max Depth of each individual tree
    "n_estimators": [50, 100, 150, 200],  # Number of trees generated
    "min_samples_leaf": [1, 2, 4, 8],  # Minimum number of samples found in a leaf
    "min_samples_split": [2, 4, 8, 16, 32],  # Minimum samples required for a split
}

print("Setting up Grid Search with 5-fold Cross Validation...")
grid_search_cv = GridSearchCV(
    estimator=clf,
    param_grid=hyperparameter_search,
    scoring=make_scorer(accuracy_score, greater_is_better=True),
    verbose=1,
    n_jobs=-1,  # Use all CPU cores
    cv=5,
)

print("Running Grid Search (This may take a while)...")
start_time = time()
grid_search_cv.fit(X_train, y_train)
end_time = time()

print(f"Grid Search completed in {end_time - start_time:.0f} seconds")

print(f"Best Parameters: {grid_search_cv.best_params_}")
print(f"Best CV Accuracy: {grid_search_cv.best_score_ * 100:.2f}%")

print("Evaluating model on test data...")
clf = grid_search_cv.best_estimator_
test_predictions = clf.predict(X_test)
accuracy = accuracy_score(y_test, test_predictions)

print(f"Testing Accuracy: {accuracy * 100:.2f}%")

Setting up a Random Forest Classifier...
Defining hyperparameters for Grid Search...
Setting up Grid Search with 5-fold Cross Validation...
Running Grid Search (This may take a while)...
Fitting 5 folds for each of 400 candidates, totalling 2000 fits
Grid Search completed in 59 seconds
Best Parameters: {'max_depth': 7, 'min_samples_leaf': 1, 'min_samples_split': 8, 'n_estimators': 150}
Best CV Accuracy: 81.65%
Evaluating model on test data...
Testing Accuracy: 81.10%


## Advanced Parameter Searching

While Grid Search proves effective in many scenarios, its major limitation is its computational demand. The number of models to be trained escalates exponentially with each additional hyperparameter. In our previous example, we had to train a model over five thousand times due to the 1024 combinations ($4^5$) and 5-fold cross-validation. This approach might be feasible for simpler models like Random Forests or Linear/Logistic regression, but it becomes prohibitively time-consuming for deep learning models or when dealing with a large search space.

## Introducing Bayesian Optimization

An alternative hyperparameter tuning technique, Bayesian Optimization, can help mitigate these computational concerns. It's a sequential, model-based optimization method used to find the optimal hyperparameters for a given machine learning model. This technique combines Bayesian inference and optimization to identify promising regions for evaluation, utilizing a surrogate model to approximate the performance of our primary model concerning its hyperparameters.

One of the key advantages of Bayesian optimization is its efficient exploration of the hyperparameter space, which provides a significant edge over exhaustive search methods like grid search. It intelligently selects new configurations based on predictions from the surrogate model, thus converging to the optimal hyperparameters more rapidly and with fewer evaluations.

Next, let's apply Bayesian Optimization for parameter search in our Random Forest model. We'll leverage Optuna, a hyperparameter optimization library that encapsulates this approach.

In [12]:
def print_callback(study, trial):
    # Print the trial number, the best value and parameters after each trial
    print(f"\nTrial {trial.number} finished.")
    print(f"Best value after trial {trial.number}: {study.best_value:.3f}")
    print(f"Best params after trial {trial.number}: {study.best_params}")


# Define a function that specifies the model, the search space and trains the model
# Optuna will try to optimize the hyperparameters to maximize the output of this function
def optuna_rf_function(trial):
    hyperparameters = {
        "max_depth": trial.suggest_int("max_depth", 3, 7),
        "n_estimators": trial.suggest_int("n_estimators", 50, 200),
        "min_samples_leaf": trial.suggest_int("min_samples_leaf", 1, 8),
        "min_samples_split": trial.suggest_int("min_samples_split", 2, 32),
    }

    model = RandomForestClassifier(**hyperparameters)

    # Evaluate the model using cross-validation and calculate the mean test score
    cv_result = cross_validate(model, X_train, y_train, cv=5, scoring="accuracy")
    return cv_result["test_score"].mean()


# Create an Optuna study object
study = optuna.create_study(direction="maximize")

# Optimize the study using the sampler
study.optimize(
    optuna_rf_function,
    n_trials=100,
    callbacks=[print_callback],
    show_progress_bar=True,
    gc_after_trial=True,
)

[I 2023-08-28 15:15:41,745] A new study created in memory with name: no-name-807f3d20-710a-4232-865e-f13344c340a0


  0%|          | 0/100 [00:00<?, ?it/s]

[I 2023-08-28 15:15:42,875] Trial 0 finished with value: 0.8093597845229462 and parameters: {'max_depth': 7, 'n_estimators': 190, 'min_samples_leaf': 1, 'min_samples_split': 28}. Best is trial 0 with value: 0.8093597845229462.

Trial 0 finished.
Best value after trial 0: 0.809
Best params after trial 0: {'max_depth': 7, 'n_estimators': 190, 'min_samples_leaf': 1, 'min_samples_split': 28}
[I 2023-08-28 15:15:43,919] Trial 1 finished with value: 0.8103801926862116 and parameters: {'max_depth': 7, 'n_estimators': 157, 'min_samples_leaf': 5, 'min_samples_split': 20}. Best is trial 1 with value: 0.8103801926862116.

Trial 1 finished.
Best value after trial 1: 0.810
Best params after trial 1: {'max_depth': 7, 'n_estimators': 157, 'min_samples_leaf': 5, 'min_samples_split': 20}
[I 2023-08-28 15:15:45,006] Trial 2 finished with value: 0.8032321558064851 and parameters: {'max_depth': 4, 'n_estimators': 177, 'min_samples_leaf': 3, 'min_samples_split': 2}. Best is trial 1 with value: 0.8103801926

KeyboardInterrupt: 

In [7]:
def print_accuracy(accuracy, dataset_name):
    print(f"{dataset_name} Accuracy: {accuracy * 100:.2f}%")


# Obtain the best parameters and their corresponding accuracy
best_params = study.best_params
best_accuracy = study.best_value

# Display the best parameters and their corresponding accuracy
print(f"Best Parameters: {best_params}")
print_accuracy(best_accuracy, "Best CV")

# Train the best model on the training data
best_model = RandomForestClassifier(**best_params)
best_model.fit(X_train, y_train)

# Make predictions on the test set and calculate the accuracy
y_pred = best_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)

# Display the test accuracy
print_accuracy(test_accuracy, "Testing")

Best Parameters: {'max_depth': 4, 'n_estimators': 189, 'min_samples_leaf': 1, 'min_samples_split': 28}
Best CV Accuracy: 81.75%
Testing Accuracy: 80.79%


**Your Turn**

- How do the CV accuracies of the grid search and Optuna compare? They are very similar.
- What differences do you notice in their testing accuracies? They are also very similar.
- Which method completed faster, and why do you think this was the case? The Optuna method completed faster. This is because Optuna intelligently selects new configurations based on predictions from the surrogate model, thus converging to the optimal hyperparameters more rapidly and with fewer evaluations.

Hyperparameter Tuning for Neural Networks

One compelling use case for Bayesian Optimization is hyperparameter tuning for neural networks. Training these models can be time-consuming and their performance can be significantly affected by hyperparameters such as dropout rate, learning rate, weight decay, batch size, among others. Let's explore how we can optimize these parameters for a neural network.

Our hyperparameters for this network will include:

- The batch size utilized during training
- The learning rate for the training process
- The number of training epochs
- The size of the hidden layer in the network.

It's outside of the scope of this lab, but the code for our neural network is below. Thanks to our network class having fit and predict functions, we can actually reuse much of our code from the Random Forest example.

In [8]:
class TitanicMLP(nn.Module):
    """
    Simple two-layer network for the Titanic dataset.
    """

    def __init__(self, input_dim, hidden_dim, batch_size, learning_rate, epochs):
        super(TitanicMLP, self).__init__()

        # Parameters
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.batch_size = batch_size
        self.learning_rate = learning_rate
        self.epochs = epochs

        # Define the forward pass layers
        self.forward_pass = nn.Sequential(
            nn.Linear(input_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, 1)
        )

        # Define loss function and optimizer
        self.criterion = nn.BCEWithLogitsLoss()
        self.optimizer = torch.optim.SGD(self.parameters(), lr=self.learning_rate)

    def forward(self, x):
        """
        Perform the forward pass.
        """
        return self.forward_pass(x)

    def fit(self, X, y):
        """
        Train the model.
        """
        self.train()

        # Create tensors
        X_tensor = torch.Tensor(X).float()
        Y_tensor = torch.Tensor(y).float()

        # Create DataLoader
        train_dataset = TensorDataset(X_tensor, Y_tensor)
        train_loader = DataLoader(
            dataset=train_dataset, batch_size=self.batch_size, shuffle=True
        )

        # Training loop
        for epoch in range(self.epochs):
            for batch_idx, (features, target) in enumerate(train_loader):
                self.optimizer.zero_grad()  # reset gradients
                outputs = self.forward(features)  # forward pass
                loss = self.criterion(
                    torch.squeeze(outputs), torch.squeeze(target)
                )  # calculate loss
                loss.backward()  # backpropagation
                self.optimizer.step()  # update weights

            # Print progress
            if (epoch + 1) % 10 == 0 and epoch != 0:
                print(f"Epoch {epoch + 1}/{self.epochs}, Loss: {loss.item()}")

    def predict(self, X):
        """
        Predict the class of the input data X.
        """
        self.eval()  # switch to evaluation mode
        with torch.no_grad():
            X_tensor = torch.Tensor(X)
            y_pred = torch.sigmoid(
                self.forward(X_tensor)
            )  # apply sigmoid for binary output
            y_pred = (
                torch.round(y_pred).squeeze().numpy()
            )  # round to nearest integer (0 or 1) and convert to numpy array
        return y_pred



**Your Turn**

- Currently, the Hyperparameter search has some placeholder values passed in. Replace these fixed values with Optuna variables to search the space.
- Examine the scores produced for the various parameter configurations tested by Optuna. Do they exhibit similar performance, or do they significantly influence the model's effectiveness?

Answer: They exhibit similar performance - the Titanic dataset is not very complex, so the hyperparameters do not have a large impact on the model's effectiveness.

Remember, the purpose of hyperparameter tuning is to optimize the model's performance, and sometimes even a minor change in parameters can lead to substantial improvement. So, it's crucial to pay close attention to the scores and make adjustments as necessary.


In [None]:
def optuna_mlp_function(trial):
    # Define the hyperparameters
    hyperparameters = {
        "input_dim": trial.suggest_categorical(
            "input_dim", [X_train.shape[1]]
        ),  # Fixed parameter, no need to include in the trial
        "hidden_dim": trial.suggest_int("hidden_dim", 8, 64),
        "batch_size": trial.suggest_int("batch_size", 8, 64),  # Power of 2
        "learning_rate": trial.suggest_float("learning_rate", 1e-5, 1e-1, log=True),
        "epochs": trial.suggest_categorical("epochs", [10,20,30])
    }

    # Instantiate the model
    model = TitanicMLP(**hyperparameters)
    scores = []
    kfold = StratifiedKFold(n_splits=5)

    # Perform cross-validation
    for train_index, val_index in kfold.split(X_train, y_train):
        X_train_fold = X_train[train_index]
        y_train_fold = y_train.values[train_index]
        X_val_fold = X_train[val_index]
        y_val_fold = y_train.values[val_index]

        # Fit the model and predict on the validation data
        model.fit(X_train_fold, y_train_fold)
        y_pred = model.predict(X_val_fold)

        # Append the accuracy score to the list of scores
        scores.append(accuracy_score(y_val_fold, y_pred))

    # Return the mean cross-validation accuracy
    return np.mean(scores)


# Initialize the Optuna study and set the optimization direction
mlp_study = optuna.create_study(direction="maximize")

# Run the optimization
mlp_study.optimize(
    optuna_mlp_function,
    n_trials=100,
    callbacks=[print_callback],
    show_progress_bar=True,
    gc_after_trial=True,
)

In [10]:
best_params = mlp_study.best_params
best_accuracy = mlp_study.best_value

print(f"Best Parameters: {best_params}")
print_accuracy(best_accuracy, "Best CV")

best_model = TitanicMLP(**best_params)
best_model.fit(X_train, y_train)

y_pred = best_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)

print_accuracy(test_accuracy, "Testing")

Best Parameters: {'input_dim': 6, 'hidden_dim': 40, 'batch_size': 30, 'learning_rate': 0.06060997465428394, 'epochs': 30}
Best CV Accuracy: 81.45%
Epoch 10/30, Loss: 0.4147522449493408
Epoch 20/30, Loss: 0.46157708764076233
Epoch 30/30, Loss: 0.31265583634376526
Testing Accuracy: 81.71%


**Your Turn**

Besides the parameters we searched for above, can you think of other hyperparameters that could be tuned in a Neural Network?

Hint: There are several aspects of a Neural Network's architecture and training process that can be adjusted. These could include the type of optimizer used, the activation functions applied, the initialization method for weights, and much more. Think about the various components that make up a Neural Network and how adjusting them might impact the model's performance.

Answer: The type of optimizer used, the activation functions applied, the initialization method for weights, and much more.