Machine Learning operates by solving an optimization problem to determine the parameters of a function that best fits the data. However, some parameters - known as hyperparameters - can't be learned through this process, as they set the model structure and guide the optimization procedure. Tuning these hyperparameters is a crucial part of model development and can greatly impact model performance.

Hyperparameter tuning refers to the task of discovering the optimal hyperparameters for a given model and dataset. This process is a key stage in the machine learning workflow. Nevertheless, it poses a challenging task due to its complex nature - there is no universal methodology that applies to all scenarios. The selection of the best hyperparameters relies heavily on the dataset at hand, the chosen model architecture, and the specific learning task. Therefore, identifying the ideal set of hyperparameters is not about finding a one-size-fits-all solution, but rather about employing a mix of intuition, systematic testing, and optimization techniques.

In this lab, we will delve into the fundamentals of hyperparameter tuning, exploring methods such as Grid Search, Cross Validation, and Bayesian Optimization. Let's dive in!

In [None]:
!pip install -q optuna

In [None]:
# Import standard libraries
import numpy as np

## Data Preparation

In this lab, we'll be using the HCC dataset to tune our model.

In [None]:
import pandas as pd
hcc = pd.read_csv('https://github.com/alexwolson/carte_workshop_2024/raw/main/data/HCC_all_ML_classification_test_annotated_frags_all_features_combined_4_tumors.csv.gz', compression='gzip')
hcc = hcc.sample(frac=0.2)

In [None]:
X = ...
y = ..

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Grid Search

With our data ready, we can commence our exploration of hyperparameter tuning. Our initial approach will be grid search, a method that involves constructing a grid of potential hyperparameters and systematically examining the model's performance for each combination. To attain a reliable measure of model performance, grid search is often combined with K-Fold cross-validation. The Grid Search CV process involves:

1. Hyperparameter grid definition: Identify the hyperparameters to be tuned and designate possible values for each one. This forms a grid, where each point represents a unique set of hyperparameters.
2. Cross-validation across folds: For each unique set of hyperparameters, execute a K-Fold Cross-validation on your model and calculate the average error.
3. Hyperparameters selection: Opt for the hyperparameters that result in the best performance (i.e., the lowest error).

Let's delve into the impact of hyperparameter tuning using a simple Random Forest Classifier in the following section.

In [None]:
# Import required libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, accuracy_score
from time import time

print("Setting up a Random Forest Classifier...")
clf = RandomForestClassifier()

print("Defining hyperparameters for Grid Search...")
hyperparameter_search = {
# Choose some hyperparameters by looking up the documentation e.g.
    "n_estimators": [50, 100, 200],
    # Add more
}

print("Setting up Grid Search with 5-fold Cross Validation...")
grid_search_cv = GridSearchCV(
    estimator=clf,
    param_grid=hyperparameter_search,
    scoring=make_scorer(accuracy_score, greater_is_better=True),
    verbose=1,
    n_jobs=-1,  # Use all CPU cores
    cv=5,
)

print("Running Grid Search (This may take a while)...")
start_time = time()
grid_search_cv.fit(X_train, y_train)
end_time = time()

print(f"Grid Search completed in {end_time - start_time:.0f} seconds")

print(f"Best Parameters: {grid_search_cv.best_params_}")
print(f"Best CV Accuracy: {grid_search_cv.best_score_ * 100:.2f}%")

print("Evaluating model on test data...")
clf = grid_search_cv.best_estimator_
test_predictions = clf.predict(X_test)
accuracy = accuracy_score(y_test, test_predictions)

print(f"Testing Accuracy: {accuracy * 100:.2f}%")

## Advanced Parameter Searching

While Grid Search proves effective in many scenarios, its major limitation is its computational demand. The number of models to be trained escalates exponentially with each additional hyperparameter. In our previous example, we had to train a model over five thousand times due to the 1024 combinations ($4^5$) and 5-fold cross-validation. This approach might be feasible for simpler models like Random Forests or Linear/Logistic regression, but it becomes prohibitively time-consuming for deep learning models or when dealing with a large search space.

## Introducing Bayesian Optimization

An alternative hyperparameter tuning technique, Bayesian Optimization, can help mitigate these computational concerns. It's a sequential, model-based optimization method used to find the optimal hyperparameters for a given machine learning model. This technique combines Bayesian inference and optimization to identify promising regions for evaluation, utilizing a surrogate model to approximate the performance of our primary model concerning its hyperparameters.

One of the key advantages of Bayesian optimization is its efficient exploration of the hyperparameter space, which provides a significant edge over exhaustive search methods like grid search. It intelligently selects new configurations based on predictions from the surrogate model, thus converging to the optimal hyperparameters more rapidly and with fewer evaluations.

Next, let's apply Bayesian Optimization for parameter search in our Random Forest model. We'll leverage Optuna, a hyperparameter optimization library that encapsulates this approach.

In [None]:
import optuna
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier


def print_callback(study, trial):
    # Print the trial number, the best value and parameters after each trial
    print(f"\nTrial {trial.number} finished.")
    print(f"Best value after trial {trial.number}: {study.best_value:.3f}")
    print(f"Best params after trial {trial.number}: {study.best_params}")


# Define a function that specifies the model, the search space and trains the model
# Optuna will try to optimize the hyperparameters to maximize the output of this function
def optuna_rf_function(trial):
    hyperparameters = {
        "max_depth": trial.suggest_int("max_depth", 3, 7),
        # Add more again
    }

    model = RandomForestClassifier(**hyperparameters)

    # Evaluate the model using cross-validation and calculate the mean test score
    cv_result = cross_validate(model, X_train, y_train, cv=5, scoring="accuracy")
    return cv_result["test_score"].mean()


# Create an Optuna study object
study = optuna.create_study(direction="maximize")

# Optimize the study using the sampler
study.optimize(
    optuna_rf_function,
    n_trials=100,
    callbacks=[print_callback],
    show_progress_bar=True,
    gc_after_trial=True,
)

In [None]:
def print_accuracy(accuracy, dataset_name):
    print(f"{dataset_name} Accuracy: {accuracy * 100:.2f}%")


# Obtain the best parameters and their corresponding accuracy
best_params = study.best_params
best_accuracy = study.best_value

# Display the best parameters and their corresponding accuracy
print(f"Best Parameters: {best_params}")
print_accuracy(best_accuracy, "Best CV")

# Train the best model on the training data
best_model = RandomForestClassifier(**best_params)
best_model.fit(X_train, y_train)

# Make predictions on the test set and calculate the accuracy
y_pred = best_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)

# Display the test accuracy
print_accuracy(test_accuracy, "Testing")