# Bayesian Hyperparameter Tuning

In HW 2 we used grid and random search to find the best hyperparameters for our models. However, these methods are often considered inefficient (as many of you experienced first hand)! In this assingment we will leverage bayesian optimization to perform this task quicker and hopefully more effectively. We will leverage a popular fine tuning library called `optuna` to accomplish this. Thankfully there is a nice blog post on how to do this to help you get started. You can find it here: [optuna tutorial](https://medium.com/@becaye-balde/bayesian-sorcery-for-hyperparameter-optimization-using-optuna-1ee4517e89a). 

Additionally, you can head over to their website to see some additional examples: [https://optuna.org/#code_examples](https://optuna.org/#code_examples)

For this assignment use `optuna` to optimize the hyperparameters of a random forest model to predict the heat capacity dataset from HW 2. Follow the same splitting procedure to ensure that materials aren't mixed between the training and testing sets. Perform the optimization for 10 trials and report the best hyperparameters and the R^2 and MAE on the training and testing sets. How does this compare to your results from HW 2?

In [92]:
# Import statements
import pandas as pd
from CBFV.composition import generate_features
import numpy as np
import optuna
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.model_selection import GroupShuffleSplit
from sklearn.ensemble import RandomForestRegressor

RANDOM_SEED = 42

In [79]:

# Initialize the program:

    # open data file
df_full = pd.read_csv("data\cp_data_cleaned.csv")   

    # rename Cp column
df_full = df_full.rename(columns={'Cp': 'target'})

    # split the data into test and training sets, grouped by formula and obtain relevant indices 
gss = GroupShuffleSplit(test_size=0.1, n_splits=1, random_state=42)
train_index, test_index = next(gss.split(df_full, groups=df_full['formula']))
    
    # generate the train and test dataframes
df_train = df_full.iloc[train_index]
df_test = df_full.iloc[test_index]

    # Featurize the data from the formulae while keeping the T data by extending the features
    # names are such that scaling can be done later
X_train_unscaled, y_train, formulae_train, skipped_train = generate_features(df_train, elem_prop='oliynyk', drop_duplicates=False, extend_features=True, sum_feat=True)
X_test_unscaled, y_test, formulae_test, skipped_test = generate_features(df_test, elem_prop='oliynyk', drop_duplicates=False, extend_features=True, sum_feat=True)

    # I am not planning on scaling the data, but can put that here if needed
    # for now, just takes the unscaled data
X_train = X_train_unscaled
X_test = X_test_unscaled

    # define the range for the hyperparameters we will optimize and calculate the starting values for reference
n_estimators_min = 2
n_estimators_max = 20
max_depth_min = 1
max_depth_max = 32
    # starting values are integer values between the min and max
n_estimators_start = round((n_estimators_min + n_estimators_max) / 2)
max_depth_start = round((max_depth_min + max_depth_max) / 2)


Processing Input Data:   0%|          | 0/4072 [00:00<?, ?it/s]

Processing Input Data: 100%|██████████| 4072/4072 [00:00<00:00, 10269.54it/s]


	Featurizing Compositions...


Assigning Features...: 100%|██████████| 4072/4072 [00:00<00:00, 8116.97it/s]


	Creating Pandas Objects...


Processing Input Data: 100%|██████████| 475/475 [00:00<00:00, 16381.44it/s]


	Featurizing Compositions...


Assigning Features...: 100%|██████████| 475/475 [00:00<00:00, 7896.91it/s]


	Creating Pandas Objects...


In [80]:
# objective(trial=None)
#
# taken from https://medium.com/@becaye-balde/bayesian-sorcery-for-hyperparameter-optimization-using-optuna-1ee4517e89a
# on 3/31/2024
#
# Modified to include trial=None so I can run the trial with the starting conditions
#  
# 
def objective(trial=None):
    """
    objective(trial)
    Define a search space for the hyperparameters `n_estimators` and `max_depth`
    of a random forest model, then train and evaluate it using cross validation.
    If no trial object is passed, it will run with starting values
    """
    if not trial == None:
        n_estimators = trial.suggest_int('n_estimators', 2, 20)
        max_depth = int(trial.suggest_int('max_depth', 1, 32, log=True))
    else:
        n_estimators = n_estimators_start
        max_depth = max_depth_start
       
    clf = RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth, random_state=RANDOM_SEED)
    
    return cross_val_score(clf, X_train, y_train, n_jobs=-1, cv=3).mean()


In [83]:
# code taken from https://medium.com/@becaye-balde/bayesian-sorcery-for-hyperparameter-optimization-using-optuna-1ee4517e89a
# simple implementation of an optuna study with 10 trials that results in the trial object containing the best one    
    # create a study
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=10)

# get the best trial
trial = study.best_trial

[I 2024-03-31 14:09:32,541] A new study created in memory with name: no-name-d89bbf0d-29cc-4926-b6cc-b262e22557f9


[I 2024-03-31 14:09:34,487] Trial 0 finished with value: 0.9119932543880372 and parameters: {'n_estimators': 7, 'max_depth': 31}. Best is trial 0 with value: 0.9119932543880372.
[I 2024-03-31 14:09:36,528] Trial 1 finished with value: 0.9169474360437159 and parameters: {'n_estimators': 9, 'max_depth': 14}. Best is trial 1 with value: 0.9169474360437159.
[I 2024-03-31 14:09:40,555] Trial 2 finished with value: 0.914440158107228 and parameters: {'n_estimators': 18, 'max_depth': 20}. Best is trial 1 with value: 0.9169474360437159.
[I 2024-03-31 14:09:42,668] Trial 3 finished with value: 0.9169474360437159 and parameters: {'n_estimators': 9, 'max_depth': 14}. Best is trial 1 with value: 0.9169474360437159.
[I 2024-03-31 14:09:46,193] Trial 4 finished with value: 0.9095422079237011 and parameters: {'n_estimators': 20, 'max_depth': 10}. Best is trial 1 with value: 0.9169474360437159.
[I 2024-03-31 14:09:46,505] Trial 5 finished with value: 0.5378224745532519 and parameters: {'n_estimators': 

In [84]:
# Print statements for results of optimization for comparison to starting parameters
print(f'Starting accuracy: {objective()}')
print(f'Starting hyperparameters: n_estimators: {n_estimators_start}, max_depth = {max_depth_start}')
print('Optimized accuracy: {}'.format(trial.value))
print("Best hyperparameters: {}".format(trial.params))

Starting accuracy: 0.9112494165349379
Starting hyperparameters: n_estimators: 11, max_depth = 16
Optimized accuracy: 0.9169474360437159
Best hyperparameters: {'n_estimators': 9, 'max_depth': 14}


In [91]:
# generate the RandomForestRegressor rgr using the best trial's hyperparameters 
# evaluate the model against the test data set

# Train predictive model with best hyperparameters:

    # Extract the best hyperparameters
best_params = trial.params
n_estimators = best_params['n_estimators']
max_depth = best_params['max_depth']

    # Create a new instance of RandomForestRegressor with the best hyperparameters
rfr = RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth, random_state=RANDOM_SEED)

    # Fit the regressor to the training data
rfr.fit(X_train, y_train)

    # Generate prediction data from test dataset
y_pred = rfr.predict(X_test)

    # Calculate metrics (MAE, R2, RSME)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

    # print results
print("Random Forest Regressor")
print("R2:", r2)
print("MAE:", mae)
print("RMSE:",rmse)


Random Forest Regressor
R2: 0.922300940294313
MAE: 10.859762026141437
RMSE: 21.19079467330077


This dataset was used to train an XG Boost model in HW2 Q2.  
XGB hyperparameters were optimized using a randomized search over 100 iterations  
  
Metrics for that model (XGB model from HW2):  
R2: 0.9434960074849109  
MAE: 11.148077957876104  
RMSE: 18.070864511559254  
  
This is comparable to the results of the RandomForestRegressor with bayesian hyperparemeter tuning using Optuna with only 10 iterations.  
  
When hyperparameter tuning was performed on the XGB model, it took several hours to run.  
The Optuna optimization ran very fast and gave results that were comparable.  
It was also much easier to implement.  