<a href="https://colab.research.google.com/github/adithyamauryakr/pytorchtutorials/blob/main/optuna-hpt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

* Grid Search CV is not scalable. Random Search is simple but can overlook better solutions.
Optuna solves these problem > Bayesian search.
* find a mathematical relation between tunable parameters and accuracy to be estimated. Hence through this we can the best possible parameters.

## Key Terms of optuna

1. Study: optimization session encompassing multiple trials.
2. Trial: different combinations used to train a model
3. Trial parameters: Hyperparameter values of a trial
4. Objective funciton: relationship between parameters and accuracy.
5. Sampler: suggests what HP to tryout next. Optuna uses Tree-structured Parzen Estimator (TPE)

In [1]:
!pip install optuna

Collecting optuna
  Downloading optuna-4.2.1-py3-none-any.whl.metadata (17 kB)
Collecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.15.1-py3-none-any.whl.metadata (7.2 kB)
Collecting colorlog (from optuna)
  Downloading colorlog-6.9.0-py3-none-any.whl.metadata (10 kB)
Collecting Mako (from alembic>=1.5.0->optuna)
  Downloading Mako-1.3.9-py3-none-any.whl.metadata (2.9 kB)
Downloading optuna-4.2.1-py3-none-any.whl (383 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m383.6/383.6 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading alembic-1.15.1-py3-none-any.whl (231 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m231.8/231.8 kB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colorlog-6.9.0-py3-none-any.whl (11 kB)
Downloading Mako-1.3.9-py3-none-any.whl (78 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.5/78.5 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: M

In [2]:
import optuna
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load the Pima Indian Diabetes dataset from sklearn
# Note: Scikit-learn's built-in 'load_diabetes' is a regression dataset.
# We will load the actual diabetes dataset from an external source
import pandas as pd

# Load the Pima Indian Diabetes dataset (from UCI repository)
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI',
           'DiabetesPedigreeFunction', 'Age', 'Outcome']

# Load the dataset
df = pd.read_csv(url, names=columns)

df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
import numpy as np
# replace zero with NaN in columns where zero is not a valid value
cols_with_missing_vals=['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
df[cols_with_missing_vals]=df[cols_with_missing_vals].replace(0, np.nan)

# impute missing vals with the mean of the respective column
df.fillna(df.mean(), inplace=True)

print(df.isnull().sum())

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64


In [5]:
X = df.drop('Outcome', axis=1)
y = df['Outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

print(f'Training set shap: {X_train.shape}')
print(f'Test set shape: {X_test.shape}')


Training set shap: (614, 8)
Test set shape: (154, 8)


In [24]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

# Define objective function
def objective(trail):
  # Suggest vals of hyperparameters:
  n_estimators = trail.suggest_int('n_estimators', 50, 200)
  max_depth = trail.suggest_int('max_depth', 3, 20)

  # create RandomForestClassifier with suggest HP
  model = RandomForestClassifier(
      n_estimators=n_estimators,
      max_depth=max_depth,
      random_state=42
  )
  score = cross_val_score(model, X_train, y_train, cv=3, scoring='accuracy').mean()
  return score # return the accuracy score for Optuna to maximize

In [25]:
study = optuna.create_study(direction='maximize', sampler=optuna.samplers.TPESampler())
study.optimize(objective, n_trials=50)

[I 2025-03-20 14:42:51,753] A new study created in memory with name: no-name-5e93ec69-e703-4777-b5e4-dba40bbabdc3
[I 2025-03-20 14:42:52,274] Trial 0 finished with value: 0.7687151283277539 and parameters: {'n_estimators': 107, 'max_depth': 8}. Best is trial 0 with value: 0.7687151283277539.
[I 2025-03-20 14:42:52,971] Trial 1 finished with value: 0.7817471704128806 and parameters: {'n_estimators': 134, 'max_depth': 13}. Best is trial 1 with value: 0.7817471704128806.
[I 2025-03-20 14:42:53,894] Trial 2 finished with value: 0.7622030926191615 and parameters: {'n_estimators': 198, 'max_depth': 7}. Best is trial 1 with value: 0.7817471704128806.
[I 2025-03-20 14:42:54,638] Trial 3 finished with value: 0.7784712258887295 and parameters: {'n_estimators': 151, 'max_depth': 16}. Best is trial 1 with value: 0.7817471704128806.
[I 2025-03-20 14:42:55,331] Trial 4 finished with value: 0.7817391997449387 and parameters: {'n_estimators': 141, 'max_depth': 16}. Best is trial 1 with value: 0.781747

In [26]:
print(f'Best trial acc: {study.best_trial.value}')
print(f'Best hyperparameters: {study.best_trial.params}')

Best trial acc: 0.7882751474573569
Best hyperparameters: {'n_estimators': 110, 'max_depth': 13}


In [27]:
from sklearn.metrics import accuracy_score

best_model = RandomForestClassifier(**study.best_trial.params, random_state=42)
best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)

print(f'Test accuracy with best hyperparameters: {test_accuracy:.2f}')

Test accuracy with best hyperparameters: 0.75


## Sampler in optuna

In [12]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

# Define objective function
def objective(trail):
  # Suggest vals of hyperparameters:
  n_estimators = trail.suggest_int('n_estimators', 50, 200)
  max_depth = trail.suggest_int('max_depth', 3, 20)

  # create RandomForestClassifier with suggest HP
  model = RandomForestClassifier(
      n_estimators=n_estimators,
      max_depth=max_depth,
      random_state=42
  )
  score = cross_val_score(model, X_train, y_train, cv=3, scoring='accuracy').mean()
  return score # return the accuracy score for Optuna to maximize

In [13]:
study = optuna.create_study(direction='maximize', sampler=optuna.samplers.RandomSampler())
study.optimize(objective, n_trials=50)

[I 2025-03-20 14:34:59,090] A new study created in memory with name: no-name-68a43549-bca8-4ff8-945d-611a1a923f63
[I 2025-03-20 14:35:00,514] Trial 0 finished with value: 0.7882751474573569 and parameters: {'n_estimators': 122, 'max_depth': 13}. Best is trial 0 with value: 0.7882751474573569.
[I 2025-03-20 14:35:01,212] Trial 1 finished with value: 0.771975131516021 and parameters: {'n_estimators': 80, 'max_depth': 20}. Best is trial 0 with value: 0.7882751474573569.
[I 2025-03-20 14:35:01,819] Trial 2 finished with value: 0.7752351347042882 and parameters: {'n_estimators': 73, 'max_depth': 17}. Best is trial 0 with value: 0.7882751474573569.
[I 2025-03-20 14:35:03,247] Trial 3 finished with value: 0.7670731707317073 and parameters: {'n_estimators': 193, 'max_depth': 6}. Best is trial 0 with value: 0.7882751474573569.
[I 2025-03-20 14:35:04,601] Trial 4 finished with value: 0.7801052128168341 and parameters: {'n_estimators': 180, 'max_depth': 14}. Best is trial 0 with value: 0.78827514

In [14]:
print(f'Best trial acc: {study.best_trial.value}')
print(f'Best hyperparameters: {study.best_trial.params}')

Best trial acc: 0.7882751474573569
Best hyperparameters: {'n_estimators': 122, 'max_depth': 13}


In [15]:
from sklearn.metrics import accuracy_score

best_model = RandomForestClassifier(**study.best_trial.params, random_state=42)
best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)

print(f'Test accuracy with best hyperparameters: {test_accuracy:.2f}')

Test accuracy with best hyperparameters: 0.76


In [16]:
# Grid search in optuna
screen_space =  {
    'n_estimators': [50, 100, 150, 200],
    'max_depth': [5, 10, 15, 20]
}

In [19]:
study = optuna.create_study(direction='maximize', sampler=optuna.samplers.GridSampler(screen_space))
study.optimize(objective, n_trials=50)

[I 2025-03-20 14:38:29,242] A new study created in memory with name: no-name-8092bc3b-db92-4dc2-bc95-d7ca8e634550
[I 2025-03-20 14:38:30,027] Trial 0 finished with value: 0.7654391838036028 and parameters: {'n_estimators': 100, 'max_depth': 5}. Best is trial 0 with value: 0.7654391838036028.
[I 2025-03-20 14:38:31,138] Trial 1 finished with value: 0.7735772357723577 and parameters: {'n_estimators': 150, 'max_depth': 10}. Best is trial 1 with value: 0.7735772357723577.
[I 2025-03-20 14:38:31,667] Trial 2 finished with value: 0.7687151283277539 and parameters: {'n_estimators': 50, 'max_depth': 15}. Best is trial 1 with value: 0.7735772357723577.
[I 2025-03-20 14:38:32,507] Trial 3 finished with value: 0.7752351347042882 and parameters: {'n_estimators': 100, 'max_depth': 15}. Best is trial 3 with value: 0.7752351347042882.
[I 2025-03-20 14:38:33,345] Trial 4 finished with value: 0.7703491152558585 and parameters: {'n_estimators': 100, 'max_depth': 20}. Best is trial 3 with value: 0.775235

In [20]:
print(f'Best trial acc: {study.best_trial.value}')
print(f'Best hyperparameters: {study.best_trial.params}')

Best trial acc: 0.7817391997449387
Best hyperparameters: {'n_estimators': 50, 'max_depth': 10}


In [21]:
from sklearn.metrics import accuracy_score

best_model = RandomForestClassifier(**study.best_trial.params, random_state=42)
best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)

print(f'Test accuracy with best hyperparameters: {test_accuracy:.2f}')

Test accuracy with best hyperparameters: 0.75


## Optuna visualisation

In [22]:
from optuna.visualization import plot_optimization_history, plot_parallel_coordinate, plot_slice, plot_contour, plot_param_importances


In [28]:
plot_optimization_history(study).show()

In [29]:
 plot_parallel_coordinate(study).show()

In [30]:
plot_slice(study).show()

In [31]:
plot_contour(study).show()

In [32]:
plot_param_importances(study).show()

## Define by run:
Dynamic Search spaces.
* one HPT to know which algo is the best, and it's parameters
`algo = [SVM, XGboost, RF, LR]`
* make search spaces for all and find the best one out there

In [33]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC

In [38]:
def objective(trial):

  classifier_name = trial.suggest_categorical('classifier', ['SVM', 'RandomForest', 'GBC'])

  if classifier_name == 'SVM':
    #SVM HPs
    c = trial.suggest_float('C', 0.1, 100, log=True)
    kernel = trial.suggest_categorical('kernel', ['linear', 'rbf', 'poly', 'sigmoid'])
    gamma = trial.suggest_categorical('gamma', ['scale', 'auto'])
    model = SVC(C=c, kernel=kernel, gamma=gamma)

  elif classifier_name == 'RandomForest':
    #RF HPs
    n_estimators = trial.suggest_int('n_estimators', 50, 300)
    max_depth = trial.suggest_int('max_depth', 3, 20)
    min_samples_split = trial.suggest_int('min_samples_split', 2, 20)
    min_samples_leaf = trial.suggest_int('min_samples_leaf', 1, 20)
    bootstrap = trial.suggest_categorical('bootstrap', [True, False])

    model = RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        min_samples_split=min_samples_split,
        min_samples_leaf=min_samples_leaf,
        bootstrap=bootstrap,
        random_state=42
    )
  elif classifier_name == 'GBC':
    #GBC HPs
    n_estimators = trial.suggest_int('n_estimators', 50, 300)
    learning_rate = trial.suggest_float('learning_rate', 0.01, 0.3, log=True)
    max_depth = trial.suggest_int('max_depth', 3, 20)
    min_samples_split = trial.suggest_int('min_samples_split', 2, 10)
    min_samples_leaf = trial.suggest_int('min_samples_leaf', 1, 10)

    model = GradientBoostingClassifier(
        n_estimators=n_estimators,
        learning_rate=learning_rate,
        max_depth=max_depth,
        min_samples_split=min_samples_split,
        min_samples_leaf=min_samples_leaf,
        random_state=42
    )

  score = cross_val_score(model, X_train, y_train, cv=3, scoring='accuracy').mean()
  return score

In [39]:
study = optuna.create_study(direction='maximize', sampler=optuna.samplers.TPESampler())
study.optimize(objective, n_trials=100)

[I 2025-03-20 14:59:30,850] A new study created in memory with name: no-name-7681a33f-113c-4ef2-bd3a-165443a23fc1
[I 2025-03-20 14:59:31,730] Trial 0 finished with value: 0.7589191774270684 and parameters: {'classifier': 'RandomForest', 'n_estimators': 149, 'max_depth': 8, 'min_samples_split': 10, 'min_samples_leaf': 16, 'bootstrap': False}. Best is trial 0 with value: 0.7589191774270684.
[I 2025-03-20 14:59:33,926] Trial 1 finished with value: 0.7622190339550454 and parameters: {'classifier': 'GBC', 'n_estimators': 255, 'learning_rate': 0.23122497688382293, 'max_depth': 20, 'min_samples_split': 10, 'min_samples_leaf': 2}. Best is trial 1 with value: 0.7622190339550454.
[I 2025-03-20 14:59:34,679] Trial 2 finished with value: 0.760561135023115 and parameters: {'classifier': 'RandomForest', 'n_estimators': 207, 'max_depth': 5, 'min_samples_split': 4, 'min_samples_leaf': 18, 'bootstrap': False}. Best is trial 1 with value: 0.7622190339550454.
[I 2025-03-20 14:59:36,811] Trial 3 finished 

In [40]:
best_trial = study.best_trial
print(f'Best trial parameters:', best_trial.params)
print(f'Best trial accuracy:', best_trial.value)

Best trial parameters: {'classifier': 'RandomForest', 'n_estimators': 231, 'max_depth': 17, 'min_samples_split': 8, 'min_samples_leaf': 6, 'bootstrap': True}
Best trial accuracy: 0.7784871672246134


In [42]:
study.trials_dataframe()

Unnamed: 0,number,value,datetime_start,datetime_complete,duration,params_C,params_bootstrap,params_classifier,params_gamma,params_kernel,params_learning_rate,params_max_depth,params_min_samples_leaf,params_min_samples_split,params_n_estimators,state
0,0,0.758919,2025-03-20 14:59:30.852605,2025-03-20 14:59:31.730661,0 days 00:00:00.878056,,False,RandomForest,,,,8.0,16.0,10.0,149.0,COMPLETE
1,1,0.762219,2025-03-20 14:59:31.731692,2025-03-20 14:59:33.926359,0 days 00:00:02.194667,,,GBC,,,0.231225,20.0,2.0,10.0,255.0,COMPLETE
2,2,0.760561,2025-03-20 14:59:33.927594,2025-03-20 14:59:34.679578,0 days 00:00:00.751984,,False,RandomForest,,,,5.0,18.0,4.0,207.0,COMPLETE
3,3,0.762179,2025-03-20 14:59:34.680699,2025-03-20 14:59:36.810818,0 days 00:00:02.130119,,,GBC,,,0.019422,20.0,9.0,5.0,205.0,COMPLETE
4,4,0.767073,2025-03-20 14:59:36.811824,2025-03-20 14:59:36.836093,0 days 00:00:00.024269,0.445042,,SVM,scale,linear,,,,,,COMPLETE
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,95,0.776861,2025-03-20 15:01:45.328088,2025-03-20 15:01:46.745395,0 days 00:00:01.417307,,True,RandomForest,,,,20.0,6.0,4.0,242.0,COMPLETE
96,96,0.741057,2025-03-20 15:01:46.746451,2025-03-20 15:01:46.781015,0 days 00:00:00.034564,0.160467,,SVM,auto,rbf,,,,,,COMPLETE
97,97,0.768707,2025-03-20 15:01:46.781775,2025-03-20 15:01:47.875198,0 days 00:00:01.093423,,True,RandomForest,,,,20.0,7.0,4.0,242.0,COMPLETE
98,98,0.773593,2025-03-20 15:01:47.876216,2025-03-20 15:01:49.089138,0 days 00:00:01.212922,,True,RandomForest,,,,19.0,5.0,3.0,257.0,COMPLETE


In [46]:
study.trials_dataframe()['params_classifier'].value_counts()

Unnamed: 0_level_0,count
params_classifier,Unnamed: 1_level_1
RandomForest,71
SVM,16
GBC,13


In [47]:
study.trials_dataframe().groupby('params_classifier')['value'].mean()

Unnamed: 0_level_0,value
params_classifier,Unnamed: 1_level_1
GBC,0.757824
RandomForest,0.770612
SVM,0.749173


In [49]:
plot_optimization_history(study).show()

In [50]:
plot_param_importances(study).show()