# Using Optuna for the Dengue LSTM Model  

https://medium.com/@walter_sperat/using-optuna-with-sklearn-the-right-way-part-1-6b4ad0ab2451  

The vast majority of what's below is copied from the website above.  Modifications are made to adapt it to an LSTM modelling effort which requires reshaping of the input arrays, and other adaptations are made to account for the numpy arrays (vs Pandas DataFrames) as inputs.  Comments are largely copied from the website to get adequate context, but some are my own additions for posterity.  

The main objective as at early April 2024 was to see if I could even get an optuna pipeline working for my case.  The choices of scalers, imputers, etc, will be revisited to better reflect the data I'm using, probably with some revisiting of the EDA notebook prefacing the modelling work.

Let's start by setting up to fetch the data.

In [73]:
import sys
import os
print("Before anything is modified here are the system path and the current working directory:")
print(sys.path)
print(os.getcwd())
os.chdir('C:\\Users\\ron_d\\lhl_capstone\\multivariate_timeseries_forecasting_dengue')
sys.path.append('C:\\Users\\ron_d\\lhl_capstone\\multivariate_timeseries_forecasting_dengue\\src')
print("\nNow that things have been modified, here are the system path and the current working directory:")
print(sys.path)
print(os.getcwd())

Before anything is modified here are the system path and the current working directory:
['c:\\Users\\ron_d\\lhl_capstone\\multivariate_timeseries_forecasting_dengue\\notebooks', 'c:\\Users\\ron_d\\anaconda3\\envs\\lhl_ds\\python310.zip', 'c:\\Users\\ron_d\\anaconda3\\envs\\lhl_ds\\DLLs', 'c:\\Users\\ron_d\\anaconda3\\envs\\lhl_ds\\lib', 'c:\\Users\\ron_d\\anaconda3\\envs\\lhl_ds', '', 'c:\\Users\\ron_d\\anaconda3\\envs\\lhl_ds\\lib\\site-packages', 'c:\\Users\\ron_d\\anaconda3\\envs\\lhl_ds\\lib\\site-packages\\win32', 'c:\\Users\\ron_d\\anaconda3\\envs\\lhl_ds\\lib\\site-packages\\win32\\lib', 'c:\\Users\\ron_d\\anaconda3\\envs\\lhl_ds\\lib\\site-packages\\Pythonwin', 'C:\\Users\\ron_d\\lhl_capstone\\multivariate_timeseries_forecasting_dengue\\src', 'C:\\Users\\ron_d\\lhl_capstone\\multivariate_timeseries_forecasting_dengue\\src', 'C:\\Users\\ron_d\\lhl_capstone\\multivariate_timeseries_forecasting_dengue\\src']
C:\Users\ron_d\lhl_capstone\multivariate_timeseries_forecasting_dengue

N

In [74]:
import pandas as pd
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from timeseries_data_prep import series_to_supervised_mv, train_test_split_rows_reserved
from sklearn.preprocessing import FunctionTransformer

In [75]:
# Set print options to display more rows and columns
np.set_printoptions(threshold=np.inf, linewidth=np.inf)
pd.set_option('display.max_columns', None)

### Data Import and Preparation  
The data files reside a folder up and over from the notebook, so that's handled first to import the desired CSV file.

`series_to_supervised_mv` produces columns lagged the number of time steps (n_weeks in this case) specified to generate the lagged input features desired, and the unlagged variables (target and features) tacked onto the end.

In [76]:
# Get the absolute path to the project directory
project_dir = os.path.abspath(os.path.join(os.getcwd(), '..'))

# Adjust the path to your data file
data_file_path = os.path.join(project_dir, 'multivariate_timeseries_forecasting_dengue', 'data', 'sj_df.csv')

# Read the CSV using the relative path
sj_df = pd.read_csv(data_file_path, header=0, index_col=0)

# For input to series_to_supervised
values = sj_df.values

# ensure all data is float
values_to_pipe = values.astype('float32')

# specify the number of lag hours
# optimal lags found in eda_multivariate vary from 0 to 14, with one group on the lower end, 
# and one group closer to 8.  We can try 4 and then try 8, we'll see what that does to performance.
n_weeks = 8 
n_features = 15 # 14 inputs, 1 output: all 15 of them are lagged to create the input features (so the target itself lagged is an input feature too)

# frame as supervised learning
reframed_for_pipe = series_to_supervised_mv(values_to_pipe, n_weeks, 1)

#print(reframed.head())
reframed_for_pipe.head()

Unnamed: 0,var1(t-8),var2(t-8),var3(t-8),var4(t-8),var5(t-8),var6(t-8),var7(t-8),var8(t-8),var9(t-8),var10(t-8),var11(t-8),var12(t-8),var13(t-8),var14(t-8),var15(t-8),var1(t-7),var2(t-7),var3(t-7),var4(t-7),var5(t-7),var6(t-7),var7(t-7),var8(t-7),var9(t-7),var10(t-7),var11(t-7),var12(t-7),var13(t-7),var14(t-7),var15(t-7),var1(t-6),var2(t-6),var3(t-6),var4(t-6),var5(t-6),var6(t-6),var7(t-6),var8(t-6),var9(t-6),var10(t-6),var11(t-6),var12(t-6),var13(t-6),var14(t-6),var15(t-6),var1(t-5),var2(t-5),var3(t-5),var4(t-5),var5(t-5),var6(t-5),var7(t-5),var8(t-5),var9(t-5),var10(t-5),var11(t-5),var12(t-5),var13(t-5),var14(t-5),var15(t-5),var1(t-4),var2(t-4),var3(t-4),var4(t-4),var5(t-4),var6(t-4),var7(t-4),var8(t-4),var9(t-4),var10(t-4),var11(t-4),var12(t-4),var13(t-4),var14(t-4),var15(t-4),var1(t-3),var2(t-3),var3(t-3),var4(t-3),var5(t-3),var6(t-3),var7(t-3),var8(t-3),var9(t-3),var10(t-3),var11(t-3),var12(t-3),var13(t-3),var14(t-3),var15(t-3),var1(t-2),var2(t-2),var3(t-2),var4(t-2),var5(t-2),var6(t-2),var7(t-2),var8(t-2),var9(t-2),var10(t-2),var11(t-2),var12(t-2),var13(t-2),var14(t-2),var15(t-2),var1(t-1),var2(t-1),var3(t-1),var4(t-1),var5(t-1),var6(t-1),var7(t-1),var8(t-1),var9(t-1),var10(t-1),var11(t-1),var12(t-1),var13(t-1),var14(t-1),var15(t-1),var1(t),var2(t),var3(t),var4(t),var5(t),var6(t),var7(t),var8(t),var9(t),var10(t),var11(t),var12(t),var13(t),var14(t),var15(t)
8,4.0,0.1226,0.103725,0.198483,0.177617,297.572845,297.742859,292.414276,299.799988,295.899994,32.0,73.365715,12.42,14.012857,2.628572,5.0,0.1699,0.142175,0.162357,0.155486,298.211426,298.442871,293.951416,300.899994,296.399994,17.940001,77.368568,22.82,15.372857,2.371428,4.0,0.03225,0.172967,0.1572,0.170843,298.781433,298.878571,295.434296,300.5,297.299988,26.1,82.052856,34.540001,16.848572,2.3,3.0,0.128633,0.245067,0.227557,0.235886,298.987152,299.228577,295.309998,301.399994,297.0,13.9,80.337143,15.36,16.672857,2.428571,6.0,0.1962,0.2622,0.2512,0.24734,299.518585,299.664276,295.821442,301.899994,297.5,12.2,80.459999,7.52,17.209999,3.014286,2.0,0.1962,0.17485,0.254314,0.181743,299.630005,299.764282,295.85144,302.399994,298.100006,26.49,79.891426,9.58,17.212856,2.1,4.0,0.1129,0.0928,0.205071,0.210271,299.207153,299.221436,295.865723,301.299988,297.700012,38.599998,82.0,3.48,17.234285,2.042857,5.0,0.0725,0.0725,0.151471,0.133029,299.591431,299.528564,296.531433,300.600006,298.399994,30.0,83.375717,151.119995,17.977142,1.571429,10.0,0.10245,0.146175,0.125571,0.1236,299.578583,299.557129,296.378571,302.100006,297.700012,37.509998,82.76857,19.32,17.790001,1.885714
9,5.0,0.1699,0.142175,0.162357,0.155486,298.211426,298.442871,293.951416,300.899994,296.399994,17.940001,77.368568,22.82,15.372857,2.371428,4.0,0.03225,0.172967,0.1572,0.170843,298.781433,298.878571,295.434296,300.5,297.299988,26.1,82.052856,34.540001,16.848572,2.3,3.0,0.128633,0.245067,0.227557,0.235886,298.987152,299.228577,295.309998,301.399994,297.0,13.9,80.337143,15.36,16.672857,2.428571,6.0,0.1962,0.2622,0.2512,0.24734,299.518585,299.664276,295.821442,301.899994,297.5,12.2,80.459999,7.52,17.209999,3.014286,2.0,0.1962,0.17485,0.254314,0.181743,299.630005,299.764282,295.85144,302.399994,298.100006,26.49,79.891426,9.58,17.212856,2.1,4.0,0.1129,0.0928,0.205071,0.210271,299.207153,299.221436,295.865723,301.299988,297.700012,38.599998,82.0,3.48,17.234285,2.042857,5.0,0.0725,0.0725,0.151471,0.133029,299.591431,299.528564,296.531433,300.600006,298.399994,30.0,83.375717,151.119995,17.977142,1.571429,10.0,0.10245,0.146175,0.125571,0.1236,299.578583,299.557129,296.378571,302.100006,297.700012,37.509998,82.76857,19.32,17.790001,1.885714,6.0,0.10245,0.12155,0.160683,0.202567,300.154297,300.278564,296.651428,302.299988,298.700012,28.4,81.281425,14.41,18.071428,2.014286
10,4.0,0.03225,0.172967,0.1572,0.170843,298.781433,298.878571,295.434296,300.5,297.299988,26.1,82.052856,34.540001,16.848572,2.3,3.0,0.128633,0.245067,0.227557,0.235886,298.987152,299.228577,295.309998,301.399994,297.0,13.9,80.337143,15.36,16.672857,2.428571,6.0,0.1962,0.2622,0.2512,0.24734,299.518585,299.664276,295.821442,301.899994,297.5,12.2,80.459999,7.52,17.209999,3.014286,2.0,0.1962,0.17485,0.254314,0.181743,299.630005,299.764282,295.85144,302.399994,298.100006,26.49,79.891426,9.58,17.212856,2.1,4.0,0.1129,0.0928,0.205071,0.210271,299.207153,299.221436,295.865723,301.299988,297.700012,38.599998,82.0,3.48,17.234285,2.042857,5.0,0.0725,0.0725,0.151471,0.133029,299.591431,299.528564,296.531433,300.600006,298.399994,30.0,83.375717,151.119995,17.977142,1.571429,10.0,0.10245,0.146175,0.125571,0.1236,299.578583,299.557129,296.378571,302.100006,297.700012,37.509998,82.76857,19.32,17.790001,1.885714,6.0,0.10245,0.12155,0.160683,0.202567,300.154297,300.278564,296.651428,302.299988,298.700012,28.4,81.281425,14.41,18.071428,2.014286,8.0,0.192875,0.08235,0.191943,0.152929,299.512848,299.592865,296.041443,301.799988,298.0,43.720001,81.46714,22.27,17.418571,2.157143
11,3.0,0.128633,0.245067,0.227557,0.235886,298.987152,299.228577,295.309998,301.399994,297.0,13.9,80.337143,15.36,16.672857,2.428571,6.0,0.1962,0.2622,0.2512,0.24734,299.518585,299.664276,295.821442,301.899994,297.5,12.2,80.459999,7.52,17.209999,3.014286,2.0,0.1962,0.17485,0.254314,0.181743,299.630005,299.764282,295.85144,302.399994,298.100006,26.49,79.891426,9.58,17.212856,2.1,4.0,0.1129,0.0928,0.205071,0.210271,299.207153,299.221436,295.865723,301.299988,297.700012,38.599998,82.0,3.48,17.234285,2.042857,5.0,0.0725,0.0725,0.151471,0.133029,299.591431,299.528564,296.531433,300.600006,298.399994,30.0,83.375717,151.119995,17.977142,1.571429,10.0,0.10245,0.146175,0.125571,0.1236,299.578583,299.557129,296.378571,302.100006,297.700012,37.509998,82.76857,19.32,17.790001,1.885714,6.0,0.10245,0.12155,0.160683,0.202567,300.154297,300.278564,296.651428,302.299988,298.700012,28.4,81.281425,14.41,18.071428,2.014286,8.0,0.192875,0.08235,0.191943,0.152929,299.512848,299.592865,296.041443,301.799988,298.0,43.720001,81.46714,22.27,17.418571,2.157143,2.0,0.2916,0.2118,0.3012,0.280667,299.667145,299.75,296.33429,302.0,297.299988,40.900002,82.144287,59.169998,17.737143,2.414286
12,6.0,0.1962,0.2622,0.2512,0.24734,299.518585,299.664276,295.821442,301.899994,297.5,12.2,80.459999,7.52,17.209999,3.014286,2.0,0.1962,0.17485,0.254314,0.181743,299.630005,299.764282,295.85144,302.399994,298.100006,26.49,79.891426,9.58,17.212856,2.1,4.0,0.1129,0.0928,0.205071,0.210271,299.207153,299.221436,295.865723,301.299988,297.700012,38.599998,82.0,3.48,17.234285,2.042857,5.0,0.0725,0.0725,0.151471,0.133029,299.591431,299.528564,296.531433,300.600006,298.399994,30.0,83.375717,151.119995,17.977142,1.571429,10.0,0.10245,0.146175,0.125571,0.1236,299.578583,299.557129,296.378571,302.100006,297.700012,37.509998,82.76857,19.32,17.790001,1.885714,6.0,0.10245,0.12155,0.160683,0.202567,300.154297,300.278564,296.651428,302.299988,298.700012,28.4,81.281425,14.41,18.071428,2.014286,8.0,0.192875,0.08235,0.191943,0.152929,299.512848,299.592865,296.041443,301.799988,298.0,43.720001,81.46714,22.27,17.418571,2.157143,2.0,0.2916,0.2118,0.3012,0.280667,299.667145,299.75,296.33429,302.0,297.299988,40.900002,82.144287,59.169998,17.737143,2.414286,6.0,0.150567,0.1717,0.2269,0.214557,299.558563,299.635712,295.959991,301.799988,297.100006,42.529999,80.742859,16.48,17.341429,2.071429


Below the data is split into a training and a test fraction, while keeping sequential integrity, by reserving rows (no shuffled data).

In [77]:
test_frac = 0.3
values = reframed_for_pipe.values
print(values)
train, test = train_test_split_rows_reserved(values,int(test_frac*sj_df.shape[0]))    
print(len(train))
print(len(test))

[[ 4.00000000e+00  1.22599997e-01  1.03725001e-01  1.98483303e-01  1.77616701e-01  2.97572845e+02  2.97742859e+02  2.92414276e+02  2.99799988e+02  2.95899994e+02  3.20000000e+01  7.33657150e+01  1.24200001e+01  1.40128574e+01  2.62857151e+00  5.00000000e+00  1.69900000e-01  1.42175004e-01  1.62357107e-01  1.55485705e-01  2.98211426e+02  2.98442871e+02  2.93951416e+02  3.00899994e+02  2.96399994e+02  1.79400005e+01  7.73685684e+01  2.28199997e+01  1.53728571e+01  2.37142849e+00  4.00000000e+00  3.22499983e-02  1.72966704e-01  1.57199994e-01  1.70842901e-01  2.98781433e+02  2.98878571e+02  2.95434296e+02  3.00500000e+02  2.97299988e+02  2.61000004e+01  8.20528564e+01  3.45400009e+01  1.68485718e+01  2.29999995e+00  3.00000000e+00  1.28633305e-01  2.45066702e-01  2.27557093e-01  2.35885695e-01  2.98987152e+02  2.99228577e+02  2.95309998e+02  3.01399994e+02  2.97000000e+02  1.38999996e+01  8.03371429e+01  1.53599997e+01  1.66728573e+01  2.42857146e+00  6.00000000e+00  1.96199998e-01  2.621

In [78]:
if isinstance(values, np.ndarray):
    print("The variable is a NumPy array.")
    dtype_of_array = values.dtype
    print("Data type of the array:", dtype_of_array)

The variable is a NumPy array.
Data type of the array: float32


The training and test data sets are now separated into the X data and the y data with knowledge of how the data columns were generated in the `series_to_supervised` function

In [79]:
# split into input and outputs; this is where the undesired columns are dropped,
# as opposed to where they were dropped in the one time lag case
n_obs = n_weeks * n_features
train_X, train_y = train[:, :n_obs], train[:, -n_features] # if everything was done correctly when the dataframe was hauled in, the target at time t is 15 columns from the end
test_X, test_y = test[:, :n_obs], test[:, -n_features]
print(train_X.shape, len(train_X), train_y.shape)

(648, 120) 648 (648,)


In [80]:
if isinstance(train_X, np.ndarray):
    print("The variable is a NumPy array.")

The variable is a NumPy array.


In [81]:
if isinstance(train_y, np.ndarray):
    print("The variable is a NumPy array.")

The variable is a NumPy array.


### Preprocessing  
Most real-world datasets have both numerical and categorical columns, as well as missing data, so let's start by instantiating an imputer:

In [82]:
from sklearn.impute import SimpleImputer
from optuna import Trial

def instantiate_numerical_simple_imputer(trial : Trial, fill_value : int=-1) -> SimpleImputer:
  strategy = trial.suggest_categorical(
    'numerical_strategy', ['mean', 'median', 'most_frequent', 'constant']
  )
  return SimpleImputer(strategy=strategy, fill_value=fill_value)

def instantiate_categorical_simple_imputer(trial : Trial, fill_value : str='missing') -> SimpleImputer:
  strategy = trial.suggest_categorical(
    'categorical_strategy', ['most_frequent', 'constant']
  )
  return SimpleImputer(strategy=strategy, fill_value=fill_value)

The model isn't the only thing we'll be optimizing, but the preprocessing too.

If you recall my previous article (https://medium.com/@walter_sperat/using-optuna-the-wrong-way-e403f7c8e726), optuna trial objects choose values from the provided distributions through the suggest API, and evaluate the objective function with those values. In this particular case, the values correspond to the imputer's hyperparameters.

An interesting quirk to note is that the first argument of suggest_x functions is the name optuna will use to interact with a particular variable (i.e. hyperparameter). Therefore, it's imperative for variable names to be distinct throughout the entire optimization; that's why I had to add the variable type prefix ('numerical_strategy' and 'categorical_strategy'), for if we don't, an error will be raised.

Now let's move on to the categorical encoding by using category-encoders (if anyone's wondering: https://feature-engine.trainindata.com/en/latest/api_doc/encoding/WoEEncoder.html):  

In [83]:
from category_encoders import WOEEncoder

def instantiate_woe_encoder(trial : Trial) -> WOEEncoder:
  params = {
    'sigma': trial.suggest_float('sigma', 0.001, 5),
    'regularization': trial.suggest_float('regularization', 0, 5),
    'randomized': trial.suggest_categorical('randomized', [True, False])
  }
  return WOEEncoder(**params)

Scaling numerical variables doesn't particularly improve the performance of tree-based models, but it's below for example's sake.  Scaling may improve the performance for other algorithms where distance-based measures and coefficients (as model parameters) are used.  

I'll re-examine if this is the scaler I want to use.  I'm just going off the blogger's example, but I seem to recall I went with MinMax for my LSTM regressor.

In [84]:
from sklearn.preprocessing import RobustScaler

def instantiate_robust_scaler(trial : Trial) -> RobustScaler:
  params = {
    'with_centering': trial.suggest_categorical(
      'with_centering', [True, False]
    ),
    'with_scaling': trial.suggest_categorical(
      'with_scaling', [True, False]
    )
  }
  return RobustScaler(**params)

### Model Specification  
Now we need the model itself. This particular implementation is for an ExtraTreesClassifier, but you can obviously abstract it away for basically any model you like:

In [85]:
# from sklearn.ensemble import ExtraTreesClassifier

# def instantiate_extra_trees(trial : Trial) -> ExtraTreesClassifier:
#   params = {
#     'n_estimators': trial.suggest_int('n_estimators', 50, 1000),
#     'max_depth': trial.suggest_int('max_depth', 1, 20),
#     'max_features': trial.suggest_float('max_features', 0, 1),
#     'bootstrap': trial.suggest_categorical('bootstrap', [True, False]),
#     'n_jobs': -1,
#     'random_state': 42
#   }
#   return ExtraTreesClassifier(**params)

Below the LSTM architecture is defined and the returned model is wrapped in a scikit regressor.  The latter was required in the lstm_trial_nlags_MV notebook to pipeline using scikit learn.  Not sure yet if that's required for the optuna pipeline.  May have to experiment.

In [86]:
def instantiate_lstm():
    model = Sequential()
    model.add(LSTM(50, input_shape=(n_weeks, n_features)))
    model.add(Dense(1))
    model.compile(loss='mae', optimizer='adam')
    
    return model

#regressor = KerasRegressor(build_fn=instantiate_lstm, epochs=50, batch_size=72, verbose=0) # re-insert if found necessary

### Pipeline Construction  
Create the pipeline instantiation functions:

In [87]:
from sklearn.pipeline import Pipeline

def instantiate_numerical_pipeline(trial : Trial) -> Pipeline:
  pipeline = Pipeline([
    ('imputer', instantiate_numerical_simple_imputer(trial)),
    ('scaler', instantiate_robust_scaler(trial))
  ])
  return pipeline

# formerly def instantiate_categorical_function(trial : Trial) -> Pipeline: # didn't match rest of code
def instantiate_categorical_pipeline(trial : Trial) -> Pipeline:
  pipeline = Pipeline([
    ('imputer', instantiate_categorical_simple_imputer(trial)),
    ('encoder', instantiate_woe_encoder(trial))
  ])
  return pipeline

Putting it all together...

In [88]:
from sklearn.compose import ColumnTransformer

# THE BELOW WAS THE ORIGINAL.  IT HAD NUMERICAL AND CATEGORICAL INPUTS, AND IT WAS USING PANDAS DATAFRAMES.  

# def instantiate_processor(trial : Trial, numerical_columns : list[str], categorical_columns : list[str]) -> ColumnTransformer:
  
#   numerical_pipeline = instantiate_numerical_pipeline(trial)
#   categorical_pipeline = instantiate_categorical_pipeline(trial)
  
#   processor = ColumnTransformer([
#     ('numerical_pipeline', numerical_pipeline, numerical_columns),
#     ('categorical_pipeline', categorical_pipeline, categorical_columns)
#   ])
  
#   return processor

# TO HANDLE MY ALL-NUMERICAL, NUMPY-ARRAY (VS PANDAS DATAFRAME) CASE...
# ATTEMPT # 1

# def instantiate_processor(trial: Trial, numerical_columns: Optional[list[str]] = None, categorical_columns: Optional[list[str]] = None) -> ColumnTransformer:
#     # Instantiate numerical pipeline
#     numerical_pipeline = instantiate_numerical_pipeline(trial)
    
#     # Instantiate categorical pipeline if needed
#     if categorical_columns:
#         categorical_pipeline = instantiate_categorical_pipeline(trial)
#     else:
#         categorical_pipeline = 'passthrough'  # No transformation needed for categorical columns
    
#     # Create transformers for numerical and categorical columns
#     transformers = []
#     if numerical_columns is not None:
#         transformers.append(('numerical_pipeline', numerical_pipeline, numerical_columns))
#     else:
#         transformers.append(('numerical_pipeline', numerical_pipeline, slice(None)))  # Apply to all columns
    
#     if categorical_columns:
#         transformers.append(('categorical_pipeline', categorical_pipeline, categorical_columns))
    
#     # Create ColumnTransformer with specified transformers
#     processor = ColumnTransformer(transformers)
    
#     return processor

# ATTEMPT # 2

def instantiate_processor(trial, numerical_columns=None, categorical_columns=None):
    # Instantiate numerical pipeline
    numerical_pipeline = instantiate_numerical_pipeline(trial)
    
    # Instantiate categorical pipeline if needed
    if categorical_columns:
        categorical_pipeline = instantiate_categorical_pipeline(trial)
    else:
        categorical_pipeline = 'passthrough'  # No transformation needed for categorical columns
    
    # Create transformers for numerical and categorical columns
    transformers = []
    if numerical_columns is not None:
        transformers.append(('numerical_pipeline', numerical_pipeline, numerical_columns))
    else:
        # Apply numerical pipeline to all columns
        num_columns_range = list(range(train_X.shape[1]))  # Adjust this based on the shape of your input data
        transformers.append(('numerical_pipeline', numerical_pipeline, num_columns_range))
    
    if categorical_columns:
        transformers.append(('categorical_pipeline', categorical_pipeline, categorical_columns))
    
    # Create ColumnTransformer with specified transformers
    processor = ColumnTransformer(transformers)
    
    return processor


def instantiate_model(trial : Trial, numerical_columns : list[str], categorical_columns : list[str]) -> Pipeline:
  
  processor = instantiate_processor(trial, numerical_columns, categorical_columns)
  #extra_trees = instantiate_extra_trees(trial)
  lstm_model = instantiate_lstm()
  
  # model = Pipeline([
  #   ('processor', processor),
  #   ('extra_trees', extra_trees)
  # ])

  # MAY HAVE TO INSERT THE RESHAPE VIA FUNCTION TRANSFORMER (see below); see lstm_trial_nlags_MV.ipynb
#   model = Pipeline([
#     ('processor', processor),
#     ('lstm_engine', lstm_model)
#   ])

#   model.fit(train_X, train_y) # no?

#   return model

# WITH RESHAPE INSERTED; TEST IT - must occur *AFTER* processor (numerical pipeline where imputation and scaling occur) but before lstm_engine (where the data is input to the LSTM)
  model = Pipeline([
    ('processor', processor),
    ('reshape', FunctionTransformer(lambda X: X.reshape((X.shape[0], n_weeks, n_features)))),
    ('lstm_engine', lstm_model)
  ])

  model.fit(train_X, train_y) # no?  Wasn't in the original code within this block.

  return model

One of the greatest things about this way of working is that if you want to try more models, you only need to modify the final function, the one that instantiates the predictor itself, because everything else is already predefined.

"But Walter", you might say "isn't this basically the same as with sklearn's grid and random searches? Aren't we adding unnecessary extra steps by introducing optuna?". Well… yes and no.

With what we've seen so far, the truth is that the only thing we have done is add a new library and learned its API (and it also gives us access to optuna's samplers). However, as we'll see in following articles, optuna's flexibility will make some apparently hard things extremely comfortable, and I don't know about you, but I'm all for comfort.

### Objective Function  

So, with all of that out of the way, let's finally define the function optuna will interact with:

In [89]:
from typing import Optional
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import roc_auc_score, make_scorer
from pandas import DataFrame, Series
import numpy as np
from sklearn.metrics import mean_absolute_error

# I'm feeding it numpy arrays for train_X and train_y; modify code accordingly.

# def objective(trial : Trial, X : DataFrame, y : np.ndarray | Series, numerical_columns : Optional[list[str]]=None, categorical_columns : Optional[list[str]]=None, random_state : int=42) -> float:
def objective(trial : Trial, X : np.ndarray, y : np.ndarray | Series, numerical_columns : Optional[list[str]]=None, categorical_columns : Optional[list[str]]=None, random_state : int=42) -> float:

  # when working with Numpy arrays, you can't use select_dtypes as it's not an attribute, and the data's numerical anyway

  # if numerical_columns is None:
  #   numerical_columns = [
  #     *X.select_dtypes(exclude=['object', 'category']).columns
  #   ]
  
  # if categorical_columns is None:
  #   categorical_columns = [
  #     *X.select_dtypes(include=['object', 'category']).columns
  #   ]

  # Also when working with Numpy arrays, using the below won't get you valid column names or indices, so the code is modified to just grab all of the columns below
#   if numerical_columns is None:
#         numerical_columns = list(range(X.shape[1]))

  if numerical_columns is None:
        numerical_columns = None # use all of the columns

  # TROUBLESHOOTING "ValueError: No valid specification of the columns. Only a scalar, list or slice of all integers or all strings, or boolean mask is allowed"
  print("Numerical Columns:", numerical_columns)  # Print numerical_columns for debugging
  print("Categorical Columns:", categorical_columns)

  model = instantiate_model(trial, numerical_columns, categorical_columns)
  
  # It's one thing to split the training data into various training and validation chunks, while moving around where the validation chunk is, but there are 2 potential problems
  # with the below: 1. shuffle (not with timeseries), 2. Will the number of folds allow for a long enough contiguous training chunk?  The EDA might say more about how much of a 
  # straight run of timestamped data you need to model.

  # I'm beginning to wonder about sklearn.model_selection.TimeSeriesSplit

  #kf = KFold(n_splits=5, shuffle=True, random_state=random_state)
  kf = KFold(n_splits=5, shuffle=False)
  
  #roc_auc_scorer = make_scorer(roc_auc_score, needs_proba=True)
  mae_scorer = make_scorer(mean_absolute_error)
  scores = cross_val_score(model, X, y, scoring=mae_scorer, cv=kf)
  
  return np.min([np.mean(scores), np.median([scores])])

Function arguments have nothing particularly weird: an optuna trial (don't worry, optuna will deal with it for us), a pandas dataframe named X (this can be suitably changed to a numpy array, a datatable table, or whatever you want, as long as you pass appropriate datatypes to the sklearn pipeline), a numpy array/pandas series with the target values, optional the names of both numerical and categorical columns in their respective lists and a random_state for the cross_validation.

For this implementation it's important that, if you decide to pass the column names, X's columns are the same as the passed names, no more, no less. This last point can be changed with a little moving-around, but if you're just copy-pasting it can save you a couple of headaches. Something else that's different with respect to the objective functions you may find online (including in optuna's documentation) is that this function is pure: ALL of its arguments must be explicitly passed, meaning that it doesn't depend on the state of the program at execution time. In contrast, others (such as here) tend to make the objective function get the data from elsewhere. This is not necessarily incorrect, and works just as well, but if you made a mistake somewhere, it makes debugging a literal nightmare. Additionally, pure functions can be easily unit tested.

With that out of the way, let's get back to the code. First of all, as the column names are optional, the function checks if they were passed or not. If they weren't (i.e. if they are None), it will use pandas' interface to automatically detect the corresponding columns. Again, this implementation uses pandas, but you can modify the code to do it with you table-library of choice.

In [91]:
# if numerical_columns is None:
#   numerical_columns = [
#     *X.select_dtypes(exclude=['object', 'category']).columns
#   ]
  
# if categorical_columns is None:
#   categorical_columns = [
#    *X.select_dtypes(include=['object', 'category']).columns
#   ]

Ok, now it gets interesting.

In [92]:
# model = instantiate_model(trial, numerical_columns, categorical_columns)

That simple line will build the entire pipeline, choosing the hyperparameters from the distributions we provided. The truth is, if you understood the general logic of the instantiation functions, there's really not a lot to explain here. That's the good thing about optuna's suggest API: it allows us to encapsulate extremely complicated logic within a couple of lines of code.

The next section is some standard-issue sklearn code:

In [93]:
# kf = KFold(n_splits=5, shuffle=True, random_state=random_state)
# roc_auc_scorer = make_scorer(roc_auc_score, needs_proba=True)
# scores = cross_val_score(model, X, y, scoring=roc_auc_scorer, cv=kf)

A correctly instantiated KFold object will allow us to reproducibly separate between train and test splits for cross-validation.

The following line uses the make_scorer function to turn the roc_auc_metric into an appropriate scorer necessary for the cross_val_score function (for more information on this I highly recommend the documentation). I arbitrarily chose to optimize for ROC AUC in this particular instance, but you can obviously choose whichever sklearn metric you want, or even create a custom one.

cross_val_score is a nifty little function that conveniently encapsulates performing a reproducible cross-validation on an arbitrary dataset with an arbitrary model and an arbitrary scoring function. The return value of this function will be a numpy array with the scores (the ROC AUC scores in this case) for the test sets of each of the folds.

Finally, we have:

In [94]:
# return np.min([np.mean(scores), np.median(scores)])

Now, here's the other place where optuna really shines. As it doesn't really care what is inside the function as long as it returns a number (or array of numbers… to be continued…), optuna basically turns a blind eye to the internals of the objective function. This allows us to calculate just about whatever we want in there.

In this particular case, I set the return value to be the minimum between the mean and the median of the cross-validation scores. The reason for it is simple: mean values tend to be strongly influenced by outliers, so if a single split has a very good performance, but the others don't (which happens more often than one would like), then the score will be artificially high. Let's look at a simple example to illustrate the point:

In [95]:
# Just an example to show impact of outliers on mean vs median, to illustrate rationale on returning the minimum between the mean and the median of the scores

# scores = np.ndarray([0.51, 0.51, 0.51, 0.51, 0.98])
# np.mean(scores) # 0.604
# np.median(scores) # 0.51

Similarly, the mean is also influenced by very small values, while the median won't. In this way, the minimum between the mean and median will prevent scores from becoming too optimistic, which may negatively influence surrogate-model optimization (e.g. Bayesian) methods. This also helps prevent overfitting towards an "easy" fold (the one with high performance, which will be the same across all trials).

You can obviously calculate this same score in sklearn grid and random search objects (after finishing the search) through the cv_results_ attribute. However, that requires manually setting things and selecting hyperparameters one by one, which could very easily turn hacky.

The rest...  
If you recall the missing steps are as follows:

1. Create the study,  
2. Get the best trial,  
3. Instantiate and train the best model.  

The first part is just this:

In [96]:
from optuna import create_study

study = create_study(study_name='optimization', direction='maximize') # I have to check this: he was using ROC_AUC score, I'm using MAE, which I want to minimize

#study.optimize(lambda trial: objective(trial, X_train, y_train), n_trials=100)
study.optimize(lambda trial: objective(trial, train_X, train_y), n_trials=100)

[I 2024-04-08 14:37:04,648] A new study created in memory with name: optimization


Numerical Columns: None
Categorical Columns: None



[I 2024-04-08 14:37:14,734] Trial 0 finished with value: 38.01366024017334 and parameters: {'numerical_strategy': 'mean', 'with_centering': False, 'with_scaling': False}. Best is trial 0 with value: 38.01366024017334.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:37:24,243] Trial 1 finished with value: 36.114485549926755 and parameters: {'numerical_strategy': 'constant', 'with_centering': True, 'with_scaling': True}. Best is trial 0 with value: 38.01366024017334.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:37:33,446] Trial 2 finished with value: 36.80457744598389 and parameters: {'numerical_strategy': 'mean', 'with_centering': False, 'with_scaling': False}. Best is trial 0 with value: 38.01366024017334.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:37:42,422] Trial 3 finished with value: 36.7618896484375 and parameters: {'numerical_strategy': 'median', 'with_centering': True, 'with_scaling': True}. Best is trial 0 with value: 38.01366024017334.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:37:51,042] Trial 4 finished with value: 36.72601165771484 and parameters: {'numerical_strategy': 'mean', 'with_centering': True, 'with_scaling': True}. Best is trial 0 with value: 38.01366024017334.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:37:59,966] Trial 5 finished with value: 37.75879383087158 and parameters: {'numerical_strategy': 'constant', 'with_centering': True, 'with_scaling': False}. Best is trial 0 with value: 38.01366024017334.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:38:09,181] Trial 6 finished with value: 37.63752365112305 and parameters: {'numerical_strategy': 'most_frequent', 'with_centering': True, 'with_scaling': True}. Best is trial 0 with value: 38.01366024017334.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:38:18,226] Trial 7 finished with value: 38.86159133911133 and parameters: {'numerical_strategy': 'mean', 'with_centering': False, 'with_scaling': True}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:38:27,046] Trial 8 finished with value: 37.799868774414065 and parameters: {'numerical_strategy': 'median', 'with_centering': False, 'with_scaling': False}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:38:35,674] Trial 9 finished with value: 36.97966690063477 and parameters: {'numerical_strategy': 'mean', 'with_centering': True, 'with_scaling': True}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:38:44,708] Trial 10 finished with value: 37.441718673706056 and parameters: {'numerical_strategy': 'most_frequent', 'with_centering': False, 'with_scaling': True}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:38:53,918] Trial 11 finished with value: 37.530520820617674 and parameters: {'numerical_strategy': 'mean', 'with_centering': False, 'with_scaling': False}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:39:02,313] Trial 12 finished with value: 37.51869010925293 and parameters: {'numerical_strategy': 'mean', 'with_centering': False, 'with_scaling': False}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:39:11,614] Trial 13 finished with value: 38.0490140914917 and parameters: {'numerical_strategy': 'mean', 'with_centering': False, 'with_scaling': False}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:39:20,062] Trial 14 finished with value: 37.5642126083374 and parameters: {'numerical_strategy': 'mean', 'with_centering': False, 'with_scaling': True}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:39:28,403] Trial 15 finished with value: 38.22792778015137 and parameters: {'numerical_strategy': 'mean', 'with_centering': False, 'with_scaling': False}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:39:38,000] Trial 16 finished with value: 36.7386661529541 and parameters: {'numerical_strategy': 'median', 'with_centering': False, 'with_scaling': True}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:39:46,669] Trial 17 finished with value: 37.34800567626953 and parameters: {'numerical_strategy': 'most_frequent', 'with_centering': False, 'with_scaling': False}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:39:55,121] Trial 18 finished with value: 37.40697021484375 and parameters: {'numerical_strategy': 'constant', 'with_centering': False, 'with_scaling': True}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:40:04,800] Trial 19 finished with value: 38.36790218353271 and parameters: {'numerical_strategy': 'mean', 'with_centering': False, 'with_scaling': False}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:40:13,359] Trial 20 finished with value: 37.68471565246582 and parameters: {'numerical_strategy': 'mean', 'with_centering': False, 'with_scaling': True}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:40:21,693] Trial 21 finished with value: 38.35360984802246 and parameters: {'numerical_strategy': 'mean', 'with_centering': False, 'with_scaling': False}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:40:30,031] Trial 22 finished with value: 37.30721015930176 and parameters: {'numerical_strategy': 'mean', 'with_centering': False, 'with_scaling': False}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:40:39,988] Trial 23 finished with value: 37.346985054016116 and parameters: {'numerical_strategy': 'mean', 'with_centering': False, 'with_scaling': False}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:40:48,478] Trial 24 finished with value: 37.5997896194458 and parameters: {'numerical_strategy': 'mean', 'with_centering': False, 'with_scaling': False}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:40:56,921] Trial 25 finished with value: 38.593325805664065 and parameters: {'numerical_strategy': 'mean', 'with_centering': False, 'with_scaling': False}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:41:05,273] Trial 26 finished with value: 37.174325942993164 and parameters: {'numerical_strategy': 'constant', 'with_centering': False, 'with_scaling': False}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:41:15,425] Trial 27 finished with value: 37.123819541931155 and parameters: {'numerical_strategy': 'median', 'with_centering': False, 'with_scaling': False}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:41:24,186] Trial 28 finished with value: 37.4743293762207 and parameters: {'numerical_strategy': 'most_frequent', 'with_centering': False, 'with_scaling': True}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:41:32,615] Trial 29 finished with value: 37.75054149627685 and parameters: {'numerical_strategy': 'mean', 'with_centering': False, 'with_scaling': False}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:41:40,997] Trial 30 finished with value: 37.68812313079834 and parameters: {'numerical_strategy': 'mean', 'with_centering': False, 'with_scaling': False}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:41:51,312] Trial 31 finished with value: 38.40414180755615 and parameters: {'numerical_strategy': 'mean', 'with_centering': False, 'with_scaling': False}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:41:59,852] Trial 32 finished with value: 38.102563285827635 and parameters: {'numerical_strategy': 'mean', 'with_centering': False, 'with_scaling': False}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:42:08,352] Trial 33 finished with value: 37.60571632385254 and parameters: {'numerical_strategy': 'mean', 'with_centering': False, 'with_scaling': False}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:42:16,817] Trial 34 finished with value: 38.09591236114502 and parameters: {'numerical_strategy': 'mean', 'with_centering': True, 'with_scaling': False}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:42:25,201] Trial 35 finished with value: 37.939734840393065 and parameters: {'numerical_strategy': 'constant', 'with_centering': False, 'with_scaling': False}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:42:33,793] Trial 36 finished with value: 37.30511837005615 and parameters: {'numerical_strategy': 'mean', 'with_centering': True, 'with_scaling': True}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:42:44,664] Trial 37 finished with value: 38.501680564880374 and parameters: {'numerical_strategy': 'mean', 'with_centering': False, 'with_scaling': False}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:42:53,369] Trial 38 finished with value: 37.031920433044434 and parameters: {'numerical_strategy': 'median', 'with_centering': True, 'with_scaling': True}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:43:02,071] Trial 39 finished with value: 36.87373237609863 and parameters: {'numerical_strategy': 'mean', 'with_centering': False, 'with_scaling': False}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:43:11,040] Trial 40 finished with value: 36.87711658477783 and parameters: {'numerical_strategy': 'most_frequent', 'with_centering': True, 'with_scaling': True}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:43:19,603] Trial 41 finished with value: 36.511891746520995 and parameters: {'numerical_strategy': 'mean', 'with_centering': False, 'with_scaling': False}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:43:28,077] Trial 42 finished with value: 37.48541812896728 and parameters: {'numerical_strategy': 'mean', 'with_centering': False, 'with_scaling': False}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:43:39,164] Trial 43 finished with value: 37.27718372344971 and parameters: {'numerical_strategy': 'mean', 'with_centering': False, 'with_scaling': False}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:54:45,409] Trial 44 finished with value: 36.46418018341065 and parameters: {'numerical_strategy': 'mean', 'with_centering': False, 'with_scaling': False}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:54:54,091] Trial 45 finished with value: 37.79909973144531 and parameters: {'numerical_strategy': 'mean', 'with_centering': False, 'with_scaling': False}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:55:02,806] Trial 46 finished with value: 38.36898536682129 and parameters: {'numerical_strategy': 'constant', 'with_centering': False, 'with_scaling': False}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:55:11,570] Trial 47 finished with value: 37.906373023986816 and parameters: {'numerical_strategy': 'constant', 'with_centering': False, 'with_scaling': True}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:55:20,552] Trial 48 finished with value: 38.01701831817627 and parameters: {'numerical_strategy': 'constant', 'with_centering': False, 'with_scaling': False}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:55:29,422] Trial 49 finished with value: 36.628009986877444 and parameters: {'numerical_strategy': 'constant', 'with_centering': True, 'with_scaling': True}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:55:41,246] Trial 50 finished with value: 38.292212677001956 and parameters: {'numerical_strategy': 'constant', 'with_centering': False, 'with_scaling': False}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:55:49,882] Trial 51 finished with value: 38.22637538909912 and parameters: {'numerical_strategy': 'median', 'with_centering': False, 'with_scaling': False}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:55:58,453] Trial 52 finished with value: 38.13577823638916 and parameters: {'numerical_strategy': 'mean', 'with_centering': False, 'with_scaling': False}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:56:07,075] Trial 53 finished with value: 37.375814819335936 and parameters: {'numerical_strategy': 'mean', 'with_centering': False, 'with_scaling': False}. Best is trial 7 with value: 38.86159133911133.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:56:15,826] Trial 54 finished with value: 39.43433380126953 and parameters: {'numerical_strategy': 'most_frequent', 'with_centering': False, 'with_scaling': False}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:56:24,543] Trial 55 finished with value: 37.45155067443848 and parameters: {'numerical_strategy': 'most_frequent', 'with_centering': False, 'with_scaling': False}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:56:33,245] Trial 56 finished with value: 38.41146945953369 and parameters: {'numerical_strategy': 'most_frequent', 'with_centering': False, 'with_scaling': False}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:56:45,444] Trial 57 finished with value: 37.96343555450439 and parameters: {'numerical_strategy': 'most_frequent', 'with_centering': False, 'with_scaling': True}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:56:54,241] Trial 58 finished with value: 37.18290367126465 and parameters: {'numerical_strategy': 'most_frequent', 'with_centering': False, 'with_scaling': False}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:57:03,054] Trial 59 finished with value: 36.84217205047607 and parameters: {'numerical_strategy': 'most_frequent', 'with_centering': False, 'with_scaling': False}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:57:11,975] Trial 60 finished with value: 38.37434406280518 and parameters: {'numerical_strategy': 'most_frequent', 'with_centering': False, 'with_scaling': True}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:57:20,778] Trial 61 finished with value: 36.4669828414917 and parameters: {'numerical_strategy': 'most_frequent', 'with_centering': False, 'with_scaling': True}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:57:29,571] Trial 62 finished with value: 36.248019790649415 and parameters: {'numerical_strategy': 'most_frequent', 'with_centering': False, 'with_scaling': True}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:57:38,347] Trial 63 finished with value: 36.96564979553223 and parameters: {'numerical_strategy': 'most_frequent', 'with_centering': False, 'with_scaling': True}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:57:47,159] Trial 64 finished with value: 37.992698097229 and parameters: {'numerical_strategy': 'most_frequent', 'with_centering': False, 'with_scaling': True}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:57:55,953] Trial 65 finished with value: 38.69328002929687 and parameters: {'numerical_strategy': 'most_frequent', 'with_centering': False, 'with_scaling': True}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:58:08,629] Trial 66 finished with value: 37.83587646484375 and parameters: {'numerical_strategy': 'most_frequent', 'with_centering': False, 'with_scaling': False}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:58:17,531] Trial 67 finished with value: 38.2501880645752 and parameters: {'numerical_strategy': 'most_frequent', 'with_centering': False, 'with_scaling': False}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:58:26,391] Trial 68 finished with value: 38.49975872039795 and parameters: {'numerical_strategy': 'most_frequent', 'with_centering': False, 'with_scaling': True}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:58:35,278] Trial 69 finished with value: 37.123053550720215 and parameters: {'numerical_strategy': 'most_frequent', 'with_centering': True, 'with_scaling': True}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:58:44,047] Trial 70 finished with value: 37.60883560180664 and parameters: {'numerical_strategy': 'most_frequent', 'with_centering': False, 'with_scaling': True}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:58:52,811] Trial 71 finished with value: 38.38143692016602 and parameters: {'numerical_strategy': 'most_frequent', 'with_centering': False, 'with_scaling': True}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:59:01,611] Trial 72 finished with value: 38.769427108764646 and parameters: {'numerical_strategy': 'most_frequent', 'with_centering': False, 'with_scaling': True}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:59:10,444] Trial 73 finished with value: 37.34710159301758 and parameters: {'numerical_strategy': 'most_frequent', 'with_centering': False, 'with_scaling': True}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:59:19,329] Trial 74 finished with value: 37.48450546264648 and parameters: {'numerical_strategy': 'most_frequent', 'with_centering': False, 'with_scaling': True}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:59:28,228] Trial 75 finished with value: 37.150693321228026 and parameters: {'numerical_strategy': 'most_frequent', 'with_centering': False, 'with_scaling': True}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:59:41,362] Trial 76 finished with value: 37.47324047088623 and parameters: {'numerical_strategy': 'median', 'with_centering': False, 'with_scaling': True}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:59:50,363] Trial 77 finished with value: 37.83398361206055 and parameters: {'numerical_strategy': 'most_frequent', 'with_centering': False, 'with_scaling': True}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 14:59:59,432] Trial 78 finished with value: 38.16778202056885 and parameters: {'numerical_strategy': 'most_frequent', 'with_centering': False, 'with_scaling': True}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 15:00:08,508] Trial 79 finished with value: 38.21122875213623 and parameters: {'numerical_strategy': 'most_frequent', 'with_centering': False, 'with_scaling': True}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 15:00:17,434] Trial 80 finished with value: 36.55858821868897 and parameters: {'numerical_strategy': 'most_frequent', 'with_centering': True, 'with_scaling': True}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 15:00:41,239] Trial 81 finished with value: 37.29408721923828 and parameters: {'numerical_strategy': 'mean', 'with_centering': False, 'with_scaling': False}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 15:03:49,813] Trial 82 finished with value: 38.773603439331055 and parameters: {'numerical_strategy': 'mean', 'with_centering': False, 'with_scaling': False}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 15:03:58,473] Trial 83 finished with value: 38.817390060424806 and parameters: {'numerical_strategy': 'mean', 'with_centering': False, 'with_scaling': False}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 15:04:07,010] Trial 84 finished with value: 37.44103107452393 and parameters: {'numerical_strategy': 'mean', 'with_centering': False, 'with_scaling': False}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 15:04:15,465] Trial 85 finished with value: 37.04869079589844 and parameters: {'numerical_strategy': 'mean', 'with_centering': False, 'with_scaling': False}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 15:04:24,149] Trial 86 finished with value: 37.91178646087646 and parameters: {'numerical_strategy': 'mean', 'with_centering': False, 'with_scaling': True}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 15:04:32,690] Trial 87 finished with value: 36.40995006561279 and parameters: {'numerical_strategy': 'mean', 'with_centering': False, 'with_scaling': False}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 15:04:46,486] Trial 88 finished with value: 38.29836807250977 and parameters: {'numerical_strategy': 'mean', 'with_centering': False, 'with_scaling': False}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 15:04:55,288] Trial 89 finished with value: 36.39894752502441 and parameters: {'numerical_strategy': 'mean', 'with_centering': False, 'with_scaling': True}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 15:05:04,022] Trial 90 finished with value: 36.94516124725342 and parameters: {'numerical_strategy': 'mean', 'with_centering': False, 'with_scaling': False}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 15:05:12,843] Trial 91 finished with value: 38.7246597290039 and parameters: {'numerical_strategy': 'median', 'with_centering': False, 'with_scaling': False}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 15:05:21,535] Trial 92 finished with value: 38.06767311096191 and parameters: {'numerical_strategy': 'median', 'with_centering': False, 'with_scaling': False}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 15:05:30,141] Trial 93 finished with value: 38.60491676330567 and parameters: {'numerical_strategy': 'median', 'with_centering': False, 'with_scaling': False}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 15:05:38,735] Trial 94 finished with value: 38.06675300598145 and parameters: {'numerical_strategy': 'median', 'with_centering': False, 'with_scaling': False}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 15:05:47,375] Trial 95 finished with value: 36.98569297790527 and parameters: {'numerical_strategy': 'median', 'with_centering': False, 'with_scaling': False}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 15:05:56,049] Trial 96 finished with value: 37.820853042602536 and parameters: {'numerical_strategy': 'median', 'with_centering': False, 'with_scaling': False}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 15:06:04,729] Trial 97 finished with value: 37.8676513671875 and parameters: {'numerical_strategy': 'median', 'with_centering': True, 'with_scaling': False}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 15:06:13,393] Trial 98 finished with value: 37.18369541168213 and parameters: {'numerical_strategy': 'median', 'with_centering': False, 'with_scaling': False}. Best is trial 54 with value: 39.43433380126953.


Numerical Columns: None
Categorical Columns: None


[I 2024-04-08 15:06:22,027] Trial 99 finished with value: 37.85788345336914 and parameters: {'numerical_strategy': 'median', 'with_centering': False, 'with_scaling': False}. Best is trial 54 with value: 39.43433380126953.


The names of the functions and objects are pretty self-explanatory: the create_study function creates a study object which "knows" both its own name and the direction to optimize (i.e. loss functions must be minimized, while gain/score functions must be maximized). The "optimize" method is a high-order function that takes the objective function (here is where the trial object automagically appears) and the number of trials to run the optimization for.

Depending on the size of the dataset, the number of splits selected for cross-validation and learning complexity, the optimize function will take a short or long amount of time (though I dare say it'll be on the longer side).

After that's done, we'll be left with a Study object that contains all of the data/metadata we could ever want about the optimization. If this is your first time using optuna, don’t worry, we’ll get a deeper into the nuances of optuna studies in the next part. For now, let's move on to getting the best combination of hyperparameters out of there.

Study objects contain several attributes; see documentation https://optuna.readthedocs.io/en/stable/reference/generated/optuna.study.Study.html#  

The attributes of interest to us right now are "best_params" and "best_trial", which will contain a dictionary with the best key-value pairs for the chosen hyperparameters and its corresponding trial object, respectively. "direction/s", "system_attrs" and "user_attrs" are study-related metadata (again, don't panic, we'll look at them in the other parts). Regarding "best_trials", it's a way of selecting the best trials for multi-objective optimizations, which we aren't doing right now, so we can safely forget about it for now.

If you want to visually inspect the best combination of hyperparameters, you can do it as follows:

In [97]:
study.best_params

{'numerical_strategy': 'most_frequent',
 'with_centering': False,
 'with_scaling': False}

 The next step is to get the best trial, which can be done like so:

In [98]:
best_trial = study.best_trial

The trial itself is a very particular optuna object that contains all the data and metadata concerning that particular combination of hyperparameters, such as the parameters themselves, the objective function return value (which in our case is the minimum between the mean and median of the ROC AUC score).

This is where the instantiation functions get their moment to shine:

In [102]:
#model = instantiate_model(best_trial, numerical_columns, categorical_columns)
model = instantiate_model(best_trial, None, None)
#model.fit(X_train, y_train)
model.fit(train_X, train_y)



fill value -1?  Investigate the significance.

That's IT!

We can now easily calculate the out-of-sample performance:

In [104]:
#probabilities = model.predict_proba(X_test)[:, 1]
#score = roc_auc_score(y_train, probabilities)
pred_test_y = model.predict(test_X)
score = mean_absolute_error(test_y, pred_test_y)
print(score)

19.053349


And train the model on all the data:  
(Need to generate X_full and y_full)

For my case, we'll have to use values from way up above before it was split into training and test sets.  Then it needs to be separated into an X and a y set by the columns that were created when `series_to_supervised` was run on it.  

Basically, with this knowledge, you can split the features from the target:  

train_X, train_y = train[:, :n_obs], train[:, -n_features]

In [106]:
X_full = values[:, :n_obs]
y_full = values[:, -n_features]

In [107]:
model.fit(X_full, y_full)



As you've seen throughout this article, optuna allows us to very comfortably define anything needed for an arbitrarily complex optimization.

This way of doing things (instantiation functions, pure objective function) allows us to define things once and build a simple-ish module of optimization-related functions which can be recycled as needed on as many projects as we want.

In general, I don’t recommend moving forward without looking at more than one hyperparameter set. In the above, we only used the best trial, but a better practice would be to look at the top 10% and compare them.

Part 2 will extend what we saw here, the following articles we'll be recycling the methodology presented here, and look at some more of optuna's features.

https://medium.com/@walter_sperat/using-sklearn-with-optuna-the-right-way-part-2-244ce874e3ff  