# **Set-up**


In [1]:
%%capture
!pip install -r requirements

In [2]:
# Import Packages
## for data and preprocessing
import pandas as pd
from sklearn.preprocessing import StandardScaler

## for model fitting
import lightgbm as lgb
import sklearn.metrics as metric

## for hyperparameter optimization
import optuna

## for replicability
import random

In [3]:
train = pd.read_csv('./data/california_housing_train.csv')
test = pd.read_csv('./data/california_housing_test.csv')

names = train.columns

scaler = StandardScaler()
train = pd.DataFrame(scaler.fit_transform(train),columns=names)
test = pd.DataFrame(scaler.transform(test), columns=names)


X_train = train.drop(['median_house_value'],axis=1)
X_test  = test.drop(['median_house_value'],axis=1)
y_train = train.median_house_value
y_test  = test.median_house_value

# **OPTUNA**

## **General Overview**

Optuna optimizes any objective function. This objective function takes a set of arguments (e.g., hyperparameters) and returns a single value (e.g., validation score).  

In Optuna, we create a **study**. A study is defined by the objective function and the hyperparameter space and, thus, defines the scope and purpose of our optimization exercise.   
Each study consists of a set of **Trials**. Each trial is, thus, a single selection from the hyperparameter space for which we evaluate the objective function. Every next trial builds on the previous one (i.e., an iterative optimization process).

The optimization algorithm helps in picking the next trial to evaluate in a smart(er) way, until we find the optimal value.

In practice, every hyperparameter optimization exercise consist of 4 steps:

* define a function which **trains a model** and **returns the validation score**

* define the **hyperparameter space** through which the optimization algorithm can search (trials are instances/realizations of this space)

* create a **study**, which describes the optimization exercise: 
    * *Direction* : 
        * minimize: for (Root) Mean Squared Errors, minus-log-likelihood, ... (the lower, the better)
        * maximize: r2_score, auc, accuracy, precision, recall, f1_score, ... (the higher, the better)
    * *Sampler* : the chosen optimization technique **(Optimization)**
    * *Pruner* : early stopping of unpromising trials **(Steroids)**

* **optimize** the study using different trials in a smart way **(worker function)**


Firstly, we need to realize that our time is also limited. In order to limit our waiting time (and computing time), we set a maximum number of trials to evaluate (i.e., maximum number of iterations). 

In [4]:
N_TRIALS = 200

### Step 1
We define a function which takes a hyperparameter configuration (params;  which is defined later) as the argument.  
Then this function takes our data and trains a machine learning model. In this example, we train a lightgbm model, which can take a lot of interesting hyperparameters to illustrate tuning. Any model architecture can work here (e.g., xgboost, random forest, neural networks, ...).  
Lastly, we make some predictions on our test (or validation) set and compute the validation score. In this example, we use the Root Mean Squared Error (RMSE), but again any validation metric is viable.

In [5]:
def train_evaluate(params):
    '''Train a model using your dataset and return the validation score.'''
    train_data = lgb.Dataset(X_train, label=y_train)
    test_data = lgb.Dataset(X_test, label=y_test, reference=train_data)
    # Train a Model
    model = lgb.train(params, train_data,
                      num_boost_round=params['NUM_BOOST_ROUND'],
                      early_stopping_rounds=params['EARLY_STOPPING_ROUNDS'],
                      valid_sets=[test_data],
                      valid_names=['valid'],
                      )
    # Evaluate the model
    preds = model.predict(test_data,num_iteration=model.best_iteration)
    truth = test_data.get_label()
    score = metric.mean_squared_error(truth, preds, squared=False)
      
    #score = model.best_score['valid']['rmse']
    # Return the validation score
    return score

### Step 2
We define the objective function of our optimization exercise, which takes a trial as argument.  
In this function, we first define the parameter space. This parameter space is a dictionary defining each hyperparameter of interest. For each hyperparameter, we use the $trial.suggest$ functionality to define the domain from which we can sample values for the hyperparameters. In a Bayesian way, think about this as our prior (hyper)parameter distribution. We can use various distributions:
- trial.suggest_loguniform for floating point hyperparameters between two bounds favoring smaller values,
- trial.suggest_float for floating point hyperparameters between two bounds
- trial.suggest_int for integer hyperparameters between two bounds and a step-size,
- trial.suggest_uniform for uniformly distributed hyperparameters between two bounds,
- trial.suggest_discrete_uniform for uniformly distrubuted hyperparameters between two bounds but with additional step-size,
- ...

After defining the hyperparameter space, we apply the previously defined function to train a model and return the validation score.  
At this stage, we will also check whether the score should be pruned or not (depending on whether a pruning strategy was specified).

In [6]:
def objective(trial):
    '''
    Define the Hyperparameter Space from which to sample a configuration.
    Then train a model and output the validation score (see Step 1).
    '''
    # Define the Hyper-parameter Space
    params = {'learning_rate': trial.suggest_loguniform('learning_rate', 0.01, 0.5),
              'max_depth': trial.suggest_int('max_depth', 1, 30, 1),
              'num_leaves': trial.suggest_int('num_leaves', 2, 100),
              'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 10, 100),
              'feature_fraction': trial.suggest_uniform('feature_fraction', 0.1, 1.0),
              'subsample': trial.suggest_discrete_uniform('subsample', 0.1, 1.0,.1),
              'colsample_by_tree': 1,
              'lambda_l1': trial.suggest_float('lambda_l1', 0, 10),
              'lambda_l2': trial.suggest_float('lambda_l2', 0, 10),
              'NUM_BOOST_ROUND': 200,
              'EARLY_STOPPING_ROUNDS': 20,
              'objective': 'rmse',
              }
              
    # Train the model and return the validation score
    score = train_evaluate(params)
    
    #Check Pruning
    trial.report(score,1)
    if trial.should_prune():
        raise optuna.TrialPruned()

    # Return the validation score
    return score

### Step 3
We create the study object, in which we describe the optimization exercise by means of: 

* the *Direction* of our optimization: 
  * minimize: for (Root) Mean Squared Errors, minus-log-likelihood, ... (the lower, the better)
  * maximize: r2_score, auc, accuracy, precision, recall, f1_score, ... (the higher, the better)  
  

* the *Sampler* which is our optimization technique: 
  * GridSampler, applies a Grid Search on a predefined grid (extra arguments required!)
  * RandomSampler, applies Random Search on the parameter space
  * CmaEsSampler, applies a Covariance Matrix Adaptation Evolutionary Search algorithm
  * TPESampler, is the default option, which applies a Tree-structured Parzen Estimator algorithm  


* the *Pruner* which is our pruning strategy to quickly stop unpromising trials:
  * NopPruner, does not prune any trials
  * MedianPruner, prunes trials that are worst than the median of previous trials
  * SuccessiveHalvingPruner, uses Asynchronous Successive Halving (prune half of the least performing trials)
  * HyperbandPruner, uses the Hyperband pruning strategy  

In [7]:
study = optuna.create_study(
    direction = 'minimize',                         
    sampler = optuna.samplers.RandomSampler(),      
    pruner = optuna.pruners.NopPruner()            
    )

[32m[I 2021-11-15 11:16:42,462][0m A new study created in memory with name: no-name-591837e3-458b-4562-9ff1-e89374cc1f7f[0m


### Step 4
We call the optimize function on our study object to start the optimization process. 

In [8]:
%%script false --no-raise-error
study.optimize(objective, n_trials=N_trials)

Couldn't find program: 'false'


## Example of Several Optimization Strategies

Comment the following line out if you want to track the progress of the optuna sampler live (warning: this creates a long output-trail). 
Run the following line to suppress all output of the optuna sampler.

In [9]:
optuna.logging.set_verbosity(optuna.logging.WARNING)

In [10]:
random.seed(1)

Next, we define out $train\_evaluate$ and our $objective$ functions. 

In [11]:
def train_evaluate(params):
    # Format/Preprocess Data
    train_data = lgb.Dataset(X_train, label=y_train)
    test_data = lgb.Dataset(X_test, label=y_test, reference=train_data)
    
    # Train a Model
    model = lgb.train(params, train_data,
                      num_boost_round=params['NUM_BOOST_ROUND'],
                      early_stopping_rounds=params['EARLY_STOPPING_ROUNDS'],
                      valid_sets=[test_data],
                      valid_names=['valid'],
                      )
    
    # Evaluate the model 
    preds = model.predict(X_test,num_iteration=model.best_iteration)
    truth = test_data.get_label()
    score = metric.mean_squared_error(truth, preds, squared=False)
    
    # Return the validation score
    return score

def objective(trial):
    # Define the Hyper-parameter Space
    params = {'learning_rate': trial.suggest_loguniform('learning_rate', 0.01, 0.5),
              'max_depth': trial.suggest_int('max_depth', 1, 50),
              'num_leaves': trial.suggest_int('num_leaves', 2, 200),
              'feature_fraction': trial.suggest_uniform('feature_fraction', 0.1, 1.0),
              'subsample': trial.suggest_discrete_uniform('subsample', 0.1, 1.0, .1),
              'colsample_by_tree': 1,
              'lambda_l1': trial.suggest_float('lambda_l1', 0, 10),
              'lambda_l2': trial.suggest_float('lambda_l2', 0, 10),
              'bagging_fraction':trial.suggest_uniform('bagging_fraction', 0, 1),
              'bagging_freq':trial.suggest_int('bagging_freq',0,10),
              'NUM_BOOST_ROUND': 200,
              'EARLY_STOPPING_ROUNDS': 20,
              'objective': 'rmse',
              }
    
    # Train the model and return the validation score
    score = train_evaluate(params)
    
    #Check Pruning
    trial.report(score,200)
    if trial.should_prune():
      raise optuna.TrialPruned()
    
    # Return the validation score
    return score

### **Grid Search**

A grid search does not really look at the hyperparameter space, but rather takes a search space with discrete lists of hyperparameter values into account.
In this example, we search over a small grid of 4*4*4 hyperparameters (total size of the grid: 48 possibilities).

In [12]:
%%time
%%capture

def objective_grid(trial):
    # Define the Hyper-parameter Space
    params = {'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.5),
              'max_depth': trial.suggest_int('max_depth', 1, 50),
              'num_leaves': trial.suggest_int('num_leaves', 2, 200),
              'NUM_BOOST_ROUND': 200,
              'EARLY_STOPPING_ROUNDS': 20,
              'objective': 'rmse',
              'verbose': -1,
              }
    #Apply the train_evaluate function
    score = train_evaluate(params)
    return score

search_space = {'learning_rate': [0.01, 0.10, 0.50],
              'max_depth': [1, 10, 20, 30],
              'num_leaves': [2, 10, 20, 100]}

study_gridsearch = optuna.create_study(
    direction='minimize',
    sampler=optuna.samplers.GridSampler(search_space),
    pruner = optuna.pruners.NopPruner() 
    )

study_gridsearch.optimize(objective_grid, n_trials=N_TRIALS)

Wall time: 7.08 s


In [13]:
df_gridsearch = study_gridsearch.trials_dataframe()
df_gridsearch.head(10)

Unnamed: 0,number,value,datetime_start,datetime_complete,duration,params_learning_rate,params_max_depth,params_num_leaves,system_attrs_grid_id,system_attrs_search_space,state
0,0,0.747001,2021-11-15 11:16:42.576752,2021-11-15 11:16:42.650712,0 days 00:00:00.073960,0.01,20,2,8,"{'learning_rate': [0.01, 0.1, 0.5], 'max_depth...",COMPLETE
1,1,0.427934,2021-11-15 11:16:42.650712,2021-11-15 11:16:42.748194,0 days 00:00:00.097482,0.5,10,10,37,"{'learning_rate': [0.01, 0.1, 0.5], 'max_depth...",COMPLETE
2,2,0.747001,2021-11-15 11:16:42.748194,2021-11-15 11:16:42.826192,0 days 00:00:00.077998,0.01,10,2,4,"{'learning_rate': [0.01, 0.1, 0.5], 'max_depth...",COMPLETE
3,3,0.598808,2021-11-15 11:16:42.826192,2021-11-15 11:16:42.896260,0 days 00:00:00.070068,0.1,1,20,18,"{'learning_rate': [0.01, 0.1, 0.5], 'max_depth...",COMPLETE
4,4,0.594373,2021-11-15 11:16:42.896260,2021-11-15 11:16:43.020259,0 days 00:00:00.123999,0.01,20,10,9,"{'learning_rate': [0.01, 0.1, 0.5], 'max_depth...",COMPLETE
5,5,0.526162,2021-11-15 11:16:43.021260,2021-11-15 11:16:43.087258,0 days 00:00:00.065998,0.5,1,100,35,"{'learning_rate': [0.01, 0.1, 0.5], 'max_depth...",COMPLETE
6,6,0.526162,2021-11-15 11:16:43.087258,2021-11-15 11:16:43.154259,0 days 00:00:00.067001,0.5,1,2,32,"{'learning_rate': [0.01, 0.1, 0.5], 'max_depth...",COMPLETE
7,7,0.526162,2021-11-15 11:16:43.154259,2021-11-15 11:16:43.223261,0 days 00:00:00.069002,0.5,10,2,36,"{'learning_rate': [0.01, 0.1, 0.5], 'max_depth...",COMPLETE
8,8,0.598808,2021-11-15 11:16:43.223261,2021-11-15 11:16:43.291258,0 days 00:00:00.067997,0.1,30,2,28,"{'learning_rate': [0.01, 0.1, 0.5], 'max_depth...",COMPLETE
9,9,0.598808,2021-11-15 11:16:43.292258,2021-11-15 11:16:43.363261,0 days 00:00:00.071003,0.1,1,2,16,"{'learning_rate': [0.01, 0.1, 0.5], 'max_depth...",COMPLETE


In [14]:
gridsearch = {'score': study_gridsearch.best_value, 'params': study_gridsearch.best_params}
print(gridsearch)

{'score': 0.4009600440564793, 'params': {'learning_rate': 0.1, 'max_depth': 10, 'num_leaves': 100}}


### **Random Search**

In [15]:
%%time
%%capture

study_randomsearch = optuna.create_study(
    direction = 'minimize',
    sampler = optuna.samplers.RandomSampler(),
    pruner = optuna.pruners.NopPruner() 
    )

study_randomsearch.optimize(objective, n_trials=N_TRIALS)

Wall time: 1min 30s


In [16]:
df_randomsearch = study_randomsearch.trials_dataframe()
df_randomsearch.head(10)

Unnamed: 0,number,value,datetime_start,datetime_complete,duration,params_bagging_fraction,params_bagging_freq,params_feature_fraction,params_lambda_l1,params_lambda_l2,params_learning_rate,params_max_depth,params_num_leaves,params_subsample,state
0,0,0.717399,2021-11-15 11:16:49.782566,2021-11-15 11:16:50.234084,0 days 00:00:00.451518,0.190881,0,0.18087,4.721965,8.78493,0.015872,18,73,0.3,COMPLETE
1,1,0.41708,2021-11-15 11:16:50.234084,2021-11-15 11:16:50.688185,0 days 00:00:00.454101,0.257557,6,0.664429,3.202172,0.487312,0.064964,16,51,0.9,COMPLETE
2,2,0.492198,2021-11-15 11:16:50.688185,2021-11-15 11:16:50.868185,0 days 00:00:00.180000,0.042112,8,0.69674,7.020914,7.191303,0.064858,16,27,0.9,COMPLETE
3,3,0.56937,2021-11-15 11:16:50.868185,2021-11-15 11:16:51.219184,0 days 00:00:00.350999,0.246907,8,0.227616,0.691037,1.123191,0.392646,39,193,0.2,COMPLETE
4,4,0.525928,2021-11-15 11:16:51.219184,2021-11-15 11:16:51.391700,0 days 00:00:00.172516,0.144321,7,0.282065,3.069748,6.511695,0.416985,33,35,1.0,COMPLETE
5,5,0.419918,2021-11-15 11:16:51.391700,2021-11-15 11:16:51.685699,0 days 00:00:00.293999,0.539119,6,0.852894,3.164154,6.174017,0.072096,27,17,0.6,COMPLETE
6,6,0.464877,2021-11-15 11:16:51.686699,2021-11-15 11:16:52.386701,0 days 00:00:00.700002,0.793,5,0.286701,1.604678,4.219211,0.138907,46,71,1.0,COMPLETE
7,7,0.402122,2021-11-15 11:16:52.386701,2021-11-15 11:16:53.475734,0 days 00:00:01.089033,0.978519,10,0.906599,6.871575,6.816364,0.255456,43,174,0.1,COMPLETE
8,8,0.528356,2021-11-15 11:16:53.475734,2021-11-15 11:16:53.611764,0 days 00:00:00.136030,0.327816,1,0.120688,0.80622,6.430674,0.286547,14,4,0.8,COMPLETE
9,9,0.40193,2021-11-15 11:16:53.611764,2021-11-15 11:16:54.503764,0 days 00:00:00.892000,0.764707,7,0.960958,0.022391,6.238483,0.059796,43,108,0.6,COMPLETE


In [17]:
randomsearch = {'score': study_randomsearch.best_value, 'params': study_randomsearch.best_params}
print(randomsearch)

{'score': 0.39950396725505927, 'params': {'learning_rate': 0.11956523483516167, 'max_depth': 11, 'num_leaves': 116, 'feature_fraction': 0.6387090866315891, 'subsample': 0.8, 'lambda_l1': 4.49265883499625, 'lambda_l2': 2.4123246887203975, 'bagging_fraction': 0.8372383216442213, 'bagging_freq': 10}}


### **CMAES**

In [18]:
%%time
%%capture

study_cmaes = optuna.create_study(
    direction = 'minimize',
    sampler = optuna.samplers.CmaEsSampler(),
    pruner = optuna.pruners.MedianPruner()  
    )

study_cmaes.optimize(objective, n_trials=N_TRIALS)

Wall time: 2min 54s


In [19]:
df_cmaes = study_cmaes.trials_dataframe()
df_cmaes.head(10)

Unnamed: 0,number,value,datetime_start,datetime_complete,duration,params_bagging_fraction,params_bagging_freq,params_feature_fraction,params_lambda_l1,params_lambda_l2,params_learning_rate,params_max_depth,params_num_leaves,params_subsample,system_attrs_cma:generation,system_attrs_cma:n_restarts,system_attrs_cma:optimizer:0,system_attrs_cma:optimizer:1,system_attrs_cma:optimizer:2,state
0,0,0.588295,2021-11-15 11:18:20.875763,2021-11-15 11:18:20.984764,0 days 00:00:00.109001,0.799408,0,0.193565,6.715192,5.73502,0.037047,3,52,0.2,,,,,,COMPLETE
1,1,0.405101,2021-11-15 11:18:20.984764,2021-11-15 11:18:21.705762,0 days 00:00:00.720998,0.709999,5,0.657881,5.199227,4.759155,0.061399,26,101,0.5,0.0,0.0,,,,COMPLETE
2,2,0.499723,2021-11-15 11:18:21.705762,2021-11-15 11:18:22.110763,0 days 00:00:00.405001,0.391103,5,0.229346,4.794969,4.868205,0.06161,25,101,0.7,0.0,0.0,,,,COMPLETE
3,3,0.413266,2021-11-15 11:18:22.110763,2021-11-15 11:18:22.740763,0 days 00:00:00.630000,0.383544,5,0.655453,5.074825,4.788592,0.080795,26,101,0.6,0.0,0.0,,,,COMPLETE
4,4,0.416543,2021-11-15 11:18:22.740763,2021-11-15 11:18:23.551277,0 days 00:00:00.810514,0.691364,5,0.512578,5.179937,5.186719,0.090731,25,101,0.7,0.0,0.0,,,,COMPLETE
5,5,0.406795,2021-11-15 11:18:23.551277,2021-11-15 11:18:24.234276,0 days 00:00:00.682999,0.563643,5,0.583681,5.098607,5.270263,0.081152,25,101,0.3,0.0,0.0,,,,COMPLETE
6,6,0.421212,2021-11-15 11:18:24.234276,2021-11-15 11:18:25.030344,0 days 00:00:00.796068,0.553276,5,0.440543,5.010775,5.062116,0.087398,25,101,0.5,0.0,0.0,,,,PRUNED
7,7,0.432294,2021-11-15 11:18:25.030344,2021-11-15 11:18:25.588343,0 days 00:00:00.557999,0.47683,5,0.318442,5.268757,5.223007,0.092745,25,101,0.7,0.0,0.0,,,,PRUNED
8,8,0.437774,2021-11-15 11:18:25.588343,2021-11-15 11:18:26.079343,0 days 00:00:00.491000,0.4137,5,0.354789,5.328544,4.791528,0.073717,25,101,0.6,0.0,0.0,,,,PRUNED
9,9,0.433845,2021-11-15 11:18:26.080344,2021-11-15 11:18:26.581857,0 days 00:00:00.501513,0.351278,5,0.412731,5.116559,5.192324,0.098436,26,101,0.5,0.0,0.0,,,,PRUNED


In [20]:
cmaessearch = {'score': study_cmaes.best_value, 'params': study_cmaes.best_params}
print(cmaessearch)

{'score': 0.3962360832826891, 'params': {'learning_rate': 0.10270955990683393, 'max_depth': 26, 'num_leaves': 101, 'feature_fraction': 0.5818597071965921, 'subsample': 0.2, 'lambda_l1': 5.339608201274356, 'lambda_l2': 4.897287644755721, 'bagging_fraction': 0.9979814468041663, 'bagging_freq': 5}}


### **Tree-Parzen Estimator**

In [21]:
%%time
%%capture

study_tpe = optuna.create_study(
    direction = 'minimize',
    sampler = optuna.samplers.TPESampler(),
    pruner = optuna.pruners.NopPruner()
    )

study_tpe.optimize(objective, n_trials=N_TRIALS)

Wall time: 3min 8s


In [22]:
df_tpe = study_tpe.trials_dataframe()
df_tpe.head(10)

Unnamed: 0,number,value,datetime_start,datetime_complete,duration,params_bagging_fraction,params_bagging_freq,params_feature_fraction,params_lambda_l1,params_lambda_l2,params_learning_rate,params_max_depth,params_num_leaves,params_subsample,state
0,0,0.475124,2021-11-15 11:21:15.591427,2021-11-15 11:21:16.155427,0 days 00:00:00.564000,0.167386,10,0.999179,8.871975,8.205199,0.025006,50,156,0.6,COMPLETE
1,1,0.433308,2021-11-15 11:21:16.155427,2021-11-15 11:21:16.500427,0 days 00:00:00.345000,0.331532,3,0.450407,9.518721,7.857436,0.164378,49,138,0.3,COMPLETE
2,2,0.410129,2021-11-15 11:21:16.500427,2021-11-15 11:21:16.818426,0 days 00:00:00.317999,0.671794,2,0.593638,3.182459,3.447314,0.126302,23,24,0.4,COMPLETE
3,3,0.451763,2021-11-15 11:21:16.818426,2021-11-15 11:21:17.103426,0 days 00:00:00.285000,0.332746,5,0.387977,3.659871,1.284126,0.37819,14,83,0.7,COMPLETE
4,4,0.438557,2021-11-15 11:21:17.103426,2021-11-15 11:21:17.298427,0 days 00:00:00.195001,0.158747,5,0.967691,3.336275,4.93965,0.219618,47,90,0.8,COMPLETE
5,5,0.407367,2021-11-15 11:21:17.298427,2021-11-15 11:21:18.074481,0 days 00:00:00.776054,0.954847,2,0.864859,2.743525,0.880957,0.028116,13,100,0.4,COMPLETE
6,6,0.477377,2021-11-15 11:21:18.074481,2021-11-15 11:21:18.272481,0 days 00:00:00.198000,0.030609,4,0.623249,1.279973,5.787029,0.063174,18,160,0.2,COMPLETE
7,7,0.430592,2021-11-15 11:21:18.272481,2021-11-15 11:21:18.540482,0 days 00:00:00.268001,0.709864,4,0.815745,6.066902,4.569959,0.045408,18,19,0.3,COMPLETE
8,8,0.695674,2021-11-15 11:21:18.540482,2021-11-15 11:21:18.855481,0 days 00:00:00.314999,0.512656,7,0.280988,9.096926,1.222224,0.010544,8,135,0.2,COMPLETE
9,9,0.612157,2021-11-15 11:21:18.856479,2021-11-15 11:21:18.988480,0 days 00:00:00.132001,0.521829,4,0.103533,6.302044,8.146512,0.045768,3,165,0.9,COMPLETE


In [23]:
tpesearch = {'score': study_tpe.best_value, 'params': study_tpe.best_params}
print(tpesearch)

{'score': 0.39185314126538845, 'params': {'learning_rate': 0.07543816463347605, 'max_depth': 43, 'num_leaves': 195, 'feature_fraction': 0.7302663462282426, 'subsample': 0.6, 'lambda_l1': 0.01327971653741411, 'lambda_l2': 7.457453904209449, 'bagging_fraction': 0.8532583683762022, 'bagging_freq': 4}}


### **BOHB**

Note that for BOHB, increasing the number of trials is highly beneficiary to the final result. This is, in general, true for every use-case in which Hyperband Pruning is applied.  
As the hyperband pruning algorithms applies Successive Halving on multiple sets to balance resource distributions, a higher number of trials will work better. 

In [24]:
%%time
%%capture

study_bohb = optuna.create_study(
    direction = 'minimize',
    sampler = optuna.samplers.TPESampler(),
    pruner = optuna.pruners.HyperbandPruner()
    )

study_bohb.optimize(objective, n_trials=N_TRIALS*2)

Wall time: 5min 40s


In [25]:
df_bohb = study_bohb.trials_dataframe()
df_bohb.head(10)

Unnamed: 0,number,value,datetime_start,datetime_complete,duration,params_bagging_fraction,params_bagging_freq,params_feature_fraction,params_lambda_l1,params_lambda_l2,params_learning_rate,params_max_depth,params_num_leaves,params_subsample,system_attrs_completed_rung_0,system_attrs_completed_rung_1,system_attrs_completed_rung_2,system_attrs_completed_rung_3,system_attrs_completed_rung_4,state
0,0,0.471825,2021-11-15 11:24:23.949277,2021-11-15 11:24:24.224309,0 days 00:00:00.275032,0.067367,9,0.436638,3.325675,9.425078,0.128138,22,188,0.7,,,,,,COMPLETE
1,1,0.412585,2021-11-15 11:24:24.224309,2021-11-15 11:24:24.589309,0 days 00:00:00.365000,0.595335,10,0.565873,4.154902,7.855833,0.369971,12,50,0.5,0.412585,0.412585,0.412585,0.412585,0.412585,COMPLETE
2,2,0.440096,2021-11-15 11:24:24.590309,2021-11-15 11:24:24.738312,0 days 00:00:00.148003,0.612644,0,0.974753,3.833311,0.924854,0.102677,29,7,0.5,0.440096,,,,,PRUNED
3,3,0.591054,2021-11-15 11:24:24.739311,2021-11-15 11:24:25.276403,0 days 00:00:00.537092,0.937819,2,0.290746,9.941261,9.807908,0.021426,37,117,0.7,0.591054,,,,,PRUNED
4,4,0.40832,2021-11-15 11:24:25.276403,2021-11-15 11:24:25.772403,0 days 00:00:00.496000,0.537561,6,0.865107,1.756577,3.495271,0.049842,14,50,0.8,0.40832,0.40832,0.40832,0.40832,,COMPLETE
5,5,0.573278,2021-11-15 11:24:25.772403,2021-11-15 11:24:25.858404,0 days 00:00:00.086001,0.029669,3,0.197976,1.988151,6.419528,0.367882,28,171,0.9,0.573278,,,,,PRUNED
6,6,0.524816,2021-11-15 11:24:25.858404,2021-11-15 11:24:26.182403,0 days 00:00:00.323999,0.09869,10,0.261691,0.180091,4.871828,0.248116,27,41,0.3,0.524816,,,,,PRUNED
7,7,0.475555,2021-11-15 11:24:26.182403,2021-11-15 11:24:26.529406,0 days 00:00:00.347003,0.056535,9,0.968617,2.607257,6.25918,0.03878,43,130,0.4,0.475555,,,,,PRUNED
8,8,0.762811,2021-11-15 11:24:26.529406,2021-11-15 11:24:26.604402,0 days 00:00:00.074996,0.003276,9,0.698808,8.876982,4.880611,0.192224,47,34,1.0,0.762811,,,,,PRUNED
9,9,0.505828,2021-11-15 11:24:26.604402,2021-11-15 11:24:27.018470,0 days 00:00:00.414068,0.194656,9,0.744505,9.820746,2.081551,0.014539,31,38,0.7,0.505828,0.505828,,,,COMPLETE


In [26]:
bohbsearch = {'score': study_bohb.best_value, 'params': study_bohb.best_params}
print(bohbsearch)

{'score': 0.39100964692210743, 'params': {'learning_rate': 0.08333226439010233, 'max_depth': 18, 'num_leaves': 161, 'feature_fraction': 0.6274713770566759, 'subsample': 0.1, 'lambda_l1': 1.7278750629875486, 'lambda_l2': 9.264696921721434, 'bagging_fraction': 0.4677980650258291, 'bagging_freq': 0}}


### **Summary**
The final overview of our results shows that TPE and BOHB perform far better than the other algorithms. 
In this example, the differences are not extreme, which is mostly due to the very stylized example. In many other cases, the gains of hyperparameter optimization are considerable.

In [27]:
pd.DataFrame([gridsearch['score'],randomsearch['score'],cmaessearch['score'],tpesearch['score'],bohbsearch['score']],index=['Grid','Random','CMAES','TPE','BOHB'],columns=['RMSE'])

Unnamed: 0,RMSE
Grid,0.40096
Random,0.399504
CMAES,0.396236
TPE,0.391853
BOHB,0.39101


### **Visualization**
#### History

In [28]:
trials_df = study_bohb.trials_dataframe()
trials_df

Unnamed: 0,number,value,datetime_start,datetime_complete,duration,params_bagging_fraction,params_bagging_freq,params_feature_fraction,params_lambda_l1,params_lambda_l2,params_learning_rate,params_max_depth,params_num_leaves,params_subsample,system_attrs_completed_rung_0,system_attrs_completed_rung_1,system_attrs_completed_rung_2,system_attrs_completed_rung_3,system_attrs_completed_rung_4,state
0,0,0.471825,2021-11-15 11:24:23.949277,2021-11-15 11:24:24.224309,0 days 00:00:00.275032,0.067367,9,0.436638,3.325675,9.425078,0.128138,22,188,0.7,,,,,,COMPLETE
1,1,0.412585,2021-11-15 11:24:24.224309,2021-11-15 11:24:24.589309,0 days 00:00:00.365000,0.595335,10,0.565873,4.154902,7.855833,0.369971,12,50,0.5,0.412585,0.412585,0.412585,0.412585,0.412585,COMPLETE
2,2,0.440096,2021-11-15 11:24:24.590309,2021-11-15 11:24:24.738312,0 days 00:00:00.148003,0.612644,0,0.974753,3.833311,0.924854,0.102677,29,7,0.5,0.440096,,,,,PRUNED
3,3,0.591054,2021-11-15 11:24:24.739311,2021-11-15 11:24:25.276403,0 days 00:00:00.537092,0.937819,2,0.290746,9.941261,9.807908,0.021426,37,117,0.7,0.591054,,,,,PRUNED
4,4,0.408320,2021-11-15 11:24:25.276403,2021-11-15 11:24:25.772403,0 days 00:00:00.496000,0.537561,6,0.865107,1.756577,3.495271,0.049842,14,50,0.8,0.408320,0.408320,0.408320,0.408320,,COMPLETE
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
395,395,0.394311,2021-11-15 11:29:59.303154,2021-11-15 11:30:00.564670,0 days 00:00:01.261516,0.450633,0,0.636563,1.306077,8.550578,0.090336,16,160,0.1,0.394311,0.394311,,,,PRUNED
396,396,0.393360,2021-11-15 11:30:00.564670,2021-11-15 11:30:01.505738,0 days 00:00:00.941068,0.482474,0,0.625695,1.525871,9.088922,0.093850,18,165,0.1,0.393360,0.393360,0.393360,0.393360,0.393360,PRUNED
397,397,0.415481,2021-11-15 11:30:01.506738,2021-11-15 11:30:02.296811,0 days 00:00:00.790073,0.833763,9,0.984536,2.005128,9.880492,0.030422,50,83,0.3,0.415481,,,,,PRUNED
398,398,0.392358,2021-11-15 11:30:02.296811,2021-11-15 11:30:03.361862,0 days 00:00:01.065051,0.496017,0,0.619132,1.780544,9.286197,0.085907,18,161,0.1,0.392358,0.392358,0.392358,0.392358,0.392358,COMPLETE


We can see that the optimization history plot for the BOHB shows a decreasing pattern. 
In the earlier trials, the algorithm makes big leaps forward in its performance (after about 60 trials, the model has already surpassed random search).    
After that, there is a long stretch where the model barely improves.

We can also notice that the optimization results become increasingly less variable. This is due to the pruning strategy. There are far less big jumps in the model performance and the model results are close together.

In [29]:
optuna.visualization.plot_optimization_history(study_bohb)

#### Hyperparameter Importance Plot
In this plot, we can see the importance of our hyperparameters. It seems that feature fraction and the bagging fraction are the most important hyperparameters for our model. 
The max depth and subsamples are, respectively, far less important to tune in our example. 

This plot might help us to more efficiently tune hyperparameters in the future (as some hyperparameters do not affect the model performance as much as others).

In [30]:
optuna.visualization.plot_param_importances(study_bohb)

#### Exploration - Exploitation Plot
In this plot, we can see which regions of each hyperparameter have been explored (on the horizontal axes). We also notice that for many important hyperparameters, the darker values (later trials) are clustered together. This means that our hyperparameter tuning exercise has reached quite good convergence on the most optimal hyperparameter value.

In [31]:
optuna.visualization.plot_slice(study_bohb)

#### Dominant Profile Plot
The last plot shows all different hyperparameter profiles. This plot becomes increasingly less useful if the number of trials increases (and even more so when no pruning strategies are used). 
However, we can clearly see on this plot that there is a 'dominant profile' which leads to the best performance metric (see the darkest profiles corresponds to the better models). 

In [32]:
optuna.visualization.plot_parallel_coordinate(study_bohb)