## **HPO Lab with Optuna**

### **Documentation**
#### https://broutonlab.com/blog/efficient-hyperparameter-optimization-with-optuna-framework
#### https://optuna.readthedocs.io/en/stable/tutorial/index.html

### **Quick setup**

Setting up the basic framework is pretty simple and straightforward. It can be divided broadly into 4 steps:

- Define an objective function (Step 1)
- Define a set of hyperparameters to try (Step 2)
- Define the variable/metrics you want to optimize (Step 3)
- Finally, run the function. Here you need to mention:
  * the scoring function/variable you are trying to optimize is to be maximized or minimized
  * the number of trials you want to make. Higher the number of hyper-parameters and more the number of trials defined, the more computationally expensive it is (unless you have a beefy machine or a GPU!)
    

In [None]:
# Connect to gmail
from google.colab import drive
drive.mount('/content/gdrive')


Mounted at /content/gdrive


In [None]:
# Only what you need in colab
!pip3 install optuna pandas sklearn

In [None]:
#import librairies
import optuna
import pandas as pd
from sklearn import linear_model
from sklearn import ensemble
from sklearn import datasets
from sklearn import model_selection

In [None]:
#Load the dataset : Grabbing a sklearn Classification dataset Breast_cancer
# More details about the dataset can be found in https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html
X,y = datasets.load_breast_cancer(return_X_y=True, as_frame=True)
X.columns

Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error', 'fractal dimension error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry', 'worst fractal dimension'],
      dtype='object')

In [42]:
#Step 1. Define an objective function to be maximized.
def objective(trial):

    classifier_name = trial.suggest_categorical("classifier", ["LogReg", "RandomForest"])
    
    # Step 2. Setup values for the hyperparameters:
    if classifier_name == 'LogReg':
        logreg_c = trial.suggest_float("logreg_c", 1e-10, 1e10, log=True)
        classifier_obj = linear_model.LogisticRegression(C=logreg_c)
    else:
        rf_n_estimators = trial.suggest_int("rf_n_estimators", 10, 1000)
        rf_max_depth = trial.suggest_int("rf_max_depth", 2, 32, log=True)
        classifier_obj = ensemble.RandomForestClassifier(
            max_depth=rf_max_depth, n_estimators=rf_n_estimators
        )

    # Step 3: Scoring method:
    score = model_selection.cross_val_score(classifier_obj, X, y, n_jobs=-1, cv=3)
    accuracy = score.mean()
    return accuracy

# Step 4: Running it
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=20)

[32m[I 2022-11-08 12:07:49,131][0m A new study created in memory with name: no-name-201701bf-43bb-4d4e-9b87-2de367b3ef0f[0m
[32m[I 2022-11-08 12:07:51,205][0m Trial 0 finished with value: 0.8261022927689594 and parameters: {'classifier': 'LogReg', 'logreg_c': 2.5051488827085408e-08}. Best is trial 0 with value: 0.8261022927689594.[0m
[32m[I 2022-11-08 12:07:51,320][0m Trial 1 finished with value: 0.9314768402487701 and parameters: {'classifier': 'LogReg', 'logreg_c': 0.004211366608976266}. Best is trial 1 with value: 0.9314768402487701.[0m
[32m[I 2022-11-08 12:07:55,504][0m Trial 2 finished with value: 0.9420124385036664 and parameters: {'classifier': 'RandomForest', 'rf_n_estimators': 915, 'rf_max_depth': 2}. Best is trial 2 with value: 0.9420124385036664.[0m
[32m[I 2022-11-08 12:07:57,350][0m Trial 3 finished with value: 0.9613292490485472 and parameters: {'classifier': 'RandomForest', 'rf_n_estimators': 353, 'rf_max_depth': 31}. Best is trial 3 with value: 0.9613292490

In [None]:
print(f"The best trial is : \n{study.best_trial}")
print(f"The best value is : \n{study.best_value}")
print(f"The best parameters are : \n{study.best_params}")

The best trial is : 
FrozenTrial(number=82, values=[0.9648565859092174], datetime_start=datetime.datetime(2022, 11, 8, 11, 2, 7, 768092), datetime_complete=datetime.datetime(2022, 11, 8, 11, 2, 8, 322066), params={'classifier': 'RandomForest', 'rf_n_estimators': 101, 'rf_max_depth': 15}, distributions={'classifier': CategoricalDistribution(choices=('LogReg', 'RandomForest')), 'rf_n_estimators': IntDistribution(high=1000, log=False, low=10, step=1), 'rf_max_depth': IntDistribution(high=32, log=True, low=2, step=1)}, user_attrs={}, system_attrs={}, intermediate_values={}, trial_id=82, state=TrialState.COMPLETE, value=None)
The best value is : 
0.9648565859092174
The best parameters are : 
{'classifier': 'RandomForest', 'rf_n_estimators': 101, 'rf_max_depth': 15}


In [43]:
optuna.importance.get_param_importances(study)

OrderedDict([('classifier', 1.0)])

## **Sampling and pruning with Optuna**

Optuna combines sampling and pruning mechanisms to provide 
efficient hyperparameter optimization.

### **Sampling**

Often, such methods as Grid Search and Random Search are used to optimize hyperparameters.

![sampling]("grid_random_serach.png")


Optuna allows to build and manipulate hyperparameter search spaces dynamically. To sample configurations from search space, Optuna provides two sampling types:

- Relational sampling: these types of methods take into account information about the correlation among the parameters.
- Independent sampling.

Tree-structured Parzen Estimator (TPE) is the default sampler in Optuna. It uses the history of previously evaluated hyperparameter configurations to sample the following ones.

The list of all samplers supported by Optuna can be found in https://optuna.readthedocs.io/en/stable/reference/samplers/index.html 

### **Pruning Mechanism**
A pruning mechanism refers to the termination of unpromising trials during hyperparameter optimization. It periodically monitors each trial's learning curves. It then determines the sets of hyperparameters that will not lead to a good result and should not be taken into account.

The pruning mechanism implemented in Optuna is based on an asynchronous variant of the **Successive Halving Algorithm (SHA)**. Let’s understand the general idea behind the SHA:

- Allocate the minimum amount of resources to each available hyperparameters configuration. The resources, for example, it’s the number of epochs, the number of training examples, training duration, e.t.c.

- Evaluate the performance metrics of all configurations within the allocated resources.
- Keep the top 1/η configurations (η - a reduction factor) with the best scores and discard the rest.
- Increase the minimum amount of resources per configuration by factor η and repeat until the number of resources per configuration reaches the maximum.


![sampling]("random_search_tpe.png")

A complete example can be find in  https://github.com/optuna/optuna-examples/blob/main/simple_pruning.py 

In [39]:
def objective(trial):

    classifier_name = trial.suggest_categorical("classifier", ["LogReg", "ExtraTree", "RandomForest"])
    
    # Step 2. Setup values for the hyperparameters:
    if classifier_name == 'LogReg':
        logreg_c = trial.suggest_float("logreg_c", 1e-10, 1e10, log=True)
        classifier_obj = linear_model.LogisticRegression(C=logreg_c)
    elif classifier_name == 'ExtraTree':
        random_state = 42
        n_jobs = -1
        max_depth = trial.suggest_int("max_depth", 80, 120)
        n_estimators = trial.suggest_int("n_estimators", 80, 1200)
        min_samples_split = trial.suggest_int("min_samples_split", 2, 5)
        min_samples_leaf = trial.suggest_int("min_samples_leaf", 1, 5)
        criterion = trial.suggest_categorical('criterion', ['gini', 'entropy'])
        classifier_obj = ensemble.ExtraTreesClassifier( random_state      = random_state,
                                                  n_jobs            = n_jobs,
                                                  max_depth         = max_depth,
                                                  n_estimators      = n_estimators,
                                                  min_samples_split = min_samples_split,
                                                  min_samples_leaf  = min_samples_leaf,
                                                  criterion = criterion
                                                  ) 
    else:
        rf_n_estimators = trial.suggest_int("rf_n_estimators", 10, 1000)
        rf_max_depth = trial.suggest_int("rf_max_depth", 2, 32, log=True)
        classifier_obj = ensemble.RandomForestClassifier(
            max_depth=rf_max_depth, n_estimators=rf_n_estimators
        )

    # Step 3: Scoring method:
    for step in range(100):

      clf.partial_fit(train_x, train_y, classes=classes)


      score = model_selection.cross_val_score(classifier_obj, X, y, n_jobs=-1, cv=3)
      accuracy = score.mean()

      # Step 4: report the result
      trial.report(accuracy, step)

      # Handle pruning based on the intermediate value.
      if trial.should_prune():
        raise optuna.TrialPruned()

    return accuracy

In [40]:
# sampler: We want to use a TPE sampler
# pruner: We use a MedianPruner in order to interrupt unpromising trials
# direction: The direction of study is “maximize” because we want to maximize the accuracy
# n_trials: Number of trials

sampler = optuna.samplers.TPESampler()    
study = optuna.create_study(
    sampler=sampler,
    pruner=optuna.pruners.MedianPruner(
        n_startup_trials=3, n_warmup_steps=5, interval_steps=3
    ),
    direction='maximize')
study.optimize(func=objective, n_trials=100)

[32m[I 2022-11-08 11:55:13,616][0m A new study created in memory with name: no-name-6678ccf8-5b51-48ac-b591-2ea4361f4cd2[0m
[33m[W 2022-11-08 11:55:17,372][0m Trial 0 failed because of the following error: TypeError("'<' not supported between instances of 'str' and 'int'")[0m
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/optuna/study/_optimize.py", line 196, in _run_trial
    value_or_values = func(trial)
  File "<ipython-input-39-71beddc401b0>", line 37, in objective
    trial.report(accuracy, classifier_name)
  File "/usr/local/lib/python3.7/dist-packages/optuna/trial/_trial.py", line 455, in report
    if step < 0:
TypeError: '<' not supported between instances of 'str' and 'int'


TypeError: ignored

In [34]:
optuna.visualization.plot_parallel_coordinate(study)

[33m[W 2022-11-08 11:48:27,425][0m Your study has only completed trials with missing parameters.[0m


The best parameters are : 
{'classifier': 'RandomForest', 'rf_n_estimators': 100, 'rf_max_depth': 25}


In [27]:
optuna.visualization.plot_param_importances(study)


In [45]:
# plot optimization history
optuna.visualization.plot_optimization_history(study)

In [46]:
# plot parallel coordinate
optuna.visualization.plot_parallel_coordinate(study)

[33m[W 2022-11-08 12:20:36,378][0m Your study has only completed trials with missing parameters.[0m
