# Hyperparemeter optimization
Searching for the best hyperparameters for a model class on a development dataset.
***
***
Hyperparameter optimization relies on the package [Optuna](https://optuna.org/). It is interacted with through the `hypopt_model` function, which involves a large number of options.

In [1]:
from cytoxnet.models.opt import hypopt_model

In [2]:
help(hypopt_model)

Help on function hypopt_model in module cytoxnet.models.opt:

hypopt_model(model_name:str, dev_set:Type[deepchem.data.datasets.NumpyDataset], search_space:dict, study_name:str, target_index:int=None, study_db:str='sqlite:///optimization.db', transformations:list=[], metric:str='r2_score', cv:int=5, trials_per_cpu:int=10, model_kwargs:dict={}, fit_kwargs:dict={}, eval_kwargs:dict={})
    Optimize a specified ToxModel by name over hyperperameter space.
    
    For a ToxModel and a development dataset, search for the best
    hyperparameter set over a specified search window. Optimizing for
    a specified metric. Uses cross validation. Can be run multiple times,
    on multiple cpus by simply executing the function again on each worker.
    `mpirun` is a quick solution to scatter to many workers.
    
    Parameters
    ----------
    model_name : str
        Name of ToxModel type under investigation
    dev_set : deepchem.data.NumpyDataset
        Dataset used for searching for the bes

***
### Minimally prepare a dataset to use for demonstration
See the dataprep example notebook for functionality and options in preparing data

### <span style='color:red'>NEED TO UPDATE WITH DATABASE CALL</span>

In [3]:
import cytoxnet.dataprep.io
import cytoxnet.dataprep.dataprep
import cytoxnet.dataprep.featurize
import pandas as pd

In [4]:
df = cytoxnet.dataprep.io.load_data('lunghini_algea_EC50')

In [5]:
df = cytoxnet.dataprep.featurize.add_features(df, method='RDKitDescriptors')

In [6]:
data = cytoxnet.dataprep.dataprep.convert_to_dataset(
    df,
    X_col='RDKitDescriptors',
    y_col='algea_EC50'
)

In [7]:
data, transformers = cytoxnet.dataprep.dataprep.data_transformation(
    data, ['MinMaxTransformer'], to_transform='y'
)

***
### Create a study
We first must create an optuna study to store out search on, and save it to disk. We will specifiy `direction` as maximize, as we will choose to optimize the R2 score

In [8]:
from optuna import create_study

In [9]:
mystudy = create_study(storage="sqlite:///optimization.db", study_name='opt', direction='maximize')

[32m[I 2021-06-04 18:33:06,933][0m A new study created in RDB with name: opt[0m


***
### Defining space to search over
For the model we are searching over, we must define the search space of inititialization hyperparameters. This is a dictionary, and the form of the values in the dictionary determines how they will be sampled. See the `search_space` parameter docs for full details on options for determinging sample space. Here we will search over `n_estimators` choosing uniformly from 5 to 50 in steps of 5, for `min_weight_fraction_leaf` from 0.1 to 0.2 on the logarithmic scale, and for `criterion` choosing from the two available options.

In [10]:
search_space = {
    'n_estimators': (5, 50, 5), # uniform integer sampling from 50 to 50 in steps of 5
    'min_weight_fraction_leaf': (0.1, 0.2, 'loguniform'), # logscale from 0.0 to 0.1,
    'criterion': ['mse', 'mae'] # a choice
}

***
### Running the optimization
Now we wimply have to execture the function with the options we want. This will run 10 trials per cpu. In this case that means 10, since we are only executing it on one CPU once.

In [11]:
hypopt_model(
    model_name = 'RFR',
    dev_set = data,
    search_space = search_space,
    study_name = 'opt',
    study_db = "sqlite:///optimization.db",
    transformations = transformers,
    metric = 'r2_score',
    trials_per_cpu=10
)



[32m[I 2021-06-04 18:33:31,348][0m Trial 0 finished with value: 0.3622328067498319 and parameters: {'criterion': 'mae', 'min_weight_fraction_leaf': 0.10126466612092125, 'n_estimators': 15}. Best is trial 0 with value: 0.3622328067498319.[0m




[32m[I 2021-06-04 18:33:32,925][0m Trial 1 finished with value: 0.3542675939566397 and parameters: {'criterion': 'mse', 'min_weight_fraction_leaf': 0.13953434994985955, 'n_estimators': 40}. Best is trial 0 with value: 0.3622328067498319.[0m




[32m[I 2021-06-04 18:34:14,344][0m Trial 2 finished with value: 0.356363415738973 and parameters: {'criterion': 'mae', 'min_weight_fraction_leaf': 0.1059605982762334, 'n_estimators': 35}. Best is trial 0 with value: 0.3622328067498319.[0m




[32m[I 2021-06-04 18:34:33,523][0m Trial 3 finished with value: 0.31456192323792564 and parameters: {'criterion': 'mae', 'min_weight_fraction_leaf': 0.19635900963030725, 'n_estimators': 15}. Best is trial 0 with value: 0.3622328067498319.[0m




[32m[I 2021-06-04 18:34:33,945][0m Trial 4 finished with value: 0.33827500767525265 and parameters: {'criterion': 'mse', 'min_weight_fraction_leaf': 0.15837090338772292, 'n_estimators': 10}. Best is trial 0 with value: 0.3622328067498319.[0m




[32m[I 2021-06-04 18:35:06,473][0m Trial 5 finished with value: 0.31257000712928534 and parameters: {'criterion': 'mae', 'min_weight_fraction_leaf': 0.19897961222850935, 'n_estimators': 35}. Best is trial 0 with value: 0.3622328067498319.[0m




[32m[I 2021-06-04 18:35:25,901][0m Trial 6 finished with value: 0.3164445076622703 and parameters: {'criterion': 'mae', 'min_weight_fraction_leaf': 0.19062598001732028, 'n_estimators': 15}. Best is trial 0 with value: 0.3622328067498319.[0m




[32m[I 2021-06-04 18:35:41,920][0m Trial 7 finished with value: 0.3220757858280128 and parameters: {'criterion': 'mae', 'min_weight_fraction_leaf': 0.1674855490592344, 'n_estimators': 15}. Best is trial 0 with value: 0.3622328067498319.[0m




[32m[I 2021-06-04 18:35:42,210][0m Trial 8 finished with value: 0.3616331423373108 and parameters: {'criterion': 'mse', 'min_weight_fraction_leaf': 0.12156915517845218, 'n_estimators': 5}. Best is trial 0 with value: 0.3622328067498319.[0m




[32m[I 2021-06-04 18:35:44,072][0m Trial 9 finished with value: 0.3835943224691388 and parameters: {'criterion': 'mse', 'min_weight_fraction_leaf': 0.10221979337192384, 'n_estimators': 40}. Best is trial 9 with value: 0.3835943224691388.[0m


***
### Retrieving results
We can access the results from the study. If you want to retrieve these results later and do not have the `study` object in memory, use the `optuna.load_study` function.

We can see the results for all trails as a dataframe.

In [13]:
mystudy.trials_dataframe()

Unnamed: 0,number,value,datetime_start,datetime_complete,duration,params_criterion,params_min_weight_fraction_leaf,params_n_estimators,state
0,0,0.362233,2021-06-04 18:33:07.196852,2021-06-04 18:33:31.334337,0 days 00:00:24.137485,mae,0.101265,15,COMPLETE
1,1,0.354268,2021-06-04 18:33:31.353982,2021-06-04 18:33:32.911684,0 days 00:00:01.557702,mse,0.139534,40,COMPLETE
2,2,0.356363,2021-06-04 18:33:32.930842,2021-06-04 18:34:14.329328,0 days 00:00:41.398486,mae,0.105961,35,COMPLETE
3,3,0.314562,2021-06-04 18:34:14.350075,2021-06-04 18:34:33.510318,0 days 00:00:19.160243,mae,0.196359,15,COMPLETE
4,4,0.338275,2021-06-04 18:34:33.530223,2021-06-04 18:34:33.932495,0 days 00:00:00.402272,mse,0.158371,10,COMPLETE
5,5,0.31257,2021-06-04 18:34:33.952254,2021-06-04 18:35:06.460572,0 days 00:00:32.508318,mae,0.19898,35,COMPLETE
6,6,0.316445,2021-06-04 18:35:06.480159,2021-06-04 18:35:25.887704,0 days 00:00:19.407545,mae,0.190626,15,COMPLETE
7,7,0.322076,2021-06-04 18:35:25.907520,2021-06-04 18:35:41.907756,0 days 00:00:16.000236,mae,0.167486,15,COMPLETE
8,8,0.361633,2021-06-04 18:35:41.927529,2021-06-04 18:35:42.196855,0 days 00:00:00.269326,mse,0.121569,5,COMPLETE
9,9,0.383594,2021-06-04 18:35:42.217148,2021-06-04 18:35:44.059149,0 days 00:00:01.842001,mse,0.10222,40,COMPLETE


We can also get the best set of parameters searched for.

In [14]:
mystudy.best_params

{'criterion': 'mse',
 'min_weight_fraction_leaf': 0.10221979337192384,
 'n_estimators': 40}

In [15]:
mystudy.best_value

0.3835943224691388