# Notebook III: Classifying Morse Features

This notebook shows how to classify vanilla and chemically-enhanced Morse feature vectors using a tuned LighGBM. 

Please note that you must have generated the features before classifying them.

(See the previous notebook in the series to see how to generate batches of Morse features.)

# Packages
Import the Python packages necessary to run the notebook.

In [1]:
import pandas as pd

Print the working directory. This is useful to know when checking the relative file paths later on.

In [None]:
%pwd

# Classifying vanilla Morse feature vectors

We shall run a script `gb_tuned_sum_all_feature_classification.py` to generate the tuned LightGBM classification results for vanilla Morse feature vectors. Due to the stochasticity of the hyperparameter tuning, different runs may lead to slightly different results. You must specify various command-line arguments:

Optional arguments:

    -h, --help

    --targets
  
    --dataset
  
    --results_prefix

where 

* `targets` is the protein target (e.g. cxcr4), to specify multiple targets enter a space between each target name (e.g. cxcr4 ampc); 

* `results_prefix` specifies the filename prefix of the results files generated by the classification script; and 

* `dataset` refers to the feature dataset and the virtual screening dataset it was generated from e.g. dude_aligned for vanilla Morse feature vectors from the DUD-E dataset, muv_kqmolsa for KQMolSA features generated from the MUV dataset and dude_baseline_q9 for the baseline features generated from the DUD-E dataset. The precise names of these feature datasets depend on the names of the directories the features are stored in.

The output of this script will be a series of results files specifying the generalisation errors of the tuned LightGBM for several metrics for each protein target.

By default the script will classify Morse features with a depth of 20 and 32 pentakis dodecahedron directions. The depth `top_values` and number of directions `num_directions` can be altered by changing the following lines _in the script_:
```
top_values = [20]
num_directions = 32
```
Note that due to how the results files are generated the depth can only take on the following values $\{1, 3, 5, 7, 10, 13, 15, 20\}$ and the number of directions must be an integer between 1 and 32. This all assumes that the Morse features were generated with the default parameters. If you were to generate Morse features up to depth 10 and with 5 directions, then the permissible ranges for classification would be accordingly different.

In [3]:
%run ../src/gb_tuned_sum_all_feature_classification.py --targets cxcr4 --results_prefix lgbm_dude_diverse_vanilla_morse --dataset dude_aligned

Best trial config: {'num_leaves': 61, 'min_data_in_leaf': 381, 'max_depth': 99, 'lambda_l1': 8.223083836905306e-06, 'lambda_l2': 8.594689739874726e-07, 'bagging_fraction': 0.9415691853284407, 'min_sum_hessian_in_leaf': 0.9649985837003707, 'feature_fraction': 0.7949992592908648}
Best trial final validation loss: 0.047064037129734226
Best trial final validation ROCAUC: 0.8920401922236785
Best trial final validation BEDROC-5: 0.7217168189451751
Best trial final validation 1% Enrichment Factor: 25.819841269841266
pos_weight = 85.15625 , actives = 32 , inactives =  2725
{'device_type': 'cpu', 'num_threads': 1, 'objective': 'binary', 'scale_pos_weight': 1.0, 'num_leaves': 31, 'min_data_in_leaf': 20, 'bagging_freq': 1, 'bagging_fraction': 1.0, 'min_sum_hessian_in_leaf': 0.001, 'max_depth': 100, 'num_iterations': 100, 'learning_rate': 0.1, 'min_gain_to_split': 0, 'feature_fraction': 1.0, 'verbosity': -1}
{'device_type': 'cpu', 'num_threads': 1, 'objective': 'binary', 'scale_pos_weight': 1.0, '

## Reading results files

In [4]:
# Choose the specific metric you desire
metrics =  ['roc', 'brier', 'logl',
            'bedroc_alpha5', 'bedroc_alpha10', 'bedroc_alpha20',
            'ef_1', 'abs_ef_1']
metric = metrics[0]

# Select the appropriate classification run
# (should match the classification script argument)
dataset = 'dude_aligned' 
results_prefix = 'lgbm_dude_diverse_vanilla_morse' 

# Read the corresponding csv file
results_df = pd.read_csv('../data/results/' +
                         dataset +  '/' +
                         '%s_%s_results.csv' %(results_prefix, metric),
                         sep=',',
                         index_col=0)

display(results_df)

Unnamed: 0,target,top 20,top 20 STD,top 20 low,top 20 up,top 20 min,top 20 max,top 15,top 15 STD,top 15 low,...,top 3 low,top 3 up,top 3 min,top 3 max,top 1,top 1 STD,top 1 low,top 1 up,top 1 min,top 1 max
0,cxcr4,0.864985,0.06585,0.786027,0.905514,0.742474,0.920888,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Classifying chemically-enhanced Morse feature vectors

We shall run a script `gb_tuned_sum_all_feature_hybrid_classification.py` to generate the tuned LightGBM classification results for chemically-enhanced Morse feature vectors. Due to the stochasticity of the hyperparameter tuning, different runs may lead to slightly different results. You must specify various command-line arguments:

Optional arguments:

    -h, --help

    --targets
  
    --dataset

    --baseline_dataset
  
    --results_prefix

where 

* `targets` is the protein target (e.g. cxcr4), to specify multiple targets enter a space between each target name (e.g. cxcr4 ampc); 

* `results_prefix` specifies the filename prefix of the results files generated by the classification script; 

* `dataset` refers to the feature dataset and the virtual screening dataset it was generated from e.g. dude_aligned for vanilla Morse feature vectors from the DUD-E dataset, muv_kqmolsa for KQMolSA features generated from the MUV dataset. The precise names of these feature datasets depend on the names of the directories the features are stored in; and 

* `baseline_dataset` refers to the chemical properties dataset and the virtual screening dataset it was generated from. There are only two options: dude_baseline_q9 and muv_baseline_q9. This dataset is where the chemical properties are extracted from to supplement the Morse feature vectors with chemical information. If you selected a DUD-E feature dataset then you should also select a DUD-E baseline dataset.

The output of this script will be a series of results files specifying the generalisation errors of the tuned LightGBM for several metrics for each protein target.

By default the script will classify Morse features with a depth of 20 and 32 pentakis dodecahedron directions. The depth `top_values` and number of directions `num_directions` can be altered by changing the following lines _in the script_:
```
top_values = [20]
num_directions = 32
```
Note that due to how the results files are generated the depth can only take on the following values $\{1, 3, 5, 7, 10, 13, 15, 20\}$ and the number of directions must be an integer between 1 and 32. This all assumes that the Morse features were generated with the default parameters. If you were to generate Morse features up to depth 10 and with 5 directions, then the permissible ranges for classification would be accordingly different.

In [2]:
%run ../src/gb_tuned_sum_all_feature_hybrid_classification.py --targets cxcr4 --results_prefix lgbm_dude_diverse_chem_morse --dataset dude_aligned --baseline_dataset dude_baseline_q9

Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information.
Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information.
Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information.
Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information.


Best trial config: {'num_leaves': 17, 'min_data_in_leaf': 133, 'max_depth': 17, 'lambda_l1': 0.0002428945418263831, 'lambda_l2': 3.3002503949106756e-05, 'bagging_fraction': 0.9507748866368981, 'min_sum_hessian_in_leaf': 7.528541359893696e-07, 'feature_fraction': 0.33248784138019694}
Best trial final validation loss: 0.008143016861063493
Best trial final validation ROCAUC: 0.9994058540847531
Best trial final validation BEDROC-5: 0.9971867311380682
Best trial final validation 1% Enrichment Factor: 83.58174603174602
{'device_type': 'cpu', 'num_threads': 1, 'objective': 'binary', 'num_leaves': 31, 'min_data_in_leaf': 20, 'bagging_freq': 1, 'bagging_fraction': 1.0, 'min_sum_hessian_in_leaf': 0.001, 'max_depth': 100, 'num_iterations': 100, 'learning_rate': 0.1, 'min_gain_to_split': 0, 'feature_fraction': 1.0, 'max_bin': 255, 'verbosity': -1}
{'device_type': 'cpu', 'num_threads': 1, 'objective': 'binary', 'num_leaves': 17, 'min_data_in_leaf': 133, 'bagging_freq': 1, 'bagging_fraction': 0.9507

## Reading results files

In [3]:
# Choose the specific metric you desire
metrics =  ['roc', 'brier', 'logl',
            'bedroc_alpha5', 'bedroc_alpha10', 'bedroc_alpha20',
            'ef_1', 'abs_ef_1']
metric = metrics[0]

# Select the appropriate classification run
# (should match the classification script argument)
dataset = 'dude_aligned' 
results_prefix = 'lgbm_dude_diverse_chem_morse'

# Read the corresponding csv file
results_df = pd.read_csv('../data/results/' +
                         dataset +  '/' +
                         '%s_%s_results.csv' %(results_prefix, metric),
                         sep=',',
                         index_col=0)

display(results_df)

Unnamed: 0,target,top 20,top 20 STD,top 20 low,top 20 up,top 20 min,top 20 max,top 15,top 15 STD,top 15 low,...,top 3 low,top 3 up,top 3 min,top 3 max,top 1,top 1 STD,top 1 low,top 1 up,top 1 min,top 1 max
0,cxcr4,0.997063,0.004693,0.992254,0.999523,0.987702,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
