# Notebook II: Batch Morse Feature Generation

This notebook shows how to generate Morse feature vectors for large batches of molecules. 

(See the previous notebook in the series to see a walkthrough of how to generate the Morse feature vector for a single molecule.)

# Packages
Import the Python packages necessary to run the notebook.

In [1]:
import pandas as pd

Print the working directory. This is useful to know when checking the relative file paths later on.

In [None]:
%pwd

# Batch Feature Generation

Here, we compute the padded Morse transform outputs and chemical percentiles for large batches of molecular SDF files. We do not perform the final step of the **MORSE** procedure (**E**ncapsulate in a vector) of concatenating the Morse and chemical information together into a single vector. This is done later during the classification of Morse feature vectors, which is demonstrated in the next notebook in the series.

## Morse transform computation

We shall run a script `feature_aligned_from_sdf_generation.py` to generate the Morse transform outputs. You must specify various command-line arguments: 

Positional arguments:

    target

Optional arguments:

    -h, --help

    --top 
  
    --file_path_root 
  
    --file_path_sdf 
  
    --file_path_save_feature 
  
    --dataset 

where `target` is the protein target (e.g. cxcr4) for a particular `dataset` (dude or muv) and `top` is the depth of the Morse transform.

By default the script is set to calculate the Morse transform for 32 pentakis dodecahedron directions but it can be changed by editing the `# Directions to analyse the complexes` section _inside the script_.

This script takes advantage of Python multiprocessing to improve performace. Consequently, it is beneficial to run the script on a machine which has multiple CPU Cores.

In [3]:
# Any '!!! Warning !!! Distance between atoms X and Y (Z A) is suspicious.' may be ignored.
%run ../src/feature_aligned_from_sdf_generation.py cxcr4 --file_path_sdf ../data/sdf/dude/ --file_path_save_feature ../data/features/dude_aligned/ --file_path_root ../data/ --dataset dude_aligned --top 20


cxcr4
cores = 48
Number of molecules: 40
number of active processes: 48
number of active processes: 0
cores = 48
Number of molecules: 3406
number of active processes: 48
number of active processes: 0
target: cxcr4 , actives: 40 , inactives: 3406 , augmented ratio: 1 , top: 20


Feature vector calculation: 100%|██████████| 3446/3446 [08:54<00:00,  6.45it/s]


actives: 40 , inactives: 3406
number of active processes: 48
number of active processes: 0 



If the script ran successfully, then you should see something similar to the following output (except the number of cores and computation time may differ):
```
cxcr4
cores = 48
Number of molecules: 40
number of active processes: 48
number of active processes: 0
cores = 48
Number of molecules: 3406
number of active processes: 48
number of active processes: 0
target: cxcr4 , actives: 40 , inactives: 3406 , augmented ratio: 1 , top: 20
Feature vector calculation: 100%|██████████| 3446/3446 [08:46<00:00,  6.55it/s]
actives: 40 , inactives: 3406
number of active processes: 48
number of active processes: 0
```

### Read the generated Morse transform outputs

In [4]:
# Select the desired dataset and protein target
dataset = 'dude_aligned'
target = 'cxcr4'

# Read the parquet file
morse_transform_df = pd.read_parquet('../data/features/' 
                                     + dataset + '/'
                                     + target
                                     + '_rand_pentakis_top_20_features.parquet')

# Convert parquet columns from strings to int to allow for easier indexing later
morse_transform_df.columns = morse_transform_df.columns.astype(int)

# Display the top few rows
morse_transform_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,8950,8951,8952,8953,8954,8955,8956,8957,8958,8959
0,5.767693,5.322646,5.055909,4.579113,4.389258,-2.646848,-3.085669,0.0,0.0,0.0,...,6.0,1.0,6.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0
1,9.545556,9.028353,9.002613,8.94537,8.481927,8.466186,8.227662,8.227405,8.200857,6.019904,...,6.0,6.0,7.0,1.0,7.0,6.0,6.0,7.0,6.0,6.0
2,9.697198,7.487512,7.432236,7.218467,6.95578,5.241619,4.706921,3.830774,3.813104,3.293295,...,1.0,1.0,7.0,6.0,6.0,1.0,7.0,1.0,6.0,6.0
3,5.215274,5.208668,4.483721,3.157618,2.725861,2.165144,2.069918,1.947704,1.150996,1.074617,...,6.0,1.0,6.0,6.0,1.0,6.0,1.0,6.0,6.0,6.0
4,6.082441,5.878793,5.332931,3.127058,2.804953,2.300829,2.035817,1.88964,1.23823,0.958774,...,1.0,6.0,1.0,6.0,7.0,1.0,6.0,1.0,6.0,6.0


## Chemical information computation

We shall run a script `baseline_quantile_feature_from_sdf_generation.py` to generate the chemical information. You must specify various arguments: 

Positional arguments:

    target

Optional arguments:

    -h, --help 
  
    --file_path_root 
  
    --file_path_sdf 
  
    --file_path_save_feature 
  
    --dataset 

where `target` is the protein target (e.g. cxcr4) for a particular `dataset` (dude_baseline_q9 or muv_baseline_q9). Note that it also has the vestigial optional argument `top` because this script is based upon the Morse feature vector generation script.

In [5]:
# Any '!!! Warning !!! Distance between atoms X and Y (Z A) is suspicious.' may be ignored.
%run ../src/baseline_quantile_feature_from_sdf_generation.py cxcr4 --file_path_sdf ../data/sdf/dude/ --file_path_save_feature ../data/features/dude_baseline_q9/ --file_path_root ../data/ --dataset dude_baseline_q9

cxcr4
cores = 48
Number of molecules: 40
number of active processes: 48
number of active processes: 0
cores = 48
Number of molecules: 3406
number of active processes: 48
number of active processes: 0
target: cxcr4 , actives: 40 , inactives: 3406 , augmented ratio: 1 , top: 20
actives: 40 , inactives: 3406
number of active processes: 0
number of active processes: 0 



### Read the generated chemical features

In [6]:
# Select the desired dataset and protein target
dataset = 'dude_baseline_q9'
target = 'cxcr4'

# Read the parquet file
chemical_feature_df = pd.read_parquet('../data/features/' 
                                      + dataset + '/'
                                      + target
                                      + '_rand_pentakis_top_20_features.parquet')

# Convert parquet columns from strings to int to allow for easier indexing later
chemical_feature_df.columns = chemical_feature_df.columns.astype(int)

# Display the top few rows
chemical_feature_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1153,1154,1155,1156,1157,1158,1159,1160,1161,1162
0,6.59589,2.867395,3.09385,0.41,0.4619,4.067,6.49686,12.89,0.161939,0.320015,...,0.0,17.283257,2.450731,1.404566,0.817618,0.115937,0.066446,-0.050707,-0.070144,0.257593
1,10.554902,3.833705,2.409499,0.4865,0.2142,3.509,6.49686,16.61,0.348087,0.212973,...,0.0,43.44903,3.871281,0.975437,0.899645,0.080158,0.020197,0.003796,-0.11286,0.021955
2,9.495996,3.831296,3.75874,0.4865,0.2142,3.509,6.49686,16.61,0.34826,0.28525,...,0.0,38.615615,4.269847,1.605814,0.867937,0.09597,0.036093,-0.028489,0.041934,-0.005566
3,6.033605,5.338802,2.5385,0.641,0.6482,7.591,11.761885,12.36,0.168055,0.336944,...,0.0,11.389305,6.31709,1.69351,0.58708,0.325625,0.087295,0.255738,-0.555734,0.180987
4,7.509796,6.4647,3.951797,0.641,0.6482,7.591,11.761885,12.36,0.168055,0.337588,...,0.0,14.143593,8.915922,1.767197,0.569693,0.359126,0.071181,0.474481,-0.549836,0.04016


# Next notebook
In the next notebook in the series (Notebook III) you shall see how to classify Morse feature vectors.