# **Computational Drug Discovery - SOAT-2 : Descriptor Calculation and Dataset Preparation Part 03**

khalid El Akri

[*'Chem Code Professor' YouTube channel*](http://youtube.com/@chemcodeprofessor)

In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the Bindingdb bioactivity data.

In **Part 03**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 04.

---

## **Download PaDEL-Descriptor**

In [1]:
! curl -O https://github.com/chemcodeprofessor/data/raw/master/PaDel-Descriptor.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  206k    0  206k    0     0   266k      0 --:--:-- --:--:-- --:--:--  266k


In [2]:
!curl -O https://github.com/chemcodeprofessor/data/raw/master/PaDel-Descriptor.sh

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  206k    0  206k    0     0   292k      0 --:--:-- --:--:-- --:--:--  292k


In [3]:
! ls -l

total 10864
-rw-r--r--  1 akrikhalid  staff   394873 May 29 01:30 Output3_Sterol O-acyltransferase 2 inhibitors 219.csv
-rw-r--r--  1 akrikhalid  staff   211049 May 29 10:06 PaDel-Descriptor.sh
-rw-r--r--  1 akrikhalid  staff   211058 May 29 10:05 PaDel-Descriptor.zip
-rw-r--r--  1 akrikhalid  staff   262598 May 29 02:55 Sterol O-acyltransferase 2 inhibitors 219 Part 01.ipynb
-rw-r--r--  1 akrikhalid  staff   250806 May 29 10:00 Sterol O-acyltransferase 2 inhibitors 219 Part 02.ipynb
-rw-r--r--  1 akrikhalid  staff     3348 May 29 10:05 Sterol O-acyltransferase 2 inhibitors 219 Part 03.ipynb
-rw-r--r--  1 akrikhalid  staff   394873 May 29 00:27 Sterol O-acyltransferase 2 inhibitors 219.csv
-rw-r--r--@ 1 akrikhalid  staff  1525829 May 28 23:19 Sterol O-acyltransferase 2 inhibitors 219.sdf
-rw-r--r--@ 1 akrikhalid  staff   382470 May 29 00:20 Sterol O-acyltransferase 2 inhibitors 219.tsv
-rw-r--r--  1 akrikhalid  staff    25632 May 29 02:41 final_output.csv
-rw-r--r--  1 akrik

## **UnZip PaDEL-Descriptor.zip**

In [None]:
import zipfile

with zipfile.ZipFile("PaDel-Descriptor.zip", "r") as zip_ref:
    zip_ref.extractall()

In [8]:
! ls -l

total 10880
-rw-r--r--   1 akrikhalid  staff   394873 May 29 01:30 Output3_Sterol O-acyltransferase 2 inhibitors 219.csv
drwxr-xr-x  21 akrikhalid  staff      672 May 24 13:06 [34mPaDel-Descriptor[m[m
-rw-r--r--   1 akrikhalid  staff   211049 May 29 10:06 PaDel-Descriptor.sh
-rw-r--r--@  1 akrikhalid  staff   211058 May 29 10:05 PaDel-Descriptor.zip
-rw-r--r--   1 akrikhalid  staff   262598 May 29 02:55 Sterol O-acyltransferase 2 inhibitors 219 Part 01.ipynb
-rw-r--r--   1 akrikhalid  staff   250806 May 29 10:00 Sterol O-acyltransferase 2 inhibitors 219 Part 02.ipynb
-rw-r--r--   1 akrikhalid  staff     9895 May 29 10:07 Sterol O-acyltransferase 2 inhibitors 219 Part 03.ipynb
-rw-r--r--   1 akrikhalid  staff   394873 May 29 00:27 Sterol O-acyltransferase 2 inhibitors 219.csv
-rw-r--r--@  1 akrikhalid  staff  1525829 May 28 23:19 Sterol O-acyltransferase 2 inhibitors 219.sdf
-rw-r--r--@  1 akrikhalid  staff   382470 May 29 00:20 Sterol O-acyltransferase 2 inhibitors 219.tsv

## **Loading SOAT-2 bioactivity data**

Download the curated Bindingdb bioactivity data that has been pre-processed from Parts 1 and 2 of this Machine learning Project series. Here we will be using the **soat_2_bioa_data_preprocessed.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [9]:
import pandas as pd

In [10]:
df = pd.read_csv('soat_2_bioa_data_preprocessed.csv')

In [11]:
df

Unnamed: 0,mol_bdID,mol_smiles,bioactivity_class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,51053294,COc1ccc(C(=O)O[C@H]2C[C@H]3[C@](C)(COC(C)=O)[C...,active,675.731,5.05780,1.0,12.0,9.045757
1,51068473,CC(=O)OC[C@@]1(C)[C@@H]2C[C@H](OC(=O)c3ccc(C#N...,active,670.715,4.92088,1.0,12.0,9.045757
2,51053301,CC(=O)OC[C@@]1(C)[C@@H]2C[C@H](OC(=O)c3ccc(C#N...,active,670.715,4.92088,1.0,12.0,9.045757
3,51053293,CC(=O)OC[C@@]1(C)[C@@H]2C[C@H](OC(=O)c3ccc(C)c...,active,659.732,5.35762,1.0,11.0,9.045757
4,51053292,CC(=O)OC[C@@]1(C)[C@@H]2C[C@H](OC(=O)c3cccc(F)...,active,663.695,5.18830,1.0,11.0,9.000000
...,...,...,...,...,...,...,...,...
213,51346963,CC(C)=CC=C(OC(=O)CC(C)(C)O)c1cc(O)c2c(O)ccc(O)...,inactive,388.416,3.67350,5.0,7.0,3.727462
214,330437,C=C1CCC[C@]2(C)C[C@]3(O)OC(=O)C(C)=C3C[C@@H]12,inactive,248.322,2.70470,1.0,3.0,3.638272
215,50855129,CCCCCCCC(=O)Nc1ccccc1-c1ccccc1,inactive,295.426,5.65260,1.0,1.0,3.638272
216,330438,CCCCCCCC(=O)Nc1ccccc1-c1ccccc1,inactive,295.426,5.65260,1.0,1.0,3.383000


In [36]:
selection = ['mol_smiles','mol_bdID']
df_selection = df[selection]
df_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [37]:
! cat molecule.smi | head -6

COc1ccc(C(=O)O[C@H]2C[C@H]3[C@](C)(COC(C)=O)[C@@H](OC(C)=O)CC[C@]3(C)[C@H]3[C@@H](O)c4c(cc(-c5cccnc5)oc4=O)O[C@]23C)cc1	51053294
CC(=O)OC[C@@]1(C)[C@@H]2C[C@H](OC(=O)c3ccc(C#N)cc3)[C@@]3(C)Oc4cc(-c5cccnc5)oc(=O)c4[C@H](O)[C@@H]3[C@@]2(C)CC[C@@H]1OC(C)=O	51068473
CC(=O)OC[C@@]1(C)[C@@H]2C[C@H](OC(=O)c3ccc(C#N)cc3)[C@@]3(C)Oc4cc(-c5cccnc5)oc(=O)c4[C@H](O)[C@@H]3[C@@]2(C)CC[C@@H]1OC(C)=O	51053301
CC(=O)OC[C@@]1(C)[C@@H]2C[C@H](OC(=O)c3ccc(C)cc3)[C@@]3(C)Oc4cc(-c5cccnc5)oc(=O)c4[C@H](O)[C@@H]3[C@@]2(C)CC[C@@H]1OC(C)=O	51053293
CC(=O)OC[C@@]1(C)[C@@H]2C[C@H](OC(=O)c3cccc(F)c3)[C@@]3(C)Oc4cc(-c5cccnc5)oc(=O)c4[C@H](O)[C@@H]3[C@@]2(C)CC[C@@H]1OC(C)=O	51053292
CC(=O)OC[C@@]1(C)[C@@H]2C[C@H](OC(=O)c3ccc(Cl)cc3)[C@@]3(C)Oc4cc(-c5cccnc5)oc(=O)c4[C@H](O)[C@@H]3[C@@]2(C)CC[C@@H]1OC(C)=O	51053291


In [38]:
! cat molecule.smi | wc -l

     218


## **Calculate fingerprint descriptors**

### **Calculate PaDEL descriptors**

In [39]:
! cat PaDel-Descriptor.sh

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


In [40]:
! bash Padel-Descriptor.sh

Processing 51053294 in molecule.smi (1/218). 
Processing 51053293 in molecule.smi (4/218). 
Processing 51068473 in molecule.smi (2/218). 
Processing 51053301 in molecule.smi (3/218). 
Processing 51053292 in molecule.smi (5/218). Average speed: 4.86 s/mol.
Processing 51053291 in molecule.smi (6/218). Average speed: 2.47 s/mol.
Processing 51053290 in molecule.smi (7/218). Average speed: 2.50 s/mol.
Processing 51053289 in molecule.smi (8/218). Average speed: 1.67 s/mol.
Processing 51053288 in molecule.smi (9/218). Average speed: 1.51 s/mol.
Processing 51053287 in molecule.smi (10/218). Average speed: 1.27 s/mol.
Processing 51053286 in molecule.smi (11/218). Average speed: 1.28 s/mol.
Processing 51053281 in molecule.smi (12/218). Average speed: 1.11 s/mol.
Processing 51053280 in molecule.smi (13/218). Average speed: 1.02 s/mol.
Processing 51053282 in molecule.smi (14/218). Average speed: 0.93 s/mol.
Processing 51053267 in molecule.smi (15/218). Average speed: 0.85 s/mol.
Processing 5105328

Processing 51125428 in molecule.smi (117/218). Average speed: 0.35 s/mol.
Processing 51061814 in molecule.smi (118/218). Average speed: 0.35 s/mol.
Processing 51061813 in molecule.smi (119/218). Average speed: 0.35 s/mol.
Processing 51061811 in molecule.smi (120/218). Average speed: 0.35 s/mol.
Processing 50088730 in molecule.smi (121/218). Average speed: 0.35 s/mol.
Processing 51061810 in molecule.smi (122/218). Average speed: 0.34 s/mol.
Processing 51105753 in molecule.smi (123/218). Average speed: 0.34 s/mol.
Processing 330434 in molecule.smi (124/218). Average speed: 0.34 s/mol.
Processing 50533665 in molecule.smi (125/218). Average speed: 0.34 s/mol.
Processing 51105756 in molecule.smi (126/218). Average speed: 0.34 s/mol.
Processing 50329415 in molecule.smi (128/218). Average speed: 0.34 s/mol.
Processing 50088731 in molecule.smi (127/218). Average speed: 0.34 s/mol.
Processing 50880965 in molecule.smi (129/218). Average speed: 0.34 s/mol.
Processing 51105764 in molecule.smi (130

#### We can see : Descriptor calculation completed in 52.965 secs . Average speed: 0.24 s/mol.

## Let check the Descriptors output file ''descriptors_output.csv''

In [41]:
! ls -l

total 6656
-rw-r--r--   1 akrikhalid  staff  394873 May 29 01:30 Output3_Sterol O-acyltransferase 2 inhibitors 219.csv
drwxr-xr-x  21 akrikhalid  staff     672 May 24 13:06 [34mPaDel-Descriptor[m[m
-rw-r--r--@  1 akrikhalid  staff     231 May 24 10:18 PaDel-Descriptor.sh
-rw-r--r--@  1 akrikhalid  staff  211058 May 29 10:05 PaDel-Descriptor.zip
-rw-r--r--   1 akrikhalid  staff  262598 May 29 02:55 Sterol O-acyltransferase 2 inhibitors 219 Part 01.ipynb
-rw-r--r--   1 akrikhalid  staff  250806 May 29 10:00 Sterol O-acyltransferase 2 inhibitors 219 Part 02.ipynb
-rw-r--r--   1 akrikhalid  staff   71742 May 29 10:47 Sterol O-acyltransferase 2 inhibitors 219 Part 03.ipynb
-rw-r--r--   1 akrikhalid  staff  394873 May 29 00:27 Sterol O-acyltransferase 2 inhibitors 219.csv
-rw-r--r--   1 akrikhalid  staff  397800 May 29 10:48 descriptors_output.csv
-rw-r--r--   1 akrikhalid  staff   25632 May 29 02:41 final_output.csv
-rw-r--r--   1 akrikhalid  staff     122 May 29 09:55 mannwhi

## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [44]:
df2_X = pd.read_csv('descriptors_output.csv')

In [45]:
df2_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,51053294,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,51068473,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,51053301,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,51053293,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,51053292,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
213,51346963,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
214,50357584,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
215,50855129,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
216,330438,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [46]:
df2_X = df2_X.drop(columns = ["Name"])

In [47]:
df2_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
213,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
214,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
215,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
216,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## **Y variable**

### **Convert IC50 to pIC50**

In [48]:
df2_Y = df['pIC50']

In [49]:
df2_Y.info()

<class 'pandas.core.series.Series'>
RangeIndex: 218 entries, 0 to 217
Series name: pIC50
Non-Null Count  Dtype  
--------------  -----  
218 non-null    float64
dtypes: float64(1)
memory usage: 1.8 KB


In [50]:
df2_Y.describe()

count    218.000000
mean       6.275259
std        1.551436
min        3.383000
25%        4.954477
50%        6.201065
75%        7.721246
max        9.045757
Name: pIC50, dtype: float64

## **Combining X and Y variable**

In [51]:
dataset2 = pd.concat([df2_X,df2_Y], axis=1)

In [52]:
dataset2

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,9.045757
1,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,9.045757
2,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,9.045757
3,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,9.045757
4,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,9.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
213,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,3.727462
214,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,3.638272
215,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,3.638272
216,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,3.383000


In [53]:
dataset2['pIC50'].describe()

count    218.000000
mean       6.275259
std        1.551436
min        3.383000
25%        4.954477
50%        6.201065
75%        7.721246
max        9.045757
Name: pIC50, dtype: float64

In [54]:
dataset2.to_csv('soat_2_bioa_data_preprocessed_pIC50_pubchem_fp.csv', index=False)

In [55]:
! ls -l

total 7472
-rw-r--r--   1 akrikhalid  staff  394873 May 29 01:30 Output3_Sterol O-acyltransferase 2 inhibitors 219.csv
drwxr-xr-x  21 akrikhalid  staff     672 May 24 13:06 [34mPaDel-Descriptor[m[m
-rw-r--r--@  1 akrikhalid  staff     231 May 24 10:18 PaDel-Descriptor.sh
-rw-r--r--@  1 akrikhalid  staff  211058 May 29 10:05 PaDel-Descriptor.zip
-rw-r--r--   1 akrikhalid  staff  262598 May 29 02:55 Sterol O-acyltransferase 2 inhibitors 219 Part 01.ipynb
-rw-r--r--   1 akrikhalid  staff  250806 May 29 10:00 Sterol O-acyltransferase 2 inhibitors 219 Part 02.ipynb
-rw-r--r--   1 akrikhalid  staff   86813 May 29 10:49 Sterol O-acyltransferase 2 inhibitors 219 Part 03.ipynb
-rw-r--r--   1 akrikhalid  staff  394873 May 29 00:27 Sterol O-acyltransferase 2 inhibitors 219.csv
-rw-r--r--   1 akrikhalid  staff  397800 May 29 10:48 descriptors_output.csv
-rw-r--r--   1 akrikhalid  staff   25632 May 29 02:41 final_output.csv
-rw-r--r--   1 akrikhalid  staff     122 May 29 09:55 mannwhi