# **Computational Drug Discovery - Descriptor Calculation and Dataset Preparation Part 03**

khalid El Akri

[*'Chem Code Professor' YouTube channel*](http://youtube.com/@chemcodeprofessor)

In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 03**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 04.

---

## **Download PaDEL-Descriptor**

In [34]:
!curl -O https://github.com/chemcodeprofessor/data/raw/master/PaDel-Descriptor.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  206k    0  206k    0     0   317k      0 --:--:-- --:--:-- --:--:--  317k


In [35]:
!curl -O https://github.com/chemcodeprofessor/data/raw/master/PaDel-Descriptor.sh

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  206k    0  206k    0     0   339k      0 --:--:-- --:--:-- --:--:--  339k


In [36]:
! ls -l

total 1016
-rw-r--r--  1 akrikhalid  staff   71663 May 24 13:01 Descriptor_Calculation.ipynb
-rw-r--r--  1 akrikhalid  staff  211054 May 24 13:02 PaDel-Descriptor.sh
-rw-r--r--  1 akrikhalid  staff  211063 May 24 13:02 PaDel-Descriptor.zip
-rw-r--r--  1 akrikhalid  staff   17146 May 24 02:55 bioa_data_preprocessed1.csv


## **UnZip PaDEL-Descriptor.zip**

In [41]:
import zipfile

with zipfile.ZipFile("PaDel-Descriptor.zip", "r") as zip_ref:
    zip_ref.extractall()

In [52]:
! ls -l

total 52616
-rw-r--r--   1 akrikhalid  staff    300602 May 24 13:11 Descriptor_Calculation.ipynb
drwxr-xr-x  21 akrikhalid  staff       672 May 24 13:06 [34mPaDel-Descriptor[m[m
-rw-r--r--   1 akrikhalid  staff    211054 May 24 13:02 PaDel-Descriptor.sh
-rw-r--r--   1 akrikhalid  staff  25765742 May 24 13:05 PaDel-Descriptor.zip
drwxr-xr-x   4 akrikhalid  staff       128 May 24 13:06 [34m__MACOSX[m[m
-rw-r--r--   1 akrikhalid  staff     17146 May 24 02:55 bioa_data_preprocessed1.csv
-rw-r--r--   1 akrikhalid  staff      7991 May 24 13:10 molecule.smi


## **Loading bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Machine learning Project series. Here we will be using the **bioa_data_preprocessed1.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [53]:
import pandas as pd

In [54]:
df = pd.read_csv('bioa_data_preprocessed1.csv')

In [55]:
df

Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,CHEMBL187579,Cc1noc(C)c1CN1C(=O)C(=O)c2cc(C#N)ccc21,intermediate,281.271,1.89262,0.0,5.0,5.142668
1,CHEMBL188487,O=C1C(=O)N(Cc2ccc(F)cc2Cl)c2ccc(I)cc21,intermediate,415.589,3.81320,0.0,2.0,5.026872
2,CHEMBL185698,O=C1C(=O)N(CC2COc3ccccc3O2)c2ccc(I)cc21,inactive,421.190,2.66050,0.0,4.0,4.869666
3,CHEMBL426082,O=C1C(=O)N(Cc2cc3ccccc3s2)c2ccccc21,inactive,293.347,3.63080,0.0,3.0,4.882397
4,CHEMBL187717,O=C1C(=O)N(Cc2cc3ccccc3s2)c2c1cccc2[N+](=O)[O-],intermediate,338.344,3.53900,0.0,5.0,5.698970
...,...,...,...,...,...,...,...,...
128,CHEMBL2146517,COC(=O)[C@@]1(C)CCCc2c1ccc1c2C(=O)C(=O)c2c(C)c...,inactive,338.359,3.40102,0.0,5.0,4.974694
129,CHEMBL187460,C[C@H]1COC2=C1C(=O)C(=O)c1c2ccc2c1CCCC2(C)C,inactive,296.366,3.44330,0.0,3.0,4.995679
130,CHEMBL363535,Cc1coc2c1C(=O)C(=O)c1c-2ccc2c(C)cccc12,inactive,276.291,4.09564,0.0,3.0,4.939302
131,CHEMBL227075,Cc1cccc2c3c(ccc12)C1=C(C(=O)C3=O)[C@@H](C)CO1,inactive,278.307,3.29102,0.0,3.0,4.970616


In [56]:
selection = ['canonical_smiles','molecule_chembl_id']
df_selection = df[selection]
df_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [57]:
! cat molecule.smi | head -6

Cc1noc(C)c1CN1C(=O)C(=O)c2cc(C#N)ccc21	CHEMBL187579
O=C1C(=O)N(Cc2ccc(F)cc2Cl)c2ccc(I)cc21	CHEMBL188487
O=C1C(=O)N(CC2COc3ccccc3O2)c2ccc(I)cc21	CHEMBL185698
O=C1C(=O)N(Cc2cc3ccccc3s2)c2ccccc21	CHEMBL426082
O=C1C(=O)N(Cc2cc3ccccc3s2)c2c1cccc2[N+](=O)[O-]	CHEMBL187717
O=C1C(=O)N(Cc2cc3ccccc3s2)c2c(Br)cccc21	CHEMBL365134


In [58]:
! cat molecule.smi | wc -l

     133


## **Calculate fingerprint descriptors**

### **Calculate PaDEL descriptors**

In [65]:
! cat PaDel-Descriptor.sh

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


In [66]:
! bash Padel-Descriptor.sh

Processing CHEMBL187579 in molecule.smi (1/133). 
Processing CHEMBL426082 in molecule.smi (4/133). 
Processing CHEMBL185698 in molecule.smi (3/133). 
Processing CHEMBL188487 in molecule.smi (2/133). 
Processing CHEMBL187717 in molecule.smi (5/133). Average speed: 1.68 s/mol.
Processing CHEMBL365134 in molecule.smi (6/133). Average speed: 0.93 s/mol.
Processing CHEMBL187598 in molecule.smi (7/133). Average speed: 0.66 s/mol.
Processing CHEMBL190743 in molecule.smi (8/133). Average speed: 0.52 s/mol.
Processing CHEMBL365469 in molecule.smi (9/133). Average speed: 0.59 s/mol.
Processing CHEMBL188983 in molecule.smi (10/133). Average speed: 0.52 s/mol.
Processing CHEMBL191575 in molecule.smi (11/133). Average speed: 0.45 s/mol.
Processing CHEMBL370923 in molecule.smi (12/133). Average speed: 0.40 s/mol.
Processing CHEMBL194398 in molecule.smi (13/133). Average speed: 0.40 s/mol.
Processing CHEMBL196635 in molecule.smi (14/133). Average speed: 0.38 s/mol.
Processing CHEMBL209287 in molecule

Processing CHEMBL2146517 in molecule.smi (110/133). Average speed: 0.14 s/mol.
Processing CHEMBL187460 in molecule.smi (111/133). Average speed: 0.14 s/mol.
Processing CHEMBL363535 in molecule.smi (112/133). Average speed: 0.14 s/mol.
Processing CHEMBL227075 in molecule.smi (113/133). Average speed: 0.14 s/mol.
Processing CHEMBL45830 in molecule.smi (114/133). Average speed: 0.14 s/mol.
Processing CHEMBL187266 in molecule.smi (115/133). Average speed: 0.14 s/mol.
Processing CHEMBL2146517 in molecule.smi (117/133). Average speed: 0.14 s/mol.
Processing CHEMBL215254 in molecule.smi (116/133). Average speed: 0.14 s/mol.
Processing CHEMBL187460 in molecule.smi (118/133). Average speed: 0.14 s/mol.
Processing CHEMBL363535 in molecule.smi (119/133). Average speed: 0.14 s/mol.
Processing CHEMBL227075 in molecule.smi (120/133). Average speed: 0.14 s/mol.
Processing CHEMBL45830 in molecule.smi (121/133). Average speed: 0.14 s/mol.
Processing CHEMBL215254 in molecule.smi (122/133). Average speed

In [67]:
! ls -l

total 52696
-rw-r--r--   1 akrikhalid  staff    289021 May 24 13:49 Descriptor_Calculation.ipynb
drwxr-xr-x  21 akrikhalid  staff       672 May 24 13:06 [34mPaDel-Descriptor[m[m
-rw-r--r--@  1 akrikhalid  staff       231 May 24 10:18 PaDel-Descriptor.sh
-rw-r--r--   1 akrikhalid  staff  25765742 May 24 13:05 PaDel-Descriptor.zip
drwxr-xr-x   4 akrikhalid  staff       128 May 24 13:06 [34m__MACOSX[m[m
-rw-r--r--   1 akrikhalid  staff     17146 May 24 02:55 bioa_data_preprocessed1.csv
-rw-r--r--   1 akrikhalid  staff    247688 May 24 13:50 descriptors_output.csv
-rw-r--r--   1 akrikhalid  staff      7991 May 24 13:47 molecule.smi


## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [68]:
df2_X = pd.read_csv('descriptors_output.csv')

In [69]:
df2_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL187579,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL188487,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL185698,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL426082,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL187717,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
128,CHEMBL215254,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
129,CHEMBL2146517,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
130,CHEMBL363535,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
131,CHEMBL45830,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [70]:
df2_X = df2_X.drop(columns=['Name'])

In [71]:
df2_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
128,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
129,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
130,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
131,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## **Y variable**

### **Convert IC50 to pIC50**

In [72]:
df2_Y = df['pIC50']

In [73]:
df2_Y

0      5.142668
1      5.026872
2      4.869666
3      4.882397
4      5.698970
         ...   
128    4.974694
129    4.995679
130    4.939302
131    4.970616
132    4.102923
Name: pIC50, Length: 133, dtype: float64

## **Combining X and Y variable**

In [74]:
dataset2 = pd.concat([df2_X,df2_Y], axis=1)

In [75]:
dataset2

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.142668
1,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.026872
2,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.869666
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.882397
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.698970
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
128,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.974694
129,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.995679
130,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.939302
131,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.970616


In [76]:
dataset2.to_csv('bioa_data_preprocessed1_pIC50_pubchem_fp.csv', index=False)

In [77]:
!ls -l

total 52760
-rw-r--r--   1 akrikhalid  staff     70664 May 24 13:51 Descriptor_Calculation.ipynb
drwxr-xr-x  21 akrikhalid  staff       672 May 24 13:06 [34mPaDel-Descriptor[m[m
-rw-r--r--@  1 akrikhalid  staff       231 May 24 10:18 PaDel-Descriptor.sh
-rw-r--r--   1 akrikhalid  staff  25765742 May 24 13:05 PaDel-Descriptor.zip
drwxr-xr-x   4 akrikhalid  staff       128 May 24 13:06 [34m__MACOSX[m[m
-rw-r--r--   1 akrikhalid  staff     17146 May 24 02:55 bioa_data_preprocessed1.csv
-rw-r--r--   1 akrikhalid  staff    248011 May 24 13:51 bioa_data_preprocessed1_pIC50_pubchem_fp.csv
-rw-r--r--   1 akrikhalid  staff    247688 May 24 13:50 descriptors_output.csv
-rw-r--r--   1 akrikhalid  staff      7991 May 24 13:47 molecule.smi


## **Let's download the CSV file to local computer for the Part 04 (Machine Learning Model Building).**