# Computational Drug Discovery [Part 3] 

Descriptor Calculation and Dataset Preparation

In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

---

## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **bioactivity_data_3class_pIC50.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [18]:
import pandas as pd
df3 = pd.read_csv('Data/04_bioactivity_data_3class_pIC50.csv')

In [19]:
df3

Unnamed: 0.1,Unnamed: 0,molecule_chembl_id,canonical_smiles,class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,0,CHEMBL20,CC(=O)Nc1nnc(S(N)(=O)=O)s1,active,222.251,-0.85610,2.0,6.0,6.602060
1,1,CHEMBL19,CC(=O)/N=c1/sc(S(N)(=O)=O)nn1C,active,236.278,-1.42380,1.0,6.0,7.301030
2,2,CHEMBL118,Cc1ccc(-c2cc(C(F)(F)F)nn2-c2ccc(S(N)(=O)=O)cc2...,inactive,381.379,3.51392,1.0,4.0,4.301030
3,3,CHEMBL26915,COc1ccc(-n2nc(C(F)(F)F)cc2-c2ccc(Cl)cc2)cc1,inactive,352.743,5.22010,0.0,3.0,4.000000
4,4,CHEMBL139,O=C(O)Cc1ccccc1Nc1c(Cl)cccc1Cl,inactive,296.153,4.36410,2.0,2.0,4.000000
...,...,...,...,...,...,...,...,...,...
277,277,CHEMBL4863113,COC(=O)[C@H]1O[C@@H](NC(=O)c2ccc(S(N)(=O)=O)cc...,intermediate,390.370,-2.95560,5.0,9.0,5.744727
278,278,CHEMBL4865818,NS(=O)(=O)c1ccc(C(=O)N[C@@H]2O[C@H](CO)[C@@H](...,intermediate,362.360,-3.13630,6.0,8.0,5.443697
279,279,CHEMBL4870385,NS(=O)(=O)OC[C@H]1C[C@@H](Nc2ncncc2C(=O)c2ccn(...,active,551.423,1.48640,3.0,10.0,7.301030
280,280,CHEMBL4856793,Cc1sc(C(=O)c2cncnc2N[C@H]2C[C@H](O)[C@@H](COS(...,active,579.100,3.16452,3.0,10.0,7.301030


In [20]:

selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('Data/molecule.smi', sep='\t', index=False, header=False)
! cp Data/molecule.smi .

In [21]:
! cat Data/molecule.smi | head -5

CC(=O)Nc1nnc(S(N)(=O)=O)s1	CHEMBL20
CC(=O)/N=c1/sc(S(N)(=O)=O)nn1C	CHEMBL19
Cc1ccc(-c2cc(C(F)(F)F)nn2-c2ccc(S(N)(=O)=O)cc2)cc1	CHEMBL118
COc1ccc(-n2nc(C(F)(F)F)cc2-c2ccc(Cl)cc2)cc1	CHEMBL26915
O=C(O)Cc1ccccc1Nc1c(Cl)cccc1Cl	CHEMBL139


In [22]:
! cat Data/molecule.smi | wc -l

282


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [23]:
! mv padel/padel.sh .
! mv padel/padel.zip .
! unzip padel.zip
! cat padel.sh | bash

mv: cannot stat 'padel/padel.sh': No such file or directory
mv: cannot stat 'padel/padel.zip': No such file or directory
Archive:  padel.zip
replace __MACOSX/._PaDEL-Descriptor? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C
Processing CHEMBL20 in molecule.smi (1/282). 
Processing CHEMBL19 in molecule.smi (2/282). 
Processing CHEMBL118 in molecule.smi (3/282). 
Processing CHEMBL26915 in molecule.smi (4/282). 
Processing CHEMBL139 in molecule.smi (5/282). Average speed: 2.44 s/mol.
Processing CHEMBL17 in molecule.smi (6/282). Average speed: 1.29 s/mol.
Processing CHEMBL865 in molecule.smi (7/282). Average speed: 1.03 s/mol.
Processing CHEMBL423041 in molecule.smi (8/282). Average speed: 1.13 s/mol.
Processing CHEMBL218490 in molecule.smi (9/282). Average speed: 1.16 s/mol.
Processing CHEMBL122708 in molecule.smi (10/282). Average speed: 0.84 s/mol.
Processing CHEMBL26 in molecule.smi (11/282). Average speed: 0.73 s/mol.
Processing CHEMBL77517 in molecule.smi (12/282). Average speed: 0.67 s/mo

In [24]:
! mv padel.zip padel.sh padel
! rm -r ./__MACOSX
! rm -r ./PaDEL-Descriptor
! rm molecule.smi 
! mv descriptors_output.csv Data


## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [None]:
df3_X = pd.read_csv('Data/descriptors_output.csv')

In [None]:
df3_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL20,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL19,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL17,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL26915,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL139,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
277,CHEMBL4876347,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
278,CHEMBL4863113,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
279,CHEMBL4870385,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
280,CHEMBL4856793,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
df3_X = df3_X.drop(columns=['Name'])
df3_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
277,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
278,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
279,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
280,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## **Y variable**

### **Convert IC50 to pIC50**

In [None]:
df3_Y = df3['pIC50']
df3_Y

0      6.602060
1      7.301030
2      4.301030
3      4.000000
4      4.000000
         ...   
277    5.744727
278    5.443697
279    7.301030
280    7.301030
281    7.045757
Name: pIC50, Length: 282, dtype: float64

## **Combining X and Y variable**

In [None]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.602060
1,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.301030
2,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.301030
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.000000
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
277,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.744727
278,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.443697
279,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.301030
280,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.301030


In [None]:
dataset3.to_csv('Data/06_bioactivity_data_3class_pIC50_pubchem_fp.csv', index=False)