# Computational Drug Discovery [Part 3] 

Descriptor Calculation and Dataset Preparation

In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

---

## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **bioactivity_data_3class_pIC50.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [1]:
import pandas as pd
df3 = pd.read_csv('Data/04_bioactivity_data_3class_pIC50.csv')

In [2]:
df3

Unnamed: 0.1,Unnamed: 0,molecule_chembl_id,canonical_smiles,class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,0,CHEMBL3775576,CNC(=O)c1ccc2[nH]nc(Cc3ccc4c(cnn4C)c3)c2c1,active,319.368,2.4000,2.0,4.0,7.585027
1,1,CHEMBL3775317,Cn1cc(-c2ccc(Cc3n[nH]c4ccc(C(=O)N5CC[C@@H](O)C...,active,401.470,2.7610,2.0,5.0,8.397940
2,2,CHEMBL3798663,O=C1NCCC12CCN(c1c(Cl)cncc1-c1ccc(-c3cnn(CCO)c3...,active,451.958,3.3643,2.0,6.0,8.602060
3,3,CHEMBL3798944,CC(C)(O)Cn1cc(-c2ccc(-c3cncc(Cl)c3N3CCC4(CCNC4...,active,480.012,4.1429,2.0,6.0,8.853872
4,4,CHEMBL3798318,O=C1NCCC12CCN(c1c(Cl)cncc1-c1ccc(-c3cnn(CCN4CC...,active,505.066,4.4678,1.0,6.0,7.686133
...,...,...,...,...,...,...,...,...,...
116,116,CHEMBL4849842,Cn1ncc2cc(-c3cnc4[nH]ccc4c3N3CCC4(CCCNC4=O)CC3...,active,414.513,3.6132,2.0,5.0,8.522879
117,117,CHEMBL4878356,Cn1ncc2cc(-c3cnc4[nH]ccc4c3N3CCC4(CC3)CNC4=O)c...,active,386.459,2.8330,2.0,5.0,8.221849
118,118,CHEMBL4862777,Cn1ncc2cc(-c3cnc4[nH]ccc4c3N3CCC4(CC3)CNC(=O)O...,active,402.458,3.1954,2.0,6.0,8.221849
119,119,CHEMBL4853002,CCOC(=O)/C=C/c1ccncc1-c1ccc2cc[nH]c2c1,active,292.338,3.8062,1.0,3.0,7.339704


In [3]:

selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('Data/molecule.smi', sep='\t', index=False, header=False)
! cp Data/molecule.smi .

In [4]:
! cat Data/molecule.smi | head -5

CNC(=O)c1ccc2[nH]nc(Cc3ccc4c(cnn4C)c3)c2c1	CHEMBL3775576
Cn1cc(-c2ccc(Cc3n[nH]c4ccc(C(=O)N5CC[C@@H](O)C5)cc34)cc2)cn1	CHEMBL3775317
O=C1NCCC12CCN(c1c(Cl)cncc1-c1ccc(-c3cnn(CCO)c3)cc1)CC2	CHEMBL3798663
CC(C)(O)Cn1cc(-c2ccc(-c3cncc(Cl)c3N3CCC4(CCNC4=O)CC3)cc2)cn1	CHEMBL3798944
O=C1NCCC12CCN(c1c(Cl)cncc1-c1ccc(-c3cnn(CCN4CCCC4)c3)cc1)CC2	CHEMBL3798318


In [5]:
! cat Data/molecule.smi | wc -l

121


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [6]:


! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh
! unzip padel.zip
! cat padel.sh | bash

# or just move files in padel folder to current directory and unzip padel.zip and run ! cat padel.sh | bash

--2023-02-07 00:44:17--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip [following]
--2023-02-07 00:44:18--  https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25768637 (25M) [application/zip]
Saving to: ‘padel.zip’


2023-02-07 00:44:22 (10,6 MB/s) - ‘padel.zip’ saved [25768637/25768637]

--2023-02-07 00:44:22--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh
Resolving github.com (git

In [7]:
! mkdir padel
! mv padel.zip padel.sh padel
! rm -r ./__MACOSX
! rm -r ./PaDEL-Descriptor
! rm molecule.smi 
! mv descriptors_output.csv Data


mkdir: cannot create directory ‘padel’: File exists


## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [8]:
df3_X = pd.read_csv('Data/descriptors_output.csv')

In [9]:
df3_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL3775576,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL3775317,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL3798663,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL3798944,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL3800311,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
116,CHEMBL4862777,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
117,CHEMBL4853002,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
118,CHEMBL4878356,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
119,CHEMBL4849842,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [10]:
df3_X = df3_X.drop(columns=['Name'])
df3_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
116,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
117,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
118,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
119,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## **Y variable**

### **Convert IC50 to pIC50**

In [11]:
df3_Y = df3['pIC50']
df3_Y

0      7.585027
1      8.397940
2      8.602060
3      8.853872
4      7.686133
         ...   
116    8.522879
117    8.221849
118    8.221849
119    7.339704
120    7.549905
Name: pIC50, Length: 121, dtype: float64

## **Combining X and Y variable**

In [12]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.585027
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,8.397940
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,8.602060
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,8.853872
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.686133
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
116,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,8.522879
117,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,8.221849
118,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,8.221849
119,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.339704


In [13]:
dataset3.to_csv('Data/06_bioactivity_data_3class_pIC50_pubchem_fp.csv', index=False)