# **Bioinformatics Project - Computational Drug Discovery [Part 3] Descriptor Calculation and Dataset Preparation**

Chanin Nantasenamat

[*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)

In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

---

## **Download PaDEL-Descriptor**

In [24]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

--2024-06-23 10:08:22--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip [following]
--2024-06-23 10:08:22--  https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25768637 (25M) [application/zip]
Saving to: ‘padel.zip.2’


2024-06-23 10:08:22 (119 MB/s) - ‘padel.zip.2’ saved [25768637/25768637]

--2024-06-23 10:08:22--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh
Resolving github.com (

In [22]:
! unzip padel.zip

Archive:  padel.zip
replace __MACOSX/._PaDEL-Descriptor? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **bioactivity_data_3class_pIC50.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [None]:
! wget https://raw.githubusercontent.com/dataprofessor/data/master/acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv

--2020-06-09 17:00:26--  https://raw.githubusercontent.com/dataprofessor/data/master/acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 655414 (640K) [text/plain]
Saving to: ‘acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv’


2020-06-09 17:00:26 (9.21 MB/s) - ‘acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv’ saved [655414/655414]



In [25]:
import pandas as pd

In [26]:
df3 = pd.read_csv('/content/df_2class.csv')

In [27]:
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,CHEMBL318782,O=C1C(Cl)=C(Cl)C(=O)c2ccccc21,active,227.046,2.75480,0.0,2.0,6.552842
1,CHEMBL272225,CC1(C)OC2=C(C(=O)c3ccccc3C2=O)C(O)C1Br,active,337.169,2.25290,1.0,4.0,6.290730
2,CHEMBL443067,COC1C2=C(OC(C)(C)C1O)C(=O)c1ccccc1C2=O,active,288.299,1.50440,1.0,5.0,6.010550
3,CHEMBL271827,CC1(C)OC2=C(C(=O)c3ccccc3C2=O)C(N2CCOCC2)C1O,active,343.379,1.19020,1.0,6.0,6.442493
4,CHEMBL260961,CCCCNC1C2=C(OC(C)(C)C1O)C(=O)c1ccccc1C2=O,active,329.396,2.24770,2.0,5.0,7.086186
...,...,...,...,...,...,...,...,...
3401,CHEMBL5273615,Cc1ccc(NC(=O)Nc2cc(-c3ccccc3-c3nc(=O)[nH]o3)cc...,active,532.600,6.95382,3.0,5.0,8.823909
3402,CHEMBL5273615,Cc1ccc(NC(=O)Nc2cc(-c3ccccc3-c3nc(=O)[nH]o3)cc...,active,532.600,6.95382,3.0,5.0,8.337242
3403,CHEMBL5267407,Cc1ccc(NC(=O)Nc2cc(-c3ccccc3-c3nn[nH]n3)cc3c2O...,active,516.605,6.79062,3.0,5.0,8.443697
3404,CHEMBL5267407,Cc1ccc(NC(=O)Nc2cc(-c3ccccc3-c3nn[nH]n3)cc3c2O...,active,516.605,6.79062,3.0,5.0,7.823909


In [28]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [29]:
! cat molecule.smi | head -5

O=C1C(Cl)=C(Cl)C(=O)c2ccccc21	CHEMBL318782
CC1(C)OC2=C(C(=O)c3ccccc3C2=O)C(O)C1Br	CHEMBL272225
COC1C2=C(OC(C)(C)C1O)C(=O)c1ccccc1C2=O	CHEMBL443067
CC1(C)OC2=C(C(=O)c3ccccc3C2=O)C(N2CCOCC2)C1O	CHEMBL271827
CCCCNC1C2=C(OC(C)(C)C1O)C(=O)c1ccccc1C2=O	CHEMBL260961


In [30]:
! cat molecule.smi | wc -l

3406


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [31]:
! cat padel.sh

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


In [32]:
! bash padel.sh

Processing CHEMBL318782 in molecule.smi (1/3406). 
Processing CHEMBL272225 in molecule.smi (2/3406). 
Processing CHEMBL443067 in molecule.smi (3/3406). Average speed: 4.97 s/mol.
Processing CHEMBL260961 in molecule.smi (5/3406). Average speed: 2.03 s/mol.
Processing CHEMBL271827 in molecule.smi (4/3406). Average speed: 2.68 s/mol.
Processing CHEMBL261179 in molecule.smi (6/3406). Average speed: 1.70 s/mol.
Processing CHEMBL261178 in molecule.smi (7/3406). Average speed: 1.56 s/mol.
Processing CHEMBL407954 in molecule.smi (9/3406). Average speed: 1.35 s/mol.
Processing CHEMBL261178 in molecule.smi (8/3406). Average speed: 1.45 s/mol.
Processing CHEMBL407954 in molecule.smi (10/3406). Average speed: 1.29 s/mol.
Processing CHEMBL272032 in molecule.smi (11/3406). Average speed: 1.30 s/mol.
Processing CHEMBL556930 in molecule.smi (13/3406). Average speed: 1.12 s/mol.
Processing CHEMBL260523 in molecule.smi (12/3406). Average speed: 1.22 s/mol.
Processing CHEMBL260744 in molecule.smi (14/340

In [33]:
! ls -l

total 82076
-rw-r--r-- 1 root root  6066836 Jun 23 10:20 descriptors_output.csv
-rw-r--r-- 1 root root   422388 Jun 23 10:03 df_2class.csv
drwx------ 5 root root     4096 Jun 23 09:05 drive
drwxr-xr-x 3 root root     4096 Jun 23 09:04 __MACOSX
-rw-r--r-- 1 root root   200807 Jun 23 10:08 molecule.smi
drwxrwxr-x 4 root root     4096 May 30  2020 PaDEL-Descriptor
-rw-r--r-- 1 root root      231 Jun 23 09:04 padel.sh
-rw-r--r-- 1 root root      231 Jun 23 10:04 padel.sh.1
-rw-r--r-- 1 root root      231 Jun 23 10:08 padel.sh.2
-rw-r--r-- 1 root root 25768637 Jun 23 09:04 padel.zip
-rw-r--r-- 1 root root 25768637 Jun 23 10:04 padel.zip.1
-rw-r--r-- 1 root root 25768637 Jun 23 10:08 padel.zip.2
drwxr-xr-x 1 root root     4096 Jun 20 18:46 sample_data


## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [34]:
df3_X = pd.read_csv('descriptors_output.csv')

In [35]:
df3_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL318782,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL272225,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL443067,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL271827,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL260961,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3401,CHEMBL5273615,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3402,CHEMBL5273615,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3403,CHEMBL5267407,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3404,CHEMBL4459180,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [36]:
df3_X = df3_X.drop(columns=['Name'])
df3_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3401,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3402,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3403,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3404,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## **Y variable**

### **Convert IC50 to pIC50**

In [37]:
df3_Y = df3['pIC50']
df3_Y

0       6.552842
1       6.290730
2       6.010550
3       6.442493
4       7.086186
          ...   
3401    8.823909
3402    8.337242
3403    8.443697
3404    7.823909
3405    6.958607
Name: pIC50, Length: 3406, dtype: float64

## **Combining X and Y variable**

In [38]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.552842
1,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.290730
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.010550
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.442493
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.086186
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3401,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,8.823909
3402,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,8.337242
3403,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,8.443697
3404,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.823909


In [39]:
dataset3.to_csv('bioactivity_data_3class_pIC50_pubchem_fp.csv', index=False)

# **Let's download the CSV file to your local computer for the Part 3B (Model Building).**