# **Bioinformatics Project - Computational Drug Discovery [Part 3] Descriptor Calculation and Dataset Preparation**

Chanin Nantasenamat

[*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)

In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

---

## **Download PaDEL-Descriptor**

In [68]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

--2025-01-24 02:20:40--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip [following]
--2025-01-24 02:20:40--  https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25768637 (25M) [application/zip]
Saving to: ‘padel.zip.1’


2025-01-24 02:20:41 (229 MB/s) - ‘padel.zip.1’ saved [25768637/25768637]

--2025-01-24 02:20:41--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh
Resolving github.com (

In [70]:
! unzip padel.zip

Archive:  padel.zip
replace __MACOSX/._PaDEL-Descriptor? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: __MACOSX/._PaDEL-Descriptor  
  inflating: PaDEL-Descriptor/MACCSFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._MACCSFingerprinter.xml  
  inflating: PaDEL-Descriptor/AtomPairs2DFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._AtomPairs2DFingerprinter.xml  
  inflating: PaDEL-Descriptor/EStateFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._EStateFingerprinter.xml  
  inflating: PaDEL-Descriptor/Fingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._Fingerprinter.xml  
  inflating: PaDEL-Descriptor/.DS_Store  
  inflating: __MACOSX/PaDEL-Descriptor/._.DS_Store  
  inflating: __MACOSX/PaDEL-Descriptor/._license  
  inflating: PaDEL-Descriptor/KlekotaRothFingerprintCount.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._KlekotaRothFingerprintCount.xml  
  inflating: PaDEL-Descriptor/config  
  inflating: __MACOSX/PaDEL-Descriptor/._config 

## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **bioactivity_data_3class_pIC50.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [71]:
import pandas as pd

In [72]:
df3 = pd.read_csv('df_2class.csv')

In [73]:
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,CHEMBL303519,c1cnc(N2CCN(Cc3cccc4c3Cc3ccccc3-4)CC2)nc1,active,342.446,3.37000,0.0,4.0,7.698970
1,CHEMBL292943,COc1ccc(-c2cccc(CN3CCN(c4ncccn4)CC3)c2)cc1,active,360.461,3.47440,0.0,5.0,7.244125
2,CHEMBL61682,Fc1ccc(-c2cncc(CN3CCN(c4ccccc4F)CC3)c2)cc1,active,365.427,4.34900,0.0,3.0,8.522879
3,CHEMBL61682,Fc1ccc(-c2cncc(CN3CCN(c4ccccc4F)CC3)c2)cc1,active,365.427,4.34900,0.0,3.0,8.154902
4,CHEMBL64487,COc1ccccc1-c1cccc(CN2CCN(c3ncccn3)CC2)c1,active,360.461,3.47440,0.0,5.0,7.795880
...,...,...,...,...,...,...,...,...
383,CHEMBL5434856,COc1ccc(C2CCN(CCCc3nc4ccccc4s3)CC2)nc1,active,351.519,4.81192,0.0,4.0,6.477556
384,CHEMBL5399689,Cc1ccc2nc(CCCN3CCC(c4ccccn4)CC3)sc2c1,active,367.518,4.51210,0.0,5.0,6.605548
385,CHEMBL5406300,COc1cccc2nc(CCCN3CCC(c4ccccn4)CC3)sc12,active,367.518,4.51210,0.0,5.0,6.913640
386,CHEMBL5399720,COc1ccc2nc(CCCN3CCC(c4ccccn4)CC3)sc2c1,active,395.478,3.21710,1.0,4.0,6.140261


In [74]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [75]:
! cat molecule.smi | head -5

c1cnc(N2CCN(Cc3cccc4c3Cc3ccccc3-4)CC2)nc1	CHEMBL303519
COc1ccc(-c2cccc(CN3CCN(c4ncccn4)CC3)c2)cc1	CHEMBL292943
Fc1ccc(-c2cncc(CN3CCN(c4ccccc4F)CC3)c2)cc1	CHEMBL61682
Fc1ccc(-c2cncc(CN3CCN(c4ccccc4F)CC3)c2)cc1	CHEMBL61682
COc1ccccc1-c1cccc(CN2CCN(c3ncccn3)CC2)c1	CHEMBL64487


In [76]:
! cat molecule.smi | wc -l

388


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [77]:
! cat padel.sh

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


In [78]:
! bash padel.sh

Processing CHEMBL303519 in molecule.smi (1/185). 
Processing CHEMBL292943 in molecule.smi (2/185). 
Processing CHEMBL61682 in molecule.smi (3/185). Average speed: 3.14 s/mol.
Processing CHEMBL61682 in molecule.smi (4/185). Average speed: 1.67 s/mol.
Processing CHEMBL64487 in molecule.smi (5/185). Average speed: 2.35 s/mol.
Processing CHEMBL64597 in molecule.smi (6/185). Average speed: 1.21 s/mol.
Processing CHEMBL64597 in molecule.smi (7/185). Average speed: 1.16 s/mol.
Processing CHEMBL291824 in molecule.smi (8/185). Average speed: 0.98 s/mol.
Processing CHEMBL59942 in molecule.smi (9/185). Average speed: 0.91 s/mol.
Processing CHEMBL59942 in molecule.smi (10/185). Average speed: 0.95 s/mol.
Processing CHEMBL61657 in molecule.smi (11/185). Average speed: 0.78 s/mol.
Processing CHEMBL302183 in molecule.smi (13/185). Average speed: 0.77 s/mol.
Processing CHEMBL302183 in molecule.smi (12/185). Average speed: 0.73 s/mol.
Processing CHEMBL64622 in molecule.smi (14/185). Average speed: 0.72

In [79]:
! ls -l

total 50764
-rw-r--r-- 1 root root   340065 Jan 24 02:23 descriptors_output.csv
-rw-r--r-- 1 root root    48673 Jan 24 01:51 df_2class.csv
drwxr-xr-x 3 root root     4096 Jan 24 02:21 __MACOSX
-rw-r--r-- 1 root root    22572 Jan 24 02:22 molecule.smi
drwxrwxr-x 4 root root     4096 Jan 24 02:21 PaDEL-Descriptor
-rw-r--r-- 1 root root      231 Jan 24 01:51 padel.sh
-rw-r--r-- 1 root root      231 Jan 24 02:20 padel.sh.1
-rw-r--r-- 1 root root 25768637 Jan 24 02:00 padel.zip
-rw-r--r-- 1 root root 25768637 Jan 24 02:20 padel.zip.1
drwxr-xr-x 1 root root     4096 Jan 22 14:23 sample_data


## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [81]:
df3_X = pd.read_csv('descriptors_output.csv')

In [82]:
df3_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL292943,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL303519,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL61682,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL61682,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL64597,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
180,CHEMBL1642126,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
181,CHEMBL1642128,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
182,CHEMBL42,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
183,CHEMBL3216758,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [83]:
df3_X = df3_X.drop(columns=['Name'])
df3_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
180,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
181,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
182,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
183,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## **Y variable**

### **Convert IC50 to pIC50**

In [84]:
df3_Y = df3['pIC50']
df3_Y

Unnamed: 0,pIC50
0,7.698970
1,7.244125
2,8.522879
3,8.154902
4,7.795880
...,...
383,6.477556
384,6.605548
385,6.913640
386,6.140261


## **Combining X and Y variable**

In [89]:
df3_X = df3_X.reset_index(drop=True)
df3_Y = df3_Y.reset_index(drop=True)
dataset3 = pd.concat([df3_X, df3_Y], axis=1)

In [92]:
dataset3 = pd.merge(df3_X, df3_Y, left_index=True, right_index=True)

In [94]:
dataset3

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.698970
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.244125
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,8.522879
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,8.154902
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.795880
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
180,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.000000
181,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.000000
182,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.920819
183,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.404283


In [93]:
dataset3.to_csv('drd4_bioactivity_data_2class_pIC50_pubchem_fp.csv', index=False)

# **Let's download the CSV file to your local computer for the Part 3B (Model Building).**