# **Bioinformatics Project - Computational Drug Discovery [Part 3] Descriptor Calculation and Dataset Preparation**

Chanin Nantasenamat

[*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)

In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

---

## **Download PaDEL-Descriptor**

In [1]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

--2024-12-22 20:39:18--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip [following]
--2024-12-22 20:39:18--  https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25768637 (25M) [application/zip]
Saving to: ‘padel.zip’


2024-12-22 20:39:19 (170 MB/s) - ‘padel.zip’ saved [25768637/25768637]

--2024-12-22 20:39:19--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh
Resolving github.com (gith

In [2]:
! unzip padel.zip

Archive:  padel.zip
   creating: PaDEL-Descriptor/
  inflating: __MACOSX/._PaDEL-Descriptor  
  inflating: PaDEL-Descriptor/MACCSFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._MACCSFingerprinter.xml  
  inflating: PaDEL-Descriptor/AtomPairs2DFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._AtomPairs2DFingerprinter.xml  
  inflating: PaDEL-Descriptor/EStateFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._EStateFingerprinter.xml  
  inflating: PaDEL-Descriptor/Fingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._Fingerprinter.xml  
  inflating: PaDEL-Descriptor/.DS_Store  
  inflating: __MACOSX/PaDEL-Descriptor/._.DS_Store  
   creating: PaDEL-Descriptor/license/
  inflating: __MACOSX/PaDEL-Descriptor/._license  
  inflating: PaDEL-Descriptor/KlekotaRothFingerprintCount.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._KlekotaRothFingerprintCount.xml  
  inflating: PaDEL-Descriptor/config  
  inflating: __MACOSX/PaDEL-Descriptor/._config  
  inf

## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **bioactivity_data_3class_pIC50.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [5]:
! wget https://raw.githubusercontent.com/VictoriaLiendro/drug-analysis-tutorial/refs/heads/main/data/erbB1_04_bioactivity_data_3class_pIC50.csv

--2024-12-22 20:44:19--  https://raw.githubusercontent.com/VictoriaLiendro/drug-analysis-tutorial/refs/heads/main/data/erbB1_04_bioactivity_data_3class_pIC50.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10490 (10K) [text/plain]
Saving to: ‘erbB1_04_bioactivity_data_3class_pIC50.csv’


2024-12-22 20:44:19 (87.3 MB/s) - ‘erbB1_04_bioactivity_data_3class_pIC50.csv’ saved [10490/10490]



In [4]:
import pandas as pd

In [6]:
df3 = pd.read_csv('erbB1_04_bioactivity_data_3class_pIC50.csv')

In [7]:
df3

Unnamed: 0.1,Unnamed: 0,molecule_chembl_id,canonical_smiles,class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,0,CHEMBL292323,COc1cccc2c(C(=O)Nc3ccccc3)c(SSc3c(C(=O)Nc4cccc...,inactive,622.772,7.9912,2.0,8.0,4.000000
1,1,CHEMBL304414,Cn1c(SSc2c(C(=O)Nc3ccccc3)c3ccccc3n2C)c(C(=O)N...,inactive,562.720,7.9740,2.0,6.0,4.000000
2,2,CHEMBL62176,CN1C(=S)C(C(=O)Nc2ccccc2)c2ccccc21,inactive,282.368,3.1861,1.0,2.0,4.000000
3,3,CHEMBL62701,Cn1c(SSc2c(C(=O)Nc3ccccc3)c3cccnc3n2C)c(C(=O)N...,inactive,564.696,6.7640,2.0,8.0,4.602060
4,4,CHEMBL137617,C/N=N/Nc1ccc2ncnc(Nc3cccc(Br)c3)c2c1,active,357.215,4.5448,2.0,5.0,7.154902
...,...,...,...,...,...,...,...,...,...
76,76,CHEMBL5430172,CN(C(=O)Nc1ccccn1)C1CCCC1,inactive,219.288,2.4879,1.0,2.0,3.100000
77,77,CHEMBL5411911,CN(C(=O)N(C)C1CCCC1)c1ccccc1,inactive,232.327,3.1172,0.0,1.0,3.900000
78,78,CHEMBL5420832,CN(C(=O)N(C)C1CCCC1)c1ccccn1,inactive,233.315,2.5122,0.0,2.0,3.000000
79,79,CHEMBL5424516,COc1cc2ncnc(NCc3ccc(NC(=O)C(c4ccccc4)N4Cc5cccc...,intermediate,672.786,5.6874,2.0,9.0,5.838632


In [8]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [9]:
! cat molecule.smi | head -5

COc1cccc2c(C(=O)Nc3ccccc3)c(SSc3c(C(=O)Nc4ccccc4)c4cccc(OC)c4n3C)n(C)c12	CHEMBL292323
Cn1c(SSc2c(C(=O)Nc3ccccc3)c3ccccc3n2C)c(C(=O)Nc2ccccc2)c2ccccc21	CHEMBL304414
CN1C(=S)C(C(=O)Nc2ccccc2)c2ccccc21	CHEMBL62176
Cn1c(SSc2c(C(=O)Nc3ccccc3)c3cccnc3n2C)c(C(=O)Nc2ccccc2)c2cccnc21	CHEMBL62701
C/N=N/Nc1ccc2ncnc(Nc3cccc(Br)c3)c2c1	CHEMBL137617


In [10]:
! cat molecule.smi | wc -l

81


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [11]:
! cat padel.sh

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


In [12]:
! bash padel.sh

Processing CHEMBL292323 in molecule.smi (1/81). 
Processing CHEMBL304414 in molecule.smi (2/81). 
Processing CHEMBL62176 in molecule.smi (3/81). Average speed: 5.81 s/mol.
Processing CHEMBL62701 in molecule.smi (4/81). Average speed: 2.96 s/mol.
Processing CHEMBL137617 in molecule.smi (5/81). Average speed: 2.09 s/mol.
Processing CHEMBL153577 in molecule.smi (6/81). Average speed: 1.69 s/mol.
Processing CHEMBL137189 in molecule.smi (8/81). Average speed: 1.22 s/mol.
Processing CHEMBL152448 in molecule.smi (7/81). Average speed: 1.45 s/mol.
Processing CHEMBL52765 in molecule.smi (9/81). Average speed: 1.28 s/mol.
Processing CHEMBL152922 in molecule.smi (10/81). Average speed: 0.98 s/mol.
Processing CHEMBL7917 in molecule.smi (11/81). Average speed: 0.90 s/mol.
Processing CHEMBL7827 in molecule.smi (12/81). Average speed: 0.83 s/mol.
Processing CHEMBL440453 in molecule.smi (13/81). Average speed: 0.77 s/mol.
Processing CHEMBL7810 in molecule.smi (14/81). Average speed: 0.74 s/mol.
Proces

In [13]:
! ls -l

total 26000
-rw-r--r-- 1 root root   655414 Dec 22 20:39 acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv
-rw-r--r-- 1 root root   155286 Dec 22 20:52 descriptors_output.csv
-rw-r--r-- 1 root root    10490 Dec 22 20:44 erbB1_04_bioactivity_data_3class_pIC50.csv
drwxr-xr-x 3 root root     4096 Dec 22 20:39 __MACOSX
-rw-r--r-- 1 root root     4777 Dec 22 20:44 molecule.smi
drwxrwxr-x 4 root root     4096 May 30  2020 PaDEL-Descriptor
-rw-r--r-- 1 root root      231 Dec 22 20:39 padel.sh
-rw-r--r-- 1 root root 25768637 Dec 22 20:39 padel.zip
drwxr-xr-x 1 root root     4096 Dec 19 14:20 sample_data


## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [14]:
df3_X = pd.read_csv('descriptors_output.csv')

In [15]:
df3_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL292323,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL304414,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL62176,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL137617,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL62701,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
76,CHEMBL5430172,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
77,CHEMBL5420832,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
78,CHEMBL5411911,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
79,CHEMBL4284413,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
df3_X = df3_X.drop(columns=['Name'])
df3_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
76,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
77,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
78,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
79,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## **Y variable**

### **Convert IC50 to pIC50**

In [17]:
df3_Y = df3['pIC50']
df3_Y

Unnamed: 0,pIC50
0,4.000000
1,4.000000
2,4.000000
3,4.602060
4,7.154902
...,...
76,3.100000
77,3.900000
78,3.000000
79,5.838632


## **Combining X and Y variable**

In [18]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.000000
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.000000
2,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.000000
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.602060
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.154902
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
76,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,3.100000
77,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,3.900000
78,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,3.000000
79,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.838632


In [19]:
dataset3.to_csv('erbB1_06_bioactivity_data_3class_pIC50_pubchem_fp.csv', index=False)

# **Let's download the CSV file to your local computer for the Part 3B (Model Building).**