<a href="https://colab.research.google.com/github/aghnisyaa/bioactivity_project/blob/main/real_CDD_ML_Part_3_dengue_virus_Descriptor_Dataset_Preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Bioinformatics Project - Computational Drug Discovery [Part 3] Descriptor Calculation and Dataset Preparation**

Chanin Nantasenamat

[*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)

In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

---

## **Download PaDEL-Descriptor**

In [1]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

--2024-11-30 02:55:12--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip [following]
--2024-11-30 02:55:12--  https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25768637 (25M) [application/zip]
Saving to: ‘padel.zip’


2024-11-30 02:55:13 (110 MB/s) - ‘padel.zip’ saved [25768637/25768637]

--2024-11-30 02:55:13--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh
Resolving github.com (gith

In [2]:
! unzip padel.zip

Archive:  padel.zip
   creating: PaDEL-Descriptor/
  inflating: __MACOSX/._PaDEL-Descriptor  
  inflating: PaDEL-Descriptor/MACCSFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._MACCSFingerprinter.xml  
  inflating: PaDEL-Descriptor/AtomPairs2DFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._AtomPairs2DFingerprinter.xml  
  inflating: PaDEL-Descriptor/EStateFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._EStateFingerprinter.xml  
  inflating: PaDEL-Descriptor/Fingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._Fingerprinter.xml  
  inflating: PaDEL-Descriptor/.DS_Store  
  inflating: __MACOSX/PaDEL-Descriptor/._.DS_Store  
   creating: PaDEL-Descriptor/license/
  inflating: __MACOSX/PaDEL-Descriptor/._license  
  inflating: PaDEL-Descriptor/KlekotaRothFingerprintCount.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._KlekotaRothFingerprintCount.xml  
  inflating: PaDEL-Descriptor/config  
  inflating: __MACOSX/PaDEL-Descriptor/._config  
  inf

## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **bioactivity_data_3class_pIC50.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [3]:
! wget https://raw.githubusercontent.com/aghnisyaa/bioactivity_project/refs/heads/main/dengue_virus_04_bioactivity_data_3class_pIC50.csv

--2024-11-30 02:57:45--  https://raw.githubusercontent.com/aghnisyaa/bioactivity_project/refs/heads/main/dengue_virus_04_bioactivity_data_3class_pIC50.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10717 (10K) [text/plain]
Saving to: ‘dengue_virus_04_bioactivity_data_3class_pIC50.csv’


2024-11-30 02:57:46 (71.9 MB/s) - ‘dengue_virus_04_bioactivity_data_3class_pIC50.csv’ saved [10717/10717]



In [4]:
import pandas as pd

In [5]:
df3 = pd.read_csv('dengue_virus_04_bioactivity_data_3class_pIC50.csv')

In [6]:
df3

Unnamed: 0.1,Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,0,CHEMBL575429,O=C(O)/C=C/c1ccc(OS(=O)(=O)O)cc1,inactive,244.224,0.9660,2.0,4.0,2.698970
1,1,CHEMBL574855,CN(CCCNC(=O)c1ccc(O)cc1)CCCNC(=O)c1ccc(O)cc1,inactive,385.464,1.9696,4.0,5.0,2.698970
2,2,CHEMBL574190,CCN(CCCN(CC)C(=O)c1ccc(O)cc1)C(=O)c1ccc(O)cc1,inactive,370.449,3.1123,2.0,4.0,2.698970
3,3,CHEMBL575724,CN(CCCNC(=O)c1ccc(O)cc1)CCCNC(=O)c1ccc2cc(O)cc...,inactive,435.524,3.1228,4.0,5.0,3.531653
4,4,CHEMBL582943,CCN(CCOC(=O)/C=C/c1ccc(O)cc1)Cc1cc(Cl)ccc1O,inactive,375.852,3.8297,2.0,5.0,4.337242
...,...,...,...,...,...,...,...,...,...
64,64,CHEMBL495739,Cl.O=S(=O)(NCCNC/C=C/c1ccc(Br)cc1)c1cccc2cnccc12,active,482.831,4.0004,2.0,4.0,6.017729
65,65,CHEMBL5170343,O=C(O)/C=C/C#CCCCCCCCC(=O)O,intermediate,238.283,2.4459,2.0,2.0,5.602060
66,66,CHEMBL251279,CCCCC#CC#CC#CCCCCCCCC(=O)O,intermediate,272.388,4.0021,1.0,1.0,5.568636
67,67,CHEMBL2204412,CCCC/C=C/C#CC#CCCCCCCCC(=O)O,intermediate,274.404,4.5549,1.0,1.0,5.657577


In [7]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [8]:
! cat molecule.smi | head -5

O=C(O)/C=C/c1ccc(OS(=O)(=O)O)cc1	CHEMBL575429
CN(CCCNC(=O)c1ccc(O)cc1)CCCNC(=O)c1ccc(O)cc1	CHEMBL574855
CCN(CCCN(CC)C(=O)c1ccc(O)cc1)C(=O)c1ccc(O)cc1	CHEMBL574190
CN(CCCNC(=O)c1ccc(O)cc1)CCCNC(=O)c1ccc2cc(O)ccc2c1	CHEMBL575724
CCN(CCOC(=O)/C=C/c1ccc(O)cc1)Cc1cc(Cl)ccc1O	CHEMBL582943


In [9]:
! cat molecule.smi | wc -l

69


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [10]:
! cat padel.sh

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


In [11]:
! bash padel.sh

Processing CHEMBL575429 in molecule.smi (1/69). 
Processing CHEMBL574855 in molecule.smi (2/69). 
Processing CHEMBL574190 in molecule.smi (3/69). Average speed: 1.86 s/mol.
Processing CHEMBL582943 in molecule.smi (5/69). Average speed: 1.02 s/mol.
Processing CHEMBL575724 in molecule.smi (4/69). Average speed: 1.30 s/mol.
Processing CHEMBL304087 in molecule.smi (6/69). Average speed: 0.96 s/mol.
Processing CHEMBL311226 in molecule.smi (7/69). Average speed: 0.78 s/mol.
Processing CHEMBL226335 in molecule.smi (8/69). Average speed: 0.86 s/mol.
Processing CHEMBL3344315 in molecule.smi (9/69). Average speed: 0.75 s/mol.
Processing CHEMBL402947 in molecule.smi (10/69). Average speed: 0.87 s/mol.
Processing CHEMBL269277 in molecule.smi (11/69). Average speed: 0.94 s/mol.
Processing CHEMBL251254 in molecule.smi (12/69). Average speed: 0.92 s/mol.
Processing CHEMBL82242 in molecule.smi (13/69). Average speed: 0.82 s/mol.
Processing CHEMBL471282 in molecule.smi (14/69). Average speed: 0.81 s/mo

In [12]:
! ls -l

total 25336
-rw-r--r-- 1 root root    10717 Nov 30 02:57 dengue_virus_04_bioactivity_data_3class_pIC50.csv
-rw-r--r-- 1 root root   134010 Nov 30 03:00 descriptors_output.csv
drwxr-xr-x 3 root root     4096 Nov 30 02:55 __MACOSX
-rw-r--r-- 1 root root     5549 Nov 30 02:58 molecule.smi
drwxrwxr-x 4 root root     4096 May 30  2020 PaDEL-Descriptor
-rw-r--r-- 1 root root      231 Nov 30 02:55 padel.sh
-rw-r--r-- 1 root root 25768637 Nov 30 02:55 padel.zip
drwxr-xr-x 1 root root     4096 Nov 25 19:13 sample_data


## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [17]:
df3_X = pd.read_csv('descriptors_output.csv')

In [18]:
df3_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL575429,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL574855,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL574190,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL575724,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL582943,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64,CHEMBL5170343,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
65,CHEMBL251279,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
66,CHEMBL495739,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
67,CHEMBL2204412,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [19]:
print(df3_X.columns)


Index(['Name', 'PubchemFP0', 'PubchemFP1', 'PubchemFP2', 'PubchemFP3',
       'PubchemFP4', 'PubchemFP5', 'PubchemFP6', 'PubchemFP7', 'PubchemFP8',
       ...
       'PubchemFP871', 'PubchemFP872', 'PubchemFP873', 'PubchemFP874',
       'PubchemFP875', 'PubchemFP876', 'PubchemFP877', 'PubchemFP878',
       'PubchemFP879', 'PubchemFP880'],
      dtype='object', length=882)


In [20]:
df3_X = df3_X.drop(columns=['Name'])
df3_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
65,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
66,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
67,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## **Y variable**

### **Convert IC50 to pIC50**

In [21]:
df3_Y = df3['pIC50']
df3_Y

Unnamed: 0,pIC50
0,2.698970
1,2.698970
2,2.698970
3,3.531653
4,4.337242
...,...
64,6.017729
65,5.602060
66,5.568636
67,5.657577


## **Combining X and Y variable**

In [22]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,2.698970
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,2.698970
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,2.698970
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,3.531653
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.337242
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.017729
65,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.602060
66,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.568636
67,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.657577


In [24]:
dataset3.to_csv('dengue_virus_06_bioactivity_data_3class_pIC50_pubchem_fp.csv', index=False)

# **Let's download the CSV file to your local computer for the Part 3B (Model Building).**