<a href="https://colab.research.google.com/github/soumik03/Major_Project_Bioinformatics/blob/main/CDD_ML_Part_3_Staphylococcus_aureus_Descriptor_Dataset_Preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Bioinformatics Project - Computational Drug Discovery [Part 3] Descriptor Calculation and Dataset Preparation**


In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

---

## **Download PaDEL-Descriptor**

In [None]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

--2022-05-05 07:29:09--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip [following]
--2022-05-05 07:29:09--  https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25768637 (25M) [application/zip]
Saving to: ‘padel.zip’


2022-05-05 07:29:13 (288 MB/s) - ‘padel.zip’ saved [25768637/25768637]

--2022-05-05 07:29:13--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh
Resolving github.com (gith

In [None]:
! unzip padel.zip

Archive:  padel.zip
   creating: PaDEL-Descriptor/
  inflating: __MACOSX/._PaDEL-Descriptor  
  inflating: PaDEL-Descriptor/MACCSFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._MACCSFingerprinter.xml  
  inflating: PaDEL-Descriptor/AtomPairs2DFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._AtomPairs2DFingerprinter.xml  
  inflating: PaDEL-Descriptor/EStateFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._EStateFingerprinter.xml  
  inflating: PaDEL-Descriptor/Fingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._Fingerprinter.xml  
  inflating: PaDEL-Descriptor/.DS_Store  
  inflating: __MACOSX/PaDEL-Descriptor/._.DS_Store  
   creating: PaDEL-Descriptor/license/
  inflating: __MACOSX/PaDEL-Descriptor/._license  
  inflating: PaDEL-Descriptor/KlekotaRothFingerprintCount.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._KlekotaRothFingerprintCount.xml  
  inflating: PaDEL-Descriptor/config  
  inflating: __MACOSX/PaDEL-Descriptor/._config  
  inf

## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **bioactivity_data_3class_pIC50.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [None]:
#! wget https://raw.githubusercontent.com/dataprofessor/data/master/acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv

--2020-06-09 17:00:26--  https://raw.githubusercontent.com/dataprofessor/data/master/acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 655414 (640K) [text/plain]
Saving to: ‘acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv’


2020-06-09 17:00:26 (9.21 MB/s) - ‘acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv’ saved [655414/655414]



In [None]:
import pandas as pd

In [None]:
df3 = pd.read_csv('Staphylococcus_aureus_04_bioactivity_data_raw_pIC50_3_bioactivity_class.csv')

In [None]:
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,CHEMBL304237,Oc1cc(NCCCNC2CCNc3c(Br)cc(Br)cc32)nc2ccccc12,active,506.242,5.41390,4.0,5.0,8.096910
1,CHEMBL60402,O=c1cc(NCCCNCc2ccc(Cl)c(Cl)c2)[nH]c2ccccc12,active,376.287,4.42670,3.0,3.0,7.795880
2,CHEMBL122734,COc1ccc2c(c1)C(=O)/C(=C/c1ccc([N+](=O)[O-])o1)O2,inactive,287.227,2.81260,0.0,6.0,3.820000
3,CHEMBL122837,O=C(/C=C/c1ccc([N+](=O)[O-])o1)c1cc(Cl)ccc1O,inactive,293.662,3.44290,1.0,5.0,1.930000
4,CHEMBL331082,CC(=O)Oc1ccc(C)cc1C(=O)/C=C/c1ccc([N+](=O)[O-])o1,inactive,315.281,3.31762,0.0,6.0,2.320000
...,...,...,...,...,...,...,...,...
891,CHEMBL4745545,C/C(=C/c1ccccc1)Cn1cc(CCCCCc2nc(N)[nH]c2-c2ccc...,intermediate,468.649,6.43280,2.0,5.0,5.872895
892,CHEMBL4785353,C/C(=C/c1ccccc1)Cn1cc(CCCCCc2nc(N)[nH]c2-c2cc(...,intermediate,462.548,5.58760,2.0,5.0,5.872895
893,CHEMBL4779057,C/C(=C/c1ccccc1)Cn1cc(CCCCCc2nc(N)[nH]c2-c2ccc...,intermediate,444.558,5.44850,2.0,5.0,5.872895
894,CHEMBL4757594,C/C(=C/c1ccccc1)Cn1cc(CCCCCc2nc(N)[nH]c2-c2ccc...,intermediate,502.666,6.97640,2.0,5.0,5.872895


In [None]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [None]:
! cat molecule.smi | head -5

Oc1cc(NCCCNC2CCNc3c(Br)cc(Br)cc32)nc2ccccc12	CHEMBL304237
O=c1cc(NCCCNCc2ccc(Cl)c(Cl)c2)[nH]c2ccccc12	CHEMBL60402
COc1ccc2c(c1)C(=O)/C(=C/c1ccc([N+](=O)[O-])o1)O2	CHEMBL122734
O=C(/C=C/c1ccc([N+](=O)[O-])o1)c1cc(Cl)ccc1O	CHEMBL122837
CC(=O)Oc1ccc(C)cc1C(=O)/C=C/c1ccc([N+](=O)[O-])o1	CHEMBL331082


In [None]:
! cat molecule.smi | wc -l

896


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [None]:
! cat padel.sh

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


In [None]:
! bash padel.sh

Processing CHEMBL304237 in molecule.smi (1/896). 
Processing CHEMBL60402 in molecule.smi (2/896). 
Processing CHEMBL122837 in molecule.smi (4/896). Average speed: 1.63 s/mol.
Processing CHEMBL122734 in molecule.smi (3/896). Average speed: 2.86 s/mol.
Processing CHEMBL331082 in molecule.smi (5/896). Average speed: 1.84 s/mol.
Processing CHEMBL434031 in molecule.smi (6/896). Average speed: 0.95 s/mol.
Processing CHEMBL331350 in molecule.smi (8/896). Average speed: 0.74 s/mol.
Processing CHEMBL121970 in molecule.smi (7/896). Average speed: 0.86 s/mol.
Processing CHEMBL333498 in molecule.smi (9/896). Average speed: 0.70 s/mol.
Processing CHEMBL123597 in molecule.smi (10/896). Average speed: 0.62 s/mol.
Processing CHEMBL123607 in molecule.smi (11/896). Average speed: 0.59 s/mol.
Processing CHEMBL123818 in molecule.smi (12/896). Average speed: 0.56 s/mol.
Processing CHEMBL333429 in molecule.smi (13/896). Average speed: 0.53 s/mol.
Processing CHEMBL333530 in molecule.smi (15/896). Average spe

In [None]:
! ls -l

total 26948
-rw-r--r-- 1 root root  1604047 May  5 07:34 descriptors_output.csv
drwxr-xr-x 3 root root     4096 May  5 07:29 __MACOSX
-rw-r--r-- 1 root root    69134 May  5 07:31 molecule.smi
drwxrwxr-x 4 root root     4096 May 30  2020 PaDEL-Descriptor
-rw-r--r-- 1 root root      231 May  5 07:29 padel.sh
-rw-r--r-- 1 root root 25768637 May  5 07:29 padel.zip
drwxr-xr-x 1 root root     4096 May  3 13:42 sample_data
-rw-r--r-- 1 root root   130606 May  5 07:28 Staphylococcus_aureus_04_bioactivity_data_raw_pIC50_3_bioactivity_class.csv


## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [None]:
df3_X = pd.read_csv('descriptors_output.csv')

In [None]:
df3_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL60402,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL304237,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL122734,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL122837,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL331082,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
891,CHEMBL4745545,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
892,CHEMBL4785353,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
893,CHEMBL4779057,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
894,CHEMBL4757594,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
df3_X = df3_X.drop(columns=['Name'])
df3_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
891,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
892,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
893,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
894,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## **Y variable**

### **Convert IC50 to pIC50**

In [None]:
df3_Y = df3['pIC50']
df3_Y

0      8.096910
1      7.795880
2      3.820000
3      1.930000
4      2.320000
         ...   
891    5.872895
892    5.872895
893    5.872895
894    5.872895
895    4.856985
Name: pIC50, Length: 896, dtype: float64

## **Combining X and Y variable**

In [None]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,8.096910
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.795880
2,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,3.820000
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1.930000
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,2.320000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
891,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.872895
892,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.872895
893,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.872895
894,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.872895


In [None]:
dataset3.to_csv('Staphylococcus_aureus_06_bioactivity_data_3class_pIC50_pubchem_fp.csv', index=False)

# **Let's download the CSV file to your local computer for the Part 3B (Model Building).**