<a href="https://colab.research.google.com/github/clarefausty/Python-Bioinformatics/blob/main/Part3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Bioinformatics Project - Computational Drug Discovery [Part 3] Descriptor Calculation and Dataset Preparation**

Chinwendu Faustina Achilonu



In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

---

In [29]:
!wget https://github.com/clarefausty/Python-Bioinformatics/blob/main/files/padel.zip
!wget https://github.com/clarefausty/Python-Bioinformatics/blob/main/files/padel.sh

--2024-07-22 04:11:06--  https://github.com/clarefausty/Python-Bioinformatics/blob/main/files/padel.zip
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘padel.zip.1’

padel.zip.1             [ <=>                ] 282.25K  --.-KB/s    in 0.05s   

2024-07-22 04:11:06 (5.54 MB/s) - ‘padel.zip.1’ saved [289028]

--2024-07-22 04:11:06--  https://github.com/clarefausty/Python-Bioinformatics/blob/main/files/padel.sh
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘padel.sh’

padel.sh                [ <=>                ] 300.20K  --.-KB/s    in 0.05s   

2024-07-22 04:11:07 (5.71 MB/s) - ‘padel.sh’ saved [307401]



In [13]:
!unzip padel.zip

Archive:  padel.zip
   creating: PaDEL-Descriptor/
  inflating: __MACOSX/._PaDEL-Descriptor  
  inflating: PaDEL-Descriptor/MACCSFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._MACCSFingerprinter.xml  
  inflating: PaDEL-Descriptor/AtomPairs2DFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._AtomPairs2DFingerprinter.xml  
  inflating: PaDEL-Descriptor/EStateFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._EStateFingerprinter.xml  
  inflating: PaDEL-Descriptor/Fingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._Fingerprinter.xml  
  inflating: PaDEL-Descriptor/.DS_Store  
  inflating: __MACOSX/PaDEL-Descriptor/._.DS_Store  
   creating: PaDEL-Descriptor/license/
  inflating: __MACOSX/PaDEL-Descriptor/._license  
  inflating: PaDEL-Descriptor/KlekotaRothFingerprintCount.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._KlekotaRothFingerprintCount.xml  
  inflating: PaDEL-Descriptor/config  
  inflating: __MACOSX/PaDEL-Descriptor/._config  
  inf

## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **Ovarian_cancer_cell_line_processed_02_bioactivity_data.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [14]:
! wget https://github.com/clarefausty/Python-Bioinformatics/blob/main/files/Ovarian_cancer_cell_line_processed_02_bioactivity_data.csv

--2024-07-22 03:45:56--  https://github.com/clarefausty/Python-Bioinformatics/blob/main/files/Ovarian_cancer_cell_line_processed_02_bioactivity_data.csv
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘Ovarian_cancer_cell_line_processed_02_bioactivity_data.csv’

Ovarian_cancer_cell     [ <=>                ] 169.16K  --.-KB/s    in 0.05s   

2024-07-22 03:45:56 (3.34 MB/s) - ‘Ovarian_cancer_cell_line_processed_02_bioactivity_data.csv’ saved [173216]



In [25]:
import pandas as pd

In [27]:
df3 = pd.read_csv('Ovarian_cancer_cell_line_processed_02_bioactivity_data.csv')

In [28]:
df3

Unnamed: 0.1,Unnamed: 0,molecule_chembl_id,canonical_smiles,class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,0,CHEMBL84463,COc1cc2cc(C(=O)N3CC(CCl)c4c3cc(NC(=O)OCc3ccc([...,active,645.068,6.9866,2.0,8.0,7.207608
1,1,CHEMBL23330,NC(=O)c1cc(N2CC2)c([N+](=O)[O-])cc1[N+](=O)[O-],inactive,252.186,0.4219,1.0,6.0,3.759451
2,2,CHEMBL311087,COc1cc2cc(C(=O)N3CC(CCl)c4c3cc(NC(=O)OCc3ccc([...,active,719.147,6.7478,3.0,10.0,6.838632
3,3,CHEMBL288377,CNc1cc2c(c3ccccc13)C(CCl)CN2C(=O)c1cc2cc(OC)c(...,active,479.964,5.3714,2.0,5.0,9.69897
4,4,CHEMBL78019,COc1cc2cc(C(=O)N3CC(CCl)c4c3cc(NC(=O)OCc3ccc([...,active,705.12,6.3577,3.0,10.0,6.649752
5,5,CHEMBL314246,COc1cc2cc(C(=O)N3CC(CCl)c4c3cc(NC(=O)OCc3ccc([...,active,735.146,5.7186,4.0,11.0,6.943095
6,6,CHEMBL316382,COc1cc([N+](=O)[O-])ccc1COC(=O)Nc1cc2c(c3ccccc...,active,675.094,6.9952,2.0,9.0,7.229148
7,7,CHEMBL314247,COCCOc1cc([N+](=O)[O-])ccc1COC(=O)Nc1cc2c(c3cc...,active,719.147,7.0118,2.0,10.0,6.974694
8,8,CHEMBL79354,COc1cc2cc(C(=O)N3CC(CCl)c4c3cc(NC(=O)OCc3ccc([...,active,746.217,7.3171,2.0,10.0,6.723538
9,9,CHEMBL83545,COc1cc([N+](=O)[O-])ccc1COC(=O)Nc1ccc(COC(=O)N...,active,824.243,8.7439,3.0,11.0,6.651695


In [35]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [36]:
! cat molecule.smi | head -5

COc1cc2cc(C(=O)N3CC(CCl)c4c3cc(NC(=O)OCc3ccc([N+](=O)[O-])cc3)c3ccccc43)[nH]c2c(OC)c1OC	CHEMBL84463
NC(=O)c1cc(N2CC2)c([N+](=O)[O-])cc1[N+](=O)[O-]	CHEMBL23330
COc1cc2cc(C(=O)N3CC(CCl)c4c3cc(NC(=O)OCc3ccc([N+](=O)[O-])cc3OCCCO)c3ccccc43)[nH]c2c(OC)c1OC	CHEMBL311087
CNc1cc2c(c3ccccc13)C(CCl)CN2C(=O)c1cc2cc(OC)c(OC)c(OC)c2[nH]1	CHEMBL288377
COc1cc2cc(C(=O)N3CC(CCl)c4c3cc(NC(=O)OCc3ccc([N+](=O)[O-])cc3OCCO)c3ccccc43)[nH]c2c(OC)c1OC	CHEMBL78019


In [38]:
! cat molecule.smi | wc -l

34


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [40]:
! cat padel.sh

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv

In [41]:
! bash padel.sh

Processing CHEMBL84463 in molecule.smi (1/34). 
Processing CHEMBL23330 in molecule.smi (2/34). 
Processing CHEMBL311087 in molecule.smi (3/34). Average speed: 2.01 s/mol.
Processing CHEMBL288377 in molecule.smi (4/34). Average speed: 2.69 s/mol.
Processing CHEMBL314246 in molecule.smi (6/34). Average speed: 1.68 s/mol.
Processing CHEMBL78019 in molecule.smi (5/34). Average speed: 1.99 s/mol.
Processing CHEMBL316382 in molecule.smi (7/34). Average speed: 1.57 s/mol.
Processing CHEMBL314247 in molecule.smi (8/34). Average speed: 1.42 s/mol.
Processing CHEMBL79354 in molecule.smi (9/34). Average speed: 1.39 s/mol.
Processing CHEMBL83545 in molecule.smi (10/34). Average speed: 1.26 s/mol.
Processing CHEMBL37705 in molecule.smi (11/34). Average speed: 1.29 s/mol.
Processing CHEMBL23612 in molecule.smi (12/34). Average speed: 1.35 s/mol.
Processing CHEMBL311195 in molecule.smi (13/34). Average speed: 1.13 s/mol.
Processing CHEMBL314487 in molecule.smi (14/34). Average speed: 1.05 s/mol.
Proc

In [42]:
! ls -l

total 25272
-rw-r--r-- 1 root root    71752 Jul 22 04:29 descriptors_output.csv
drwxr-xr-x 3 root root     4096 Jul 22 03:21 __MACOSX
-rw-r--r-- 1 root root     3004 Jul 22 04:21 molecule.smi
-rw-r--r-- 1 root root     5528 Jul 22 03:52 Ovarian_cancer_cell_line_processed_02_bioactivity_data.csv
drwxrwxr-x 4 root root     4096 May 30  2020 PaDEL-Descriptor
-rw-r--r-- 1 root root      230 Jul 22 04:28 padel.sh
-rw-r--r-- 1 root root 25768637 Jul 22 03:19 padel.zip
drwxr-xr-x 1 root root     4096 Jul 18 13:22 sample_data


## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [43]:
df3_x = pd.read_csv('descriptors_output.csv')

In [44]:
df3_x

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL23330,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL84463,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL311087,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL288377,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL78019,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,CHEMBL314246,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,CHEMBL316382,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,CHEMBL314247,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,CHEMBL79354,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,CHEMBL83545,1,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [45]:
df3_x = df3_x.drop(columns=["Name"])
df3_x

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
5,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
6,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
7,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
8,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
9,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## **Y variable**

### **Convert IC50 to pIC50**

In [46]:
df3_y = df3["pIC50"]
df3_y

0     7.207608
1     3.759451
2     6.838632
3     9.698970
4     6.649752
5     6.943095
6     7.229148
7     6.974694
8     6.723538
9     6.651695
10    8.958607
11    4.236572
12    6.801343
13    7.886057
14    7.055517
15    6.818156
16    6.113509
17    8.000000
18    4.769551
19    4.958607
20    4.823909
21    4.920819
22    5.031517
23    4.769551
24    6.036212
25    5.920819
26    8.045757
27    9.698970
28    8.698970
29    3.249646
30    4.123205
31    7.356547
32    7.259637
33    4.397940
Name: pIC50, dtype: float64

## **Combining X and Y variable**

In [47]:
dataset3 = pd.concat([df3_x, df3_y], axis=1)
dataset3

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.207608
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,3.759451
2,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.838632
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,9.69897
4,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.649752
5,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.943095
6,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.229148
7,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.974694
8,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.723538
9,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.651695


In [48]:
dataset3.to_csv("Ovarian_cancer_cell_line_processed_02_bioactivity_data_pIC50_pubchem_fp.csv", index=False)

# **Download the CSV file to your local computer for the Part 4 (Model Building).**