<a href="https://colab.research.google.com/github/correctchemist/code/blob/main/bb1_dihydroorotate_part3_descriptors.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Bioinformatics Project - Dihydroorotate dehydrogenase [Part 3] Descriptor Calculation and Dataset Preparation**

Beatrice Iwuala

In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.


## **Download PaDEL-Descriptor**

In [1]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

--2022-11-13 20:19:08--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip [following]
--2022-11-13 20:19:08--  https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25768637 (25M) [application/zip]
Saving to: ‘padel.zip’


2022-11-13 20:19:09 (155 MB/s) - ‘padel.zip’ saved [25768637/25768637]

--2022-11-13 20:19:09--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh
Resolving github.com (gith

In [2]:
! unzip padel.zip

Archive:  padel.zip
   creating: PaDEL-Descriptor/
  inflating: __MACOSX/._PaDEL-Descriptor  
  inflating: PaDEL-Descriptor/MACCSFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._MACCSFingerprinter.xml  
  inflating: PaDEL-Descriptor/AtomPairs2DFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._AtomPairs2DFingerprinter.xml  
  inflating: PaDEL-Descriptor/EStateFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._EStateFingerprinter.xml  
  inflating: PaDEL-Descriptor/Fingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._Fingerprinter.xml  
  inflating: PaDEL-Descriptor/.DS_Store  
  inflating: __MACOSX/PaDEL-Descriptor/._.DS_Store  
   creating: PaDEL-Descriptor/license/
  inflating: __MACOSX/PaDEL-Descriptor/._license  
  inflating: PaDEL-Descriptor/KlekotaRothFingerprintCount.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._KlekotaRothFingerprintCount.xml  
  inflating: PaDEL-Descriptor/config  
  inflating: __MACOSX/PaDEL-Descriptor/._config  
  inf

## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **bioactivity_data_3class_pIC50.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [3]:
import pandas as pd

In [4]:
df3 = pd.read_csv('dihydoorotate_04_bioactivity_data_3class_pIC50.csv')

In [6]:
df3

Unnamed: 0.1,Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,0,CHEMBL199572,CN(C(=O)c1ccc(-c2ccccc2)cc1)c1ccccc1C(=O)O,inactive,331.371,4.32840,1.0,2.0,4.370590
1,1,CHEMBL199574,O=C(Nc1ccccc1C(=O)O)c1ccc2cc(Br)ccc2c1,inactive,370.202,4.55280,2.0,2.0,3.845880
2,2,CHEMBL372561,CN(C(=O)c1ccc2cc(Br)ccc2c1)c1ccccc1C(=O)O,inactive,384.229,4.57710,1.0,2.0,4.029653
3,3,CHEMBL370865,O=C(Nc1ccccc1C(=O)O)c1ccc(-c2ccccc2)cc1,inactive,317.344,4.30410,2.0,2.0,3.813892
4,4,CHEMBL199575,CN(C(=O)c1ccc2ccccc2c1)c1ccccc1C(=O)O,inactive,305.333,3.81460,1.0,2.0,3.698970
...,...,...,...,...,...,...,...,...,...
597,597,CHEMBL4569109,Cn1nc(OCC2CC2)c(C(=O)O)c1COc1ccccc1,inactive,302.330,2.48610,1.0,5.0,3.602060
598,598,CHEMBL4568957,Cn1nc(OCc2ccccc2)c(C(=O)O)c1COc1ccccc1,inactive,338.363,3.27630,1.0,5.0,3.602060
599,599,CHEMBL4449622,Cn1nc(O)c(C(N)=O)c1COc1ccccc1,inactive,247.254,0.80360,2.0,5.0,3.602060
600,600,CHEMBL1956285,Cc1cc(Nc2ccc(S(F)(F)(F)(F)F)cc2)n2nc(C(C)(F)F)...,active,415.338,5.94542,1.0,5.0,8.000000


In [7]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [8]:
! cat molecule.smi | head -5

CN(C(=O)c1ccc(-c2ccccc2)cc1)c1ccccc1C(=O)O	CHEMBL199572
O=C(Nc1ccccc1C(=O)O)c1ccc2cc(Br)ccc2c1	CHEMBL199574
CN(C(=O)c1ccc2cc(Br)ccc2c1)c1ccccc1C(=O)O	CHEMBL372561
O=C(Nc1ccccc1C(=O)O)c1ccc(-c2ccccc2)cc1	CHEMBL370865
CN(C(=O)c1ccc2ccccc2c1)c1ccccc1C(=O)O	CHEMBL199575


In [9]:
! cat molecule.smi | wc -l

602


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [10]:
! cat padel.sh

java -Xms1G -Xmx1G -Djava.awt.headless=true -jar ./PaDEL-Descriptor/PaDEL-Descriptor.jar -removesalt -standardizenitro -fingerprints -descriptortypes ./PaDEL-Descriptor/PubchemFingerprinter.xml -dir ./ -file descriptors_output.csv


In [11]:
! bash padel.sh

Processing CHEMBL199572 in molecule.smi (1/602). 
Processing CHEMBL199574 in molecule.smi (2/602). 
Processing CHEMBL370865 in molecule.smi (4/602). Average speed: 1.97 s/mol.
Processing CHEMBL372561 in molecule.smi (3/602). Average speed: 3.66 s/mol.
Processing CHEMBL199575 in molecule.smi (5/602). Average speed: 1.74 s/mol.
Processing CHEMBL200536 in molecule.smi (6/602). Average speed: 1.44 s/mol.
Processing CHEMBL218467 in molecule.smi (7/602). Average speed: 1.32 s/mol.
Processing CHEMBL973 in molecule.smi (8/602). Average speed: 1.15 s/mol.
Processing CHEMBL217893 in molecule.smi (9/602). Average speed: 1.02 s/mol.
Processing CHEMBL217951 in molecule.smi (10/602). Average speed: 0.92 s/mol.
Processing CHEMBL217897 in molecule.smi (11/602). Average speed: 0.86 s/mol.
Processing CHEMBL387254 in molecule.smi (12/602). Average speed: 0.82 s/mol.
Processing CHEMBL373493 in molecule.smi (13/602). Average speed: 0.77 s/mol.
Processing CHEMBL374338 in molecule.smi (14/602). Average speed

In [12]:
! ls -l

total 26352
-rw-r--r-- 1 root root  1081565 Nov 13 20:33 descriptors_output.csv
-rw-r--r-- 1 root root    72282 Nov 13 20:25 dihydoorotate_04_bioactivity_data_3class_pIC50.csv
drwx------ 6 root root     4096 Nov 13 20:25 drive
drwxr-xr-x 3 root root     4096 Nov 13 20:21 __MACOSX
-rw-r--r-- 1 root root    31311 Nov 13 20:29 molecule.smi
drwxrwxr-x 4 root root     4096 May 30  2020 PaDEL-Descriptor
-rw-r--r-- 1 root root      231 Nov 13 20:19 padel.sh
-rw-r--r-- 1 root root 25768637 Nov 13 20:19 padel.zip
drwxr-xr-x 1 root root     4096 Nov 10 14:30 sample_data


## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [13]:
df3_X = pd.read_csv('descriptors_output.csv')

In [14]:
df3_X

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL199574,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL199572,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL372561,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL370865,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL199575,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
597,CHEMBL4569109,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
598,CHEMBL4568957,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
599,CHEMBL4449622,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
600,CHEMBL1956285,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
df3_X = df3_X.drop(columns=['Name'])
df3_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
597,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
598,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
599,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
600,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## **Y variable**

### **Convert IC50 to pIC50**

In [16]:
df3_Y = df3['pIC50']
df3_Y

0      4.370590
1      3.845880
2      4.029653
3      3.813892
4      3.698970
         ...   
597    3.602060
598    3.602060
599    3.602060
600    8.000000
601    8.000000
Name: pIC50, Length: 602, dtype: float64

## **Combining X and Y variable**

In [17]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.370590
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,3.845880
2,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.029653
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,3.813892
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,3.698970
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
597,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,3.602060
598,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,3.602060
599,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,3.602060
600,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,8.000000


In [18]:
dataset3.to_csv('dihydroorotate_06_bioactivity_data_3class_pIC50_pubchem_fp.csv', index=False)

# **Let's download the CSV file to your local computer for the Part 3B (Model Building).**