# **Bioinformatics Project - Computational Drug Discovery [Part 3] Descriptor Calculation and Dataset Preparation**

In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

---

## **Load bioactivity data**

In [1]:
%%capture
! unzip ../toolkits/padel.zip ../toolkits

In [2]:
import pandas as pd
df3 = pd.read_csv('../data/acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv')
df3.head()

Unnamed: 0.1,Unnamed: 0,molecule_chembl_id,canonical_smiles,class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,0,CHEMBL133897,CCOc1nn(-c2cccc(OCc3ccccc3)c2)c(=O)o1,active,312.325,2.8032,0.0,6.0,6.124939
1,1,CHEMBL336398,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC1CC1,active,376.913,4.5546,0.0,5.0,7.0
2,2,CHEMBL131588,CN(C(=O)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F)c1ccccc1,inactive,426.851,5.3574,0.0,5.0,4.30103
3,3,CHEMBL130628,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F,active,404.845,4.7069,0.0,5.0,6.522879
4,4,CHEMBL130478,CSc1nc(-c2ccc(OC(F)(F)F)cc2)nn1C(=O)N(C)C,active,346.334,3.0953,0.0,6.0,6.09691


In [3]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('../data/molecule.smi', sep='\t', index=False, header=False)

In [4]:
! cat ../data/molecule.smi | head -5

CCOc1nn(-c2cccc(OCc3ccccc3)c2)c(=O)o1	CHEMBL133897
O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC1CC1	CHEMBL336398
CN(C(=O)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F)c1ccccc1	CHEMBL131588
O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F	CHEMBL130628
CSc1nc(-c2ccc(OC(F)(F)F)cc2)nn1C(=O)N(C)C	CHEMBL130478
cat: write error: Broken pipe


In [5]:
! cat ../data/molecule.smi | wc -l

5664


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [6]:
! bash ../toolkits/padel.sh

Processing CHEMBL133897 in molecule.smi (1/5664). 
Processing CHEMBL130628 in molecule.smi (4/5664). 
Processing CHEMBL130112 in molecule.smi (6/5664). 
Processing CHEMBL336398 in molecule.smi (2/5664). 
Processing CHEMBL130478 in molecule.smi (5/5664). 
Processing CHEMBL130098 in molecule.smi (7/5664). 
Processing CHEMBL337486 in molecule.smi (8/5664). 
Processing CHEMBL131588 in molecule.smi (3/5664). 
Processing CHEMBL341437 in molecule.smi (11/5664). Average speed: 1.52 s/mol.
Processing CHEMBL131051 in molecule.smi (10/5664). Average speed: 3.99 s/mol.
Processing CHEMBL336538 in molecule.smi (9/5664). Average speed: 3.97 s/mol.
Processing CHEMBL335033 in molecule.smi (12/5664). Average speed: 0.92 s/mol.
Processing CHEMBL122983 in molecule.smi (13/5664). Average speed: 0.92 s/mol.
Processing CHEMBL339995 in molecule.smi (15/5664). Average speed: 0.68 s/mol.
Processing CHEMBL338720 in molecule.smi (14/5664). Average speed: 0.78 s/mol.
Processing CHEMBL131536 in molecule.smi (17/566

In [7]:
! mv descriptors_output.csv ../data/descriptors_output.csv

## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [8]:
df3_X = pd.read_csv('../data/descriptors_output.csv')
df3_X.head()

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL130098,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,CHEMBL130478,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CHEMBL336398,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,CHEMBL337486,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CHEMBL130628,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [9]:
df3_X = df3_X.drop(columns=['Name'])
df3_X.head()

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## **Y variable**

### **Convert IC50 to pIC50**

In [10]:
df3_Y = df3['pIC50']
df3_Y.head()

0    6.124939
1    7.000000
2    4.301030
3    6.522879
4    6.096910
Name: pIC50, dtype: float64

## **Combining X and Y variable**

In [11]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3.head()

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.124939
1,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,7.0
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.30103
3,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.522879
4,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,6.09691


## **Saving csv file**

In [12]:
dataset3.to_csv('../data/acetylcholinesterase_06_bioactivity_data_3class_pIC50_pubchem_fp.csv', index=False)