# **Bioinformatics Project - Computational Drug Discovery [Part 3] Descriptor Calculation and Dataset Preparation**

Chanin Nantasenamat

[*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)

In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

---

## **Download PaDEL-Descriptor**

In [1]:
''' No longer available
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh
'''

' No longer available\n! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip\n! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh\n'

In [2]:
''' No longer available
! unzip padel.zip
'''

' No longer available\n! unzip padel.zip\n'

In [3]:
! pip install padelpy



In [4]:
! wget https://github.com/dataprofessor/padel/raw/main/fingerprints_xml.zip
! unzip -o fingerprints_xml.zip

--2025-12-17 00:09:31--  https://github.com/dataprofessor/padel/raw/main/fingerprints_xml.zip
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
  Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/padel/main/fingerprints_xml.zip [following]
--2025-12-17 00:09:31--  https://raw.githubusercontent.com/dataprofessor/padel/main/fingerprints_xml.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
  Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 200 OK
Length: 10871 (11K) [application/zip]
Saving to: ‘fingerprints_xml.zip’


2025-12-17 00:09:31 (14.1 MB/s) - ‘fingerprints_xml.zip’ saved [10871/10871]

Arc

In [5]:
import glob
xml_files = glob.glob("*.xml")
xml_files.sort()
xml_files

['AtomPairs2DFingerprintCount.xml',
 'AtomPairs2DFingerprinter.xml',
 'EStateFingerprinter.xml',
 'ExtendedFingerprinter.xml',
 'Fingerprinter.xml',
 'GraphOnlyFingerprinter.xml',
 'KlekotaRothFingerprintCount.xml',
 'KlekotaRothFingerprinter.xml',
 'MACCSFingerprinter.xml',
 'PubchemFingerprinter.xml',
 'SubstructureFingerprintCount.xml',
 'SubstructureFingerprinter.xml']

In [6]:
FP_list = ['AtomPairs2DCount',
 'AtomPairs2D',
 'EState',
 'CDKextended',
 'CDK',
 'CDKgraphonly',
 'KlekotaRothCount',
 'KlekotaRoth',
 'MACCS',
 'PubChem',
 'SubstructureCount',
 'Substructure']

In [7]:
fp = dict(zip(FP_list, xml_files))
fp

{'AtomPairs2DCount': 'AtomPairs2DFingerprintCount.xml',
 'AtomPairs2D': 'AtomPairs2DFingerprinter.xml',
 'EState': 'EStateFingerprinter.xml',
 'CDKextended': 'ExtendedFingerprinter.xml',
 'CDK': 'Fingerprinter.xml',
 'CDKgraphonly': 'GraphOnlyFingerprinter.xml',
 'KlekotaRothCount': 'KlekotaRothFingerprintCount.xml',
 'KlekotaRoth': 'KlekotaRothFingerprinter.xml',
 'MACCS': 'MACCSFingerprinter.xml',
 'PubChem': 'PubchemFingerprinter.xml',
 'SubstructureCount': 'SubstructureFingerprintCount.xml',
 'Substructure': 'SubstructureFingerprinter.xml'}

In [8]:
fp['AtomPairs2D']

'AtomPairs2DFingerprinter.xml'

## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **bioactivity_data_3class_pIC50.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [9]:
# ! wget https://raw.githubusercontent.com/dataprofessor/data/master/acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv
# outdated datset

In [10]:
import pandas as pd

In [11]:
df3 = pd.read_csv('https://raw.githubusercontent.com/dataprofessor/data/master/HCV_NS5B_Curated.csv')
# updated dataset

In [12]:
df3

Unnamed: 0,CMPD_CHEMBLID,CANONICAL_SMILES,STANDARD_TYPE,RELATION,STANDARD_VALUE,STANDARD_UNITS,pIC50,PROTEIN_ACCESSION,PREF_NAME,DOC_CHEMBLID,...,JOURNAL,YEAR,VOLUME,ISSUE,FIRST_PAGE,MOLWEIGHT,ALOGP,PSA,NUM_RO5_VIOLATIONS,Activity
0,CHEMBL179256,OC(=O)c1ccc2c(c1)nc(c3ccc(O)cc3F)n2C4CCCCC4,IC50,=,1.4,nM,8.853872,Q8JXU8,Hepatitis C virus NS5B RNA-dependent RNA polym...,CHEMBL1142688,...,J. Med. Chem.,2005,48.0,5.0,1314.0,354.37,4.93,75.35,0,Active
1,CHEMBL204350,CC(C)(C)CCN1[C@H](C(=C(C1=O)C2=NS(=O)(=O)c3ccc...,IC50,=,1.7,nM,8.769551,Q8JXU8,Hepatitis C virus NS5B RNA-dependent RNA polym...,CHEMBL1146957,...,Bioorg. Med. Chem. Lett.,2006,16.0,8.0,2205.0,419.54,2.37,107.45,0,Active
2,CHEMBL179257,OC(=O)c1ccc2c(c1)nc(c3ccc(O)cc3)n2C4CCCCC4,IC50,=,3.0,nM,8.522879,Q8JXU8,Hepatitis C virus NS5B RNA-dependent RNA polym...,CHEMBL1142688,...,J. Med. Chem.,2005,48.0,5.0,1314.0,336.38,4.72,75.35,0,Active
3,CHEMBL178784,OC(=O)c1ccc2c(C3CCCCC3)c([nH]c2c1)c4ccc(O)cc4,IC50,=,4.8,nM,8.318759,Q8JXU8,Hepatitis C virus NS5B RNA-dependent RNA polym...,CHEMBL1142688,...,J. Med. Chem.,2005,48.0,5.0,1314.0,335.40,5.51,73.32,1,Active
4,CHEMBL369319,CN(C)C(=O)Cn1c(c2ccc(OCc3ccccc3)cc2)c(C4CCCCC4...,IC50,=,6.0,nM,8.221849,Q8JXU8,Hepatitis C virus NS5B RNA-dependent RNA polym...,CHEMBL1142688,...,J. Med. Chem.,2005,48.0,5.0,1314.0,510.62,6.79,71.77,2,Active
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
573,CHEMBL175762,CCC(CC)n1c(nc2cc(ccc12)C(=O)O)c3ccccn3,IC50,=,139000.0,nM,3.856985,Q8JXU8,Hepatitis C virus NS5B RNA-dependent RNA polym...,CHEMBL1149223,...,Bioorg. Med. Chem. Lett.,2004,14.0,1.0,119.0,309.36,4.16,68.01,0,Inactive
574,CHEMBL197882,Cc1sc(cc1\C(=C\C(=O)C(=O)O)\O)c2ccccc2,IC50,=,167000.0,nM,3.777284,Q8JXU8,Hepatitis C virus NS5B RNA-dependent RNA polym...,CHEMBL1140440,...,J. Med. Chem.,2005,48.0,20.0,6304.0,288.32,2.88,102.83,0,Inactive
575,CHEMBL177122,OC(=O)c1ccc2c(c1)ncn2C3CCCCC3,IC50,=,186000.0,nM,3.730487,Q8JXU8,Hepatitis C virus NS5B RNA-dependent RNA polym...,CHEMBL1149223,...,Bioorg. Med. Chem. Lett.,2004,14.0,1.0,119.0,244.29,3.04,55.12,0,Inactive
576,CHEMBL175454,OC(=O)c1ccc2c(c1)nc(c3ccccn3)n2c4ccccc4,IC50,=,360000.0,nM,3.443697,Q8JXU8,Hepatitis C virus NS5B RNA-dependent RNA polym...,CHEMBL1149223,...,Bioorg. Med. Chem. Lett.,2004,14.0,1.0,119.0,315.33,3.96,68.01,0,Inactive


In [13]:
selection = ['CANONICAL_SMILES','CMPD_CHEMBLID']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [14]:
! cat molecule.smi | head -5

OC(=O)c1ccc2c(c1)nc(c3ccc(O)cc3F)n2C4CCCCC4	CHEMBL179256
CC(C)(C)CCN1[C@H](C(=C(C1=O)C2=NS(=O)(=O)c3ccccc3N2)O)C(C)(C)C	CHEMBL204350
OC(=O)c1ccc2c(c1)nc(c3ccc(O)cc3)n2C4CCCCC4	CHEMBL179257
OC(=O)c1ccc2c(C3CCCCC3)c([nH]c2c1)c4ccc(O)cc4	CHEMBL178784
CN(C)C(=O)Cn1c(c2ccc(OCc3ccccc3)cc2)c(C4CCCCC4)c5ccc(cc15)C(=O)O	CHEMBL369319


In [15]:
! cat molecule.smi | wc -l

     578


## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [16]:
# ! cat padel.sh (not available)

In [17]:
# ! bash padel.sh (not available)

In [18]:
df2 = pd.concat( [df3['CANONICAL_SMILES'],df3['CMPD_CHEMBLID']], axis=1 )
df2.to_csv('molecule.smi', sep='\t', index=False, header=False)
df2

Unnamed: 0,CANONICAL_SMILES,CMPD_CHEMBLID
0,OC(=O)c1ccc2c(c1)nc(c3ccc(O)cc3F)n2C4CCCCC4,CHEMBL179256
1,CC(C)(C)CCN1[C@H](C(=C(C1=O)C2=NS(=O)(=O)c3ccc...,CHEMBL204350
2,OC(=O)c1ccc2c(c1)nc(c3ccc(O)cc3)n2C4CCCCC4,CHEMBL179257
3,OC(=O)c1ccc2c(C3CCCCC3)c([nH]c2c1)c4ccc(O)cc4,CHEMBL178784
4,CN(C)C(=O)Cn1c(c2ccc(OCc3ccccc3)cc2)c(C4CCCCC4...,CHEMBL369319
...,...,...
573,CCC(CC)n1c(nc2cc(ccc12)C(=O)O)c3ccccn3,CHEMBL175762
574,Cc1sc(cc1\C(=C\C(=O)C(=O)O)\O)c2ccccc2,CHEMBL197882
575,OC(=O)c1ccc2c(c1)ncn2C3CCCCC3,CHEMBL177122
576,OC(=O)c1ccc2c(c1)nc(c3ccccn3)n2c4ccccc4,CHEMBL175454


In [19]:
from padelpy import padeldescriptor

fingerprint = 'Substructure'

fingerprint_output_file = ''.join([fingerprint,'.csv']) #Substructure.csv
fingerprint_descriptortypes = fp[fingerprint]

padeldescriptor(mol_dir='molecule.smi', 
                d_file=fingerprint_output_file, #'Substructure.csv'
                #descriptortypes='SubstructureFingerprint.xml', 
                descriptortypes= fingerprint_descriptortypes,
                detectaromaticity=True,
                standardizenitro=True,
                standardizetautomers=True,
                threads=2,
                removesalt=True,
                log=True,
                fingerprints=True)

In [20]:
! ls -l

total 26208
-rwxr-xr-x  1 samiha  staff     4645 Mar 27  2018 [31mAtomPairs2DFingerprintCount.xml[m[m
-rwxr-xr-x  1 samiha  staff     4645 Mar 27  2018 [31mAtomPairs2DFingerprinter.xml[m[m
-rw-r--r--  1 samiha  staff   132618 Nov 20 16:22 CDD_ML_Part_1_Acetylcholinesterase_Bioactivity_Data_Concised.ipynb
-rw-r--r--  1 samiha  staff   262242 Dec 16 23:25 CDD_ML_Part_2_Acetylcholinesterase_Exploratory_Data_Analysis.ipynb
-rw-r--r--  1 samiha  staff   100076 Nov 19 13:21 CDD_ML_Part_4_Acetylcholinesterase_Regression_Random_Forest.ipynb
-rw-r--r--  1 samiha  staff   230778 Nov 19 13:21 CDD_ML_Part_5_Acetylcholinesterase_Compare_Regressors.ipynb
-rwxr-xr-x  1 samiha  staff     4645 Mar 27  2018 [31mEStateFingerprinter.xml[m[m
-rwxr-xr-x  1 samiha  staff     4645 Mar 27  2018 [31mExtendedFingerprinter.xml[m[m
-rwxr-xr-x  1 samiha  staff     4645 Mar 27  2018 [31mFingerprinter.xml[m[m
-rwxr-xr-x  1 samiha  staff     4645 Mar 27  2018 [31mGraphOnlyFingerprinter.xml[m[m
-rwxr-

## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [21]:
df3_X = pd.read_csv(fingerprint_output_file)

In [22]:
df3_X

Unnamed: 0,Name,SubFP1,SubFP2,SubFP3,SubFP4,SubFP5,SubFP6,SubFP7,SubFP8,SubFP9,...,SubFP298,SubFP299,SubFP300,SubFP301,SubFP302,SubFP303,SubFP304,SubFP305,SubFP306,SubFP307
0,CHEMBL179256,0,1,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1
1,CHEMBL204350,1,1,0,1,0,0,0,0,0,...,0,0,1,1,1,1,0,0,0,1
2,CHEMBL179257,0,1,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1
3,CHEMBL178784,0,1,1,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1
4,CHEMBL369319,0,1,1,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
573,CHEMBL175762,1,1,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1
574,CHEMBL197882,1,1,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1
575,CHEMBL177122,0,1,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1
576,CHEMBL175454,0,0,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1


In [23]:
df3_X = df3_X.drop(columns=['Name'])
df3_X

Unnamed: 0,SubFP1,SubFP2,SubFP3,SubFP4,SubFP5,SubFP6,SubFP7,SubFP8,SubFP9,SubFP10,...,SubFP298,SubFP299,SubFP300,SubFP301,SubFP302,SubFP303,SubFP304,SubFP305,SubFP306,SubFP307
0,0,1,0,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1
1,1,1,0,1,0,0,0,0,0,0,...,0,0,1,1,1,1,0,0,0,1
2,0,1,0,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1
3,0,1,1,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1
4,0,1,1,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
573,1,1,0,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1
574,1,1,0,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1
575,0,1,0,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1
576,0,0,0,0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,1


## **Y variable**

### **Convert IC50 to pIC50**

In [24]:
df3_Y = df3['pIC50']
df3_Y

0      8.853872
1      8.769551
2      8.522879
3      8.318759
4      8.221849
         ...   
573    3.856985
574    3.777284
575    3.730487
576    3.443697
577    3.389340
Name: pIC50, Length: 578, dtype: float64

## **Combining X and Y variable**

In [25]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3

Unnamed: 0,SubFP1,SubFP2,SubFP3,SubFP4,SubFP5,SubFP6,SubFP7,SubFP8,SubFP9,SubFP10,...,SubFP299,SubFP300,SubFP301,SubFP302,SubFP303,SubFP304,SubFP305,SubFP306,SubFP307,pIC50
0,0,1,0,0,0,0,0,0,0,0,...,0,1,1,1,0,0,0,0,1,8.853872
1,1,1,0,1,0,0,0,0,0,0,...,0,1,1,1,1,0,0,0,1,8.769551
2,0,1,0,0,0,0,0,0,0,0,...,0,1,1,1,0,0,0,0,1,8.522879
3,0,1,1,0,0,0,0,0,0,0,...,0,1,1,1,0,0,0,0,1,8.318759
4,0,1,1,0,0,0,0,0,0,0,...,0,1,1,1,0,0,0,0,1,8.221849
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
573,1,1,0,0,0,0,0,0,0,0,...,0,1,1,1,0,0,0,0,1,3.856985
574,1,1,0,0,0,0,0,0,0,0,...,0,1,1,1,0,0,0,0,1,3.777284
575,0,1,0,0,0,0,0,0,0,0,...,0,1,1,1,0,0,0,0,1,3.730487
576,0,0,0,0,0,0,0,0,0,0,...,0,1,1,1,0,0,0,0,1,3.443697


In [26]:
dataset3.to_csv('acetylcholinesterase_06_bioactivity_data_3class_pIC50_pubchem_fp.csv', index=False)

# **Let's download the CSV file to your local computer for the Part 3B (Model Building).**