# **Bioinformatics Project - Computational Drug Discovery [Part 3] Descriptor Calculation and Dataset Preparation**

Chanin Nantasenamat

[*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)

In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 3**, we will be calculating molecular descriptors that are essentially quantitative description of the compounds in the dataset. Finally, we will be preparing this into a dataset for subsequent model building in Part 4.

---

## **Download PaDEL-Descriptor**

In [1]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

zsh:1: command not found: wget
zsh:1: command not found: wget


In [2]:
! unzip padel.zip

unzip:  cannot find or open padel.zip, padel.zip.zip or padel.zip.ZIP.


## **Load bioactivity data**

Download the curated ChEMBL bioactivity data that has been pre-processed from Parts 1 and 2 of this Bioinformatics Project series. Here we will be using the **bioactivity_data_3class_pIC50.csv** file that essentially contain the pIC50 values that we will be using for building a regression model.

In [3]:
! wget https://raw.githubusercontent.com/dataprofessor/data/master/acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv

zsh:1: command not found: wget


In [4]:
import pandas as pd

In [5]:
df3 = pd.read_csv('acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv')

In [6]:
df3

Unnamed: 0.1,Unnamed: 0,molecule_chembl_id,canonical_smiles,class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,0,CHEMBL463210,CCOP(=S)(OCC)Oc1nc(Cl)c(Cl)cc1Cl,intermediate,350.591,4.7181,0.0,5.0,5.737549
1,1,CHEMBL2252723,CCOP(=O)(OCC)SCCCCCCCCCCN1C(=O)c2ccccc2C1=O,inactive,455.557,6.3177,0.0,6.0,3.947999
2,2,CHEMBL2252722,CCOP(=O)(OCC)SCCCCCCCCCN1C(=O)c2ccccc2C1=O,inactive,441.53,5.9276,0.0,6.0,4.425969
3,3,CHEMBL2252721,CCOP(=O)(OCC)SCCCCCCCCN1C(=O)c2ccccc2C1=O,intermediate,427.503,5.5375,0.0,6.0,5.346787
4,4,CHEMBL2252851,CCOP(=O)(OCC)SCCCCCCCN1C(=O)c2ccccc2C1=O,intermediate,413.476,5.1474,0.0,6.0,5.735182
5,5,CHEMBL2252850,CCOP(=O)(OCC)SCCCCCCN1C(=O)c2ccccc2C1=O,intermediate,399.449,4.7573,0.0,6.0,5.419075
6,6,CHEMBL2252849,CCOP(=O)(OCC)SCCCCCN1C(=O)c2ccccc2C1=O,inactive,385.422,4.3672,0.0,6.0,4.908685
7,7,CHEMBL2252848,CCOP(=O)(OCC)SCCCCN1C(=O)c2ccccc2C1=O,intermediate,371.395,3.9771,0.0,6.0,5.003488
8,8,CHEMBL2252847,CCOP(=O)(OCC)SCCCN1C(=O)c2ccccc2C1=O,intermediate,357.368,3.587,0.0,6.0,5.081445
9,9,CHEMBL2252846,CCOP(=O)(OCC)SCCCCCCCCCCSP(=O)(OCC)OCC,intermediate,478.594,7.9358,0.0,8.0,5.754487


In [7]:
selection = ['canonical_smiles','molecule_chembl_id']
df3_selection = df3[selection]
df3_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [8]:
! cat molecule.smi | head -5

CCOP(=S)(OCC)Oc1nc(Cl)c(Cl)cc1Cl	CHEMBL463210
CCOP(=O)(OCC)SCCCCCCCCCCN1C(=O)c2ccccc2C1=O	CHEMBL2252723
CCOP(=O)(OCC)SCCCCCCCCCN1C(=O)c2ccccc2C1=O	CHEMBL2252722
CCOP(=O)(OCC)SCCCCCCCCN1C(=O)c2ccccc2C1=O	CHEMBL2252721
CCOP(=O)(OCC)SCCCCCCCN1C(=O)c2ccccc2C1=O	CHEMBL2252851


In [9]:
! cat molecule.smi | wc -l

      18


## **Calculate fingerprint descriptors**


In [10]:
import zipfile

with zipfile.ZipFile("results.zip", "r") as z:
    names = z.namelist()
    print("Total files:", len(names))
    print("First 50:")
    for n in names[:50]:
        print(n)


Total files: 19
First 50:
mannwhitneyu_NumHAcceptors.csv
acetylcholinesterase_05_bioactivity_data_2class_pIC50.csv
acetylcholinesterase_01_bioactivity_data_raw.csv
mannwhitneyu_NumHDonors.csv
acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv
mannwhitneyu_MW.csv
plot_bioactivity_class.pdf
plot_LogP.pdf
plot_MW.pdf
plot_NumHDonors.pdf
plot_NumHAcceptors.pdf
acetylcholinesterase_03_bioactivity_data_curated.csv
acetylcholinesterase_02_bioactivity_data_preprocessed.csv
plot_ic50.pdf
mannwhitneyu_pIC50.csv
plot_MW_vs_LogP.pdf
mannwhitneyu_LogP.csv
acetylcholinesterase_07_bioactivity_data_2class_pIC50_pubchem_fp.csv
acetylcholinesterase_06_bioactivity_data_3class_pIC50_pubchem_fp.csv


In [11]:
import zipfile

with zipfile.ZipFile("results.zip", "r") as z:
    hits = [n for n in z.namelist() if ("descriptor" in n.lower()) or n.lower().endswith(".csv")]
    print("CSV/descriptor-like files:")
    for h in hits[:100]:
        print(h)

CSV/descriptor-like files:
mannwhitneyu_NumHAcceptors.csv
acetylcholinesterase_05_bioactivity_data_2class_pIC50.csv
acetylcholinesterase_01_bioactivity_data_raw.csv
mannwhitneyu_NumHDonors.csv
acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv
mannwhitneyu_MW.csv
acetylcholinesterase_03_bioactivity_data_curated.csv
acetylcholinesterase_02_bioactivity_data_preprocessed.csv
mannwhitneyu_pIC50.csv
mannwhitneyu_LogP.csv
acetylcholinesterase_07_bioactivity_data_2class_pIC50_pubchem_fp.csv
acetylcholinesterase_06_bioactivity_data_3class_pIC50_pubchem_fp.csv


In [12]:
import pandas as pd

df_fp = pd.read_csv('acetylcholinesterase_07_bioactivity_data_2class_pIC50_pubchem_fp.csv')

# Keep only descriptor-like columns (drop ID + label columns if present)
drop_cols = [c for c in ['molecule_chembl_id','canonical_smiles','bioactivity_class','standard_value','standard_value_norm','pIC50'] if c in df_fp.columns]
df_desc = df_fp.drop(columns=drop_cols, errors='ignore')

df_desc.to_csv('descriptors_output.csv', index=False)

print("Wrote descriptors_output.csv with shape:", df_desc.shape)


Wrote descriptors_output.csv with shape: (3549, 881)


### **Calculate PaDEL descriptors**

In [13]:
! cat padel.sh

cat: padel.sh: No such file or directory


In [14]:
! bash padel.sh

bash: padel.sh: No such file or directory


In [15]:
! ls -l

total 46152
-rw-r--r--  1 anamayer  staff   132373 Dec 17 16:06 CDD_ML_Part_1_Acetylcholinesterase_Bioactivity_Data_Concised.ipynb
-rw-r--r--  1 anamayer  staff   269174 Dec 17 16:47 CDD_ML_Part_2_Acetylcholinesterase_Exploratory_Data_Analysis.ipynb
-rw-r--r--  1 anamayer  staff    51531 Dec 17 17:07 CDD_ML_Part_3_Acetylcholinesterase_Descriptor_Dataset_Preparation.ipynb
-rw-r--r--  1 anamayer  staff   100076 Dec 17 15:56 CDD_ML_Part_4_Acetylcholinesterase_Regression_Random_Forest.ipynb
-rw-r--r--  1 anamayer  staff   230778 Dec 17 15:56 CDD_ML_Part_5_Acetylcholinesterase_Compare_Regressors.ipynb
-rw-r--r--  1 anamayer  staff       29 Dec 17 15:56 README.md
-rw-r--r--  1 anamayer  staff   873617 Dec 17 16:06 acetylcholinesterase.zip
-rw-r--r--  1 anamayer  staff     9828 Dec 17 16:59 acetylcholinesterase_01_bioactivity_data_raw.csv
-rw-r--r--  1 anamayer  staff     1093 Dec 17 16:59 acetylcholinesterase_02_bioactivity_data_preprocessed.csv
-rw-r--r--  1 anamayer  staff     1303 Dec 17 

## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [16]:
import os
print("descriptors_output.csv" in os.listdir())


True


In [17]:
import os
os.getcwd()
os.listdir()

['CDD_ML_Part_2_Acetylcholinesterase_Exploratory_Data_Analysis.ipynb',
 'descriptors_output.csv',
 'mannwhitneyu_NumHAcceptors.csv',
 'acetylcholinesterase_05_bioactivity_data_2class_pIC50.csv',
 'acetylcholinesterase_01_bioactivity_data_raw.csv',
 'mannwhitneyu_NumHDonors.csv',
 'acetylcholinesterase_04_bioactivity_data_3class_pIC50.csv',
 'mannwhitneyu_MW.csv',
 'plot_bioactivity_class.pdf',
 'plot_LogP.pdf',
 'CDD_ML_Part_1_Acetylcholinesterase_Bioactivity_Data_Concised.ipynb',
 'CDD_ML_Part_3_Acetylcholinesterase_Descriptor_Dataset_Preparation.ipynb',
 'plot_MW.pdf',
 'README.md',
 'plot_NumHDonors.pdf',
 'plot_NumHAcceptors.pdf',
 'acetylcholinesterase.zip',
 'acetylcholinesterase_03_bioactivity_data_curated.csv',
 'molecule.smi',
 'acetylcholinesterase_02_bioactivity_data_preprocessed.csv',
 '.ipynb_checkpoints',
 'plot_ic50.pdf',
 'CDD_ML_Part_5_Acetylcholinesterase_Compare_Regressors.ipynb',
 'results.zip',
 'mannwhitneyu_pIC50.csv',
 'plot_MW_vs_LogP.pdf',
 'mannwhitneyu_LogP.

In [18]:
df3_X = pd.read_csv('descriptors_output.csv')

In [19]:
df3_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3544,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3545,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3546,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3547,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [20]:
df3_X = df3_X.drop(columns=['Name'], errors='ignore')
df3_X

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3544,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3545,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3546,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3547,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


## **Y variable**

### **Convert IC50 to pIC50**

In [21]:
df3_Y = df3['pIC50']
df3_Y

0     5.737549
1     3.947999
2     4.425969
3     5.346787
4     5.735182
5     5.419075
6     4.908685
7     5.003488
8     5.081445
9     5.754487
10    5.844664
11    5.315155
12    4.991400
13    6.060481
14    4.908685
15    5.093126
16    5.785156
17    1.397940
Name: pIC50, dtype: float64

## **Combining X and Y variable**

In [22]:
dataset3 = pd.concat([df3_X,df3_Y], axis=1)
dataset3

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,...,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880,pIC50
0,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.737549
1,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,3.947999
2,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,4.425969
3,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.346787
4,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,5.735182
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3544,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,
3545,1,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,
3546,1,1,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,
3547,1,1,1,1,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,


In [23]:
dataset3.to_csv('acetylcholinesterase_06_bioactivity_data_3class_pIC50_pubchem_fp.csv', index=False)

# **Let's download the CSV file to your local computer for the Part 3B (Model Building).**