# **Computational Drug Discovery - SOAT-2 : Download Bioactivity Data Part  01**

khalid El Akri

[*'chem Code Professor' YouTube channel*](http://youtube.com/@chemcodeprofessor)

In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the Bindingdb bioactivity data.

In **Part 01**, we will be performing Data Collection and Pre-Processing from the Bindingdb Database.  Let get started! Shall we ;)

# Import and convert SOAT-2_INHIBITORS_DATA from SDF file to CSV file

In [5]:
from rdkit import Chem
import pandas as pd
from rdkit.Chem import PandasTools

# Read the SDF file into RDKit
suppl = Chem.SDMolSupplier('Sterol O-acyltransferase 2 inhibitors 219.sdf')

# Convert each molecule to a dictionary
molecules = []
for mol in suppl:
    if mol is not None:
        molecules.append(mol.GetPropsAsDict(True))

# Convert the dictionaries to a DataFrame
df = pd.DataFrame(molecules)

# Write the DataFrame to a CSV file
df.to_csv('output_Sterol O-acyltransferase 2 inhibitors 219.csv', index=False)

## 1. Let check the output CSV file

In [6]:
df1 = pd.read_csv('output_Sterol O-acyltransferase 2 inhibitors 219.csv')
df1

Unnamed: 0,_Name,_MolFileInfo,_MolFileComments,_MolFileChiralFlag,From,BindingDB Reactant_set_id,Ligand InChI,Ligand InChI Key,BindingDB MonomerID,BindingDB Ligand Name,...,UniProt (SwissProt) Recommended Name of Target Chain,UniProt (SwissProt) Entry Name of Target Chain,UniProt (SwissProt) Primary ID of Target Chain,UniProt (SwissProt) Secondary ID(s) of Target Chain,UniProt (SwissProt) Alternative ID(s) of Target Chain,UniProt (TrEMBL) Submitted Name of Target Chain,UniProt (TrEMBL) Entry Name of Target Chain,UniProt (TrEMBL) Primary ID of Target Chain,UniProt (TrEMBL) Secondary ID(s) of Target Chain,UniProt (TrEMBL) Alternative ID(s) of Target Chain
0,,Mrv0541 11241417102D,,0,www.bindingDB.org,51053295,InChI=1S/C37H41NO11/c1-20(39)45-19-36(4)27-17-...,WIYPRMUSDJLXEC-ZRPSWOMSSA-N,50429605,CHEMBL2334538,...,Sterol O-acyltransferase 2,ACAT-2,O75908,F5H7W4 I6L9H9 Q4VB99 Q4VBA1 Q96TD4 Q9UNR2,,,,,,
1,,Mrv0541 11241417102D,,0,www.bindingDB.org,51053294,InChI=1S/C37H38N2O10/c1-20(40)45-19-36(4)27-16...,HBSJDWOXHMSOQS-ZRPSWOMSSA-N,50429604,CHEMBL2334539,...,Sterol O-acyltransferase 2,ACAT-2,O75908,F5H7W4 I6L9H9 Q4VB99 Q4VBA1 Q96TD4 Q9UNR2,,,,,,
2,,Mrv0541 11241417102D,,0,www.bindingDB.org,51068473,InChI=1S/C37H38N2O10/c1-20(40)45-19-36(4)27-16...,HBSJDWOXHMSOQS-ZRPSWOMSSA-N,50429604,CHEMBL2334539,...,Sterol O-acyltransferase 2,ACAT-2,O75908,F5H7W4 I6L9H9 Q4VB99 Q4VBA1 Q96TD4 Q9UNR2,,,,,,
3,,Mrv0541 11241417102D,,0,www.bindingDB.org,51053301,InChI=1S/C37H41NO10/c1-20-9-11-23(12-10-20)33(...,PXWLUSGRVNCIQG-ZRPSWOMSSA-N,50429610,CHEMBL2334222,...,Sterol O-acyltransferase 2,ACAT-2,O75908,F5H7W4 I6L9H9 Q4VB99 Q4VBA1 Q96TD4 Q9UNR2,,,,,,
4,,Mrv0541 11241417102D,,0,www.bindingDB.org,51053293,InChI=1S/C36H38FNO10/c1-19(39)44-18-35(4)26-16...,ZUQDNOHGPAJTCX-HRSZXCBJSA-N,50429603,CHEMBL2334540,...,Sterol O-acyltransferase 2,ACAT-2,O75908,F5H7W4 I6L9H9 Q4VB99 Q4VBA1 Q96TD4 Q9UNR2,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
214,,Mrv0541 11221409282D,,0,www.bindingDB.org,51346963,InChI=1S/C15H20O3/c1-9-5-4-6-14(3)8-15(17)12(7...,FBMORZZOJSDNRQ-GLQYFDAESA-N,50241945,Atractylenolide III::CHEMBL486961,...,Sterol O-acyltransferase 2,ACAT-2,O75908,F5H7W4 I6L9H9 Q4VB99 Q4VBA1 Q96TD4 Q9UNR2,,,,,,
215,,Mrv0541 11241407572D,,0,www.bindingDB.org,330437,InChI=1S/C20H25NO/c1-2-3-4-5-9-16-20(22)21-19-...,CLTPDJWVYAJEKC-UHFFFAOYSA-N,50371698,"CHEMBL408322::US9149492, 1",...,Sterol O-acyltransferase 2,ACAT-2,O75908,F5H7W4 I6L9H9 Q4VB99 Q4VBA1 Q96TD4 Q9UNR2,,,,,,
216,,Mrv0541 11241407572D,,0,www.bindingDB.org,50855129,InChI=1S/C20H25NO/c1-2-3-4-5-9-16-20(22)21-19-...,CLTPDJWVYAJEKC-UHFFFAOYSA-N,50371698,"CHEMBL408322::US9149492, 1",...,Sterol O-acyltransferase 2,ACAT-2,O75908,F5H7W4 I6L9H9 Q4VB99 Q4VBA1 Q96TD4 Q9UNR2,,,,,,
217,,Mrv0541 11241407572D,,0,www.bindingDB.org,330438,InChI=1S/C21H27NO/c1-2-3-4-5-6-10-17-21(23)22-...,QOVZQINZWUSANF-UHFFFAOYSA-N,50371699,"CHEMBL270041::US9149492, 2",...,Sterol O-acyltransferase 2,ACAT-2,O75908,F5H7W4 I6L9H9 Q4VB99 Q4VBA1 Q96TD4 Q9UNR2,,,,,,


In [44]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 219 entries, 0 to 218
Data columns (total 54 columns):
 #   Column                                                     Non-Null Count  Dtype  
---  ------                                                     --------------  -----  
 0   _Name                                                      44 non-null     object 
 1   _MolFileInfo                                               219 non-null    object 
 2   _MolFileComments                                           0 non-null      float64
 3   _MolFileChiralFlag                                         219 non-null    int64  
 4   From                                                       219 non-null    object 
 5   BindingDB Reactant_set_id                                  219 non-null    int64  
 6   Ligand InChI                                               219 non-null    object 
 7   Ligand InChI Key                                           219 non-null    object 
 8   BindingDB 

In [16]:
df1.describe()

Unnamed: 0,_MolFileComments,_MolFileChiralFlag,BindingDB Reactant_set_id,BindingDB MonomerID,Ki (nM),Kd (nM),EC50 (nM),kon (M-1-s-1),koff (s-1),pH,...,DrugBank ID of Ligand,IUPHAR_GRAC ID of Ligand,Number of Protein Chains in Target (,PDB ID(s) of Target Chain,UniProt (SwissProt) Alternative ID(s) of Target Chain,UniProt (TrEMBL) Submitted Name of Target Chain,UniProt (TrEMBL) Entry Name of Target Chain,UniProt (TrEMBL) Primary ID of Target Chain,UniProt (TrEMBL) Secondary ID(s) of Target Chain,UniProt (TrEMBL) Alternative ID(s) of Target Chain
count,0.0,219.0,219.0,219.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,219.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mean,,0.041096,43735930.0,43704510.0,,,,,,,...,,,1.0,,,,,,,
std,,0.198967,17566410.0,16994870.0,,,,,,,...,,,0.0,,,,,,,
min,,0.0,330429.0,113723.0,,,,,,,...,,,1.0,,,,,,,
25%,,0.0,50357580.0,50150480.0,,,,,,,...,,,1.0,,,,,,,
50%,,0.0,51053270.0,50429010.0,,,,,,,...,,,1.0,,,,,,,
75%,,0.0,51068240.0,50433210.0,,,,,,,...,,,1.0,,,,,,,
max,,1.0,51346970.0,50544940.0,,,,,,,...,,,1.0,,,,,,,


## There is no SMILES column - Let extract SMILES first from SDF file

In [41]:
# Let extract SMILES from this SDF file 

# Load the SDF file
mols = Chem.SDMolSupplier('Sterol O-acyltransferase 2 inhibitors 219.sdf')

# Extract the name and SMILES string of each molecule
data = []
for mol in mols:
    if mol is not None:
        name = mol.GetProp('_Name')
        smiles = Chem.MolToSmiles(mol)
        data.append({'Name': name, 'SMILES': smiles})

# Save the data to a CSV file using pandas
import pandas as pd
df2 = pd.DataFrame(data)
df2.to_csv('output2_Sterol O-acyltransferase 2 inhibitors 219.csv', index=False)

In [42]:
df2 = pd.read_csv('output2_Sterol O-acyltransferase 2 inhibitors 219.csv')
df2

Unnamed: 0,Name,SMILES
0,,COc1ccc(C(=O)O[C@H]2C[C@H]3[C@](C)(COC(C)=O)[C...
1,,CC(=O)OC[C@@]1(C)[C@@H]2C[C@H](OC(=O)c3ccc(C#N...
2,,CC(=O)OC[C@@]1(C)[C@@H]2C[C@H](OC(=O)c3ccc(C#N...
3,,CC(=O)OC[C@@]1(C)[C@@H]2C[C@H](OC(=O)c3ccc(C)c...
4,,CC(=O)OC[C@@]1(C)[C@@H]2C[C@H](OC(=O)c3cccc(F)...
...,...,...
214,,C=C1CCC[C@]2(C)C[C@]3(O)OC(=O)C(C)=C3C[C@@H]12
215,,CCCCCCCC(=O)Nc1ccccc1-c1ccccc1
216,,CCCCCCCC(=O)Nc1ccccc1-c1ccccc1
217,,CCCCCCCCC(=O)Nc1ccccc1-c1ccccc1


In [43]:
# Drop the Name columns that contains NaN

df2 = df2.drop(["Name"], axis=1)
df2

Unnamed: 0,SMILES
0,COc1ccc(C(=O)O[C@H]2C[C@H]3[C@](C)(COC(C)=O)[C...
1,CC(=O)OC[C@@]1(C)[C@@H]2C[C@H](OC(=O)c3ccc(C#N...
2,CC(=O)OC[C@@]1(C)[C@@H]2C[C@H](OC(=O)c3ccc(C#N...
3,CC(=O)OC[C@@]1(C)[C@@H]2C[C@H](OC(=O)c3ccc(C)c...
4,CC(=O)OC[C@@]1(C)[C@@H]2C[C@H](OC(=O)c3cccc(F)...
...,...
214,C=C1CCC[C@]2(C)C[C@]3(O)OC(=O)C(C)=C3C[C@@H]12
215,CCCCCCCC(=O)Nc1ccccc1-c1ccccc1
216,CCCCCCCC(=O)Nc1ccccc1-c1ccccc1
217,CCCCCCCCC(=O)Nc1ccccc1-c1ccccc1


## There is no IC50 column - Let convert TSV file to CSV and extract IC50

In [35]:
import pandas as pd

# read TSV file
df3 = pd.read_csv('Sterol O-acyltransferase 2 inhibitors 219.tsv', sep='\t')

# write CSV file
df3.to_csv('Output3_Sterol O-acyltransferase 2 inhibitors 219.csv', index=False)

In [36]:
df3 = pd.read_csv('Output3_Sterol O-acyltransferase 2 inhibitors 219.csv')
df3

Unnamed: 0,BindingDB Reactant_set_id,Ligand SMILES,Ligand InChI,Ligand InChI Key,BindingDB MonomerID,BindingDB Ligand Name,Target Name,Target Source Organism According to Curator or DataSource,Ki (nM),IC50 (nM),...,Sterol O-acyltransferase 2.1,ACAT-2,O75908,F5H7W4 I6L9H9 Q4VB99 Q4VBA1 Q96TD4 Q9UNR2,Unnamed: 93,Unnamed: 94,Unnamed: 95,Unnamed: 96,Unnamed: 97,Unnamed: 98
0,51053294,CC(=O)OC[C@]1(C)[C@H](CC[C@@]2(C)[C@H]1C[C@H](...,InChI=1S/C37H38N2O10/c1-20(40)45-19-36(4)27-16...,HBSJDWOXHMSOQS-ZRPSWOMSSA-N,50429604,CHEMBL2334539,Sterol O-acyltransferase 2,,,0.900000,...,,,,,,,,,,
1,51068473,CC(=O)OC[C@]1(C)[C@H](CC[C@@]2(C)[C@H]1C[C@H](...,InChI=1S/C37H38N2O10/c1-20(40)45-19-36(4)27-16...,HBSJDWOXHMSOQS-ZRPSWOMSSA-N,50429604,CHEMBL2334539,Sterol O-acyltransferase 2,,,0.900000,...,,,,,,,,,,
2,51053301,CC(=O)OC[C@]1(C)[C@H](CC[C@@]2(C)[C@H]1C[C@H](...,InChI=1S/C37H41NO10/c1-20-9-11-23(12-10-20)33(...,PXWLUSGRVNCIQG-ZRPSWOMSSA-N,50429610,CHEMBL2334222,Sterol O-acyltransferase 2,,,0.900000,...,,,,,,,,,,
3,51053293,CC(=O)OC[C@]1(C)[C@H](CC[C@@]2(C)[C@H]1C[C@H](...,InChI=1S/C36H38FNO10/c1-19(39)44-18-35(4)26-16...,ZUQDNOHGPAJTCX-HRSZXCBJSA-N,50429603,CHEMBL2334540,Sterol O-acyltransferase 2,,,0.900000,...,,,,,,,,,,
4,51053292,CC(=O)OC[C@]1(C)[C@H](CC[C@@]2(C)[C@H]1C[C@H](...,InChI=1S/C36H38ClNO10/c1-19(39)44-18-35(4)26-1...,JJTKTRDEEWKAQK-HRSZXCBJSA-N,50429602,CHEMBL2334541,Sterol O-acyltransferase 2,,,1.000000,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
213,51346963,CC1=C2C[C@H]3C(=C)CCC[C@]3(C)C[C@]2(O)OC1=O,InChI=1S/C15H20O3/c1-9-5-4-6-14(3)8-15(17)12(7...,FBMORZZOJSDNRQ-GLQYFDAESA-N,50241945,Atractylenolide III::CHEMBL486961,Sterol O-acyltransferase 2,,,187300,...,,,,,,,,,,
214,330437,CCCCCCCC(=O)Nc1ccccc1-c1ccccc1,InChI=1S/C20H25NO/c1-2-3-4-5-9-16-20(22)21-19-...,CLTPDJWVYAJEKC-UHFFFAOYSA-N,50371698,"CHEMBL408322::US9149492, 1",Sterol O-acyltransferase 2,,,230000,...,,,,,,,,,,
215,50855129,CCCCCCCC(=O)Nc1ccccc1-c1ccccc1,InChI=1S/C20H25NO/c1-2-3-4-5-9-16-20(22)21-19-...,CLTPDJWVYAJEKC-UHFFFAOYSA-N,50371698,"CHEMBL408322::US9149492, 1",Sterol O-acyltransferase 2,,,230000,...,,,,,,,,,,
216,330438,CCCCCCCCC(=O)Nc1ccccc1-c1ccccc1,InChI=1S/C21H27NO/c1-2-3-4-5-6-10-17-21(23)22-...,QOVZQINZWUSANF-UHFFFAOYSA-N,50371699,"CHEMBL270041::US9149492, 2",Sterol O-acyltransferase 2,,,414000,...,,,,,,,,,,


In [37]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 218 entries, 0 to 217
Data columns (total 99 columns):
 #   Column                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Non-Null Count  Dtype  
---  ------                                                                                                                                                                                                                                                                                                                                           

In [26]:
df3.describe()

Unnamed: 0,BindingDB Reactant_set_id,BindingDB MonomerID,Ki (nM),Kd (nM),EC50 (nM),kon (M-1-s-1),koff (s-1),pH,Temp (C),PMID,...,Sterol O-acyltransferase 2.1,ACAT-2,O75908,F5H7W4 I6L9H9 Q4VB99 Q4VBA1 Q96TD4 Q9UNR2,Unnamed: 93,Unnamed: 94,Unnamed: 95,Unnamed: 96,Unnamed: 97,Unnamed: 98
count,218.0,218.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,187.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mean,43702360.0,43673660.0,,,,,,,,23352710.0,...,,,,,,,,,,
std,17599800.0,17027840.0,,,,,,,,3963607.0,...,,,,,,,,,,
min,330429.0,113723.0,,,,,,,,15149650.0,...,,,,,,,,,,
25%,50357570.0,50150480.0,,,,,,,,23369540.0,...,,,,,,,,,,
50%,51053270.0,50429010.0,,,,,,,,23535330.0,...,,,,,,,,,,
75%,51068330.0,50433210.0,,,,,,,,24165110.0,...,,,,,,,,,,
max,51346970.0,50544940.0,,,,,,,,32035700.0,...,,,,,,,,,,


In [61]:
df4 = df3.rename(columns={"BindingDB Reactant_set_id": "bdID"})
df4

Unnamed: 0,bdID,Ligand SMILES,Ligand InChI,Ligand InChI Key,BindingDB MonomerID,BindingDB Ligand Name,Target Name,Target Source Organism According to Curator or DataSource,Ki (nM),IC50 (nM),...,Sterol O-acyltransferase 2.1,ACAT-2,O75908,F5H7W4 I6L9H9 Q4VB99 Q4VBA1 Q96TD4 Q9UNR2,Unnamed: 93,Unnamed: 94,Unnamed: 95,Unnamed: 96,Unnamed: 97,Unnamed: 98
0,51053294,CC(=O)OC[C@]1(C)[C@H](CC[C@@]2(C)[C@H]1C[C@H](...,InChI=1S/C37H38N2O10/c1-20(40)45-19-36(4)27-16...,HBSJDWOXHMSOQS-ZRPSWOMSSA-N,50429604,CHEMBL2334539,Sterol O-acyltransferase 2,,,0.900000,...,,,,,,,,,,
1,51068473,CC(=O)OC[C@]1(C)[C@H](CC[C@@]2(C)[C@H]1C[C@H](...,InChI=1S/C37H38N2O10/c1-20(40)45-19-36(4)27-16...,HBSJDWOXHMSOQS-ZRPSWOMSSA-N,50429604,CHEMBL2334539,Sterol O-acyltransferase 2,,,0.900000,...,,,,,,,,,,
2,51053301,CC(=O)OC[C@]1(C)[C@H](CC[C@@]2(C)[C@H]1C[C@H](...,InChI=1S/C37H41NO10/c1-20-9-11-23(12-10-20)33(...,PXWLUSGRVNCIQG-ZRPSWOMSSA-N,50429610,CHEMBL2334222,Sterol O-acyltransferase 2,,,0.900000,...,,,,,,,,,,
3,51053293,CC(=O)OC[C@]1(C)[C@H](CC[C@@]2(C)[C@H]1C[C@H](...,InChI=1S/C36H38FNO10/c1-19(39)44-18-35(4)26-16...,ZUQDNOHGPAJTCX-HRSZXCBJSA-N,50429603,CHEMBL2334540,Sterol O-acyltransferase 2,,,0.900000,...,,,,,,,,,,
4,51053292,CC(=O)OC[C@]1(C)[C@H](CC[C@@]2(C)[C@H]1C[C@H](...,InChI=1S/C36H38ClNO10/c1-19(39)44-18-35(4)26-1...,JJTKTRDEEWKAQK-HRSZXCBJSA-N,50429602,CHEMBL2334541,Sterol O-acyltransferase 2,,,1.000000,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
213,51346963,CC1=C2C[C@H]3C(=C)CCC[C@]3(C)C[C@]2(O)OC1=O,InChI=1S/C15H20O3/c1-9-5-4-6-14(3)8-15(17)12(7...,FBMORZZOJSDNRQ-GLQYFDAESA-N,50241945,Atractylenolide III::CHEMBL486961,Sterol O-acyltransferase 2,,,187300,...,,,,,,,,,,
214,330437,CCCCCCCC(=O)Nc1ccccc1-c1ccccc1,InChI=1S/C20H25NO/c1-2-3-4-5-9-16-20(22)21-19-...,CLTPDJWVYAJEKC-UHFFFAOYSA-N,50371698,"CHEMBL408322::US9149492, 1",Sterol O-acyltransferase 2,,,230000,...,,,,,,,,,,
215,50855129,CCCCCCCC(=O)Nc1ccccc1-c1ccccc1,InChI=1S/C20H25NO/c1-2-3-4-5-9-16-20(22)21-19-...,CLTPDJWVYAJEKC-UHFFFAOYSA-N,50371698,"CHEMBL408322::US9149492, 1",Sterol O-acyltransferase 2,,,230000,...,,,,,,,,,,
216,330438,CCCCCCCCC(=O)Nc1ccccc1-c1ccccc1,InChI=1S/C21H27NO/c1-2-3-4-5-6-10-17-21(23)22-...,QOVZQINZWUSANF-UHFFFAOYSA-N,50371699,"CHEMBL270041::US9149492, 2",Sterol O-acyltransferase 2,,,414000,...,,,,,,,,,,


In [101]:
df5 = df4.rename(columns={"IC50 (nM)": "IC50"})
df5

Unnamed: 0,bdID,Ligand SMILES,Ligand InChI,Ligand InChI Key,BindingDB MonomerID,BindingDB Ligand Name,Target Name,Target Source Organism According to Curator or DataSource,Ki (nM),IC50,...,Sterol O-acyltransferase 2.1,ACAT-2,O75908,F5H7W4 I6L9H9 Q4VB99 Q4VBA1 Q96TD4 Q9UNR2,Unnamed: 93,Unnamed: 94,Unnamed: 95,Unnamed: 96,Unnamed: 97,Unnamed: 98
0,51053294,CC(=O)OC[C@]1(C)[C@H](CC[C@@]2(C)[C@H]1C[C@H](...,InChI=1S/C37H38N2O10/c1-20(40)45-19-36(4)27-16...,HBSJDWOXHMSOQS-ZRPSWOMSSA-N,50429604,CHEMBL2334539,Sterol O-acyltransferase 2,,,0.900000,...,,,,,,,,,,
1,51068473,CC(=O)OC[C@]1(C)[C@H](CC[C@@]2(C)[C@H]1C[C@H](...,InChI=1S/C37H38N2O10/c1-20(40)45-19-36(4)27-16...,HBSJDWOXHMSOQS-ZRPSWOMSSA-N,50429604,CHEMBL2334539,Sterol O-acyltransferase 2,,,0.900000,...,,,,,,,,,,
2,51053301,CC(=O)OC[C@]1(C)[C@H](CC[C@@]2(C)[C@H]1C[C@H](...,InChI=1S/C37H41NO10/c1-20-9-11-23(12-10-20)33(...,PXWLUSGRVNCIQG-ZRPSWOMSSA-N,50429610,CHEMBL2334222,Sterol O-acyltransferase 2,,,0.900000,...,,,,,,,,,,
3,51053293,CC(=O)OC[C@]1(C)[C@H](CC[C@@]2(C)[C@H]1C[C@H](...,InChI=1S/C36H38FNO10/c1-19(39)44-18-35(4)26-16...,ZUQDNOHGPAJTCX-HRSZXCBJSA-N,50429603,CHEMBL2334540,Sterol O-acyltransferase 2,,,0.900000,...,,,,,,,,,,
4,51053292,CC(=O)OC[C@]1(C)[C@H](CC[C@@]2(C)[C@H]1C[C@H](...,InChI=1S/C36H38ClNO10/c1-19(39)44-18-35(4)26-1...,JJTKTRDEEWKAQK-HRSZXCBJSA-N,50429602,CHEMBL2334541,Sterol O-acyltransferase 2,,,1.000000,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
213,51346963,CC1=C2C[C@H]3C(=C)CCC[C@]3(C)C[C@]2(O)OC1=O,InChI=1S/C15H20O3/c1-9-5-4-6-14(3)8-15(17)12(7...,FBMORZZOJSDNRQ-GLQYFDAESA-N,50241945,Atractylenolide III::CHEMBL486961,Sterol O-acyltransferase 2,,,187300,...,,,,,,,,,,
214,330437,CCCCCCCC(=O)Nc1ccccc1-c1ccccc1,InChI=1S/C20H25NO/c1-2-3-4-5-9-16-20(22)21-19-...,CLTPDJWVYAJEKC-UHFFFAOYSA-N,50371698,"CHEMBL408322::US9149492, 1",Sterol O-acyltransferase 2,,,230000,...,,,,,,,,,,
215,50855129,CCCCCCCC(=O)Nc1ccccc1-c1ccccc1,InChI=1S/C20H25NO/c1-2-3-4-5-9-16-20(22)21-19-...,CLTPDJWVYAJEKC-UHFFFAOYSA-N,50371698,"CHEMBL408322::US9149492, 1",Sterol O-acyltransferase 2,,,230000,...,,,,,,,,,,
216,330438,CCCCCCCCC(=O)Nc1ccccc1-c1ccccc1,InChI=1S/C21H27NO/c1-2-3-4-5-6-10-17-21(23)22-...,QOVZQINZWUSANF-UHFFFAOYSA-N,50371699,"CHEMBL270041::US9149492, 2",Sterol O-acyltransferase 2,,,414000,...,,,,,,,,,,


In [102]:
df5.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 218 entries, 0 to 217
Data columns (total 99 columns):
 #   Column                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Non-Null Count  Dtype  
---  ------                                                                                                                                                                                                                                                                                                                                           

## We have the IC50 column as object dtype not a float64 - let clean & convert it

In [103]:
print(df5['IC50'])

0       0.900000
1       0.900000
2       0.900000
3       0.900000
4       1.000000
         ...    
213       187300
214       230000
215       230000
216       414000
217       414000
Name: IC50, Length: 218, dtype: object


In [104]:
df5['IC50'] = df5['IC50'].str.replace('>', '')

In [105]:
df5['IC50'] = df5['IC50'].astype('float64')
print(df5['IC50'].dtypes)

float64


### **Iterate the *SMILES* to a list**

In [106]:
mol_smiles = []
for i in df2.SMILES:
  mol_smiles.append(i)

### **Iterate the *bdID* to a list**

In [107]:
mol_bdID = []
for i in df4.bdID:
  mol_bdID.append(i)

### **Iterate the *IC50 (nM)* to a list**

In [108]:
mol_IC50 = []
for i in df5.IC50:
  mol_IC50.append(i)

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**. 

In [109]:
bioactivity_class = []
for i in df5.IC50:
  if float(i) >= 10000:
    bioactivity_class.append("inactive")
  elif float(i) <= 1000:
    bioactivity_class.append("active")
  else:
    bioactivity_class.append("intermediate")

### **Combine the 4 lists into a dataframe**

In [110]:
data_tuples = list(zip(mol_bdID, mol_smiles, bioactivity_class, mol_IC50))
df6 = pd.DataFrame( data_tuples,  columns=['mol_bdID', 'mol_smiles', 'bioactivity_class', 'mol_IC50'])

In [111]:
df6

Unnamed: 0,mol_bdID,mol_smiles,bioactivity_class,mol_IC50
0,51053294,COc1ccc(C(=O)O[C@H]2C[C@H]3[C@](C)(COC(C)=O)[C...,active,0.9
1,51068473,CC(=O)OC[C@@]1(C)[C@@H]2C[C@H](OC(=O)c3ccc(C#N...,active,0.9
2,51053301,CC(=O)OC[C@@]1(C)[C@@H]2C[C@H](OC(=O)c3ccc(C#N...,active,0.9
3,51053293,CC(=O)OC[C@@]1(C)[C@@H]2C[C@H](OC(=O)c3ccc(C)c...,active,0.9
4,51053292,CC(=O)OC[C@@]1(C)[C@@H]2C[C@H](OC(=O)c3cccc(F)...,active,1.0
...,...,...,...,...
213,51346963,CC(C)=CC=C(OC(=O)CC(C)(C)O)c1cc(O)c2c(O)ccc(O)...,inactive,187300.0
214,330437,C=C1CCC[C@]2(C)C[C@]3(O)OC(=O)C(C)=C3C[C@@H]12,inactive,230000.0
215,50855129,CCCCCCCC(=O)Nc1ccccc1-c1ccccc1,inactive,230000.0
216,330438,CCCCCCCC(=O)Nc1ccccc1-c1ccccc1,inactive,414000.0


## **Handling missing data**
If any compounds has missing value for the **IC50** column then drop it

In [120]:
df7 = df6[df6.mol_IC50.notna()]
df7
# no missing data

Unnamed: 0,mol_bdID,mol_smiles,bioactivity_class,mol_IC50
0,51053294,COc1ccc(C(=O)O[C@H]2C[C@H]3[C@](C)(COC(C)=O)[C...,active,0.9
1,51068473,CC(=O)OC[C@@]1(C)[C@@H]2C[C@H](OC(=O)c3ccc(C#N...,active,0.9
2,51053301,CC(=O)OC[C@@]1(C)[C@@H]2C[C@H](OC(=O)c3ccc(C#N...,active,0.9
3,51053293,CC(=O)OC[C@@]1(C)[C@@H]2C[C@H](OC(=O)c3ccc(C)c...,active,0.9
4,51053292,CC(=O)OC[C@@]1(C)[C@@H]2C[C@H](OC(=O)c3cccc(F)...,active,1.0
...,...,...,...,...
213,51346963,CC(C)=CC=C(OC(=O)CC(C)(C)O)c1cc(O)c2c(O)ccc(O)...,inactive,187300.0
214,330437,C=C1CCC[C@]2(C)C[C@]3(O)OC(=O)C(C)=C3C[C@@H]12,inactive,230000.0
215,50855129,CCCCCCCC(=O)Nc1ccccc1-c1ccccc1,inactive,230000.0
216,330438,CCCCCCCC(=O)Nc1ccccc1-c1ccccc1,inactive,414000.0


In [121]:
df7.to_csv('final_output.csv', index=False)

In [122]:
df7.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 218 entries, 0 to 217
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   mol_bdID           218 non-null    int64  
 1   mol_smiles         218 non-null    object 
 2   bioactivity_class  218 non-null    object 
 3   mol_IC50           218 non-null    float64
dtypes: float64(1), int64(1), object(2)
memory usage: 8.5+ KB


In [123]:
df7.describe()

Unnamed: 0,mol_bdID,mol_IC50
count,218.0,218.0
mean,43702360.0,21076.343119
std,17599800.0,54043.479575
min,330429.0,0.9
25%,50357570.0,19.0
50%,51053270.0,629.5
75%,51068330.0,11125.0
max,51346970.0,414000.0


In [124]:
! ls -l

total 8560
-rw-r--r--  1 akrikhalid  staff   394873 May 29 01:30 Output3_Sterol O-acyltransferase 2 inhibitors 219.csv
-rw-r--r--  1 akrikhalid  staff   394873 May 29 00:27 Sterol O-acyltransferase 2 inhibitors 219.csv
-rw-r--r--  1 akrikhalid  staff   258787 May 29 02:40 Sterol O-acyltransferase 2 inhibitors 219.ipynb
-rw-r--r--@ 1 akrikhalid  staff  1525829 May 28 23:19 Sterol O-acyltransferase 2 inhibitors 219.sdf
-rw-r--r--@ 1 akrikhalid  staff   382470 May 29 00:20 Sterol O-acyltransferase 2 inhibitors 219.tsv
-rw-r--r--  1 akrikhalid  staff    25632 May 29 02:41 final_output.csv
-rw-r--r--  1 akrikhalid  staff    21055 May 29 01:32 output2_Sterol O-acyltransferase 2 inhibitors 219.csv
-rw-r--r--  1 akrikhalid  staff   394873 May 29 01:44 output4_Sterol O-acyltransferase 2 inhibitors 219.csv
-rw-r--r--  1 akrikhalid  staff   375277 May 29 00:31 output_Sterol O-acyltransferase 2 inhibitors 219.csv
