# **Computational Drug Discovery - SOAT-1 : Download Bioactivity Data Part  01**

khalid El Akri

[*'chem Code Professor' YouTube channel*](http://youtube.com/@chemcodeprofessor)

In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the Bindingdb bioactivity data.

In **Part 01**, we will be performing Data Collection and Pre-Processing from the Bindingdb Database.  Let get started! Shall we ;)

# Import and convert SOAT-2_INHIBITORS_DATA from SDF file to CSV file

In [1]:
from rdkit import Chem
import pandas as pd
from rdkit.Chem import PandasTools

# Read the SDF file into RDKit
suppl = Chem.SDMolSupplier('Sterol O-acyltransferase 1 inhibitors 1703.sdf')

# Convert each molecule to a dictionary
molecules = []
for mol in suppl:
    if mol is not None:
        molecules.append(mol.GetPropsAsDict(True))

# Convert the dictionaries to a DataFrame
df = pd.DataFrame(molecules)

# Write the DataFrame to a CSV file
df.to_csv('output_Sterol O-acyltransferase 1 inhibitors 1703.csv', index=False)

## 1. Let check the output CSV file

In [2]:
df1 = pd.read_csv('output_Sterol O-acyltransferase 1 inhibitors 1703.csv')
df1

Unnamed: 0,_Name,_MolFileInfo,_MolFileComments,_MolFileChiralFlag,From,BindingDB Reactant_set_id,Ligand InChI,Ligand InChI Key,BindingDB MonomerID,BindingDB Ligand Name,...,UniProt (SwissProt) Recommended Name of Target Chain,UniProt (SwissProt) Entry Name of Target Chain,UniProt (SwissProt) Primary ID of Target Chain,UniProt (SwissProt) Secondary ID(s) of Target Chain,UniProt (SwissProt) Alternative ID(s) of Target Chain,UniProt (TrEMBL) Submitted Name of Target Chain,UniProt (TrEMBL) Entry Name of Target Chain,UniProt (TrEMBL) Primary ID of Target Chain,UniProt (TrEMBL) Secondary ID(s) of Target Chain,UniProt (TrEMBL) Alternative ID(s) of Target Chain
0,,Mrv0541 12012115242D,,0,www.bindingDB.org,50648002,InChI=1S/C28H25N5O5/c34-24(19-10-7-13-21(14-19...,OOMSOWNNUKTQQA-UHFFFAOYSA-N,50229236,CHEMBL46356,...,Sterol O-acyltransferase 1,ACAT-1,O70536,,,,,,,
1,,Mrv0541 12012115242D,,0,www.bindingDB.org,50648007,InChI=1S/C28H26N4O3/c33-24(20-12-6-2-7-13-20)1...,HEODCNVWJVNTHS-UHFFFAOYSA-N,50229241,CHEMBL295382,...,Sterol O-acyltransferase 1,ACAT-1,O70536,,,,,,,
2,,Mrv0541 12012115242D,,0,www.bindingDB.org,50647999,InChI=1S/C29H22N4O3/c34-25(22-14-8-3-9-15-22)1...,JIGJFFJVSBVHBA-UHFFFAOYSA-N,50229233,CHEMBL299604,...,Sterol O-acyltransferase 1,ACAT-1,O70536,,,,,,,
3,,Mrv0541 12012115242D,,0,www.bindingDB.org,50648003,InChI=1S/C28H28N4O3/c1-18(2)14-19(3)30-16-22-2...,NYWDSFJIYIHHHP-UHFFFAOYSA-N,50229237,CHEMBL47580,...,Sterol O-acyltransferase 1,ACAT-1,O70536,,,,,,,
4,,RDKit 2D,,0,www.bindingDB.org,51138741,InChI=1S/C23H29N5OS4/c1-16-14-19(30-2)21(22(24...,AMJCKXKAZQZZDV-UHFFFAOYSA-N,50466752,CHEMBL4281138,...,Sterol O-acyltransferase 1,ACAT-1,P35610,A6NC40 A8K3P4 A9Z1V7 B4DU95 Q5T0X4 Q8N1E4,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1698,,Mrv0541 11221423482D,,0,www.bindingDB.org,50051759,InChI=1S/C30H42N4O2S/c1-4-5-6-7-14-19-34(30(35...,FLVVFAUKANQOOA-UHFFFAOYSA-N,50033977,"1-{2-[2-(4,5-Diphenyl-1H-imidazol-2-ylsulfanyl...",...,Sterol O-acyltransferase 1,ACAT-1,O70536,,,,,,,
1699,,Mrv0541 04241521002D,,0,www.bindingDB.org,50004370,InChI=1S/C35H35F2N5O3S/c1-44-28-14-9-24(10-15-...,VDAACKVBVCTIAI-UHFFFAOYSA-N,50450364,CHEMBL50171,...,Sterol O-acyltransferase 1,ACAT-1,Q61263,Q5XK32 Q64180,,,,,,
1700,,Mrv0541 04241521002D,,0,www.bindingDB.org,50004374,InChI=1S/C31H37N5OS/c1-24(2)33-31(37)36(22-19-...,WCGYVMMVCWPXRM-UHFFFAOYSA-N,50450353,CHEMBL299178,...,Sterol O-acyltransferase 1,ACAT-1,O70536,,,,,,,
1701,,Mrv0541 04241521002D,,0,www.bindingDB.org,50004847,InChI=1S/C27H36N6O3/c1-2-3-4-5-6-7-8-9-10-14-2...,MOINZZQENNMLQE-UHFFFAOYSA-N,50450386,CHEMBL98716,...,Sterol O-acyltransferase 1,ACAT-1,O70536,,,,,,,


In [3]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1703 entries, 0 to 1702
Data columns (total 54 columns):
 #   Column                                                     Non-Null Count  Dtype  
---  ------                                                     --------------  -----  
 0   _Name                                                      170 non-null    object 
 1   _MolFileInfo                                               1703 non-null   object 
 2   _MolFileComments                                           0 non-null      float64
 3   _MolFileChiralFlag                                         1703 non-null   int64  
 4   From                                                       1703 non-null   object 
 5   BindingDB Reactant_set_id                                  1703 non-null   int64  
 6   Ligand InChI                                               1703 non-null   object 
 7   Ligand InChI Key                                           1703 non-null   object 
 8   BindingD

In [4]:
df1.describe()

Unnamed: 0,_MolFileComments,_MolFileChiralFlag,BindingDB Reactant_set_id,BindingDB MonomerID,Ki (nM),Kd (nM),EC50 (nM),kon (M-1-s-1),koff (s-1),pH,...,PubChem SID,ChEBI ID of Ligand,IUPHAR_GRAC ID of Ligand,Number of Protein Chains in Target (,UniProt (SwissProt) Alternative ID(s) of Target Chain,UniProt (TrEMBL) Submitted Name of Target Chain,UniProt (TrEMBL) Entry Name of Target Chain,UniProt (TrEMBL) Primary ID of Target Chain,UniProt (TrEMBL) Secondary ID(s) of Target Chain,UniProt (TrEMBL) Alternative ID(s) of Target Chain
count,0.0,1703.0,1703.0,1703.0,0.0,0.0,0.0,0.0,0.0,7.0,...,1703.0,27.0,7.0,1703.0,0.0,0.0,0.0,0.0,0.0,0.0
mean,,0.083382,49756670.0,49299960.0,,,,,,7.4,...,164050800.0,45890.666667,1863.0,1.0,,,,,,
std,,0.27654,6166758.0,6696684.0,,,,,,9.593423e-16,...,107187700.0,20496.751889,2143.058795,0.0,,,,,,
min,,0.0,51064.0,8903.0,,,,,,7.4,...,49846050.0,16196.0,1052.0,1.0,,,,,,
25%,,0.0,50073280.0,50050200.0,,,,,,7.4,...,103947400.0,24868.5,1052.0,1.0,,,,,,
50%,,0.0,50529970.0,50158060.0,,,,,,7.4,...,104061400.0,53721.0,1054.0,1.0,,,,,,
75%,,0.0,50779900.0,50429570.0,,,,,,7.4,...,175445500.0,65444.0,1054.0,1.0,,,,,,
max,,1.0,51346970.0,50544940.0,,,,,,7.4,...,442146500.0,69958.0,6723.0,1.0,,,,,,


## There is no SMILES column - Let extract SMILES first from SDF file

In [25]:
# Let extract SMILES from this SDF file 

# Load the SDF file
mols = Chem.SDMolSupplier('Sterol O-acyltransferase 1 inhibitors 1703.sdf')

# Extract the name and SMILES string of each molecule
data = []
for mol in mols:
    if mol is not None:
        name = mol.GetProp('_Name')
        smiles = Chem.MolToSmiles(mol)
        data.append({'Name': name, 'SMILES': smiles})

# Save the data to a CSV file using pandas
import pandas as pd
df2 = pd.DataFrame(data)
df2.to_csv('output2_Sterol O-acyltransferase 1 inhibitors 1703.csv', index=False)

In [26]:
df2 = pd.read_csv('output2_Sterol O-acyltransferase 1 inhibitors 1703.csv')
df2

Unnamed: 0,Name,SMILES
0,,O=C(Cn1c2c(c(=O)n3nc(-c4ccccc4)cc13)CN(C1CCCCC...
1,,O=C(Cn1c2c(c(=O)n3nc(-c4ccccc4)cc13)CN(C1CCCCC...
2,,O=C(Cn1c2c(c(=O)n3nc(-c4ccccc4)cc13)CN(Cc1cccc...
3,,CC(C)CC(C)N1Cc2c(n(CC(=O)c3ccccc3)c3cc(-c4cccc...
4,,CSc1cc(C)nc(SC)c1NC(=O)CN1CCN(CCSc2nc3ccccc3s2...
...,...,...
1698,,CCCCCCCN(CCOCCSc1nc(-c2ccccc2)c(-c2ccccc2)[nH]...
1699,,COc1ccc(-c2nc(SCCCCCN(Cc3ccccn3)C(=O)Nc3ccc(F)...
1700,,CC(C)NC(=O)N(CCCCCSc1nc(-c2ccccc2)c(-c2ccccc2)...
1701,,CCCCCCCCCCCCn1nnnc1C(NC(=O)c1cccc([N+](=O)[O-]...


In [7]:
# Drop the Name columns that contains NaN

df2 = df2.drop(["Name"], axis=1)
df2

Unnamed: 0,SMILES
0,O=C(Cn1c2c(c(=O)n3nc(-c4ccccc4)cc13)CN(C1CCCCC...
1,O=C(Cn1c2c(c(=O)n3nc(-c4ccccc4)cc13)CN(C1CCCCC...
2,O=C(Cn1c2c(c(=O)n3nc(-c4ccccc4)cc13)CN(Cc1cccc...
3,CC(C)CC(C)N1Cc2c(n(CC(=O)c3ccccc3)c3cc(-c4cccc...
4,CSc1cc(C)nc(SC)c1NC(=O)CN1CCN(CCSc2nc3ccccc3s2...
...,...
1698,CCCCCCCN(CCOCCSc1nc(-c2ccccc2)c(-c2ccccc2)[nH]...
1699,COc1ccc(-c2nc(SCCCCCN(Cc3ccccn3)C(=O)Nc3ccc(F)...
1700,CC(C)NC(=O)N(CCCCCSc1nc(-c2ccccc2)c(-c2ccccc2)...
1701,CCCCCCCCCCCCn1nnnc1C(NC(=O)c1cccc([N+](=O)[O-]...


---

### **Rename the *bdID* Column**

In [28]:
df3 = df1.rename(columns={"BindingDB Reactant_set_id": "bdID"})
df3

Unnamed: 0,_Name,_MolFileInfo,_MolFileComments,_MolFileChiralFlag,From,bdID,Ligand InChI,Ligand InChI Key,BindingDB MonomerID,BindingDB Ligand Name,...,UniProt (SwissProt) Recommended Name of Target Chain,UniProt (SwissProt) Entry Name of Target Chain,UniProt (SwissProt) Primary ID of Target Chain,UniProt (SwissProt) Secondary ID(s) of Target Chain,UniProt (SwissProt) Alternative ID(s) of Target Chain,UniProt (TrEMBL) Submitted Name of Target Chain,UniProt (TrEMBL) Entry Name of Target Chain,UniProt (TrEMBL) Primary ID of Target Chain,UniProt (TrEMBL) Secondary ID(s) of Target Chain,UniProt (TrEMBL) Alternative ID(s) of Target Chain
0,,Mrv0541 12012115242D,,0,www.bindingDB.org,50648002,InChI=1S/C28H25N5O5/c34-24(19-10-7-13-21(14-19...,OOMSOWNNUKTQQA-UHFFFAOYSA-N,50229236,CHEMBL46356,...,Sterol O-acyltransferase 1,ACAT-1,O70536,,,,,,,
1,,Mrv0541 12012115242D,,0,www.bindingDB.org,50648007,InChI=1S/C28H26N4O3/c33-24(20-12-6-2-7-13-20)1...,HEODCNVWJVNTHS-UHFFFAOYSA-N,50229241,CHEMBL295382,...,Sterol O-acyltransferase 1,ACAT-1,O70536,,,,,,,
2,,Mrv0541 12012115242D,,0,www.bindingDB.org,50647999,InChI=1S/C29H22N4O3/c34-25(22-14-8-3-9-15-22)1...,JIGJFFJVSBVHBA-UHFFFAOYSA-N,50229233,CHEMBL299604,...,Sterol O-acyltransferase 1,ACAT-1,O70536,,,,,,,
3,,Mrv0541 12012115242D,,0,www.bindingDB.org,50648003,InChI=1S/C28H28N4O3/c1-18(2)14-19(3)30-16-22-2...,NYWDSFJIYIHHHP-UHFFFAOYSA-N,50229237,CHEMBL47580,...,Sterol O-acyltransferase 1,ACAT-1,O70536,,,,,,,
4,,RDKit 2D,,0,www.bindingDB.org,51138741,InChI=1S/C23H29N5OS4/c1-16-14-19(30-2)21(22(24...,AMJCKXKAZQZZDV-UHFFFAOYSA-N,50466752,CHEMBL4281138,...,Sterol O-acyltransferase 1,ACAT-1,P35610,A6NC40 A8K3P4 A9Z1V7 B4DU95 Q5T0X4 Q8N1E4,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1698,,Mrv0541 11221423482D,,0,www.bindingDB.org,50051759,InChI=1S/C30H42N4O2S/c1-4-5-6-7-14-19-34(30(35...,FLVVFAUKANQOOA-UHFFFAOYSA-N,50033977,"1-{2-[2-(4,5-Diphenyl-1H-imidazol-2-ylsulfanyl...",...,Sterol O-acyltransferase 1,ACAT-1,O70536,,,,,,,
1699,,Mrv0541 04241521002D,,0,www.bindingDB.org,50004370,InChI=1S/C35H35F2N5O3S/c1-44-28-14-9-24(10-15-...,VDAACKVBVCTIAI-UHFFFAOYSA-N,50450364,CHEMBL50171,...,Sterol O-acyltransferase 1,ACAT-1,Q61263,Q5XK32 Q64180,,,,,,
1700,,Mrv0541 04241521002D,,0,www.bindingDB.org,50004374,InChI=1S/C31H37N5OS/c1-24(2)33-31(37)36(22-19-...,WCGYVMMVCWPXRM-UHFFFAOYSA-N,50450353,CHEMBL299178,...,Sterol O-acyltransferase 1,ACAT-1,O70536,,,,,,,
1701,,Mrv0541 04241521002D,,0,www.bindingDB.org,50004847,InChI=1S/C27H36N6O3/c1-2-3-4-5-6-7-8-9-10-14-2...,MOINZZQENNMLQE-UHFFFAOYSA-N,50450386,CHEMBL98716,...,Sterol O-acyltransferase 1,ACAT-1,O70536,,,,,,,


### **Rename the *IC50* Column**

In [29]:
df4 = df3.rename(columns={"IC50 (nM)": "IC50"})
df4

Unnamed: 0,_Name,_MolFileInfo,_MolFileComments,_MolFileChiralFlag,From,bdID,Ligand InChI,Ligand InChI Key,BindingDB MonomerID,BindingDB Ligand Name,...,UniProt (SwissProt) Recommended Name of Target Chain,UniProt (SwissProt) Entry Name of Target Chain,UniProt (SwissProt) Primary ID of Target Chain,UniProt (SwissProt) Secondary ID(s) of Target Chain,UniProt (SwissProt) Alternative ID(s) of Target Chain,UniProt (TrEMBL) Submitted Name of Target Chain,UniProt (TrEMBL) Entry Name of Target Chain,UniProt (TrEMBL) Primary ID of Target Chain,UniProt (TrEMBL) Secondary ID(s) of Target Chain,UniProt (TrEMBL) Alternative ID(s) of Target Chain
0,,Mrv0541 12012115242D,,0,www.bindingDB.org,50648002,InChI=1S/C28H25N5O5/c34-24(19-10-7-13-21(14-19...,OOMSOWNNUKTQQA-UHFFFAOYSA-N,50229236,CHEMBL46356,...,Sterol O-acyltransferase 1,ACAT-1,O70536,,,,,,,
1,,Mrv0541 12012115242D,,0,www.bindingDB.org,50648007,InChI=1S/C28H26N4O3/c33-24(20-12-6-2-7-13-20)1...,HEODCNVWJVNTHS-UHFFFAOYSA-N,50229241,CHEMBL295382,...,Sterol O-acyltransferase 1,ACAT-1,O70536,,,,,,,
2,,Mrv0541 12012115242D,,0,www.bindingDB.org,50647999,InChI=1S/C29H22N4O3/c34-25(22-14-8-3-9-15-22)1...,JIGJFFJVSBVHBA-UHFFFAOYSA-N,50229233,CHEMBL299604,...,Sterol O-acyltransferase 1,ACAT-1,O70536,,,,,,,
3,,Mrv0541 12012115242D,,0,www.bindingDB.org,50648003,InChI=1S/C28H28N4O3/c1-18(2)14-19(3)30-16-22-2...,NYWDSFJIYIHHHP-UHFFFAOYSA-N,50229237,CHEMBL47580,...,Sterol O-acyltransferase 1,ACAT-1,O70536,,,,,,,
4,,RDKit 2D,,0,www.bindingDB.org,51138741,InChI=1S/C23H29N5OS4/c1-16-14-19(30-2)21(22(24...,AMJCKXKAZQZZDV-UHFFFAOYSA-N,50466752,CHEMBL4281138,...,Sterol O-acyltransferase 1,ACAT-1,P35610,A6NC40 A8K3P4 A9Z1V7 B4DU95 Q5T0X4 Q8N1E4,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1698,,Mrv0541 11221423482D,,0,www.bindingDB.org,50051759,InChI=1S/C30H42N4O2S/c1-4-5-6-7-14-19-34(30(35...,FLVVFAUKANQOOA-UHFFFAOYSA-N,50033977,"1-{2-[2-(4,5-Diphenyl-1H-imidazol-2-ylsulfanyl...",...,Sterol O-acyltransferase 1,ACAT-1,O70536,,,,,,,
1699,,Mrv0541 04241521002D,,0,www.bindingDB.org,50004370,InChI=1S/C35H35F2N5O3S/c1-44-28-14-9-24(10-15-...,VDAACKVBVCTIAI-UHFFFAOYSA-N,50450364,CHEMBL50171,...,Sterol O-acyltransferase 1,ACAT-1,Q61263,Q5XK32 Q64180,,,,,,
1700,,Mrv0541 04241521002D,,0,www.bindingDB.org,50004374,InChI=1S/C31H37N5OS/c1-24(2)33-31(37)36(22-19-...,WCGYVMMVCWPXRM-UHFFFAOYSA-N,50450353,CHEMBL299178,...,Sterol O-acyltransferase 1,ACAT-1,O70536,,,,,,,
1701,,Mrv0541 04241521002D,,0,www.bindingDB.org,50004847,InChI=1S/C27H36N6O3/c1-2-3-4-5-6-7-8-9-10-14-2...,MOINZZQENNMLQE-UHFFFAOYSA-N,50450386,CHEMBL98716,...,Sterol O-acyltransferase 1,ACAT-1,O70536,,,,,,,


### **Iterate the *IC50* to a list**

In [47]:
mol_IC50 = []
for i in df4.IC50:
  mol_IC50.append(i)

### **Iterate the *SMILES* to a list**

In [48]:
mol_smiles = []
for i in df2.SMILES:
  mol_smiles.append(i)

### **Iterate the *bdID* to a list**

In [49]:
mol_bdID = []
for i in df4.bdID:
  mol_bdID.append(i)

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**. 

In [38]:
df4['IC50'] = df4['IC50'].str.replace('>', '')

In [39]:
df4['IC50'] = df4['IC50'].astype('float64')
print(df4['IC50'].dtypes)

float64


In [40]:
bioactivity_class = []
for i in df4.IC50:
  if float(i) >= 10000:
    bioactivity_class.append("inactive")
  elif float(i) <= 1000:
    bioactivity_class.append("active")
  else:
    bioactivity_class.append("intermediate")

### **Combine the 4 lists into a dataframe**

In [43]:
data_tuples = list(zip(mol_bdID, mol_smiles, bioactivity_class, mol_IC50))
df5 = pd.DataFrame( data_tuples,  columns=['mol_bdID', 'mol_smiles', 'bioactivity_class', 'mol_IC50'])

In [44]:
df5

Unnamed: 0,mol_bdID,mol_smiles,bioactivity_class,mol_IC50
0,50648002,O=C(Cn1c2c(c(=O)n3nc(-c4ccccc4)cc13)CN(C1CCCCC...,active,1.4
1,50648007,O=C(Cn1c2c(c(=O)n3nc(-c4ccccc4)cc13)CN(C1CCCCC...,active,1.7
2,50647999,O=C(Cn1c2c(c(=O)n3nc(-c4ccccc4)cc13)CN(Cc1cccc...,active,2.6
3,50648003,CC(C)CC(C)N1Cc2c(n(CC(=O)c3ccccc3)c3cc(-c4cccc...,active,3.0
4,51138741,CSc1cc(C)nc(SC)c1NC(=O)CN1CCN(CCSc2nc3ccccc3s2...,active,3.0
...,...,...,...,...
1698,50051759,CCCCCCCN(CCOCCSc1nc(-c2ccccc2)c(-c2ccccc2)[nH]...,active,190
1699,50004370,COc1ccc(-c2nc(SCCCCCN(Cc3ccccn3)C(=O)Nc3ccc(F)...,active,190
1700,50004374,CC(C)NC(=O)N(CCCCCSc1nc(-c2ccccc2)c(-c2ccccc2)...,active,190
1701,50004847,CCCCCCCCCCCCn1nnnc1C(NC(=O)c1cccc([N+](=O)[O-]...,active,190


In [50]:
df5.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1703 entries, 0 to 1702
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   mol_bdID           1703 non-null   int64 
 1   mol_smiles         1703 non-null   object
 2   bioactivity_class  1703 non-null   object
 3   mol_IC50           1703 non-null   object
dtypes: int64(1), object(3)
memory usage: 53.3+ KB


In [54]:
df5['mol_IC50'] = df5['mol_IC50'].str.replace('>', '')

In [56]:
df5['mol_IC50'] = df5['mol_IC50'].str.replace('<', '')

In [58]:
df5['mol_IC50'] = df5['mol_IC50'].astype('float64')
print(df5['mol_IC50'].dtypes)

float64


## **Handling missing data**
If any compounds has missing value for the **IC50** column then drop it

In [59]:
df6 = df5[df5.mol_IC50.notna()]
df6
# no missing data

Unnamed: 0,mol_bdID,mol_smiles,bioactivity_class,mol_IC50
0,50648002,O=C(Cn1c2c(c(=O)n3nc(-c4ccccc4)cc13)CN(C1CCCCC...,active,1.4
1,50648007,O=C(Cn1c2c(c(=O)n3nc(-c4ccccc4)cc13)CN(C1CCCCC...,active,1.7
2,50647999,O=C(Cn1c2c(c(=O)n3nc(-c4ccccc4)cc13)CN(Cc1cccc...,active,2.6
3,50648003,CC(C)CC(C)N1Cc2c(n(CC(=O)c3ccccc3)c3cc(-c4cccc...,active,3.0
4,51138741,CSc1cc(C)nc(SC)c1NC(=O)CN1CCN(CCSc2nc3ccccc3s2...,active,3.0
...,...,...,...,...
1698,50051759,CCCCCCCN(CCOCCSc1nc(-c2ccccc2)c(-c2ccccc2)[nH]...,active,190.0
1699,50004370,COc1ccc(-c2nc(SCCCCCN(Cc3ccccn3)C(=O)Nc3ccc(F)...,active,190.0
1700,50004374,CC(C)NC(=O)N(CCCCCSc1nc(-c2ccccc2)c(-c2ccccc2)...,active,190.0
1701,50004847,CCCCCCCCCCCCn1nnnc1C(NC(=O)c1cccc([N+](=O)[O-]...,active,190.0


In [60]:
df6.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1703 entries, 0 to 1702
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   mol_bdID           1703 non-null   int64  
 1   mol_smiles         1703 non-null   object 
 2   bioactivity_class  1703 non-null   object 
 3   mol_IC50           1703 non-null   float64
dtypes: float64(1), int64(1), object(2)
memory usage: 66.5+ KB


In [62]:
df6.to_csv('final_output.csv', index=False)

In [63]:
df6.describe()

Unnamed: 0,mol_bdID,mol_IC50
count,1703.0,1703.0
mean,49756670.0,19435.4
std,6166758.0,82181.89
min,51064.0,1.4
25%,50073280.0,160.0
50%,50529970.0,1200.0
75%,50779900.0,10000.0
max,51346970.0,1140000.0


---

In [64]:
! ls -l

total 58216
-rw-r--r--@  1 akrikhalid  staff     16872 May 29 14:44 IC50.csv
-rw-r--r--   1 akrikhalid  staff   3041472 May 29 14:36 Output3_Sterol O-acyltransferase 1 inhibitors 1703.csv
drwxr-xr-x   2 akrikhalid  staff        64 May 29 14:22 [34mPDFs[m[m
drwxr-xr-x  21 akrikhalid  staff       672 May 24 13:06 [34mPaDel-Descriptor[m[m
-rw-r--r--@  1 akrikhalid  staff       231 May 24 10:18 PaDel-Descriptor.sh
-rw-r--r--@  1 akrikhalid  staff    211058 May 29 10:05 PaDel-Descriptor.zip
-rw-r--r--@  1 akrikhalid  staff    338359 May 29 16:14 Sterol O-acyltransferase 1 inhibitors 1703 Part 01.ipynb
-rw-r--r--@  1 akrikhalid  staff    250536 May 29 11:51 Sterol O-acyltransferase 1 inhibitors 1703 Part 02.ipynb
-rw-r--r--@  1 akrikhalid  staff     89689 May 29 10:53 Sterol O-acyltransferase 1 inhibitors 1703 Part 03.ipynb
-rw-r--r--@  1 akrikhalid  staff     76160 May 29 11:06 Sterol O-acyltransferase 1 inhibitors 1703 Part 04.ipynb
-rw-r--r--@  1 akrikhalid  staff  110723