# **CDD - CCR5 : Download Bioactivity Data Part  01**

khalid El Akri

[*'Chem Code Professor' YouTube channel*](http://youtube.com/@chemcodeprofessor)

In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the Bindingdb bioactivity data.

In **Part 01**, we will be performing Data Collection and Pre-Processing from the Bindingdb Database.  Let get started! Shall we ;)

# Import and convert CCR5_INHIBITORS_DATA from SDF file to CSV file

In [1]:
from rdkit import Chem
import pandas as pd
from rdkit.Chem import PandasTools

# Read the SDF file into RDKit
suppl = Chem.SDMolSupplier('C-C chemokine receptor type 5_Inhibitors_3316.sdf')

# Convert each molecule to a dictionary
molecules = []
for mol in suppl:
    if mol is not None:
        molecules.append(mol.GetPropsAsDict(True))

# Convert the dictionaries to a DataFrame
df = pd.DataFrame(molecules)

# Write the DataFrame to a CSV file
df.to_csv('output_C-C chemokine receptor type 5_Inhibitors_3316.csv', index=False)

## 1. Let check the output CSV file

In [2]:
df1 = pd.read_csv('output_C-C chemokine receptor type 5_Inhibitors_3316.csv')
df1

Unnamed: 0,_Name,_MolFileInfo,_MolFileComments,_MolFileChiralFlag,From,BindingDB Reactant_set_id,Ligand InChI,Ligand InChI Key,BindingDB MonomerID,BindingDB Ligand Name,...,UniProt (SwissProt) Recommended Name of Target Chain,UniProt (SwissProt) Entry Name of Target Chain,UniProt (SwissProt) Primary ID of Target Chain,UniProt (SwissProt) Secondary ID(s) of Target Chain,UniProt (SwissProt) Alternative ID(s) of Target Chain,UniProt (TrEMBL) Submitted Name of Target Chain,UniProt (TrEMBL) Entry Name of Target Chain,UniProt (TrEMBL) Primary ID of Target Chain,UniProt (TrEMBL) Secondary ID(s) of Target Chain,UniProt (TrEMBL) Alternative ID(s) of Target Chain
0,,Mrv0541 11211414262D,,0,www.bindingDB.org,50815122,InChI=1S/C34H41BrN4O2/c1-5-41-37-32(27-6-8-30(...,AKHBRWYRHRLXTB-FTTXPQLCSA-N,50134046,(4-{(4-Bromo-phenyl)-[(E)-ethoxyimino]-methyl}...,...,C-C chemokine receptor type 5,,P51681,O14692 O14693 O14695 O14696 O14697 O14698 O146...,,,,,,
1,,Mrv0541 11231414242D,,0,www.bindingDB.org,51209261,InChI=1S/C29H41F2N5O/c1-19(2)27-34-33-20(3)36(...,GSNHKUDZZFZSJB-QYOOZWMWSA-N,50334986,"4,4-Difluoro-cyclohexanecarboxylic acid {(S)-3...",...,C-C chemokine receptor type 5,,P61814,O02746 P79436 Q548Q9,,,,,,
2,,RDKit 2D,,0,www.bindingDB.org,51381873,InChI=1S/C29H41F2N5O/c1-19(2)27-34-33-20(3)36(...,GSNHKUDZZFZSJB-QUMGSSFMSA-N,50464147,CHEMBL256907,...,C-C chemokine receptor type 5,,P61814,O02746 P79436 Q548Q9,,,,,,
3,,Mrv0541 01171608472D,,0,www.bindingDB.org,304257,InChI=1S/C29H41F2N5O/c1-19(2)27-34-33-20(3)36(...,GSNHKUDZZFZSJB-ILVMPNSOSA-N,160935,"US10167299, Maraviroc::US9107954, maraviroc",...,C-C chemokine receptor type 5,,P51681,O14692 O14693 O14695 O14696 O14697 O14698 O146...,,,,,,
4,,Mrv0541 11211413102D,,0,www.bindingDB.org,50740010,InChI=1S/C26H31Br3N4O2/c1-3-35-31-24(18-4-6-20...,RHCLUWBDZMYHPR-QFMPWRQOSA-N,50124943,(4-{(4-Bromo-phenyl)-[(Z)-ethoxyimino]-methyl}...,...,C-C chemokine receptor type 5,,P51681,O14692 O14693 O14695 O14696 O14697 O14698 O146...,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3311,,Mrv0541 11211415462D,,1,www.bindingDB.org,50260998,InChI=1S/C41H51FN4O3/c1-4-29(3)40(41(47)48)45-...,HLZAJLBSYWFNOU-BYBKDXPSSA-N,50142022,"(2R,4R)-2-[(2S,3S)-3-{4-[5-(4-Benzyloxy-benzyl...",...,C-C chemokine receptor type 5,,P51681,O14692 O14693 O14695 O14696 O14697 O14698 O146...,,,,,,
3312,,Mrv0541 11211410322D,,0,www.bindingDB.org,50264307,InChI=1S/C26H34BrN3O/c1-19-5-4-6-20(2)24(19)25...,UWYRPZAYLGOTBK-UHFFFAOYSA-N,50104934,CHEMBL292625::[4-(4-Bromo-phenylamino)-4'-meth...,...,C-C chemokine receptor type 5,,P51681,O14692 O14693 O14695 O14696 O14697 O14698 O146...,,,,,,
3313,,Mrv0541 11211415462D,,1,www.bindingDB.org,50260929,InChI=1S/C36H46F4N4O3/c1-4-24(3)34(35(45)46)43...,BYGHXMKCHHHFAS-RKCOTYCJSA-N,50141966,"(2R,4R)-2-[(2S,3S)-3-(4-{2-Ethyl-5-[4-(2,2,2-t...",...,C-C chemokine receptor type 5,,P51681,O14692 O14693 O14695 O14696 O14697 O14698 O146...,,,,,,
3314,,Mrv0541 11211415462D,,1,www.bindingDB.org,50260935,InChI=1S/C36H46FN5O3/c1-5-24(3)35(36(43)44)41-...,JLQAWCSYGCHXFR-XWHDRLKKSA-N,50141963,"(2R,4S)-2-[(2S,3S)-3-{4-[5-(3-Cyano-4-methoxy-...",...,C-C chemokine receptor type 5,,P51681,O14692 O14693 O14695 O14696 O14697 O14698 O146...,,,,,,


In [3]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3316 entries, 0 to 3315
Data columns (total 54 columns):
 #   Column                                                     Non-Null Count  Dtype  
---  ------                                                     --------------  -----  
 0   _Name                                                      650 non-null    object 
 1   _MolFileInfo                                               3316 non-null   object 
 2   _MolFileComments                                           0 non-null      float64
 3   _MolFileChiralFlag                                         3316 non-null   int64  
 4   From                                                       3316 non-null   object 
 5   BindingDB Reactant_set_id                                  3316 non-null   int64  
 6   Ligand InChI                                               3308 non-null   object 
 7   Ligand InChI Key                                           3308 non-null   object 
 8   BindingD

In [4]:
df1.describe()

Unnamed: 0,_MolFileComments,_MolFileChiralFlag,BindingDB Reactant_set_id,BindingDB MonomerID,Kd (nM),kon (M-1-s-1),koff (s-1),pH,Temp (C),PMID,...,ChEBI ID of Ligand,IUPHAR_GRAC ID of Ligand,Number of Protein Chains in Target (,UniProt (SwissProt) Entry Name of Target Chain,UniProt (SwissProt) Alternative ID(s) of Target Chain,UniProt (TrEMBL) Submitted Name of Target Chain,UniProt (TrEMBL) Entry Name of Target Chain,UniProt (TrEMBL) Primary ID of Target Chain,UniProt (TrEMBL) Secondary ID(s) of Target Chain,UniProt (TrEMBL) Alternative ID(s) of Target Chain
count,0.0,3316.0,3316.0,3316.0,18.0,0.0,0.0,0.0,0.0,3044.0,...,15.0,37.0,3316.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mean,,0.228287,48802980.0,48281010.0,10723.038889,,,,,18383120.0,...,90891.133333,849.135135,1.0,,,,,,,
std,,0.419792,9353729.0,9698896.0,14027.405122,,,,,5069652.0,...,97417.153481,200.468363,0.0,,,,,,,
min,,0.0,207253.0,38924.0,1.6,,,,,10821720.0,...,7459.0,777.0,1.0,,,,,,,
25%,,0.0,50260920.0,50119330.0,13.5,,,,,15013000.0,...,30351.0,802.0,1.0,,,,,,,
50%,,0.0,50558750.0,50190520.0,4258.0,,,,,18267360.0,...,44975.0,803.0,1.0,,,,,,,
75%,,0.0,50924260.0,50337610.0,18175.0,,,,,21658960.0,...,155774.0,803.0,1.0,,,,,,,
max,,1.0,51444880.0,50583790.0,49000.0,,,,,34795860.0,...,245199.0,1676.0,1.0,,,,,,,


## There is no SMILES column - Let extract SMILES first from SDF file

In [5]:
# Let extract SMILES from this SDF file 

# Load the SDF file
mols = Chem.SDMolSupplier('C-C chemokine receptor type 5_Inhibitors_3316.sdf')

# Extract the name and SMILES string of each molecule
data = []
for mol in mols:
    if mol is not None:
        name = mol.GetProp('_Name')
        smiles = Chem.MolToSmiles(mol)
        data.append({'Name': name, 'SMILES': smiles})

# Save the data to a CSV file using pandas
import pandas as pd
df2 = pd.DataFrame(data)
df2.to_csv('output2_C-C chemokine receptor type 5_Inhibitors_3316.csv', index=False)

In [7]:
df2 = pd.read_csv('output2_C-C chemokine receptor type 5_Inhibitors_3316.csv')
df2

Unnamed: 0,Name,SMILES
0,,CCO/N=C(/c1ccc(Br)cc1)C1CCN(C2(C)CCN(C(=O)c3c(...
1,,Cc1nnc(C(C)C)n1[C@@H]1C[C@H]2CC[C@@H](C1)N2CC[...
2,,Cc1nnc(C(C)C)n1[C@H]1C[C@H]2CC[C@@H](C1)N2CC[C...
3,,Cc1nnc(C(C)C)n1C1CC2CCC(C1)N2CC[C@H](NC(=O)C1C...
4,,CCO/N=C(\c1ccc(Br)cc1)C1CCN(C2(C)CCN(C(=O)c3c(...
...,...,...
3311,,CC[C@@H](C)[C@H](C(=O)O)N1C[C@H](CN2CCC(c3cc(C...
3312,,Cc1cccc(C)c1C(=O)N1CCC(C)(N2CCC(Nc3ccc(Br)cc3)...
3313,,CC[C@@H](C)[C@H](C(=O)O)N1C[C@H](CN2CCC(c3cc(C...
3314,,CC[C@H](C)[C@H](C(=O)O)N1C[C@H](CN2CCC(c3cc(Cc...


In [8]:
# Drop the Name columns that contains NaN

df2 = df2.drop(["Name"], axis=1)
df2

Unnamed: 0,SMILES
0,CCO/N=C(/c1ccc(Br)cc1)C1CCN(C2(C)CCN(C(=O)c3c(...
1,Cc1nnc(C(C)C)n1[C@@H]1C[C@H]2CC[C@@H](C1)N2CC[...
2,Cc1nnc(C(C)C)n1[C@H]1C[C@H]2CC[C@@H](C1)N2CC[C...
3,Cc1nnc(C(C)C)n1C1CC2CCC(C1)N2CC[C@H](NC(=O)C1C...
4,CCO/N=C(\c1ccc(Br)cc1)C1CCN(C2(C)CCN(C(=O)c3c(...
...,...
3311,CC[C@@H](C)[C@H](C(=O)O)N1C[C@H](CN2CCC(c3cc(C...
3312,Cc1cccc(C)c1C(=O)N1CCC(C)(N2CCC(Nc3ccc(Br)cc3)...
3313,CC[C@@H](C)[C@H](C(=O)O)N1C[C@H](CN2CCC(c3cc(C...
3314,CC[C@H](C)[C@H](C(=O)O)N1C[C@H](CN2CCC(c3cc(Cc...


## There is no IC50 column - Let convert TSV file to CSV and extract IC50

In [10]:
import pandas as pd

# read TSV file
df3 = pd.read_csv('C-C chemokine receptor type 5_Inhibitors_3316.tsv', sep='\t')

# write CSV file
df3.to_csv('Output3_C-C chemokine receptor type 5_Inhibitors_3316.csv', index=False)

In [11]:
df3 = pd.read_csv('Output3_C-C chemokine receptor type 5_Inhibitors_3316.csv')
df3

Unnamed: 0,BindingDB Reactant_set_id,Ligand SMILES,Ligand InChI,Ligand InChI Key,BindingDB MonomerID,BindingDB Ligand Name,Target Name,Target Source Organism According to Curator or DataSource,Ki (nM),IC50 (nM),...,C-C chemokine receptor type 5.1,Unnamed: 90,P51681,O14692 O14693 O14695 O14696 O14697 O14698 O14699 O14700 O14701 O14702 O14703 O14704 O14705 O14706 O14707 O14708 O15538 Q9UPA4,Unnamed: 93,Unnamed: 94,Unnamed: 95,Unnamed: 96,Unnamed: 97,Unnamed: 98
0,51209261,CC(C)c1nnc(C)n1[C@H]1C[C@@H]2CC[C@H](C1)N2CC[C...,InChI=1S/C29H41F2N5O/c1-19(2)27-34-33-20(3)36(...,GSNHKUDZZFZSJB-QYOOZWMWSA-N,50334986,"4,4-Difluoro-cyclohexanecarboxylic acid {(S)-3...",C-C chemokine receptor type 5,Macaca fascicularis,0.240000,,...,,,,,,,,,,
1,51381873,CC(C)c1nnc(C)n1[C@@H]1C[C@@H]2CC[C@H](C1)N2CC[...,InChI=1S/C29H41F2N5O/c1-19(2)27-34-33-20(3)36(...,GSNHKUDZZFZSJB-QUMGSSFMSA-N,50464147,CHEMBL256907,C-C chemokine receptor type 5,Macaca fascicularis,0.240000,,...,,,,,,,,,,
2,304257,CC(C)c1nnc(C)n1C1CC2CCC(C1)N2CC[C@H](NC(=O)C1C...,InChI=1S/C29H41F2N5O/c1-19(2)27-34-33-20(3)36(...,GSNHKUDZZFZSJB-ILVMPNSOSA-N,160935,"US10167299, Maraviroc::US9107954, maraviroc",C-C chemokine receptor type 5,,0.24,,...,,,,,,,,,,
3,50740010,CCO\N=C(\C1CCN(CC1)C1(C)CCN(CC1)C(=O)c1c(Br)cn...,InChI=1S/C26H31Br3N4O2/c1-3-35-31-24(18-4-6-20...,RHCLUWBDZMYHPR-QFMPWRQOSA-N,50124943,(4-{(4-Bromo-phenyl)-[(Z)-ethoxyimino]-methyl}...,C-C chemokine receptor type 5,,0.300000,,...,,,,,,,,,,
4,50815144,CCO\N=C(/C1CCN(CC1)C1(C)CCN(CC1)C(=O)c1c(Br)cn...,InChI=1S/C26H31Br3N4O2/c1-3-35-31-24(18-4-6-20...,RHCLUWBDZMYHPR-QLTSDVKISA-N,50134054,(4-{(4-Bromo-phenyl)-[(E)-ethoxyimino]-methyl}...,C-C chemokine receptor type 5,,0.3,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3310,50260998,CC[C@@H](C)[C@@H](N1C[C@H](CN2CCC(CC2)c2cc(Cc3...,InChI=1S/C41H51FN4O3/c1-4-29(3)40(41(47)48)45-...,HLZAJLBSYWFNOU-BYBKDXPSSA-N,50142022,"(2R,4R)-2-[(2S,3S)-3-{4-[5-(4-Benzyloxy-benzyl...",C-C chemokine receptor type 5,,,0.500000,...,,,,,,,,,,
3311,50264307,Cc1cccc(C)c1C(=O)N1CCC(C)(CC1)N1CCC(CC1)Nc1ccc...,InChI=1S/C26H34BrN3O/c1-19-5-4-6-20(2)24(19)25...,UWYRPZAYLGOTBK-UHFFFAOYSA-N,50104934,CHEMBL292625::[4-(4-Bromo-phenylamino)-4'-meth...,C-C chemokine receptor type 5,,,0.5,...,,,,,,,,,,
3312,50260929,CC[C@@H](C)[C@@H](N1C[C@H](CN2CCC(CC2)c2cc(Cc3...,InChI=1S/C36H46F4N4O3/c1-4-24(3)34(35(45)46)43...,BYGHXMKCHHHFAS-RKCOTYCJSA-N,50141966,"(2R,4R)-2-[(2S,3S)-3-(4-{2-Ethyl-5-[4-(2,2,2-t...",C-C chemokine receptor type 5,,,0.500000,...,,,,,,,,,,
3313,50260935,CC[C@H](C)[C@@H](N1C[C@H](CN2CCC(CC2)c2cc(Cc3c...,InChI=1S/C36H46FN5O3/c1-5-24(3)35(36(43)44)41-...,JLQAWCSYGCHXFR-XWHDRLKKSA-N,50141963,"(2R,4S)-2-[(2S,3S)-3-{4-[5-(3-Cyano-4-methoxy-...",C-C chemokine receptor type 5,,,0.500000,...,,,,,,,,,,


In [12]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3315 entries, 0 to 3314
Data columns (total 99 columns):
 #   Column                                                                                                                                                                                                                                                                                                                                                            Non-Null Count  Dtype  
---  ------                                                                                                                                                                                                                                                                                                                                                            --------------  -----  
 0   BindingDB Reactant_set_id                                                                                                 

In [13]:
df3.describe()

Unnamed: 0,BindingDB Reactant_set_id,BindingDB MonomerID,Kd (nM),kon (M-1-s-1),koff (s-1),pH,Temp (C),PMID,PubChem CID,PubChem SID,...,C-C chemokine receptor type 5.1,Unnamed: 90,P51681,O14692 O14693 O14695 O14696 O14697 O14698 O14699 O14700 O14701 O14702 O14703 O14704 O14705 O14706 O14707 O14708 O15538 Q9UPA4,Unnamed: 93,Unnamed: 94,Unnamed: 95,Unnamed: 96,Unnamed: 97,Unnamed: 98
count,3315.0,3315.0,18.0,0.0,0.0,0.0,0.0,3043.0,3315.0,3315.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mean,48802380.0,48280460.0,10723.038889,,,,,18384390.0,34704020.0,146428300.0,...,,,,,,,,,,
std,9355075.0,9700306.0,14027.405122,,,,,5070002.0,37548140.0,90463490.0,...,,,,,,,,,,
min,207253.0,38924.0,1.6,,,,,10821720.0,1318.0,103915300.0,...,,,,,,,,,,
25%,50260920.0,50119330.0,13.5,,,,,15013000.0,3008915.0,104017200.0,...,,,,,,,,,,
50%,50558750.0,50190520.0,4258.0,,,,,18267360.0,21064340.0,104084000.0,...,,,,,,,,,,
75%,50924260.0,50337610.0,18175.0,,,,,21658960.0,52942070.0,136963500.0,...,,,,,,,,,,
max,51444880.0,50583790.0,49000.0,,,,,34795860.0,166636600.0,476025100.0,...,,,,,,,,,,


In [14]:
df4 = df3.rename(columns={"BindingDB Reactant_set_id": "bdID"})
df4

Unnamed: 0,bdID,Ligand SMILES,Ligand InChI,Ligand InChI Key,BindingDB MonomerID,BindingDB Ligand Name,Target Name,Target Source Organism According to Curator or DataSource,Ki (nM),IC50 (nM),...,C-C chemokine receptor type 5.1,Unnamed: 90,P51681,O14692 O14693 O14695 O14696 O14697 O14698 O14699 O14700 O14701 O14702 O14703 O14704 O14705 O14706 O14707 O14708 O15538 Q9UPA4,Unnamed: 93,Unnamed: 94,Unnamed: 95,Unnamed: 96,Unnamed: 97,Unnamed: 98
0,51209261,CC(C)c1nnc(C)n1[C@H]1C[C@@H]2CC[C@H](C1)N2CC[C...,InChI=1S/C29H41F2N5O/c1-19(2)27-34-33-20(3)36(...,GSNHKUDZZFZSJB-QYOOZWMWSA-N,50334986,"4,4-Difluoro-cyclohexanecarboxylic acid {(S)-3...",C-C chemokine receptor type 5,Macaca fascicularis,0.240000,,...,,,,,,,,,,
1,51381873,CC(C)c1nnc(C)n1[C@@H]1C[C@@H]2CC[C@H](C1)N2CC[...,InChI=1S/C29H41F2N5O/c1-19(2)27-34-33-20(3)36(...,GSNHKUDZZFZSJB-QUMGSSFMSA-N,50464147,CHEMBL256907,C-C chemokine receptor type 5,Macaca fascicularis,0.240000,,...,,,,,,,,,,
2,304257,CC(C)c1nnc(C)n1C1CC2CCC(C1)N2CC[C@H](NC(=O)C1C...,InChI=1S/C29H41F2N5O/c1-19(2)27-34-33-20(3)36(...,GSNHKUDZZFZSJB-ILVMPNSOSA-N,160935,"US10167299, Maraviroc::US9107954, maraviroc",C-C chemokine receptor type 5,,0.24,,...,,,,,,,,,,
3,50740010,CCO\N=C(\C1CCN(CC1)C1(C)CCN(CC1)C(=O)c1c(Br)cn...,InChI=1S/C26H31Br3N4O2/c1-3-35-31-24(18-4-6-20...,RHCLUWBDZMYHPR-QFMPWRQOSA-N,50124943,(4-{(4-Bromo-phenyl)-[(Z)-ethoxyimino]-methyl}...,C-C chemokine receptor type 5,,0.300000,,...,,,,,,,,,,
4,50815144,CCO\N=C(/C1CCN(CC1)C1(C)CCN(CC1)C(=O)c1c(Br)cn...,InChI=1S/C26H31Br3N4O2/c1-3-35-31-24(18-4-6-20...,RHCLUWBDZMYHPR-QLTSDVKISA-N,50134054,(4-{(4-Bromo-phenyl)-[(E)-ethoxyimino]-methyl}...,C-C chemokine receptor type 5,,0.3,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3310,50260998,CC[C@@H](C)[C@@H](N1C[C@H](CN2CCC(CC2)c2cc(Cc3...,InChI=1S/C41H51FN4O3/c1-4-29(3)40(41(47)48)45-...,HLZAJLBSYWFNOU-BYBKDXPSSA-N,50142022,"(2R,4R)-2-[(2S,3S)-3-{4-[5-(4-Benzyloxy-benzyl...",C-C chemokine receptor type 5,,,0.500000,...,,,,,,,,,,
3311,50264307,Cc1cccc(C)c1C(=O)N1CCC(C)(CC1)N1CCC(CC1)Nc1ccc...,InChI=1S/C26H34BrN3O/c1-19-5-4-6-20(2)24(19)25...,UWYRPZAYLGOTBK-UHFFFAOYSA-N,50104934,CHEMBL292625::[4-(4-Bromo-phenylamino)-4'-meth...,C-C chemokine receptor type 5,,,0.5,...,,,,,,,,,,
3312,50260929,CC[C@@H](C)[C@@H](N1C[C@H](CN2CCC(CC2)c2cc(Cc3...,InChI=1S/C36H46F4N4O3/c1-4-24(3)34(35(45)46)43...,BYGHXMKCHHHFAS-RKCOTYCJSA-N,50141966,"(2R,4R)-2-[(2S,3S)-3-(4-{2-Ethyl-5-[4-(2,2,2-t...",C-C chemokine receptor type 5,,,0.500000,...,,,,,,,,,,
3313,50260935,CC[C@H](C)[C@@H](N1C[C@H](CN2CCC(CC2)c2cc(Cc3c...,InChI=1S/C36H46FN5O3/c1-5-24(3)35(36(43)44)41-...,JLQAWCSYGCHXFR-XWHDRLKKSA-N,50141963,"(2R,4S)-2-[(2S,3S)-3-{4-[5-(3-Cyano-4-methoxy-...",C-C chemokine receptor type 5,,,0.500000,...,,,,,,,,,,


In [15]:
df5 = df4.rename(columns={"IC50 (nM)": "IC50"})
df5

Unnamed: 0,bdID,Ligand SMILES,Ligand InChI,Ligand InChI Key,BindingDB MonomerID,BindingDB Ligand Name,Target Name,Target Source Organism According to Curator or DataSource,Ki (nM),IC50,...,C-C chemokine receptor type 5.1,Unnamed: 90,P51681,O14692 O14693 O14695 O14696 O14697 O14698 O14699 O14700 O14701 O14702 O14703 O14704 O14705 O14706 O14707 O14708 O15538 Q9UPA4,Unnamed: 93,Unnamed: 94,Unnamed: 95,Unnamed: 96,Unnamed: 97,Unnamed: 98
0,51209261,CC(C)c1nnc(C)n1[C@H]1C[C@@H]2CC[C@H](C1)N2CC[C...,InChI=1S/C29H41F2N5O/c1-19(2)27-34-33-20(3)36(...,GSNHKUDZZFZSJB-QYOOZWMWSA-N,50334986,"4,4-Difluoro-cyclohexanecarboxylic acid {(S)-3...",C-C chemokine receptor type 5,Macaca fascicularis,0.240000,,...,,,,,,,,,,
1,51381873,CC(C)c1nnc(C)n1[C@@H]1C[C@@H]2CC[C@H](C1)N2CC[...,InChI=1S/C29H41F2N5O/c1-19(2)27-34-33-20(3)36(...,GSNHKUDZZFZSJB-QUMGSSFMSA-N,50464147,CHEMBL256907,C-C chemokine receptor type 5,Macaca fascicularis,0.240000,,...,,,,,,,,,,
2,304257,CC(C)c1nnc(C)n1C1CC2CCC(C1)N2CC[C@H](NC(=O)C1C...,InChI=1S/C29H41F2N5O/c1-19(2)27-34-33-20(3)36(...,GSNHKUDZZFZSJB-ILVMPNSOSA-N,160935,"US10167299, Maraviroc::US9107954, maraviroc",C-C chemokine receptor type 5,,0.24,,...,,,,,,,,,,
3,50740010,CCO\N=C(\C1CCN(CC1)C1(C)CCN(CC1)C(=O)c1c(Br)cn...,InChI=1S/C26H31Br3N4O2/c1-3-35-31-24(18-4-6-20...,RHCLUWBDZMYHPR-QFMPWRQOSA-N,50124943,(4-{(4-Bromo-phenyl)-[(Z)-ethoxyimino]-methyl}...,C-C chemokine receptor type 5,,0.300000,,...,,,,,,,,,,
4,50815144,CCO\N=C(/C1CCN(CC1)C1(C)CCN(CC1)C(=O)c1c(Br)cn...,InChI=1S/C26H31Br3N4O2/c1-3-35-31-24(18-4-6-20...,RHCLUWBDZMYHPR-QLTSDVKISA-N,50134054,(4-{(4-Bromo-phenyl)-[(E)-ethoxyimino]-methyl}...,C-C chemokine receptor type 5,,0.3,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3310,50260998,CC[C@@H](C)[C@@H](N1C[C@H](CN2CCC(CC2)c2cc(Cc3...,InChI=1S/C41H51FN4O3/c1-4-29(3)40(41(47)48)45-...,HLZAJLBSYWFNOU-BYBKDXPSSA-N,50142022,"(2R,4R)-2-[(2S,3S)-3-{4-[5-(4-Benzyloxy-benzyl...",C-C chemokine receptor type 5,,,0.500000,...,,,,,,,,,,
3311,50264307,Cc1cccc(C)c1C(=O)N1CCC(C)(CC1)N1CCC(CC1)Nc1ccc...,InChI=1S/C26H34BrN3O/c1-19-5-4-6-20(2)24(19)25...,UWYRPZAYLGOTBK-UHFFFAOYSA-N,50104934,CHEMBL292625::[4-(4-Bromo-phenylamino)-4'-meth...,C-C chemokine receptor type 5,,,0.5,...,,,,,,,,,,
3312,50260929,CC[C@@H](C)[C@@H](N1C[C@H](CN2CCC(CC2)c2cc(Cc3...,InChI=1S/C36H46F4N4O3/c1-4-24(3)34(35(45)46)43...,BYGHXMKCHHHFAS-RKCOTYCJSA-N,50141966,"(2R,4R)-2-[(2S,3S)-3-(4-{2-Ethyl-5-[4-(2,2,2-t...",C-C chemokine receptor type 5,,,0.500000,...,,,,,,,,,,
3313,50260935,CC[C@H](C)[C@@H](N1C[C@H](CN2CCC(CC2)c2cc(Cc3c...,InChI=1S/C36H46FN5O3/c1-5-24(3)35(36(43)44)41-...,JLQAWCSYGCHXFR-XWHDRLKKSA-N,50141963,"(2R,4S)-2-[(2S,3S)-3-{4-[5-(3-Cyano-4-methoxy-...",C-C chemokine receptor type 5,,,0.500000,...,,,,,,,,,,


In [16]:
df5.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3315 entries, 0 to 3314
Data columns (total 99 columns):
 #   Column                                                                                                                                                                                                                                                                                                                                                            Non-Null Count  Dtype  
---  ------                                                                                                                                                                                                                                                                                                                                                            --------------  -----  
 0   bdID                                                                                                                      

## We have the IC50 column as object dtype not a float64 - let clean & convert it

In [17]:
print(df5['IC50'])

0             NaN
1             NaN
2             NaN
3             NaN
4             NaN
          ...    
3310     0.500000
3311          0.5
3312     0.500000
3313     0.500000
3314     0.500000
Name: IC50, Length: 3315, dtype: object


In [20]:
df5['IC50'] = df5['IC50'].str.replace('>', '')

In [21]:
df5['IC50'] = df5['IC50'].str.replace('<', '')

In [22]:
df5['IC50'] = df5['IC50'].astype('float64')
print(df5['IC50'].dtypes)

float64


### **Iterate the *SMILES* to a list**

In [23]:
mol_smiles = []
for i in df2.SMILES:
  mol_smiles.append(i)

### **Iterate the *bdID* to a list**

In [24]:
mol_bdID = []
for i in df4.bdID:
  mol_bdID.append(i)

### **Iterate the *IC50 (nM)* to a list**

In [25]:
mol_IC50 = []
for i in df5.IC50:
  mol_IC50.append(i)

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**. 

In [26]:
bioactivity_class = []
for i in df5.IC50:
  if float(i) >= 10000:
    bioactivity_class.append("inactive")
  elif float(i) <= 1000:
    bioactivity_class.append("active")
  else:
    bioactivity_class.append("intermediate")

### **Combine the 4 lists into a dataframe**

In [27]:
data_tuples = list(zip(mol_bdID, mol_smiles, bioactivity_class, mol_IC50))
df6 = pd.DataFrame( data_tuples,  columns=['mol_bdID', 'mol_smiles', 'bioactivity_class', 'mol_IC50'])

In [28]:
df6

Unnamed: 0,mol_bdID,mol_smiles,bioactivity_class,mol_IC50
0,51209261,CCO/N=C(/c1ccc(Br)cc1)C1CCN(C2(C)CCN(C(=O)c3c(...,intermediate,
1,51381873,Cc1nnc(C(C)C)n1[C@@H]1C[C@H]2CC[C@@H](C1)N2CC[...,intermediate,
2,304257,Cc1nnc(C(C)C)n1[C@H]1C[C@H]2CC[C@@H](C1)N2CC[C...,intermediate,
3,50740010,Cc1nnc(C(C)C)n1C1CC2CCC(C1)N2CC[C@H](NC(=O)C1C...,intermediate,
4,50815144,CCO/N=C(\c1ccc(Br)cc1)C1CCN(C2(C)CCN(C(=O)c3c(...,intermediate,
...,...,...,...,...
3310,50260998,CCn1nc(Cc2ccc(Oc3ccccc3)cc2)cc1C1CCN(C[C@H]2CN...,active,0.5
3311,50264307,CC[C@@H](C)[C@H](C(=O)O)N1C[C@H](CN2CCC(c3cc(C...,active,0.5
3312,50260929,Cc1cccc(C)c1C(=O)N1CCC(C)(N2CCC(Nc3ccc(Br)cc3)...,active,0.5
3313,50260935,CC[C@@H](C)[C@H](C(=O)O)N1C[C@H](CN2CCC(c3cc(C...,active,0.5


## **Handling missing data**
If any compounds has missing value for the **IC50** column then drop it

In [29]:
df7 = df6[df6.mol_IC50.notna()]
df7
# no missing data

Unnamed: 0,mol_bdID,mol_smiles,bioactivity_class,mol_IC50
49,50853863,Cc1ccnc(C)c1C(=O)N1CCC(C)(N2CCN([C@@H](C)c3ccc...,active,0.5
50,50961778,Cc1cccc(C)c1C(=O)N1CCC(C)(N2CCC(N(c3ccccc3)c3c...,active,0.5
51,50448211,Cc1cc(Cl)nc(C)c1C(=O)NCC[C@@H](C)N1CCC(N(Cc2cc...,active,0.5
52,50417247,Cc1cccc(C)c1C(=O)N1CCC(C)(N2CCC(N(Cc3ccccc3)C(...,active,0.5
53,50260844,O=C(O)[C@@H](CC1CCC1)N1C[C@H](CN2CCC(CCCc3cccc...,active,0.5
...,...,...,...,...
3310,50260998,CCn1nc(Cc2ccc(Oc3ccccc3)cc2)cc1C1CCN(C[C@H]2CN...,active,0.5
3311,50264307,CC[C@@H](C)[C@H](C(=O)O)N1C[C@H](CN2CCC(c3cc(C...,active,0.5
3312,50260929,Cc1cccc(C)c1C(=O)N1CCC(C)(N2CCC(Nc3ccc(Br)cc3)...,active,0.5
3313,50260935,CC[C@@H](C)[C@H](C(=O)O)N1C[C@H](CN2CCC(c3cc(C...,active,0.5


In [30]:
df7.to_csv('final_output.csv', index=False)

In [31]:
df7.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2946 entries, 49 to 3314
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   mol_bdID           2946 non-null   int64  
 1   mol_smiles         2946 non-null   object 
 2   bioactivity_class  2946 non-null   object 
 3   mol_IC50           2946 non-null   float64
dtypes: float64(1), int64(1), object(2)
memory usage: 115.1+ KB


In [32]:
df7.describe()

Unnamed: 0,mol_bdID,mol_IC50
count,2946.0,2946.0
mean,48710390.0,20962.6
std,9573186.0,327913.9
min,304261.0,0.003
25%,50260970.0,3.7
50%,50547950.0,30.0
75%,50945680.0,450.0
max,51444880.0,10000000.0


In [33]:
! ls -l

total 82824
-rw-r--r--@ 1 akrikhalid  staff  22654982 May 29 20:39 C-C chemokine receptor type 5_Inhibitors_3316.sdf
-rw-r--r--@ 1 akrikhalid  staff   5861006 May 29 20:38 C-C chemokine receptor type 5_Inhibitors_3316.tsv
-rw-r--r--@ 1 akrikhalid  staff    238910 Jun  5 22:53 CCR5_inhibitors 3316_Part 01.ipynb
-rw-r--r--@ 1 akrikhalid  staff    250536 May 29 11:51 CCR5_inhibitors 3316_Part 02.ipynb
-rw-r--r--@ 1 akrikhalid  staff     89689 May 29 10:53 CCR5_inhibitors 3316_Part 03.ipynb
-rw-r--r--@ 1 akrikhalid  staff     76160 May 29 11:06 CCR5_inhibitors 3316_Part 04.ipynb
-rw-r--r--@ 1 akrikhalid  staff    563191 May 29 11:20 CCR5_inhibitors 3316_Part_05.ipynb
-rw-r--r--  1 akrikhalid  staff   6052996 Jun  5 22:49 Output3_C-C chemokine receptor type 5_Inhibitors_3316.csv
-rw-r--r--  1 akrikhalid  staff    275282 Jun  5 22:53 final_output.csv
-rw-r--r--  1 akrikhalid  staff    240163 Jun  5 22:48 output2_C-C chemokine receptor type 5_Inhibitors_3316.csv
-rw-r--r--  1 akrik