# **Bioinformatics Project - Computational Drug Discovery [Part 1] Download Bioactivity Data (Concised version)**

Chanin Nantasenamat

[*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)

In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 1**, we will be performing Data Collection and Pre-Processing from the ChEMBL Database.

Note for this Concised Version:
* Redundant code cells were deleted.
* Code cells for saving files to Google Drive has been deleted.

---

## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 76,000 documents, 1.2 million assays and the data spans 13,000 targets and 1,800 cells and 33,000 indications.
[Data as of March 25, 2020; ChEMBL version 26].

## **Installing libraries**

Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

In [1]:
! pip install chembl_webresource_client

Collecting chembl_webresource_client
[?25l  Downloading https://files.pythonhosted.org/packages/2e/48/0db29040c92726fcc6f99a5bc89e0ea8cf5a9d84753ebaaf53108792da2a/chembl-webresource-client-0.10.2.tar.gz (51kB)
[K     |████████████████████████████████| 61kB 2.0MB/s 
Collecting requests-cache>=0.4.7
  Downloading https://files.pythonhosted.org/packages/7f/55/9b1c40eb83c16d8fc79c5f6c2ffade04208b080670fbfc35e0a5effb5a92/requests_cache-0.5.2-py2.py3-none-any.whl
Building wheels for collected packages: chembl-webresource-client
  Building wheel for chembl-webresource-client (setup.py) ... [?25l[?25hdone
  Created wheel for chembl-webresource-client: filename=chembl_webresource_client-0.10.2-cp36-none-any.whl size=55661 sha256=b18d21e5bd994b267030b3ba05b8ac44afd618392942d9d2d1fe893b2ea5632b
  Stored in directory: /root/.cache/pip/wheels/e6/96/19/3f042bfda7c669bfe24c8278477f57b0fbbf3e488b4c09e3a8
Successfully built chembl-webresource-client
Installing collected packages: requests-cache, ch

## **Importing libraries**

In [0]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

## **Search for Target protein**

### **Target search for coronavirus**

In [3]:
# Target search for coronavirus
target = new_client.target
target_query = target.search('aromatase')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,"[{'xref_id': 'P11511', 'xref_name': None, 'xre...",Homo sapiens,Cytochrome P450 19A1,19.0,False,CHEMBL1978,"[{'accession': 'P11511', 'component_descriptio...",SINGLE PROTEIN,9606
1,"[{'xref_id': 'P22443', 'xref_name': None, 'xre...",Rattus norvegicus,Cytochrome P450 19A1,19.0,False,CHEMBL3859,"[{'accession': 'P22443', 'component_descriptio...",SINGLE PROTEIN,10116


### **Select and retrieve bioactivity data for *Human Aromatase* (first entry)**

We will assign the fifth entry (which corresponds to the target protein, *Human Aromatase*) to the ***selected_target*** variable 

In [4]:
selected_target = targets.target_chembl_id[0]
selected_target

'CHEMBL1978'

Here, we will retrieve only bioactivity data for *Human Aromatase* (CHEMBL1978) that are reported as pChEMBL values.

In [0]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [0]:
df = pd.DataFrame.from_dict(res)

In [7]:
df

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,82585,[],CHEMBL666794,Inhibition of Cytochrome P450 19A1,B,BAO_0000190,BAO_0000357,single protein format,CC12CCC(O)CC1=CCC1C2CCC2(C)C(CC3CN3)CCC12,,,CHEMBL1133116,J. Med. Chem.,2000.0,"{'bei': '15.62', 'le': '0.29', 'lle': '0.86', ...",CHEMBL341591,,CHEMBL341591,5.15,False,http://www.openphacts.org/units/Nanomolar,267172,=,1,True,=,,IC50,nM,,7100.0,CHEMBL1978,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,7.1
1,,94540,[],CHEMBL666794,Inhibition of Cytochrome P450 19A1,B,BAO_0000190,BAO_0000357,single protein format,C[C@]12CC[C@H]3[C@@H](CC=C4C[C@@H](O)CC[C@@]43...,,,CHEMBL1133116,J. Med. Chem.,2000.0,,CHEMBL2111947,,CHEMBL2111947,,False,http://www.openphacts.org/units/Nanomolar,267163,>,1,True,>,,IC50,nM,,50000.0,CHEMBL1978,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,50.0
2,,112960,[],CHEMBL661700,In vitro inhibition of human Cytochrome P450 19A1,B,BAO_0000190,BAO_0000357,single protein format,CCn1c(C(c2ccc(F)cc2)n2ccnc2)c(C)c2cc(Br)ccc21,,,CHEMBL1131917,Bioorg. Med. Chem. Lett.,1999.0,"{'bei': '16.06', 'le': '0.35', 'lle': '0.91', ...",CHEMBL431859,,CHEMBL431859,6.62,False,http://www.openphacts.org/units/Nanomolar,214178,=,1,True,=,,IC50,nM,,238.0,CHEMBL1978,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.238
3,,116766,[],CHEMBL661700,In vitro inhibition of human Cytochrome P450 19A1,B,BAO_0000190,BAO_0000357,single protein format,CCn1cc(C(c2ccc(F)cc2)n2ccnc2)c2ccccc21,,,CHEMBL1131917,Bioorg. Med. Chem. Lett.,1999.0,"{'bei': '22.68', 'le': '0.41', 'lle': '2.61', ...",CHEMBL113637,,CHEMBL113637,7.24,False,http://www.openphacts.org/units/Nanomolar,214179,=,1,True,=,,IC50,nM,,57.0,CHEMBL1978,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.057
4,,118017,[],CHEMBL661700,In vitro inhibition of human Cytochrome P450 19A1,B,BAO_0000190,BAO_0000357,single protein format,Clc1ccccc1Cn1cc(Cn2ccnc2)c2ccccc21,,,CHEMBL1131917,Bioorg. Med. Chem. Lett.,1999.0,"{'bei': '22.58', 'le': '0.43', 'lle': '2.68', ...",CHEMBL112021,,CHEMBL112021,7.27,False,http://www.openphacts.org/units/Nanomolar,214173,=,1,True,=,,IC50,nM,,54.0,CHEMBL1978,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.054
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2762,,18715535,[],CHEMBL4257179,Inhibition of CYP19 (unknown origin),B,BAO_0000190,BAO_0000357,single protein format,O=S(=O)(c1cccc(OC(F)(F)F)c1)N1CCN(c2ccc(/C=C/c...,,,CHEMBL4257083,,,"{'bei': '8.29', 'le': '0.17', 'lle': '-1.16', ...",CHEMBL4288256,,CHEMBL4288256,5.00,False,http://www.openphacts.org/units/Nanomolar,3108930,=,38,True,=,,IC50,nM,,10000.0,CHEMBL1978,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,nM,UO_0000065,,10000.0
2763,,18715536,[],CHEMBL4257179,Inhibition of CYP19 (unknown origin),B,BAO_0000190,BAO_0000357,single protein format,CCS(=O)(=O)N1CCN(c2ccc(/C=C/c3cc(F)cc(-c4ccncc...,,,CHEMBL4257083,,,"{'bei': '11.07', 'le': '0.21', 'lle': '0.47', ...",CHEMBL4126284,,CHEMBL4126284,5.00,False,http://www.openphacts.org/units/Nanomolar,3108931,=,38,True,=,,IC50,nM,,10000.0,CHEMBL1978,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,nM,UO_0000065,,10000.0
2764,,18715537,[],CHEMBL4257179,Inhibition of CYP19 (unknown origin),B,BAO_0000190,BAO_0000357,single protein format,O=S(=O)(N1CCN(c2ccc(/C=C/c3cc(F)cc(-c4ccncc4)c...,,,CHEMBL4257083,,,"{'bei': '12.21', 'le': '0.24', 'lle': '0.97', ...",CHEMBL4128368,,CHEMBL4128368,6.00,False,http://www.openphacts.org/units/Nanomolar,3108932,=,38,True,=,,IC50,nM,,1000.0,CHEMBL1978,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,nM,UO_0000065,,1000.0
2765,Not Determined,18715538,[],CHEMBL4257179,Inhibition of CYP19 (unknown origin),B,BAO_0000190,BAO_0000357,single protein format,CCOC(=O)CNC(=O)N1CCN(c2ccc(/C=C/c3cc(Cl)cc(Cn4...,,,CHEMBL4257083,,,,CHEMBL4291002,,CHEMBL4291002,,False,,3108933,,38,False,,,IC50,,,,CHEMBL1978,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,,,,


Finally we will save the resulting bioactivity data to a CSV file **bioactivity_data.csv**.

In [0]:
df.to_csv('bioactivity_data_raw.csv', index=False)

## **Handling missing data**
If any compounds has missing value for the **standard_value** column then drop it

In [9]:
df2 = df[df.standard_value.notna()]
df2

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,82585,[],CHEMBL666794,Inhibition of Cytochrome P450 19A1,B,BAO_0000190,BAO_0000357,single protein format,CC12CCC(O)CC1=CCC1C2CCC2(C)C(CC3CN3)CCC12,,,CHEMBL1133116,J. Med. Chem.,2000.0,"{'bei': '15.62', 'le': '0.29', 'lle': '0.86', ...",CHEMBL341591,,CHEMBL341591,5.15,False,http://www.openphacts.org/units/Nanomolar,267172,=,1,True,=,,IC50,nM,,7100.0,CHEMBL1978,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,7.1
1,,94540,[],CHEMBL666794,Inhibition of Cytochrome P450 19A1,B,BAO_0000190,BAO_0000357,single protein format,C[C@]12CC[C@H]3[C@@H](CC=C4C[C@@H](O)CC[C@@]43...,,,CHEMBL1133116,J. Med. Chem.,2000.0,,CHEMBL2111947,,CHEMBL2111947,,False,http://www.openphacts.org/units/Nanomolar,267163,>,1,True,>,,IC50,nM,,50000.0,CHEMBL1978,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,50.0
2,,112960,[],CHEMBL661700,In vitro inhibition of human Cytochrome P450 19A1,B,BAO_0000190,BAO_0000357,single protein format,CCn1c(C(c2ccc(F)cc2)n2ccnc2)c(C)c2cc(Br)ccc21,,,CHEMBL1131917,Bioorg. Med. Chem. Lett.,1999.0,"{'bei': '16.06', 'le': '0.35', 'lle': '0.91', ...",CHEMBL431859,,CHEMBL431859,6.62,False,http://www.openphacts.org/units/Nanomolar,214178,=,1,True,=,,IC50,nM,,238.0,CHEMBL1978,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.238
3,,116766,[],CHEMBL661700,In vitro inhibition of human Cytochrome P450 19A1,B,BAO_0000190,BAO_0000357,single protein format,CCn1cc(C(c2ccc(F)cc2)n2ccnc2)c2ccccc21,,,CHEMBL1131917,Bioorg. Med. Chem. Lett.,1999.0,"{'bei': '22.68', 'le': '0.41', 'lle': '2.61', ...",CHEMBL113637,,CHEMBL113637,7.24,False,http://www.openphacts.org/units/Nanomolar,214179,=,1,True,=,,IC50,nM,,57.0,CHEMBL1978,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.057
4,,118017,[],CHEMBL661700,In vitro inhibition of human Cytochrome P450 19A1,B,BAO_0000190,BAO_0000357,single protein format,Clc1ccccc1Cn1cc(Cn2ccnc2)c2ccccc21,,,CHEMBL1131917,Bioorg. Med. Chem. Lett.,1999.0,"{'bei': '22.58', 'le': '0.43', 'lle': '2.68', ...",CHEMBL112021,,CHEMBL112021,7.27,False,http://www.openphacts.org/units/Nanomolar,214173,=,1,True,=,,IC50,nM,,54.0,CHEMBL1978,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.054
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2761,,18715534,[],CHEMBL4257179,Inhibition of CYP19 (unknown origin),B,BAO_0000190,BAO_0000357,single protein format,O=S(=O)(C1CC1)N1CCN(c2ccc(/C=C/c3cc(F)cc(-c4cc...,,,CHEMBL4257083,,,"{'bei': '10.79', 'le': '0.21', 'lle': '0.33', ...",CHEMBL4126996,,CHEMBL4126996,5.00,False,http://www.openphacts.org/units/Nanomolar,3108929,=,38,True,=,,IC50,nM,,10000.0,CHEMBL1978,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,nM,UO_0000065,,10000.0
2762,,18715535,[],CHEMBL4257179,Inhibition of CYP19 (unknown origin),B,BAO_0000190,BAO_0000357,single protein format,O=S(=O)(c1cccc(OC(F)(F)F)c1)N1CCN(c2ccc(/C=C/c...,,,CHEMBL4257083,,,"{'bei': '8.29', 'le': '0.17', 'lle': '-1.16', ...",CHEMBL4288256,,CHEMBL4288256,5.00,False,http://www.openphacts.org/units/Nanomolar,3108930,=,38,True,=,,IC50,nM,,10000.0,CHEMBL1978,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,nM,UO_0000065,,10000.0
2763,,18715536,[],CHEMBL4257179,Inhibition of CYP19 (unknown origin),B,BAO_0000190,BAO_0000357,single protein format,CCS(=O)(=O)N1CCN(c2ccc(/C=C/c3cc(F)cc(-c4ccncc...,,,CHEMBL4257083,,,"{'bei': '11.07', 'le': '0.21', 'lle': '0.47', ...",CHEMBL4126284,,CHEMBL4126284,5.00,False,http://www.openphacts.org/units/Nanomolar,3108931,=,38,True,=,,IC50,nM,,10000.0,CHEMBL1978,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,nM,UO_0000065,,10000.0
2764,,18715537,[],CHEMBL4257179,Inhibition of CYP19 (unknown origin),B,BAO_0000190,BAO_0000357,single protein format,O=S(=O)(N1CCN(c2ccc(/C=C/c3cc(F)cc(-c4ccncc4)c...,,,CHEMBL4257083,,,"{'bei': '12.21', 'le': '0.24', 'lle': '0.97', ...",CHEMBL4128368,,CHEMBL4128368,6.00,False,http://www.openphacts.org/units/Nanomolar,3108932,=,38,True,=,,IC50,nM,,1000.0,CHEMBL1978,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,nM,UO_0000065,,1000.0


In [10]:
len(df2.canonical_smiles.unique())

1989

In [11]:
df2_nr = df2.drop_duplicates(['canonical_smiles'])
df2_nr

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,82585,[],CHEMBL666794,Inhibition of Cytochrome P450 19A1,B,BAO_0000190,BAO_0000357,single protein format,CC12CCC(O)CC1=CCC1C2CCC2(C)C(CC3CN3)CCC12,,,CHEMBL1133116,J. Med. Chem.,2000.0,"{'bei': '15.62', 'le': '0.29', 'lle': '0.86', ...",CHEMBL341591,,CHEMBL341591,5.15,False,http://www.openphacts.org/units/Nanomolar,267172,=,1,True,=,,IC50,nM,,7100.0,CHEMBL1978,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,7.1
1,,94540,[],CHEMBL666794,Inhibition of Cytochrome P450 19A1,B,BAO_0000190,BAO_0000357,single protein format,C[C@]12CC[C@H]3[C@@H](CC=C4C[C@@H](O)CC[C@@]43...,,,CHEMBL1133116,J. Med. Chem.,2000.0,,CHEMBL2111947,,CHEMBL2111947,,False,http://www.openphacts.org/units/Nanomolar,267163,>,1,True,>,,IC50,nM,,50000.0,CHEMBL1978,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,50.0
2,,112960,[],CHEMBL661700,In vitro inhibition of human Cytochrome P450 19A1,B,BAO_0000190,BAO_0000357,single protein format,CCn1c(C(c2ccc(F)cc2)n2ccnc2)c(C)c2cc(Br)ccc21,,,CHEMBL1131917,Bioorg. Med. Chem. Lett.,1999.0,"{'bei': '16.06', 'le': '0.35', 'lle': '0.91', ...",CHEMBL431859,,CHEMBL431859,6.62,False,http://www.openphacts.org/units/Nanomolar,214178,=,1,True,=,,IC50,nM,,238.0,CHEMBL1978,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.238
3,,116766,[],CHEMBL661700,In vitro inhibition of human Cytochrome P450 19A1,B,BAO_0000190,BAO_0000357,single protein format,CCn1cc(C(c2ccc(F)cc2)n2ccnc2)c2ccccc21,,,CHEMBL1131917,Bioorg. Med. Chem. Lett.,1999.0,"{'bei': '22.68', 'le': '0.41', 'lle': '2.61', ...",CHEMBL113637,,CHEMBL113637,7.24,False,http://www.openphacts.org/units/Nanomolar,214179,=,1,True,=,,IC50,nM,,57.0,CHEMBL1978,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.057
4,,118017,[],CHEMBL661700,In vitro inhibition of human Cytochrome P450 19A1,B,BAO_0000190,BAO_0000357,single protein format,Clc1ccccc1Cn1cc(Cn2ccnc2)c2ccccc21,,,CHEMBL1131917,Bioorg. Med. Chem. Lett.,1999.0,"{'bei': '22.58', 'le': '0.43', 'lle': '2.68', ...",CHEMBL112021,,CHEMBL112021,7.27,False,http://www.openphacts.org/units/Nanomolar,214173,=,1,True,=,,IC50,nM,,54.0,CHEMBL1978,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.054
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2757,,18577031,[],CHEMBL4222102,Inhibition of aromatase in supersomes (unknown...,B,BAO_0000190,BAO_0000357,single protein format,Cn1c(-c2csc(-c3ncccn3)n2)nc2ccccc21,,,CHEMBL4219133,Bioorg Med Chem,2018.0,"{'bei': '19.44', 'le': '0.37', 'lle': '2.55', ...",CHEMBL4228031,,CHEMBL4228031,5.70,False,http://www.openphacts.org/units/Nanomolar,3084670,=,1,True,=,,IC50,nM,,1984.4,CHEMBL1978,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,1.9844
2758,,18577032,[],CHEMBL4222102,Inhibition of aromatase in supersomes (unknown...,B,BAO_0000190,BAO_0000357,single protein format,c1cnc(-c2nc(-c3nc4ccccc4s3)cs2)nc1,,,CHEMBL4219133,Bioorg Med Chem,2018.0,"{'bei': '20.24', 'le': '0.41', 'lle': '2.12', ...",CHEMBL4228244,,CHEMBL4228244,6.00,False,http://www.openphacts.org/units/Nanomolar,3084671,=,1,True,=,,IC50,nM,,999.74,CHEMBL1978,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,uM,UO_0000065,,0.99974
2761,,18715534,[],CHEMBL4257179,Inhibition of CYP19 (unknown origin),B,BAO_0000190,BAO_0000357,single protein format,O=S(=O)(C1CC1)N1CCN(c2ccc(/C=C/c3cc(F)cc(-c4cc...,,,CHEMBL4257083,,,"{'bei': '10.79', 'le': '0.21', 'lle': '0.33', ...",CHEMBL4126996,,CHEMBL4126996,5.00,False,http://www.openphacts.org/units/Nanomolar,3108929,=,38,True,=,,IC50,nM,,10000.0,CHEMBL1978,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,nM,UO_0000065,,10000.0
2762,,18715535,[],CHEMBL4257179,Inhibition of CYP19 (unknown origin),B,BAO_0000190,BAO_0000357,single protein format,O=S(=O)(c1cccc(OC(F)(F)F)c1)N1CCN(c2ccc(/C=C/c...,,,CHEMBL4257083,,,"{'bei': '8.29', 'le': '0.17', 'lle': '-1.16', ...",CHEMBL4288256,,CHEMBL4288256,5.00,False,http://www.openphacts.org/units/Nanomolar,3108930,=,38,True,=,,IC50,nM,,10000.0,CHEMBL1978,Homo sapiens,Cytochrome P450 19A1,9606,,,IC50,nM,UO_0000065,,10000.0


## **Data pre-processing of the bioactivity data**

### **Combine the 3 columns (molecule_chembl_id,canonical_smiles,standard_value) and bioactivity_class into a DataFrame**

In [12]:
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2_nr[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL341591,CC12CCC(O)CC1=CCC1C2CCC2(C)C(CC3CN3)CCC12,7100.0
1,CHEMBL2111947,C[C@]12CC[C@H]3[C@@H](CC=C4C[C@@H](O)CC[C@@]43...,50000.0
2,CHEMBL431859,CCn1c(C(c2ccc(F)cc2)n2ccnc2)c(C)c2cc(Br)ccc21,238.0
3,CHEMBL113637,CCn1cc(C(c2ccc(F)cc2)n2ccnc2)c2ccccc21,57.0
4,CHEMBL112021,Clc1ccccc1Cn1cc(Cn2ccnc2)c2ccccc21,54.0
...,...,...,...
2757,CHEMBL4228031,Cn1c(-c2csc(-c3ncccn3)n2)nc2ccccc21,1984.4
2758,CHEMBL4228244,c1cnc(-c2nc(-c3nc4ccccc4s3)cs2)nc1,999.74
2761,CHEMBL4126996,O=S(=O)(C1CC1)N1CCN(c2ccc(/C=C/c3cc(F)cc(-c4cc...,10000.0
2762,CHEMBL4288256,O=S(=O)(c1cccc(OC(F)(F)F)c1)N1CCN(c2ccc(/C=C/c...,10000.0


Saves dataframe to CSV file

In [0]:
df3.to_csv('bioactivity_data_preprocessed.csv', index=False)

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**. 

In [0]:
df4 = pd.read_csv('bioactivity_data_preprocessed.csv')

In [0]:
bioactivity_threshold = []
for i in df4.standard_value:
  if float(i) >= 10000:
    bioactivity_threshold.append("inactive")
  elif float(i) <= 1000:
    bioactivity_threshold.append("active")
  else:
    bioactivity_threshold.append("intermediate")

In [16]:
bioactivity_class = pd.Series(bioactivity_threshold, name='class')
df5 = pd.concat([df4, bioactivity_class], axis=1)
df5

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,class
0,CHEMBL341591,CC12CCC(O)CC1=CCC1C2CCC2(C)C(CC3CN3)CCC12,7100.00,intermediate
1,CHEMBL2111947,C[C@]12CC[C@H]3[C@@H](CC=C4C[C@@H](O)CC[C@@]43...,50000.00,inactive
2,CHEMBL431859,CCn1c(C(c2ccc(F)cc2)n2ccnc2)c(C)c2cc(Br)ccc21,238.00,active
3,CHEMBL113637,CCn1cc(C(c2ccc(F)cc2)n2ccnc2)c2ccccc21,57.00,active
4,CHEMBL112021,Clc1ccccc1Cn1cc(Cn2ccnc2)c2ccccc21,54.00,active
...,...,...,...,...
1984,CHEMBL4228031,Cn1c(-c2csc(-c3ncccn3)n2)nc2ccccc21,1984.40,intermediate
1985,CHEMBL4228244,c1cnc(-c2nc(-c3nc4ccccc4s3)cs2)nc1,999.74,active
1986,CHEMBL4126996,O=S(=O)(C1CC1)N1CCN(c2ccc(/C=C/c3cc(F)cc(-c4cc...,10000.00,inactive
1987,CHEMBL4288256,O=S(=O)(c1cccc(OC(F)(F)F)c1)N1CCN(c2ccc(/C=C/c...,10000.00,inactive


Saves dataframe to CSV file

In [0]:
df5.to_csv('bioactivity_data_curated_aromatase.csv', index=False)

In [18]:
! ls -l

total 1616
-rw-r--r-- 1 root root  145772 May 30 07:59 bioactivity_data_curated.csv
-rw-r--r-- 1 root root  127352 May 30 07:59 bioactivity_data_preprocessed.csv
-rw-r--r-- 1 root root 1369218 May 30 07:59 bioactivity_data_raw.csv
drwxr-xr-x 1 root root    4096 May 27 16:27 sample_data


---