## A Journey into bioinformatics
Original tutorial I followed while developing this notebook: <a href="https://www.youtube.com/watch?v=jBlTQjcKuaY">Python for Bioinformatics - Drug Discovery Using Machine Learning and Data Analysis</a><br>
## Data collection - <a href="https://www.ebi.ac.uk/chembl/">ChEMBL Database</a>
The ChEMBL Database is a database containing curated bioactivity data for more than 2 million compounds and was compiled with over 76,000 documents, 1.2 million essays and the data spans 13,000 targets and 1,800 cells and 33,000 indications (version 26)<br>
### Import section

In [2]:
# You may need to: pip install chembl_webresource_client

import pandas as pd 
import os

from chembl_webresource_client.new_client import new_client

### Search for a target protein
Acetylcholinesterase in this case. Specifically the single protein target_type for the Homo sapien organism

In [3]:
qry = new_client.target.search("acetylcholinesterase")
dfsearch = pd.DataFrame(qry)
dfsearch[(dfsearch['organism']=='Homo sapiens') & (dfsearch['target_type']=='SINGLE PROTEIN')]

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,"[{'xref_id': 'P22303', 'xref_name': None, 'xre...",Homo sapiens,Acetylcholinesterase,27.0,False,CHEMBL220,"[{'accession': 'P22303', 'component_descriptio...",SINGLE PROTEIN,9606


Now we will aquire the bioactivity for Human Acetylcholinesterase that are reported as pChEMBL values

In [11]:
chembl_id = dfsearch['target_chembl_id'][0]
csv_name = "bioactivity_"+chembl_id +"_raw.csv"

# If we don't already have the file, then download it
if not os.path.exists(csv_name):
    activity = new_client.activity
    res = activity.filter(target_chembl_id=chembl_id).filter(standard_type="IC50")
    df = pd.DataFrame(res)
    df.to_csv(csv_name, index=False)
else:
    df = pd.read_csv(csv_name)
df.head()

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,33969,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.75
1,,37563,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.1
2,,37565,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,50.0
3,,38902,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.3
4,,41170,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.8


### Wrangling section
<ul><li>Keep only the necessary columns</li>
<li>Drop nulls</li>
<li>Drop dupes</li>
<li>Add a "Class" feature</li>
</ul>

In [23]:
def classify_standard_value(x):
    ''' INPUTS: x: the standard value of the molecule
        OUTPUTS: the "class" as per the standard value range'''
    x = float(x)
    if x >= 10000:
        return "inactive"
    elif x > 1000:
        return "intermediate"
    return "active"

dfc = df[["molecule_chembl_id","canonical_smiles","standard_value"]].copy()
dfc.dropna(axis=0,how="any", subset=["standard_value","canonical_smiles"], inplace=True)
dfc.drop_duplicates(["canonical_smiles"], inplace=True)
dfc['class'] = dfc['standard_value'].apply(lambda x: classify_standard_value(x))
dfc.to_csv("bioactivity_"+chembl_id +"_clean.csv", index=False)
dfc.head()

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,class
0,CHEMBL133897,CCOc1nn(-c2cccc(OCc3ccccc3)c2)c(=O)o1,750.0,active
1,CHEMBL336398,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC1CC1,100.0,active
2,CHEMBL131588,CN(C(=O)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F)c1ccccc1,50000.0,inactive
3,CHEMBL130628,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F,300.0,active
4,CHEMBL130478,CSc1nc(-c2ccc(OC(F)(F)F)cc2)nn1C(=O)N(C)C,800.0,active
