# Part 1 - Download Bioactivity Data
Eric Kwok

This notebook is based on [Computational Drug Discovery [Part 1] Download Bioactivity Data](https://github.com/dataprofessor/code/blob/master/python/CDD_ML_Part_1_bioactivity_data.ipynb) by Chanin Nantasenamat.

---

## ChEMBL Database
The [ChEMBL Database](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 76,000 documents and 1.2 million assays. The data spans 13,000 targets and 1,800 cells and 33,000 indications. [Data as of May 18, 2020; ChEMBL version 27].

## Installing Libraries

In [51]:
! pip install chembl_webresource_client



## Importing Libraries

In [52]:
import pandas as pd
from chembl_webresource_client.new_client import new_client

## Search for Target Protein

### Target Search for Influenza

In [53]:
target = new_client.target
target_query = target.search('influenza')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],unidentified influenza virus,unidentified influenza virus,12.0,False,CHEMBL613128,[],ORGANISM,11309
1,[],Influenza B virus,Influenza B virus,12.0,False,CHEMBL613129,[],ORGANISM,11520
2,[],Influenza A virus,Influenza A virus,12.0,False,CHEMBL613740,[],ORGANISM,11320
3,[],Influenza C virus,Influenza C virus,12.0,False,CHEMBL612783,[],ORGANISM,11552
4,"[{'xref_id': 'P03438', 'xref_name': None, 'xre...",Influenza A virus (strain A/X-31 H3N2),Influenza A virus Hemagglutinin,10.0,False,CHEMBL4918,"[{'accession': 'P03438', 'component_descriptio...",SINGLE PROTEIN,132504
5,[],Influenza A virus (H5N1),Influenza A virus (H5N1),10.0,False,CHEMBL613845,[],ORGANISM,102793
6,[],Influenza A virus H3N2,Influenza A virus H3N2,10.0,False,CHEMBL2366902,[],ORGANISM,41857
7,[],Unidentified Influenza A virus (H1N2),Unidentified Influenza A virus (H1N2),9.0,False,CHEMBL2367089,[],ORGANISM,1323429
8,"[{'xref_id': 'P63231', 'xref_name': None, 'xre...",Influenza A virus (A/Udorn/307/1972(H3N2)),Influenza virus A matrix protein M2,8.0,False,CHEMBL2052,"[{'accession': 'P0DOF8', 'component_descriptio...",SINGLE PROTEIN,381517
9,[],Influenza B virus (B/Lee/40),Influenza B virus (B/Lee/40),8.0,False,CHEMBL612452,[],ORGANISM,107412


### Select and retrieve bioactivity data for Influenza A virus (third entry)

In [54]:
selected_target = targets.target_chembl_id[2]
selected_target

'CHEMBL613740'

Here, we will retrieve only bioactivity data for Influenza A virus Hemagglutinin that are reported as IC50 values in nM (nanomolar) unit.

In [55]:
activity = new_client.activity
result = activity.filter(target_chembl_id=selected_target).filter(standard_type='IC50')
df = pd.DataFrame.from_dict(result)
df

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,bao_endpoint,bao_format,bao_label,canonical_smiles,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,76866,[],CHEMBL808323,The compound was tested in vitro for the inhib...,F,BAO_0000190,BAO_0000218,organism-based format,CC(=O)Nc1ccc(C(=O)O)cc1NC(=O)CO,...,Influenza A virus,Influenza A virus,11320,,,IC50,mM,UO_0000065,,4.0
1,,76868,[],CHEMBL808324,The compound was tested in vitro for the inhib...,F,BAO_0000190,BAO_0000218,organism-based format,CC(=O)Nc1ccc(C(=O)O)cc1NC(=O)CO,...,Influenza A virus,Influenza A virus,11320,,,IC50,mM,UO_0000065,,10.0
2,,77906,[],CHEMBL808323,The compound was tested in vitro for the inhib...,F,BAO_0000190,BAO_0000218,organism-based format,CC(=O)Nc1c(O)cc(C(=O)O)cc1[N+](=O)[O-],...,Influenza A virus,Influenza A virus,11320,,,IC50,mM,UO_0000065,,0.75
3,,77908,[],CHEMBL808324,The compound was tested in vitro for the inhib...,F,BAO_0000190,BAO_0000218,organism-based format,CC(=O)Nc1c(O)cc(C(=O)O)cc1[N+](=O)[O-],...,Influenza A virus,Influenza A virus,11320,,,IC50,mM,UO_0000065,,1.0
4,,79067,[],CHEMBL808323,The compound was tested in vitro for the inhib...,F,BAO_0000190,BAO_0000218,organism-based format,CC(=O)Nc1c(OC(C)=O)cc(C(=O)O)cc1[N+](=O)[O-],...,Influenza A virus,Influenza A virus,11320,,,IC50,mM,UO_0000065,,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1692,,18746013,[],CHEMBL4262939,Antiviral activity against Influenza A virus (...,F,BAO_0000190,BAO_0000218,organism-based format,C=C(C)[C@@H]1CC[C@]2(C(=O)N[C@H]3C=C(C(=O)OC)O...,...,Influenza A virus,Influenza A virus,11320,,,IC50,uM,UO_0000065,,100.0
1693,,18746014,[],CHEMBL4262939,Antiviral activity against Influenza A virus (...,F,BAO_0000190,BAO_0000218,organism-based format,COC(=O)C1=C[C@H](NC(=O)[C@]23CCC(C)(C)C[C@H]2C...,...,Influenza A virus,Influenza A virus,11320,,,IC50,uM,UO_0000065,,40.5
1694,,18746015,[],CHEMBL4262939,Antiviral activity against Influenza A virus (...,F,BAO_0000190,BAO_0000218,organism-based format,C=C(C)[C@@H]1CC[C@]2(C(=O)N[C@H]3C=C(C(=O)OC)O...,...,Influenza A virus,Influenza A virus,11320,,,IC50,uM,UO_0000065,,100.0
1695,,18746016,[],CHEMBL4262939,Antiviral activity against Influenza A virus (...,F,BAO_0000190,BAO_0000218,organism-based format,COc1cc(/C=C/C(=O)CC(=O)/C=C/c2ccc(O)c(OC)c2)ccc1O,...,Influenza A virus,Influenza A virus,11320,,,IC50,uM,UO_0000065,,6.7


Finally, we will save the resulting bioactivity data to a CSV file **influenza_a_bioactivity.csv**.

In [56]:
df.to_csv('influenza_a_bioactivity.csv', index=False)

## Handling Missing Data
Drop compounds that are missing a value in the **standard_value** column.

In [57]:
df2 = df[df.standard_value.notnull()]
df2

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,bao_endpoint,bao_format,bao_label,canonical_smiles,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,76866,[],CHEMBL808323,The compound was tested in vitro for the inhib...,F,BAO_0000190,BAO_0000218,organism-based format,CC(=O)Nc1ccc(C(=O)O)cc1NC(=O)CO,...,Influenza A virus,Influenza A virus,11320,,,IC50,mM,UO_0000065,,4.0
1,,76868,[],CHEMBL808324,The compound was tested in vitro for the inhib...,F,BAO_0000190,BAO_0000218,organism-based format,CC(=O)Nc1ccc(C(=O)O)cc1NC(=O)CO,...,Influenza A virus,Influenza A virus,11320,,,IC50,mM,UO_0000065,,10.0
2,,77906,[],CHEMBL808323,The compound was tested in vitro for the inhib...,F,BAO_0000190,BAO_0000218,organism-based format,CC(=O)Nc1c(O)cc(C(=O)O)cc1[N+](=O)[O-],...,Influenza A virus,Influenza A virus,11320,,,IC50,mM,UO_0000065,,0.75
3,,77908,[],CHEMBL808324,The compound was tested in vitro for the inhib...,F,BAO_0000190,BAO_0000218,organism-based format,CC(=O)Nc1c(O)cc(C(=O)O)cc1[N+](=O)[O-],...,Influenza A virus,Influenza A virus,11320,,,IC50,mM,UO_0000065,,1.0
4,,79067,[],CHEMBL808323,The compound was tested in vitro for the inhib...,F,BAO_0000190,BAO_0000218,organism-based format,CC(=O)Nc1c(OC(C)=O)cc(C(=O)O)cc1[N+](=O)[O-],...,Influenza A virus,Influenza A virus,11320,,,IC50,mM,UO_0000065,,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1692,,18746013,[],CHEMBL4262939,Antiviral activity against Influenza A virus (...,F,BAO_0000190,BAO_0000218,organism-based format,C=C(C)[C@@H]1CC[C@]2(C(=O)N[C@H]3C=C(C(=O)OC)O...,...,Influenza A virus,Influenza A virus,11320,,,IC50,uM,UO_0000065,,100.0
1693,,18746014,[],CHEMBL4262939,Antiviral activity against Influenza A virus (...,F,BAO_0000190,BAO_0000218,organism-based format,COC(=O)C1=C[C@H](NC(=O)[C@]23CCC(C)(C)C[C@H]2C...,...,Influenza A virus,Influenza A virus,11320,,,IC50,uM,UO_0000065,,40.5
1694,,18746015,[],CHEMBL4262939,Antiviral activity against Influenza A virus (...,F,BAO_0000190,BAO_0000218,organism-based format,C=C(C)[C@@H]1CC[C@]2(C(=O)N[C@H]3C=C(C(=O)OC)O...,...,Influenza A virus,Influenza A virus,11320,,,IC50,uM,UO_0000065,,100.0
1695,,18746016,[],CHEMBL4262939,Antiviral activity against Influenza A virus (...,F,BAO_0000190,BAO_0000218,organism-based format,COc1cc(/C=C/C(=O)CC(=O)/C=C/c2ccc(O)c(OC)c2)ccc1O,...,Influenza A virus,Influenza A virus,11320,,,IC50,uM,UO_0000065,,6.7


## Data Pre-Processing
### Label compounds as being either active, inactive, or intermediate
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered **active** and those having values greater than 10,000 nM will be considered **inactive**. Compounds with values in between 1,000 and 10,000 nM will be considered **intermediate**.

In [58]:
bioactivity_class = []
for value in df2.standard_value.astype('float64'):
    if value <= 1000:
        bioactivity_class.append('active')
    elif value >= 10000:
        bioactivity_class.append('inactive')
    else:
        bioactivity_class.append('intermediate')
print(len(bioactivity_class))

1591


### Create new dataframe with metrics of interest
In addition to the bioactivity class, we would like to analyze the following columns:
- molecule_chembl_id
- canonical_smiles
- standard_value

In [62]:
selection = ['molecule_chembl_id', 'canonical_smiles', 'standard_value']
df3 = df2[selection].reset_index(drop=True)
df3.to_csv('df3_selection.csv')
df3 = pd.concat([df3, pd.DataFrame(bioactivity_class, columns=['bioactivity_class'])], axis=1)
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_class
0,CHEMBL327097,CC(=O)Nc1ccc(C(=O)O)cc1NC(=O)CO,4000000.0,inactive
1,CHEMBL327097,CC(=O)Nc1ccc(C(=O)O)cc1NC(=O)CO,10000000.0,inactive
2,CHEMBL324455,CC(=O)Nc1c(O)cc(C(=O)O)cc1[N+](=O)[O-],750000.0,inactive
3,CHEMBL324455,CC(=O)Nc1c(O)cc(C(=O)O)cc1[N+](=O)[O-],1000000.0,inactive
4,CHEMBL321393,CC(=O)Nc1c(OC(C)=O)cc(C(=O)O)cc1[N+](=O)[O-],5000000.0,inactive
...,...,...,...,...
1586,CHEMBL4286184,C=C(C)[C@@H]1CC[C@]2(C(=O)N[C@H]3C=C(C(=O)OC)O...,100000.0,inactive
1587,CHEMBL4294084,COC(=O)C1=C[C@H](NC(=O)[C@]23CCC(C)(C)C[C@H]2C...,40500.0,inactive
1588,CHEMBL4282791,C=C(C)[C@@H]1CC[C@]2(C(=O)N[C@H]3C=C(C(=O)OC)O...,100000.0,inactive
1589,CHEMBL140,COc1cc(/C=C/C(=O)CC(=O)/C=C/c2ccc(O)c(OC)c2)ccc1O,6700.0,intermediate


Write to CSV file.

In [63]:
df3.to_csv('influenza_a_preprocessed.csv', index=False)