# **Bioinformatics Project - Computational Drug Discovery [Part 1] 

Download Bioactivity Data (Concised version)**

In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 1**, we will be performing Data Collection and Pre-Processing from the ChEMBL Database.

Note for this Concised Version:
* Redundant code cells were deleted.
* Code cells for saving files to Google Drive has been deleted.

---

## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 76,000 documents, 1.2 million assays and the data spans 13,000 targets and 1,800 cells and 33,000 indications.
[Data as of March 25, 2020; ChEMBL version 26].

## **Installing libraries**

Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

## **Importing libraries**

In [40]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client
# chembl_webresource_client is a Python client for accessing ChEMBL data via the RESTful web services interface.
# this library is used to access ChEMBL database and retrieve bioactivity data for our target of interest.
# ChEMBL is a manually curated database of bioactive molecules with drug-like properties.
# It brings together chemical, bioactivity and genomic data to aid the translation of genomic
# information into effective new drugs.

## **Search for Target protein**

In [41]:
# Target search for coronavirus
# initialize the target search of chembl
target = new_client.target
target_query = target.search('covid-19') # list of dictionaries is returned
targets = pd.DataFrame.from_dict(target_query) # convert to dataframe
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,"[{'xref_id': 'Q712U5', 'xref_name': None, 'xre...",Rattus norvegicus,cyclic AMP phosphoprotein,15.0,False,CHEMBL2170,"[{'accession': 'Q712U5', 'component_descriptio...",SINGLE PROTEIN,10116
1,[],Homo sapiens,KU-19-19,15.0,False,CHEMBL1075483,[],CELL-LINE,9606
2,[],Escherichia coli,Metallo-beta-lactamase VIM-19,15.0,False,CHEMBL3309038,"[{'accession': 'D2D9J0', 'component_descriptio...",SINGLE PROTEIN,562
3,[],Homo sapiens,Ubiquitin carboxyl-terminal hydrolase 19,14.0,False,CHEMBL4523156,"[{'accession': 'O94966', 'component_descriptio...",SINGLE PROTEIN,9606
4,[],Homo sapiens,SNB-19,13.0,False,CHEMBL614164,[],CELL-LINE,9606
5,[],Homo sapiens,HOP-19,13.0,False,CHEMBL614832,[],CELL-LINE,9606
6,"[{'xref_id': 'A8QUY6', 'xref_name': None, 'xre...",Aeromonas caviae,IMP-19,13.0,False,CHEMBL5438,"[{'accession': 'A8QUY6', 'component_descriptio...",SINGLE PROTEIN,648
7,[],Homo sapiens,EFM-19,13.0,False,CHEMBL1075439,[],CELL-LINE,9606
8,[],Homo sapiens,Matrix metalloproteinase-19,13.0,False,CHEMBL1938214,"[{'accession': 'Q99542', 'component_descriptio...",SINGLE PROTEIN,9606
9,[],Homo sapiens,ARPE-19,13.0,False,CHEMBL4296399,[],CELL-LINE,9606


In [42]:
# Select and retrieve bioactivity data for the target of interest
selected_target = targets.target_chembl_id[19]
print(selected_target)
# initialize the bioactivity data
activity = new_client.activity
# retrieve the bioactivity data for the selected target of interest with IC50 as standard type
# IC50 is the concentration of a drug required to inhibit the growth of a microorganism by 50%.
# The IC50 is a measure of the potency of a drug.
# The lower the IC50 value, the more potent the drug.
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")
df = pd.DataFrame.from_dict(res)
df.shape

CHEMBL3883323


(151, 45)

In [43]:
# df head show all the columns
df.head(3)

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,16508071,[],CHEMBL3779201,Binding affinity to CDK19/Cyclin C (unknown or...,B,,,BAO_0000190,BAO_0000223,...,Homo sapiens,Cyclin-C/Cyclin-dependent kinase 19,9606,,,IC50,nM,UO_0000065,,26.0
1,,16509067,[],CHEMBL3779201,Binding affinity to CDK19/Cyclin C (unknown or...,B,,,BAO_0000190,BAO_0000223,...,Homo sapiens,Cyclin-C/Cyclin-dependent kinase 19,9606,,,IC50,nM,UO_0000065,,4.0
2,,16573757,[],CHEMBL3803437,Binding affinity to human CDK19 (1 to 502 amin...,B,,,BAO_0000190,BAO_0000223,...,Homo sapiens,Cyclin-C/Cyclin-dependent kinase 19,9606,,,IC50,nM,UO_0000065,,2.5


Finally we will save the resulting bioactivity data to a CSV file **bioactivity_data.csv**.

In [44]:
df.to_csv('Data/01_bioactivity_data_raw.csv', index=False)

## **Handling missing data**
If any compounds has missing value for the **standard_value** and **canonical_smiles** column then drop it.

In [45]:
# Remove columns with missing values in the standard_value column
df2 = df[df.standard_value.notna()]
# Remove columns with missing values in the canonical_smiles column
# canonical_smiles is a unique representation of a molecule in the form of a string.
# example: Cc1ccccc1C(=O)Nc1ccc(O)cc1
# we will use this column to calculate molecular descriptors?
# molecular descriptors are numerical values that describe the chemical properties of a molecule.
df2 = df2[df.canonical_smiles.notna()]
df2.head(3)

  df2 = df2[df.canonical_smiles.notna()]


Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,16508071,[],CHEMBL3779201,Binding affinity to CDK19/Cyclin C (unknown or...,B,,,BAO_0000190,BAO_0000223,...,Homo sapiens,Cyclin-C/Cyclin-dependent kinase 19,9606,,,IC50,nM,UO_0000065,,26.0
1,,16509067,[],CHEMBL3779201,Binding affinity to CDK19/Cyclin C (unknown or...,B,,,BAO_0000190,BAO_0000223,...,Homo sapiens,Cyclin-C/Cyclin-dependent kinase 19,9606,,,IC50,nM,UO_0000065,,4.0
2,,16573757,[],CHEMBL3803437,Binding affinity to human CDK19 (1 to 502 amin...,B,,,BAO_0000190,BAO_0000223,...,Homo sapiens,Cyclin-C/Cyclin-dependent kinase 19,9606,,,IC50,nM,UO_0000065,,2.5


In [46]:
len(df2.canonical_smiles.unique())

121

In [47]:
df2_nr = df2.drop_duplicates(['canonical_smiles'])
df2_nr.shape

(121, 45)

## **Data pre-processing of the bioactivity data**

**Combine the 3 columns (molecule_chembl_id,canonical_smiles,standard_value) and bioactivity_class into a DataFrame**

In [48]:
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2_nr[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL3775576,CNC(=O)c1ccc2[nH]nc(Cc3ccc4c(cnn4C)c3)c2c1,26.0
1,CHEMBL3775317,Cn1cc(-c2ccc(Cc3n[nH]c4ccc(C(=O)N5CC[C@@H](O)C...,4.0
2,CHEMBL3798663,O=C1NCCC12CCN(c1c(Cl)cncc1-c1ccc(-c3cnn(CCO)c3...,2.5
3,CHEMBL3798944,CC(C)(O)Cn1cc(-c2ccc(-c3cncc(Cl)c3N3CCC4(CCNC4...,1.4
5,CHEMBL3798318,O=C1NCCC12CCN(c1c(Cl)cncc1-c1ccc(-c3cnn(CCN4CC...,20.6
...,...,...,...
146,CHEMBL4849842,Cn1ncc2cc(-c3cnc4[nH]ccc4c3N3CCC4(CCCNC4=O)CC3...,3.0
147,CHEMBL4878356,Cn1ncc2cc(-c3cnc4[nH]ccc4c3N3CCC4(CC3)CNC4=O)c...,6.0
148,CHEMBL4862777,Cn1ncc2cc(-c3cnc4[nH]ccc4c3N3CCC4(CC3)CNC(=O)O...,6.0
149,CHEMBL4853002,CCOC(=O)/C=C/c1ccncc1-c1ccc2cc[nH]c2c1,45.74


Saves dataframe to CSV file

In [49]:
df3.to_csv('Data/02_bioactivity_data_preprocessed.csv', index=False)

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**. 

In [50]:
df4 = pd.read_csv('Data/02_bioactivity_data_preprocessed.csv')

In [51]:
# activity_class is a new column that will be created to classify the bioactivity of the molecules
# based on the standard_value column
def bioactivity_level(value):
  if value >= 10000:
    return "inactive"
  elif value <= 1000:
    return "active"
  else:
    return "intermediate"

df4['class'] = df4['standard_value'].apply(lambda x: bioactivity_level(float(x)))
df4

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,class
0,CHEMBL3775576,CNC(=O)c1ccc2[nH]nc(Cc3ccc4c(cnn4C)c3)c2c1,26.00,active
1,CHEMBL3775317,Cn1cc(-c2ccc(Cc3n[nH]c4ccc(C(=O)N5CC[C@@H](O)C...,4.00,active
2,CHEMBL3798663,O=C1NCCC12CCN(c1c(Cl)cncc1-c1ccc(-c3cnn(CCO)c3...,2.50,active
3,CHEMBL3798944,CC(C)(O)Cn1cc(-c2ccc(-c3cncc(Cl)c3N3CCC4(CCNC4...,1.40,active
4,CHEMBL3798318,O=C1NCCC12CCN(c1c(Cl)cncc1-c1ccc(-c3cnn(CCN4CC...,20.60,active
...,...,...,...,...
116,CHEMBL4849842,Cn1ncc2cc(-c3cnc4[nH]ccc4c3N3CCC4(CCCNC4=O)CC3...,3.00,active
117,CHEMBL4878356,Cn1ncc2cc(-c3cnc4[nH]ccc4c3N3CCC4(CC3)CNC4=O)c...,6.00,active
118,CHEMBL4862777,Cn1ncc2cc(-c3cnc4[nH]ccc4c3N3CCC4(CC3)CNC(=O)O...,6.00,active
119,CHEMBL4853002,CCOC(=O)/C=C/c1ccncc1-c1ccc2cc[nH]c2c1,45.74,active


Saves dataframe to CSV file

In [52]:
df4.to_csv('Data/03_bioactivity_data_curated.csv', index=False)

In [53]:
! rm 'Data/bioactivity.zip'
! zip 'Data/bioactivity.zip' Data/*.csv

updating: Data/01_bioactivity_data_raw.csv (deflated 91%)
updating: Data/02_bioactivity_data_preprocessed.csv (deflated 76%)
updating: Data/03_bioactivity_data_curated.csv (deflated 77%)
  adding: Data/04_bioactivity_data_3class_pIC50.csv (deflated 76%)
  adding: Data/05_bioactivity_data_2class_pIC50.csv (deflated 76%)
  adding: Data/06_bioactivity_data_3class_pIC50_pubchem_fp.csv (deflated 97%)
  adding: Data/descriptors_output.csv (deflated 96%)


In [54]:
! ls -l

total 1368
-rw-rw-r-- 1 adr adr  93988 feb  7 00:43 CDD_ML_Part_01.ipynb
-rw-rw-r-- 1 adr adr 393634 feb  6 00:42 CDD_ML_Part_02.ipynb
-rw-rw-r-- 1 adr adr 283385 feb  6 00:55 CDD_ML_Part_03.ipynb
-rw-rw-r-- 1 adr adr  92415 feb  6 00:58 CDD_ML_Part_04.ipynb
-rw-rw-r-- 1 adr adr 517287 feb  6 01:09 CDD_ML_Part_05.ipynb
drwxrwxr-x 2 adr adr   4096 feb  7 00:43 Data
drwxrwxr-x 2 adr adr   4096 feb  6 00:55 padel
