# **Bioinformatics Project - Computational Drug Discovery [Part 1] 

In **Part 1**, we will be performing Data Collection and Pre-Processing from the ChEMBL Database.

## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 76,000 documents, 1.2 million assays and the data spans 13,000 targets and 1,800 cells and 33,000 indications.
[Data as of March 25, 2020; ChEMBL version 26].

## **Installing libraries**

Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

## **Importing libraries**

In [22]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client
# chembl_webresource_client is a Python client for accessing ChEMBL data via the RESTful web services interface.
# this library is used to access ChEMBL database and retrieve bioactivity data for our target of interest.
# ChEMBL is a manually curated database of bioactive molecules with drug-like properties.
# It brings together chemical, bioactivity and genomic data to aid the translation of genomic
# information into effective new drugs.

## **Search for Target protein**

In [23]:
# Target search for Carbonic anhydrase VA
# initialize the target search of chembl
target = new_client.target
target_query = target.search('Carbonic anhydrase VA') # list of dictionaries is returned
targets = pd.DataFrame.from_dict(target_query) # convert to dataframe
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,"[{'xref_id': 'P35218', 'xref_name': None, 'xre...",Homo sapiens,Carbonic anhydrase VA,47.0,False,CHEMBL4789,"[{'accession': 'P35218', 'component_descriptio...",SINGLE PROTEIN,9606
1,[],Homo sapiens,Carbonic anhydrase V,47.0,False,CHEMBL2111457,"[{'accession': 'Q9Y2D0', 'component_descriptio...",PROTEIN FAMILY,9606
2,[],Homo sapiens,Carbonic anhydrase,38.0,False,CHEMBL2095180,"[{'accession': 'P00918', 'component_descriptio...",PROTEIN FAMILY,9606
3,"[{'xref_id': 'P23280', 'xref_name': None, 'xre...",Homo sapiens,Carbonic anhydrase VI,30.0,False,CHEMBL3025,"[{'accession': 'P23280', 'component_descriptio...",SINGLE PROTEIN,9606
4,[],Bos taurus,Carbonic anhydrase,30.0,False,CHEMBL2096971,"[{'accession': 'P00921', 'component_descriptio...",PROTEIN FAMILY,9913
5,"[{'xref_id': 'P00918', 'xref_name': None, 'xre...",Homo sapiens,Carbonic anhydrase II,28.0,False,CHEMBL205,"[{'accession': 'P00918', 'component_descriptio...",SINGLE PROTEIN,9606
6,"[{'xref_id': 'P00915', 'xref_name': None, 'xre...",Homo sapiens,Carbonic anhydrase I,28.0,False,CHEMBL261,"[{'accession': 'P00915', 'component_descriptio...",SINGLE PROTEIN,9606
7,"[{'xref_id': 'P07451', 'xref_name': None, 'xre...",Homo sapiens,Carbonic anhydrase III,27.0,False,CHEMBL2885,"[{'accession': 'P07451', 'component_descriptio...",SINGLE PROTEIN,9606
8,"[{'xref_id': 'P27139', 'xref_name': None, 'xre...",Rattus norvegicus,Carbonic anhydrase II,27.0,False,CHEMBL4706,"[{'accession': 'P27139', 'component_descriptio...",SINGLE PROTEIN,10116
9,"[{'xref_id': 'Q9ERQ8', 'xref_name': None, 'xre...",Mus musculus,Carbonic anhydrase VII,27.0,False,CHEMBL2216,"[{'accession': 'Q9ERQ8', 'component_descriptio...",SINGLE PROTEIN,10090


In [24]:
# Select and retrieve bioactivity data for the target of interest 
selected_target = targets.target_chembl_id[6]
print(selected_target)
# initialize the bioactivity data
activity = new_client.activity
# retrieve the bioactivity data for the selected target of interest with IC50 as standard type
# IC50 is the concentration of a drug required to inhibit the growth of a microorganism by 50%.
# The IC50 is a measure of the potency of a drug.
# The lower the IC50 value, the more potent the drug.
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")
df = pd.DataFrame.from_dict(res)
df.shape

CHEMBL261


(350, 45)

In [25]:
# df head show all the columns
df.head(3)

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,651622,[],CHEMBL662716,Inhibitory activity against human carbonic anh...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Carbonic anhydrase I,9606,,,IC50,nM,UO_0000065,,250.0
1,,651633,[],CHEMBL662716,Inhibitory activity against human carbonic anh...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Carbonic anhydrase I,9606,,,IC50,nM,UO_0000065,,50.0
2,,651637,[],CHEMBL662716,Inhibitory activity against human carbonic anh...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Carbonic anhydrase I,9606,,,IC50,nM,UO_0000065,,50000.0


Finally we will save the resulting bioactivity data to a CSV file **bioactivity_data.csv**.

In [26]:
df.to_csv('Data/01_bioactivity_data_raw.csv', index=False)

## **Handling missing data**
If any compounds has missing value for the **standard_value** and **canonical_smiles** column then drop it.

In [27]:
# Remove columns with missing values in the standard_value column
df2 = df[df.standard_value.notna()]
# Remove columns with missing values in the canonical_smiles column
# canonical_smiles is a unique representation of a molecule in the form of a string.
# example: Cc1ccccc1C(=O)Nc1ccc(O)cc1
# we will use this column to calculate molecular descriptors?
# molecular descriptors are numerical values that describe the chemical properties of a molecule.
df2 = df2[df.canonical_smiles.notna()]
df2.head(3)

  df2 = df2[df.canonical_smiles.notna()]


Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,651622,[],CHEMBL662716,Inhibitory activity against human carbonic anh...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Carbonic anhydrase I,9606,,,IC50,nM,UO_0000065,,250.0
1,,651633,[],CHEMBL662716,Inhibitory activity against human carbonic anh...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Carbonic anhydrase I,9606,,,IC50,nM,UO_0000065,,50.0
2,,651637,[],CHEMBL662716,Inhibitory activity against human carbonic anh...,B,,,BAO_0000190,BAO_0000357,...,Homo sapiens,Carbonic anhydrase I,9606,,,IC50,nM,UO_0000065,,50000.0


In [28]:
len(df2.canonical_smiles.unique())

282

In [29]:
df2_nr = df2.drop_duplicates(['canonical_smiles'])
df2_nr.shape

(282, 45)

## **Data pre-processing of the bioactivity data**

**Combine the 3 columns (molecule_chembl_id,canonical_smiles,standard_value) and bioactivity_class into a DataFrame**

In [30]:
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2_nr[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL20,CC(=O)Nc1nnc(S(N)(=O)=O)s1,250.0
1,CHEMBL19,CC(=O)/N=c1/sc(S(N)(=O)=O)nn1C,50.0
2,CHEMBL118,Cc1ccc(-c2cc(C(F)(F)F)nn2-c2ccc(S(N)(=O)=O)cc2...,50000.0
3,CHEMBL26915,COc1ccc(-n2nc(C(F)(F)F)cc2-c2ccc(Cl)cc2)cc1,100000.0
4,CHEMBL139,O=C(O)Cc1ccccc1Nc1c(Cl)cccc1Cl,100000.0
...,...,...,...
345,CHEMBL4863113,COC(=O)[C@H]1O[C@@H](NC(=O)c2ccc(S(N)(=O)=O)cc...,1800.0
346,CHEMBL4865818,NS(=O)(=O)c1ccc(C(=O)N[C@@H]2O[C@H](CO)[C@@H](...,3600.0
347,CHEMBL4870385,NS(=O)(=O)OC[C@H]1C[C@@H](Nc2ncncc2C(=O)c2ccn(...,50.0
348,CHEMBL4856793,Cc1sc(C(=O)c2cncnc2N[C@H]2C[C@H](O)[C@@H](COS(...,50.0


Saves dataframe to CSV file

In [31]:
df3.to_csv('Data/02_bioactivity_data_preprocessed.csv', index=False)

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**. 

In [32]:
df4 = pd.read_csv('Data/02_bioactivity_data_preprocessed.csv')

In [33]:
# activity_class is a new column that will be created to classify the bioactivity of the molecules
# based on the standard_value column
def bioactivity_level(value):
  if value >= 10000:
    return "inactive"
  elif value <= 1000:
    return "active"
  else:
    return "intermediate"

df4['class'] = df4['standard_value'].apply(lambda x: bioactivity_level(float(x)))
df4

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,class
0,CHEMBL20,CC(=O)Nc1nnc(S(N)(=O)=O)s1,250.0,active
1,CHEMBL19,CC(=O)/N=c1/sc(S(N)(=O)=O)nn1C,50.0,active
2,CHEMBL118,Cc1ccc(-c2cc(C(F)(F)F)nn2-c2ccc(S(N)(=O)=O)cc2...,50000.0,inactive
3,CHEMBL26915,COc1ccc(-n2nc(C(F)(F)F)cc2-c2ccc(Cl)cc2)cc1,100000.0,inactive
4,CHEMBL139,O=C(O)Cc1ccccc1Nc1c(Cl)cccc1Cl,100000.0,inactive
...,...,...,...,...
277,CHEMBL4863113,COC(=O)[C@H]1O[C@@H](NC(=O)c2ccc(S(N)(=O)=O)cc...,1800.0,intermediate
278,CHEMBL4865818,NS(=O)(=O)c1ccc(C(=O)N[C@@H]2O[C@H](CO)[C@@H](...,3600.0,intermediate
279,CHEMBL4870385,NS(=O)(=O)OC[C@H]1C[C@@H](Nc2ncncc2C(=O)c2ccn(...,50.0,active
280,CHEMBL4856793,Cc1sc(C(=O)c2cncnc2N[C@H]2C[C@H](O)[C@@H](COS(...,50.0,active


Saves dataframe to CSV file

In [34]:
df4.to_csv('Data/03_bioactivity_data_curated.csv', index=False)

In [35]:
! rm 'Data/bioactivity.zip'
! zip 'Data/bioactivity.zip' Data/*.csv

rm: cannot remove 'Data/bioactivity.zip': No such file or directory
  adding: Data/01_bioactivity_data_raw.csv (deflated 90%)
  adding: Data/02_bioactivity_data_preprocessed.csv (deflated 78%)
  adding: Data/03_bioactivity_data_curated.csv (deflated 80%)


In [36]:
! ls -l

total 1044
-rw-rw-r-- 1 adr adr 101143 feb  7 01:09 CDD_ML_Part_01.ipynb
-rw-rw-r-- 1 adr adr 295105 feb  7 00:44 CDD_ML_Part_02.ipynb
-rw-rw-r-- 1 adr adr  95731 feb  7 00:45 CDD_ML_Part_03.ipynb
-rw-rw-r-- 1 adr adr  52379 feb  7 00:45 CDD_ML_Part_04.ipynb
-rw-rw-r-- 1 adr adr 511768 feb  7 00:47 CDD_ML_Part_05.ipynb
drwxrwxr-x 2 adr adr   4096 feb  7 01:09 Data
