<a href="https://colab.research.google.com/github/deepsharma26/SIRT1_Main/blob/main/Copy_of_CDD_ML_Part_1_Bioactivity_Data_Concised_SIRT1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Bioinformatics Project - Computational Drug Discovery [Part 1] Download Bioactivity Data (Concised version)**

Chanin Nantasenamat

[*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)

In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In **Part 1**, we will be performing Data Collection and Pre-Processing from the ChEMBL Database.

Note for this Concised Version:
* Redundant code cells were deleted.
* Code cells for saving files to Google Drive has been deleted.

---

## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 76,000 documents, 1.2 million assays and the data spans 13,000 targets and 1,800 cells and 33,000 indications.
[Data as of March 25, 2020; ChEMBL version 26].

## **Installing libraries**

Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

In [2]:
! pip install chembl_webresource_client



## **Importing libraries**

In [3]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

## **Search for Target protein**

### **Target search for Sirtuin1 (SIRT1)**

In [4]:
# Target search for SIRT1
target = new_client.target
target_query = target.search('SIRT1')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Homo sapiens,NAD-dependent deacetylase sirtuin 1,12.0,False,CHEMBL4506,"[{'accession': 'Q96EB6', 'component_descriptio...",SINGLE PROTEIN,9606
1,[],Homo sapiens,Cereblon/NAD-dependent protein deacetylase sir...,11.0,False,CHEMBL4296132,"[{'accession': 'Q96EB6', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606


### **Select and retrieve bioactivity data for *Human Sirtuin1 (SIRT1)* (zeorth entry)**

We will assign the zeroth entry (which corresponds to the target protein, *Human SIRT1*) to the ***selected_target*** variable

In [5]:
selected_target = targets.target_chembl_id[0]
selected_target

'CHEMBL4506'

Here, we will retrieve only bioactivity data for *Human SIRT1* (CHEMBL4506) that are reported as pChEMBL values.

In [6]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [7]:
df = pd.DataFrame.from_dict(res)

In [8]:
df

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,1653881,[],CHEMBL859689,Inhibitory activity against recombinant human ...,B,,,BAO_0000190,...,Homo sapiens,NAD-dependent deacetylase sirtuin 1,9606,,,IC50,uM,UO_0000065,,0.098
1,,,1653885,[],CHEMBL859693,Inhibitory activity against human SIRT1 by rad...,B,,,BAO_0000190,...,Homo sapiens,NAD-dependent deacetylase sirtuin 1,9606,,,IC50,uM,UO_0000065,,1.29
2,,,1653895,[],CHEMBL859689,Inhibitory activity against recombinant human ...,B,,,BAO_0000190,...,Homo sapiens,NAD-dependent deacetylase sirtuin 1,9606,,,IC50,uM,UO_0000065,,0.205
3,,,1653899,[],CHEMBL859693,Inhibitory activity against human SIRT1 by rad...,B,,,BAO_0000190,...,Homo sapiens,NAD-dependent deacetylase sirtuin 1,9606,,,IC50,uM,UO_0000065,,2.5
4,,,1653908,[],CHEMBL859689,Inhibitory activity against recombinant human ...,B,,,BAO_0000190,...,Homo sapiens,NAD-dependent deacetylase sirtuin 1,9606,,,IC50,uM,UO_0000065,,1.47
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1257,"{'action_type': 'INHIBITOR', 'description': 'N...",,25656033,[],CHEMBL5372270,Inhibition of recombinant SIRT1 (unknown origi...,B,,,BAO_0000190,...,Homo sapiens,NAD-dependent deacetylase sirtuin 1,9606,,,IC50,uM,UO_0000065,,95.86
1258,,,25665840,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5374950,Inhibition of recombinant human Sirt1 (134 to ...,B,,,BAO_0000190,...,Homo sapiens,NAD-dependent deacetylase sirtuin 1,9606,,,IC50,uM,UO_0000065,,100.0
1259,,Not Determined,25665841,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5374950,Inhibition of recombinant human Sirt1 (134 to ...,B,,,BAO_0000190,...,Homo sapiens,NAD-dependent deacetylase sirtuin 1,9606,,,IC50,,,,
1260,,Not Determined,25665842,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5374950,Inhibition of recombinant human Sirt1 (134 to ...,B,,,BAO_0000190,...,Homo sapiens,NAD-dependent deacetylase sirtuin 1,9606,,,IC50,,,,


Finally we will save the resulting bioactivity data to a CSV file **bioactivity_data.csv**.

In [9]:
df.to_csv('SIRT1_01_bioactivity_data_raw.csv', index=False)

## **Handling missing data**
If any compounds has missing value for the **standard_value** and **canonical_smiles** column then drop it.

In [10]:
df2 = df[df.standard_value.notna()]
df2 = df2[df.canonical_smiles.notna()]
df2

  df2 = df2[df.canonical_smiles.notna()]


Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,1653881,[],CHEMBL859689,Inhibitory activity against recombinant human ...,B,,,BAO_0000190,...,Homo sapiens,NAD-dependent deacetylase sirtuin 1,9606,,,IC50,uM,UO_0000065,,0.098
1,,,1653885,[],CHEMBL859693,Inhibitory activity against human SIRT1 by rad...,B,,,BAO_0000190,...,Homo sapiens,NAD-dependent deacetylase sirtuin 1,9606,,,IC50,uM,UO_0000065,,1.29
2,,,1653895,[],CHEMBL859689,Inhibitory activity against recombinant human ...,B,,,BAO_0000190,...,Homo sapiens,NAD-dependent deacetylase sirtuin 1,9606,,,IC50,uM,UO_0000065,,0.205
3,,,1653899,[],CHEMBL859693,Inhibitory activity against human SIRT1 by rad...,B,,,BAO_0000190,...,Homo sapiens,NAD-dependent deacetylase sirtuin 1,9606,,,IC50,uM,UO_0000065,,2.5
4,,,1653908,[],CHEMBL859689,Inhibitory activity against recombinant human ...,B,,,BAO_0000190,...,Homo sapiens,NAD-dependent deacetylase sirtuin 1,9606,,,IC50,uM,UO_0000065,,1.47
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1254,"{'action_type': 'INHIBITOR', 'description': 'N...",,25625619,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5364135,Inhibition of Sirtuin-1 (unknown origin) incub...,B,,,BAO_0000190,...,Homo sapiens,NAD-dependent deacetylase sirtuin 1,9606,,,IC50,uM,UO_0000065,,2.5
1255,"{'action_type': 'INHIBITOR', 'description': 'N...",,25626692,[],CHEMBL5364481,Inhibition of recombinant SIRT1 (unknown origi...,B,,,BAO_0000190,...,Homo sapiens,NAD-dependent deacetylase sirtuin 1,9606,,,IC50,uM,UO_0000065,,112.6
1256,"{'action_type': 'INHIBITOR', 'description': 'N...",,25656032,[],CHEMBL5372270,Inhibition of recombinant SIRT1 (unknown origi...,B,,,BAO_0000190,...,Homo sapiens,NAD-dependent deacetylase sirtuin 1,9606,,,IC50,uM,UO_0000065,,21.9
1257,"{'action_type': 'INHIBITOR', 'description': 'N...",,25656033,[],CHEMBL5372270,Inhibition of recombinant SIRT1 (unknown origi...,B,,,BAO_0000190,...,Homo sapiens,NAD-dependent deacetylase sirtuin 1,9606,,,IC50,uM,UO_0000065,,95.86


In [11]:
len(df2.canonical_smiles.unique())

933

In [12]:
df2_nr = df2.drop_duplicates(['canonical_smiles'])
df2_nr

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,1653881,[],CHEMBL859689,Inhibitory activity against recombinant human ...,B,,,BAO_0000190,...,Homo sapiens,NAD-dependent deacetylase sirtuin 1,9606,,,IC50,uM,UO_0000065,,0.098
2,,,1653895,[],CHEMBL859689,Inhibitory activity against recombinant human ...,B,,,BAO_0000190,...,Homo sapiens,NAD-dependent deacetylase sirtuin 1,9606,,,IC50,uM,UO_0000065,,0.205
4,,,1653908,[],CHEMBL859689,Inhibitory activity against recombinant human ...,B,,,BAO_0000190,...,Homo sapiens,NAD-dependent deacetylase sirtuin 1,9606,,,IC50,uM,UO_0000065,,1.47
6,,,1653921,[],CHEMBL859689,Inhibitory activity against recombinant human ...,B,,,BAO_0000190,...,Homo sapiens,NAD-dependent deacetylase sirtuin 1,9606,,,IC50,uM,UO_0000065,,0.124
10,,,1653937,[],CHEMBL859689,Inhibitory activity against recombinant human ...,B,,,BAO_0000190,...,Homo sapiens,NAD-dependent deacetylase sirtuin 1,9606,,,IC50,uM,UO_0000065,,100.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1253,"{'action_type': 'INHIBITOR', 'description': 'N...",,25625618,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5364135,Inhibition of Sirtuin-1 (unknown origin) incub...,B,,,BAO_0000190,...,Homo sapiens,NAD-dependent deacetylase sirtuin 1,9606,,,IC50,uM,UO_0000065,,1.2
1254,"{'action_type': 'INHIBITOR', 'description': 'N...",,25625619,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5364135,Inhibition of Sirtuin-1 (unknown origin) incub...,B,,,BAO_0000190,...,Homo sapiens,NAD-dependent deacetylase sirtuin 1,9606,,,IC50,uM,UO_0000065,,2.5
1255,"{'action_type': 'INHIBITOR', 'description': 'N...",,25626692,[],CHEMBL5364481,Inhibition of recombinant SIRT1 (unknown origi...,B,,,BAO_0000190,...,Homo sapiens,NAD-dependent deacetylase sirtuin 1,9606,,,IC50,uM,UO_0000065,,112.6
1256,"{'action_type': 'INHIBITOR', 'description': 'N...",,25656032,[],CHEMBL5372270,Inhibition of recombinant SIRT1 (unknown origi...,B,,,BAO_0000190,...,Homo sapiens,NAD-dependent deacetylase sirtuin 1,9606,,,IC50,uM,UO_0000065,,21.9


## **Data pre-processing of the bioactivity data**

### **Combine the 3 columns (molecule_chembl_id,canonical_smiles,standard_value) and bioactivity_class into a DataFrame**

In [13]:
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2_nr[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL420311,NC(=O)C1CCCc2c1[nH]c1ccc(Cl)cc21,98.0
2,CHEMBL115600,Cc1ccc2[nH]c3c(c2c1)CCCC3C(N)=O,205.0
4,CHEMBL112265,NC(=O)C1CCCc2c1[nH]c1ccccc21,1470.0
6,CHEMBL446446,NC(=O)C1CCCCc2c1[nH]c1ccc(Cl)cc21,124.0
10,CHEMBL171137,CCOC(=O)C1CCCc2c1[nH]c1ccc(Cl)cc21,100000.0
...,...,...,...
1253,CHEMBL4793948,O=C(Nc1ccccc1)Nc1ccc(NC(=O)c2ccccc2)cc1,1200.0
1254,CHEMBL4749004,O=C(NCCc1c[nH]c2ccccc12)c1cc2ccccc2cc1O,2500.0
1255,CHEMBL5416344,O=C(O)CCNC(=S)NCCCNc1nc(Nc2ccccc2Cl)ncc1C(=O)N...,112600.0
1256,CHEMBL5424599,O=C(O)CCNC(=S)NCCCNc1nc(Nc2ccccc2)nc(Nc2ccccc2...,21900.0


Saves dataframe to CSV file

In [14]:
df3.to_csv('SIRT1_02_bioactivity_data_preprocessed.csv', index=False)

Saves dataframe to CSV file

In [18]:
! zip SIRT1.zip *.csv

  adding: SIRT1_01_bioactivity_data_raw.csv (deflated 92%)
  adding: SIRT1_02_bioactivity_data_preprocessed.csv (deflated 82%)


In [19]:
! ls -l

total 948
drwxr-xr-x 1 root root   4096 Feb  6 14:19 sample_data
-rw-r--r-- 1 root root 809063 Feb  8 11:23 SIRT1_01_bioactivity_data_raw.csv
-rw-r--r-- 1 root root  73390 Feb  8 11:24 SIRT1_02_bioactivity_data_preprocessed.csv
-rw-r--r-- 1 root root  80796 Feb  8 11:25 SIRT1.zip


---