<a href="https://colab.research.google.com/github/dhynasah/Drug-Discovery-project/blob/main/Part_1_Computational_protein_activity_analysis_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **ChEMBL Database**


This Notebook will  be building a machine learning model using the ChEMBL bioactivity data. 

The ChEMBL Database is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 76,000 documents, 1.2 million assays and the data spans 13,000 targets and 1,800 cells and 33,000 indications. [Data as of January 25, 2022; ChEMBL version 29].

In [None]:
#install chembl Install the ChEMBL web service package so that we can retrieve 
#bioactivity data from the ChEMBL Database.
! pip install chembl_webresource_client

Collecting chembl_webresource_client
  Downloading chembl_webresource_client-0.10.7-py3-none-any.whl (55 kB)
[?25l[K     |██████                          | 10 kB 21.6 MB/s eta 0:00:01[K     |███████████▉                    | 20 kB 12.0 MB/s eta 0:00:01[K     |█████████████████▊              | 30 kB 9.7 MB/s eta 0:00:01[K     |███████████████████████▋        | 40 kB 8.6 MB/s eta 0:00:01[K     |█████████████████████████████▌  | 51 kB 5.3 MB/s eta 0:00:01[K     |████████████████████████████████| 55 kB 2.0 MB/s 
[?25hCollecting requests-cache~=0.7.0
  Downloading requests_cache-0.7.5-py3-none-any.whl (39 kB)
Collecting pyyaml>=5.4
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 10.6 MB/s 
Collecting itsdangerous>=2.0.1
  Downloading itsdangerous-2.0.1-py3-none-any.whl (18 kB)
Collecting url-normalize<2.0,>=1.4
  Downloading url_normalize-1.4.3-p

In [None]:
#import necessary libraries 
import pandas as pd
from chembl_webresource_client.new_client import new_client

In [None]:
#target search for estrogen 
#takes about 8 minutes to run
target = new_client.target
target_query = target.search('estrogen receptor')
targets = pd.DataFrame.from_dict(target_query)
targets

In [None]:
targets.head(20)

Select target protein. Estrogen receptor protein in homo sapiens, the target protein. 

In [None]:
selected_target = targets.target_chembl_id[5]
selected_target

'CHEMBL206'

Select and retrieve bioactivity data reported as pChEMBL values.

In [None]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")
df = pd.DataFrame.from_dict(res)
df.shape
df.head()

In [None]:
#save bioactivity data to CSV file 
df.to_csv('bioactivity_estrogen_IC50_data.csv', index= False)

In [None]:
#copy files to google drive 
from google.colab import drive
drive.mount('/content/gdrive/', force_remount= True)

Mounted at /content/gdrive/


In [None]:
#create a data folder in colab notebooks
! mkdir "/content/gdrive/My Drive/Colab Notebooks/data"

mkdir: cannot create directory ‘/content/gdrive/My Drive/Colab Notebooks/data’: File exists


In [None]:
! cp bioactivity_estrogen_data.csv '/content/gdrive/My Drive/Colab Notebooks/data'

In [None]:
! ls -l '/content/gdrive/My Drive/Colab Notebooks/data'

-rw------- 1 root root 2407582 Jan 30 14:13 '/content/gdrive/My Drive/Colab Notebooks/data'


In [None]:
! ls

bioactivity_estrogen_data.csv		    gdrive
bioactivity_estrogen_preprocessed_data.csv  sample_data


In [None]:
! head bioactivity_data.csv

In [None]:
# If any column has missing values drop them
df2 = df[df.standard_value.notna()]
df2.shape

(3551, 45)

The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be active while those greater than 10,000 nM will be considered to be inactive. As for those values in between 1,000 and 10,000 nM will be referred to as intermediate.

In [None]:
bioactivity_threshold = []
for i in df2.standard_value:
  if float(i) >= 10000:
    bioactivity_threshold.append("inactive")
  elif float(i) <= 1000:
    bioactivity_threshold.append("active")
  else:
    bioactivity_threshold.append("intermediate")

In [None]:
#Bioactivity category will be used for classifying the data and creating a machine learning model
#It is added to the data set. 
from typing import BinaryIO
selection = ['molecule_chembl_id', 'canonical_smiles', 'standard_value']
df3 = df2[selection]
bioactivity_class = pd.Series(bioactivity_threshold)
df3['bioactivity_classification'] = bioactivity_class.values
df3

In [None]:
#after viewing the dataset in excel, I noticed more missing values. those were 
#dropped also. 
df3 = df3[df3.bioactivity_classification.notna()]
df3.shape

(3551, 4)

In [None]:
df3 = df3[df3.canonical_smiles.notna()]
df3.shape

(3543, 4)

In [None]:
#save bioactivity data to CSV file 
df3.to_csv('bioactivity_estrogen_preprocessed_data.csv', index= False)
! cp bioactivity_estrogen_preprocessed_data.csv '/content/gdrive/My Drive/Colab Notebooks/data'