# Interacting with ChEMBL database

Many a times during lead optimisation projects, one has to look into the ChEMBL bioactivity database to retrieve compounds with reported assay data against a target to build a ML model to predict such activity for potentially new compounds.

In this notebook, I will show how to interact with the ChEMBL database to retrieve SAR data

In [1]:
import re

import chembl_downloader
import pandas as pd
import scikit_posthocs as sp
import seaborn as sns
import useful_rdkit_utils as uru
from rdkit import Chem
from rdkit.Chem.Draw import MolsToGridImage
from rdkit.Chem.MolStandardize import rdMolStandardize
from rdkit.rdBase import BlockLogs
from tqdm.auto import tqdm
# Enable progress bars in Pandas
tqdm.pandas()

# Download entire ChEMBL database locally

In [2]:
path = chembl_downloader.download_extract_sqlite()
print(path)

Downloading chembl_35_sqlite.tar.gz: 0.00B [00:00, ?B/s]

/Users/ganeshshahane/.data/chembl/35/data/chembl_35/chembl_35_sqlite/chembl_35.db


# Find binding data for a target

In order the fetch the binding assay data for a particular target protein, its ```target_chembl_id``` needs to be known. This can be figured out by simply typing in the name of the target inside the search box of [ChEMBL](https://www.ebi.ac.uk/chembl/). Once you have typed in the full name, select the suggestion of the protein under ```Targets``` label. Then the main page of the target will display the ```target_chembl_id``` of your target protein.

In the following example, I will fetch binding assay data for KRAS GTPase:

In [3]:
from chembl_webresource_client.new_client import new_client

activity = new_client.activity
res = activity.filter(target_chembl_id='CHEMBL2189121', assay_type='B')

# Put the results in a DataFrame
kras = pd.DataFrame(res)
print(f"There are {kras.shape[0]} Kras inhibitors in ChEMBL")

There are 2110 Kras inhibitors in ChEMBL


In [4]:
kras.head(1)

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,11011900,[],CHEMBL2089529,Binding affinity to K-Ras G12D mutant-GDP complex,B,P01116,G12D,BAO_0000034,...,Homo sapiens,GTPase KRas,9606,,,Kd,uM,UO_0000065,,1300.0


The activity units are present in the ```type``` column. This can give a clue as to how molecule's activity was determined

In [5]:
kras.type.value_counts().head(5)

type
IC50        1014
INH          447
Activity     204
Kd            72
Ratio         64
Name: count, dtype: int64

# Get all activities with a pChEMBL value for a molecule

One of the actity metrics: pChEMBL is used to convey the potency of a given molecule. It is calculated from one of several semi-comparable values in the ChEMBL database, and is defined as the negative log 10 molar of the IC50, XC50, EC50, AC50, Ki, Kd, or potency. Therefore, pChEMBL permits a rough comparison of these values. For example, a pChEMBL value of 7 would indicate that there is a measurable effect on a given target in the presence of 100 nM of molecule. To harmonize the data from Klaeger et al. with ChEMBL data, the Kd values were converted to pChEMBLs. The mean pChEMBL was calculated for every molecule–target combination, as well as the number of quantitative and qualitative associations found in the source databases.

Source: [Allaway et al, 2018](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0297-4)

In [6]:
activities = new_client.activity
res = activities.filter(molecule_chembl_id="CHEMBL25", pchembl_value__isnull=False)

len(res)

151