# Dataset report: ChEMBL


> https://www.ebi.ac.uk/chembl/

The ChEMBL database is a curated database of bioactive molecules with drug-like properties.  The ChEMBL website allows for querying through the browser.  Users can also download Oracle, MySQL, SQLite and PostgreSQL versions of the entire database from the website [here](https://chembl.gitbook.io/chembl-interface-documentation/downloads).  The ChEMBL also helpfully provides an interactive schema of the [entire database layout](https://www.ebi.ac.uk/chembl/db_schema).

The ChEMBL database is an especially attractive data source for our purposes because it has datasets for a variety of microbial target organisms, compounds, and compound activities (ie, MIC, LD50, EC50, etc.).  ChEMBL has data on various Staphylococcus species (>130,000 hits), Escherichia species (>68,000 hits), Mycobacterium (>54,000 hits), Pseudomonas species (>43,000 hits), Streptococcus species (>38,000 hits), Enterococcus (>29,000 hits), Bacillus (>23,000 hits), and several other groups.

However, navigating this database can be fairly tricky.  For example, to get the query we want, we first need to search by **Target**, setting our target organism to *Escherichia coli*.  This step alone is a bit tricky since as of writting, 'Escherichia coli', 'Escherichia coli (strain K-12)' and 'Escherichia coli K-12' all appear as unique selectable options, but for now we will select 'Escherichia coli' without the strain specification. After selecting the organism, we will be returned only a few hundred results, each cooresponding to a unique UniProt ID rather than a compound SMILES or InChI string.  We need to take the additional step of clicking ***Browse Activites*** at the top of the query bar (more information on how to use the ChEMBL query settup [here](https://chembl.gitbook.io/chembl-interface-documentation/frequently-asked-questions/chembl-interface-questions#browsing-related-entities).  This will yield ~65,000 results, which will be narrowed down to ~55,000 when we specify µg/mL as our desired units.

In [1]:
import altair as alt
alt.data_transformers.disable_max_rows()
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

import cytoxnet.dataprep.io

## Start with the MIC query

In [2]:
## get the data
dataframe = cytoxnet.dataprep.io.load_data('chembl_ecoli_MIC')
dataframe.describe()

Unnamed: 0,ecoli_MIC,molecular_weight
count,5271.0,5271.0
mean,2.840188,529.43355
std,2.364505,368.980299
min,-11.042922,61.04
25%,1.832581,356.36
50%,3.465736,435.55
75%,4.158883,546.575
max,9.433484,4462.64


In [3]:
cytoxnet.dataprep.io.create_compound_codex()

In [4]:
dataset = cytoxnet.dataprep.io.add_datasets(
                 dataframes=dataframe,
                 names=['chembl_ecoli_MIC'],
                 id_col='smiles',
                 new_featurizers=None)

In [5]:
print('Number of datapoints in the query: ', len(dataframe))

Number of datapoints in the query:  5271


In [6]:
compounds = pd.read_csv('./database/compounds.csv')
print('Number of unique molecules after SMILES canonicalization: ', len(compounds))

Number of unique molecules after SMILES canonicalization:  4581


### Targets present

#### <span style='color:blue'>__The range of targets seems to be quite large for each species (units of ug/mL)__</span>

In [7]:
dataframe.describe().loc[['min', 'max']]

Unnamed: 0,ecoli_MIC,molecular_weight
min,-11.042922,61.04
max,9.433484,4462.64


In [8]:
alt.Chart(dataframe).mark_area(
    opacity=0.7,
    interpolate='step'
).encode(
    alt.X('ecoli_MIC:Q', bin=alt.Bin(maxbins=100)),
    alt.Y('count()', stack=None)
)

#### <span style='color:blue'>__The dataset is heavily imbalanced towards the toxic side (units log ug/mL)__</span>

### Molecule space

In [9]:
!pip install --quiet umap-learn hdbscan

distutils: /opt/anaconda3/envs/cytoxnet/include/python3.7m/UNKNOWN
sysconfig: /opt/anaconda3/envs/cytoxnet/include/python3.7m[0m
user = False
home = None
root = None
prefix = None[0m
distutils: /opt/anaconda3/envs/cytoxnet/include/python3.7m/UNKNOWN
sysconfig: /opt/anaconda3/envs/cytoxnet/include/python3.7m[0m
user = False
home = None
root = None
prefix = None[0m


In [13]:
import rdkit.Chem.AllChem
# import umap
import umap.umap_ as umap

Set the descriptors to use for mapping

In [14]:
dataframe['descriptor'] = dataframe['smiles'].apply(
    lambda smiles: rdkit.Chem.AllChem.GetMorganFingerprintAsBitVect(rdkit.Chem.MolFromSmiles(smiles), radius=2, nBits=2048)
    )

UMAP the smiles

In [15]:
%%time
umap_model = umap.UMAP(metric = "jaccard",
                      n_neighbors = 25,
                      n_components = 2,
                      low_memory = False,
                      min_dist = 0.001)
X_umap = umap_model.fit_transform(np.vstack(dataframe['descriptor'].values))
dataframe["UMAP_0"], dataframe["UMAP_1"] = X_umap[:,0], X_umap[:,1]

  "inverse_transform will be unavailable".format(self.metric)
CPU times: user 1min 6s, sys: 1.69 s, total: 1min 8s
Wall time: 50.5 s


Are there any clusters?

In [16]:
alt.Chart(dataframe[['UMAP_0', 'UMAP_1']]).mark_circle(size=60).encode(
    x='UMAP_0',
    y='UMAP_1',
)

#### <span style='color:blue'>__All three species seem to cover the available space__</span>

### Do any clusters in UMAP space seem to exhibit high toxicity?

In [21]:
alt.Chart(dataframe[['UMAP_0', 'UMAP_1', 'ecoli_MIC']]).mark_circle(size=60).encode(
    x='UMAP_0',
    y='UMAP_1',
    color='ecoli_MIC:Q',
)