# Mapping ChEMBL to ChEBI

The `MOLECULE_DICTIONARY` table in ChEMBL contains mappings to ChEBI for some, but not all chemicals. This is unsurprising, given the scope of ChEBML is larger tha ChEBI. However, there is still room for improving these mappings.

This notebook identifies molecules that have no ChEBI mapping (and have a label) then generates prioritized lexical matchings usingy [`gilda`](https://github.com/indralab/gilda) for curation.

In [1]:
import sys
import time

import gilda.grounder
import gilda.term
import pandas as pd
import pystow
from biomappings.gilda_utils import iter_prediction_tuples
from gilda.process import normalize
from IPython.display import Markdown
from tqdm.auto import tqdm

from chembl_downloader import latest, queries, query

In [2]:
print(sys.version)

3.10.8 (main, Oct 13 2022, 10:17:43) [Clang 14.0.0 (clang-1400.0.29.102)]


In [3]:
print(time.asctime())

Thu Nov  3 13:32:50 2022


In [4]:
version = latest()
print(f"Using ChEMBL version {version}")

Using ChEMBL version 31


## Making the Query

The following query over the `MOLECULE_DICTIONARY` finds all ChEMBL compound identifiers and their associated preferred names but filters out ones that already have mappings to ChEBI. This allows us to focus on doing some extra curation of new mappings.

In [5]:
queries.markdown(queries.CHEBI_UNMAPPED_SQL)

```sql
SELECT
    chembl_id,
    pref_name
FROM MOLECULE_DICTIONARY
WHERE
    chebi_par_id IS NULL
    AND pref_name IS NOT NULL
```

Make the query with `chembl_downloader.query`.

In [6]:
%time
df = query(queries.CHEBI_UNMAPPED_SQL, version=version)

CPU times: user 2 µs, sys: 1e+03 ns, total: 3 µs
Wall time: 5.96 µs


In [7]:
df

Unnamed: 0,chembl_id,pref_name
0,CHEMBL6206,BROMOENOL LACTONE
1,CHEMBL446445,UCL-1530
2,CHEMBL266459,ZOMEPIRAC SODIUM
3,CHEMBL216458,ALPHA-BUNGAROTOXIN
4,CHEMBL6346,SCR01020
...,...,...
39225,CHEMBL4802269,IZURALIMAB
39226,CHEMBL4802270,PUDEXACIANINIUM
39227,CHEMBL4804171,AFP-464 FREE SALT
39228,CHEMBL4804172,SAMARIUM DOTMP


## What's Already in ChEBI

ChEBI also maintains its own mappings to ChEMBL - investigate if there's anything available there that is not already available in ChEMBL before moving on to propose new mappings.

In [8]:
chebi_url = "https://ftp.ebi.ac.uk/pub/databases/chebi/Flat_file_tab_delimited/reference.tsv.gz"
chebi_df = pystow.ensure_csv(
    "bio",
    "chebi",
    url=chebi_url,
    read_csv_kwargs=dict(
        compression="gzip",
        sep="\t",
        encoding="unicode_escape",
        on_bad_lines="skip",
    ),
)
chebi_mappings = dict(
    chebi_df[chebi_df.REFERENCE_DB_NAME == "ChEMBL"][["REFERENCE_ID", "COMPOUND_ID"]].values
)
len(chebi_mappings)

  return pd.read_csv(path, **_clean_csv_kwargs(read_csv_kwargs))


34312

In [9]:
chebi_idx = df.chembl_id.isin(set(chebi_mappings))

print(
    f"there are {chebi_idx.sum():,}/{len(df.index):,} ({chebi_idx.sum()/len(df.index):.2%}) "
    f"extra mappings from ChEBI"
)

df = df[~chebi_idx]

there are 4,041/39,230 (10.30%)  extra mappings from ChEBI


## Propose New Mappings

First, we index the dataframe of molecules using [`gilda`](https://github.com/indralab/gilda), which implements a scored string matching algorithm.

In [10]:
terms = [
    gilda.term.Term(
        norm_text=normalize(name),
        text=name,
        db="chembl.compound",
        id=identifier,
        entry_name=name,
        status="name",
        source="chembl",
    )
    for identifier, name in tqdm(df.values, unit="term", unit_scale=True)
]

grounder = gilda.grounder.Grounder(terms)

  0%|          | 0.00/35.2k [00:00<?, ?term/s]

Second, we use a utility function from [`biomappings`](https://github.com/biopragmatics/biomappings) that takes in three things:

1. a `prefix` corresponding to the resource we want to map against
2. the `grounder` object generated from indexing all of the ChEMBL terms
3. a `provenance` string


This function in turn relies on [`pyobo`](https://github.com/pyobo/pyobo) and will download/cache the [ChEBI ontology](https://obofoundry.org/ontology/chebi), so be patient on the first run.

In [11]:
prediction_tuples = list(
    iter_prediction_tuples(
        prefix="chebi",
        grounder=grounder,
        provenance="notebook",
    )
)



[chebi] gilda tuples:   0%|          | 0.00/163k [00:00<?, ?name/s]

In [12]:
print(f"Got {len(prediction_tuples):,} predictions")

Got 1,921 predictions


## Results

The results below show promising results, often resulting in exact string matches. Further proofing can be done on the chemical strucutre level, but these matches are typically correct without further investigation.

In [13]:
predictions_df = pd.DataFrame(prediction_tuples).sort_values("confidence", ascending=False)
predictions_df

Unnamed: 0,source_prefix,source_id,source_name,relation,target_prefix,target_identifier,target_name,type,confidence,source
497,chebi,CHEBI:190867,1-AMINOCYCLOBUTANE CARBOXYLIC ACID,skos:exactMatch,chembl.compound,CHEMBL131244,1-AMINOCYCLOBUTANE CARBOXYLIC ACID,lexical,0.777778,notebook
904,chebi,CHEBI:34827,M2,skos:exactMatch,chembl.compound,CHEMBL4525134,M2,lexical,0.777778,notebook
927,chebi,CHEBI:35811,"2-endo-hydroxy-1,8-cineole",skos:exactMatch,chembl.compound,CHEMBL2229602,"2-endo-hydroxy-1,8-cineole",lexical,0.777778,notebook
443,chebi,CHEBI:188062,XYLOCARPUS A,skos:exactMatch,chembl.compound,CHEMBL3039346,XYLOCARPUS A,lexical,0.777778,notebook
574,chebi,CHEBI:192723,L-NIO,skos:exactMatch,chembl.compound,CHEMBL11471,L-NIO,lexical,0.777778,notebook
...,...,...,...,...,...,...,...,...,...,...
852,chebi,CHEBI:32187,Technetium Tc 99m succimer,skos:exactMatch,chembl.compound,CHEMBL1200797,TECHNETIUM TC 99M SUCCIMER,lexical,0.723974,notebook
1182,chebi,CHEBI:5938,Interferon beta-1b,skos:exactMatch,chembl.compound,CHEMBL1201563,INTERFERON BETA-1B,lexical,0.723974,notebook
1183,chebi,CHEBI:5939,Interferon gamma-1b,skos:exactMatch,chembl.compound,CHEMBL1201564,INTERFERON GAMMA-1B,lexical,0.723974,notebook
1892,chebi,CHEBI:9423,Technetium tc 99m sestamibi,skos:exactMatch,chembl.compound,CHEMBL4594241,TECHNETIUM TC 99M SESTAMIBI,lexical,0.723974,notebook
