# Mapping ChEMBL to ChEBI

The `MOLECULE_DICTIONARY` table in ChEMBL contains mappings to ChEBI for some, but not all chemicals. This is unsurprising, given the scope of ChEBML is larger tha ChEBI. However, there is still room for improving these mappings.

This notebook identifies molecules that have no ChEBI mapping (and have a label) then generates prioritized lexical matchings usingy [`gilda`](https://github.com/indralab/gilda) for curation.

In [9]:
import sys
import time

import gilda.grounder
import gilda.term
import pandas as pd
import pystow
from biomappings.gilda_utils import iter_prediction_tuples
from gilda.process import normalize
from tqdm.auto import tqdm
from IPython.display import Markdown

from chembl_downloader import queries, query, latest

In [2]:
print(sys.version)

3.10.8 (main, Oct 13 2022, 10:17:43) [Clang 14.0.0 (clang-1400.0.29.102)]


In [3]:
print(time.asctime())

Thu Nov  3 12:53:35 2022


In [4]:
version = latest()
print(f"Using ChEMBL version {version}")

Using ChEMBL version 31


## Making the Query

The following query over the `MOLECULE_DICTIONARY` finds all ChEMBL compound identifiers and their associated preferred names but filters out ones that already have mappings to ChEBI. This allows us to focus on doing some extra curation of new mappings.

In [5]:
queries.markdown(queries.CHEBI_UNMAPPED_SQL)

```sql
SELECT
    chembl_id,
    pref_name
FROM MOLECULE_DICTIONARY
WHERE
    chebi_par_id IS NULL
    AND pref_name IS NOT NULL
```

Make the query with `chembl_downloader.query`.

In [6]:
%time
df = query(queries.CHEBI_UNMAPPED_SQL, version=version)

CPU times: user 3 µs, sys: 1e+03 ns, total: 4 µs
Wall time: 6.91 µs


In [7]:
df

Unnamed: 0,chembl_id,pref_name
0,CHEMBL6206,BROMOENOL LACTONE
1,CHEMBL446445,UCL-1530
2,CHEMBL266459,ZOMEPIRAC SODIUM
3,CHEMBL216458,ALPHA-BUNGAROTOXIN
4,CHEMBL6346,SCR01020
...,...,...
39225,CHEMBL4802269,IZURALIMAB
39226,CHEMBL4802270,PUDEXACIANINIUM
39227,CHEMBL4804171,AFP-464 FREE SALT
39228,CHEMBL4804172,SAMARIUM DOTMP


## What's Already in ChEBI

ChEBI also maintains its own mappings to ChEMBL - investigate if there's anything available there that is not already available in ChEMBL before moving on to propose new mappings.

In [8]:
chebi_url = "https://ftp.ebi.ac.uk/pub/databases/chebi/Flat_file_tab_delimited/reference.tsv.gz"
chebi_df = pystow.ensure_df("bio", "chebi", url=chebi_url)

# TODO slice to only have chembl sources then remove unnecessary columns

#: keys are chembl IDs, values are chebi IDs
chebi_mappings = ...

NameError: name 'pystow' is not defined

In [None]:
chebi_idx = df.chembl_id.isin(set(chebi_mappings))

print(
    f"there are {chebi_idx.sum():,}/{len(df.index):,} ({chebi_idx.sum()/len(df.index):.2%}) "
    " extra mappings from ChEBI"
)

df = df[~chebi_idx]

## Propose New Mappings

First, we index the dataframe of molecules using [`gilda`](https://github.com/indralab/gilda), which implements a scored string matching algorithm.

In [None]:
terms = [
    gilda.term.Term(
        norm_text=normalize(name),
        text=name,
        db="chembl.compound",
        id=identifier,
        entry_name=name,
        status="name",
        source="chembl",
    )
    for identifier, name in tqdm(df.values, unit="term", unit_scale=True)
]

grounder = gilda.grounder.Grounder(terms)

Second, we use a utility function from [`biomappings`](https://github.com/biopragmatics/biomappings) that takes in three things:

1. a `prefix` corresponding to the resource we want to map against
2. the `grounder` object generated from indexing all of the ChEMBL terms
3. a `provenance` string


This function in turn relies on [`pyobo`](https://github.com/pyobo/pyobo) and will download/cache the [ChEBI ontology](https://obofoundry.org/ontology/chebi), so be patient on the first run.

In [None]:
prediction_tuples = list(
    iter_prediction_tuples(
        prefix="chebi",
        grounder=grounder,
        provenance="notebook",
    )
)

In [None]:
print(f"Got {len(prediction_tuples):,} predictions")

## Results

The results below show promising results, often resulting in exact string matches. Further proofing can be done on the chemical strucutre level, but these matches are typically correct without further investigation.

In [None]:
predictions_df = pd.DataFrame(prediction_tuples).sort_values("confidence", ascending=False)
predictions_df