# Mapping ChEMBL to ChEBI

The `MOLECULE_DICTIONARY` table in ChEMBL contains mappings to ChEBI for some, but not all chemicals. This is unsurprising, given the scope of ChEBML is larger tha ChEBI. However, there is still room for improving these mappings.

This notebook identifies molecules that have no ChEBI mapping (and have a label) then generates prioritized lexical matchings usingy [`gilda`](https://github.com/indralab/gilda) for curation.

In [1]:
import sys
import time

import gilda.grounder
import gilda.term
import pandas as pd
import pystow
from biomappings.lexical import predict_lexical_mappings
import ssslm
from tqdm.auto import tqdm
import curies
from chembl_downloader import latest, queries, query

In [2]:
print(sys.version)

3.12.10 (main, Apr  8 2025, 11:35:47) [Clang 17.0.0 (clang-1700.0.13.3)]


In [3]:
print(time.asctime())

Fri Jun 27 17:12:10 2025


In [4]:
version = latest()
print(f"Using ChEMBL version {version}")

Using ChEMBL version 35


## Making the Query

The following query over the `MOLECULE_DICTIONARY` finds all ChEMBL compound identifiers and their associated preferred names but filters out ones that already have mappings to ChEBI. This allows us to focus on doing some extra curation of new mappings.

In [5]:
queries.markdown(queries.CHEBI_UNMAPPED_SQL)

```sql
SELECT
    chembl_id,
    pref_name
FROM MOLECULE_DICTIONARY
WHERE
    chebi_par_id IS NULL
    AND pref_name IS NOT NULL
```

Make the query with `chembl_downloader.query`.

In [6]:
%time
df = query(queries.CHEBI_UNMAPPED_SQL, version=version)

CPU times: user 2 μs, sys: 0 ns, total: 2 μs
Wall time: 5.01 μs


In [7]:
df

Unnamed: 0,chembl_id,pref_name
0,CHEMBL6206,BROMOENOL LACTONE
1,CHEMBL446445,UCL-1530
2,CHEMBL216458,ALPHA-BUNGAROTOXIN
3,CHEMBL6346,SCR01020
4,CHEMBL204021,DARAPLADIB
...,...,...
40388,CHEMBL5482969,BUTIROSIN
40389,CHEMBL5482975,CEPHALOSPORIN
40390,CHEMBL5483015,ARSENIC TRIOXIDE
40391,CHEMBL5498461,E133


## What's Already in ChEBI

ChEBI also maintains its own mappings to ChEMBL - investigate if there's anything available there that is not already available in ChEMBL before moving on to propose new mappings.

In [8]:
chebi_url = "https://ftp.ebi.ac.uk/pub/databases/chebi/Flat_file_tab_delimited/reference.tsv.gz"
chebi_df = pystow.ensure_csv(
    "bio",
    "chebi",
    url=chebi_url,
    read_csv_kwargs={
        "compression": "gzip",
        "sep": "\t",
        "encoding": "unicode_escape",
        "on_bad_lines": "skip",
        "dtype": str,
    },
)
chebi_mappings = dict(
    chebi_df[chebi_df.REFERENCE_DB_NAME == "ChEMBL"][["REFERENCE_ID", "COMPOUND_ID"]].values
)
len(chebi_mappings)

35558

In [9]:
chebi_idx = df.chembl_id.isin(set(chebi_mappings))

print(
    f"there are {chebi_idx.sum():,}/{len(df.index):,} ({chebi_idx.sum() / len(df.index):.2%}) "
    f"extra mappings from ChEBI"
)

df = df[~chebi_idx]

there are 4,295/40,393 (10.63%) extra mappings from ChEBI


## Propose New Mappings

First, we index the dataframe of molecules using [`gilda`](https://github.com/indralab/gilda), which implements a scored string matching algorithm.

In [10]:
literal_mappings = [
    ssslm.LiteralMapping(
        reference=curies.NamableReference(
            prefix="chembl.compound",
            identifier=identifier.strip(),
            name=name,
        ),
        text=name,
        source="chembl",
    )
    for identifier, name in tqdm(df.values, unit="term", unit_scale=True)
]

grounder = ssslm.make_grounder(literal_mappings)

  0%|          | 0.00/36.1k [00:00<?, ?term/s]

Second, we use a utility function from [`biomappings`](https://github.com/biopragmatics/biomappings) that takes in three things:

1. a `prefix` corresponding to the resource we want to map against
2. the `grounder` object generated from indexing all of the ChEMBL terms
3. a `provenance` string


This function in turn relies on [`pyobo`](https://github.com/pyobo/pyobo) and will download/cache the [ChEBI ontology](https://obofoundry.org/ontology/chebi), so be patient on the first run.

In [11]:
prediction_tuples = list(
    predict_lexical_mappings(
        prefix="chebi",
        grounder=grounder,
        provenance="chembl-downloader-repo",
    )
)

[chebi] lexical tuples:   0%|          | 0.00/203k [00:00<?, ?name/s]

[chebi] generated 4,266 predictions from names


In [12]:
print(f"Got {len(prediction_tuples):,} predictions")

Got 4,266 predictions


## Results

The results below show promising results, often resulting in exact string matches. Further proofing can be done on the chemical strucutre level, but these matches are typically correct without further investigation.

In [13]:
predictions_df = pd.DataFrame(prediction_tuples)
predictions_df.sort_values(7, ascending=False)

Unnamed: 0,0,1,2,3,4,5,6,7
3356,"(subject, prefix='chebi' identifier='41462' na...","(predicate, prefix='skos' identifier='exactMat...","(object, prefix='chembl.compound' identifier='...","(mapping_justification, prefix='semapv' identi...","(author, None)","(mapping_tool, chembl-downloader-repo)","(predicate_modifier, None)","(confidence, 0.556)"
2321,"(subject, prefix='chebi' identifier='221846' n...","(predicate, prefix='skos' identifier='exactMat...","(object, prefix='chembl.compound' identifier='...","(mapping_justification, prefix='semapv' identi...","(author, None)","(mapping_tool, chembl-downloader-repo)","(predicate_modifier, None)","(confidence, 0.556)"
3365,"(subject, prefix='chebi' identifier='43633' na...","(predicate, prefix='skos' identifier='exactMat...","(object, prefix='chembl.compound' identifier='...","(mapping_justification, prefix='semapv' identi...","(author, None)","(mapping_tool, chembl-downloader-repo)","(predicate_modifier, None)","(confidence, 0.556)"
3257,"(subject, prefix='chebi' identifier='34827' na...","(predicate, prefix='skos' identifier='exactMat...","(object, prefix='chembl.compound' identifier='...","(mapping_justification, prefix='semapv' identi...","(author, None)","(mapping_tool, chembl-downloader-repo)","(predicate_modifier, None)","(confidence, 0.556)"
3256,"(subject, prefix='chebi' identifier='34827' na...","(predicate, prefix='skos' identifier='exactMat...","(object, prefix='chembl.compound' identifier='...","(mapping_justification, prefix='semapv' identi...","(author, None)","(mapping_tool, chembl-downloader-repo)","(predicate_modifier, None)","(confidence, 0.556)"
...,...,...,...,...,...,...,...,...
1728,"(subject, prefix='chebi' identifier='211166' n...","(predicate, prefix='skos' identifier='exactMat...","(object, prefix='chembl.compound' identifier='...","(mapping_justification, prefix='semapv' identi...","(author, None)","(mapping_tool, chembl-downloader-repo)","(predicate_modifier, None)","(confidence, 0.502)"
3090,"(subject, prefix='chebi' identifier='29540' na...","(predicate, prefix='skos' identifier='exactMat...","(object, prefix='chembl.compound' identifier='...","(mapping_justification, prefix='semapv' identi...","(author, None)","(mapping_tool, chembl-downloader-repo)","(predicate_modifier, None)","(confidence, 0.502)"
3089,"(subject, prefix='chebi' identifier='29534' na...","(predicate, prefix='skos' identifier='exactMat...","(object, prefix='chembl.compound' identifier='...","(mapping_justification, prefix='semapv' identi...","(author, None)","(mapping_tool, chembl-downloader-repo)","(predicate_modifier, None)","(confidence, 0.502)"
4237,"(subject, prefix='chebi' identifier='9423' nam...","(predicate, prefix='skos' identifier='exactMat...","(object, prefix='chembl.compound' identifier='...","(mapping_justification, prefix='semapv' identi...","(author, None)","(mapping_tool, chembl-downloader-repo)","(predicate_modifier, None)","(confidence, 0.502)"
