# Estimation of Chemical Identifier Mapping Precision
Because there are more than 10K mappings predicted by lexical matching between ChEBI and MeSH, we have not been able to exhaustively curate them all. 

Further, we biased our cuation towards the mappings that were easiest to curate. Typically, this boiled down to checking when the primary label was an exact string match (minus capitalization). This means that a direct estimation of precision by taking the ratio of positive curations to total curations would be biased. A previous estimate calculated this way gave a best-case scenario precision of 99%. In reality, the precision will be lower.

An unbiased way to estimate the precision is instead to take a random (unbiased) sample of all mappings and curate them exhaustively. In this notebook, we identify 100 mappings between ChEBI and MeSH and curate them to do this.

In [1]:
import random
import time
from collections import Counter
from pathlib import Path

import pandas as pd

import biomappings

In [2]:
print(time.asctime())

Tue Feb  7 20:26:44 2023


In [3]:
path = Path("chemical_identifier_evaluation_full.tsv")

if path.is_file():
    print(f"loading from {path}")
    df = pd.read_csv(path, sep="\t")

else:
    mappings = []

    for label, xxx in [
        ("prediction", biomappings.load_predictions()),
        ("positive", biomappings.load_mappings()),
        ("negative", biomappings.load_false_mappings()),
        ("unsure", biomappings.load_unsure()),
    ]:
        for mapping in xxx:
            if mapping["source prefix"] == "chebi" and mapping["target prefix"] == "mesh":
                mapping["curation_status"] = label
                mappings.append(mapping)

    mappings = sorted(
        mappings,
        key=lambda m: (
            m["source identifier"],
            m["relation"],
            m["target identifier"],
        ),
    )

    print(f"There are {len(mappings):,} total mappings")

    random.seed(0)
    subset = random.choices(mappings, k=100)
    df = pd.DataFrame(subset).sort_values("curation_status")
    df.to_csv(path, sep="\t", index=False)

df

loading from chemical_identifier_evaluation_full.tsv


Unnamed: 0,source prefix,source identifier,source name,relation,target prefix,target identifier,target name,type,confidence,source,curation_status
0,chebi,CHEBI:135656,mazaticol,skos:exactMatch,mesh,C003706,mazaticol,manually_reviewed,,orcid:0000-0001-9439-5346,positive
1,chebi,CHEBI:135822,nafiverine,skos:exactMatch,mesh,C002681,nafiverine,manually_reviewed,,orcid:0000-0003-4423-4370,positive
2,chebi,CHEBI:75408,sulindac sulfide,skos:exactMatch,mesh,C025462,sulindac sulfide,manually_reviewed,,orcid:0000-0001-9439-5346,positive
3,chebi,CHEBI:132171,"5,6-dihydrothymidine",skos:exactMatch,mesh,C029949,"5,6-dihydrothymidine",manually_reviewed,,orcid:0000-0001-9439-5346,positive
4,chebi,CHEBI:4948,Evodiamine,skos:exactMatch,mesh,C049639,evodiamine,manually_reviewed,,orcid:0000-0001-9439-5346,positive
...,...,...,...,...,...,...,...,...,...,...,...
95,chebi,CHEBI:82469,Glycidyl oleate,skos:exactMatch,mesh,C013542,glycidyl oleate,lexical,0.95,generate_chebi_mesh_mappings.py,prediction
96,chebi,CHEBI:6086,Jatrophone,skos:exactMatch,mesh,C006386,jatrophone,lexical,0.95,generate_chebi_mesh_mappings.py,prediction
97,chebi,CHEBI:34339,"3-Hydroxyestra-1,3,5(10),6-tetraen-17-one",skos:exactMatch,mesh,C076348,"3-hydroxyestra-1,3,5(10),6-tetraen-17-one",lexical,0.95,generate_chebi_mesh_mappings.py,prediction
98,chebi,CHEBI:3174,Brevicolline,skos:exactMatch,mesh,C519984,brevicolline,lexical,0.95,generate_chebi_mesh_mappings.py,prediction


In [4]:
counter = Counter()
pairs = Counter(map(tuple, df[["source identifier", "target identifier"]].values))
print(f"There are {len(pairs)} pairs")

for label, xxx in [
    ("prediction", biomappings.load_predictions()),
    ("positive", biomappings.load_mappings()),
    ("negative", biomappings.load_false_mappings()),
    ("unsure", biomappings.load_unsure()),
]:
    for mapping in xxx:
        counter[label] += pairs.get((mapping["source identifier"], mapping["target identifier"]), 0)

counter

There are 98 pairs


Counter({'prediction': 0, 'positive': 97, 'negative': 2, 'unsure': 1})

In [5]:
total = sum(counter.values())
precision_mi = (counter["positive"] + counter["unsure"] / 2) / total
precision_error = counter["unsure"] / 2 / total

print(
    f"""\
With {total:,} random curations, we estimate a precision \
of {precision_mi:.1%} ± {precision_error:.1%}
"""
)

With 100 random curations, we estimate a precision of 97.5% ± 0.5%

