# Assessment of Value Added from Curation in Biomappings

This notebook assesses the added value Biomappings in a direct comparison to the primary resources that it supplements as well as in several practical data integration scenarios.

In [1]:
import sys
import time
from collections import defaultdict

import pandas as pd

import biomappings
from biomappings.paper_analysis import (
    EVALUATION,
    Result,
    get_non_obo_mappings,
    get_obo_mappings,
    index_mappings,
)

In [2]:
print(sys.version)
print(time.asctime())

3.11.0 (main, Oct 25 2022, 14:13:24) [Clang 14.0.0 (clang-1400.0.29.202)]
Tue Feb  7 19:04:00 2023


# Importing Data

## Biomappings

Manually curated mappings from Biomappings

In [3]:
biomappings_dd = index_mappings(
    biomappings.load_mappings(),
    path=EVALUATION.join(name="positive_mapping_index.pkl"),
)

Predicted mappings from Biomappings

In [4]:
biomappings_predictions_dd = index_mappings(
    biomappings.load_predictions(),
    path=EVALUATION.join(name="predicted_mapping_index.pkl"),
)

## Primary Mappings

Get primary mappings from 1) OBO ontologies that can be parsed with ROBOT and 2) other resoruces, via PyOBO.

In [5]:
# Primary mappings from OBO and other sources are going in here
primary_dd = defaultdict(dict)
summary_rows = []

summary_rows.extend(get_obo_mappings(primary_dd, biomappings_dd))
summary_rows.extend(get_non_obo_mappings(primary_dd, biomappings_dd))

# Primary Resource Value Added

While importing data, summaries of the value added on top of each primary resource's mappings were calculated.

In [6]:
summary_df = pd.DataFrame(
    summary_rows,
    columns=[
        "resource",
        "version",
        "external",
        "primary_xrefs",
        "biomappings_xrefs",
        "total_xrefs",
        "percentage_gain",
    ],
)
pd.option_context("display.max_rows", summary_df.shape[0])
summary_df

Unnamed: 0,resource,version,external,primary_xrefs,biomappings_xrefs,total_xrefs,percentage_gain
0,doid,2022-11-01,umls,6790,426,7216,6.3
1,doid,2022-11-01,mesh,2738,2905,5643,106.1
2,doid,2022-11-01,efo,131,126,257,96.2
3,mondo,2022-11-01,umls,16771,0,16771,0.0
4,mondo,2022-11-01,mesh,7982,414,8396,5.2
5,mondo,2022-11-01,doid,9895,0,9895,0.0
6,mondo,2022-11-01,efo,2865,0,2865,0.0
7,efo,3.47.0,mesh,2429,216,2645,8.9
8,efo,3.47.0,doid,2300,126,2426,5.5
9,efo,3.47.0,cl,11,0,11,0.0


# Secondary Resource Value Added

The following scenarios show how the extra mappings available in Biomappings supplement the primary mappings in various data integration scenarios. For example, the CTD uses MeSH identifiers for chemicals. Therefore, to integrate the CTD with resources that use ChEBI identifiers, it's necessary to map out of MeSH into ChEBI. Unfortunately, there are no primary mappings from MeSH to ChEBI in MeSH, nor any mappings from ChEBI to MeSH in ChEBI, so the value proposition is very high.

In [7]:
evaluation_results = []

## Mapping Chemicals in the CTD Chemical-Gene Interactions Dataset

The chemical-gene interaction set uses MeSH identifiers for chemicals and NCBIGene identifiers for genes. The following cells assess the number of these chemicals that can be mapped via MeSH to ChEBI.

Caveat: both MeSH and CTD also track CAS numbers for some chemicals, which can in some cases be used to make two-hop mappings to ChEBI.

In [8]:
CTD_CHEMICAL_GENE_URL = "https://ctdbase.org/reports/CTD_chem_gene_ixns.tsv.gz"
ctd_header = [
    "chemical_name",
    "chemical_mesh_id",
    "chemical_cas",
    "gene_symbol",
    "gene_ncbigene_id",
    "gene_forms",
    "organism_name",
    "organism_ncbitaxon_id",
    "evidence",
    "interaction",
    "pubmed_ids",
]
ctd_gene_chemical_df = EVALUATION.ensure_csv(
    url=CTD_CHEMICAL_GENE_URL,
    read_csv_kwargs={
        "sep": "\t",
        "comment": "#",
        "header": None,
        "dtype": str,
        "keep_default_na": False,
        "usecols": [1],
        "squeeze": True,
    },
)



  return pd.read_csv(path, **_clean_csv_kwargs(read_csv_kwargs))


In [9]:
result = Result.make(
    dataset="ctd-chemical-gene",
    source="mesh",
    target="chebi",
    datasource_identifiers=set(ctd_gene_chemical_df.tolist()),
    primary=primary_dd,
    secondary=biomappings_dd,
    tertiary=biomappings_predictions_dd,
)
result.print()
evaluation_results.append(result)

Missing                        Unmappable to chebi    % Unmappable
-----------------------------  ---------------------  --------------
Total in ctd-chemical-gene     14,337
Missing w/ mesh                14,337                 100.00%
Missing w/ mesh + BM.          12,884                 89.9%
Missing w/ mesh + BM. + Pred.  9,086                  63.4%


## Mapping Diseases in the CTD Chemical-Diseases Dataset

The chemical-gene interaction set uses MeSH identifiers for chemicals and either MeSH or [Online Mendelian Inheritance in Man (OMIM)](https://bioregistry.io/registry/omim) identifiers for diseases. The following cells assess the number of the unique diseases appearing in this dataset that can be mapped via MeSH to 1) the [Disease Ontology (DO)](https://bioregistry.io/registry/doid) and 2) the [Monarch Disease Ontology (MONDO)](https://bioregistry.io/registry/mondo).

In [10]:
CTD_CHEMICAL_DISEASES_URL = "https://ctdbase.org/reports/CTD_chemicals_diseases.tsv.gz"
"""
    ChemicalName
    ChemicalID (MeSH identifier)
    CasRN (CAS Registry Number, if available)
    DiseaseName
    DiseaseID (MeSH or OMIM identifier)
    DirectEvidence ('|'-delimited list)
    InferenceGeneSymbol
    InferenceScore
    OmimIDs ('|'-delimited list)
    PubMedIDs ('|'-delimited list)
"""
ctd_chemical_diseases_df = EVALUATION.ensure_csv(
    url=CTD_CHEMICAL_DISEASES_URL,
    read_csv_kwargs={
        "sep": "\t",
        "comment": "#",
        "header": None,
        "dtype": str,
        "keep_default_na": False,
        "usecols": [4],
        "squeeze": True,
    },
)
ctd_chemical_diseases_df.head()



  return pd.read_csv(path, **_clean_csv_kwargs(read_csv_kwargs))


0       MESH:D054198
1       MESH:D000230
2    MESH:D000077192
3       MESH:D000505
4       MESH:D013734
Name: 4, dtype: object

Index diseases that have been mapped via MeSH and OMIM separately.

In [11]:
ctd_chemical_diseases_mesh = {
    x.split(":")[1] for x in ctd_chemical_diseases_df.tolist() if x.startswith("MESH")
}
ctd_chemical_diseases_omim = {
    x.split(":")[1] for x in ctd_chemical_diseases_df.tolist() if x.startswith("OMIM")
}

Calculate value added summary for the Disease Ontology (DO).

In [12]:
result = Result.make(
    dataset="ctd-chemical-disease",
    source="mesh",
    target="doid",
    datasource_identifiers=ctd_chemical_diseases_mesh,
    primary=primary_dd,
    secondary=biomappings_dd,
    tertiary=biomappings_predictions_dd,
)
result.print()
evaluation_results.append(result)

Missing                        Unmappable to doid    % Unmappable
-----------------------------  --------------------  --------------
Total in ctd-chemical-disease  5,821
Missing w/ mesh                3,017                 51.83%
Missing w/ mesh + BM.          2,676                 46.0%
Missing w/ mesh + BM. + Pred.  2,633                 45.2%


Calculate value added summary for the Monarch Disease Ontology (MONDO).

In [13]:
result = Result.make(
    dataset="ctd-chemical-disease",
    source="mesh",
    target="mondo",
    datasource_identifiers=ctd_chemical_diseases_mesh,
    primary=primary_dd,
    secondary=biomappings_dd,
    tertiary=biomappings_predictions_dd,
)
result.print()
evaluation_results.append(result)

Missing                        Unmappable to mondo    % Unmappable
-----------------------------  ---------------------  --------------
Total in ctd-chemical-disease  5,821
Missing w/ mesh                1,472                  25.29%
Missing w/ mesh + BM.          1,461                  25.1%
Missing w/ mesh + BM. + Pred.  1,418                  24.4%


## Mapping Side Effects in the SIDER Dataset

The [Side Effect Resource (SIDER)](http://sideeffects.embl.de/) contains manually annotated side effects and indications from drug labels. Side effects are stored with the [Unified Medical Language System (UMLS)](https://bioregistry.io/registry/umls) vocabulary. Similarly to the analysis of the CTD's chemical-diseases analysis, the following cells assess the number of the uniqu side effects in SIDER that can be mapped via MeSH to 1) the [Disease Ontology (DO)](https://bioregistry.io/registry/doid) and 2) the [Monarch Disease Ontology (MONDO)](https://bioregistry.io/registry/mondo).


In [14]:
SIDER_URL = "http://sideeffects.embl.de/media/download/meddra_all_se.tsv.gz"

SIDE_EFFECTS_HEADER = [
    "STITCH_FLAT_ID",
    "STITCH_STEREO_ID",
    "UMLS CUI from Label",
    "MedDRA Concept Type",
    "UMLS CUI from MedDRA",
    "MedDRA Concept name",
]

side_effects_df = EVALUATION.ensure_csv(
    url=SIDER_URL,
    read_csv_kwargs={
        "dtype": str,
        "header": None,
        "names": SIDE_EFFECTS_HEADER,
    },
)
side_effects_df

Unnamed: 0,STITCH_FLAT_ID,STITCH_STEREO_ID,UMLS CUI from Label,MedDRA Concept Type,UMLS CUI from MedDRA,MedDRA Concept name
0,CID100000085,CID000010917,C0000729,LLT,C0000729,Abdominal cramps
1,CID100000085,CID000010917,C0000729,PT,C0000737,Abdominal pain
2,CID100000085,CID000010917,C0000737,LLT,C0000737,Abdominal pain
3,CID100000085,CID000010917,C0000737,PT,C0687713,Gastrointestinal pain
4,CID100000085,CID000010917,C0000737,PT,C0000737,Abdominal pain
...,...,...,...,...,...,...
309844,CID171306834,CID071306834,C3203358,PT,C1145670,Respiratory failure
309845,CID171306834,CID071306834,C3665386,LLT,C3665386,Abnormal vision
309846,CID171306834,CID071306834,C3665386,PT,C3665347,Visual impairment
309847,CID171306834,CID071306834,C3665596,LLT,C3665596,Warts


In [15]:
result = Result.make(
    dataset="sider",
    source="umls",
    target="doid",
    datasource_identifiers=set(side_effects_df["UMLS CUI from Label"].unique()),
    primary=primary_dd,
    secondary=biomappings_dd,
    tertiary=biomappings_predictions_dd,
)
result.print()
evaluation_results.append(result)

Missing                        Unmappable to doid    % Unmappable
-----------------------------  --------------------  --------------
Total in sider                 5,868
Missing w/ umls                4,729                 80.59%
Missing w/ umls + BM.          4,695                 80.0%
Missing w/ umls + BM. + Pred.  4,618                 78.7%


## Mapping Cell Lines in the CCLE Achilles Dataset

The [Cancer Cell Line Encyclopedia (CCLE)](https://sites.broadinstitute.org/ccle) stores comparative experiments of chemical and genetic perturbations across a wide array of cancer cell lines. However, it's difficult to directly map from this vocabulary to others. The following cell assess the ability to map CCLE cell lines to the [Experimental Factor Ontology (EFO)](https://bioregistry.io/registry/efo).

The `sample_info.csv` data file can be downloaded from https://depmap.org/portal/download/ which redirects to a FigShare download at https://ndownloader.figshare.com/files/35020903. Unfortunately, this could not be automated, so this file is included in the same directory as this notebook.

In [16]:
ccle_achilles_df = pd.read_csv("sample_info.csv")
ccle_achilles_df.head()

Unnamed: 0,DepMap_ID,cell_line_name,stripped_cell_line_name,CCLE_Name,alias,COSMICID,sex,source,RRID,WTSI_Master_Cell_ID,...,lineage_sub_subtype,lineage_molecular_subtype,default_growth_pattern,model_manipulation,model_manipulation_details,patient_id,parent_depmap_id,Cellosaurus_NCIt_disease,Cellosaurus_NCIt_id,Cellosaurus_issues
0,ACH-000016,SLR 21,SLR21,SLR21_KIDNEY,,,,Academic lab,CVCL_V607,,...,,,,,,PT-JnARLB,,Clear cell renal cell carcinoma,C4033,
1,ACH-000032,MHH-CALL-3,MHHCALL3,MHHCALL3_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE,,,Female,DSMZ,CVCL_0089,,...,b_cell,,,,,PT-p2KOyI,,Childhood B acute lymphoblastic leukemia,C9140,
2,ACH-000033,NCI-H1819,NCIH1819,NCIH1819_LUNG,,,Female,Academic lab,CVCL_1497,,...,NSCLC_adenocarcinoma,,,,,PT-9p1WQv,,Lung adenocarcinoma,C3512,
3,ACH-000043,Hs 895.T,HS895T,HS895T_FIBROBLAST,,,Female,ATCC,CVCL_0993,,...,,,2D: adherent,,,PT-rTUVZQ,,Melanoma,C3224,
4,ACH-000049,HEK TE,HEKTE,HEKTE_KIDNEY,,,,Academic lab,CVCL_WS59,,...,,,,immortalized,,PT-qWYYgr,,,,No information is available about this cell li...


In [17]:
result = Result.make(
    dataset="ccle-achilles",
    source="ccle",
    target="efo",
    datasource_identifiers=set(ccle_achilles_df["CCLE_Name"].unique()),
    primary=primary_dd,
    secondary=biomappings_dd,
    tertiary=biomappings_predictions_dd,
)
result.print()
evaluation_results.append(result)

Missing                        Unmappable to efo    % Unmappable
-----------------------------  -------------------  --------------
Total in ccle-achilles         1,837
Missing w/ ccle                1,837                100.00%
Missing w/ ccle + BM.          1,326                72.2%
Missing w/ ccle + BM. + Pred.  1,270                69.1%


## Mapping Chemical Reactions in the Rhea Dataset

The Rhea database stores inforamtion about biologically relevant chemical reactions using ChEBI. The following cells assess the ability to map Rhea reactions to MeSH for integration with other resources using MeSH. This is the "inverse" of the problem presented for the CTD Chemical-Gene dataset.

In [18]:
RHEA_URL = "https://ftp.expasy.org/databases/rhea/tsv/chebiId%5Fname.tsv"
rhea_chebi_ids = {
    curie.removeprefix("CHEBI:")
    for curie in pd.read_csv(RHEA_URL, sep="\t", header=None, usecols=[0], squeeze=True)
    if curie.startswith("CHEBI:")
}
result = Result.make(
    dataset="rhea",
    source="chebi",
    target="mesh",
    datasource_identifiers=rhea_chebi_ids,
    primary=primary_dd,
    secondary=biomappings_dd,
    tertiary=biomappings_predictions_dd,
)
result.print()
evaluation_results.append(result)



  for curie in pd.read_csv(RHEA_URL, sep="\t", header=None, usecols=[0], squeeze=True)


Missing                         Unmappable to mesh    % Unmappable
------------------------------  --------------------  --------------
Total in rhea                   10,970
Missing w/ chebi                10,970                100.00%
Missing w/ chebi + BM.          10,840                98.8%
Missing w/ chebi + BM. + Pred.  10,005                91.2%


## Summary of Practical Integration Problems

Note that for brevity, not all possible mappings between all possible resources were included. For example, we only showed the CCLE to EFO mappings, while there exists several other resources like the Cell Ontology (CL) and Cellosaurus that also contain mappings.

Further, this notebook did not consider two- or multi-hop mappings. In practice, these are very useful, but also increase the complexity of the code, datastructures, and workflows that are necessary to perform mapping of datasets.

The data are presented with the following statistics:

1. Missing with primary: the total number of unique entities that could not be mapped using primary resources' mappings.
2. Missing with primary percentage (`m1 (%)`): the percentage of the unique entities that could not be mapped using primary resources' mappings versus the total number of unique entities
3. Missing with curations: the total number of unique entities that could not be mapped using a combination of primary resources' mappings and manually curated content from Biomappings.
4. Missing with curations percentage (`m2 (%)`): the percentage number of unique entities that could not be mapped using a combination of primary resources' mappings and manually curated content from Biomappings versus the total number of unique entities.
5. Missing with curations percentage delta (`m2 (%) Δ`): the change in percentage unmapped (`m1 (%)`) when adding manual curations from Biomappings. Bigger is better.
6. Missing with curations *and* predictions: the total number of unique entities that could not be mapped using a combination of primary resources' mappings, manually curated content from Biomappings, and predicted mappings from Biomappings. This gives insight into the potential benefit of further curationin Biomappings.
7. Missing with curations *and* predictions (`m3 (%)`): the percentage number of unique entities that could not be mapped using a combination of primary resources' mappings, manually curated content from Biomappings, and predicted mappings from Biomappings versus the total number of unique entities.
8. Missing with curations *and* predictions delta (`m3 (%) Δ`): the change in percentage unmapped (`m1 (%)`) when adding manual curations and predictions from Biomappings. Bigger is better.

In [19]:
evaluation_df_rows = []
for result in evaluation_results:
    evaluation_df_rows.append(
        (
            result.dataset,
            result.source,
            result.target,
            result.total,
            result.missing,
            round(100 * result.missing / result.total, 1),
            result.missing_biomappings,
            round(100 * result.missing_biomappings / result.total, 1),
            round(100 * (result.missing - result.missing_biomappings) / result.total, 1),
            result.missing_predictions,
            round(100 * result.missing_predictions / result.total, 1),
            round(100 * (result.missing - result.missing_predictions) / result.total, 1),
        )
    )
pd.DataFrame(
    evaluation_df_rows,
    columns=[
        "dataset",
        "source",
        "target",
        "total",
        "missing_w_primary",
        "m1 (%)",
        "missing_w_curations",
        "m2 (%)",
        "m2 (%) Δ",
        "missing_w_predictions",
        "m3 (%)",
        "m3 (%) Δ",
    ],
)

Unnamed: 0,dataset,source,target,total,missing_w_primary,m1 (%),missing_w_curations,m2 (%),m2 (%) Δ,missing_w_predictions,m3 (%),m3 (%) Δ
0,ctd-chemical-gene,mesh,chebi,14337,14337,100.0,12884,89.9,10.1,9086,63.4,36.6
1,ctd-chemical-disease,mesh,doid,5821,3017,51.8,2676,46.0,5.9,2633,45.2,6.6
2,ctd-chemical-disease,mesh,mondo,5821,1472,25.3,1461,25.1,0.2,1418,24.4,0.9
3,sider,umls,doid,5868,4729,80.6,4695,80.0,0.6,4618,78.7,1.9
4,ccle-achilles,ccle,efo,1837,1837,100.0,1326,72.2,27.8,1270,69.1,30.9
5,rhea,chebi,mesh,10970,10970,100.0,10840,98.8,1.2,10005,91.2,8.8
