# Mapping from CCLE to Cellosaurus/EFO

This notebook supports the case study for mapping cancer cell lines.

In [1]:
from collections import defaultdict

import bioregistry
import bioversions
import networkx as nx
import pandas as pd
import pyobo
from bioregistry import curie_to_str

import biomappings

In [2]:
prefixes = ["ccle", "depmap", "cellosaurus", "efo"]

In [3]:
for prefix in prefixes:
    try:
        print(prefix, bioversions.get_version(prefix))
    except:
        print(prefix, "missing")



ccle missing
depmap 22Q2
cellosaurus missing
efo 3.47.0


## Load Primary Mappings

Primary mappings are loaded into a directed graph. Nodes represent entities, encoded in canonical Bioregisry CURIEs. Directed edges represent the existence of a mapping from the source node to the target node.

In [4]:
graph = nx.DiGraph()

for prefix in prefixes:
    df = pyobo.get_xrefs_df(prefix)
    for source_id, target_ns, target_id in df.values:
        if target_ns not in prefixes or prefix == target_ns:
            continue

        source_id = bioregistry.standardize_identifier(prefix, source_id)
        target_id = bioregistry.standardize_identifier(target_ns, target_id)

        graph.add_edge(
            curie_to_str(prefix, source_id),
            curie_to_str(target_ns, target_id),
        )



## Summarize Primary Mappings

The mapping graph is summarized as a dataframe.

In [5]:
def calculate_graph_df(g):
    """Summarize a mapping graph as a dataframe."""
    rows = [
        [
            sum(u.startswith(source_prefix) and v.startswith(target_prefix) for u, v in g.edges())
            for target_prefix in prefixes
        ]
        for source_prefix in prefixes
    ]
    df = pd.DataFrame(rows, columns=prefixes, index=prefixes)
    df.index.name = "source"
    df.columns.name = "target"
    return df

In [6]:
summary_df = calculate_graph_df(graph)
summary_df

target,ccle,depmap,cellosaurus,efo
source,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ccle,0,1457,0,0
depmap,0,0,1678,0
cellosaurus,1448,1798,0,1302
efo,0,0,0,0


## Apply Reasoning and Inference

The graph is first compose with a reverse version of itself to allow reasoning over mappings backwards. Then, the transitive closure is used to add edges between all nodes that can be reached by following a path. Note that the diagonal entries now count the number of entities in a source that have mappings, potentially to any external resource.

In [7]:
closure_graph = nx.transitive_closure(nx.compose(graph, graph.reverse()), reflexive=False)

closure_summary_df = calculate_graph_df(closure_graph)
closure_summary_df

target,ccle,depmap,cellosaurus,efo
source,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ccle,1489,1483,1467,725
depmap,1483,1829,1804,763
cellosaurus,1467,1804,2340,1304
efo,725,763,1304,1314


## Added Benefit of Biomappings

First, we'll load the Biomappings curated content and filter to relevant rows and columns.

In [8]:
mappings_df = pd.DataFrame(biomappings.load_mappings())
idx = (
    mappings_df["source prefix"].isin(prefixes)
    & mappings_df["target prefix"].isin(prefixes)
    & (mappings_df["relation"] == "skos:exactMatch")
)
columns = ["source prefix", "source identifier", "target prefix", "target identifier"]
mappings_df = mappings_df[idx][columns]
mappings_df

Unnamed: 0,source prefix,source identifier,target prefix,target identifier
142,ccle,1321N1_CENTRAL_NERVOUS_SYSTEM,cellosaurus,CVCL_0110
143,ccle,143B_BONE,cellosaurus,CVCL_2270
144,ccle,143B_BONE,efo,0006355
145,ccle,22RV1_PROSTATE,cellosaurus,CVCL_1045
146,ccle,22RV1_PROSTATE,efo,0002095
...,...,...,...,...
762,ccle,YMB1_BREAST,efo,0006779
763,ccle,ZR751_BREAST,cellosaurus,CVCL_0588
764,ccle,ZR751_BREAST,efo,0001262
765,ccle,ZR7530_BREAST,cellosaurus,CVCL_1661


Count the number of mappings for each prefix from CCLE.

In [9]:
mappings_df[mappings_df["source prefix"] == "ccle"].groupby("target prefix").count()[
    "target identifier"
]

target prefix
cellosaurus    106
depmap           1
efo            516
Name: target identifier, dtype: int64

In [10]:
added_value_table = []
total_value_table = []
for source_prefix in prefixes:
    row_added_value = []
    row_total_value = []
    for target_prefix in prefixes:
        before = sum(
            u.startswith(source_prefix) and v.startswith(target_prefix)
            for u, v in closure_graph.edges()
        )
        benefit = sum(
            not closure_graph.has_edge(
                curie_to_str(source_prefix, source_id),
                curie_to_str(target_prefix, target_id),
            )
            for _, source_id, _, target_id in mappings_df[
                (mappings_df["source prefix"] == source_prefix)
                & (mappings_df["target prefix"] == target_prefix)
            ].values
        )
        row_added_value.append(benefit)
        row_total_value.append(before + benefit)

    added_value_table.append(row_added_value)
    total_value_table.append(row_total_value)

added_value_df = pd.DataFrame(added_value_table, index=prefixes, columns=prefixes)
added_value_df.index.name = "source"
added_value_df.columns.name = "target"
added_value_df

target,ccle,depmap,cellosaurus,efo
source,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ccle,0,1,106,516
depmap,0,0,0,0
cellosaurus,0,0,0,0
efo,0,0,0,0


In [11]:
total_value_df = pd.DataFrame(total_value_table, index=prefixes, columns=prefixes)
total_value_df.index.name = "source"
total_value_df.columns.name = "target"
total_value_df

target,ccle,depmap,cellosaurus,efo
source,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ccle,1489,1484,1573,1241
depmap,1483,1829,1804,763
cellosaurus,1467,1804,2340,1304
efo,725,763,1304,1314


## Apply Reasoning and Inference with Biomappings

The graph is extended with Biomappings content, the transitive closure is computed, and new statistics are reported. Note that several additional CCLE-EFO mappings are calculatable when using inference over direct mappings!

In [12]:
extended_closure_graph = closure_graph.copy()


for source_prefix, source_id, target_prefix, target_id in mappings_df[columns].values:
    extended_closure_graph.add_edge(
        curie_to_str(source_prefix, source_id),
        curie_to_str(target_prefix, target_id),
    )

extended_closure_graph = nx.transitive_closure(extended_closure_graph, reflexive=False)

extended_closure_summary_df = calculate_graph_df(extended_closure_graph)
extended_closure_summary_df

target,ccle,depmap,cellosaurus,efo
source,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ccle,1490,1484,1575,1246
depmap,1483,1829,1896,1281
cellosaurus,1467,1804,2431,1817
efo,725,763,1360,1828
