# Mapping from CCLE to Cellosaurus/EFO

This notebook supports the case study for mapping cancer cell lines.

In [1]:
from collections import Counter, defaultdict

import bioregistry
import bioversions
import networkx as nx
import pandas as pd
import pyobo
from bioregistry import curie_to_str

import biomappings

In [2]:
prefixes = ["ccle", "depmap", "cellosaurus", "efo"]

This is using the 2019 version of the CCLE downloaded from the cBioPortal. It appears since, they have removed the ability to bulk download data.

In [3]:
for prefix in prefixes:
    try:
        print(prefix, bioversions.get_version(prefix))
    except:
        print(prefix, "missing")

ccle missing
depmap 22Q4
cellosaurus 44.0
efo 3.49.0


## Load Primary Mappings

Primary mappings are loaded into a directed graph. Nodes represent entities, encoded in canonical Bioregisry CURIEs. Directed edges represent the existence of a mapping from the source node to the target node.

In [4]:
graph = nx.DiGraph()

for prefix in prefixes:
    if prefix == "cellosaurus":
        df = pd.read_csv("cellosaurus_43_xrefs.tsv", sep="\t")
    else:
        df = pyobo.get_xrefs_df(prefix)
    for source_id, target_ns, target_id in df.values:
        if target_ns not in prefixes or prefix == target_ns:
            continue

        source_id = source_id.removeprefix("EFO_")
        source_id = bioregistry.standardize_identifier(prefix, source_id)
        target_id = target_id.removeprefix("EFO_")
        target_id = bioregistry.standardize_identifier(target_ns, target_id)

        graph.add_edge(
            curie_to_str(prefix, source_id),
            curie_to_str(target_ns, target_id),
        )



In [5]:
# check all CURIEs are correct
for curie in graph:
    assert bioregistry.is_valid_curie(curie)

## Summarize Primary Mappings

The mapping graph is summarized as a dataframe.

In [6]:
def calculate_graph_df(g):
    """Summarize a mapping graph as a dataframe."""
    rows = [
        [
            sum(u.startswith(source_prefix) and v.startswith(target_prefix) for u, v in g.edges())
            for target_prefix in prefixes
        ]
        for source_prefix in prefixes
    ]
    df = pd.DataFrame(rows, columns=prefixes, index=prefixes)
    df.index.name = "source"
    df.columns.name = "target"
    return df

In [7]:
summary_df = calculate_graph_df(graph)
summary_df

target,ccle,depmap,cellosaurus,efo
source,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ccle,0,1457,0,0
depmap,0,0,1678,0
cellosaurus,1448,1798,0,1302
efo,0,0,0,0


## Check Benefits of Inference 

Check if there is any benefit to mapping from CCLE to EFO via only Cellosaurus vs. via DepMap then Cellosaurus

In [17]:
direct_cellosaurus = set()
indirect_cellosaurus = set()

via_cellosaurus = set()
via_depmap_cellosaurus = set()

undirected_graph = graph.to_undirected()

for ccle_node in undirected_graph:
    if not ccle_node.startswith("ccle:"):
        continue
    for cellosaurus_node in undirected_graph[ccle_node]:
        if not cellosaurus_node.startswith("cellosaurus:"):
            continue
        direct_cellosaurus.add(cellosaurus_node)
        for efo_node in undirected_graph[cellosaurus_node]:
            if not efo_node.startswith("efo:"):
                continue
            via_cellosaurus.add(efo_node)

    for depmap_node in undirected_graph[ccle_node]:
        if not depmap_node.startswith("depmap:"):
            continue
        for cellosaurus_node in undirected_graph[depmap_node]:
            if not cellosaurus_node.startswith("cellosaurus:"):
                continue
            indirect_cellosaurus.add(cellosaurus_node)
            for efo_node in undirected_graph[cellosaurus_node]:
                if not efo_node.startswith("efo:"):
                    continue
                via_depmap_cellosaurus.add(efo_node)

print(
    f"""\
There are {len(direct_cellosaurus):,} Cellosaurus nodes directly mapped from CCLE.
There are {len(indirect_cellosaurus):,} Cellosaurus nodes accessible when mapping from CCLE via DepMap.
Of the {len(direct_cellosaurus.union(indirect_cellosaurus)):,} accessible, \
{len(direct_cellosaurus.intersection(indirect_cellosaurus)):,} of these are shared, \
{len(direct_cellosaurus - indirect_cellosaurus):,} are only accessible when mapping directly \
and {len(indirect_cellosaurus - direct_cellosaurus):,} are only accessible when mapping via DepMap.

There are no mappings directly from CCLE to EFO. However, Cellosaurus maps to EFO, so mapping first
from CCLE to Cellosaurus then to EFO allows for inference of CCLE to EFO mappings.

There are {len(via_cellosaurus):,} EFO nodes mappable from CCLE via Cellosaurus.
There are {len(via_depmap_cellosaurus):,} EFO nodes mappable from CCLE via DepMap and Cellosaurus.
This totals {len(via_cellosaurus.union(via_depmap_cellosaurus)):,} mappings possible, \
where {len(via_cellosaurus.intersection(via_depmap_cellosaurus)):,} of these are shared, \
{len(via_cellosaurus - via_depmap_cellosaurus):,} are only accessible when mapping via Cellosaurus, \
({via_cellosaurus - via_depmap_cellosaurus})\
and {len(via_depmap_cellosaurus - via_cellosaurus):,} are only accessible when mapping via DepMap then \
Cellosaurus ({via_depmap_cellosaurus - via_cellosaurus}).
"""
)

There are 1,444 Cellosaurus nodes directly mapped from CCLE.
There are 1,450 Cellosaurus nodes accessible when mapping from CCLE via DepMap.
Of the 1,455 accessible, 1,439 of these are shared, 5 are only accessible when mapping directly and 11 are only accessible when mapping via DepMap.

There are no mappings directly from CCLE to EFO. However, Cellosaurus maps to EFO, so mapping first
from CCLE to Cellosaurus then to EFO allows for inference of CCLE to EFO mappings.

There are 718 EFO nodes mappable from CCLE via Cellosaurus.
There are 718 EFO nodes mappable from CCLE via DepMap and Cellosaurus.
This totals 719 mappings possible, where 717 of these are shared, 1 are only accessible when mapping via Cellosaurus, ({'efo:0003125'})and 1 are only accessible when mapping via DepMap then Cellosaurus ({'efo:0001246'}).



## Apply Reasoning and Inference

The graph is first compose with a reverse version of itself to allow reasoning over mappings backwards. Then, the transitive closure is used to add edges between all nodes that can be reached by following a path. Note that the diagonal entries now count the number of entities in a source that have mappings, potentially to any external resource.

In [9]:
closure_graph = nx.transitive_closure(graph.to_undirected(), reflexive=False)

closure_summary_df = calculate_graph_df(closure_graph)
closure_summary_df

target,ccle,depmap,cellosaurus,efo
source,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ccle,10,1470,1455,725
depmap,13,13,1690,763
cellosaurus,12,114,2,1304
efo,0,0,0,6


# Check problematic components in this mapping graph

The following finds components in the mapping graph where there are not 1-1 mappings

In [10]:
for component in nx.connected_components(closure_graph):
    prefix_counter = Counter(node.split(":", 1)[0] for node in component)
    if any(v > 1 for v in prefix_counter.values()):
        print(sorted(component))

['ccle:451LU_NS', 'ccle:451LU_SKIN', 'cellosaurus:6357', 'depmap:ACH-001002']
['ccle:HCC2279_LUNG', 'cellosaurus:5131', 'depmap:ACH-000731', 'efo:0005374', 'efo:0006431']
['ccle:KOSC2_UPPER_AERODIGESTIVE_TRACT', 'cellosaurus:1337', 'depmap:ACH-001543', 'depmap:ACH-002260']
['ccle:LC1SQ_LUNG', 'ccle:LC1_LUNG', 'cellosaurus:3008', 'depmap:ACH-002156']
['ccle:MB157_BREAST', 'ccle:MDAMB157_BREAST', 'cellosaurus:0618', 'depmap:ACH-000621', 'depmap:ACH-001120', 'efo:0001206']
['ccle:NCIH292_LUNG', 'cellosaurus:0455', 'depmap:ACH-000474', 'depmap:ACH-001075', 'efo:0006690']
['ccle:NCIH3255_LUNG', 'cellosaurus:6831', 'depmap:ACH-000109', 'depmap:ACH-002137', 'efo:0003123']
['ccle:NCIH513_PLEURA', 'cellosaurus:A570', 'depmap:ACH-002138', 'depmap:ACH-002341']
['ccle:NCIH720_LUNG', 'cellosaurus:1583', 'depmap:ACH-002174', 'efo:0001166', 'efo:0002302']
['ccle:ALEXANDERCELLS_LIVER', 'ccle:PLCPRF5_LIVER', 'cellosaurus:0485', 'depmap:ACH-001318', 'efo:0006291']
['ccle:RH30_SOFT_TISSUE', 'ccle:SJRH30_

Conclusion: in the current versions of each data source, the exact same mappings are possible using depmap as an intermediary. There are some additional mappings from DepMap to cellosaurus, but these DepMap entries are not themselves mapped to cellosaurus.

## Added Benefit of Biomappings

First, we'll load the Biomappings curated content and filter to relevant rows and columns.

In [11]:
mappings_df = pd.DataFrame(biomappings.load_mappings())
idx = (
    mappings_df["source prefix"].isin(prefixes)
    & mappings_df["target prefix"].isin(prefixes)
    & (mappings_df["relation"] == "skos:exactMatch")
)
columns = ["source prefix", "source identifier", "target prefix", "target identifier"]
mappings_df = mappings_df[idx][columns]
mappings_df

Unnamed: 0,source prefix,source identifier,target prefix,target identifier
142,ccle,1321N1_CENTRAL_NERVOUS_SYSTEM,cellosaurus,CVCL_0110
143,ccle,143B_BONE,cellosaurus,CVCL_2270
144,ccle,143B_BONE,efo,0006355
145,ccle,22RV1_PROSTATE,cellosaurus,CVCL_1045
146,ccle,22RV1_PROSTATE,efo,0002095
...,...,...,...,...
827,ccle,YMB1_BREAST,efo,0006779
828,ccle,ZR751_BREAST,cellosaurus,CVCL_0588
829,ccle,ZR751_BREAST,efo,0001262
830,ccle,ZR7530_BREAST,cellosaurus,CVCL_1661


Count the number of mappings for each prefix from CCLE.

In [12]:
mappings_df[mappings_df["source prefix"] == "ccle"].groupby("target prefix").count()[
    "target identifier"
]

target prefix
cellosaurus    114
depmap           4
efo            570
Name: target identifier, dtype: int64

In [13]:
efo_mapping_idx = (mappings_df["source prefix"] == "ccle") & (mappings_df["target prefix"] == "efo")
manual_efo_mapped = set(mappings_df[efo_mapping_idx]["target identifier"])

# refer to these variables in a previous cell
inferred_efo_mapped = via_cellosaurus.union(via_depmap_cellosaurus)

print(
    f"""\
Infered CCLE-EFO: {len(inferred_efo_mapped):,}
Manually curated CCLE-EFO: {len(manual_efo_mapped):,}
Union: {len(inferred_efo_mapped.union(manual_efo_mapped)):,}
Overlap: {len(inferred_efo_mapped.intersection(manual_efo_mapped)):,}
Novel from manual: {len(manual_efo_mapped-inferred_efo_mapped)}
Percentage gain: {len(manual_efo_mapped) / len(inferred_efo_mapped):.2%}
"""
)

Infered CCLE-EFO: 719
Manually curated CCLE-EFO: 570
Union: 1,289
Overlap: 0
Novel from manual: 570
Percentage gain: 79.28%



In [14]:
added_value_table = []
total_value_table = []
for source_prefix in prefixes:
    row_added_value = []
    row_total_value = []
    for target_prefix in prefixes:
        before = sum(
            u.startswith(source_prefix) and v.startswith(target_prefix)
            for u, v in closure_graph.edges()
        )
        benefit = sum(
            not closure_graph.has_edge(
                curie_to_str(source_prefix, source_id),
                curie_to_str(target_prefix, target_id),
            )
            for _, source_id, _, target_id in mappings_df[
                (mappings_df["source prefix"] == source_prefix)
                & (mappings_df["target prefix"] == target_prefix)
            ].values
        )
        row_added_value.append(benefit)
        row_total_value.append(before + benefit)

    added_value_table.append(row_added_value)
    total_value_table.append(row_total_value)

added_value_df = pd.DataFrame(added_value_table, index=prefixes, columns=prefixes)
added_value_df.index.name = "source"
added_value_df.columns.name = "target"
added_value_df

target,ccle,depmap,cellosaurus,efo
source,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ccle,0,4,114,8
depmap,0,0,0,0
cellosaurus,0,0,0,0
efo,0,0,0,0


In [15]:
total_value_df = pd.DataFrame(total_value_table, index=prefixes, columns=prefixes)
total_value_df.index.name = "source"
total_value_df.columns.name = "target"
total_value_df

target,ccle,depmap,cellosaurus,efo
source,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ccle,10,1474,1569,733
depmap,13,13,1690,763
cellosaurus,12,114,2,1304
efo,0,0,0,6


## Apply Reasoning and Inference with Biomappings

The graph is extended with Biomappings content, the transitive closure is computed, and new statistics are reported. Note that several additional CCLE-EFO mappings are calculatable when using inference over direct mappings!

In [16]:
extended_closure_graph = closure_graph.copy()


for source_prefix, source_id, target_prefix, target_id in mappings_df[columns].values:
    source_id = source_id.removeprefix("EFO_")
    source_id = bioregistry.standardize_identifier(prefix, source_id)
    target_id = target_id.removeprefix("EFO_")
    target_id = bioregistry.standardize_identifier(target_ns, target_id)
    extended_closure_graph.add_edge(
        curie_to_str(source_prefix, source_id),
        curie_to_str(target_prefix, target_id),
    )

extended_closure_graph = nx.transitive_closure(extended_closure_graph, reflexive=False)

extended_closure_summary_df = calculate_graph_df(extended_closure_graph)
extended_closure_summary_df

target,ccle,depmap,cellosaurus,efo
source,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ccle,13,1470,1573,727
depmap,17,13,1787,765
cellosaurus,19,114,101,1306
efo,6,0,62,6
