The point of this notebook is to assess the number of duplicate nodes in Hetionet (and potentially others)

In [1]:
import sys
import time
import pandas as pd
import bioregistry
import itertools as itt
import json
from tqdm.auto import tqdm
from collections import Counter, defaultdict
from pyobo.gilda_utils import get_grounder
from biomappings.graph import get_true_graph
import networkx as nx

In [2]:
print(sys.version)

3.10.8 (main, Oct 13 2022, 10:17:43) [Clang 14.0.0 (clang-1400.0.29.102)]


In [3]:
print(time.asctime())

Fri Oct 28 18:39:13 2022


# Loading

Load the Biomappings data

In [4]:
true_graph = get_true_graph()

Load the Hetionet data into a data frame

In [5]:
URL = "https://github.com/hetio/hetionet/raw/master/hetnet/json/hetionet-v1.0.json.bz2"

with open("/Users/cthoyt/Downloads/hetionet-v1.0.json") as file:
    data = json.load(file)

In [6]:
rows = [
    (
        d["kind"],
        d["identifier"],
        d["name"],
        d["data"]["source"],
        d["data"].get("url"),
        d["data"].get("description"),
    )
    for d in tqdm(data["nodes"], unit_scale=True, unit="node")
]
df = pd.DataFrame(
    rows,
    columns=["kind", "identifier", "name", "source", "url", "description"],
)
df

  0%|          | 0.00/47.0k [00:00<?, ?node/s]

Unnamed: 0,kind,identifier,name,source,url,description
0,Molecular Function,GO:0031753,endothelial differentiation G-protein coupled ...,Gene Ontology,http://purl.obolibrary.org/obo/GO_0031753,
1,Side Effect,C0023448,Lymphocytic leukaemia,UMLS via SIDER 4.1,http://identifiers.org/umls/C0023448,
2,Gene,5345,SERPINF2,Entrez Gene,http://identifiers.org/ncbigene/5345,"serpin peptidase inhibitor, clade F (alpha-2 a..."
3,Gene,9409,PEX16,Entrez Gene,http://identifiers.org/ncbigene/9409,peroxisomal biogenesis factor 16
4,Biological Process,GO:0032474,otolith morphogenesis,Gene Ontology,http://purl.obolibrary.org/obo/GO_0032474,
...,...,...,...,...,...,...
47026,Gene,92291,CAPN13,Entrez Gene,http://identifiers.org/ncbigene/92291,calpain 13
47027,Biological Process,GO:1902308,regulation of peptidyl-serine dephosphorylation,Gene Ontology,http://purl.obolibrary.org/obo/GO_1902308,
47028,Gene,643338,C15orf62,Entrez Gene,http://identifiers.org/ncbigene/643338,chromosome 15 open reading frame 62
47029,Gene,645121,CCNI2,Entrez Gene,http://identifiers.org/ncbigene/645121,"cyclin I family, member 2"


What are the unique kinds of nodes in hetionet?

In [7]:
df.kind.unique()

array(['Molecular Function', 'Side Effect', 'Gene', 'Biological Process',
       'Compound', 'Pathway', 'Anatomy', 'Cellular Component', 'Symptom',
       'Disease', 'Pharmacologic Class'], dtype=object)

# Identify Redundant Pathways

Where do pathways come from?

In [8]:
df[df.kind == "Pathway"].groupby("source").count()["identifier"]

source
PID via Pathway Commons          220
Reactome via Pathway Commons    1308
WikiPathways                     294
Name: identifier, dtype: int64

Biological processes and pathways are the same, but these are split into their own "kind".

In [9]:
df[df.kind == "Biological Process"].count()["identifier"]

11381

## Mapping Pathways from WikiPathways and GO

WikiPathway and GO are possible to map to standard CURIEs with string operations

In [10]:
idx = (df.kind == "Pathway") & (df.source == "WikiPathways")
wikipathways_mapping = {wpid: "wikipathways:" + wpid.split("_")[0] for wpid in df[idx].identifier}

In [11]:
go_mapping = {
    go_curie: go_curie.lower() for go_curie in df[df.kind == "Biological Process"].identifier
}

## Grounding Pathways from Reactome and NCI-PID

Unfortunately, Hetionet imports pathways via Pathway Commons, so its provenance information is mostly lost (i.e., the identifiers are opaque).

In [12]:
def get_mappings(prefix, source):
    grounder = get_grounder(prefix=prefix)
    counter = Counter(term.organism for terms in grounder.entries.values() for term in terms)
    print(f"{prefix.capitalize()} species count: {counter}")

    idx = (df.kind == "Pathway") & (df.source == source)
    it = tqdm(df[idx].values, unit_scale=True, desc="Grounding")
    rv = {}
    for _, identifier, name, _, _, _ in it:
        res = grounder.ground(name, organisms=["9606"])
        if res:
            rv[identifier] = prefix + ":" + res[0].term.id

    total = idx.sum()
    n = len(rv)
    print(f"{prefix.capitalize()} got {n:,}/{total:,} ({n/total:.1%}) groundings")
    return rv


pid_mapping = get_mappings("pid", "PID via Pathway Commons")
reactome_mapping = get_mappings("reactome", "Reactome via Pathway Commons")



[pid] mapping:   0%|          | 0.00/211 [00:00<?, ?name/s]

Pid species count: Counter({None: 211})


Grounding:   0%|          | 0.00/220 [00:00<?, ?it/s]

Pid got 194/220 (88.2%) groundings


[reactome] mapping:   0%|          | 0.00/22.0k [00:00<?, ?name/s]

Reactome species count: Counter({'9606': 2601, '10090': 1715, '9031': 1706, '10116': 1702, '9913': 1696, '7955': 1676, '9823': 1660, '9615': 1657, '8364': 1580, '7227': 1477, '6239': 1304, '44689': 982, '4896': 819, '4932': 812, '5833': 599, '1773': 13})


Grounding:   0%|          | 0.00/1.31k [00:00<?, ?it/s]

Reactome got 1,097/1,308 (83.9%) groundings


## Compare 

In [16]:
mappings = [
    ("reactome", reactome_mapping),
    ("go", go_mapping),
    ("wikipathways", wikipathways_mapping),
    ("pid", pid_mapping),
]

merge_graph = nx.Graph()

for (p1, d1), (p2, d2) in itt.combinations(mappings, 2):
    label = f"{p1} - {p2}"
    it = tqdm(
        itt.product(d1.items(), d2.items()),
        desc=label,
        total=len(d1) * len(d2),
        unit_scale=True,
        unit="pair",
    )
    c = 0
    for (hetio_id_1, curie_1), (hetio_id_2, curie_2) in it:
        if true_graph.has_edge(curie_1, curie_2):
            merge_graph.add_edge(hetio_id_1, hetio_id_2)
            c += 1
    print(label, "had", c)

reactome - go:   0%|          | 0.00/12.5M [00:00<?, ?pair/s]

reactome - go had 95


reactome - wikipathways:   0%|          | 0.00/323k [00:00<?, ?pair/s]

reactome - wikipathways had 42


reactome - pid:   0%|          | 0.00/213k [00:00<?, ?pair/s]

reactome - pid had 0


go - wikipathways:   0%|          | 0.00/3.35M [00:00<?, ?pair/s]

go - wikipathways had 34


go - pid:   0%|          | 0.00/2.21M [00:00<?, ?pair/s]

go - pid had 0


wikipathways - pid:   0%|          | 0.00/57.0k [00:00<?, ?pair/s]

wikipathways - pid had 0


In [14]:
n_components = len(list(nx.connected_components(merge_graph)))
n_nodes = merge_graph.number_of_nodes()
duplicates = n_nodes - n_components

print(
    f"There are {n_nodes} that participate in duplicates, "
    f"forming {n_components} connected components in the mapping graph. "
    f"Therefore, {duplicates} nodes are duplicates."
)

There are 318 that participate in duplicates, forming 150 connected components in the mapping graph. Therefore, 168 nodes are duplicates.


# Identifying Redundant Phenotypes

The next investigation will see if any side effects, symptoms, and diseases are actually mapped to each other.