# Merging RTX and Robokop

This notebook summarizes our results when merging RTX and Robokop. using Translators Node Normalization.
For this we downloaded RTX KG2 v2.7.3 and Robo `c5ec1f282158182f`


## Summary

TODO

## Questions

- How many nodes are merged? %
- How many edges are merged? %
- What are example edges that are not merged, why not?
- How do the triplets differ across the KGs and for the part that is merged across KGs



In [2]:
%%capture
# Import dependencies
import pyspark as ps
import os
from pathlib import Path
import subprocess
import pyspark.sql.functions as f

import pandas as pd

# import spark 
%load_ext autoreload
%autoreload 2
from rich.console import Console
from rich.logging import RichHandler
from rich.panel import Panel
console = Console()

# hack that moves this notebook context into the kedro path
root_path = subprocess.check_output(['git', 'rev-parse', '--show-toplevel']).decode().strip()
os.chdir(Path(root_path) / 'pipelines' / 'matrix')

# this loads various objects into the context, see 
# https://docs.kedro.org/en/stable/notebooks_and_ipython/kedro_and_notebooks.html#kedro-line-magics
%load_ext kedro.ipython
# %reload_kedro  --env cloud
# %reload_kedro  --env test
%reload_kedro


In [3]:
def top_n(df, n: int = 20):
    return df._jdf.showString(n,200, False)

In [4]:
%%capture
unified_nodes = catalog.load("integration.prm.unified_nodes")
unified_edges = catalog.load("integration.prm.unified_edges")
robo_nodes = catalog.load("integration.int.robokop.nodes")
robo_edges = catalog.load("integration.int.robokop.edges")
rtx_nodes = catalog.load("integration.int.rtx.nodes")
rtx_edges = catalog.load("integration.int.rtx.edges")

24/10/08 16:32:30 WARN Utils: Your hostname, Pascals-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 192.168.1.227 instead (on interface en0)
24/10/08 16:32:30 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


:: loading settings :: url = jar:file:/Users/pascalwhoop/Code/everycure/matrix/pipelines/matrix/.venv/lib/python3.11/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /Users/pascalwhoop/.ivy2/cache
The jars for the packages stored in: /Users/pascalwhoop/.ivy2/jars
com.google.cloud.spark#spark-3.5-bigquery added as a dependency
org.neo4j#neo4j-connector-apache-spark_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-db3ef7b1-99f8-41d9-a2f9-5b7aaba1a78a;1.0
	confs: [default]
	found com.google.cloud.spark#spark-3.5-bigquery;0.39.0 in central
	found com.google.cloud.spark#spark-bigquery-dsv2-common;0.39.0 in central
	found com.google.cloud.spark#spark-bigquery-connector-common;0.39.0 in central
	found com.google.cloud.spark#bigquery-connector-common;0.39.0 in central
	found com.google.api.grpc#grpc-google-cloud-bigquerystorage-v1;3.5.1 in central
	found io.grpc#grpc-api;1.64.0 in central
	found com.google.code.findbugs#jsr305;3.0.2 in central
	found com.google.errorprone#error_prone_annotations;2.23.0 in central
	found io.grpc#grpc-stub;1.64.0 in central
	found io.grpc#grpc-protobuf;1.6

In [5]:
console.rule("[bold blue]Unified KG")
console.print(Panel.fit(f"""
Unified Nodes: {unified_nodes.count()}
Robo Nodes: {robo_nodes.count()}
RTX Nodes: {rtx_nodes.count()}
""", title="Node Counts"))
# now edges
console.print(Panel.fit(f"""
Unified Edges: {unified_edges.count()}
Robo Edges: {robo_edges.count()}
RTX Edges: {rtx_edges.count()}
""", title="Edge Counts"))
# first calculate the number of nodes and edges in each kg
unified_node_count = unified_nodes.count()
unified_edge_count = unified_edges.count()
robo_node_count = robo_nodes.count()
robo_edge_count = robo_edges.count()
rtx_node_count = rtx_nodes.count()
rtx_edge_count = rtx_edges.count()

nodes_in_both = unified_nodes.filter(f.array_contains(f.col("upstream_data_source"), "rtxkg2") & f.array_contains(f.col("upstream_data_source"), "robokop"))
nodes_in_rtx = unified_nodes.filter(f.array_contains(f.col("upstream_data_source"), "rtxkg2"))
nodes_in_robo = unified_nodes.filter(f.array_contains(f.col("upstream_data_source"), "robokop"))

console.print(Panel.fit(
f"""
Nodes originating from RTX: {nodes_in_rtx.count()/unified_node_count*100:.2f}%
Nodes originating from Robo: {nodes_in_robo.count()/unified_node_count*100:.2f}%
Nodes originating from Both: {nodes_in_both.count()/unified_node_count*100:.2f}%
""", title="Node Origin Proportions"))

edges_in_both = unified_edges.filter(f.array_contains(f.col("upstream_data_source"), "rtxkg2") & f.array_contains(f.col("upstream_data_source"), "robokop"))
edges_in_rtx = unified_edges.filter(f.array_contains(f.col("upstream_data_source"), "rtxkg2"))
edges_in_robo = unified_edges.filter(f.array_contains(f.col("upstream_data_source"), "robokop"))

console.print(Panel.fit(
f"""
Edges originating from RTX: {edges_in_rtx.count()/unified_edge_count*100:.2f}%
Edges originating from Robo: {edges_in_robo.count()/unified_edge_count*100:.2f}%
Edges originating from Both: {edges_in_both.count()/unified_edge_count*100:.2f}%
""", title="Edge Origin Proportions"))


                                                                                

                                                                                

Wow that's not a lot of edges that are present in both. I wonder why there are so many more edges in Robokop as well. There's like 150M edges there and only 18M in RTX. 
Let's look at the predicate counts:


In [5]:
def stats_on_df(df: ps.sql.DataFrame, col: str, kg_name: str, n=40):
    df_counts = df.groupBy(col).count().sort("count", ascending=False)
    console.print(Panel.fit(top_n(df_counts, n=n), title=f"{col} Counts in {kg_name}"))

stats_on_df(edges_in_both, "predicate", "Both")
stats_on_df(edges_in_rtx, "predicate", "RTX")
stats_on_df(edges_in_robo, "predicate", "Robo")



                                                                                

                                                                                

                                                                                

In [6]:

stats_on_df(nodes_in_both, "category", "Both")
stats_on_df(nodes_in_rtx, "category", "RTX")
stats_on_df(nodes_in_robo, "category", "Robo")

OK it looks like Robokop has tons of biolink `subclass_of` and `is_nearby_variant_of` edges. Also 18M `affects`. RTX on the other hand appears to be heavier on `has_participant` and `occurs_in`

## Doing some plotting. Let's get a correlation matrix of categories in the 4 variants (rtx, robo, overlap, union)

I want to see which categories of nodes are connected with each other. For that I need to join the node categories on the edges dataframe to then the correlation matrix

In [8]:
def get_category_connections(edges: ps.sql.DataFrame, nodes: ps.sql.DataFrame):
    categories = nodes.select("id", "category")
    edges = edges.join(categories.withColumnsRenamed({"id": "subject", "category": "subj_category"}), "subject", "left")
    edges = edges.join(categories.withColumnsRenamed({"id": "object", "category": "obj_category"}), "object", "left")
    # join the nodes dataframe on the subject column of the edges dataframe
    return edges.select("subject", "predicate", "object", "subj_category", "obj_category")


def get_sankey_data_for_kg(edges: ps.sql.DataFrame, nodes: ps.sql.DataFrame) -> pd.DataFrame:
    df = get_category_connections(edges, nodes)
    # preparing sankey diagram data
    df = (df
          .withColumn("subj_category", f.concat(f.lit("sub:"), f.col("subj_category")))
          .withColumn("obj_category", f.concat(f.lit("obj:"), f.col("obj_category")))
    )
    first_level = df.groupBy("subj_category","predicate").count().withColumnsRenamed({"subj_category": "source", "predicate": "sink", "count": "value"})
    second_level = df.groupBy("predicate", "obj_category").count().withColumnsRenamed({"predicate": "source", "obj_category": "sink", "count": "value"})
    return first_level.union(second_level).orderBy("value", ascending=False).toPandas()

import plotly.graph_objects as go
import pandas as pd
import numpy as np

def create_sankey_diagram(df, title):
    # Prepare the data
    all_nodes = pd.concat([df['source'], df['sink']]).unique()
    node_indices = {node: index for index, node in enumerate(all_nodes)}

    # Create color scale
    n_colors = len(all_nodes)
    colors = [f'rgb({r},{g},{b})' for r, g, b in np.random.randint(0, 255, size=(n_colors, 3))]

    # Prepare the Sankey diagram data
    link_source = [node_indices[source] for source in df['source']]
    link_target = [node_indices[sink] for sink in df['sink']]
    link_value = df['value']

    # Create the figure
    fig = go.Figure(data=[go.Sankey(
        node = dict(
          pad = 15,
          thickness = 20,
          line = dict(color = "black", width = 0.5),
          label = list(all_nodes),
          color = colors
        ),
        link = dict(
          source = link_source,
          target = link_target,
          value = link_value
    ))])

    # Update the layout
    fig.update_layout(title_text=title, font_size=10, width=1920, height=800)

    return fig

def plot_sankey_for_kg(edges: ps.sql.DataFrame, nodes: ps.sql.DataFrame, title: str, max_categories: int = 100):
    sankey_data = get_sankey_data_for_kg(edges, nodes)
    fig = create_sankey_diagram(sankey_data[:max_categories], title)
    fig.show()

plot_sankey_for_kg(edges_in_robo, nodes_in_robo, "Robo")
# plot_sankey_for_kg(edges_in_rtx, nodes_in_rtx, "RTX")
# plot_sankey_for_kg(edges_in_both, nodes_in_both, "Both")
# plot_sankey_for_kg(unified_edges, unified_nodes, "Unified")

                                                                                

                                                                                

                                                                                

                                                                                

In [7]:
# nodes_in_robo.filter(f.col("category") == "biolink:NamedThing").show(100, truncate=False)
nodes_in_robo.show(100, truncate=False)

+-----------------+-----------+---------------------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+--------------------+
|id               |name       |category             |description|equivalent_identifiers            

In [8]:
unified_nodes.printSchema()

root
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- category: string (nullable = true)
 |-- description: string (nullable = true)
 |-- equivalent_identifiers: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- all_categories: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- publications: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- labels: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- international_resource_identifier: string (nullable = true)
 |-- upstream_data_source: array (nullable = true)
 |    |-- element: string (containsNull = true)



In [12]:
unified_nodes.filter(f.array_contains(f.col("upstream_data_source"), "robokop")).show(10, truncate=False)

+-----------------+-----------+---------------------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------+--------------------+
|id               |name       |category             |description|equivalent_identifiers            