# Sanity check on weights

**Problem:** we need to avoid inputting drug-disease edges from our KG before inputting them to GraphSAGE to avoid data leakage. One solution was to assign zero-weights to drug-disease edges through GDS - this is beneficial from visualisation perspective. However we were unsure if zero-weighted drug-disease edges definitely dont provide any signal to GraphSAGE (e.g. are zero-weighted edges counted when calculating node degrees?). So here we examine if zero-weighted edges are equivalent to complete removal of edges.

**Reproduction:** to reproduce the workflow, follow commit history in this branch:
* step1 - codebase where drug-disease edges are assigned weight 0 in GDS; 
* step2 - codebase where all edges are assigned weight 1 in GDS;
* step3 - codebase where we assign zero weights to drug-disease edges AND we remove drug-disease edges; 
* step4 - codebase where all edges are assigned weight 1 AND we remove drug-disease edges; 
* step5 - codebase where we dont assign any weights to edges AND we also don't remove drug-disease edges;
* step6 - codebase where we dont assign any weights to edges AND we also don't remove drug-disease edges;

Step3 and step4 are a bit redundant I realize so the most essential are step 1,2,5,6 

The embeddings were obtained by simply running
```
Make wipe_neo
kedro run -p integration --env base 
kedro run -p embeddings --env base
```
and then saving the output for the embeddings (`embeddings.model_output.graphsage` in `embeddings/catalog.yml`, I did it through ipython/jupyter)

In [None]:
# Import dependencies
import pyspark as ps
import os
import pandas as pd 
import matplotlib.pyplot as plt 
from pathlib import Path
import subprocess
from pyspark.sql.functions import col, when

from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName("ReadSavedDataFrame") \
    .getOrCreate()


%load_ext kedro.ipython
%reload_kedro  --env base


First, load output of graphsage (calculated embeddings) - I will upload them to gdrive if anyone is interesed - https://drive.google.com/drive/folders/1noAkqyU0rNaczTFp8ED_2wORtjbJmnMR - embed data folder

In [None]:
# compare
step1=spark.read.parquet('scratch/embed_data/step1').toPandas()
step2=spark.read.parquet('scratch/embed_data/step2').toPandas()
step3=spark.read.parquet('scratch/embed_data/step3').toPandas()
step4=spark.read.parquet('scratch/embed_data/step4').toPandas()
step5=spark.read.parquet('scratch/embed_data/step5').toPandas()
step6=spark.read.parquet('scratch/embed_data/step6').toPandas()


If the zero-weighted edges are equivalent to not having edges, then embeddings from step1, step3, and step5 should be the same, while embeddings from step2, step4, and step5 should be equivalent

In [None]:
print(step1.equals(step3))
print(step1.equals(step5))
print(step3.equals(step5))


Not what I expected, lets check further 

Step2 (when all weights are the same and all edges are present) should be equivalent to step6 (when no weighting is applied and all edges are present)

In [None]:
print(step2.equals(step6))

## Embeddings raw

In [None]:
print(step1.topological_embedding.head(10))

In [None]:
print(step2.topological_embedding.head(10))

In [None]:
print(step3.topological_embedding.head(10))

In [None]:
print(step4.topological_embedding.head(10))

In [None]:
print(step5.topological_embedding.head(10))

In [None]:
print(step6.topological_embedding.head(10))

TODO check umap, cosine, 