# Sanity check on weights

**Problem:** we need to avoid inputting drug-disease edges from our KG before inputting them to GraphSAGE to avoid data leakage. One solution was to assign zero-weights to drug-disease edges through GDS - this is beneficial from visualisation perspective. However we were unsure if zero-weighted drug-disease edges definitely dont provide any signal to GraphSAGE (e.g. are zero-weighted edges counted when calculating node degrees?). So here we examine if zero-weighted edges are equivalent to complete removal of edges.

**Reproduction:** to reproduce the workflow, follow commit history in this branch:
* step1 - codebase where drug-disease edges are assigned weight 0 in GDS; 
* step2 - codebase where all edges are assigned weight 1 in GDS;
* step3 - codebase where we assign zero weights to drug-disease edges AND we remove drug-disease edges; 
* step4 - codebase where all edges are assigned weight 1 AND we remove drug-disease edges; 
* step5 - codebase where we dont assign any weights to edges AND we also do remove drug-disease edges;
* step6 - codebase where we dont assign any weights to edges AND we also don't remove drug-disease edges;

Step3 and step4 are a bit redundant I realize so the most essential are step 1,2,5,6 

The embeddings were obtained by simply running
```
Make wipe_neo
kedro run -p integration --env base 
kedro run -p embeddings --env base
```
and then saving the output for the embeddings (`embeddings.model_output.graphsage` in `embeddings/catalog.yml`, I did it through ipython/jupyter)

In [None]:
# Import dependencies
import pyspark as ps
import os
import  prince
from scipy.spatial import distance
import pandas as pd 
import matplotlib.pyplot as plt 
from pathlib import Path
import subprocess
from pyspark.sql.functions import col, when
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import umap
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName("ReadSavedDataFrame") \
    .getOrCreate()




First, load output of graphsage (calculated embeddings) - I will upload them to gdrive if anyone is interesed - https://drive.google.com/drive/folders/1noAkqyU0rNaczTFp8ED_2wORtjbJmnMR - embed data folder

In [None]:
# compare
step1=spark.read.parquet('../scratch/embed_data/step1').select('topological_embedding').toPandas()
step2=spark.read.parquet('../scratch/embed_data/step2').select('topological_embedding').toPandas()
step3=spark.read.parquet('../scratch/embed_data/step3').select('topological_embedding').toPandas()
step4=spark.read.parquet('../scratch/embed_data/step4').select('topological_embedding').toPandas()
step5=spark.read.parquet('../scratch/embed_data/step5').select('topological_embedding').toPandas()
step6=spark.read.parquet('../scratch/embed_data/step6').select('topological_embedding').toPandas()

#extra steps with custom weights 
step7=spark.read.parquet('../scratch/embed_data/step7').select('topological_embedding').toPandas()
step8=spark.read.parquet('../scratch/embed_data/step8').select('topological_embedding').toPandas()
step9=spark.read.parquet('../scratch/embed_data/step9').select('topological_embedding').toPandas()
step10=spark.read.parquet('../scratch/embed_data/step10').select('topological_embedding').toPandas()

#interested mainly in the visualisation part so creating emb objects here
emb7= np.array([np.array(i) for i in step7.topological_embedding.values])
emb8 = np.array([np.array(i) for i in step8.topological_embedding.values])
emb9 = np.array([np.array(i) for i in step9.topological_embedding.values])
emb10 = np.array([np.array(i) for i in step10.topological_embedding.values])


If the zero-weighted edges are equivalent to not having edges, then embeddings from step1, step3, and step5 should be the same, while embeddings from step2, step4, and step5 should be equivalent

In [None]:
print(step1.equals(step3))
print(step1.equals(step5))
print(step3.equals(step5))


Not what I expected, lets check further 

Step2 (when all weights are the same and all edges are present) should be equivalent to step6 (when no weighting is applied and all edges are present)

In [None]:
print(step2.equals(step6))

In [None]:
#compare step 1,3,5
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

emb1 = np.array([np.array(i) for i in step1.topological_embedding.values])

emb3 = np.array([np.array(i) for i in step3.topological_embedding.values])

emb5 = np.array([np.array(i) for i in step5.topological_embedding.values])


similarities = cosine_similarity(emb1, emb3)
print(np.mean(similarities))
similarities = cosine_similarity(emb1, emb5)
print(np.mean(similarities))
similarities = cosine_similarity(emb3, emb5)
print(np.mean(similarities))

In [None]:
#compare step2 with step5 (all weighted 1 and no edges removed respectively)

emb2 = np.array([np.array(i) for i in step2.topological_embedding.values])

emb4 = np.array([np.array(i) for i in step4.topological_embedding.values])

emb6 = np.array([np.array(i) for i in step6.topological_embedding.values])

similarities = cosine_similarity(emb2, emb4)
print(np.mean(similarities))
similarities = cosine_similarity(emb2, emb6)
print(np.mean(similarities))
similarities = cosine_similarity(emb4, emb6)
print(np.mean(similarities))

In [None]:
#check difference between step1 and 2 as well as 5 and 6
similarities = cosine_similarity(emb1, emb2)
print(np.mean(similarities))
similarities = cosine_similarity(emb3, emb4)
print(np.mean(similarities))
similarities = cosine_similarity(emb5, emb6)
print(np.mean(similarities))

Although cosine might not be the best metric it shows how greatly different they are

## Embeddings raw

In [None]:
print(np.mean(emb1, axis=1))
pd.DataFrame(emb1)


In [None]:
print(np.mean(emb2, axis=1))
pd.DataFrame(emb2)

In [None]:
print(np.mean(emb3, axis=1))
pd.DataFrame(emb3)

In [None]:
print(np.mean(emb4, axis=1))
pd.DataFrame(emb4)

In [None]:
print(np.mean(emb5, axis=1))
pd.DataFrame(emb5)

In [None]:
print(np.mean(emb6, axis=1))
pd.DataFrame(emb6)

## Visual checks

In [None]:
emb_list =[emb1,emb2,emb3,emb4,emb5,emb6]
print('1 - zero-weighted edges; 2 - one-weighted edges; 3 - zero weighted edges + rm; 4 - one-weighted edges + rm; 5 - rm edges, no weights; 6 - all edges, no weights')
plt.subplots(1,6, figsize=(20, 5), sharex=True)
for i, emb in enumerate(emb_list, start=1):
    plt.subplot(1,6,i)
    plt.title(f'setup {i}')
    plt.hist(emb, edgecolor='black')
plt.tight_layout()
plt.show()

In [None]:
emb_list =[emb1,emb2,emb3,emb4,emb5,emb6]
print('1 - zero-weighted edges; 2 - one-weighted edges; 3 - zero weighted edges + rm; 4 - one-weighted edges + rm; 5 - rm edges, no weights; 6 - all edges, no weights')
plt.subplots(1,6, figsize=(10, 6)) #, sharey=True, sharex=True)
for i, emb in enumerate(emb_list, start=1):
    plt.subplot(1,6,i)
    plt.title(f'setup {i}')
    pca = prince.PCA(n_components=2)
    result = pca.fit_transform(pd.DataFrame(emb))
    plt.scatter(result[0], result[1], alpha=0.5)
    plt.xlabel('')
    plt.ylabel('PC 2')
    plt.grid(True)
plt.suptitle('PC 1',y=0.02)
plt.tight_layout()
plt.show()
#idea - change weights 

In [None]:
# same axes
emb_list =[emb1,emb2,emb3,emb4,emb5,emb6]
print('1 - zero-weighted edges; 2 - one-weighted edges; 3 - zero weighted edges + rm; 4 - one-weighted edges + rm; 5 - rm edges, no weights; 6 - all edges, no weights')
plt.subplots(1,6, figsize=(10, 6), sharey=True, sharex=True)
for i, emb in enumerate(emb_list, start=1):
    plt.subplot(1,6,i)
    plt.title(f'setup {i}')
    pca = prince.PCA(n_components=2)
    result = pca.fit_transform(pd.DataFrame(emb))
    plt.scatter(result[0], result[1], alpha=0.5)
    plt.xlabel('PC 1')
    plt.ylabel('PC 2')
    plt.grid(True)
plt.tight_layout()
plt.show()

In [None]:
#check additional 3 steps with 0.2 0.5 and 0.7 weights
# same axes
emb_list ={1:['weight=0',emb1], 2:['weight=0.25',emb7],3:['weight=0.5',emb8],4:['weight=0.75',emb9],5:['weight=0.9',emb10], 6:['weight=1',emb2]}
plt.subplots(1,6, figsize=(10, 6), sharey=True, sharex=True)
for id in emb_list.keys():
    plt.subplot(1,6,id)
    plt.title(f'{emb_list[id][0]}')
    pca = prince.PCA(n_components=2)
    result = pca.fit_transform(pd.DataFrame(emb_list[id][1]))
    plt.scatter(result[0], result[1], alpha=0.5)
    plt.xlabel('PC 1')
    plt.ylabel('PC 2')
    plt.grid(True)
plt.tight_layout()
plt.show()

# Distance measured

In [None]:
#no treat edges
dist = distance.cdist(emb1, emb3, 'euclidean')
print('Mean', np.mean(dist), 'STD',np.std(dist))
#print(dist, '\n')
dist = distance.cdist(emb1, emb5, 'euclidean')
print('Mean', np.mean(dist), 'STD',np.std(dist))
#print(dist, '\n')
dist = distance.cdist(emb3, emb5, 'euclidean')
print('Mean', np.mean(dist), 'STD',np.std(dist))
#print(dist, '\n')

In [None]:
#treat edges
dist = distance.cdist(emb2, emb4, 'euclidean')
print('Mean', np.mean(dist), 'STD',np.std(dist))
#print(dist, '\n')
dist = distance.cdist(emb2, emb6, 'euclidean')
print('Mean', np.mean(dist), 'STD',np.std(dist))
#print(dist, '\n')
dist = distance.cdist(emb4, emb6, 'euclidean')
print('Mean', np.mean(dist), 'STD',np.std(dist))
#print(dist, '\n')

In [None]:
#extra checks
dist = distance.cdist(emb1, emb2, 'euclidean')
print('Mean', np.mean(dist), 'STD',np.std(dist))
#print(dist, '\n')
dist = distance.cdist(emb3, emb4, 'euclidean')
print('Mean', np.mean(dist), 'STD',np.std(dist))
#print(dist, '\n')
dist = distance.cdist(emb5, emb6, 'euclidean')
print('Mean', np.mean(dist), 'STD',np.std(dist))
#print(dist, '\n')

Conclusion so far: the embeddings that we got from setup 1 are very different from remaining setups. For remaining setups, the differences are very subtle. There are two possible scenarios:
* the effect of zero-weights is quite large on embedding calculation which is why the difference between setup1 and remaining is large (likely but why?)
* something went wrong with removing the edges and generating embeddings (unlikely - checked the code, neo4j does show lack of edges in graph data science projection and all the checks work)