*Author & point of contact: Piotr Kaniewski (piotr@everycure.org)*

# e2e classifier evaluation check
We noticed our graphsage produces weirdly looking topological embeddings which likely impact downstream performance; we should try to optimize graphsage so that it produces more informative embeddings

# Summary - high level 

We had the following runs: 

**2024-08-07 - pre-troubleshooting e2e run**
* LR = 0.1; BS = 20 000; Search Depth = 100; MaxIter = 10; SizeSample(25,10); Embedding Dimensions and hidden layer dims = 512; 10 epochs
* loss slowly decreasing, approx 0.26-> 0.19
* This is where we noticed that downstream task performance is odd (mainly due to bugs in evaluation matrix) and when we noticed the infamous 'spaghetti monster' UMAP.

**2024-08-19 - first-troubleshooting e2e run**
* LR = 0.01; BS = 20 000; Search Depth = 100; MaxIter = 10; SizeSample(25,10); Embedding Dimensions and hidden layer dims = 512; 10 epochs
* The weird looking PCA persisted;This is where we noticed that downstream task performance is odd (mainly due to bugs in evaluation matrix) and when we noticed the infamous 'spaghetti monster' UMAP.

**2024-08-20 - 'random' e2e run**
* LR = 0.01; BS = 20 000; Search Depth = 100; MaxIter = 1; SizeSample(25,10); Embedding Dimensions and hidden layer dims = 512; 1 epoch
* Surprisingly, this gave initially nice looking PCA and a really good task performance (highlighting that we didnt remove all drug-disease edges - now it's fixed thanks to Laurens). This data leakage however was a proof that **GraphSage topological enrichment worked** which we cannot say for sure for the rest 
* yesterday this specific setting was re-run and while PCA looked a bit odd, the downstream task performance was consistenly really good (bit lower but with different run it is expected)

**2024-08-20 - Chunyu-like e2e run**
* LR = 0.001; BS = 256; Search Depth = 100; MaxIter = 1000; SizeSample(96,96); Embedding Dimensions and hidden layer dims = 512; 10 epoch
* Chunyu like parameters but took ages, so was terminated. It was likely to overfit anyway knowing what we know now

**2024-08-21 - small Chunyu-like e2e run**
* LR = 0.001; BS = 256; Search Depth = 100; MaxIter = 100; SizeSample(96,96); Embedding Dimensions and hidden layer dims = 512; 10 epoch
* Gave even worse results than first-troubleshooting e2e run, with weird looking PCA and really bad downstream task performance, showing potential overfit

My explanation: 
* we use very high search depth (100). What this essentially does is that when Graphsage samples a node, it is approximated by nodes within its 100-hops which is probably the whole graph. This is done for 20 000 nodes * 4 times (batch size x concurrency 4) which gives us a lot of repeated and consistent information. With batch size being large we get quite stable training as well.
* sigmoid activation function sucks because of its quickly vanishing gradient (hence everyone uses ReLU nowadays), but because we only have one epoch and one iteration, this vanishing gradient is not yet apparent. Nevertheless, that would potentially explain the low variance and plateuing loss being more and more apparent as we train with more iterations/epochs (as we increase no.epochs, we increase forward passess through sigmoid functions which squish the values between 0 and 1, giving low variance)
  * In a notebook e2e_pca.ipynb, this can be clearly visible - when we train graphsage only on 1 epoch and 1 iteration, ther eis very subtle difference between sigmoid and others. Only after a longer 
* wiggly line is a sign of overtraining, graphsage is supposed to approximate a node based on its connectivity i.e. neighbors. Think about neighbor A. In the first epoch we have relatively original features being aggregated and averaged to represent A, even when we use 100-hops. But in the second epoch then, we have approximate aggregation of aggregations from epoch 1, which decreases in value due to sigmoid function. These aggregated values should be more and more similar to each other (I think) and also decrease in variance due to sigmoig, which eventually leads to the squiggly line. 
  * This is consistent with the check I have done in e2e_pca.ipynb, where as we increase no. epochs/iterations, the wiggly line starts appearing

# Code

In [None]:
import os
import joblib
import subprocess
import tqdm
from pathlib import Path

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier                                    

# Setting the root path and changing the directory
root_path = subprocess.check_output(['git', 'rev-parse', '--show-toplevel']).decode().strip()
os.chdir(Path(root_path) / 'pipelines' / 'matrix')

#%load_ext kedro.ipython
#%reload_kedro  --env base

In [None]:
#load metadata and gt
nodes_df = pd.read_parquet("data/03_primary/rtx_kg2/nodes")

#load gt 
gt = pd.read_csv('scratch/gt.csv').drop('Unnamed: 0',axis=1)

# 20240807 - pre trobuleshooting e2e run

In [None]:
openai_graph = pd.read_csv('notebooks/topological_embeddings_20240807.csv', index_col=0)

openai_graph['topological_embedding'] = [np.fromstring(key.strip('[]'), sep=' ') for key in openai_graph.topological_embedding]

from sklearn.decomposition import PCA

pca = PCA(n_components = 2 )
top_embed = pca.fit_transform(pd.DataFrame(openai_graph['topological_embedding'].tolist()))

plt.scatter(top_embed.transpose()[0], top_embed.transpose()[1])
plt.show()

sns.kdeplot(np.array(openai_graph['topological_embedding'].tolist()).flatten())

In [None]:
#create sub-dfs
DRUG_TYPE = ['biolink:Drug', 'biolink:SmallMolecule']
DISEASE_TYPE = ['biolink:Disease', 'biolink:PhenotypicFeature', 'biolink:BehavioralFeature', 'biolink:DiseaseOrPhenotypicFeature']

#sample
nodes_df_drugs = nodes_df[nodes_df['category'].isin(DRUG_TYPE)]
nodes_df_disease = nodes_df[nodes_df['category'].isin(DISEASE_TYPE)]

#train test split 
train, test = train_test_split(gt, stratify=gt['y'], test_size=0.1, random_state=42)
train_tp_df = train[train['y']==1]
train_tp_df_drugs = train_tp_df['source'].reset_index(drop=True)
train_tp_df_diseases = train_tp_df['target'].reset_index(drop=True)
len_tp_tr = len(train_tp_df)
n_rep = 3

# create random drug-disease pairs
rand_drugs = nodes_df_drugs['id'].sample(n_rep*len_tp_tr, replace=True, ignore_index = True, random_state = 42) # 42
rand_disease = nodes_df_disease['id'].sample(n_rep*len_tp_tr, replace=True, ignore_index = True, random_state = 42) # 42
train_tp_diseases_copies = pd.concat([train_tp_df_diseases for _ in range(n_rep)], ignore_index = True)
train_tp_drugs_copies = pd.concat([train_tp_df_drugs for _ in range(n_rep)], ignore_index = True)
tmp_1 = pd.DataFrame({'source': rand_drugs, 'target': train_tp_diseases_copies, 'y': 2})
tmp_2 = pd.DataFrame({'source': train_tp_drugs_copies, 'target': rand_disease, 'y': 2})
un_data_1 =  pd.concat([tmp_1,tmp_2], ignore_index =True)
train_df_1 = pd.concat([train, un_data_1]).sample(frac=1, random_state = 42).reset_index(drop=True)
test = test.reset_index(drop=True)

In [None]:
# test set

# Collecting relevant info from TP test set 
test_tp_df = test[test['y'] == 1]
len_tp_tst = len(test_tp_df)
test_tp_drugs = test_tp_df['source'].reset_index(drop=True)
test_tp_diseases = test_tp_df['target'].reset_index(drop=True)

# Adding synthetic data to TP+TN dataset
rand_drugs_tmp = nodes_df_drugs['id'].sample(len_tp_tst, replace=True, ignore_index = True, random_state = 42)
rand_dis_tmp = nodes_df_disease['id'].sample(len_tp_tst, replace=True, ignore_index = True, random_state = 42)
tmp_1 = pd.DataFrame({'source': rand_drugs_tmp, 'target': test_tp_diseases, 'y': 2})
tmp_2 = pd.DataFrame({'source': test_tp_drugs, 'target': rand_dis_tmp, 'y': 2})
un_data_kgml = pd.concat([tmp_1,tmp_2], ignore_index =True)
test = pd.concat([test, un_data_kgml]).sample(frac=1, random_state = 42).reset_index(drop=True)

In [None]:

# Create a dictionary for fast lookup
embedding_dict = openai_graph.set_index('id')['topological_embedding'].to_dict()

# Initialize feature matrices
feature_length = 1024
X_openai = np.empty((len(train_df_1), feature_length), dtype='float32')
X_openai_test = np.empty((len(test), feature_length), dtype='float32')

# Function to get concatenated vectors
def get_concatenated_vector(row):
    drug_vector = embedding_dict[row['source']]
    disease_vector = embedding_dict[row['target']]
    return np.concatenate([drug_vector, disease_vector])

# Apply the function to each row of train_df_1
X_openai = np.vstack(train_df_1.apply(get_concatenated_vector, axis=1))

# Convert target variable to numpy array
y_openai = train_df_1['y'].to_numpy()

# Reset index for test DataFrame
test = test.reset_index(drop=True)

X_openai_test = np.vstack(test.apply(get_concatenated_vector, axis=1))

y_openai_test = test['y'].to_numpy()

In [None]:
# xgboost
xgb = XGBClassifier(random_state = 42)
xgb.fit(X_openai, y_openai)

y_openai_pred = xgb.predict_proba(X_openai_test)
y_openai_proba = xgb.predict_proba(X_openai_test)
joblib.dump(xgb, 'notebooks/e2e_troubleshoot/openai_e2e_0_xgb.pkl')

print('xgboost scores (not treat; treat; unknown)')
print(y_openai_proba)

openai_df_full_graph_1 = pd.DataFrame(y_openai_proba)
openai_df_full_graph_1.columns = ['not-treat-score', 'treat-score', 'unknown-treat-score']
openai_df_full_graph_1.to_csv('notebooks/e2e_troubleshoot/openai_e2e_0_xgb.csv')

In [None]:
openai_df_full_graph_1.columns=['not treat','treat', 'unknown']
openai_df_full_graph_1['GT']=test.y

In [None]:
openai_df_full_1=openai_df_full_graph_1
# Define a function to plot KDEs
def plot_kde(ax, data, column, gt, title):
    sns.kdeplot(data=data.loc[data.GT==gt], x=column, ax=ax, label='Non-topological Embeddings')
    sns.kdeplot(data=openai_df_full_graph_1.loc[openai_df_full_graph_1.GT==gt], x=column, ax=ax, label='Topological Embeddings')
    ax.set_title(title)
    ax.legend()

# Create a 2x3 grid for subplots
fig, axes = plt.subplots(3, 3, figsize=(18, 12))

# Plot True Positives
plot_kde(axes[0, 0], openai_df_full_1, 'not treat', 1, 'OpenAI - Distribution of not treat scores for True Positives')
plot_kde(axes[0, 1], openai_df_full_1, 'treat', 1, 'OpenAI - Distribution of treat scores for True Positives')
plot_kde(axes[0, 2], openai_df_full_1, 'unknown', 1, 'OpenAI - Distribution of unknown scores for True Positives')

# Plot True Negatives
plot_kde(axes[1, 0], openai_df_full_1, 'not treat', 0, 'OpenAI - Distribution of not treat scores for True Negatives')
plot_kde(axes[1, 1], openai_df_full_1, 'treat', 0, 'OpenAI - Distribution of treat scores for True Negatives')
plot_kde(axes[1, 2], openai_df_full_1, 'unknown', 0, 'OpenAI - Distribution of unknown scores for True Negatives')

# Plot True Negatives
plot_kde(axes[2, 0], openai_df_full_1, 'not treat', 0, 'OpenAI - Distribution of not treat scores for True Unknown')
plot_kde(axes[2, 1], openai_df_full_1, 'treat', 0, 'OpenAI - Distribution of treat scores for True Unknown')
plot_kde(axes[2, 2], openai_df_full_1, 'unknown', 0, 'OpenAI - Distribution of unknown scores for True Unknown')


# Adjust layout
plt.tight_layout()
plt.show()

## Quantitative Evaluation
Checking quantiative scores on test set 1 

In [None]:
# Functions for "alexei_fellowship.ipynb"

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd


## Dataset utilities

from sklearn.model_selection import train_test_split


def give_vectorised_dataset(df, emb_dict):
    '''
    Args:
    - df (pd.DataFrame): drug-disease dataset
    - emb_dict (dictionary with np.array values): drug and disease embeddings
    Out: vectorised dataset `(X, y)` (tuple of np.array)
    '''
    # Generating design matrix
    drug_ids = pd.Series(df['source'].unique())
    dis_ids = pd.Series(df['target'].unique())
    vector_length_drug = len(emb_dict[drug_ids[0]]) # Here we assume consistent array lengths
    vector_length_dis = len(emb_dict[dis_ids[0]]) 
    X = np.empty(shape=(len(df), vector_length_drug+vector_length_dis), dtype = 'float32')
    for index, row in df.reset_index().iterrows():
        drug_id = row['source']
        disease_id = row['target']
        drug_vector = emb_dict[drug_id]
        disease_vector = emb_dict[disease_id]
        X[index] = np.concatenate([drug_vector, disease_vector])
    # Target vector
    y = df['y'].to_numpy()
    return X, y
    

def give_X(df, emb_dict):
    '''
    Args:
    - df (pd.DataFrame): drug-disease input dataset (no labels)
    - emb_dict (dictionary with np.array values): drug and disease embeddings
    Out: vectorised input dataset `X` (np.array)
    '''
    # Generating design matrix
    drug_ids = pd.Series(df['source'].unique())
    dis_ids = pd.Series(df['target'].unique())
    vector_length_drug = len(emb_dict[drug_ids[0]]) # Here we assume consistent array lengths
    vector_length_dis = len(emb_dict[dis_ids[0]]) 
    X = np.empty(shape=(len(df), vector_length_drug+vector_length_dis), dtype = 'float32')
    for index, row in df.reset_index().iterrows():
        drug_id = row['source']
        disease_id = row['target']
        drug_vector = emb_dict[drug_id]
        disease_vector = emb_dict[disease_id]
        X[index] = np.concatenate([drug_vector, disease_vector])
    return X



def train_test_split_pd(df, test_size, random_state):
    '''
    Returns train-test split for a Pandas dataframe.
    '''
    df_train, df_test, y_train, y_test = train_test_split(df[['source','target']].to_numpy(), 
         df['y'].to_numpy(), test_size=test_size, stratify=df['y'].to_numpy(), random_state=random_state)
    train_df = pd.concat([pd.DataFrame(df_train), pd.DataFrame(y_train)], axis=1)
    train_df.columns = df.columns
    test_df = pd.concat([pd.DataFrame(df_test), pd.DataFrame(y_test)], axis=1)
    test_df.columns = df.columns
    return train_df, test_df
    
    
## MRR ank Hit@k computation

import bisect
pd.options.mode.chained_assignment = None  # silence warning: see https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas


def give_rank(prob, rand_pairs):
    '''
    Computes rank of a probability score `prob` among pd.DataFrame `rand_pairs`, 
    which has a column 'treat' probability scores sorted in a descending fashion.
    '''
    N = len(rand_pairs)
    rand_prob_lst = list(rand_pairs['treat'])
    rand_prob_lst.reverse() # Ascending order for bisect package 
    rank_ascending = bisect.bisect(rand_prob_lst, prob)
    rank = N - rank_ascending + 1
    return rank


def give_mrr(N, model, test_tptn_df, drug_nodes, disease_nodes, emb_dict,
                portion = 1, load_bar = True):
    '''
    Computes MRR using the following sampling method: 
    For each TP drug-disease pair, produce `N` random samples 
    by N/2 replacements of the drug and N/2 replacements of the disease.
    '''
    # Number of replacements
    N_samps = int(N/2)
    
    # Creating TP dataset and sampling portion
    test_tp_df = test_tptn_df[test_tptn_df['y']==1]
    test_tp_df_portion = test_tp_df.sample(frac=portion, replace = False)
    test_tp_df_portion = test_tp_df_portion.reset_index(drop=True)

    # Compute treat probability scores for test TP drug-disease pairs
    test_tp_X = give_X(test_tp_df_portion, emb_dict)
    tp_probs = model.predict_proba(test_tp_X)[:,1]

    # Loading bar
    if load_bar == False:
        tqdm_tmp = lambda x: x
    else:
        tqdm_tmp = tqdm
        
    rec_rank_total = 0
    for idx, row in tqdm_tmp(test_tp_df_portion.iterrows()):
        drug = row['source']
        disease = row['target']
        # Generate random samples
        rand_drugs = drug_nodes['id'].sample(N_samps, replace=False, ignore_index=True)
        rand_dis = disease_nodes['id'].sample(N_samps, replace=False, ignore_index=True)
        tmp_1 = pd.DataFrame({'source': rand_drugs, 'target': disease, 'y': 2})
        tmp_2 = pd.DataFrame({'source': drug, 'target': rand_dis, 'y': 2})
        rand_pairs = pd.concat([tmp_1,tmp_2], ignore_index=True)
        # Compute treat probability scores for random pairs
        rand_X = give_X(rand_pairs, emb_dict)
        rand_pairs['treat'] = model.predict_proba(rand_X)[:,1]
        rand_pairs = rand_pairs.sort_values(by = 'treat', ascending=False)
        # Compute rank and add to sum
        row_prob = tp_probs[idx]
        rank = give_rank(row_prob, rand_pairs)
        rec_rank_total += 1/rank

    mrr = rec_rank_total/len(test_tp_df_portion)
    return mrr


def give_hitk(k, N, model, test_tptn_df, drug_nodes, disease_nodes, emb_dict,
                portion = 1, load_bar = True):
    '''
    Computes Hit@k using the following sampling method: 
    For each TP drug-disease pair, produce `N` random samples 
    by N/2 replacements of the drug and N/2 replacements of the disease.
    '''
    # Number of replacements
    N_samps = int(N/2)
    
    # Creating TP dataset and sampling portion
    test_tp_df = test_tptn_df[test_tptn_df['y']==1]
    test_tp_df_portion = test_tp_df.sample(frac=portion, replace = False)
    test_tp_df_portion = test_tp_df_portion.reset_index(drop=True)

    # Compute treat probability scores for test TP drug-disease pairs
    test_tp_X = give_X(test_tp_df_portion, emb_dict)
    tp_probs = model.predict_proba(test_tp_X)[:,1]

    # Loading bar
    if load_bar == False:
        tqdm_tmp = lambda x: x
    else:
        tqdm_tmp = tqdm
        
    geqk_count = 0
    for idx, row in tqdm_tmp(test_tp_df_portion.iterrows()):
        drug = row['source']
        disease = row['target']
        # Generate random samples
        rand_drugs = drug_nodes['id'].sample(N_samps, replace=False, ignore_index = True)
        rand_dis = disease_nodes['id'].sample(N_samps, replace=False, ignore_index = True)
        tmp_1 = pd.DataFrame({'source': rand_drugs, 'target': disease, 'y': 2})
        tmp_2 = pd.DataFrame({'source': drug, 'target': rand_dis, 'y': 2})
        rand_pairs = pd.concat([tmp_1,tmp_2], ignore_index =True)
        # Compute treat probability scores for random pairs
        rand_X = give_X(rand_pairs, emb_dict)
        rand_pairs['treat'] = model.predict_proba(rand_X)[:,1]
        rand_pairs = rand_pairs.sort_values(by = 'treat', ascending = False)
        # Determine if rank <= k
        row_prob = tp_probs[idx]
        rank = give_rank(row_prob, rand_pairs)
        if rank <= k:
            geqk_count += 1

    hitk = geqk_count/len(test_tp_df_portion)
    return hitk
    

## Bayesian hyperparameter optimisation    

from xgboost import XGBClassifier
from skopt import gp_minimize
from skopt.utils import use_named_args
from skopt.plots import plot_convergence
from tqdm.notebook import tqdm
import time


class tqdm_skopt(object):
    'Loading bar for hyperparameter optimisation'
    def __init__(self, **kwargs):
        self._bar = tqdm(**kwargs) 
        
    def __call__(self, res):
        self._bar.update()


def perform_hyperparameter_opt(df, search_space, args_xg, eval_score, emb_dict,
                                n_calls=100, val_size=1/9, verbose=True, random_state=1):
    """
    Trains an XGBoost model on the training set `df` (pd.DataFrame) with hyperparameter optimisation. 
    The objective function,  `eval_score` must take the model and validation set as arguments. 
    Output: 
    `best_model`: XGBoost model instance with best validation score (trained on the full training set `(X_ftr,y_ftr)`)
    `result`: scikit-optimize `OptimizeResult` instance
    """
    # Define the function used to evaluate a given configuration
    @use_named_args(search_space)
    def evaluate_model(**params):
        # Configure the model with specific hyperparameters
        model = XGBClassifier(**args_xg)
        model.set_params(**params)
        # Train and evaluate with different split every round
        train_set, val_set = train_test_split_pd(df, test_size = val_size, random_state=None)
        X_tr, y_tr = give_vectorised_dataset(train_set, emb_dict)
        model.fit(X_tr, y_tr, verbose = False)
        score = eval_score(model, val_set)
        # Convert from a maximizing score to a minimising score
        return 1.0 - score
    
    # Perform optimisation
    start_time = time.time()
    if verbose == False:
        callback=[tqdm_skopt(total=n_calls, desc="Progress")]
    else:
        callback = None
    result = gp_minimize(evaluate_model, search_space, 
                         verbose = verbose, 
                         n_calls = n_calls,
                         callback=callback,
                         random_state=random_state)
    

    # Train best model on full training set `(X_ftr, y_ftr)`
    opt_params = {param.name:param_val for param, param_val in zip(search_space,result.x)}
    best_model = XGBClassifier(**args_xg)
    best_model.set_params(**opt_params)
    X_ftr, y_ftr = give_vectorised_dataset(df, emb_dict)
    best_model.fit(X_ftr, y_ftr)
    end_time = time.time()
    
    # Summarising finding
    plot_convergence(result)
    plt.show()
    print('Best validation score: %.3f' % (1.0 - result.fun))
    print('Time taken (seconds) %.3f' % (end_time - start_time))
    
    return best_model, result

In [None]:
from sklearn.metrics import f1_score
#from utils import *

# map
openai_df_full_graph_1['PT'] = openai_df_full_graph_1.round().drop('GT', axis=1).idxmax(axis=1).map({'not treat': 0, 'treat': 1, 'unknown': 2})

In [None]:
# pubmed full
f1_score(openai_df_full_graph_1['PT'],openai_df_full_graph_1['GT'], average='macro')

In [None]:
def give_mrr(N, model, test_tptn_df, drug_nodes, disease_nodes, emb_dict,
                portion = 1, load_bar = True):
    '''
    Computes MRR using the following sampling method: 
    For each TP drug-disease pair, produce `N` random samples 
    by N/2 replacements of the drug and N/2 replacements of the disease.
    '''
    # Number of replacements
    N_samps = int(N/2)
    
    # Creating TP dataset and sampling portion
    test_tp_df = test_tptn_df[test_tptn_df['y']==1]
    test_tp_df_portion = test_tp_df.sample(frac=portion, replace = False)
    test_tp_df_portion = test_tp_df_portion.reset_index(drop=True)

    # Compute treat probability scores for test TP drug-disease pairs
    test_tp_X = give_X(test_tp_df_portion, emb_dict)
    tp_probs = model.predict_proba(test_tp_X)[:,1]

    # Loading bar
    if load_bar == False:
        tqdm_tmp = lambda x: x
    else:
        tqdm_tmp = tqdm
        
    rec_rank_total = 0
    for idx, row in tqdm_tmp(test_tp_df_portion.iterrows()):
        drug = row['source']
        disease = row['target']
        # Generate random samples
        rand_drugs = drug_nodes['id'].sample(N_samps, replace=False, ignore_index=True)
        rand_dis = disease_nodes['id'].sample(N_samps, replace=False, ignore_index=True)
        tmp_1 = pd.DataFrame({'source': rand_drugs, 'target': disease, 'y': 2})
        tmp_2 = pd.DataFrame({'source': drug, 'target': rand_dis, 'y': 2})
        rand_pairs = pd.concat([tmp_1,tmp_2], ignore_index=True)
        # Compute treat probability scores for random pairs
        rand_X = give_X(rand_pairs, emb_dict)
        rand_pairs['treat'] = model.predict_proba(rand_X)[:,1]
        rand_pairs = rand_pairs.sort_values(by = 'treat', ascending=False)
        # Compute rank and add to sum
        row_prob = tp_probs[idx]
        rank = give_rank(row_prob, rand_pairs)
        rec_rank_total += 1/rank

    mrr = rec_rank_total/len(test_tp_df_portion)
    return mrr

In [None]:
N = 1000

other_args = {'drug_nodes':nodes_df_drugs, 'disease_nodes':nodes_df_disease, 'emb_dict':embedding_dict, 
              'portion':1, 'load_bar':False} 

# Computing scores for all models
metrics_lst = []
mrr_tmp = give_mrr(1000, xgb, test[test.y!=2], **other_args)
hit_metrics = []
for k in [1,3,5]:
    hit_metrics.append(give_hitk(k, 1000, xgb, test[test.y!=2],  **other_args))
print(hit_metrics)
print(mrr_tmp)

# 20240819 - first trobuleshooting e2e run (lower lr)

In [None]:
openai_graph = pd.read_csv('notebooks/topological_embeddings_20240819.csv', index_col=0)

openai_graph['topological_embedding'] = [np.fromstring(key.strip('[]'), sep=' ') for key in openai_graph.topological_embedding]

from sklearn.decomposition import PCA

pca = PCA(n_components = 2 )
top_embed = pca.fit_transform(pd.DataFrame(openai_graph['topological_embedding'].tolist()))

plt.scatter(top_embed.transpose()[0], top_embed.transpose()[1])
plt.show()

sns.kdeplot(np.array(openai_graph['topological_embedding'].tolist()).flatten())

In [None]:

# Create a dictionary for fast lookup
embedding_dict = openai_graph.set_index('id')['topological_embedding'].to_dict()

# Initialize feature matrices
feature_length = 1024
X_openai = np.empty((len(train_df_1), feature_length), dtype='float32')
X_openai_test = np.empty((len(test), feature_length), dtype='float32')

# Function to get concatenated vectors
def get_concatenated_vector(row):
    drug_vector = embedding_dict[row['source']]
    disease_vector = embedding_dict[row['target']]
    return np.concatenate([drug_vector, disease_vector])

# Apply the function to each row of train_df_1
X_openai = np.vstack(train_df_1.apply(get_concatenated_vector, axis=1))

# Convert target variable to numpy array
y_openai = train_df_1['y'].to_numpy()

# Reset index for test DataFrame
test = test.reset_index(drop=True)

X_openai_test = np.vstack(test.apply(get_concatenated_vector, axis=1))

y_openai_test = test['y'].to_numpy()

In [None]:
# xgboost
xgb = XGBClassifier(random_state = 42)
xgb.fit(X_openai, y_openai)

y_openai_pred = xgb.predict_proba(X_openai_test)
y_openai_proba = xgb.predict_proba(X_openai_test)
joblib.dump(xgb, 'notebooks/e2e_troubleshoot/openai_e2e_1_xgb.pkl')

print('xgboost scores (not treat; treat; unknown)')
print(y_openai_proba)

openai_df_full_graph_1 = pd.DataFrame(y_openai_proba)
openai_df_full_graph_1.columns = ['not-treat-score', 'treat-score', 'unknown-treat-score']
openai_df_full_graph_1.to_csv('notebooks/e2e_troubleshoot/openai_e2e_1_xgb.csv')

In [None]:
openai_df_full_graph_1.columns=['not treat','treat', 'unknown']
openai_df_full_graph_1['GT']=test.y

In [None]:
openai_df_full_1=openai_df_full_graph_1
# Define a function to plot KDEs
def plot_kde(ax, data, column, gt, title):
    sns.kdeplot(data=data.loc[data.GT==gt], x=column, ax=ax, label='Non-topological Embeddings')
    sns.kdeplot(data=openai_df_full_graph_1.loc[openai_df_full_graph_1.GT==gt], x=column, ax=ax, label='Topological Embeddings')
    ax.set_title(title)
    ax.legend()

# Create a 2x3 grid for subplots
fig, axes = plt.subplots(3, 3, figsize=(18, 12))

# Plot True Positives
plot_kde(axes[0, 0], openai_df_full_1, 'not treat', 1, 'OpenAI - Distribution of not treat scores for True Positives')
plot_kde(axes[0, 1], openai_df_full_1, 'treat', 1, 'OpenAI - Distribution of treat scores for True Positives')
plot_kde(axes[0, 2], openai_df_full_1, 'unknown', 1, 'OpenAI - Distribution of unknown scores for True Positives')

# Plot True Negatives
plot_kde(axes[1, 0], openai_df_full_1, 'not treat', 0, 'OpenAI - Distribution of not treat scores for True Negatives')
plot_kde(axes[1, 1], openai_df_full_1, 'treat', 0, 'OpenAI - Distribution of treat scores for True Negatives')
plot_kde(axes[1, 2], openai_df_full_1, 'unknown', 0, 'OpenAI - Distribution of unknown scores for True Negatives')

# Plot True Negatives
plot_kde(axes[2, 0], openai_df_full_1, 'not treat', 0, 'OpenAI - Distribution of not treat scores for True Unknown')
plot_kde(axes[2, 1], openai_df_full_1, 'treat', 0, 'OpenAI - Distribution of treat scores for True Unknown')
plot_kde(axes[2, 2], openai_df_full_1, 'unknown', 0, 'OpenAI - Distribution of unknown scores for True Unknown')


# Adjust layout
plt.tight_layout()
plt.show()

In [None]:
from sklearn.metrics import f1_score
#from utils import *

# map
openai_df_full_graph_1['PT'] = openai_df_full_graph_1.round().drop('GT', axis=1).idxmax(axis=1).map({'not treat': 0, 'treat': 1, 'unknown': 2})

In [None]:
# pubmed full
f1_score(openai_df_full_graph_1['PT'],openai_df_full_graph_1['GT'], average='macro')

In [None]:
N = 1000

other_args = {'drug_nodes':nodes_df_drugs, 'disease_nodes':nodes_df_disease, 'emb_dict':embedding_dict, 
              'portion':1, 'load_bar':False} 

# Computing scores for all models
metrics_lst = []
mrr_tmp = give_mrr(1000, xgb, test[test.y!=2], **other_args)
hit_metrics = []
for k in [1,3,5]:
    hit_metrics.append(give_hitk(k, 1000, xgb, test[test.y!=2],  **other_args))
print(hit_metrics)
print(mrr_tmp)

# 20240820 - e2e run - most recent 

In [None]:
openai_graph = pd.read_csv('notebooks/topological_embeddings_20240820.csv', index_col=0)

In [None]:
openai_graph = pd.read_csv('notebooks/topological_embeddings_20240820.csv', index_col=0)

openai_graph['topological_embedding'] = [np.fromstring(key.strip('[]'), sep=' ') for key in openai_graph.topological_embedding]

from sklearn.decomposition import PCA

pca = PCA(n_components = 2 )
top_embed = pca.fit_transform(pd.DataFrame(openai_graph['topological_embedding'].tolist()))

plt.scatter(top_embed.transpose()[0], top_embed.transpose()[1])
plt.show()

sns.kdeplot(np.array(openai_graph['topological_embedding'].tolist()).flatten())

In [None]:

# Create a dictionary for fast lookup
embedding_dict = openai_graph.set_index('id')['topological_embedding'].to_dict()

# Initialize feature matrices
feature_length = 1024
X_openai = np.empty((len(train_df_1), feature_length), dtype='float32')
X_openai_test = np.empty((len(test), feature_length), dtype='float32')

# Function to get concatenated vectors
def get_concatenated_vector(row):
    drug_vector = embedding_dict[row['source']]
    disease_vector = embedding_dict[row['target']]
    return np.concatenate([drug_vector, disease_vector])

# Apply the function to each row of train_df_1
X_openai = np.vstack(train_df_1.apply(get_concatenated_vector, axis=1))

# Convert target variable to numpy array
y_openai = train_df_1['y'].to_numpy()

# Reset index for test DataFrame
test = test.reset_index(drop=True)

X_openai_test = np.vstack(test.apply(get_concatenated_vector, axis=1))

y_openai_test = test['y'].to_numpy()

In [None]:
# xgboost
xgb = XGBClassifier(random_state = 42)
xgb.fit(X_openai, y_openai)

y_openai_pred = xgb.predict_proba(X_openai_test)
y_openai_proba = xgb.predict_proba(X_openai_test)
joblib.dump(xgb, 'notebooks/e2e_troubleshoot/openai_e2e_2_xgb.pkl')

print('xgboost scores (not treat; treat; unknown)')
print(y_openai_proba)

openai_df_full_graph_1 = pd.DataFrame(y_openai_proba)
openai_df_full_graph_1.columns = ['not-treat-score', 'treat-score', 'unknown-treat-score']
openai_df_full_graph_1.to_csv('notebooks/e2e_troubleshoot/openai_e2e_2_xgb.csv')

In [None]:
openai_df_full_graph_1.columns=['not treat','treat', 'unknown']
openai_df_full_graph_1['GT']=test.y

In [None]:
openai_df_full_1=openai_df_full_graph_1
# Define a function to plot KDEs
def plot_kde(ax, data, column, gt, title):
    sns.kdeplot(data=data.loc[data.GT==gt], x=column, ax=ax, label='Non-topological Embeddings')
    sns.kdeplot(data=openai_df_full_graph_1.loc[openai_df_full_graph_1.GT==gt], x=column, ax=ax, label='Topological Embeddings')
    ax.set_title(title)
    ax.legend()

# Create a 2x3 grid for subplots
fig, axes = plt.subplots(3, 3, figsize=(18, 12))

# Plot True Positives
plot_kde(axes[0, 0], openai_df_full_1, 'not treat', 1, 'OpenAI - Distribution of not treat scores for True Positives')
plot_kde(axes[0, 1], openai_df_full_1, 'treat', 1, 'OpenAI - Distribution of treat scores for True Positives')
plot_kde(axes[0, 2], openai_df_full_1, 'unknown', 1, 'OpenAI - Distribution of unknown scores for True Positives')

# Plot True Negatives
plot_kde(axes[1, 0], openai_df_full_1, 'not treat', 0, 'OpenAI - Distribution of not treat scores for True Negatives')
plot_kde(axes[1, 1], openai_df_full_1, 'treat', 0, 'OpenAI - Distribution of treat scores for True Negatives')
plot_kde(axes[1, 2], openai_df_full_1, 'unknown', 0, 'OpenAI - Distribution of unknown scores for True Negatives')

# Plot True Negatives
plot_kde(axes[2, 0], openai_df_full_1, 'not treat', 0, 'OpenAI - Distribution of not treat scores for True Unknown')
plot_kde(axes[2, 1], openai_df_full_1, 'treat', 0, 'OpenAI - Distribution of treat scores for True Unknown')
plot_kde(axes[2, 2], openai_df_full_1, 'unknown', 0, 'OpenAI - Distribution of unknown scores for True Unknown')


# Adjust layout
plt.tight_layout()
plt.show()

In [None]:
from sklearn.metrics import f1_score
#from utils import *

# map
openai_df_full_graph_1['PT'] = openai_df_full_graph_1.round().drop('GT', axis=1).idxmax(axis=1).map({'not treat': 0, 'treat': 1, 'unknown': 2})

In [None]:
# pubmed full
f1_score(openai_df_full_graph_1['PT'],openai_df_full_graph_1['GT'], average='macro')

In [None]:
N = 1000

other_args = {'drug_nodes':nodes_df_drugs, 'disease_nodes':nodes_df_disease, 'emb_dict':embedding_dict, 
              'portion':1, 'load_bar':False} 

# Computing scores for all models
metrics_lst = []
mrr_tmp = give_mrr(1000, xgb, test[test.y!=2], **other_args)
hit_metrics = []
for k in [1,3,5]:
    hit_metrics.append(give_hitk(k, 1000, xgb, test[test.y!=2],  **other_args))
print(hit_metrics)
print(mrr_tmp)

# 20240821

In [None]:

openai_graph = pd.read_csv('notebooks/topological_embeddings_20240821.csv', index_col=0)

openai_graph['topological_embedding'] = [np.fromstring(key.strip('[]'), sep=' ') for key in openai_graph.topological_embedding]

from sklearn.decomposition import PCA

pca = PCA(n_components = 2 )
top_embed = pca.fit_transform(pd.DataFrame(openai_graph['topological_embedding'].tolist()))

plt.scatter(top_embed.transpose()[0], top_embed.transpose()[1])
plt.show()

#sns.kdeplot(np.array(openai_graph['topological_embedding'].tolist()).flatten())

In [None]:

# Create a dictionary for fast lookup
embedding_dict = openai_graph.set_index('id')['topological_embedding'].to_dict()

# Initialize feature matrices
feature_length = 1024
X_openai = np.empty((len(train_df_1), feature_length), dtype='float32')
X_openai_test = np.empty((len(test), feature_length), dtype='float32')

# Function to get concatenated vectors
def get_concatenated_vector(row):
    drug_vector = embedding_dict[row['source']]
    disease_vector = embedding_dict[row['target']]
    return np.concatenate([drug_vector, disease_vector])

# Apply the function to each row of train_df_1
X_openai = np.vstack(train_df_1.apply(get_concatenated_vector, axis=1))

# Convert target variable to numpy array
y_openai = train_df_1['y'].to_numpy()

# Reset index for test DataFrame
test = test.reset_index(drop=True)

X_openai_test = np.vstack(test.apply(get_concatenated_vector, axis=1))

y_openai_test = test['y'].to_numpy()

In [None]:
# xgboost
xgb = XGBClassifier(random_state = 42)
xgb.fit(X_openai, y_openai)

y_openai_pred = xgb.predict_proba(X_openai_test)
y_openai_proba = xgb.predict_proba(X_openai_test)
joblib.dump(xgb, 'notebooks/e2e_troubleshoot/openai_e2e_3_xgb.pkl')

print('xgboost scores (not treat; treat; unknown)')
print(y_openai_proba)

openai_df_full_graph_1 = pd.DataFrame(y_openai_proba)
openai_df_full_graph_1.columns = ['not-treat-score', 'treat-score', 'unknown-treat-score']
openai_df_full_graph_1.to_csv('notebooks/e2e_troubleshoot/openai_e2e_3_xgb.csv')

In [None]:
openai_df_full_graph_1.columns=['not treat','treat', 'unknown']
openai_df_full_graph_1['GT']=test.y

In [None]:
openai_df_full_1=openai_df_full_graph_1
# Define a function to plot KDEs
def plot_kde(ax, data, column, gt, title):
    sns.kdeplot(data=data.loc[data.GT==gt], x=column, ax=ax, label='Non-topological Embeddings')
    sns.kdeplot(data=openai_df_full_graph_1.loc[openai_df_full_graph_1.GT==gt], x=column, ax=ax, label='Topological Embeddings')
    ax.set_title(title)
    ax.legend()

# Create a 2x3 grid for subplots
fig, axes = plt.subplots(3, 3, figsize=(18, 12))

# Plot True Positives
plot_kde(axes[0, 0], openai_df_full_1, 'not treat', 1, 'OpenAI - Distribution of not treat scores for True Positives')
plot_kde(axes[0, 1], openai_df_full_1, 'treat', 1, 'OpenAI - Distribution of treat scores for True Positives')
plot_kde(axes[0, 2], openai_df_full_1, 'unknown', 1, 'OpenAI - Distribution of unknown scores for True Positives')

# Plot True Negatives
plot_kde(axes[1, 0], openai_df_full_1, 'not treat', 0, 'OpenAI - Distribution of not treat scores for True Negatives')
plot_kde(axes[1, 1], openai_df_full_1, 'treat', 0, 'OpenAI - Distribution of treat scores for True Negatives')
plot_kde(axes[1, 2], openai_df_full_1, 'unknown', 0, 'OpenAI - Distribution of unknown scores for True Negatives')

# Plot True Negatives
plot_kde(axes[2, 0], openai_df_full_1, 'not treat', 0, 'OpenAI - Distribution of not treat scores for True Unknown')
plot_kde(axes[2, 1], openai_df_full_1, 'treat', 0, 'OpenAI - Distribution of treat scores for True Unknown')
plot_kde(axes[2, 2], openai_df_full_1, 'unknown', 0, 'OpenAI - Distribution of unknown scores for True Unknown')


# Adjust layout
plt.tight_layout()
plt.show()

In [None]:
from sklearn.metrics import f1_score
#from utils import *

# map
openai_df_full_graph_1['PT'] = openai_df_full_graph_1.round().drop('GT', axis=1).idxmax(axis=1).map({'not treat': 0, 'treat': 1, 'unknown': 2})

In [None]:
# pubmed full
f1_score(openai_df_full_graph_1['PT'],openai_df_full_graph_1['GT'], average='macro')

In [None]:
N = 1000

other_args = {'drug_nodes':nodes_df_drugs, 'disease_nodes':nodes_df_disease, 'emb_dict':embedding_dict, 
              'portion':1, 'load_bar':False} 

# Computing scores for all models
metrics_lst = []
mrr_tmp = give_mrr(1000, xgb, test[test.y!=2], **other_args)
hit_metrics = []
for k in [1,3,5]:
    hit_metrics.append(give_hitk(k, 1000, xgb, test[test.y!=2],  **other_args))
print(hit_metrics)
print(mrr_tmp)

# node sidecar (reproducing 240820)

In [None]:
#nodes_sidecar=pd.read_csv('notebooks/e2e_troubleshoot/topological_embeddings_nodes_sidecar.csv', sep=',',encoding='utf-8')#, index_col=0)
#mport csv

df = pd.read_csv('notebooks/e2e_troubleshoot/topological_embeddings_nodes_sidecar.csv', index_col=0)#, header = None, delimiter=",", quoting=csv.QUOTE_NONE, encoding='utf-8')

In [None]:
df['topological_embedding'] = [np.fromstring(key.strip('[]'), sep=' ') for key in df.topological_embedding]

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components = 2 )
top_embed = pca.fit_transform(pd.DataFrame(df['topological_embedding'].tolist()))

plt.scatter(top_embed.transpose()[0], top_embed.transpose()[1])
plt.show()


In [None]:
sns.kdeplot(np.array(df['topological_embedding'].tolist()).flatten())

In [None]:

# Create a dictionary for fast lookup
embedding_dict = df.set_index('id')['topological_embedding'].to_dict()

# Initialize feature matrices
feature_length = 1024
X_openai = np.empty((len(train_df_1), feature_length), dtype='float32')
X_openai_test = np.empty((len(test), feature_length), dtype='float32')

# Function to get concatenated vectors
def get_concatenated_vector(row):
    print(row)
    print(row['source'])
    print(row['target'])
    print(row['source'] in embedding_dict.keys())
    drug_vector = embedding_dict[row['source']]
    disease_vector = embedding_dict[row['target']]
    return np.concatenate([drug_vector, disease_vector])

# Apply the function to each row of train_df_1
X_openai = np.vstack(train_df_1.apply(get_concatenated_vector, axis=1))

# Convert target variable to numpy array
y_openai = train_df_1['y'].to_numpy()

# Reset index for test DataFrame
test = test.reset_index(drop=True)

X_openai_test = np.vstack(test.apply(get_concatenated_vector, axis=1))

y_openai_test = test['y'].to_numpy()

In [None]:
# xgboost
xgb = XGBClassifier(random_state = 42)
xgb.fit(X_openai, y_openai)

y_openai_pred = xgb.predict_proba(X_openai_test)
y_openai_proba = xgb.predict_proba(X_openai_test)
joblib.dump(xgb, 'notebooks/e2e_troubleshoot/openai_e2e_sidecar_xgb.pkl')

print('xgboost scores (not treat; treat; unknown)')
print(y_openai_proba)

openai_df_full_graph_1 = pd.DataFrame(y_openai_proba)
openai_df_full_graph_1.columns = ['not-treat-score', 'treat-score', 'unknown-treat-score']
openai_df_full_graph_1.to_csv('notebooks/e2e_troubleshoot/openai_e2e_sidecar_xgb.csv')

In [None]:
openai_df_full_graph_1.columns=['not treat','treat', 'unknown']
openai_df_full_graph_1['GT']=test.y

In [None]:
openai_df_full_1=openai_df_full_graph_1
# Define a function to plot KDEs
def plot_kde(ax, data, column, gt, title):
    sns.kdeplot(data=data.loc[data.GT==gt], x=column, ax=ax, label='Non-topological Embeddings')
    sns.kdeplot(data=openai_df_full_graph_1.loc[openai_df_full_graph_1.GT==gt], x=column, ax=ax, label='Topological Embeddings')
    ax.set_title(title)
    ax.legend()

# Create a 2x3 grid for subplots
fig, axes = plt.subplots(3, 3, figsize=(18, 12))

# Plot True Positives
plot_kde(axes[0, 0], openai_df_full_1, 'not treat', 1, 'OpenAI - Distribution of not treat scores for True Positives')
plot_kde(axes[0, 1], openai_df_full_1, 'treat', 1, 'OpenAI - Distribution of treat scores for True Positives')
plot_kde(axes[0, 2], openai_df_full_1, 'unknown', 1, 'OpenAI - Distribution of unknown scores for True Positives')

# Plot True Negatives
plot_kde(axes[1, 0], openai_df_full_1, 'not treat', 0, 'OpenAI - Distribution of not treat scores for True Negatives')
plot_kde(axes[1, 1], openai_df_full_1, 'treat', 0, 'OpenAI - Distribution of treat scores for True Negatives')
plot_kde(axes[1, 2], openai_df_full_1, 'unknown', 0, 'OpenAI - Distribution of unknown scores for True Negatives')

# Plot True Negatives
plot_kde(axes[2, 0], openai_df_full_1, 'not treat', 0, 'OpenAI - Distribution of not treat scores for True Unknown')
plot_kde(axes[2, 1], openai_df_full_1, 'treat', 0, 'OpenAI - Distribution of treat scores for True Unknown')
plot_kde(axes[2, 2], openai_df_full_1, 'unknown', 0, 'OpenAI - Distribution of unknown scores for True Unknown')


# Adjust layout
plt.tight_layout()
plt.show()

In [None]:
from sklearn.metrics import f1_score
#from utils import *

# map
openai_df_full_graph_1['PT'] = openai_df_full_graph_1.round().drop('GT', axis=1).idxmax(axis=1).map({'not treat': 0, 'treat': 1, 'unknown': 2})

In [None]:
# pubmed full
f1_score(openai_df_full_graph_1['PT'],openai_df_full_graph_1['GT'], average='macro')

In [None]:
N = 1000

other_args = {'drug_nodes':nodes_df_drugs, 'disease_nodes':nodes_df_disease, 'emb_dict':embedding_dict, 
              'portion':1, 'load_bar':False} 

# Computing scores for all models
metrics_lst = []
mrr_tmp = give_mrr(1000, xgb, test[test.y!=2], **other_args)
hit_metrics = []
for k in [1,3,5]:
    hit_metrics.append(give_hitk(k, 1000, xgb, test[test.y!=2],  **other_args))
print(hit_metrics)
print(mrr_tmp)