*Author & point of contact: Piotr Kaniewski (piotr@everycure.org)*

# LLM Embedding Benchmark - ML classifiers

In this notebook I will conduct EDA of node embeddings generated with help of two LLMs and one non-LLM method:
* OpenAI - generic text-embedding-3-small model with concurrency 50, currently implemented in our pipeline
* PubMedBert - biomedical embedding model, used in KGML-xDTD publication
* Spacy - generic pipeline with a pretrained language model - *en_core_web_md, web data training*
* SciSpacy - biomedical pipeline with a pretrained language model - *en_core_sci_md, biomedical data training*

**Summary:**
In this notebook I am only using a stratified sample of RTX-KG2 nodes to train ML classifiers on both node attribute embeddings (first part) as well as GraphSage embeddings (second part). Therefore the classifiers trained here are trained on very little data but this is done on purpose to detect potential data leakage/compare quality of embeddings. We don't expect these models to perform well with so little data, but if they do perform well on some examples, that could indicate a potential data leakage.

By the end of this notebook I hope to be able to answer/understand better:
* the quality of embeddings before and after topological enrichment (in terms of downstream analysis i.e. how useful are topological embeddings).
* the impact of PCA on embedding's quality (in terms of downstream analysis i.e. do we lose information during PCA)
* potential data leakage.

*For exploratory analysis of the embeddings and sample data preparation refer to other notebooks*

In [1]:
import os
import time
import joblib
import subprocess
from pathlib import Path

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.manifold import TSNE
from xgboost import XGBClassifier                                    

# Setting the root path and changing the directory
root_path = subprocess.check_output(['git', 'rev-parse', '--show-toplevel']).decode().strip()
os.chdir(Path(root_path) / 'pipelines' / 'matrix')

# Node Attribute Embeddings

## Load data
To further explore if there is data leakage issue we will train the ML classifier on nodes attributes only (non-topological) to explore whether there are obvious sign of data leakage (eg high performance). Note that because we are training on a super small set of drugs, our model performance will be very poor but if there is obvious data leakage, our model should catch it 

In [2]:
#load metadata and gt
sample_df = pd.read_parquet("gs://mtrx-us-central1-wg2-modeling-dev-storage/llm_embed_benchmark/subsample/metadata_df.parquet").drop('index',axis=1)

#load gt 
gt = pd.read_csv('gs://mtrx-us-central1-wg2-modeling-dev-storage/llm_embed_benchmark/subsample/gt.csv')
result_gt = gt.loc[(gt.source.isin(sample_df.id)) & (gt.target.isin(sample_df.id))]

In [3]:
# load full embeddings (name + category encoded by embeddings)

#pubmedbert
pubmed_emb = np.array(joblib.load('gs://mtrx-us-central1-wg2-modeling-dev-storage/llm_embed_benchmark/subsample/pubmedbert/attribute/embed_full.joblib'))

# #openai
openai_emb = np.array(joblib.load('gs://mtrx-us-central1-wg2-modeling-dev-storage/llm_embed_benchmark/subsample/openai/attribute/embed_full.joblib')) 

# #spacy
spacy_emb = np.array(joblib.load('gs://mtrx-us-central1-wg2-modeling-dev-storage/llm_embed_benchmark/subsample/spacy/attribute/embed_full.joblib'))

# #scispacy
scispacy_emb = np.array(joblib.load('gs://mtrx-us-central1-wg2-modeling-dev-storage/llm_embed_benchmark/subsample/scispacy/attribute/embed_full.joblib'))

In [4]:
# load embeddings with no category/name (name only OR category only encoded by embeddings)

#pubmedbert
pubmed_emb_nonames = joblib.load('gs://mtrx-us-central1-wg2-modeling-dev-storage/llm_embed_benchmark/subsample/pubmedbert/attribute/embed_no_cat.joblib')

#openai
openai_emb_nocat = np.array(joblib.load('gs://mtrx-us-central1-wg2-modeling-dev-storage/llm_embed_benchmark/subsample/openai/attribute/embed_no_cat.joblib'))

#spacy
spacy_emb_nocat = joblib.load('gs://mtrx-us-central1-wg2-modeling-dev-storage/llm_embed_benchmark/subsample/spacy/attribute/embed_no_cat.joblib')

#scispacy
scispacy_emb_nocat = joblib.load('gs://mtrx-us-central1-wg2-modeling-dev-storage/llm_embed_benchmark/subsample/scispacy/attribute/embed_no_cat.joblib')

In [5]:
# transform the embeddings with PCA (to examine if it results in information loss)

# reduce dimensions
pca = PCA(n_components=100)

#pubmedbert
pubmed_emb_pca = pca.fit_transform(pubmed_emb)

# #openai
openai_emb_pca = pca.fit_transform(openai_emb)

# #spacy
spacy_emb_pca = pca.fit_transform(spacy_emb)

# #scispacy
scispacy_emb_pca = pca.fit_transform(scispacy_emb)

## Test data no.1 (rasagoline - parkinsons) 

Now prepare the train and test split. Following what we do in the pipeline, we do a random split, generate random drug-disease pairs and create a training and test datasets. Ideally we want a test set to contain drug-disease pair which can be a sign of data leakage. 

In [6]:
from sklearn.model_selection import train_test_split
#create sub-dfs
DRUG_TYPE = ['biolink:Drug', 'biolink:SmallMolecule']
DISEASE_TYPE = ['biolink:Disease', 'biolink:PhenotypicFeature', 'biolink:BehavioralFeature', 'biolink:DiseaseOrPhenotypicFeature']

#sample
sample_df_drugs = sample_df[sample_df['category'].isin(DRUG_TYPE)]
sample_df_disease = sample_df[sample_df['category'].isin(DISEASE_TYPE)]

#train test split 
train, test = train_test_split(result_gt, stratify=result_gt['y'], test_size=0.1, random_state=42)
train_tp_df = train[train['y']==1]
train_tp_df_drugs = train_tp_df['source'].reset_index(drop=True)
train_tp_df_diseases = train_tp_df['target'].reset_index(drop=True)
len_tp_tr = len(train_tp_df)
n_rep = 3

# create random drug-disease pairs
rand_drugs = sample_df_drugs['id'].sample(n_rep*len_tp_tr, replace=True, ignore_index = True, random_state = 42) # 42
rand_disease = sample_df_disease['id'].sample(n_rep*len_tp_tr, replace=True, ignore_index = True, random_state = 42) # 42
train_tp_diseases_copies = pd.concat([train_tp_df_diseases for _ in range(n_rep)], ignore_index = True)
train_tp_drugs_copies = pd.concat([train_tp_df_drugs for _ in range(n_rep)], ignore_index = True)
tmp_1 = pd.DataFrame({'source': rand_drugs, 'target': train_tp_diseases_copies, 'y': 2})
tmp_2 = pd.DataFrame({'source': train_tp_drugs_copies, 'target': rand_disease, 'y': 2})
un_data_1 =  pd.concat([tmp_1,tmp_2], ignore_index =True)
train_df_1 = pd.concat([train, un_data_1]).sample(frac=1).reset_index(drop=True)
test = test.reset_index(drop=True)

In [7]:
test

Unnamed: 0,source,target,y
0,CHEMBL.COMPOUND:CHEMBL2135534,MONDO:0005098,0
1,CHEMBL.COMPOUND:CHEMBL887,MONDO:0014742,1
2,CHEMBL.COMPOUND:CHEMBL1653,MONDO:0005420,0
3,CHEMBL.COMPOUND:CHEMBL2004297,MONDO:0012010,0


Here is our small test. THe most important is compound 2 which has quite known relationship between its disease (Parkinsons). Good 'bait' for data leakage

    CHEMBL.COMPOUND:CHEMBL887 - refers to Rasagline whhich is known parkinson treatment
    MONDO:0014742 - refers to Parkinsons

### Model predictions - full embeddings (node:name + node:category embedded)
Here I am running the predictions on full embeddings.

In [8]:
feature_length = 1536
#pubmed dataset
X_pubmed = np.empty(shape=(len(train_df_1), feature_length), dtype = 'float32')
for index, row in train_df_1.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = pubmed_emb[drug_id]
    disease_vector = pubmed_emb[disease_id]
    X_pubmed[index] = np.concatenate([drug_vector, disease_vector])
y_pubmed = train_df_1.y.to_numpy()

#pubmed dataset - test
test = test.reset_index(drop=True)
X_pubmed_test = np.empty(shape=(len(test), feature_length), dtype = 'float32')
for index, row in test.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = pubmed_emb[drug_id]
    disease_vector = pubmed_emb[disease_id]
    X_pubmed_test[index] = np.concatenate([drug_vector, disease_vector])
y_pubmed_test = test.y.to_numpy()

In [9]:
# xgboost
xgb = XGBClassifier(random_state = 42)
xgb.fit(X_pubmed, y_pubmed)

y_pubmed_pred = xgb.predict_proba(X_pubmed_test)
y_pubmed_proba = xgb.predict_proba(X_pubmed_test)

print('xgboost scores (not treat; treat; unknown)')
print(y_pubmed_proba)

pubmed_df_full_1 = pd.DataFrame(y_pubmed_proba)
pubmed_df_full_1.columns = ['not-treat-score', 'treat-score', 'unknown-treat-score']

# random forest
rf_clf = RandomForestClassifier(random_state = 42)
rf_clf.fit(X_pubmed, y_pubmed)

y_pubmed_pred = rf_clf.predict_proba(X_pubmed_test)
y_pubmed_proba = rf_clf.predict_proba(X_pubmed_test)

print('random forest scores (not treat; treat; unknown)')
print(y_pubmed_proba)



xgboost scores (not treat; treat; unknown)
[[0.98698217 0.00420795 0.00880986]
 [0.48305073 0.0859496  0.43099967]
 [0.990802   0.00352506 0.00567299]
 [0.99360377 0.00304433 0.00335189]]
random forest scores (not treat; treat; unknown)
[[0.81 0.01 0.18]
 [0.68 0.03 0.29]
 [0.92 0.   0.08]
 [1.   0.   0.  ]]


In [10]:
#openai dataset
feature_length = 1536
X_openai = np.empty(shape=(len(train_df_1), feature_length), dtype = 'float32')
for index, row in train_df_1.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = openai_emb[drug_id]
    disease_vector = openai_emb[disease_id]
    X_openai[index] = np.concatenate([drug_vector, disease_vector])
y_openai = train_df_1.y.to_numpy()

#popenai dataset - test
test = test.reset_index(drop=True)
X_openai_test = np.empty(shape=(len(test), feature_length), dtype = 'float32')
for index, row in test.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = openai_emb[drug_id]
    disease_vector = openai_emb[disease_id]
    X_openai_test[index] = np.concatenate([drug_vector, disease_vector])
y_openai_test = test.y.to_numpy()

In [11]:
# xgboost
xgb = XGBClassifier(random_state = 42)
xgb.fit(X_openai, y_openai)

y_openai_pred = xgb.predict_proba(X_openai_test)
y_openai_proba = xgb.predict_proba(X_openai_test)

print('xgboost scores (not treat; treat; unknown)')
print(y_openai_proba)

openai_df_full_1 = pd.DataFrame(y_openai_proba)
openai_df_full_1.columns = ['not-treat-score', 'treat-score', 'unknown-treat-score']

# random forest
rf_clf = RandomForestClassifier(random_state = 42)
rf_clf.fit(X_openai, y_openai)

y_openai_pred = rf_clf.predict_proba(X_openai_test)
y_openai_proba = rf_clf.predict_proba(X_openai_test)

print('random forest scores (not treat; treat; unknown)')
print(y_openai_proba)

xgboost scores (not treat; treat; unknown)
[[0.9790404  0.00405578 0.01690384]
 [0.16604231 0.00348112 0.8304766 ]
 [0.9885047  0.00310138 0.00839397]
 [0.99514955 0.00238267 0.00246782]]
random forest scores (not treat; treat; unknown)
[[0.86 0.01 0.13]
 [0.46 0.07 0.47]
 [0.94 0.01 0.05]
 [0.99 0.   0.01]]


In [12]:
#spacy dataset
feature_length = 600

X_spacy = np.empty(shape=(len(train_df_1), feature_length), dtype = 'float32')
for index, row in train_df_1.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = spacy_emb[drug_id]
    disease_vector = spacy_emb[disease_id]
    X_spacy[index] = np.concatenate([drug_vector, disease_vector])
y_spacy = train_df_1.y.to_numpy()

#spacy dataset - test
test = test.reset_index(drop=True)
X_spacy_test = np.empty(shape=(len(test), feature_length), dtype = 'float32')
for index, row in test.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = spacy_emb[drug_id]
    disease_vector = spacy_emb[disease_id]
    X_spacy_test[index] = np.concatenate([drug_vector, disease_vector])
y_spacy_test = test.y.to_numpy()

In [13]:
# xgboost
xgb = XGBClassifier(random_state = 42)
xgb.fit(X_spacy, y_spacy)

y_spacy_pred = xgb.predict_proba(X_spacy_test)
y_spacy_proba = xgb.predict_proba(X_spacy_test)

print('xgboost scores (not treat; treat; unknown)')
print(y_spacy_proba)

spacy_df_full_1 = pd.DataFrame(y_spacy_proba)
spacy_df_full_1.columns = ['not-treat-score', 'treat-score', 'unknown-treat-score']

# random forest
rf_clf = RandomForestClassifier(random_state = 42)
rf_clf.fit(X_spacy, y_spacy)

y_spacy_pred = rf_clf.predict_proba(X_spacy_test)
y_spacy_proba = rf_clf.predict_proba(X_spacy_test)

print('random forest scores (not treat; treat; unknown)')
print(y_spacy_proba)

xgboost scores (not treat; treat; unknown)
[[0.9924292  0.00209263 0.00547816]
 [0.31870016 0.06715064 0.6141492 ]
 [0.98664075 0.00762538 0.00573388]
 [0.9942736  0.00205151 0.00367488]]
random forest scores (not treat; treat; unknown)
[[0.95      0.01      0.04     ]
 [0.37      0.0797619 0.5502381]
 [0.91      0.04      0.05     ]
 [1.        0.        0.       ]]


In [14]:
feature_length = 400

#scispacy dataset
X_scispacy = np.empty(shape=(len(train_df_1), feature_length), dtype = 'float32')
for index, row in train_df_1.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = scispacy_emb[drug_id]
    disease_vector = scispacy_emb[disease_id]
    X_scispacy[index] = np.concatenate([drug_vector, disease_vector])
y_scispacy = train_df_1.y.to_numpy()

#scispacy dataset - test
test = test.reset_index(drop=True)
X_scispacy_test = np.empty(shape=(len(test), feature_length), dtype = 'float32')
for index, row in test.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = scispacy_emb[drug_id]
    disease_vector = scispacy_emb[disease_id]
    X_scispacy_test[index] = np.concatenate([drug_vector, disease_vector])
y_scispacy_test = test.y.to_numpy()

In [15]:
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
# xgboost
xgb = XGBClassifier(random_state = 42)
xgb.fit(X_scispacy, y_scispacy)

y_scispacy_pred = xgb.predict_proba(X_scispacy_test)
y_scispacy_proba = xgb.predict_proba(X_scispacy_test)

print('xgboost scores (not treat; treat; unknown)')
print(y_scispacy_proba)

scispacy_df_full_1 = pd.DataFrame(y_scispacy_proba)
scispacy_df_full_1.columns = ['not-treat-score', 'treat-score', 'unknown-treat-score']

# random forest
rf_clf = RandomForestClassifier(random_state = 42)
rf_clf.fit(X_scispacy, y_scispacy)

y_scispacy_pred = rf_clf.predict_proba(X_scispacy_test)
y_scispacy_proba = rf_clf.predict_proba(X_scispacy_test)

print('random forest scores (not treat; treat; unknown)')
print(y_scispacy_proba)

xgboost scores (not treat; treat; unknown)
[[0.9937278  0.00361075 0.00266142]
 [0.86867017 0.03667339 0.09465642]
 [0.95299196 0.00476423 0.04224384]
 [0.99514467 0.00269034 0.00216492]]
random forest scores (not treat; treat; unknown)
[[0.94       0.         0.06      ]
 [0.44       0.06288095 0.49711905]
 [0.92       0.02       0.06      ]
 [1.         0.         0.        ]]


Based on the result it seems like RASAGILINE and Parkinsons relationship is not clearly detected but all models (especially LLMs) give higher score to 'unknown' treat rather than ' not treat'. Bearing in mind we train the models on very little data, this is an indication of LLM embeddings knowing more than just a simple word2vec would know. There is surprisingly good performance of Spacy - this could be due to rasagline-parkinsons duo being widely known in the web (if you google rasagline, the pair will come up ) and spacy is trained on web scraped data.

### Model predictions - only-name embeddings (node:name embedded)
Running predictions on embeddings generated without category being encoded. If performance improves considerably, that would be a sign of data leakage which was not visible when we had names + categories embedded (category embedding would be a 'noise' in such scenario)

In [16]:
# overwriting for convenience
pubmed_emb = pubmed_emb_nocat
openai_emb = openai_emb_nocat 
scispacy_emb = scispacy_emb_nocat
spacy_emb = spacy_emb_nocat

In [17]:
feature_length = 1536
#pubmed dataset
X_pubmed = np.empty(shape=(len(train_df_1), feature_length), dtype = 'float32')
for index, row in train_df_1.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = pubmed_emb[drug_id]
    disease_vector = pubmed_emb[disease_id]
    X_pubmed[index] = np.concatenate([drug_vector, disease_vector])
y_pubmed = train_df_1.y.to_numpy()

#pubmed dataset - test
test = test.reset_index(drop=True)
X_pubmed_test = np.empty(shape=(len(test), feature_length), dtype = 'float32')
for index, row in test.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = pubmed_emb[drug_id]
    disease_vector = pubmed_emb[disease_id]
    X_pubmed_test[index] = np.concatenate([drug_vector, disease_vector])
y_pubmed_test = test.y.to_numpy()

In [18]:
# xgboost
xgb = XGBClassifier(random_state = 42)
xgb.fit(X_pubmed, y_pubmed)

y_pubmed_pred = xgb.predict_proba(X_pubmed_test)
y_pubmed_proba = xgb.predict_proba(X_pubmed_test)

print('xgboost scores (not treat; treat; unknown)')
print(y_pubmed_proba)

pubmed_df_name_1 = pd.DataFrame(y_pubmed_proba)
pubmed_df_name_1.columns = ['not-treat-score', 'treat-score', 'unknown-treat-score']

# random forest
rf_clf = RandomForestClassifier(random_state = 42)
rf_clf.fit(X_pubmed, y_pubmed)

y_pubmed_pred = rf_clf.predict_proba(X_pubmed_test)
y_pubmed_proba = rf_clf.predict_proba(X_pubmed_test)

print('random forest scores (not treat; treat; unknown)')
print(y_pubmed_proba)


xgboost scores (not treat; treat; unknown)
[[0.9871694  0.00351547 0.00931519]
 [0.6987614  0.00548067 0.29575795]
 [0.9936732  0.00337781 0.00294896]
 [0.9956524  0.00284592 0.0015017 ]]
random forest scores (not treat; treat; unknown)
[[0.84 0.04 0.12]
 [0.59 0.06 0.35]
 [0.96 0.01 0.03]
 [0.99 0.01 0.  ]]


In [19]:
#openai dataset
feature_length = 1536
X_openai = np.empty(shape=(len(train_df_1), feature_length), dtype = 'float32')
for index, row in train_df_1.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = openai_emb[drug_id]
    disease_vector = openai_emb[disease_id]
    X_openai[index] = np.concatenate([drug_vector, disease_vector])
y_openai = train_df_1.y.to_numpy()

#popenai dataset - test
test = test.reset_index(drop=True)
X_openai_test = np.empty(shape=(len(test), feature_length), dtype = 'float32')
for index, row in test.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = openai_emb[drug_id]
    disease_vector = openai_emb[disease_id]
    X_openai_test[index] = np.concatenate([drug_vector, disease_vector])
y_openai_test = test.y.to_numpy()

In [20]:
# xgboost
xgb = XGBClassifier(random_state = 42)
xgb.fit(X_openai, y_openai)

y_openai_pred = xgb.predict_proba(X_openai_test)
y_openai_proba = xgb.predict_proba(X_openai_test)

print('xgboost scores (not treat; treat; unknown)')
print(y_pubmed_proba)

openai_df_name_1 = pd.DataFrame(y_openai_proba)
openai_df_name_1.columns = ['not-treat-score', 'treat-score', 'unknown-treat-score']

# random forest
rf_clf = RandomForestClassifier(random_state = 42)
rf_clf.fit(X_openai, y_openai)

y_openai_pred = rf_clf.predict_proba(X_openai_test)
y_openai_proba = rf_clf.predict_proba(X_openai_test)

print('random forest scores (not treat; treat; unknown)')
print(y_openai_proba)

xgboost scores (not treat; treat; unknown)
[[0.84 0.04 0.12]
 [0.59 0.06 0.35]
 [0.96 0.01 0.03]
 [0.99 0.01 0.  ]]
random forest scores (not treat; treat; unknown)
[[0.87 0.01 0.12]
 [0.46 0.11 0.43]
 [0.96 0.01 0.03]
 [0.99 0.   0.01]]


In [21]:
#spacy dataset
feature_length = 600

X_spacy = np.empty(shape=(len(train_df_1), feature_length), dtype = 'float32')
for index, row in train_df_1.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = spacy_emb[drug_id]
    disease_vector = spacy_emb[disease_id]
    X_spacy[index] = np.concatenate([drug_vector, disease_vector])
y_spacy = train_df_1.y.to_numpy()

#spacy dataset - test
test = test.reset_index(drop=True)
X_spacy_test = np.empty(shape=(len(test), feature_length), dtype = 'float32')
for index, row in test.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = spacy_emb[drug_id]
    disease_vector = spacy_emb[disease_id]
    X_spacy_test[index] = np.concatenate([drug_vector, disease_vector])
y_spacy_test = test.y.to_numpy()

In [22]:
# xgboost
xgb = XGBClassifier(random_state = 42)
xgb.fit(X_spacy, y_spacy)

y_spacy_pred = xgb.predict_proba(X_spacy_test)
y_spacy_proba = xgb.predict_proba(X_spacy_test)

print('xgboost scores (not treat; treat; unknown)')
print(y_spacy_proba)

spacy_df_name_1 = pd.DataFrame(y_spacy_proba)
spacy_df_name_1.columns = ['not-treat-score', 'treat-score', 'unknown-treat-score']

# random forest
rf_clf = RandomForestClassifier(random_state = 42)
rf_clf.fit(X_spacy, y_spacy)

y_spacy_pred = rf_clf.predict_proba(X_spacy_test)
y_spacy_proba = rf_clf.predict_proba(X_spacy_test)

print('random forest scores (not treat; treat; unknown)')
print(y_spacy_proba)

xgboost scores (not treat; treat; unknown)
[[0.99211967 0.00371249 0.00416783]
 [0.41097662 0.28492334 0.30410007]
 [0.9855666  0.00586496 0.00856836]
 [0.9930848  0.0025752  0.00434003]]
random forest scores (not treat; treat; unknown)
[[0.92  0.045 0.035]
 [0.47  0.07  0.46 ]
 [0.87  0.02  0.11 ]
 [1.    0.    0.   ]]


In [23]:
feature_length = 400

#scispacy dataset
X_scispacy = np.empty(shape=(len(train_df_1), feature_length), dtype = 'float32')
for index, row in train_df_1.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = scispacy_emb[drug_id]
    disease_vector = scispacy_emb[disease_id]
    X_scispacy[index] = np.concatenate([drug_vector, disease_vector])
y_scispacy = train_df_1.y.to_numpy()

#scispacy dataset - test
test = test.reset_index(drop=True)
X_scispacy_test = np.empty(shape=(len(test), feature_length), dtype = 'float32')
for index, row in test.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = scispacy_emb[drug_id]
    disease_vector = scispacy_emb[disease_id]
    X_scispacy_test[index] = np.concatenate([drug_vector, disease_vector])
y_scispacy_test = test.y.to_numpy()

In [24]:
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
# xgboost
xgb = XGBClassifier(random_state = 42)
xgb.fit(X_scispacy, y_scispacy)

y_scispacy_pred = xgb.predict_proba(X_scispacy_test)
y_scispacy_proba = xgb.predict_proba(X_scispacy_test)

print('xgboost scores (not treat; treat; unknown)')
print(y_scispacy_proba)

scispacy_df_name_1 = pd.DataFrame(y_scispacy_proba)
scispacy_df_name_1.columns = ['not-treat-score', 'treat-score', 'unknown-treat-score']

# random forest
rf_clf = RandomForestClassifier(random_state = 42)
rf_clf.fit(X_scispacy, y_scispacy)

y_scispacy_pred = rf_clf.predict_proba(X_scispacy_test)
y_scispacy_proba = rf_clf.predict_proba(X_scispacy_test)

print('random forest scores (not treat; treat; unknown)')
print(y_scispacy_proba)



xgboost scores (not treat; treat; unknown)
[[0.98008156 0.00416898 0.01574952]
 [0.29274723 0.00828164 0.69897115]
 [0.99665844 0.00210099 0.00124054]
 [0.99593014 0.0015508  0.00251903]]
random forest scores (not treat; treat; unknown)
[[0.93       0.01333333 0.05666667]
 [0.37       0.0725     0.5575    ]
 [0.97       0.         0.03      ]
 [0.99       0.01       0.        ]]


Interestingly, for names only embeddings, we get much higher score for spacy models but slightly lower scores for LLM. Either way, potential leakage could be present there.

### Model predictions - PCA embeddings
Now training the classifiers on embeddings post PCA. This could give us an indication on whether attribute information is lost when we do PCA (if thats the case, they will all perform poorly)

In [25]:
# for convenience we overwrite the emb, this way we just copy paste the results
# can be automated 
pubmed_emb = pubmed_emb_pca 
openai_emb = openai_emb_pca
scispacy_emb = scispacy_emb_pca
spacy_emb = spacy_emb_pca

In [26]:
feature_length = 200
#pubmed dataset
X_pubmed = np.empty(shape=(len(train_df_1), feature_length), dtype = 'float32')
for index, row in train_df_1.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = pubmed_emb[drug_id]
    disease_vector = pubmed_emb[disease_id]
    X_pubmed[index] = np.concatenate([drug_vector, disease_vector])
y_pubmed = train_df_1.y.to_numpy()

#pubmed dataset - test
test = test.reset_index(drop=True)
X_pubmed_test = np.empty(shape=(len(test), feature_length), dtype = 'float32')
for index, row in test.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = pubmed_emb[drug_id]
    disease_vector = pubmed_emb[disease_id]
    X_pubmed_test[index] = np.concatenate([drug_vector, disease_vector])
y_pubmed_test = test.y.to_numpy()

In [27]:
# xgboost
xgb = XGBClassifier(random_state = 42)
xgb.fit(X_pubmed, y_pubmed)

y_pubmed_pred = xgb.predict_proba(X_pubmed_test)
y_pubmed_proba = xgb.predict_proba(X_pubmed_test)

print('xgboost scores (not treat; treat; unknown)')
print(y_pubmed_proba)

pubmed_df_full_pca_1 = pd.DataFrame(y_pubmed_proba)
pubmed_df_full_pca_1.columns = ['not-treat-score', 'treat-score', 'unknown-treat-score']

# random forest
rf_clf = RandomForestClassifier(random_state = 42)
rf_clf.fit(X_pubmed, y_pubmed)

y_pubmed_pred = rf_clf.predict_proba(X_pubmed_test)
y_pubmed_proba = rf_clf.predict_proba(X_pubmed_test)

print('random forest scores (not treat; treat; unknown)')
print(y_pubmed_proba)



xgboost scores (not treat; treat; unknown)
[[0.98710895 0.00344157 0.00944951]
 [0.9525908  0.00661887 0.04079038]
 [0.9943638  0.00283103 0.00280523]
 [0.9954951  0.00185501 0.00264995]]
random forest scores (not treat; treat; unknown)
[[0.89 0.   0.11]
 [0.52 0.07 0.41]
 [0.94 0.01 0.05]
 [1.   0.   0.  ]]


In [28]:
#openai dataset
feature_length = 200
X_openai = np.empty(shape=(len(train_df_1), feature_length), dtype = 'float32')
for index, row in train_df_1.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = openai_emb[drug_id]
    disease_vector = openai_emb[disease_id]
    X_openai[index] = np.concatenate([drug_vector, disease_vector])
y_openai = train_df_1.y.to_numpy()

#popenai dataset - test
test = test.reset_index(drop=True)
X_openai_test = np.empty(shape=(len(test), feature_length), dtype = 'float32')
for index, row in test.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = openai_emb[drug_id]
    disease_vector = openai_emb[disease_id]
    X_openai_test[index] = np.concatenate([drug_vector, disease_vector])
y_openai_test = test.y.to_numpy()

In [29]:
# xgboost
xgb = XGBClassifier(random_state = 42)
xgb.fit(X_openai, y_openai)

y_openai_pred = xgb.predict_proba(X_openai_test)
y_openai_proba = xgb.predict_proba(X_openai_test)

print('xgboost scores (not treat; treat; unknown)')
print(y_pubmed_proba)

openai_df_full_pca_1 = pd.DataFrame(y_openai_proba)
openai_df_full_pca_1.columns = ['not-treat-score', 'treat-score', 'unknown-treat-score']

# random forest
rf_clf = RandomForestClassifier(random_state = 42)
rf_clf.fit(X_openai, y_openai)

y_openai_pred = rf_clf.predict_proba(X_openai_test)
y_openai_proba = rf_clf.predict_proba(X_openai_test)

print('random forest scores (not treat; treat; unknown)')
print(y_openai_proba)

xgboost scores (not treat; treat; unknown)
[[0.89 0.   0.11]
 [0.52 0.07 0.41]
 [0.94 0.01 0.05]
 [1.   0.   0.  ]]
random forest scores (not treat; treat; unknown)
[[0.9  0.01 0.09]
 [0.5  0.07 0.43]
 [0.9  0.01 0.09]
 [0.95 0.01 0.04]]


In [30]:
#spacy dataset
feature_length = 200

X_spacy = np.empty(shape=(len(train_df_1), feature_length), dtype = 'float32')
for index, row in train_df_1.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = spacy_emb[drug_id]
    disease_vector = spacy_emb[disease_id]
    X_spacy[index] = np.concatenate([drug_vector, disease_vector])
y_spacy = train_df_1.y.to_numpy()

#spacy dataset - test
test = test.reset_index(drop=True)
X_spacy_test = np.empty(shape=(len(test), feature_length), dtype = 'float32')
for index, row in test.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = spacy_emb[drug_id]
    disease_vector = spacy_emb[disease_id]
    X_spacy_test[index] = np.concatenate([drug_vector, disease_vector])
y_spacy_test = test.y.to_numpy()

In [31]:
# xgboost
xgb = XGBClassifier(random_state = 42)
xgb.fit(X_spacy, y_spacy)

y_spacy_pred = xgb.predict_proba(X_spacy_test)
y_spacy_proba = xgb.predict_proba(X_spacy_test)

print('xgboost scores (not treat; treat; unknown)')
print(y_spacy_proba)

spacy_df_full_pca_1 = pd.DataFrame(y_spacy_proba)
spacy_df_full_pca_1.columns = ['not-treat-score', 'treat-score', 'unknown-treat-score']

# random forest
rf_clf = RandomForestClassifier(random_state = 42)
rf_clf.fit(X_spacy, y_spacy)

y_spacy_pred = rf_clf.predict_proba(X_spacy_test)
y_spacy_proba = rf_clf.predict_proba(X_spacy_test)

print('random forest scores (not treat; treat; unknown)')
print(y_spacy_proba)

xgboost scores (not treat; treat; unknown)
[[0.99077356 0.00338323 0.00584328]
 [0.13399236 0.02303956 0.8429681 ]
 [0.9733611  0.01523507 0.01140388]
 [0.99457705 0.0026587  0.00276429]]
random forest scores (not treat; treat; unknown)
[[0.97       0.01       0.02      ]
 [0.35       0.11166667 0.53833333]
 [0.92       0.         0.08      ]
 [1.         0.         0.        ]]


In [32]:
feature_length = 200

#scispacy dataset
X_scispacy = np.empty(shape=(len(train_df_1), feature_length), dtype = 'float32')
for index, row in train_df_1.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = scispacy_emb[drug_id]
    disease_vector = scispacy_emb[disease_id]
    X_scispacy[index] = np.concatenate([drug_vector, disease_vector])
y_scispacy = train_df_1.y.to_numpy()

#scispacy dataset - test
test = test.reset_index(drop=True)
X_scispacy_test = np.empty(shape=(len(test), feature_length), dtype = 'float32')
for index, row in test.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = scispacy_emb[drug_id]
    disease_vector = scispacy_emb[disease_id]
    X_scispacy_test[index] = np.concatenate([drug_vector, disease_vector])
y_scispacy_test = test.y.to_numpy()

In [33]:
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
# xgboost
xgb = XGBClassifier(random_state = 42)
xgb.fit(X_scispacy, y_scispacy)

y_scispacy_pred = xgb.predict_proba(X_scispacy_test)
y_scispacy_proba = xgb.predict_proba(X_scispacy_test)

print('xgboost scores (not treat; treat; unknown)')
print(y_scispacy_proba)

scispacy_df_full_pca_1 = pd.DataFrame(y_scispacy_proba)
scispacy_df_full_pca_1.columns = ['not-treat-score', 'treat-score', 'unknown-treat-score']

# random forest
rf_clf = RandomForestClassifier(random_state = 42)
rf_clf.fit(X_scispacy, y_scispacy)

y_scispacy_pred = rf_clf.predict_proba(X_scispacy_test)
y_scispacy_proba = rf_clf.predict_proba(X_scispacy_test)

print('random forest scores (not treat; treat; unknown)')
print(y_scispacy_proba)

xgboost scores (not treat; treat; unknown)
[[0.99041253 0.00226776 0.00731974]
 [0.31073108 0.03859127 0.6506777 ]
 [0.973957   0.0101897  0.01585323]
 [0.99363625 0.00306529 0.00329845]]
random forest scores (not treat; treat; unknown)
[[0.97       0.0025     0.0275    ]
 [0.46       0.03569048 0.50430952]
 [0.91       0.015      0.075     ]
 [1.         0.         0.        ]]


### Summary - test data no.1
Comparing XGBoost results for 1) Full Embeddings 2) Only Name Embeddings 3) Post-PCA Full Embeddings

In [34]:
multi_index = pd.MultiIndex.from_arrays([['pubmed_full', 'pubmed_full', 'pubmed_full'],['not treat','treat','unknown']])
pubmed_df_full_1.columns = multi_index

multi_index = pd.MultiIndex.from_arrays([['pubmed_name', 'pubmed_name', 'pubmed_name'],['not treat','treat','unknown']])
pubmed_df_name_1.columns = multi_index

multi_index = pd.MultiIndex.from_arrays([['pubmed_pca', 'pubmed_pca', 'pubmed_pca'],['not treat','treat','unknown']])
pubmed_df_full_pca_1.columns = multi_index

pd.concat([pubmed_df_full_1, pubmed_df_name_1, pubmed_df_full_pca_1], axis=1)

Unnamed: 0_level_0,pubmed_full,pubmed_full,pubmed_full,pubmed_name,pubmed_name,pubmed_name,pubmed_pca,pubmed_pca,pubmed_pca
Unnamed: 0_level_1,not treat,treat,unknown,not treat,treat,unknown,not treat,treat,unknown
0,0.986982,0.004208,0.00881,0.987169,0.003515,0.009315,0.987109,0.003442,0.00945
1,0.483051,0.08595,0.431,0.698761,0.005481,0.295758,0.952591,0.006619,0.04079
2,0.990802,0.003525,0.005673,0.993673,0.003378,0.002949,0.994364,0.002831,0.002805
3,0.993604,0.003044,0.003352,0.995652,0.002846,0.001502,0.995495,0.001855,0.00265


In [35]:
multi_index = pd.MultiIndex.from_arrays([['openai_full', 'openai_full', 'openai_full'],['not treat','treat','unknown']])
openai_df_full_1.columns = multi_index

multi_index = pd.MultiIndex.from_arrays([['openai_name', 'openai_name', 'openai_name'],['not treat','treat','unknown']])
openai_df_name_1.columns = multi_index

multi_index = pd.MultiIndex.from_arrays([['openai_pca', 'openai_pca', 'openai_pca'],['not treat','treat','unknown']])
openai_df_full_pca_1.columns = multi_index

pd.concat([openai_df_full_1, openai_df_name_1, openai_df_full_pca_1], axis=1)

Unnamed: 0_level_0,openai_full,openai_full,openai_full,openai_name,openai_name,openai_name,openai_pca,openai_pca,openai_pca
Unnamed: 0_level_1,not treat,treat,unknown,not treat,treat,unknown,not treat,treat,unknown
0,0.97904,0.004056,0.016904,0.988741,0.004184,0.007075,0.97606,0.004588,0.019352
1,0.166042,0.003481,0.830477,0.284159,0.013867,0.701974,0.410038,0.014051,0.575911
2,0.988505,0.003101,0.008394,0.989175,0.00273,0.008095,0.986033,0.005686,0.00828
3,0.99515,0.002383,0.002468,0.993939,0.003236,0.002826,0.991365,0.002459,0.006176


In [36]:
multi_index = pd.MultiIndex.from_arrays([['spacy_full', 'spacy_full', 'spacy_full'],['not treat','treat','unknown']])
spacy_df_full_1.columns = multi_index

multi_index = pd.MultiIndex.from_arrays([['spacy_name', 'spacy_name', 'spacy_name'],['not treat','treat','unknown']])
spacy_df_name_1.columns = multi_index

multi_index = pd.MultiIndex.from_arrays([['spacy_pca', 'spacy_pca', 'spacy_pca'],['not treat','treat','unknown']])
spacy_df_full_pca_1.columns = multi_index


pd.concat([spacy_df_full_1, spacy_df_name_1, spacy_df_full_pca_1], axis=1)

Unnamed: 0_level_0,spacy_full,spacy_full,spacy_full,spacy_name,spacy_name,spacy_name,spacy_pca,spacy_pca,spacy_pca
Unnamed: 0_level_1,not treat,treat,unknown,not treat,treat,unknown,not treat,treat,unknown
0,0.992429,0.002093,0.005478,0.99212,0.003712,0.004168,0.990774,0.003383,0.005843
1,0.3187,0.067151,0.614149,0.410977,0.284923,0.3041,0.133992,0.02304,0.842968
2,0.986641,0.007625,0.005734,0.985567,0.005865,0.008568,0.973361,0.015235,0.011404
3,0.994274,0.002052,0.003675,0.993085,0.002575,0.00434,0.994577,0.002659,0.002764


In [37]:
multi_index = pd.MultiIndex.from_arrays([['scispacy_full', 'scispacy_full', 'scispacy_full'],['not treat','treat','unknown']])
scispacy_df_full_1.columns = multi_index

multi_index = pd.MultiIndex.from_arrays([['scispacy_name', 'scispacy_name', 'scispacy_name'],['not treat','treat','unknown']])
scispacy_df_name_1.columns = multi_index

multi_index = pd.MultiIndex.from_arrays([['scispacy_pca', 'scispacy_pca', 'scispacy_pca'],['not treat','treat','unknown']])
scispacy_df_full_pca_1.columns = multi_index

pd.concat([scispacy_df_full_1, scispacy_df_name_1, scispacy_df_full_pca_1], axis=1)

Unnamed: 0_level_0,scispacy_full,scispacy_full,scispacy_full,scispacy_name,scispacy_name,scispacy_name,scispacy_pca,scispacy_pca,scispacy_pca
Unnamed: 0_level_1,not treat,treat,unknown,not treat,treat,unknown,not treat,treat,unknown
0,0.993728,0.003611,0.002661,0.980082,0.004169,0.01575,0.990413,0.002268,0.00732
1,0.86867,0.036673,0.094656,0.292747,0.008282,0.698971,0.310731,0.038591,0.650678
2,0.952992,0.004764,0.042244,0.996658,0.002101,0.001241,0.973957,0.01019,0.015853
3,0.995145,0.00269,0.002165,0.99593,0.001551,0.002519,0.993636,0.003065,0.003298


**Conclusions:** 
* PCA seems to lead to information loss for LLMs as the performance of the classifier drops (however not completely) - could indicate that we should find more optimal way of reducing dimensions/better n_components
* Higher scores for spacy models, obtained for only-name embeddings, show the superiority of LLM complex architectures as they are able to capture the keywords better than non-LLM models (hence their score is relatively unchanged between full and name embeddings)
* it is difficult to say whether there is data leakage in those models or not as all of them seem to be more confident when it comes to predicting the ground truth positive. Further tests would be needed
* there is no clear 'superiority' of biomedical models vs non-biomedical models for embedding generation when looking at the classifiers performance

## Test no.2 (cloxotestosterone - prostate carcinoma) 
As it is possible that rasagline and parkinsons are an easy example (if you google rasagline, parkinsons is the first indication that you will see, and it has been approved for a long while), I will aim to find less 'obvious' example

In [38]:
#re-load embeddings as we previously overwrote them

#pubmedbert
pubmed_emb = np.array(joblib.load('gs://mtrx-us-central1-wg2-modeling-dev-storage/llm_embed_benchmark/subsample/pubmedbert/attribute/embed_full.joblib'))

# #openai
openai_emb = np.array(joblib.load('gs://mtrx-us-central1-wg2-modeling-dev-storage/llm_embed_benchmark/subsample/openai/attribute/embed_full.joblib')) 

# #spacy
spacy_emb = np.array(joblib.load('gs://mtrx-us-central1-wg2-modeling-dev-storage/llm_embed_benchmark/subsample/spacy/attribute/embed_full.joblib'))

# #scispacy
scispacy_emb = np.array(joblib.load('gs://mtrx-us-central1-wg2-modeling-dev-storage/llm_embed_benchmark/subsample/scispacy/attribute/embed_full.joblib'))

In [39]:
#create sub-dfs
DRUG_TYPE = ['biolink:Drug', 'biolink:SmallMolecule']
DISEASE_TYPE = ['biolink:Disease', 'biolink:PhenotypicFeature', 'biolink:BehavioralFeature', 'biolink:DiseaseOrPhenotypicFeature']

#sample
sample_df_drugs = sample_df[sample_df['category'].isin(DRUG_TYPE)]
sample_df_disease = sample_df[sample_df['category'].isin(DISEASE_TYPE)]

#train test split 
train, test = train_test_split(result_gt, stratify=result_gt['y'], test_size=0.1, random_state=2) # prev 42
train_tp_df = train[train['y']==1]
train_tp_df_drugs = train_tp_df['source'].reset_index(drop=True)
train_tp_df_diseases = train_tp_df['target'].reset_index(drop=True)
len_tp_tr = len(train_tp_df)
n_rep = 3

# create random drug-disease pairs
rand_drugs = sample_df_drugs['id'].sample(n_rep*len_tp_tr, replace=True, ignore_index = True, random_state = 2) # prev 42
rand_disease = sample_df_disease['id'].sample(n_rep*len_tp_tr, replace=True, ignore_index = True, random_state = 2) # 42
train_tp_diseases_copies = pd.concat([train_tp_df_diseases for _ in range(n_rep)], ignore_index = True)
train_tp_drugs_copies = pd.concat([train_tp_df_drugs for _ in range(n_rep)], ignore_index = True)
tmp_1 = pd.DataFrame({'source': rand_drugs, 'target': train_tp_diseases_copies, 'y': 2})
tmp_2 = pd.DataFrame({'source': train_tp_drugs_copies, 'target': rand_disease, 'y': 2})
un_data_1 =  pd.concat([tmp_1,tmp_2], ignore_index =True)
train_df_1 = pd.concat([train, un_data_1]).sample(frac=1).reset_index(drop=True)
test = test.reset_index(drop=True)

In [40]:
test

Unnamed: 0,source,target,y
0,CHEMBL.COMPOUND:CHEMBL2106514,MONDO:0005159,1
1,CHEMBL.COMPOUND:CHEMBL2106119,MONDO:0012010,0
2,CHEMBL.COMPOUND:CHEMBL2004297,MONDO:0007264,0
3,CHEMBL.COMPOUND:CHEMBL294199,MONDO:0007264,0


In [41]:
sample_df.loc[sample_df.id=='MONDO:0005159']

Unnamed: 0,id,name,category,description
19996,MONDO:0005159,prostate carcinoma,biolink:Disease,One of the most common malignant tumors afflic...


In [42]:
sample_df.loc[sample_df.id=='CHEMBL.COMPOUND:CHEMBL2106514']

Unnamed: 0,id,name,category,description
1658,CHEMBL.COMPOUND:CHEMBL2106514,CLOXOTESTOSTERONE,biolink:SmallMolecule,Testosterone is the most important androgen in...


In [43]:
print('CHEMBL.COMPOUND:CHEMBL2106514' in train_df_1.source)
print('MONDO:0005159' in train_df_1.target)

False
False


Another test case - cloxotestosterone for prostate carcinoma. Not as easy to google as rasagoline as in this case. It makes some sense from NLP perspective (testosterone in the name -> prostate, two words somewhat associated to each other) so we can see how good the models are in terms of 'medical knowledge'

### Model predictions - full embeddings (node:name + node:category embedded)

In [44]:
feature_length = 1536
#pubmed dataset
X_pubmed = np.empty(shape=(len(train_df_1), feature_length), dtype = 'float32')
for index, row in train_df_1.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = pubmed_emb[drug_id]
    disease_vector = pubmed_emb[disease_id]
    X_pubmed[index] = np.concatenate([drug_vector, disease_vector])
y_pubmed = train_df_1.y.to_numpy()

#pubmed dataset - test
test = test.reset_index(drop=True)
X_pubmed_test = np.empty(shape=(len(test), feature_length), dtype = 'float32')
for index, row in test.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = pubmed_emb[drug_id]
    disease_vector = pubmed_emb[disease_id]
    X_pubmed_test[index] = np.concatenate([drug_vector, disease_vector])
y_pubmed_test = test.y.to_numpy()

In [45]:
# xgboost
xgb = XGBClassifier(random_state = 15)
xgb.fit(X_pubmed, y_pubmed)

y_pubmed_pred = xgb.predict_proba(X_pubmed_test)
y_pubmed_proba = xgb.predict_proba(X_pubmed_test)

print('xgboost scores (not treat; treat; unknown)')
print(y_pubmed_proba)

pubmed_df_full_2 = pd.DataFrame(y_pubmed_proba)
pubmed_df_full_2.columns = ['not-treat-score', 'treat-score', 'unknown-treat-score']

# random forest
rf_clf = RandomForestClassifier(random_state = 15)
rf_clf.fit(X_pubmed, y_pubmed)

y_pubmed_pred = rf_clf.predict_proba(X_pubmed_test)
y_pubmed_proba = rf_clf.predict_proba(X_pubmed_test)

print('random forest scores (not treat; treat; unknown)')
print(y_pubmed_proba)



xgboost scores (not treat; treat; unknown)
[[0.02748222 0.59669477 0.37582305]
 [0.9960361  0.00200322 0.00196074]
 [0.99332196 0.00303311 0.00364488]
 [0.9892854  0.00447604 0.00623859]]
random forest scores (not treat; treat; unknown)
[[0.1  0.29 0.61]
 [0.93 0.02 0.05]
 [0.94 0.   0.06]
 [0.86 0.03 0.11]]


In [46]:
#openai dataset
feature_length = 1536
X_openai = np.empty(shape=(len(train_df_1), feature_length), dtype = 'float32')
for index, row in train_df_1.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = openai_emb[drug_id]
    disease_vector = openai_emb[disease_id]
    X_openai[index] = np.concatenate([drug_vector, disease_vector])
y_openai = train_df_1.y.to_numpy()

#popenai dataset - test
test = test.reset_index(drop=True)
X_openai_test = np.empty(shape=(len(test), feature_length), dtype = 'float32')
for index, row in test.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = openai_emb[drug_id]
    disease_vector = openai_emb[disease_id]
    X_openai_test[index] = np.concatenate([drug_vector, disease_vector])
y_openai_test = test.y.to_numpy()

In [47]:
# xgboost
xgb = XGBClassifier(random_state = 15)
xgb.fit(X_openai, y_openai)

y_openai_pred = xgb.predict_proba(X_openai_test)
y_openai_proba = xgb.predict_proba(X_openai_test)

print('xgboost scores (not treat; treat; unknown)')
print(y_openai_proba)

openai_df_full_2 = pd.DataFrame(y_openai_proba)
openai_df_full_2.columns = ['not-treat-score', 'treat-score', 'unknown-treat-score']

# random forest
rf_clf = RandomForestClassifier(random_state = 15)
rf_clf.fit(X_openai, y_openai)

y_openai_pred = rf_clf.predict_proba(X_openai_test)
y_openai_proba = rf_clf.predict_proba(X_openai_test)

print('random forest scores (not treat; treat; unknown)')
print(y_openai_proba)

xgboost scores (not treat; treat; unknown)
[[0.01654385 0.5000971  0.48335907]
 [0.99152595 0.00518447 0.00328961]
 [0.99324894 0.0044715  0.00227962]
 [0.990866   0.00431839 0.00481568]]
random forest scores (not treat; treat; unknown)
[[0.1  0.38 0.52]
 [0.95 0.01 0.04]
 [0.93 0.02 0.05]
 [0.84 0.04 0.12]]


In [48]:
#spacy dataset
feature_length = 600

X_spacy = np.empty(shape=(len(train_df_1), feature_length), dtype = 'float32')
for index, row in train_df_1.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = spacy_emb[drug_id]
    disease_vector = spacy_emb[disease_id]
    X_spacy[index] = np.concatenate([drug_vector, disease_vector])
y_spacy = train_df_1.y.to_numpy()

#spacy dataset - test
test = test.reset_index(drop=True)
X_spacy_test = np.empty(shape=(len(test), feature_length), dtype = 'float32')
for index, row in test.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = spacy_emb[drug_id]
    disease_vector = spacy_emb[disease_id]
    X_spacy_test[index] = np.concatenate([drug_vector, disease_vector])
y_spacy_test = test.y.to_numpy()

In [49]:
# xgboost
xgb = XGBClassifier(random_state = 15)
xgb.fit(X_spacy, y_spacy)

y_spacy_pred = xgb.predict_proba(X_spacy_test)
y_spacy_proba = xgb.predict_proba(X_spacy_test)

print('xgboost scores (not treat; treat; unknown)')
print(y_spacy_proba)

spacy_df_full_2 = pd.DataFrame(y_spacy_proba)
spacy_df_full_2.columns = ['not-treat-score', 'treat-score', 'unknown-treat-score']

# random forest
rf_clf = RandomForestClassifier(random_state = 15)
rf_clf.fit(X_spacy, y_spacy)

y_spacy_pred = rf_clf.predict_proba(X_spacy_test)
y_spacy_proba = rf_clf.predict_proba(X_spacy_test)

print('random forest scores (not treat; treat; unknown)')
print(y_spacy_proba)

xgboost scores (not treat; treat; unknown)
[[0.0073759  0.9170792  0.07554487]
 [0.9914779  0.00127018 0.00725192]
 [0.9963756  0.00116073 0.00246363]
 [0.9963756  0.00116073 0.00246363]]
random forest scores (not treat; treat; unknown)
[[0.09       0.42066667 0.48933333]
 [1.         0.         0.        ]
 [0.97       0.01       0.02      ]
 [0.97       0.01       0.02      ]]


In [50]:
feature_length = 400

#scispacy dataset
X_scispacy = np.empty(shape=(len(train_df_1), feature_length), dtype = 'float32')
for index, row in train_df_1.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = scispacy_emb[drug_id]
    disease_vector = scispacy_emb[disease_id]
    X_scispacy[index] = np.concatenate([drug_vector, disease_vector])
y_scispacy = train_df_1.y.to_numpy()

#scispacy dataset - test
test = test.reset_index(drop=True)
X_scispacy_test = np.empty(shape=(len(test), feature_length), dtype = 'float32')
for index, row in test.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = scispacy_emb[drug_id]
    disease_vector = scispacy_emb[disease_id]
    X_scispacy_test[index] = np.concatenate([drug_vector, disease_vector])
y_scispacy_test = test.y.to_numpy()

In [51]:
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
# xgboost
xgb = XGBClassifier(random_state = 15)
xgb.fit(X_scispacy, y_scispacy)

y_scispacy_pred = xgb.predict_proba(X_scispacy_test)
y_scispacy_proba = xgb.predict_proba(X_scispacy_test)

print('xgboost scores (not treat; treat; unknown)')
print(y_scispacy_proba)

scispacy_df_full_2 = pd.DataFrame(y_scispacy_proba)
scispacy_df_full_2.columns = ['not-treat-score', 'treat-score', 'unknown-treat-score']

# random forest
rf_clf = RandomForestClassifier(random_state = 15)
rf_clf.fit(X_scispacy, y_scispacy)

y_scispacy_pred = rf_clf.predict_proba(X_scispacy_test)
y_scispacy_proba = rf_clf.predict_proba(X_scispacy_test)

print('random forest scores (not treat; treat; unknown)')
print(y_scispacy_proba)

xgboost scores (not treat; treat; unknown)
[[0.00207694 0.01543517 0.98248786]
 [0.99762934 0.0011899  0.00118082]
 [0.9965329  0.00124078 0.00222625]
 [0.9965329  0.00124078 0.00222625]]
random forest scores (not treat; treat; unknown)
[[0.09       0.15633333 0.75366667]
 [1.         0.         0.        ]
 [0.98       0.01       0.01      ]
 [0.98       0.01       0.01      ]]


For this pair, the models make much more accurate predictions, with up to 50% probability scores for treat when using LLMs. Here PubMedBERT seems to give the highest score while scispacy - the lowest. This could be another sign of data leakage as this pair is not as easy to find via google (however the names are partially associated with each other).

### Model predictions - only name embeddings (node:name embedded)

In [52]:
# lazy
pubmed_emb = pubmed_emb_nocat
openai_emb = openai_emb_nocat 
scispacy_emb = scispacy_emb_nocat
spacy_emb = spacy_emb_nocat

In [53]:
feature_length = 1536
#pubmed dataset
X_pubmed = np.empty(shape=(len(train_df_1), feature_length), dtype = 'float32')
for index, row in train_df_1.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = pubmed_emb[drug_id]
    disease_vector = pubmed_emb[disease_id]
    X_pubmed[index] = np.concatenate([drug_vector, disease_vector])
y_pubmed = train_df_1.y.to_numpy()

#pubmed dataset - test
test = test.reset_index(drop=True)
X_pubmed_test = np.empty(shape=(len(test), feature_length), dtype = 'float32')
for index, row in test.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = pubmed_emb[drug_id]
    disease_vector = pubmed_emb[disease_id]
    X_pubmed_test[index] = np.concatenate([drug_vector, disease_vector])
y_pubmed_test = test.y.to_numpy()

In [54]:
# xgboost
xgb = XGBClassifier(random_state = 15)
xgb.fit(X_pubmed, y_pubmed)

y_pubmed_pred = xgb.predict_proba(X_pubmed_test)
y_pubmed_proba = xgb.predict_proba(X_pubmed_test)

print('xgboost scores (not treat; treat; unknown)')
print(y_pubmed_proba)

pubmed_df_name_2 = pd.DataFrame(y_pubmed_proba)
pubmed_df_name_2.columns = ['not-treat-score', 'treat-score', 'unknown-treat-score']

# random forest
rf_clf = RandomForestClassifier(random_state = 15)
rf_clf.fit(X_pubmed, y_pubmed)

y_pubmed_pred = rf_clf.predict_proba(X_pubmed_test)
y_pubmed_proba = rf_clf.predict_proba(X_pubmed_test)

print('random forest scores (not treat; treat; unknown)')
print(y_pubmed_proba)


xgboost scores (not treat; treat; unknown)
[[0.01757862 0.92539567 0.05702576]
 [0.99484396 0.00308735 0.00206872]
 [0.99260736 0.00284035 0.00455226]
 [0.9919233  0.00250044 0.00557631]]
random forest scores (not treat; treat; unknown)
[[0.14 0.27 0.59]
 [0.98 0.   0.02]
 [0.89 0.02 0.09]
 [0.96 0.01 0.03]]


In [55]:
#openai dataset
feature_length = 1536
X_openai = np.empty(shape=(len(train_df_1), feature_length), dtype = 'float32')
for index, row in train_df_1.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = openai_emb[drug_id]
    disease_vector = openai_emb[disease_id]
    X_openai[index] = np.concatenate([drug_vector, disease_vector])
y_openai = train_df_1.y.to_numpy()

#popenai dataset - test
test = test.reset_index(drop=True)
X_openai_test = np.empty(shape=(len(test), feature_length), dtype = 'float32')
for index, row in test.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = openai_emb[drug_id]
    disease_vector = openai_emb[disease_id]
    X_openai_test[index] = np.concatenate([drug_vector, disease_vector])
y_openai_test = test.y.to_numpy()

In [56]:
# xgboost
xgb = XGBClassifier(random_state = 15)
xgb.fit(X_openai, y_openai)

y_openai_pred = xgb.predict_proba(X_openai_test)
y_openai_proba = xgb.predict_proba(X_openai_test)

print('xgboost scores (not treat; treat; unknown)')
print(y_openai_proba)

openai_df_name_2 = pd.DataFrame(y_openai_proba)
openai_df_name_2.columns = ['not-treat-score', 'treat-score', 'unknown-treat-score']

# random forest
rf_clf = RandomForestClassifier(random_state = 15)
rf_clf.fit(X_openai, y_openai)

y_openai_pred = rf_clf.predict_proba(X_openai_test)
y_openai_proba = rf_clf.predict_proba(X_openai_test)

print('random forest scores (not treat; treat; unknown)')
print(y_openai_proba)

xgboost scores (not treat; treat; unknown)
[[0.02750883 0.92066085 0.05183035]
 [0.99445194 0.00325369 0.0022944 ]
 [0.99299073 0.00254832 0.00446092]
 [0.985106   0.0059388  0.00895522]]
random forest scores (not treat; treat; unknown)
[[0.1  0.31 0.59]
 [0.95 0.01 0.04]
 [0.94 0.   0.06]
 [0.87 0.01 0.12]]


In [57]:
#spacy dataset
feature_length = 600

X_spacy = np.empty(shape=(len(train_df_1), feature_length), dtype = 'float32')
for index, row in train_df_1.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = spacy_emb[drug_id]
    disease_vector = spacy_emb[disease_id]
    X_spacy[index] = np.concatenate([drug_vector, disease_vector])
y_spacy = train_df_1.y.to_numpy()

#spacy dataset - test
test = test.reset_index(drop=True)
X_spacy_test = np.empty(shape=(len(test), feature_length), dtype = 'float32')
for index, row in test.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = spacy_emb[drug_id]
    disease_vector = spacy_emb[disease_id]
    X_spacy_test[index] = np.concatenate([drug_vector, disease_vector])
y_spacy_test = test.y.to_numpy()

In [58]:
# xgboost
xgb = XGBClassifier(random_state = 15)
xgb.fit(X_spacy, y_spacy)

y_spacy_pred = xgb.predict_proba(X_spacy_test)
y_spacy_proba = xgb.predict_proba(X_spacy_test)

print('xgboost scores (not treat; treat; unknown)')
print(y_spacy_proba)

spacy_df_name_2 = pd.DataFrame(y_spacy_proba)
spacy_df_name_2.columns = ['not-treat-score', 'treat-score', 'unknown-treat-score']

# random forest
rf_clf = RandomForestClassifier(random_state = 15)
rf_clf.fit(X_spacy, y_spacy)

y_spacy_pred = rf_clf.predict_proba(X_spacy_test)
y_spacy_proba = rf_clf.predict_proba(X_spacy_test)

print('random forest scores (not treat; treat; unknown)')
print(y_spacy_proba)

xgboost scores (not treat; treat; unknown)
[[9.6887806e-03 9.4375628e-01 4.6554983e-02]
 [9.9292868e-01 1.6124324e-03 5.4589231e-03]
 [9.9745387e-01 9.5261657e-04 1.5935330e-03]
 [9.9745387e-01 9.5261657e-04 1.5935330e-03]]
random forest scores (not treat; treat; unknown)
[[0.07       0.40666667 0.52333333]
 [0.99       0.         0.01      ]
 [0.94       0.02       0.04      ]
 [0.94       0.02       0.04      ]]


In [59]:
feature_length = 400

#scispacy dataset
X_scispacy = np.empty(shape=(len(train_df_1), feature_length), dtype = 'float32')
for index, row in train_df_1.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = scispacy_emb[drug_id]
    disease_vector = scispacy_emb[disease_id]
    X_scispacy[index] = np.concatenate([drug_vector, disease_vector])
y_scispacy = train_df_1.y.to_numpy()

#scispacy dataset - test
test = test.reset_index(drop=True)
X_scispacy_test = np.empty(shape=(len(test), feature_length), dtype = 'float32')
for index, row in test.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = scispacy_emb[drug_id]
    disease_vector = scispacy_emb[disease_id]
    X_scispacy_test[index] = np.concatenate([drug_vector, disease_vector])
y_scispacy_test = test.y.to_numpy()

In [60]:
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
# xgboost
xgb = XGBClassifier(random_state = 15)
xgb.fit(X_scispacy, y_scispacy)

y_scispacy_pred = xgb.predict_proba(X_scispacy_test)
y_scispacy_proba = xgb.predict_proba(X_scispacy_test)

print('xgboost scores (not treat; treat; unknown)')
print(y_scispacy_proba)

scispacy_df_name_2 = pd.DataFrame(y_scispacy_proba)
scispacy_df_name_2.columns = ['not-treat-score', 'treat-score', 'unknown-treat-score']

# random forest
rf_clf = RandomForestClassifier(random_state = 15)
rf_clf.fit(X_scispacy, y_scispacy)

y_scispacy_pred = rf_clf.predict_proba(X_scispacy_test)
y_scispacy_proba = rf_clf.predict_proba(X_scispacy_test)

print('random forest scores (not treat; treat; unknown)')
print(y_scispacy_proba)



xgboost scores (not treat; treat; unknown)
[[0.02348455 0.06583104 0.9106844 ]
 [0.9956527  0.00187085 0.00247644]
 [0.99533564 0.00166252 0.00300181]
 [0.99486923 0.00226533 0.00286545]]
random forest scores (not treat; treat; unknown)
[[0.08 0.05 0.87]
 [1.   0.   0.  ]
 [0.94 0.04 0.02]
 [0.93 0.02 0.05]]


Once only names are present - the performance increases A LOT for LLm embeddings as well as spacy (but not scispacy).

### Model predictions - post-PCA full embeddings (node:name + node:category embedded)

In [61]:
# for convenience we overwrite the emb, this way we just copy paste the results
# can be automated 
pubmed_emb = pubmed_emb_pca 
openai_emb = openai_emb_pca
scispacy_emb = scispacy_emb_pca
spacy_emb = spacy_emb_pca

In [62]:
feature_length = 200
#pubmed dataset
X_pubmed = np.empty(shape=(len(train_df_1), feature_length), dtype = 'float32')
for index, row in train_df_1.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = pubmed_emb[drug_id]
    disease_vector = pubmed_emb[disease_id]
    X_pubmed[index] = np.concatenate([drug_vector, disease_vector])
y_pubmed = train_df_1.y.to_numpy()

#pubmed dataset - test
test = test.reset_index(drop=True)
X_pubmed_test = np.empty(shape=(len(test), feature_length), dtype = 'float32')
for index, row in test.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = pubmed_emb[drug_id]
    disease_vector = pubmed_emb[disease_id]
    X_pubmed_test[index] = np.concatenate([drug_vector, disease_vector])
y_pubmed_test = test.y.to_numpy()

In [63]:
# xgboost
xgb = XGBClassifier(random_state = 42)
xgb.fit(X_pubmed, y_pubmed)

y_pubmed_pred = xgb.predict_proba(X_pubmed_test)
y_pubmed_proba = xgb.predict_proba(X_pubmed_test)

print('xgboost scores (not treat; treat; unknown)')
print(y_pubmed_proba)

pubmed_df_full_pca_2 = pd.DataFrame(y_pubmed_proba)
pubmed_df_full_pca_2.columns = ['not-treat-score', 'treat-score', 'unknown-treat-score']

# random forest
rf_clf = RandomForestClassifier(random_state = 42)
rf_clf.fit(X_pubmed, y_pubmed)

y_pubmed_pred = rf_clf.predict_proba(X_pubmed_test)
y_pubmed_proba = rf_clf.predict_proba(X_pubmed_test)

print('random forest scores (not treat; treat; unknown)')
print(y_pubmed_proba)



xgboost scores (not treat; treat; unknown)
[[0.04689866 0.40799338 0.54510796]
 [0.99352807 0.00157446 0.00489744]
 [0.9861966  0.00263397 0.01116944]
 [0.99351764 0.00271332 0.00376906]]
random forest scores (not treat; treat; unknown)
[[0.09 0.36 0.55]
 [0.98 0.   0.02]
 [0.9  0.03 0.07]
 [0.78 0.04 0.18]]


In [64]:
#openai dataset
feature_length = 200
X_openai = np.empty(shape=(len(train_df_1), feature_length), dtype = 'float32')
for index, row in train_df_1.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = openai_emb[drug_id]
    disease_vector = openai_emb[disease_id]
    X_openai[index] = np.concatenate([drug_vector, disease_vector])
y_openai = train_df_1.y.to_numpy()

#popenai dataset - test
test = test.reset_index(drop=True)
X_openai_test = np.empty(shape=(len(test), feature_length), dtype = 'float32')
for index, row in test.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = openai_emb[drug_id]
    disease_vector = openai_emb[disease_id]
    X_openai_test[index] = np.concatenate([drug_vector, disease_vector])
y_openai_test = test.y.to_numpy()

In [65]:
# xgboost
xgb = XGBClassifier(random_state = 42)
xgb.fit(X_openai, y_openai)

y_openai_pred = xgb.predict_proba(X_openai_test)
y_openai_proba = xgb.predict_proba(X_openai_test)

print('xgboost scores (not treat; treat; unknown)')
print(y_pubmed_proba)

openai_df_full_pca_2 = pd.DataFrame(y_openai_proba)
openai_df_full_pca_2.columns = ['not-treat-score', 'treat-score', 'unknown-treat-score']

# random forest
rf_clf = RandomForestClassifier(random_state = 42)
rf_clf.fit(X_openai, y_openai)

y_openai_pred = rf_clf.predict_proba(X_openai_test)
y_openai_proba = rf_clf.predict_proba(X_openai_test)

print('random forest scores (not treat; treat; unknown)')
print(y_openai_proba)

xgboost scores (not treat; treat; unknown)
[[0.09 0.36 0.55]
 [0.98 0.   0.02]
 [0.9  0.03 0.07]
 [0.78 0.04 0.18]]
random forest scores (not treat; treat; unknown)
[[0.18 0.22 0.6 ]
 [0.95 0.   0.05]
 [0.89 0.03 0.08]
 [0.85 0.02 0.13]]


In [66]:
#spacy dataset
feature_length = 200

X_spacy = np.empty(shape=(len(train_df_1), feature_length), dtype = 'float32')
for index, row in train_df_1.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = spacy_emb[drug_id]
    disease_vector = spacy_emb[disease_id]
    X_spacy[index] = np.concatenate([drug_vector, disease_vector])
y_spacy = train_df_1.y.to_numpy()

#spacy dataset - test
test = test.reset_index(drop=True)
X_spacy_test = np.empty(shape=(len(test), feature_length), dtype = 'float32')
for index, row in test.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = spacy_emb[drug_id]
    disease_vector = spacy_emb[disease_id]
    X_spacy_test[index] = np.concatenate([drug_vector, disease_vector])
y_spacy_test = test.y.to_numpy()

In [67]:
# xgboost
xgb = XGBClassifier(random_state = 42)
xgb.fit(X_spacy, y_spacy)

y_spacy_pred = xgb.predict_proba(X_spacy_test)
y_spacy_proba = xgb.predict_proba(X_spacy_test)

print('xgboost scores (not treat; treat; unknown)')
print(y_spacy_proba)

spacy_df_full_pca_2 = pd.DataFrame(y_spacy_proba)
spacy_df_full_pca_2.columns = ['not-treat-score', 'treat-score', 'unknown-treat-score']

# random forest
rf_clf = RandomForestClassifier(random_state = 42)
rf_clf.fit(X_spacy, y_spacy)

y_spacy_pred = rf_clf.predict_proba(X_spacy_test)
y_spacy_proba = rf_clf.predict_proba(X_spacy_test)

print('random forest scores (not treat; treat; unknown)')
print(y_spacy_proba)

xgboost scores (not treat; treat; unknown)
[[0.01951226 0.7927984  0.18768929]
 [0.9974381  0.00113529 0.00142662]
 [0.99221134 0.00224067 0.00554804]
 [0.99221134 0.00224067 0.00554804]]
random forest scores (not treat; treat; unknown)
[[0.08       0.42541667 0.49458333]
 [1.         0.         0.        ]
 [0.98       0.         0.02      ]
 [0.98       0.         0.02      ]]


In [68]:
feature_length = 200

#scispacy dataset
X_scispacy = np.empty(shape=(len(train_df_1), feature_length), dtype = 'float32')
for index, row in train_df_1.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = scispacy_emb[drug_id]
    disease_vector = scispacy_emb[disease_id]
    X_scispacy[index] = np.concatenate([drug_vector, disease_vector])
y_scispacy = train_df_1.y.to_numpy()

#scispacy dataset - test
test = test.reset_index(drop=True)
X_scispacy_test = np.empty(shape=(len(test), feature_length), dtype = 'float32')
for index, row in test.iterrows():
    drug = row['source']
    disease = row['target']
    drug_id = sample_df.loc[sample_df.id==drug].index[0]
    disease_id = sample_df.loc[sample_df.id==disease].index[0]
    drug_vector = scispacy_emb[drug_id]
    disease_vector = scispacy_emb[disease_id]
    X_scispacy_test[index] = np.concatenate([drug_vector, disease_vector])
y_scispacy_test = test.y.to_numpy()

In [69]:
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
# xgboost
xgb = XGBClassifier(random_state = 42)
xgb.fit(X_scispacy, y_scispacy)

y_scispacy_pred = xgb.predict_proba(X_scispacy_test)
y_scispacy_proba = xgb.predict_proba(X_scispacy_test)

print('xgboost scores (not treat; treat; unknown)')
print(y_scispacy_proba)

scispacy_df_full_pca_2 = pd.DataFrame(y_scispacy_proba)
scispacy_df_full_pca_2.columns = ['not-treat-score', 'treat-score', 'unknown-treat-score']

# random forest
rf_clf = RandomForestClassifier(random_state = 42)
rf_clf.fit(X_scispacy, y_scispacy)

y_scispacy_pred = rf_clf.predict_proba(X_scispacy_test)
y_scispacy_proba = rf_clf.predict_proba(X_scispacy_test)

print('random forest scores (not treat; treat; unknown)')
print(y_scispacy_proba)

xgboost scores (not treat; treat; unknown)
[[5.3941538e-03 1.9255865e-02 9.7535002e-01]
 [9.9741119e-01 1.3282673e-03 1.2606047e-03]
 [9.9780482e-01 8.2794001e-04 1.3671808e-03]
 [9.9780482e-01 8.2794001e-04 1.3671808e-03]]
random forest scores (not treat; treat; unknown)
[[0.06       0.08916667 0.85083333]
 [1.         0.         0.        ]
 [0.97       0.         0.03      ]
 [0.97       0.         0.03      ]]


And again - once we apply PCA, the performance for LLM drops, while it is relatively unchanged for spacy/scispacy. This is another example showing how PCA is leading to partial information loss

### Summary - test data no.2

In [72]:
multi_index = pd.MultiIndex.from_arrays([['openai_full', 'openai_full', 'openai_full'],['not treat','treat','unknown']])
pubmed_df_full_2.columns = multi_index

multi_index = pd.MultiIndex.from_arrays([['openai_name', 'openai_name', 'openai_name'],['not treat','treat','unknown']])
pubmed_df_name_2.columns = multi_index

multi_index = pd.MultiIndex.from_arrays([['pubmed_pca', 'pubmed_pca', 'pubmed_pca'],['not treat','treat','unknown']])
pubmed_df_full_pca_2.columns = multi_index

pd.concat([pubmed_df_full_2, pubmed_df_name_2, pubmed_df_full_pca_2], axis=1)

Unnamed: 0_level_0,openai_full,openai_full,openai_full,openai_name,openai_name,openai_name,pubmed_pca,pubmed_pca,pubmed_pca
Unnamed: 0_level_1,not treat,treat,unknown,not treat,treat,unknown,not treat,treat,unknown
0,0.027482,0.596695,0.375823,0.017579,0.925396,0.057026,0.046899,0.407993,0.545108
1,0.996036,0.002003,0.001961,0.994844,0.003087,0.002069,0.993528,0.001574,0.004897
2,0.993322,0.003033,0.003645,0.992607,0.00284,0.004552,0.986197,0.002634,0.011169
3,0.989285,0.004476,0.006239,0.991923,0.0025,0.005576,0.993518,0.002713,0.003769


In [73]:
multi_index = pd.MultiIndex.from_arrays([['openai_full', 'openai_full', 'openai_full'],['not treat','treat','unknown']])
openai_df_full_2.columns = multi_index

multi_index = pd.MultiIndex.from_arrays([['openai_name', 'openai_name', 'openai_name'],['not treat','treat','unknown']])
openai_df_name_2.columns = multi_index

multi_index = pd.MultiIndex.from_arrays([['openai_pca', 'openai_pca', 'openai_pca'],['not treat','treat','unknown']])
openai_df_full_pca_2.columns = multi_index

pd.concat([openai_df_full_2, openai_df_name_2, openai_df_full_pca_2,], axis=1)

Unnamed: 0_level_0,openai_full,openai_full,openai_full,openai_name,openai_name,openai_name,openai_pca,openai_pca,openai_pca
Unnamed: 0_level_1,not treat,treat,unknown,not treat,treat,unknown,not treat,treat,unknown
0,0.016544,0.500097,0.483359,0.027509,0.920661,0.05183,0.016243,0.808951,0.174807
1,0.991526,0.005184,0.00329,0.994452,0.003254,0.002294,0.98881,0.006192,0.004998
2,0.993249,0.004472,0.00228,0.992991,0.002548,0.004461,0.990156,0.0025,0.007344
3,0.990866,0.004318,0.004816,0.985106,0.005939,0.008955,0.991547,0.004245,0.004208


In [74]:
multi_index = pd.MultiIndex.from_arrays([['spacy_full', 'spacy_full', 'spacy_full'],['not treat','treat','unknown']])
spacy_df_full_2.columns = multi_index

multi_index = pd.MultiIndex.from_arrays([['spacy_name', 'spacy_name', 'spacy_name'],['not treat','treat','unknown']])
spacy_df_name_2.columns = multi_index

multi_index = pd.MultiIndex.from_arrays([['spacy_pca', 'spacy_pca', 'spacy_pca'],['not treat','treat','unknown']])
spacy_df_full_pca_2.columns = multi_index

pd.concat([spacy_df_full_2, spacy_df_name_2, spacy_df_full_pca_2], axis=1)

Unnamed: 0_level_0,spacy_full,spacy_full,spacy_full,spacy_name,spacy_name,spacy_name,spacy_pca,spacy_pca,spacy_pca
Unnamed: 0_level_1,not treat,treat,unknown,not treat,treat,unknown,not treat,treat,unknown
0,0.007376,0.917079,0.075545,0.009689,0.943756,0.046555,0.019512,0.792798,0.187689
1,0.991478,0.00127,0.007252,0.992929,0.001612,0.005459,0.997438,0.001135,0.001427
2,0.996376,0.001161,0.002464,0.997454,0.000953,0.001594,0.992211,0.002241,0.005548
3,0.996376,0.001161,0.002464,0.997454,0.000953,0.001594,0.992211,0.002241,0.005548


In [75]:
multi_index = pd.MultiIndex.from_arrays([['scispacy_full', 'scispacy_full', 'scispacy_full'],['not treat','treat','unknown']])
scispacy_df_full_2.columns = multi_index

multi_index = pd.MultiIndex.from_arrays([['scispacy_name', 'scispacy_name', 'scispacy_name'],['not treat','treat','unknown']])
scispacy_df_name_2.columns = multi_index

multi_index = pd.MultiIndex.from_arrays([['scispacy_pca', 'scispacy_pca', 'scispacy_pca'],['not treat','treat','unknown']])
scispacy_df_full_pca_2.columns = multi_index

pd.concat([scispacy_df_full_2, scispacy_df_name_2, scispacy_df_full_pca_2], axis=1)

Unnamed: 0_level_0,scispacy_full,scispacy_full,scispacy_full,scispacy_name,scispacy_name,scispacy_name,scispacy_pca,scispacy_pca,scispacy_pca
Unnamed: 0_level_1,not treat,treat,unknown,not treat,treat,unknown,not treat,treat,unknown
0,0.002077,0.015435,0.982488,0.023485,0.065831,0.910684,0.005394,0.019256,0.97535
1,0.997629,0.00119,0.001181,0.995653,0.001871,0.002476,0.997411,0.001328,0.001261
2,0.996533,0.001241,0.002226,0.995336,0.001663,0.003002,0.997805,0.000828,0.001367
3,0.996533,0.001241,0.002226,0.994869,0.002265,0.002865,0.997805,0.000828,0.001367


Interestingly, it seems that spacy knows all along that cloxotosterone treats prostate carcinoma; while it is odd, the results post-pca clearly show that PCA leads to information loss (10% worse performance, which is also consistent with remaining model performance). Which is not the worst thing in this scenario as it might reduce the data leakage impact.

Both OpenAI and PubMedBERT are relatively confident about the pair when full node attributes are present and very confident when only name is present. This really good performance in my opinion is a sign of data leakage but this is where we also have data in a generic spacy model. Therefore it is challenging to enrich the embeddings while preventing data leakage.

## GraphSage Topological Embeddings
I used the LLM attribute embeddings as inputs in the pipeline (see 'prepare data notebook') to obtain graphsage embeddings following the same process as we apply in our pipeline. I will then train ML classifiers on those topological embeddings using the same test data to see if potential data leakage is more/less visible 


### Prepare graphsage input 

In [76]:
#pubmedbert
pubmed_graph = pd.read_parquet('gs://mtrx-us-central1-wg2-modeling-dev-storage/llm_embed_benchmark/subsample/pubmedbert/topological/pubmed_graphsage')

# #openai
openai_graph = pd.read_parquet('gs://mtrx-us-central1-wg2-modeling-dev-storage/llm_embed_benchmark/subsample/openai/topological/graphsage_parquet')

# #spacy
spacy_graph = pd.read_parquet('gs://mtrx-us-central1-wg2-modeling-dev-storage/llm_embed_benchmark/subsample/spacy/topological/graphsage_parquet')

# #scispacy
scispacy_graph = pd.read_parquet('gs://mtrx-us-central1-wg2-modeling-dev-storage/llm_embed_benchmark/subsample/scispacy/topological/graphsage_parquet')

## Test data no.1 (rasagoline -parkinsons)

In [77]:
from sklearn.model_selection import train_test_split
#create sub-dfs
DRUG_TYPE = ['biolink:Drug', 'biolink:SmallMolecule']
DISEASE_TYPE = ['biolink:Disease', 'biolink:PhenotypicFeature', 'biolink:BehavioralFeature', 'biolink:DiseaseOrPhenotypicFeature']

#sample
sample_df_drugs = sample_df[sample_df['category'].isin(DRUG_TYPE)]
sample_df_disease = sample_df[sample_df['category'].isin(DISEASE_TYPE)]

#train test split 
train, test = train_test_split(result_gt, stratify=result_gt['y'], test_size=0.1, random_state=42)
train_tp_df = train[train['y']==1]
train_tp_df_drugs = train_tp_df['source'].reset_index(drop=True)
train_tp_df_diseases = train_tp_df['target'].reset_index(drop=True)
len_tp_tr = len(train_tp_df)
n_rep = 3

# create random drug-disease pairs
rand_drugs = sample_df_drugs['id'].sample(n_rep*len_tp_tr, replace=True, ignore_index = True, random_state = 2) # 42
rand_disease = sample_df_disease['id'].sample(n_rep*len_tp_tr, replace=True, ignore_index = True, random_state = 2) # 42
train_tp_diseases_copies = pd.concat([train_tp_df_diseases for _ in range(n_rep)], ignore_index = True)
train_tp_drugs_copies = pd.concat([train_tp_df_drugs for _ in range(n_rep)], ignore_index = True)
tmp_1 = pd.DataFrame({'source': rand_drugs, 'target': train_tp_diseases_copies, 'y': 2})
tmp_2 = pd.DataFrame({'source': train_tp_drugs_copies, 'target': rand_disease, 'y': 2})
un_data_1 =  pd.concat([tmp_1,tmp_2], ignore_index =True)
train_df_1 = pd.concat([train, un_data_1]).sample(frac=1).reset_index(drop=True)
test = test.reset_index(drop=True)

In [78]:
test

Unnamed: 0,source,target,y
0,CHEMBL.COMPOUND:CHEMBL2135534,MONDO:0005098,0
1,CHEMBL.COMPOUND:CHEMBL887,MONDO:0014742,1
2,CHEMBL.COMPOUND:CHEMBL1653,MONDO:0005420,0
3,CHEMBL.COMPOUND:CHEMBL2004297,MONDO:0012010,0


### Train Classifiers (topological embeddings)

In [79]:
feature_length = 1024
#pubmed dataset
X_pubmed = np.empty(shape=(len(train_df_1), feature_length), dtype = 'float32')
for index, row in train_df_1.iterrows():
    drug = row['source']
    disease = row['target']
    drug_vector = pubmed_graph.loc[pubmed_graph.id==drug].topological_embedding.values[0] #rdb.collect()[0].embedding
    disease_vector = pubmed_graph.loc[pubmed_graph.id==disease].topological_embedding.values[0] #rdb.collect()[0].embedding
    X_pubmed[index] = np.concatenate([drug_vector, disease_vector])
y_pubmed = train_df_1.y.to_numpy()

#pubmed dataset - test
test = test.reset_index(drop=True)
X_pubmed_test = np.empty(shape=(len(test), feature_length), dtype = 'float32')
for index, row in test.iterrows():
    drug = row['source']
    disease = row['target']
    drug_vector = pubmed_graph.loc[pubmed_graph.id==drug].topological_embedding.values[0] #rdb.collect()[0].embedding
    disease_vector = pubmed_graph.loc[pubmed_graph.id==disease].topological_embedding.values[0] #rdb.collect()[0].embedding
    X_pubmed_test[index] = np.concatenate([drug_vector, disease_vector])
y_pubmed_test = test.y.to_numpy()

In [80]:
# xgboost
xgb = XGBClassifier(random_state = 42)
xgb.fit(X_pubmed, y_pubmed)

y_pubmed_pred = xgb.predict_proba(X_pubmed_test)
y_pubmed_proba = xgb.predict_proba(X_pubmed_test)

print('xgboost scores (not treat; treat; unknown)')
print(y_pubmed_proba)

pubmed_df_full_graph_1 = pd.DataFrame(y_pubmed_proba)
pubmed_df_full_graph_1.columns = ['not-treat-score', 'treat-score', 'unknown-treat-score']

# random forest
rf_clf = RandomForestClassifier(random_state = 42)
rf_clf.fit(X_pubmed, y_pubmed)

y_pubmed_pred = rf_clf.predict_proba(X_pubmed_test)
y_pubmed_proba = rf_clf.predict_proba(X_pubmed_test)

print('random forest scores (not treat; treat; unknown)')
print(y_pubmed_proba)

xgboost scores (not treat; treat; unknown)
[[9.1741031e-01 4.1314219e-03 7.8458324e-02]
 [1.6583905e-02 4.4699982e-03 9.7894615e-01]
 [9.8740804e-01 9.8574779e-04 1.1606223e-02]
 [9.9618983e-01 3.6896710e-04 3.4412183e-03]]
random forest scores (not treat; treat; unknown)
[[0.78       0.         0.22      ]
 [0.09640115 0.10965188 0.79394697]
 [0.83       0.         0.17      ]
 [1.         0.         0.        ]]


In [81]:
feature_length = 1024
#pubmed dataset
X_openai = np.empty(shape=(len(train_df_1), feature_length), dtype = 'float32')
for index, row in train_df_1.iterrows():
    drug = row['source']
    disease = row['target']
    drug_vector = openai_graph.loc[openai_graph.id==drug].topological_embedding.values[0] #rdb.collect()[0].embedding
    disease_vector = openai_graph.loc[openai_graph.id==disease].topological_embedding.values[0] #rdb.collect()[0].embedding
    X_openai[index] = np.concatenate([drug_vector, disease_vector])
y_openai = train_df_1.y.to_numpy()

#pubmed dataset - test
test = test.reset_index(drop=True)
X_openai_test = np.empty(shape=(len(test), feature_length), dtype = 'float32')
for index, row in test.iterrows():
    drug = row['source']
    disease = row['target']
    drug_vector = openai_graph.loc[openai_graph.id==drug].topological_embedding.values[0] #rdb.collect()[0].embedding
    disease_vector = openai_graph.loc[openai_graph.id==disease].topological_embedding.values[0] #rdb.collect()[0].embedding
    X_openai_test[index] = np.concatenate([drug_vector, disease_vector])
y_openai_test = test.y.to_numpy()

In [82]:
# xgboost
xgb = XGBClassifier(random_state = 42)
xgb.fit(X_openai, y_openai)

y_openai_pred = xgb.predict_proba(X_openai_test)
y_openai_proba = xgb.predict_proba(X_openai_test)

print('xgboost scores (not treat; treat; unknown)')
print(y_openai_proba)

openai_df_full_graph_1 = pd.DataFrame(y_openai_proba)
openai_df_full_graph_1.columns = ['not-treat-score', 'treat-score', 'unknown-treat-score']

# random forest
rf_clf = RandomForestClassifier(random_state = 42)
rf_clf.fit(X_openai, y_openai)

y_openai_pred = rf_clf.predict_proba(X_openai_test)
y_openai_proba = rf_clf.predict_proba(X_openai_test)

print('random forest scores (not treat; treat; unknown)')
print(y_openai_proba)

xgboost scores (not treat; treat; unknown)
[[9.7668684e-01 1.2463948e-03 2.2066712e-02]
 [2.2621643e-02 1.7999855e-03 9.7557837e-01]
 [9.9696749e-01 5.2656047e-04 2.5059672e-03]
 [4.1288260e-01 4.0968424e-03 5.8302057e-01]]
random forest scores (not treat; treat; unknown)
[[0.88 0.   0.12]
 [0.12 0.01 0.87]
 [0.77 0.02 0.21]
 [0.89 0.01 0.1 ]]


In [83]:
feature_length = 1024
#pubmed dataset
X_spacy = np.empty(shape=(len(train_df_1), feature_length), dtype = 'float32')
for index, row in train_df_1.iterrows():
    drug = row['source']
    disease = row['target']
    drug_vector = spacy_graph.loc[spacy_graph.id==drug].topological_embedding.values[0] #rdb.collect()[0].embedding
    disease_vector = spacy_graph.loc[spacy_graph.id==disease].topological_embedding.values[0] #rdb.collect()[0].embedding
    X_spacy[index] = np.concatenate([drug_vector, disease_vector])
y_spacy = train_df_1.y.to_numpy()

#pubmed dataset - test
test = test.reset_index(drop=True)
X_spacy_test = np.empty(shape=(len(test), feature_length), dtype = 'float32')
for index, row in test.iterrows():
    drug = row['source']
    disease = row['target']
    drug_vector = spacy_graph.loc[spacy_graph.id==drug].topological_embedding.values[0] #rdb.collect()[0].embedding
    disease_vector = spacy_graph.loc[spacy_graph.id==disease].topological_embedding.values[0] #rdb.collect()[0].embedding
    X_spacy_test[index] = np.concatenate([drug_vector, disease_vector])
y_spacy_test = test.y.to_numpy()

In [84]:
# xgboost
xgb = XGBClassifier(random_state = 42)
xgb.fit(X_spacy, y_spacy)

y_spacy_pred = xgb.predict_proba(X_spacy_test)
y_spacy_proba = xgb.predict_proba(X_spacy_test)

print('xgboost scores (not treat; treat; unknown)')
print(y_spacy_proba)

spacy_df_full_graph_1 = pd.DataFrame(y_spacy_proba)
spacy_df_full_graph_1.columns = ['not-treat-score', 'treat-score', 'unknown-treat-score']

# random forest
rf_clf = RandomForestClassifier(random_state = 42)
rf_clf.fit(X_spacy, y_spacy)

y_spacy_pred = rf_clf.predict_proba(X_spacy_test)
y_spacy_proba = rf_clf.predict_proba(X_spacy_test)

print('random forest scores (not treat; treat; unknown)')
print(y_spacy_proba)

xgboost scores (not treat; treat; unknown)
[[0.9826286  0.00350406 0.01386736]
 [0.14782725 0.39425674 0.45791605]
 [0.99558395 0.00234692 0.00206917]
 [0.97300154 0.00121575 0.02578276]]
random forest scores (not treat; treat; unknown)
[[0.63 0.01 0.36]
 [0.31 0.44 0.25]
 [0.95 0.02 0.03]
 [0.83 0.   0.17]]


In [85]:
feature_length = 1024
#pubmed dataset
X_scispacy = np.empty(shape=(len(train_df_1), feature_length), dtype = 'float32')
for index, row in train_df_1.iterrows():
    drug = row['source']
    disease = row['target']
    drug_vector = scispacy_graph.loc[scispacy_graph.id==drug].topological_embedding.values[0] #rdb.collect()[0].embedding
    disease_vector = scispacy_graph.loc[scispacy_graph.id==disease].topological_embedding.values[0] #rdb.collect()[0].embedding
    X_scispacy[index] = np.concatenate([drug_vector, disease_vector])
y_scispacy = train_df_1.y.to_numpy()

#pubmed dataset - test
test = test.reset_index(drop=True)
X_scispacy_test = np.empty(shape=(len(test), feature_length), dtype = 'float32')
for index, row in test.iterrows():
    drug = row['source']
    disease = row['target']
    drug_vector = scispacy_graph.loc[scispacy_graph.id==drug].topological_embedding.values[0] #rdb.collect()[0].embedding
    disease_vector = scispacy_graph.loc[scispacy_graph.id==disease].topological_embedding.values[0] #rdb.collect()[0].embedding
    X_scispacy_test[index] = np.concatenate([drug_vector, disease_vector])
y_scispacy_test = test.y.to_numpy()

In [86]:
# xgboost
xgb = XGBClassifier(random_state = 42)
xgb.fit(X_scispacy, y_scispacy)

y_scispacy_pred = xgb.predict_proba(X_scispacy_test)
y_scispacy_proba = xgb.predict_proba(X_scispacy_test)

print('xgboost scores (not treat; treat; unknown)')
print(y_spacy_proba)

scispacy_df_full_graph_1 = pd.DataFrame(y_scispacy_proba)
scispacy_df_full_graph_1.columns = ['not-treat-score', 'treat-score', 'unknown-treat-score']

# random forest
rf_clf = RandomForestClassifier(random_state = 42)
rf_clf.fit(X_scispacy, y_scispacy)

y_scispacy_pred = rf_clf.predict_proba(X_scispacy_test)
y_scispacy_proba = rf_clf.predict_proba(X_scispacy_test)

print('random forest scores (not treat; treat; unknown)')
print(y_scispacy_proba)

xgboost scores (not treat; treat; unknown)
[[0.63 0.01 0.36]
 [0.31 0.44 0.25]
 [0.95 0.02 0.03]
 [0.83 0.   0.17]]
random forest scores (not treat; treat; unknown)
[[0.83 0.   0.17]
 [0.21 0.08 0.71]
 [0.96 0.01 0.03]
 [0.77 0.   0.23]]


### Summary

In [87]:
multi_index = pd.MultiIndex.from_arrays([['pubmed_full_attribute', 'pubmed_full_attribute', 'pubmed_full_attribute'],['not treat','treat','unknown']])
pubmed_df_full_1.columns = multi_index

multi_index = pd.MultiIndex.from_arrays([['pubmed_pca_attribute', 'pubmed_pca_attribute', 'pubmed_pca_attribute'],['not treat','treat','unknown']])
pubmed_df_full_pca_1.columns = multi_index

multi_index = pd.MultiIndex.from_arrays([['pubmed_full_topological', 'pubmed_full_topological', 'pubmed_full_topological'],['not treat','treat','unknown']])
pubmed_df_full_graph_1.columns = multi_index

pd.concat([pubmed_df_full_1, pubmed_df_full_pca_1, pubmed_df_full_graph_1], axis=1)

Unnamed: 0_level_0,pubmed_full_attribute,pubmed_full_attribute,pubmed_full_attribute,pubmed_pca_attribute,pubmed_pca_attribute,pubmed_pca_attribute,pubmed_full_topological,pubmed_full_topological,pubmed_full_topological
Unnamed: 0_level_1,not treat,treat,unknown,not treat,treat,unknown,not treat,treat,unknown
0,0.986982,0.004208,0.00881,0.987109,0.003442,0.00945,0.91741,0.004131,0.078458
1,0.483051,0.08595,0.431,0.952591,0.006619,0.04079,0.016584,0.00447,0.978946
2,0.990802,0.003525,0.005673,0.994364,0.002831,0.002805,0.987408,0.000986,0.011606
3,0.993604,0.003044,0.003352,0.995495,0.001855,0.00265,0.99619,0.000369,0.003441


In [88]:
multi_index = pd.MultiIndex.from_arrays([['openai_full_attribute', 'openai_full_attribute', 'openai_full_attribute'],['not treat','treat','unknown']])
openai_df_full_1.columns = multi_index

multi_index = pd.MultiIndex.from_arrays([['openai_pca_attribute', 'openai_pca_attribute', 'openai_pca_attribute'],['not treat','treat','unknown']])
openai_df_full_pca_1.columns = multi_index

multi_index = pd.MultiIndex.from_arrays([['openai_full_topological', 'openai_full_topological', 'openai_full_topological'],['not treat','treat','unknown']])
openai_df_full_graph_1.columns = multi_index

pd.concat([openai_df_full_1, openai_df_full_pca_1, openai_df_full_graph_1], axis=1)

Unnamed: 0_level_0,openai_full_attribute,openai_full_attribute,openai_full_attribute,openai_pca_attribute,openai_pca_attribute,openai_pca_attribute,openai_full_topological,openai_full_topological,openai_full_topological
Unnamed: 0_level_1,not treat,treat,unknown,not treat,treat,unknown,not treat,treat,unknown
0,0.97904,0.004056,0.016904,0.97606,0.004588,0.019352,0.976687,0.001246,0.022067
1,0.166042,0.003481,0.830477,0.410038,0.014051,0.575911,0.022622,0.0018,0.975578
2,0.988505,0.003101,0.008394,0.986033,0.005686,0.00828,0.996967,0.000527,0.002506
3,0.99515,0.002383,0.002468,0.991365,0.002459,0.006176,0.412883,0.004097,0.583021


In [90]:
multi_index = pd.MultiIndex.from_arrays([['spacy_full_attribute', 'spacy_full_attribute', 'spacy_full_attribute'],['not treat','treat','unknown']])
spacy_df_full_1.columns = multi_index

multi_index = pd.MultiIndex.from_arrays([['spacy_pca_attribute', 'spacy_pca_attribute', 'spacy_pca_attribute'],['not treat','treat','unknown']])
spacy_df_full_pca_1.columns = multi_index

multi_index = pd.MultiIndex.from_arrays([['spacy_full_topological', 'spacy_full_topological', 'spacy_full_topological'],['not treat','treat','unknown']])
spacy_df_full_graph_1.columns = multi_index

pd.concat([spacy_df_full_1,spacy_df_full_pca_1, spacy_df_full_graph_1], axis=1)

Unnamed: 0_level_0,spacy_full_attribute,spacy_full_attribute,spacy_full_attribute,spacy_pca_attribute,spacy_pca_attribute,spacy_pca_attribute,spacy_full_topological,spacy_full_topological,spacy_full_topological
Unnamed: 0_level_1,not treat,treat,unknown,not treat,treat,unknown,not treat,treat,unknown
0,0.992429,0.002093,0.005478,0.990774,0.003383,0.005843,0.982629,0.003504,0.013867
1,0.3187,0.067151,0.614149,0.133992,0.02304,0.842968,0.147827,0.394257,0.457916
2,0.986641,0.007625,0.005734,0.973361,0.015235,0.011404,0.995584,0.002347,0.002069
3,0.994274,0.002052,0.003675,0.994577,0.002659,0.002764,0.973002,0.001216,0.025783


In [91]:
multi_index = pd.MultiIndex.from_arrays([['scispacy_full_attribute', 'scispacy_full_attribute', 'scispacy_full_attribute'],['not treat','treat','unknown']])
scispacy_df_full_1.columns = multi_index

multi_index = pd.MultiIndex.from_arrays([['scispacy_pca_attribute', 'scispacy_pca_attribute', 'scispacy_pca_attribute'],['not treat','treat','unknown']])
scispacy_df_full_pca_1.columns = multi_index

multi_index = pd.MultiIndex.from_arrays([['scispacy_full_topological', 'scispacy_full_topological', 'scispacy_full_topological'],['not treat','treat','unknown']])
scispacy_df_full_graph_1.columns = multi_index

pd.concat([scispacy_df_full_1, scispacy_df_full_pca_1, scispacy_df_full_graph_1], axis=1)

Unnamed: 0_level_0,scispacy_full_attribute,scispacy_full_attribute,scispacy_full_attribute,scispacy_pca_attribute,scispacy_pca_attribute,scispacy_pca_attribute,scispacy_full_topological,scispacy_full_topological,scispacy_full_topological
Unnamed: 0_level_1,not treat,treat,unknown,not treat,treat,unknown,not treat,treat,unknown
0,0.993728,0.003611,0.002661,0.990413,0.002268,0.00732,0.962316,0.002551,0.035133
1,0.86867,0.036673,0.094656,0.310731,0.038591,0.650678,0.004431,0.000976,0.994593
2,0.952992,0.004764,0.042244,0.973957,0.01019,0.015853,0.989819,0.001246,0.008935
3,0.995145,0.00269,0.002165,0.993636,0.003065,0.003298,0.98495,0.00887,0.00618


You can definitely see the topological enrichment effect - it increases the confidence of scispacy, OpenAI and PubMedBERT of unknown relationship. The only not so confident model is spacy for which treat score increased. This is where we could maybe see a benefit of biomedical models vs generic models (note that OpenAI is not biomedical but it has been trained on so much data it can be probably considered more biomedical than spacy/scispacy)

The other observation is that although upon PCA application performance of the classifiers drops, it is then 'caught up' by topological enrichment - therefore we could expect even higher performance with no PCA/more optimal dimensionality reduction?

## Test no.2 (cloxotestosterone - prostate carcinoma)

In [92]:
from sklearn.model_selection import train_test_split
#create sub-dfs
DRUG_TYPE = ['biolink:Drug', 'biolink:SmallMolecule']
DISEASE_TYPE = ['biolink:Disease', 'biolink:PhenotypicFeature', 'biolink:BehavioralFeature', 'biolink:DiseaseOrPhenotypicFeature']

#sample
sample_df_drugs = sample_df[sample_df['category'].isin(DRUG_TYPE)]
sample_df_disease = sample_df[sample_df['category'].isin(DISEASE_TYPE)]

#train test split 
train, test = train_test_split(result_gt, stratify=result_gt['y'], test_size=0.1, random_state=2)
train_tp_df = train[train['y']==1]
train_tp_df_drugs = train_tp_df['source'].reset_index(drop=True)
train_tp_df_diseases = train_tp_df['target'].reset_index(drop=True)
len_tp_tr = len(train_tp_df)
n_rep = 3

# create random drug-disease pairs
rand_drugs = sample_df_drugs['id'].sample(n_rep*len_tp_tr, replace=True, ignore_index = True, random_state = 2) # 42
rand_disease = sample_df_disease['id'].sample(n_rep*len_tp_tr, replace=True, ignore_index = True, random_state = 2) # 42
train_tp_diseases_copies = pd.concat([train_tp_df_diseases for _ in range(n_rep)], ignore_index = True)
train_tp_drugs_copies = pd.concat([train_tp_df_drugs for _ in range(n_rep)], ignore_index = True)
tmp_1 = pd.DataFrame({'source': rand_drugs, 'target': train_tp_diseases_copies, 'y': 2})
tmp_2 = pd.DataFrame({'source': train_tp_drugs_copies, 'target': rand_disease, 'y': 2})
un_data_1 =  pd.concat([tmp_1,tmp_2], ignore_index =True)
train_df_1 = pd.concat([train, un_data_1]).sample(frac=1).reset_index(drop=True)
test = test.reset_index(drop=True)

### Train Classifiers (topological embeddings)

In [93]:
feature_length = 1024
#pubmed dataset
X_pubmed = np.empty(shape=(len(train_df_1), feature_length), dtype = 'float32')
for index, row in train_df_1.iterrows():
    drug = row['source']
    disease = row['target']
    drug_vector = pubmed_graph.loc[pubmed_graph.id==drug].topological_embedding.values[0] #rdb.collect()[0].embedding
    disease_vector = pubmed_graph.loc[pubmed_graph.id==disease].topological_embedding.values[0] #rdb.collect()[0].embedding
    X_pubmed[index] = np.concatenate([drug_vector, disease_vector])
y_pubmed = train_df_1.y.to_numpy()

#pubmed dataset - test
test = test.reset_index(drop=True)
X_pubmed_test = np.empty(shape=(len(test), feature_length), dtype = 'float32')
for index, row in test.iterrows():
    drug = row['source']
    disease = row['target']
    drug_vector = pubmed_graph.loc[pubmed_graph.id==drug].topological_embedding.values[0] #rdb.collect()[0].embedding
    disease_vector = pubmed_graph.loc[pubmed_graph.id==disease].topological_embedding.values[0] #rdb.collect()[0].embedding
    X_pubmed_test[index] = np.concatenate([drug_vector, disease_vector])
y_pubmed_test = test.y.to_numpy()

In [94]:
# xgboost
xgb = XGBClassifier(random_state = 42)
xgb.fit(X_pubmed, y_pubmed)

y_pubmed_pred = xgb.predict_proba(X_pubmed_test)
y_pubmed_proba = xgb.predict_proba(X_pubmed_test)

print('xgboost scores (not treat; treat; unknown)')
print(y_pubmed_proba)

pubmed_df_full_graph_2 = pd.DataFrame(y_pubmed_proba)
pubmed_df_full_graph_2.columns = ['not-treat-score', 'treat-score', 'unknown-treat-score']

# random forest
rf_clf = RandomForestClassifier(random_state = 42)
rf_clf.fit(X_pubmed, y_pubmed)

y_pubmed_pred = rf_clf.predict_proba(X_pubmed_test)
y_pubmed_proba = rf_clf.predict_proba(X_pubmed_test)

print('random forest scores (not treat; treat; unknown)')
print(y_pubmed_proba)

xgboost scores (not treat; treat; unknown)
[[3.6921572e-02 7.3685247e-01 2.2622597e-01]
 [9.9292362e-01 1.9743734e-03 5.1020617e-03]
 [9.9106419e-01 6.9643505e-04 8.2394043e-03]
 [9.9470961e-01 2.1580290e-03 3.1323335e-03]]
random forest scores (not treat; treat; unknown)
[[0.14 0.43 0.43]
 [0.98 0.   0.02]
 [0.95 0.   0.05]
 [0.91 0.   0.09]]


In [95]:
feature_length = 1024
#pubmed dataset
X_openai = np.empty(shape=(len(train_df_1), feature_length), dtype = 'float32')
for index, row in train_df_1.iterrows():
    drug = row['source']
    disease = row['target']
    drug_vector = openai_graph.loc[openai_graph.id==drug].topological_embedding.values[0] #rdb.collect()[0].embedding
    disease_vector = openai_graph.loc[openai_graph.id==disease].topological_embedding.values[0] #rdb.collect()[0].embedding
    X_openai[index] = np.concatenate([drug_vector, disease_vector])
y_openai = train_df_1.y.to_numpy()

#pubmed dataset - test
test = test.reset_index(drop=True)
X_openai_test = np.empty(shape=(len(test), feature_length), dtype = 'float32')
for index, row in test.iterrows():
    drug = row['source']
    disease = row['target']
    drug_vector = openai_graph.loc[openai_graph.id==drug].topological_embedding.values[0] #rdb.collect()[0].embedding
    disease_vector = openai_graph.loc[openai_graph.id==disease].topological_embedding.values[0] #rdb.collect()[0].embedding
    X_openai_test[index] = np.concatenate([drug_vector, disease_vector])
y_openai_test = test.y.to_numpy()

In [96]:
# xgboost
xgb = XGBClassifier(random_state = 42)
xgb.fit(X_openai, y_openai)

y_openai_pred = xgb.predict_proba(X_openai_test)
y_openai_proba = xgb.predict_proba(X_openai_test)

print('xgboost scores (not treat; treat; unknown)')
print(y_openai_proba)

openai_df_full_graph_2 = pd.DataFrame(y_openai_proba)
openai_df_full_graph_2.columns = ['not-treat-score', 'treat-score', 'unknown-treat-score']

# random forest
rf_clf = RandomForestClassifier(random_state = 42)
rf_clf.fit(X_openai, y_openai)

y_openai_pred = rf_clf.predict_proba(X_openai_test)
y_openai_proba = rf_clf.predict_proba(X_openai_test)

print('random forest scores (not treat; treat; unknown)')
print(y_openai_proba)

xgboost scores (not treat; treat; unknown)
[[1.0820476e-01 8.7112164e-01 2.0673547e-02]
 [9.9785221e-01 5.1674782e-04 1.6311018e-03]
 [5.2885312e-01 1.7399462e-02 4.5374742e-01]
 [9.5633411e-01 2.9732803e-02 1.3933076e-02]]
random forest scores (not treat; treat; unknown)
[[0.54 0.14 0.32]
 [0.98 0.01 0.01]
 [0.44 0.03 0.53]
 [0.68 0.1  0.22]]


In [97]:
feature_length = 1024
#pubmed dataset
X_spacy = np.empty(shape=(len(train_df_1), feature_length), dtype = 'float32')
for index, row in train_df_1.iterrows():
    drug = row['source']
    disease = row['target']
    drug_vector = spacy_graph.loc[spacy_graph.id==drug].topological_embedding.values[0] #rdb.collect()[0].embedding
    disease_vector = spacy_graph.loc[spacy_graph.id==disease].topological_embedding.values[0] #rdb.collect()[0].embedding
    X_spacy[index] = np.concatenate([drug_vector, disease_vector])
y_spacy = train_df_1.y.to_numpy()

#pubmed dataset - test
test = test.reset_index(drop=True)
X_spacy_test = np.empty(shape=(len(test), feature_length), dtype = 'float32')
for index, row in test.iterrows():
    drug = row['source']
    disease = row['target']
    drug_vector = spacy_graph.loc[spacy_graph.id==drug].topological_embedding.values[0] #rdb.collect()[0].embedding
    disease_vector = spacy_graph.loc[spacy_graph.id==disease].topological_embedding.values[0] #rdb.collect()[0].embedding
    X_spacy_test[index] = np.concatenate([drug_vector, disease_vector])
y_spacy_test = test.y.to_numpy()

In [98]:
# xgboost
xgb = XGBClassifier(random_state = 42)
xgb.fit(X_spacy, y_spacy)

y_spacy_pred = xgb.predict_proba(X_spacy_test)
y_spacy_proba = xgb.predict_proba(X_spacy_test)

print('xgboost scores (not treat; treat; unknown)')
print(y_spacy_proba)

spacy_df_full_graph_2 = pd.DataFrame(y_spacy_proba)
spacy_df_full_graph_2.columns = ['not-treat-score', 'treat-score', 'unknown-treat-score']

# random forest
rf_clf = RandomForestClassifier(random_state = 42)
rf_clf.fit(X_spacy, y_spacy)

y_spacy_pred = rf_clf.predict_proba(X_spacy_test)
y_spacy_proba = rf_clf.predict_proba(X_spacy_test)

print('random forest scores (not treat; treat; unknown)')
print(y_spacy_proba)

xgboost scores (not treat; treat; unknown)
[[0.03950122 0.20158602 0.75891274]
 [0.9938775  0.00130645 0.00481606]
 [0.982064   0.0015774  0.01635863]
 [0.9860742  0.00250905 0.01141682]]
random forest scores (not treat; treat; unknown)
[[0.08 0.3  0.62]
 [1.   0.   0.  ]
 [0.81 0.   0.19]
 [0.79 0.07 0.14]]


In [99]:
feature_length = 1024
#pubmed dataset
X_scispacy = np.empty(shape=(len(train_df_1), feature_length), dtype = 'float32')
for index, row in train_df_1.iterrows():
    drug = row['source']
    disease = row['target']
    drug_vector = scispacy_graph.loc[scispacy_graph.id==drug].topological_embedding.values[0] #rdb.collect()[0].embedding
    disease_vector = scispacy_graph.loc[scispacy_graph.id==disease].topological_embedding.values[0] #rdb.collect()[0].embedding
    X_scispacy[index] = np.concatenate([drug_vector, disease_vector])
y_scispacy = train_df_1.y.to_numpy()

#pubmed dataset - test
test = test.reset_index(drop=True)
X_scispacy_test = np.empty(shape=(len(test), feature_length), dtype = 'float32')
for index, row in test.iterrows():
    drug = row['source']
    disease = row['target']
    drug_vector = scispacy_graph.loc[scispacy_graph.id==drug].topological_embedding.values[0] #rdb.collect()[0].embedding
    disease_vector = scispacy_graph.loc[scispacy_graph.id==disease].topological_embedding.values[0] #rdb.collect()[0].embedding
    X_scispacy_test[index] = np.concatenate([drug_vector, disease_vector])
y_scispacy_test = test.y.to_numpy()

In [100]:
# xgboost
xgb = XGBClassifier(random_state = 42)
xgb.fit(X_scispacy, y_scispacy)

y_scispacy_pred = xgb.predict_proba(X_scispacy_test)
y_scispacy_proba = xgb.predict_proba(X_scispacy_test)

print('xgboost scores (not treat; treat; unknown)')
print(y_spacy_proba)

scispacy_df_full_graph_2 = pd.DataFrame(y_scispacy_proba)
scispacy_df_full_graph_2.columns = ['not-treat-score', 'treat-score', 'unknown-treat-score']

# random forest
rf_clf = RandomForestClassifier(random_state = 42)
rf_clf.fit(X_scispacy, y_scispacy)

y_scispacy_pred = rf_clf.predict_proba(X_scispacy_test)
y_scispacy_proba = rf_clf.predict_proba(X_scispacy_test)

print('random forest scores (not treat; treat; unknown)')
print(y_scispacy_proba)

xgboost scores (not treat; treat; unknown)
[[0.08 0.3  0.62]
 [1.   0.   0.  ]
 [0.81 0.   0.19]
 [0.79 0.07 0.14]]
random forest scores (not treat; treat; unknown)
[[0.2  0.08 0.72]
 [0.92 0.   0.08]
 [0.67 0.01 0.32]
 [0.45 0.13 0.42]]


### Summary

In [101]:
multi_index = pd.MultiIndex.from_arrays([['pubmed_full_attribute', 'pubmed_full_attribute', 'pubmed_full_attribute'],['not treat','treat','unknown']])
pubmed_df_full_2.columns = multi_index

multi_index = pd.MultiIndex.from_arrays([['pubmed_pca_attribute', 'pubmed_pca_attribute', 'pubmed_pca_attribute'],['not treat','treat','unknown']])
pubmed_df_full_pca_2.columns = multi_index

multi_index = pd.MultiIndex.from_arrays([['pubmed_full_topological', 'pubmed_full_topological', 'pubmed_full_topological'],['not treat','treat','unknown']])
pubmed_df_full_graph_2.columns = multi_index

pd.concat([pubmed_df_full_2, pubmed_df_full_pca_2, pubmed_df_full_graph_2], axis=1)

Unnamed: 0_level_0,pubmed_full_attribute,pubmed_full_attribute,pubmed_full_attribute,pubmed_pca_attribute,pubmed_pca_attribute,pubmed_pca_attribute,pubmed_full_topological,pubmed_full_topological,pubmed_full_topological
Unnamed: 0_level_1,not treat,treat,unknown,not treat,treat,unknown,not treat,treat,unknown
0,0.027482,0.596695,0.375823,0.046899,0.407993,0.545108,0.036922,0.736852,0.226226
1,0.996036,0.002003,0.001961,0.993528,0.001574,0.004897,0.992924,0.001974,0.005102
2,0.993322,0.003033,0.003645,0.986197,0.002634,0.011169,0.991064,0.000696,0.008239
3,0.989285,0.004476,0.006239,0.993518,0.002713,0.003769,0.99471,0.002158,0.003132


In [102]:
multi_index = pd.MultiIndex.from_arrays([['openai_full_attribute', 'openai_full_attribute', 'openai_full_attribute'],['not treat','treat','unknown']])
openai_df_full_2.columns = multi_index

multi_index = pd.MultiIndex.from_arrays([['openai_pca_attribute', 'openai_pca_attribute', 'openai_pca_attribute'],['not treat','treat','unknown']])
openai_df_full_pca_2.columns = multi_index

multi_index = pd.MultiIndex.from_arrays([['openai_full_topological', 'openai_full_topological', 'openai_full_topological'],['not treat','treat','unknown']])
openai_df_full_graph_2.columns = multi_index

pd.concat([openai_df_full_2, openai_df_full_pca_2, openai_df_full_graph_2], axis=1)

Unnamed: 0_level_0,openai_full_attribute,openai_full_attribute,openai_full_attribute,openai_pca_attribute,openai_pca_attribute,openai_pca_attribute,openai_full_topological,openai_full_topological,openai_full_topological
Unnamed: 0_level_1,not treat,treat,unknown,not treat,treat,unknown,not treat,treat,unknown
0,0.016544,0.500097,0.483359,0.016243,0.808951,0.174807,0.108205,0.871122,0.020674
1,0.991526,0.005184,0.00329,0.98881,0.006192,0.004998,0.997852,0.000517,0.001631
2,0.993249,0.004472,0.00228,0.990156,0.0025,0.007344,0.528853,0.017399,0.453747
3,0.990866,0.004318,0.004816,0.991547,0.004245,0.004208,0.956334,0.029733,0.013933


In [103]:
multi_index = pd.MultiIndex.from_arrays([['spacy_full_attribute', 'spacy_full_attribute', 'spacy_full_attribute'],['not treat','treat','unknown']])
spacy_df_full_2.columns = multi_index

multi_index = pd.MultiIndex.from_arrays([['spacy_pca_attribute', 'spacy_pca_attribute', 'spacy_pca_attribute'],['not treat','treat','unknown']])
spacy_df_full_pca_2.columns = multi_index

multi_index = pd.MultiIndex.from_arrays([['spacy_full_topological', 'spacy_full_topological', 'spacy_full_topological'],['not treat','treat','unknown']])
spacy_df_full_graph_2.columns = multi_index

pd.concat([spacy_df_full_2,spacy_df_full_pca_2, spacy_df_full_graph_2], axis=1)

Unnamed: 0_level_0,spacy_full_attribute,spacy_full_attribute,spacy_full_attribute,spacy_pca_attribute,spacy_pca_attribute,spacy_pca_attribute,spacy_full_topological,spacy_full_topological,spacy_full_topological
Unnamed: 0_level_1,not treat,treat,unknown,not treat,treat,unknown,not treat,treat,unknown
0,0.007376,0.917079,0.075545,0.019512,0.792798,0.187689,0.039501,0.201586,0.758913
1,0.991478,0.00127,0.007252,0.997438,0.001135,0.001427,0.993877,0.001306,0.004816
2,0.996376,0.001161,0.002464,0.992211,0.002241,0.005548,0.982064,0.001577,0.016359
3,0.996376,0.001161,0.002464,0.992211,0.002241,0.005548,0.986074,0.002509,0.011417


In [104]:
multi_index = pd.MultiIndex.from_arrays([['scispacy_full_attribute', 'scispacy_full_attribute', 'scispacy_full_attribute'],['not treat','treat','unknown']])
scispacy_df_full_2.columns = multi_index

multi_index = pd.MultiIndex.from_arrays([['scispacy_pca_attribute', 'scispacy_pca_attribute', 'scispacy_pca_attribute'],['not treat','treat','unknown']])
scispacy_df_full_pca_2.columns = multi_index

multi_index = pd.MultiIndex.from_arrays([['scispacy_full_topological', 'scispacy_full_topological', 'scispacy_full_topological'],['not treat','treat','unknown']])
scispacy_df_full_graph_2.columns = multi_index

pd.concat([scispacy_df_full_2, scispacy_df_full_pca_2, scispacy_df_full_graph_2], axis=1)

Unnamed: 0_level_0,scispacy_full_attribute,scispacy_full_attribute,scispacy_full_attribute,scispacy_pca_attribute,scispacy_pca_attribute,scispacy_pca_attribute,scispacy_full_topological,scispacy_full_topological,scispacy_full_topological
Unnamed: 0_level_1,not treat,treat,unknown,not treat,treat,unknown,not treat,treat,unknown
0,0.002077,0.015435,0.982488,0.005394,0.019256,0.97535,0.079009,0.056275,0.864716
1,0.997629,0.00119,0.001181,0.997411,0.001328,0.001261,0.994236,0.001864,0.0039
2,0.996533,0.001241,0.002226,0.997805,0.000828,0.001367,0.968437,0.001984,0.029579
3,0.996533,0.001241,0.002226,0.997805,0.000828,0.001367,0.977582,0.006404,0.016013


Upon topological enrichment, the treat score increases for PubMedBERT and OpenAI. Spacy and Scispacy scores also change but much less considerably. 

## Overall conclusions

* The results show that there is very high possibility of data leakage, especially when you check it for cloxotestosterone- carcinoma example however this data leakage is present not only in case of LLMs but also simpler neural networks like Spacy /SciSpacy which were trained on much less data. This shows that even if we decide to stay away from LLMs for embedding generation, we should in general stay away from pre-trained models which greatly limits the node attribute embeddings generation. 
* In some cases generic models like spacy or openai performed better than PubMedBERT. While it could mean that biomedical embeddings are not actually better, I do not think thats the case - what actually could be the case is that PubMedBERT has greater literature coverage, meaning it knows more than they do (i.e. that this drug-disease pair hasn't proved to treat 100% cases which was perhaps reported in the literature). Either way, more extensive tests should be conducted.

Next step is to train such classifiers on all node embeddings and compare the results for those drug-disease pairs to see whether true positives consistenttly get higher scores. This will also give us an idea of how much information is actually learnt from knowledge graph and how much from embeddings. 
