# PrimeKG Subgraph Construction

In this tutorial, we will showcase how to construct a subraph from PrimeKG and prepare necessary graph formats for further analysis.

In particular, we will slice a subgraph from PrimeKG related to inflammatory bowel disease (IBD).

The subgraph will contain all nodes and edges that are connected to IBD-related disease nodes, including the following relationships:
- Disease-Protein Relationship
- Disease-Disease Relationship (skipped as of now)
- Protein-Protein Relationship (skipped as of now)
- Drug-Protein Relationship
- Pathway-Protein Relationship
- Pathway-Pathway Relationship (skipped as of now)
- Bioprocess-Protein Relationship
- Molecular Function-Protein Relationship
- Cellular Component-Protein Relationship

In addition, to enrich the nodes and edges, we will perform the following tasks:
- Textual enrichment (only this task is implemented as of now) 
- Multi-modal enrichment (to be added)


First of all, we need to import necessary libraries as follows:

In [3]:
# Import necessary libraries
import os
import numpy as np
import pandas as pd
import networkx as nx
import pickle
from tqdm import tqdm
from torch_geometric.utils import from_networkx
import sys
sys.path.append('../')
from aiagents4pharma.talk2knowledgegraphs.datasets.primekg import PrimeKG
from aiagents4pharma.talk2knowledgegraphs.datasets.starkqa_primekg import StarkQAPrimeKG
from aiagents4pharma.talk2knowledgegraphs.utils.embeddings.ollama import EmbeddingWithOllama
from aiagents4pharma.talk2knowledgegraphs.utils import kg_utils

# # Set the logging level for httpx to WARNING to suppress INFO messages
import logging
logging.getLogger("httpx").setLevel(logging.WARNING)

### PrimeKG

We utilize the `PrimeKG` class from the aiagents4pharma/talk2knowledgegraphs library.

The `PrimeKG` needs to be initialized with the path to the PrimeKG dataset to be stored/loaded from the local directory.

In [4]:
# Define primekg data by providing a local directory where the data is stored
primekg_data = PrimeKG(local_dir="../../data/primekg/")

# Invoke a method to load the data
primekg_data.load_data()

# Get primekg_nodes and primekg_edges
primekg_nodes = primekg_data.get_nodes()
primekg_edges = primekg_data.get_edges()

Loading nodes of PrimeKG dataset ...
../../data/primekg/primekg_nodes.tsv.gz already exists. Loading the data from the local directory.
Loading edges of PrimeKG dataset ...
../../data/primekg/primekg_edges.tsv.gz already exists. Loading the data from the local directory.


### IBD-related Data Filtering

#### IBD-related Disease Nodes

As a first step, we will perform data filtering over the primekg_nodes by querying the nodes that contains the following terms:
- inflammatory bowel disease
- crohn
- ulcerative colitis

As of now, this basic query is used to filter the data. However, this can be replaced with a more complex query that can capture more nodes related to IBD.

In [31]:
# Query for nodes related to IBD

relevant_terms = ["Crohn", "Interleukin-6", "T-cells", "(IL-6)"]
query_str = ""
for idx in range(len(relevant_terms)):  # Assuming relevant_terms is a list
    if idx == 0:
        query_str += f'node_name_lower.str.contains("{relevant_terms[idx].lower()}")'
    else:
        query_str += f'or node_name_lower.str.contains("{relevant_terms[idx].lower()}")'

# Get the nodes related to IBD
ibd_nodes_df = primekg_nodes.copy()
ibd_nodes_df["node_name_lower"] = primekg_nodes.node_name.apply(lambda x: x.lower())
ibd_nodes_df = ibd_nodes_df[ibd_nodes_df.node_type == "disease"].query(query_str, engine='python')
ibd_nodes_df.drop(columns=["node_name_lower"], inplace=True)
ibd_nodes_df

  ibd_nodes_df = ibd_nodes_df[ibd_nodes_df.node_type == "disease"].query(query_str, engine='python')


Unnamed: 0,node_index,node_name,node_source,node_id,node_type
35649,35649,neoplasm of mature T-cells or NK-cells,MONDO,5169,disease
35814,35814,Crohn ileitis and jejunitis,MONDO_grouped,709_21207,disease
35815,35815,small bowel Crohn disease,MONDO,5539,disease
37784,37784,Crohn disease,MONDO_grouped,5011_5535,disease
83770,83770,Crohn's colitis,MONDO,5532,disease
95279,95279,Crohn jejunoileitis,MONDO,708,disease
95280,95280,gastroduodenal Crohn disease,MONDO,710,disease
97088,97088,perianal Crohn disease,MONDO,5537,disease
99325,99325,Crohn disease of the esophagus,MONDO,22901,disease


#### Disease-Protein Relationship


Based on the nodes related to IBD, we can further capture the records containing the relationships of disease-gene/protein nodes.

In [6]:
# IBD disease_protein edges
ibd_disease_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_index.isin(ibd_nodes_df.index.values)) & 
                                                        (primekg_edges.tail_type == 'gene/protein')],
                                          primekg_edges[(primekg_edges.tail_index.isin(ibd_nodes_df.index.values)) & 
                                                        (primekg_edges.head_type == 'gene/protein')]])

# Check dataframe
ibd_disease_protein_edges_df

Unnamed: 0,head_index,head_name,head_source,head_id,head_type,tail_index,tail_name,tail_source,tail_id,tail_type,display_relation,relation
6018395,37784,Crohn disease,MONDO_grouped,5011_5535,disease,2384,CRP,NCBI,1401,gene/protein,associated with,disease_protein
6018397,83770,Crohn's colitis,MONDO,5532,disease,2384,CRP,NCBI,1401,gene/protein,associated with,disease_protein
6018398,37784,Crohn disease,MONDO_grouped,5011_5535,disease,3088,DNMT3A,NCBI,1788,gene/protein,associated with,disease_protein
6018400,83770,Crohn's colitis,MONDO,5532,disease,3088,DNMT3A,NCBI,1788,gene/protein,associated with,disease_protein
6018401,37784,Crohn disease,MONDO_grouped,5011_5535,disease,2057,FN1,NCBI,2335,gene/protein,associated with,disease_protein
...,...,...,...,...,...,...,...,...,...,...,...,...
3304468,34887,DENND1B,NCBI,163486,gene/protein,35814,Crohn ileitis and jejunitis,MONDO_grouped,709_21207,disease,associated with,disease_protein
3304469,13365,CCNY,NCBI,219771,gene/protein,35814,Crohn ileitis and jejunitis,MONDO_grouped,709_21207,disease,associated with,disease_protein
3304470,35156,FAM92B,NCBI,339145,gene/protein,35814,Crohn ileitis and jejunitis,MONDO_grouped,709_21207,disease,associated with,disease_protein
3304471,34780,IRGM,NCBI,345611,gene/protein,35814,Crohn ileitis and jejunitis,MONDO_grouped,709_21207,disease,associated with,disease_protein


In [7]:
# Get unique protein index
ibd_protein_index = np.unique(np.concatenate([ibd_disease_protein_edges_df[ibd_disease_protein_edges_df.head_type == 'gene/protein'].head_index.unique(),
                                              ibd_disease_protein_edges_df[ibd_disease_protein_edges_df.tail_type == 'gene/protein'].tail_index.unique()]))
ibd_protein_index

array([  144,   729,   989,  1122,  1480,  1567,  1618,  2057,  2111,
        2329,  2384,  2983,  3088,  3259,  3333,  3469,  3484,  3495,
        4162,  4997,  5022,  5195,  5385,  5720,  5915,  6168,  6175,
        6229,  6428,  6661,  7059,  7083,  7899,  7958, 10113, 10191,
       11134, 11523, 12305, 12663, 12763, 12816, 13014, 13365, 22105,
       34778, 34779, 34780, 34814, 34887, 35156])

#### Disease-Disease Relationship

Here, we can get the records containing the relationships of disease-disease nodes.

In [8]:
# # IBD disease_disease edges 
# ibd_disease_disease_edges_df = pd.concat([primekg_edges[(primekg_edges.head_index.isin(ibd_nodes_df.index.values)) & 
#                                                         (primekg_edges.tail_type == 'disease')],
#                                           primekg_edges[(primekg_edges.tail_index.isin(ibd_nodes_df.index.values)) & 
#                                                         (primekg_edges.head_type == 'disease')]])

# # Check dataframe
# ibd_disease_disease_edges_df

#### Protein-Protein Relationship

We also can get the records containing the relationships of gene/protein-gene/protein nodes.

In [9]:
# # IBD protein_protein edges 
# ibd_protein_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_index.isin(ibd_protein_index)) & 
#                                                         (primekg_edges.tail_type == 'gene/protein')],
#                                           primekg_edges[(primekg_edges.tail_index.isin(ibd_protein_index)) & 
#                                                         (primekg_edges.head_type == 'gene/protein')]])

# # Check dataframe
# ibd_protein_protein_edges_df

#### Drug-Protein Relationship

Next, we will get the records containing the relationships of drug-gene/protein nodes.

In [10]:
# IBD drug_protein edges
ibd_drug_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_type == 'drug') & 
                                                     (primekg_edges.tail_type == 'gene/protein') & 
                                                     (primekg_edges.tail_index.isin(ibd_protein_index))], 
                                       primekg_edges[(primekg_edges.tail_type == 'drug') & 
                                                     (primekg_edges.head_type == 'gene/protein') & 
                                                     (primekg_edges.head_index.isin(ibd_protein_index))]])

# Check dataframe
ibd_drug_protein_edges_df

Unnamed: 0,head_index,head_name,head_source,head_id,head_type,tail_index,tail_name,tail_source,tail_id,tail_type,display_relation,relation
322559,14645,Methionine,DrugBank,DB00134,drug,6175,MTHFR,NCBI,4524,gene/protein,enzyme,drug_protein
322560,14833,Riboflavin,DrugBank,DB00140,drug,6175,MTHFR,NCBI,4524,gene/protein,enzyme,drug_protein
322561,14678,Folic acid,DrugBank,DB00158,drug,6175,MTHFR,NCBI,4524,gene/protein,enzyme,drug_protein
322562,14834,Menadione,DrugBank,DB00170,drug,6175,MTHFR,NCBI,4524,gene/protein,enzyme,drug_protein
322563,14835,Benazepril,DrugBank,DB00542,drug,6175,MTHFR,NCBI,4524,gene/protein,enzyme,drug_protein
...,...,...,...,...,...,...,...,...,...,...,...,...
5730796,5385,SLC22A4,NCBI,6583,gene/protein,14476,Testosterone cypionate,DrugBank,DB13943,drug,transporter,drug_protein
5730797,5385,SLC22A4,NCBI,6583,gene/protein,14477,Testosterone enanthate,DrugBank,DB13944,drug,transporter,drug_protein
5730798,5385,SLC22A4,NCBI,6583,gene/protein,14478,Testosterone undecanoate,DrugBank,DB13946,drug,transporter,drug_protein
5730799,5385,SLC22A4,NCBI,6583,gene/protein,14654,Choline salicylate,DrugBank,DB14006,drug,transporter,drug_protein


#### Pathway-Protein Relationship

For this case, we will get the records containing the relationships of pathway-protein nodes.

In [11]:
# IBD pathway_protein edges 
ibd_pathway_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_type == 'pathway') & 
                                                        (primekg_edges.tail_type == 'gene/protein') & 
                                                        (primekg_edges.tail_index.isin(ibd_protein_index))],
                                          primekg_edges[(primekg_edges.tail_type == 'pathway') & 
                                                        (primekg_edges.head_type == 'gene/protein') & 
                                                        (primekg_edges.head_index.isin(ibd_protein_index))]])

# Check dataframe
ibd_pathway_protein_edges_df

Unnamed: 0,head_index,head_name,head_source,head_id,head_type,tail_index,tail_name,tail_source,tail_id,tail_type,display_relation,relation
6508862,128783,Metalloprotease DUBs,REACTOME,R-HSA-5689901,pathway,12305,NLRP3,NCBI,114548,gene/protein,interacts with,pathway_protein
6508863,128804,The NLRP3 inflammasome,REACTOME,R-HSA-844456,pathway,12305,NLRP3,NCBI,114548,gene/protein,interacts with,pathway_protein
6508864,129266,Purinergic signaling in leishmaniasis infection,REACTOME,R-HSA-9660826,pathway,12305,NLRP3,NCBI,114548,gene/protein,interacts with,pathway_protein
6508865,63064,Cytoprotection by HMOX1,REACTOME,R-HSA-9707564,pathway,12305,NLRP3,NCBI,114548,gene/protein,interacts with,pathway_protein
6509093,129136,RHOA GTPase cycle,REACTOME,R-HSA-8980692,pathway,34814,TAGAP,NCBI,117289,gene/protein,interacts with,pathway_protein
...,...,...,...,...,...,...,...,...,...,...,...,...
3834560,11134,NR1H4,NCBI,9971,gene/protein,128001,Synthesis of bile acids and bile salts via 27-...,REACTOME,R-HSA-193807,pathway,interacts with,pathway_protein
3834561,11134,NR1H4,NCBI,9971,gene/protein,128393,PPARA activates gene expression,REACTOME,R-HSA-1989781,pathway,interacts with,pathway_protein
3834562,11134,NR1H4,NCBI,9971,gene/protein,62588,Endogenous sterols,REACTOME,R-HSA-211976,pathway,interacts with,pathway_protein
3834563,11134,NR1H4,NCBI,9971,gene/protein,128116,Nuclear Receptor transcription pathway,REACTOME,R-HSA-383280,pathway,interacts with,pathway_protein


In [12]:
# Get unique protein index
ibd_pathway_index = np.unique(np.concatenate([ibd_pathway_protein_edges_df[ibd_pathway_protein_edges_df.head_type == 'pathway'].head_index.unique(),
                                              ibd_pathway_protein_edges_df[ibd_pathway_protein_edges_df.tail_type == 'pathway'].tail_index.unique()]))
ibd_pathway_index

array([ 62341,  62373,  62376,  62394,  62400,  62404,  62405,  62414,
        62448,  62462,  62465,  62467,  62469,  62472,  62483,  62543,
        62571,  62575,  62588,  62596,  62603,  62606,  62628,  62644,
        62651,  62691,  62692,  62697,  62702,  62711,  62717,  62733,
        62734,  62805,  62807,  62916,  62925,  62968,  62987,  62996,
        63041,  63064,  63071,  63076, 127601, 127639, 127640, 127682,
       127683, 127688, 127691, 127693, 127694, 127696, 127730, 127731,
       127733, 127791, 127797, 127815, 127835, 127856, 127858, 127866,
       127869, 127886, 127908, 127917, 127918, 127960, 127971, 127977,
       127999, 128001, 128010, 128015, 128025, 128034, 128058, 128065,
       128071, 128072, 128073, 128074, 128111, 128113, 128116, 128117,
       128137, 128138, 128139, 128158, 128165, 128176, 128191, 128198,
       128204, 128208, 128227, 128242, 128243, 128244, 128253, 128254,
       128272, 128299, 128302, 128341, 128348, 128350, 128353, 128360,
      

#### Pathway-Pathway Relationship

As well as, a set of records containing the relationships of pathway-pathway nodes.

In [13]:
# # # IBD pathway_pathway edges 
# ibd_pathway_pathway_edges_df = pd.concat([primekg_edges[(primekg_edges.head_index.isin(ibd_pathway_index)) & 
#                                                         (primekg_edges.tail_type == 'pathway')],
#                                           primekg_edges[(primekg_edges.tail_index.isin(ibd_pathway_index)) & 
#                                                         (primekg_edges.head_type == 'pathway')]])

# # Check dataframe
# ibd_pathway_pathway_edges_df

#### Bioprocess-Protein Relationship

Next step is to get the records containing the relationships of biological_process-gene/protein nodes.

In [14]:
# IBD bioprocess_protein edges 
ibd_bioprocess_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_type == 'biological_process') & 
                                                           (primekg_edges.tail_type == 'gene/protein') & 
                                                           (primekg_edges.tail_index.isin(ibd_protein_index))],
                                             primekg_edges[(primekg_edges.tail_type == 'biological_process') & 
                                                           (primekg_edges.head_type == 'gene/protein') & 
                                                           (primekg_edges.head_index.isin(ibd_protein_index))]])

# Check dataframe
ibd_bioprocess_protein_edges_df

Unnamed: 0,head_index,head_name,head_source,head_id,head_type,tail_index,tail_name,tail_source,tail_id,tail_type,display_relation,relation
6351300,112487,neutrophil degranulation,GO,43312,biological_process,3333,FPR2,NCBI,2358,gene/protein,interacts with,bioprocess_protein
6351346,112487,neutrophil degranulation,GO,43312,biological_process,5022,ITGAM,NCBI,3684,gene/protein,interacts with,bioprocess_protein
6351457,112487,neutrophil degranulation,GO,43312,biological_process,12663,SLC11A1,NCBI,6556,gene/protein,interacts with,bioprocess_protein
6351710,103224,platelet degranulation,GO,2576,biological_process,2057,FN1,NCBI,2335,gene/protein,interacts with,bioprocess_protein
6351714,103224,platelet degranulation,GO,2576,biological_process,5720,IGF1,NCBI,3479,gene/protein,interacts with,bioprocess_protein
...,...,...,...,...,...,...,...,...,...,...,...,...
3781707,2111,LRRK2,NCBI,120892,gene/protein,51599,negative regulation of peroxidase activity,GO,2000469,biological_process,interacts with,bioprocess_protein
3781708,2111,LRRK2,NCBI,120892,gene/protein,52358,regulation of kidney size,GO,35564,biological_process,interacts with,bioprocess_protein
3781710,2111,LRRK2,NCBI,120892,gene/protein,109343,negative regulation of thioredoxin peroxidase ...,GO,1903125,biological_process,interacts with,bioprocess_protein
3781811,22105,GPBAR1,NCBI,151306,gene/protein,105254,cell surface bile acid receptor signaling pathway,GO,38184,biological_process,interacts with,bioprocess_protein


#### MolFunc-Protein Relationship

Here, we would like to get biological_process-gene/protein relationships.

In [15]:
# IBD molfunc_protein edges 
ibd_molfunc_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_type == 'molecular_function') & 
                                                        (primekg_edges.tail_type == 'gene/protein') & 
                                                        (primekg_edges.tail_index.isin(ibd_protein_index))],
                                           primekg_edges[(primekg_edges.tail_type == 'molecular_function') & 
                                                         (primekg_edges.head_type == 'gene/protein') & 
                                                         (primekg_edges.head_index.isin(ibd_protein_index))]])

# Check dataframe
ibd_molfunc_protein_edges_df

Unnamed: 0,head_index,head_name,head_source,head_id,head_type,tail_index,tail_name,tail_source,tail_id,tail_type,display_relation,relation
6198366,54290,enzyme binding,GO,19899,molecular_function,2057,FN1,NCBI,2335,gene/protein,interacts with,molfunc_protein
6198442,54290,enzyme binding,GO,19899,molecular_function,989,PPARG,NCBI,5468,gene/protein,interacts with,molfunc_protein
6198634,54290,enzyme binding,GO,19899,molecular_function,6229,NOD2,NCBI,64127,gene/protein,interacts with,molfunc_protein
6198708,54671,protease binding,GO,2020,molecular_function,2057,FN1,NCBI,2335,gene/protein,interacts with,molfunc_protein
6198747,54671,protease binding,GO,2020,molecular_function,2329,TNF,NCBI,7124,gene/protein,interacts with,molfunc_protein
...,...,...,...,...,...,...,...,...,...,...,...,...
3553533,6229,NOD2,NCBI,64127,gene/protein,122117,muramyl dipeptide binding,GO,32500,molecular_function,interacts with,molfunc_protein
3553770,2111,LRRK2,NCBI,120892,gene/protein,115199,GTP-dependent protein kinase activity,GO,34211,molecular_function,interacts with,molfunc_protein
3553771,2111,LRRK2,NCBI,120892,gene/protein,118105,beta-catenin destruction complex binding,GO,1904713,molecular_function,interacts with,molfunc_protein
3553773,2111,LRRK2,NCBI,120892,gene/protein,119847,peroxidase inhibitor activity,GO,36479,molecular_function,interacts with,molfunc_protein


#### CellComp-Protein Relationship

Finally, we are getting the records containing the relationships of cellular_component-gene/protein nodes.

In [16]:
# IBD molfunc_protein edges 
ibd_cellcomp_protein_edges_df = pd.concat([primekg_edges[(primekg_edges.head_type == 'cellular_component') & 
                                                        (primekg_edges.tail_type == 'gene/protein') & 
                                                        (primekg_edges.tail_index.isin(ibd_protein_index))],
                                           primekg_edges[(primekg_edges.tail_type == 'cellular_component') & 
                                                         (primekg_edges.head_type == 'gene/protein') & 
                                                         (primekg_edges.head_index.isin(ibd_protein_index))]])

# Check dataframe
ibd_cellcomp_protein_edges_df

Unnamed: 0,head_index,head_name,head_source,head_id,head_type,tail_index,tail_name,tail_source,tail_id,tail_type,display_relation,relation
6268120,124245,extracellular space,GO,5615,cellular_component,2384,CRP,NCBI,1401,gene/protein,interacts with,cellcomp_protein
6268231,124245,extracellular space,GO,5615,cellular_component,2057,FN1,NCBI,2335,gene/protein,interacts with,cellcomp_protein
6268290,124245,extracellular space,GO,5615,cellular_component,7083,HLA-DRB1,NCBI,3123,gene/protein,interacts with,cellcomp_protein
6268329,124245,extracellular space,GO,5615,cellular_component,3495,IFNG,NCBI,3458,gene/protein,interacts with,cellcomp_protein
6268331,124245,extracellular space,GO,5615,cellular_component,5720,IGF1,NCBI,3479,gene/protein,interacts with,cellcomp_protein
...,...,...,...,...,...,...,...,...,...,...,...,...
3636360,6661,ATG16L1,NCBI,55054,gene/protein,125634,phagophore assembly site membrane,GO,34045,cellular_component,interacts with,cellcomp_protein
3636472,2111,LRRK2,NCBI,120892,gene/protein,127370,amphisome,GO,44753,cellular_component,interacts with,cellcomp_protein
3637211,6661,ATG16L1,NCBI,55054,gene/protein,126444,vacuole-isolation membrane contact site,GO,120095,cellular_component,interacts with,cellcomp_protein
3637234,2111,LRRK2,NCBI,120892,gene/protein,126938,cytoplasmic side of mitochondrial outer membrane,GO,32473,cellular_component,interacts with,cellcomp_protein


#### Merge all dataframes

Once we have all of particular type of edges, we can merge them into a single dataframe representing a subgraph of IBD inferred from PrimeKG.

In [17]:
# PrimeKG edges related to IBD
primekg_ibd_edges_df = pd.concat([ibd_disease_protein_edges_df,
                                #   ibd_disease_disease_edges_df,
                                #   ibd_protein_protein_edges_df,
                                  ibd_drug_protein_edges_df,
                                  ibd_pathway_protein_edges_df,
                                #   ibd_pathway_pathway_edges_df,
                                  ibd_bioprocess_protein_edges_df,
                                  ibd_molfunc_protein_edges_df,
                                  ibd_cellcomp_protein_edges_df])
primekg_ibd_edges_df["edge_type"] = primekg_ibd_edges_df.apply(lambda x: (x.head_type, x.display_relation, x.tail_type), axis=1)
primekg_ibd_edges_df.drop_duplicates(subset=['head_index', 'tail_index'], inplace=True)
primekg_ibd_edges_df.reset_index(drop=True, inplace=True)
primekg_ibd_edges_df

Unnamed: 0,head_index,head_name,head_source,head_id,head_type,tail_index,tail_name,tail_source,tail_id,tail_type,display_relation,relation,edge_type
0,37784,Crohn disease,MONDO_grouped,5011_5535,disease,2384,CRP,NCBI,1401,gene/protein,associated with,disease_protein,"(disease, associated with, gene/protein)"
1,83770,Crohn's colitis,MONDO,5532,disease,2384,CRP,NCBI,1401,gene/protein,associated with,disease_protein,"(disease, associated with, gene/protein)"
2,37784,Crohn disease,MONDO_grouped,5011_5535,disease,3088,DNMT3A,NCBI,1788,gene/protein,associated with,disease_protein,"(disease, associated with, gene/protein)"
3,83770,Crohn's colitis,MONDO,5532,disease,3088,DNMT3A,NCBI,1788,gene/protein,associated with,disease_protein,"(disease, associated with, gene/protein)"
4,37784,Crohn disease,MONDO_grouped,5011_5535,disease,2057,FN1,NCBI,2335,gene/protein,associated with,disease_protein,"(disease, associated with, gene/protein)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...
7063,6661,ATG16L1,NCBI,55054,gene/protein,125634,phagophore assembly site membrane,GO,34045,cellular_component,interacts with,cellcomp_protein,"(gene/protein, interacts with, cellular_compon..."
7064,2111,LRRK2,NCBI,120892,gene/protein,127370,amphisome,GO,44753,cellular_component,interacts with,cellcomp_protein,"(gene/protein, interacts with, cellular_compon..."
7065,6661,ATG16L1,NCBI,55054,gene/protein,126444,vacuole-isolation membrane contact site,GO,120095,cellular_component,interacts with,cellcomp_protein,"(gene/protein, interacts with, cellular_compon..."
7066,2111,LRRK2,NCBI,120892,gene/protein,126938,cytoplasmic side of mitochondrial outer membrane,GO,32473,cellular_component,interacts with,cellcomp_protein,"(gene/protein, interacts with, cellular_compon..."


We can get a dataframe of nodes based on the above edge dataframe as follows:

In [18]:
# PrimeKG nodes related to IBD
primekg_ibd_nodes_df = primekg_nodes[primekg_nodes.index.isin(np.unique(np.hstack([primekg_ibd_edges_df.head_index.unique(), 
                                                                                   primekg_ibd_edges_df.tail_index.unique()])))]
primekg_ibd_nodes_df

Unnamed: 0,node_index,node_name,node_source,node_id,node_type
144,144,SMAD3,NCBI,4088,gene/protein
729,729,STAT3,NCBI,6774,gene/protein
989,989,PPARG,NCBI,5468,gene/protein
1122,1122,PPARA,NCBI,5465,gene/protein
1480,1480,ADIPOQ,NCBI,9370,gene/protein
...,...,...,...,...,...
129310,129310,Potential therapeutics for SARS,REACTOME,R-HSA-9679191,pathway
129355,129355,Detoxification of Reactive Oxygen Species,REACTOME,R-HSA-3299685,pathway
129360,129360,IRAK2 mediated activation of TAK1 complex upon...,REACTOME,R-HSA-975163,pathway
129361,129361,TRAF6 mediated IRF7 activation in TLR7/8 or 9 ...,REACTOME,R-HSA-975110,pathway


We can store the nodes and edges related to IBD in a parquet file for future use.

In [20]:
# Store the IBD-related nodes and edges
local_dir = '../../data/primekg_ibd/'
if not os.path.exists(local_dir):
    os.makedirs(local_dir)
primekg_ibd_nodes_df.to_parquet(os.path.join(local_dir, 'primekg_ibd_nodes.parquet'), compression='gzip', index=False)
primekg_ibd_edges_df.to_parquet(os.path.join(local_dir, 'primekg_ibd_edges.parquet'), compression='gzip', index=False)

In [21]:
# Statistics over the IBD-related nodes and edges
print(f"Number of IBD-related nodes: {primekg_ibd_nodes_df.shape[0]}")
print(f"Number of IBD-related edges: {primekg_ibd_edges_df.shape[0]}")

Number of IBD-related nodes: 2079
Number of IBD-related edges: 7068


In [22]:
# Count the number of nodes by node type
primekg_ibd_nodes_df.groupby('node_type').size()

node_type
biological_process    1172
cellular_component     146
disease                  3
drug                   276
gene/protein            51
molecular_function     222
pathway                209
dtype: int64

In [23]:
# Count the number of edges by relation and display_relation
primekg_ibd_edges_df.groupby(['relation','display_relation']).size()

relation            display_relation
bioprocess_protein  interacts with      3880
cellcomp_protein    interacts with       710
disease_protein     associated with      290
drug_protein        enzyme                14
                    target               538
                    transporter          178
molfunc_protein     interacts with       848
pathway_protein     interacts with       610
dtype: int64

In [24]:
# Count the number of edges by edge type
primekg_ibd_edges_df.groupby(['edge_type']).size()

edge_type
(biological_process, interacts with, gene/protein)    1940
(cellular_component, interacts with, gene/protein)     355
(disease, associated with, gene/protein)               145
(drug, enzyme, gene/protein)                             7
(drug, target, gene/protein)                           269
(drug, transporter, gene/protein)                       89
(gene/protein, associated with, disease)               145
(gene/protein, enzyme, drug)                             7
(gene/protein, interacts with, biological_process)    1940
(gene/protein, interacts with, cellular_component)     355
(gene/protein, interacts with, molecular_function)     424
(gene/protein, interacts with, pathway)                305
(gene/protein, target, drug)                           269
(gene/protein, transporter, drug)                       89
(molecular_function, interacts with, gene/protein)     424
(pathway, interacts with, gene/protein)                305
dtype: int64

### Enrichment (using textual as of now)

From this point onwards, we will use the pre-processed IBD-related nodes and edges to create a set of graph formats.

Before that, we should perform enrichment and embedding over the IBD-related nodes and edges.

As of now, we will conduct a textual enrichment over the records.

Since StarQA provide most of information of the nodes, we will use StarkQA to get the information of the nodes related to IBD.

In [26]:
# Define starkqa primekg data by providing a local directory where the data is stored
starkqa_data = StarkQAPrimeKG(local_dir="../../data/starkqa_primekg/")

# Invoke a method to load the data
starkqa_data.load_data()

# Get the StarkQAPrimeKG data, which are the QA pairs, split indices, and the node information
# starkqa_df = starkqa_data.get_starkqa()
starkqa_node_info = starkqa_data.get_starkqa_node_info()

Loading StarkQAPrimeKG dataset...
../../data/starkqa_primekg/qa/prime/stark_qa/stark_qa.csv already exists. Loading the data from the local directory.
Loading StarkQAPrimeKG embeddings...


Note that not all nodes in the StarkQA-PrimeKG have additional information. 

For this case, we provide a basic text enrichment for the nodes by simply specifying their node name and type.

In [27]:
def do_enrichment_text(data, starkqa_node_info):
    """
    Enrich the node with additional textual information from BioBridge and StarkQA.

    Args:
        data (dict): The node data from PrimeKG
        starkqa_node_info (dict): The node information from StarkQA-PrimeKG
    """
    # Basic textual enrichment of the node
    enriched_node = f"{data['node_name']} belongs to {data['node_type']} category. "

    # Only enrich the node if the node type is gene/protein, drug, disease, or pathway, which
    # has additional information in the node_info of StarkQA-PrimeKG
    added_info = ''
    if data['node_type'] == 'gene/protein':
        added_info = starkqa_node_info['details']['summary'] if 'summary' in starkqa_node_info['details'] else ''
    elif data['node_type'] == 'drug':
        added_info = ' '.join([str(starkqa_node_info['details']['description']).replace('nan', ''),
                               str(starkqa_node_info['details']['mechanism_of_action']).replace('nan', ''),
                               str(starkqa_node_info['details']['protein_binding']).replace('nan', ''),
                               str(starkqa_node_info['details']['pharmacodynamics']).replace('nan', ''),
                               str(starkqa_node_info['details']['indication']).replace('nan', '')])
    elif data['node_type'] == 'disease':
        added_info = ' '.join([str(starkqa_node_info['details']['mondo_definition']).replace('nan', ''),
                               str(starkqa_node_info['details']['mayo_symptoms']).replace('nan', ''),
                               str(starkqa_node_info['details']['mayo_causes']).replace('nan', '')])
    elif data['node_type'] == 'pathway':
        added_info += f"This pathway found in {starkqa_node_info['details']['speciesName']}. " + ' '.join([x['text'] for x in starkqa_node_info['details']['summation']]) if 'details' in starkqa_node_info else ''

    # Append the additional information for enrichment
    enriched_node += added_info
    return enriched_node

By using the above function, we can enrich the node information from PrimeKG with additional information from StarkQA-PrimeKG as shown below:

In [28]:
# Perform node enrichment for each row in primekg_nodes
text_enriched_nodes = primekg_ibd_nodes_df.apply(lambda x: do_enrichment_text(x, starkqa_node_info[x['node_index']]), axis=1).tolist()
primekg_ibd_nodes_df['enriched_node'] = text_enriched_nodes
primekg_ibd_nodes_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  primekg_ibd_nodes_df['enriched_node'] = text_enriched_nodes


Unnamed: 0,node_index,node_name,node_source,node_id,node_type,enriched_node
144,144,SMAD3,NCBI,4088,gene/protein,SMAD3 belongs to gene/protein category. The SM...
729,729,STAT3,NCBI,6774,gene/protein,STAT3 belongs to gene/protein category. The pr...
989,989,PPARG,NCBI,5468,gene/protein,PPARG belongs to gene/protein category. This g...
1122,1122,PPARA,NCBI,5465,gene/protein,PPARA belongs to gene/protein category. Peroxi...
1480,1480,ADIPOQ,NCBI,9370,gene/protein,ADIPOQ belongs to gene/protein category. This ...
...,...,...,...,...,...,...
129310,129310,Potential therapeutics for SARS,REACTOME,R-HSA-9679191,pathway,Potential therapeutics for SARS belongs to pat...
129355,129355,Detoxification of Reactive Oxygen Species,REACTOME,R-HSA-3299685,pathway,Detoxification of Reactive Oxygen Species belo...
129360,129360,IRAK2 mediated activation of TAK1 complex upon...,REACTOME,R-HSA-975163,pathway,IRAK2 mediated activation of TAK1 complex upon...
129361,129361,TRAF6 mediated IRF7 activation in TLR7/8 or 9 ...,REACTOME,R-HSA-975110,pathway,TRAF6 mediated IRF7 activation in TLR7/8 or 9 ...


Subsequently, we can perform similar textual enrichment for the edges in PrimeKG.

Since StarkQA only provides node information, we can only enrich the edges with basic information of the triples in combination with the head and tail nodes.

In [29]:
# Perform textual enrichment over the edges by simply concatenating the head and tail nodes with the relation followed by the enriched node information
text_enriched_edges = primekg_ibd_edges_df.apply(lambda x: f"{x['head_name']} ({x['head_type']}) has a direct relationship of {x['relation']}:{x['display_relation']} with {x['tail_name']} ({x['tail_type']}).", axis=1).tolist()
primekg_ibd_edges_df['enriched_edge'] = text_enriched_edges
primekg_ibd_edges_df.head()

Unnamed: 0,head_index,head_name,head_source,head_id,head_type,tail_index,tail_name,tail_source,tail_id,tail_type,display_relation,relation,edge_type,enriched_edge
0,37784,Crohn disease,MONDO_grouped,5011_5535,disease,2384,CRP,NCBI,1401,gene/protein,associated with,disease_protein,"(disease, associated with, gene/protein)",Crohn disease (disease) has a direct relations...
1,83770,Crohn's colitis,MONDO,5532,disease,2384,CRP,NCBI,1401,gene/protein,associated with,disease_protein,"(disease, associated with, gene/protein)",Crohn's colitis (disease) has a direct relatio...
2,37784,Crohn disease,MONDO_grouped,5011_5535,disease,3088,DNMT3A,NCBI,1788,gene/protein,associated with,disease_protein,"(disease, associated with, gene/protein)",Crohn disease (disease) has a direct relations...
3,83770,Crohn's colitis,MONDO,5532,disease,3088,DNMT3A,NCBI,1788,gene/protein,associated with,disease_protein,"(disease, associated with, gene/protein)",Crohn's colitis (disease) has a direct relatio...
4,37784,Crohn disease,MONDO_grouped,5011_5535,disease,2057,FN1,NCBI,2335,gene/protein,associated with,disease_protein,"(disease, associated with, gene/protein)",Crohn disease (disease) has a direct relations...


### Embeddings (using textual embedding as of now)

We are going to perform embedding using the enriched nodes and edges by leveraging `EmbeddingWithOllama` class.

For this purpose, we will use `nomic-embed-text`.

In [30]:
# Using nomic-ai/nomic-embed-text-v1.5 model
emb_model = EmbeddingWithOllama(model_name='nomic-embed-text')

ValueError: Error: Failed to connect to Ollama. Please check that Ollama is downloaded, running and accessible. https://ollama.com/download and restarted Ollama server.

#### Node Embedding

We will perform node embedding for the IBD-related nodes using the Ollama model by using mini-batches of 100 nodes at a time.

In [None]:
# Since the records of nodes has large amount of data, we will split them into mini-batches
mini_batch_size = 100
node_embeddings = []
for i in tqdm(range(0, primekg_ibd_nodes_df.shape[0], mini_batch_size)):
    outputs = emb_model.embed_documents(primekg_ibd_nodes_df.enriched_node.values.tolist()[i:i+mini_batch_size])
    node_embeddings.extend(outputs)
# node_embeddings

In [None]:
# Check the shape of the node embeddings
len(node_embeddings), len(node_embeddings[0])

In [None]:
# Add them as features to the dataframe
primekg_ibd_nodes_df['x'] = node_embeddings

# Drop and rename several columns
primekg_ibd_nodes_df.drop(columns=['node_source', 'node_id'], inplace=True)
primekg_ibd_nodes_df.rename(columns={'node_index': 'node_id'}, inplace=True)

# Check dataframe of nodes
primekg_ibd_nodes_df.head()

In [None]:
# Duplicate a node_name as index and use it as index
primekg_ibd_nodes_df['node'] = primekg_ibd_nodes_df['node_id']
primekg_ibd_nodes_df.set_index('node', inplace=True)
primekg_ibd_nodes_df.head()

In [None]:
# Save the embedded nodes dataframes to parquet file
primekg_ibd_nodes_df.to_parquet(os.path.join(local_dir, 'primekg_ibd_nodes_embedded.parquet'), compression='gzip', index=False)

#### Edge Embedding

Likewise, we also conduct node embedding for the IBD-related edges using the Ollama model by using mini-batches of 100 edges at a time.

In [None]:
# Since the records of edges has large amount of data, we will split them into mini-batches
mini_batch_size = 100
edge_embeddings = []
for i in tqdm(range(0, primekg_ibd_edges_df.shape[0], mini_batch_size)):
    outputs = emb_model.embed_documents(primekg_ibd_edges_df.enriched_edge.values.tolist()[i:i+mini_batch_size])
    edge_embeddings.extend(outputs)
# edge_embeddings

In [None]:
# Check the shape of the edge embeddings
len(edge_embeddings), len(edge_embeddings[0])

In [None]:
# Add them as features to the dataframe
primekg_ibd_edges_df['edge_attr'] = edge_embeddings

# Drop and rename several columns
primekg_ibd_edges_df.drop(columns=['head_source', 'head_id', 'head_type', 'tail_source', 'tail_id', 'tail_type', 'display_relation', 'relation'], inplace=True)
primekg_ibd_edges_df.rename(columns={'head_index': 'head_id', 'tail_index': 'tail_id'}, inplace=True)

# Check dataframe of edges
primekg_ibd_edges_df.head()

In [None]:
# Save the embedded nodes dataframes to parquet file
primekg_ibd_edges_df.to_parquet(os.path.join(local_dir, 'primekg_ibd_edges_embedded.parquet'), compression='gzip', index=False)

### Knowledge Graph Construction

For this section, we would like to convert our dataframes to networkx `DiGraph` object.

In [None]:
# Modify the node dataframe
primekg_ibd_nodes_df["node"] = primekg_ibd_nodes_df.apply(lambda x: f"{x.node_name}_({x.node_id})", axis=1)
primekg_ibd_nodes_df["node_id"] = primekg_ibd_nodes_df.apply(lambda x: f"{x.node_name}_({x.node_id})", axis=1)
primekg_ibd_nodes_df.set_index('node', inplace=True)
primekg_ibd_nodes_df.head()

In [None]:
# Modify the edge dataframe
primekg_ibd_edges_df["head_id"] = primekg_ibd_edges_df.apply(lambda x: f"{x.head_name}_({x.head_id})", axis=1)
primekg_ibd_edges_df["tail_id"] = primekg_ibd_edges_df.apply(lambda x: f"{x.tail_name}_({x.tail_id})", axis=1)
primekg_ibd_edges_df.reset_index(drop=True, inplace=True)
primekg_ibd_edges_df.head()

In [None]:
# # Convert dataframes to knowledge graph as networkx object
kg = nx.DiGraph()
for i, row in primekg_ibd_nodes_df.iterrows():
    kg.add_node(row['node_id'], **row.to_dict())
for i, row in primekg_ibd_edges_df.iterrows():
    kg.add_edge(row['head_id'], row['tail_id'], key=i, **row.to_dict())


In [None]:
# Save graph object
local_dir = '../../../aiagents4pharma/talk2knowledgegraphs/tests/files/'
with open(os.path.join(local_dir, 'primekg_ibd_nx_graph.pkl'), 'wb') as f:
    pickle.dump(kg, f)

# # Load graph object
# with open(os.path.join(local_dir, 'primekg_ibd_nx_graph.pkl'), 'rb') as f:
#     kg = pickle.load(f)


In [None]:
print ("#Nodes", kg.number_of_nodes())
print ("#Edges", kg.number_of_edges())

In addition, we can convert the networkx graph to PyG `Data` object for further processing (e.g., subgraph extraction).

In [None]:
# Convert networkx graph to PyG data object
pyg_graph = from_networkx(kg)

# Save graph object
with open(os.path.join(local_dir, 'primekg_ibd_pyg_graph.pkl'), 'wb') as f:
    pickle.dump(pyg_graph, f)

# Load graph object
# with open(os.path.join(local_dir, 'primekg_ibd_pyg_graph.pkl'), 'rb') as f:
#     pyg_graph = pickle.load(f)

Lastly, we are going to prepare a textualized graph of nodes and edges for RAG application, for instance.


In [None]:
# Prepare nodes
nodes_df = pd.DataFrame({
    'node_id': list(pyg_graph.node_id),
    'node_attr': list(pyg_graph.enriched_node),
})
nodes_df

In [None]:
# Prepare edges
edges_df = pd.DataFrame({
    'head_id': list(pyg_graph.head_id),
    'edge_type': list(pyg_graph.edge_type),
    'tail_id': list(pyg_graph.tail_id),
})
edges_df

In [None]:
with open(os.path.join(local_dir, 'primekg_ibd_text_graph.pkl'), "wb") as f:
    pickle.dump({"nodes": nodes_df, "edges": edges_df}, f)