# Multimodal BioBridge-PrimeKG Graph Construction

In this tutorial, we will perform a simple pre-processing task over BioBridge-PrimeKG dataset that employs multimodal data.
In particular, we are using the pre-loaded embeddings which are already provided by BioBridge joined with PrimKG IBD dataset obtained from previous tutorial:


[docs/notebooks/talk2knowledgegraphs/tutorial_primekg_subgraph.ipynb](https://github.com/VirtualPatientEngine/AIAgents4Pharma/blob/main/docs/notebooks/talk2knowledgegraphs/tutorial_primekg_subgraph.ipynb)


First of all, we need to import necessary libraries as follows:

In [1]:
# Import necessary libraries
# %load_ext cudf.pandas

import os
import numpy as np
import pandas as pd
import networkx as nx
import pickle
import blosc

from tqdm import tqdm
from torch_geometric.utils import from_networkx
import sys

from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings

sys.path.append('../../..')
from aiagents4pharma.talk2knowledgegraphs.datasets.starkqa_primekg import StarkQAPrimeKG
from aiagents4pharma.talk2knowledgegraphs.datasets.biobridge_primekg import BioBridgePrimeKG
from aiagents4pharma.talk2knowledgegraphs.utils.embeddings.ollama import EmbeddingWithOllama
# from aiagents4pharma.talk2knowledgegraphs.utils import kg_utils

# # Set the logging level for httpx to WARNING to suppress INFO messages
import logging
logging.getLogger("httpx").setLevel(logging.WARNING)

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
import os
os.environ["NVIDIA_API_KEY"] = "XXX"  # Replace with your actual NVIDIA API key

### Prepare BioBridge dataset

The `BioBridgePrimeKG` allows to load the data from related Github repository if the data is not available locally. 

Otherwise, the data is loaded from the local directory as defined in the `local_dir` and `primekg_dir`.

In [2]:
# Define biobridge primekg data by providing a local directory where the data is stored
biobridge_data = BioBridgePrimeKG(primekg_dir="../../../../data/primekg/",
                                  local_dir="../../../../data/biobridge_primekg/")

# Invoke a method to load the data
biobridge_data.load_data()

# Get the node information of the BioBridge PrimeKG
biobridge_node_info = biobridge_data.get_node_info_dict()
biobridge_node_info.keys()

Loading PrimeKG dataset...
Loading nodes of PrimeKG dataset ...
../../../../data/primekg/primekg_nodes.tsv.gz already exists. Loading the data from the local directory.
Loading edges of PrimeKG dataset ...
../../../../data/primekg/primekg_edges.tsv.gz already exists. Loading the data from the local directory.
Loading data config file of BioBridgePrimeKG...
File data_config.json already exists in ../../../../data/biobridge_primekg/.
Building node embeddings...
Building full triplets...
Building train-test split...


dict_keys(['gene/protein', 'molecular_function', 'cellular_component', 'biological_process', 'drug', 'disease'])

We also utilize another source of information:  StarkQA PrimeKG that provide us with the information of each node in the graph.
We can use `StarkQAPrimeKG` class to load the data.
Subsequently, we can use the `get_node_info_dict` method to obtain the node information of the StarkQA PrimeKG after loading the data using the `load_data` method.

In [3]:
# As an additional source of information, we utilize StarkQA PrimeKG 
starkqa_data = StarkQAPrimeKG(local_dir="../../../../data/starkqa_primekg/")

# Invoke a method to load the data
starkqa_data.load_data()

# Get the node information of the StarkQA PrimeKG
starkqa_node_info = starkqa_data.get_starkqa_node_info()

Loading StarkQAPrimeKG dataset...
../../../../data/starkqa_primekg/qa/prime/stark_qa/stark_qa.csv already exists. Loading the data from the local directory.
Loading StarkQAPrimeKG embeddings...


The following codes will prepare the nodes and edges dataframes from the BioBridge dataset.

In [4]:
# Prepare BioBridge-PrimeKG edges
# Build the node index list
node_info_dict = {}
node_index_list = []
for i, node_type in enumerate(biobridge_data.preselected_node_types):
    df_node = pd.read_csv(os.path.join(biobridge_data.local_dir, "processed", f"{node_type}.csv"))
    node_info_dict[biobridge_data.node_type_map[node_type]] = df_node
    node_index_list.extend(df_node["node_index"].tolist())

# Filter the PrimeKG dataset to take into account only the selected node types
edges_df = biobridge_data.primekg.get_edges().copy()
edges_df = edges_df[
    edges_df["head_index"].isin(node_index_list) &\
    edges_df["tail_index"].isin(node_index_list)
]
edges_df = edges_df.reset_index(drop=True)

# Further filtering out some nodes in the embedding dictionary
edges_df = edges_df[
    edges_df["head_index"].isin(list(biobridge_data.emb_dict.keys())) &\
    edges_df["tail_index"].isin(list(biobridge_data.emb_dict.keys()))
].reset_index(drop=True)

In [5]:
# Prepare BioBridge-PrimeKG nodes
nodes_df = biobridge_data.primekg.get_nodes().copy()
nodes_df = nodes_df[nodes_df["node_index"].isin(np.unique(np.concatenate([edges_df.head_index.unique(), 
                                                                          edges_df.tail_index.unique()])))].reset_index(drop=True)

In [6]:
# Check the number of nodes
print(f"Number of nodes: {len(nodes_df)}")


Number of nodes: 84981


In [7]:
# Check the number of edges
print(f"Number of edges: {len(edges_df)}")

Number of edges: 3904610


### Modal-Specific Enrichment & Embedding


BioBridge dataset provides multimodal data for diverse node types, including: gene/proten, molecular_function, cellular_component, biological_process, drug, and disease.
The following code snippet demonstrates how to obtain such information.


In [8]:
# Define feature columns 
dict_feature_columns = {
    "gene/protein": "sequence",
    "molecular_function": "description",
    "cellular_component": "description",
    "biological_process": "description",
    "drug": "smiles",
    "disease": "definition",
}

# Obtain the node embeddings of the BioBridge
biobridge_node_embeddings = biobridge_data.get_node_embeddings()

#### Node Enrichment & Embedding

As mentioned earlier, we can use StarkQA PrimeKG dataset to simplify the enrichment process of textual data for the nodes.

In [9]:
def get_textual_enrichment(data, node_info):
    """
    Enrich the node with additional information from StarkQA-PrimeKG

    Args:
        data (dict): The node data from PrimeKG
        node_info (dict): The node information from StarkQA-PrimeKG
    """
    # Basic textual enrichment of the node
    enriched_node = f"{data['node_name']} belongs to {data['node_type']} node. "
    # Only enrich the node if the node type is gene/protein, drug, disease, or pathway, which
    # has additional information in the node_info of StarkQA-PrimeKG
    added_info = ''
    if data['node_type'] == 'gene/protein':
        added_info += f"{data['node_name']} is {node_info['details']['name']}. " if 'name' in node_info['details'] else ''
        added_info += node_info['details']['summary'] if 'summary' in node_info['details'] else ''
    elif data['node_type'] == 'drug':
        added_info = ' '.join([str(node_info['details']['description']).replace('nan', ''),
                               str(node_info['details']['mechanism_of_action']).replace('nan', ''),
                               str(node_info['details']['protein_binding']).replace('nan', ''),
                               str(node_info['details']['pharmacodynamics']).replace('nan', ''),
                               str(node_info['details']['indication']).replace('nan', '')])
    elif data['node_type'] == 'disease':
        added_info = ' '.join([str(node_info['details']['mondo_definition']).replace('nan', ''),
                               str(node_info['details']['mayo_symptoms']).replace('nan', ''),
                               str(node_info['details']['mayo_causes']).replace('nan', '')])
    elif data['node_type'] == 'pathway':
        added_info += f"This pathway found in {node_info['details']['speciesName']}. " + ' '.join([x['text'] for x in node_info['details']['summation']]) if 'details' in node_info else ''
    # Append the additional information for enrichment
    enriched_node += added_info
    return enriched_node

In [10]:
# Enrich the node with additional textual description from StarkQA-PrimeKG
nodes_df["desc"] = nodes_df.apply(lambda x: get_textual_enrichment(x, starkqa_node_info[x['node_index']]), axis=1)
nodes_df.head(5)

Unnamed: 0,node_index,node_name,node_source,node_id,node_type,desc
0,0,PHYHIP,NCBI,9796,gene/protein,PHYHIP belongs to gene/protein node. PHYHIP is...
1,1,GPANK1,NCBI,7918,gene/protein,GPANK1 belongs to gene/protein node. GPANK1 is...
2,2,ZRSR2,NCBI,8233,gene/protein,ZRSR2 belongs to gene/protein node. ZRSR2 is z...
3,3,NRF1,NCBI,4899,gene/protein,NRF1 belongs to gene/protein node. NRF1 is nuc...
4,4,PI4KA,NCBI,5297,gene/protein,PI4KA belongs to gene/protein node. PI4KA is p...


Afterwards, we will perform embeddings over such description column using the Nvidia model.

In [13]:
# Using NVIDIA embedding model hosted on local server
emb_model = NVIDIAEmbeddings(
    model="nvidia/llama-3.2-nv-embedqa-1b-v2",  # Available models: "nvolveqa_40k", "nvolveqa_1.5", etc.
    base_url="http://localhost:8000/v1"
)

# Use mini-batch processing to perform the embedding
mini_batch_size = 1000
desc_embeddings = []
for i in tqdm(range(0, nodes_df.shape[0], mini_batch_size)):
    outputs = emb_model.embed_documents(nodes_df.desc.values.tolist()[i:i+mini_batch_size])
    desc_embeddings.extend(outputs)

# Add them as features to the dataframe
nodes_df['desc_emb'] = desc_embeddings
nodes_df.head(5)

  9%|▉         | 8/85 [01:56<18:39, 14.54s/it]


KeyboardInterrupt: 

We then obtain enriched node by using BioBridge data along with its embeddings.

In [12]:
# Obtain modality-specific information
nodes_df["feat"] = nodes_df.apply(lambda x: 
                                           biobridge_node_info[x["node_type"]][biobridge_node_info[x["node_type"]]["node_index"] == x["node_index"]][dict_feature_columns[x["node_type"]]].values[0], axis=1)
nodes_df["feat"] = nodes_df.apply(lambda x: 
                                           x["feat"] 
                                           if not pd.isnull(x["feat"]) else x["node_name"], axis=1)
nodes_df["feat_emb"] = nodes_df.apply(lambda x: 
                               biobridge_node_embeddings[x["node_index"]] 
                               if x["node_index"] in biobridge_node_embeddings else np.NaN, axis=1)
nodes_df.dropna(subset=["feat_emb"], inplace=True)
nodes_df.head(5)

Unnamed: 0,node_index,node_name,node_source,node_id,node_type,desc,desc_emb,feat,feat_emb
0,0,PHYHIP,NCBI,9796,gene/protein,PHYHIP belongs to gene/protein node. PHYHIP is...,"[0.012382511, 0.0384233, -0.16545817, -0.03335...",MELLSTPHSIEINNITCDSFRISWAMEDSDLERVTHYFIDLNKKEN...,"[0.04029838368296623, -0.018344514071941376, 0..."
1,1,GPANK1,NCBI,7918,gene/protein,GPANK1 belongs to gene/protein node. GPANK1 is...,"[0.007383303, 0.02430243, -0.15115355, -0.0132...",MSRPLLITFTPATDPSDLWKDGQQQPQPEKPESTLDGAAARAFYEA...,"[-0.049913737922906876, -0.04380067065358162, ..."
2,2,ZRSR2,NCBI,8233,gene/protein,ZRSR2 belongs to gene/protein node. ZRSR2 is z...,"[0.011058212, 0.0035900373, -0.13332076, -0.01...",MAAPEKMTFPEKPSHKKYRAALKKEKRKKRRQELARLRDSGLSQKE...,"[0.035360466688871384, -0.09613325446844101, 0..."
3,3,NRF1,NCBI,4899,gene/protein,NRF1 belongs to gene/protein node. NRF1 is nuc...,"[0.011984555, 0.03422219, -0.15100288, 0.01625...",MEEHGVTQTEHMATIEAHAVAQQVQQVHVATYTEHSMLSADEDSPS...,"[-0.052261918783187866, -0.022747397422790527,..."
4,4,PI4KA,NCBI,5297,gene/protein,PI4KA belongs to gene/protein node. PI4KA is p...,"[0.031896856, 0.028683903, -0.13113366, -0.022...",MAAAPARGGGGGGGGGGGCSGSGSSASRGFYFNTVLSLARSLAVQR...,"[0.005174526944756508, -0.049968406558036804, ..."


In [13]:
# Check if there are any NaN values in the feature column
nodes_df["feat_emb"].isna().any()

False

Note that for nodes with textual embeddings, we will replace the original embeddings with the new ones that are retrieved from Ollama model (to be further used in the following talk2knowledgegraphs application).

In [14]:
# Using nomic-ai/nomic-embed-text-v1.5 model via Ollama
emb_model = EmbeddingWithOllama(model_name='nomic-embed-text')

# Since the records of nodes has large amount of data, we will split them into mini-batches
mini_batch_size = 100
text_based_df = nodes_df[nodes_df.node_type.isin(['disease', 'biological_process', 'cellular_component', 'molecular_function'])]
text_node_indexes = []
text_node_embeddings = []
for i in tqdm(range(0, text_based_df.shape[0], mini_batch_size)):
    outputs = emb_model.embed_documents(text_based_df.feat.values.tolist()[i:i+mini_batch_size])
    text_node_indexes.extend(text_based_df.node_index.values.tolist()[i:i+mini_batch_size])
    text_node_embeddings.extend(outputs)
dic_text_embeddings = dict(zip(text_node_indexes, text_node_embeddings))
# dic_text_embeddings

100%|██████████| 595/595 [04:06<00:00,  2.42it/s]


In [15]:
# Replace the embeddings of the nodes with the updated embeddings for text-based nodes
nodes_df["feat_emb"] = nodes_df.apply(lambda x: dic_text_embeddings[x["node_index"]] if x["node_index"] in dic_text_embeddings else x["feat_emb"], axis=1)
nodes_df.head(5)

Unnamed: 0,node_index,node_name,node_source,node_id,node_type,desc,desc_emb,feat,feat_emb
0,0,PHYHIP,NCBI,9796,gene/protein,PHYHIP belongs to gene/protein node. PHYHIP is...,"[0.012382511, 0.0384233, -0.16545817, -0.03335...",MELLSTPHSIEINNITCDSFRISWAMEDSDLERVTHYFIDLNKKEN...,"[0.04029838368296623, -0.018344514071941376, 0..."
1,1,GPANK1,NCBI,7918,gene/protein,GPANK1 belongs to gene/protein node. GPANK1 is...,"[0.007383303, 0.02430243, -0.15115355, -0.0132...",MSRPLLITFTPATDPSDLWKDGQQQPQPEKPESTLDGAAARAFYEA...,"[-0.049913737922906876, -0.04380067065358162, ..."
2,2,ZRSR2,NCBI,8233,gene/protein,ZRSR2 belongs to gene/protein node. ZRSR2 is z...,"[0.011058212, 0.0035900373, -0.13332076, -0.01...",MAAPEKMTFPEKPSHKKYRAALKKEKRKKRRQELARLRDSGLSQKE...,"[0.035360466688871384, -0.09613325446844101, 0..."
3,3,NRF1,NCBI,4899,gene/protein,NRF1 belongs to gene/protein node. NRF1 is nuc...,"[0.011984555, 0.03422219, -0.15100288, 0.01625...",MEEHGVTQTEHMATIEAHAVAQQVQQVHVATYTEHSMLSADEDSPS...,"[-0.052261918783187866, -0.022747397422790527,..."
4,4,PI4KA,NCBI,5297,gene/protein,PI4KA belongs to gene/protein node. PI4KA is p...,"[0.031896856, 0.028683903, -0.13113366, -0.022...",MAAAPARGGGGGGGGGGGCSGSGSSASRGFYFNTVLSLARSLAVQR...,"[0.005174526944756508, -0.049968406558036804, ..."


In [16]:
# # Modify the node dataframe
# nodes_df["node"] = nodes_df.apply(lambda x: f"{x.node_name}_({x.node_index})", axis=1)
nodes_df["node_id"] = nodes_df.apply(lambda x: f"{x.node_name}_({x.node_index})", axis=1)
nodes_df.drop(columns=['node_index', 'node_source'], inplace=True)
nodes_df.reset_index(inplace=True)
nodes_df.rename(columns={'index': 'node_index'}, inplace=True)
nodes_df.head(5)

Unnamed: 0,node_index,node_name,node_id,node_type,desc,desc_emb,feat,feat_emb
0,0,PHYHIP,PHYHIP_(0),gene/protein,PHYHIP belongs to gene/protein node. PHYHIP is...,"[0.012382511, 0.0384233, -0.16545817, -0.03335...",MELLSTPHSIEINNITCDSFRISWAMEDSDLERVTHYFIDLNKKEN...,"[0.04029838368296623, -0.018344514071941376, 0..."
1,1,GPANK1,GPANK1_(1),gene/protein,GPANK1 belongs to gene/protein node. GPANK1 is...,"[0.007383303, 0.02430243, -0.15115355, -0.0132...",MSRPLLITFTPATDPSDLWKDGQQQPQPEKPESTLDGAAARAFYEA...,"[-0.049913737922906876, -0.04380067065358162, ..."
2,2,ZRSR2,ZRSR2_(2),gene/protein,ZRSR2 belongs to gene/protein node. ZRSR2 is z...,"[0.011058212, 0.0035900373, -0.13332076, -0.01...",MAAPEKMTFPEKPSHKKYRAALKKEKRKKRRQELARLRDSGLSQKE...,"[0.035360466688871384, -0.09613325446844101, 0..."
3,3,NRF1,NRF1_(3),gene/protein,NRF1 belongs to gene/protein node. NRF1 is nuc...,"[0.011984555, 0.03422219, -0.15100288, 0.01625...",MEEHGVTQTEHMATIEAHAVAQQVQQVHVATYTEHSMLSADEDSPS...,"[-0.052261918783187866, -0.022747397422790527,..."
4,4,PI4KA,PI4KA_(4),gene/protein,PI4KA belongs to gene/protein node. PI4KA is p...,"[0.031896856, 0.028683903, -0.13113366, -0.022...",MAAAPARGGGGGGGGGGGCSGSGSSASRGFYFNTVLSLARSLAVQR...,"[0.005174526944756508, -0.049968406558036804, ..."


In [17]:

# Store node dataframe into two separated files: enrichment and embedding
local_dir = '../../../aiagents4pharma/talk2knowledgegraphs/tests/files/biobridge_multimodal/nodes/'
os.makedirs(local_dir, exist_ok=True)
for nt in nodes_df.node_type.unique():
    nt_ = nt.replace('/', '_')
    # Enrichment
    # os.makedirs(os.path.join(local_dir, 'enrichment', nt_), exist_ok=True)
    nodes_df[nodes_df.node_type == nt][
        ["node_index", "node_id", "node_name", "node_type", "desc", "feat"]
        ].to_parquet(
        os.path.join(local_dir, 'enrichment', f"{nt_}.parquet.gzip"),
        compression='gzip',
        index=False
    )
    # Embedding
    # os.makedirs(os.path.join(local_dir, 'embedding', nt_), exist_ok=True)
    nodes_df[nodes_df.node_type == nt][
        ["node_id", "desc_emb", "feat_emb"]
        ].to_parquet(
        os.path.join(local_dir, 'embedding', f"{nt_}.parquet.gzip"),
        compression='gzip',
        index=False
    )

#### Edge Enrichment & Embedding

We will also perform enrichment and embedding for the edges of the BioBridge-PrimeKG.

This time, we just use textual enrichment by using simple concatenation of the head, tail and relation.


In [18]:
# Filtering edges that exists in BioBridge PrimeKG
edges_df = edges_df[edges_df['head_index'].isin(nodes_df.node_index.unique()) & 
                    edges_df['tail_index'].isin(nodes_df.node_index.unique())]

# Adding an additional column to the edges dataframe
edges_df["edge_type"] = edges_df.apply(lambda x: (x.head_type, x.display_relation, x.tail_type), axis=1)
edges_df["edge_type_str"] = edges_df.apply(lambda x: f"{x.head_type}|{x.display_relation}|{x.tail_type}", axis=1)
edges_df.head(5)
# As of now, we are enriching each edge using textual information 
# Perform textual enrichment over the edges by simply concatenating the head and tail nodes with the relation followed by the enriched node information
# text_enriched_edges = edges_df.apply(lambda x: f"{x['head_name']} ({x['head_type']}) has a direct relationship of {x['relation']}:{x['display_relation']} with {x['tail_name']} ({x['tail_type']}).", axis=1).tolist()
# edges_df['feat'] = text_enriched_edges
# edges_df.head(5)


Unnamed: 0,head_index,head_name,head_source,head_id,head_type,tail_index,tail_name,tail_source,tail_id,tail_type,display_relation,relation,edge_type,edge_type_str
0,0,PHYHIP,NCBI,9796,gene/protein,8889,KIF15,NCBI,56992,gene/protein,ppi,protein_protein,"(gene/protein, ppi, gene/protein)",gene/protein|ppi|gene/protein
1,1,GPANK1,NCBI,7918,gene/protein,2798,PNMA1,NCBI,9240,gene/protein,ppi,protein_protein,"(gene/protein, ppi, gene/protein)",gene/protein|ppi|gene/protein
2,2,ZRSR2,NCBI,8233,gene/protein,5646,TTC33,NCBI,23548,gene/protein,ppi,protein_protein,"(gene/protein, ppi, gene/protein)",gene/protein|ppi|gene/protein
3,3,NRF1,NCBI,4899,gene/protein,11592,MAN1B1,NCBI,11253,gene/protein,ppi,protein_protein,"(gene/protein, ppi, gene/protein)",gene/protein|ppi|gene/protein
4,4,PI4KA,NCBI,5297,gene/protein,2122,RGS20,NCBI,8601,gene/protein,ppi,protein_protein,"(gene/protein, ppi, gene/protein)",gene/protein|ppi|gene/protein


After that, we perform the same embedding process for the edges using Ollama model.

In [19]:
# # Perform embedding using NVIDIA embeddings
# emb_model = NVIDIAEmbeddings(
#     model="nvidia/llama-3.2-nv-embedqa-1b-v2",
#     base_url="http://localhost:8000/v1"
# )
# 
# # Since the records of edges has large amount of data, we will split them into mini-batches
# mini_batch_size = 100
# edge_embeddings = []
# for i in tqdm(range(0, edges_df.shape[0], mini_batch_size)):
#     outputs = emb_model.embed_documents(edges_df.enriched_edge.values.tolist()[i:i+mini_batch_size])
#     edge_embeddings.extend(outputs)

# # Add them as features to the dataframe
# edges_df['edge_attr'] = edge_embeddings

In [20]:
# Using nomic-ai/nomic-embed-text-v1.5 model via Ollama
emb_model = EmbeddingWithOllama(model_name='nomic-embed-text')

# Populate the edge embeddings dictionary
edge_embeddings_keys = edges_df.edge_type.unique().tolist()
edge_embeddings = emb_model.embed_documents([str(e) for e in edge_embeddings_keys])
edge_embeddings_dict = dict(zip(edge_embeddings_keys, edge_embeddings))
edges_df['edge_emb'] = edges_df.apply(lambda x: edge_embeddings_dict[x.edge_type], axis=1)
edges_df.head(5)

Unnamed: 0,head_index,head_name,head_source,head_id,head_type,tail_index,tail_name,tail_source,tail_id,tail_type,display_relation,relation,edge_type,edge_type_str,edge_emb
0,0,PHYHIP,NCBI,9796,gene/protein,8889,KIF15,NCBI,56992,gene/protein,ppi,protein_protein,"(gene/protein, ppi, gene/protein)",gene/protein|ppi|gene/protein,"[0.024646243, 0.04494511, -0.13975705, -0.0281..."
1,1,GPANK1,NCBI,7918,gene/protein,2798,PNMA1,NCBI,9240,gene/protein,ppi,protein_protein,"(gene/protein, ppi, gene/protein)",gene/protein|ppi|gene/protein,"[0.024646243, 0.04494511, -0.13975705, -0.0281..."
2,2,ZRSR2,NCBI,8233,gene/protein,5646,TTC33,NCBI,23548,gene/protein,ppi,protein_protein,"(gene/protein, ppi, gene/protein)",gene/protein|ppi|gene/protein,"[0.024646243, 0.04494511, -0.13975705, -0.0281..."
3,3,NRF1,NCBI,4899,gene/protein,11592,MAN1B1,NCBI,11253,gene/protein,ppi,protein_protein,"(gene/protein, ppi, gene/protein)",gene/protein|ppi|gene/protein,"[0.024646243, 0.04494511, -0.13975705, -0.0281..."
4,4,PI4KA,NCBI,5297,gene/protein,2122,RGS20,NCBI,8601,gene/protein,ppi,protein_protein,"(gene/protein, ppi, gene/protein)",gene/protein|ppi|gene/protein,"[0.024646243, 0.04494511, -0.13975705, -0.0281..."


In [21]:
# Drop and rename several columns
edges_df.drop(columns=['head_source', 'head_id', 'head_type', 'tail_source', 'tail_id', 'tail_type', 'relation'], inplace=True)
edges_df.rename(columns={'head_index': 'head_id', 'tail_index': 'tail_id'}, inplace=True)

# Check dataframe of edges
edges_df.head(5)

Unnamed: 0,head_id,head_name,tail_id,tail_name,display_relation,edge_type,edge_type_str,edge_emb
0,0,PHYHIP,8889,KIF15,ppi,"(gene/protein, ppi, gene/protein)",gene/protein|ppi|gene/protein,"[0.024646243, 0.04494511, -0.13975705, -0.0281..."
1,1,GPANK1,2798,PNMA1,ppi,"(gene/protein, ppi, gene/protein)",gene/protein|ppi|gene/protein,"[0.024646243, 0.04494511, -0.13975705, -0.0281..."
2,2,ZRSR2,5646,TTC33,ppi,"(gene/protein, ppi, gene/protein)",gene/protein|ppi|gene/protein,"[0.024646243, 0.04494511, -0.13975705, -0.0281..."
3,3,NRF1,11592,MAN1B1,ppi,"(gene/protein, ppi, gene/protein)",gene/protein|ppi|gene/protein,"[0.024646243, 0.04494511, -0.13975705, -0.0281..."
4,4,PI4KA,2122,RGS20,ppi,"(gene/protein, ppi, gene/protein)",gene/protein|ppi|gene/protein,"[0.024646243, 0.04494511, -0.13975705, -0.0281..."


In [22]:
# Make an additional edge index column as identifier
edges_df.reset_index(inplace=True)
edges_df.rename(columns={'index': 'triplet_index'}, inplace=True)
edges_df.head(5)

Unnamed: 0,triplet_index,head_id,head_name,tail_id,tail_name,display_relation,edge_type,edge_type_str,edge_emb
0,0,0,PHYHIP,8889,KIF15,ppi,"(gene/protein, ppi, gene/protein)",gene/protein|ppi|gene/protein,"[0.024646243, 0.04494511, -0.13975705, -0.0281..."
1,1,1,GPANK1,2798,PNMA1,ppi,"(gene/protein, ppi, gene/protein)",gene/protein|ppi|gene/protein,"[0.024646243, 0.04494511, -0.13975705, -0.0281..."
2,2,2,ZRSR2,5646,TTC33,ppi,"(gene/protein, ppi, gene/protein)",gene/protein|ppi|gene/protein,"[0.024646243, 0.04494511, -0.13975705, -0.0281..."
3,3,3,NRF1,11592,MAN1B1,ppi,"(gene/protein, ppi, gene/protein)",gene/protein|ppi|gene/protein,"[0.024646243, 0.04494511, -0.13975705, -0.0281..."
4,4,4,PI4KA,2122,RGS20,ppi,"(gene/protein, ppi, gene/protein)",gene/protein|ppi|gene/protein,"[0.024646243, 0.04494511, -0.13975705, -0.0281..."


In [23]:
# Modify the edge dataframe
edges_df["head_id"] = edges_df.apply(lambda x: f"{x.head_name}_({x.head_id})", axis=1)
edges_df["tail_id"] = edges_df.apply(lambda x: f"{x.tail_name}_({x.tail_id})", axis=1)
edges_df.drop(columns=['head_name', 'tail_name'], inplace=True)
edges_df.reset_index(drop=True, inplace=True)
edges_df.head(5)

Unnamed: 0,triplet_index,head_id,tail_id,display_relation,edge_type,edge_type_str,edge_emb
0,0,PHYHIP_(0),KIF15_(8889),ppi,"(gene/protein, ppi, gene/protein)",gene/protein|ppi|gene/protein,"[0.024646243, 0.04494511, -0.13975705, -0.0281..."
1,1,GPANK1_(1),PNMA1_(2798),ppi,"(gene/protein, ppi, gene/protein)",gene/protein|ppi|gene/protein,"[0.024646243, 0.04494511, -0.13975705, -0.0281..."
2,2,ZRSR2_(2),TTC33_(5646),ppi,"(gene/protein, ppi, gene/protein)",gene/protein|ppi|gene/protein,"[0.024646243, 0.04494511, -0.13975705, -0.0281..."
3,3,NRF1_(3),MAN1B1_(11592),ppi,"(gene/protein, ppi, gene/protein)",gene/protein|ppi|gene/protein,"[0.024646243, 0.04494511, -0.13975705, -0.0281..."
4,4,PI4KA_(4),RGS20_(2122),ppi,"(gene/protein, ppi, gene/protein)",gene/protein|ppi|gene/protein,"[0.024646243, 0.04494511, -0.13975705, -0.0281..."


In [24]:
# Add index columns for head and tail nodes
# Map head_id to head_index
edges_df = edges_df.merge(
    nodes_df[["node_index", "node_id"]],
    left_on="head_id",
    right_on="node_id",
    how="left"
).rename(columns={"node_index": "head_index"}).drop(columns=["node_id"])

# Merge to get tail_index
edges_df = edges_df.merge(
    nodes_df[["node_index", "node_id"]],
    left_on="tail_id",
    right_on="node_id",
    how="left"
).rename(columns={"node_index": "tail_index"}).drop(columns=["node_id"])

# Check the final edges dataframe
edges_df.head(5)


Unnamed: 0,triplet_index,head_id,tail_id,display_relation,edge_type,edge_type_str,edge_emb,head_index,tail_index
0,0,PHYHIP_(0),KIF15_(8889),ppi,"(gene/protein, ppi, gene/protein)",gene/protein|ppi|gene/protein,"[0.024646243, 0.04494511, -0.13975705, -0.0281...",0,8816
1,1,GPANK1_(1),PNMA1_(2798),ppi,"(gene/protein, ppi, gene/protein)",gene/protein|ppi|gene/protein,"[0.024646243, 0.04494511, -0.13975705, -0.0281...",1,2787
2,2,ZRSR2_(2),TTC33_(5646),ppi,"(gene/protein, ppi, gene/protein)",gene/protein|ppi|gene/protein,"[0.024646243, 0.04494511, -0.13975705, -0.0281...",2,5610
3,3,NRF1_(3),MAN1B1_(11592),ppi,"(gene/protein, ppi, gene/protein)",gene/protein|ppi|gene/protein,"[0.024646243, 0.04494511, -0.13975705, -0.0281...",3,11467
4,4,PI4KA_(4),RGS20_(2122),ppi,"(gene/protein, ppi, gene/protein)",gene/protein|ppi|gene/protein,"[0.024646243, 0.04494511, -0.13975705, -0.0281...",4,2117


In [36]:

# Store node dataframe into two separated files: enrichment and embedding
local_dir = '../../../aiagents4pharma/talk2knowledgegraphs/tests/files/biobridge_multimodal/edges/'
os.makedirs(local_dir, exist_ok=True)
# Enrichment
os.makedirs(os.path.join(local_dir, 'enrichment'), exist_ok=True)
edges_df[
    ["triplet_index", "head_id", "tail_id", "display_relation", "edge_type", "edge_type_str", "head_index", "tail_index"]
    ].to_parquet(
    os.path.join(local_dir, 'enrichment', "edges.parquet.gzip"),
    compression='gzip',
    index=False
)


In [37]:
# Store edge embeddings into a separate file
edge_embeddings_df = pd.DataFrame(
    edge_embeddings_dict.items(),
    columns=['edge_type', 'edge_emb']
)
edge_embeddings_df['edge_type_str'] = edge_embeddings_df.apply(lambda x: f"{x.edge_type[0]}|{x.edge_type[1]}|{x.edge_type[2]}", axis=1)
edge_embeddings_df = edge_embeddings_df[['edge_type', 'edge_type_str', 'edge_emb']]
edge_embeddings_df

# Embedding
os.makedirs(os.path.join(local_dir, 'embedding'), exist_ok=True)
edge_embeddings_df.to_parquet(
    os.path.join(local_dir, 'embedding', "edges.parquet.gzip"),
    compression='gzip',
    index=False
)

In [None]:
# # Embedding
# os.makedirs(os.path.join(local_dir, 'embedding'), exist_ok=True)
# edges_df[
#     ["head_id", "tail_id", "edge_type", "edge_emb"]
#     ].to_parquet(
#     os.path.join(local_dir, 'embedding', "edges.parquet.gzip"),
#     compression='gzip',
#     index=False
# )

In [None]:
# # Store the node dataframes into a parquet file
# local_dir = '../../../aiagents4pharma/talk2knowledgegraphs/tests/files/'
# edges_df.to_parquet(os.path.join(local_dir, 'biobridge_mm_edges.parquet.gzip'), compression='gzip', index=False)


In [None]:
# # Load the data from the parquet files
# local_dir = '../../../aiagents4pharma/talk2knowledgegraphs/tests/files/'
# nodes_df_ = pd.read_parquet(os.path.join(local_dir, 'biobridge_nodes.parquet.gzip'))
# edges_df_ = pd.read_parquet(os.path.join(local_dir, 'biobridge_edges.parquet.gzip'))

We would like to convert our dataframes to networkx `DiGraph` object.

In [None]:
# # Modify the node dataframe
# nodes_df["node"] = nodes_df.apply(lambda x: f"{x.node_name}_({x.node_index})", axis=1)
# nodes_df["node_id"] = nodes_df.apply(lambda x: f"{x.node_name}_({x.node_index})", axis=1)
# nodes_df.drop(columns=['node_index', 'node_source'], inplace=True)
# nodes_df.set_index('node', inplace=True)
# nodes_df.head(5)

In [None]:
# # Modify the edge dataframe
# edges_df["head_id"] = edges_df.apply(lambda x: f"{x.head_name}_({x.head_id})", axis=1)
# edges_df["tail_id"] = edges_df.apply(lambda x: f"{x.tail_name}_({x.tail_id})", axis=1)
# edges_df.reset_index(drop=True, inplace=True)
# edges_df.head(5)

In [None]:
# # # Convert dataframes to knowledge graph as networkx object
# kg = nx.DiGraph()
# for i, row in nodes_df.iterrows():
#     kg.add_node(row['node_id'], **row.to_dict())
# for i, row in edges_df.iterrows():
#     kg.add_edge(row['head_id'], row['tail_id'], key=i, **row.to_dict())


In [None]:
# # Save graph object
# local_dir = '../../../aiagents4pharma/talk2knowledgegraphs/tests/files/'
# with open(os.path.join(local_dir, 'biobridge_multimodal_nx_graph.pkl'), 'wb') as f:
#     pickle.dump(kg, f)

# # # Load graph object
# with open(os.path.join(local_dir, 'biobridge_multimodal_nx_graph.pkl'), 'rb') as f:
#     kg_2 = pickle.load(f)


In [None]:
# print ("#Nodes", kg.number_of_nodes())
# print ("#Edges", kg.number_of_edges())

We can convert the networkx graph to PyG `Data` object

In [None]:
# # Convert networkx graph to PyG data object
# pyg_graph = from_networkx(kg)
# pyg_graph.num_nodes = kg.number_of_nodes()
# pyg_graph.num_edges = kg.number_of_edges()

# # Save graph object
# with open(os.path.join(local_dir, 'biobridge_multimodal_pyg_graph.pkl'), 'wb') as f:
#     pickle.dump(pyg_graph, f)

# # Load graph object
# # with open(os.path.join(local_dir, 'biobridge_ibd_pyg_graph.pkl'), 'rb') as f:
# #     pyg_graph = pickle.load(f)

Lastly, we will also prepare a textualized graph of nodes and edges for RAG application, for instance.


In [None]:
# # Prepare nodes
# nodes_df = pd.DataFrame({
#     'node_id': list(pyg_graph.node_id),
#     'node_attr': list(pyg_graph.desc),
# })
# nodes_df.head(5)

In [None]:
# # Prepare edges
# edges_df = pd.DataFrame({
#     'head_id': list(pyg_graph.head_id),
#     'edge_type': list(pyg_graph.edge_type),
#     'tail_id': list(pyg_graph.tail_id),
# })
# edges_df.head(5)

In [None]:
# with open(os.path.join(local_dir, 'biobridge_multimodal_text_graph.pkl'), "wb") as f:
#     pickle.dump({"nodes": nodes_df, "edges": edges_df}, f)

In [None]:
# Check the number of nodes and edges
print(f"Number of nodes: {len(nodes_df)}")
print(f"Number of edges: {len(edges_df)}")