# PrimeKG Loader

In this tutorial, we will explain how to load dataframes of PrimeKG containing the information of the entities and the relations of the knowledge graph.

Prior information about the PrimeKG can be found in the following repositories:
- https://github.com/mims-harvard/PrimeKG
- https://github.com/mims-harvard/TDC/

Note that we are leveraging the PrimeKG provided in Harvard Dataverse, which is publicly available in the following link:

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/IXA7BM

By the time we are writing this tutorial, the latest version of PrimeKG (`kg.csv`) is `2.1`.

First of all, we need to import necessary libraries as follows:

In [56]:
# Import necessary libraries
import sys
import torch
sys.path.append('../../..')
from aiagents4pharma.talk2knowledgegraphs.datasets.primekg import PrimeKG
from aiagents4pharma.talk2knowledgegraphs.utils.enrichments.uniprot_proteins import EnrichmentWithUniProt
from aiagents4pharma.talk2knowledgegraphs.utils import uniprot_utils
from aiagents4pharma.talk2knowledgegraphs.utils.embeddings.huggingface import EmbeddingWithHuggingFace

### Load PrimeKG

The `PrimeKG` dataset allows to load the data from the Harvard Dataverse server if the data is not available locally. 

Otherwise, the data is loaded from the local directory as defined in the `local_dir`.

In [2]:
# Define primekg data by providing a local directory where the data is stored
primekg_data = PrimeKG(local_dir="../../../../data/primekg/")

To load the dataframes of nodes and edges from PrimeKG, we just need to invoke a method as follows.

In [3]:
# Invoke a method to load the data
primekg_data.load_data()

# Get primekg_nodes and primekg_edges
primekg_nodes = primekg_data.get_nodes()
primekg_edges = primekg_data.get_edges()

Loading nodes of PrimeKG dataset ...
../../../../data/primekg/primekg_nodes.tsv.gz already exists. Loading the data from the local directory.
Loading edges of PrimeKG dataset ...
../../../../data/primekg/primekg_edges.tsv.gz already exists. Loading the data from the local directory.


### Check PrimeKG Dataframes

As mentioned before, the primekg_nodes and primekg_edges are the dataframes of nodes and edges, respectively.

We can further analyze the dataframes to extract the information we need.

For instance, we can construct a graph from the nodes and edges dataframes using the networkx library.

#### PrimeKG Nodes

`primekg_nodes` is a dataframe of nodes, which has the following columns:
- `node_index`: the index of the node
- `node`: the node name
- `node_id`: the id of the node (currently set as node name itself, for visualization purposes)
- `node_uid`: the unique identifier of the node (source name + unique id)
- `node_type`: the type of the node

We can check a sample of the primekg nodes to see the list of nodes in the PrimeKG dataset as follows.

In [4]:
# Check a sample of the primekg nodes
primekg_nodes.head()

Unnamed: 0,node_index,node_name,node_source,node_id,node_type
0,0,PHYHIP,NCBI,9796,gene/protein
1,1,GPANK1,NCBI,7918,gene/protein
2,2,ZRSR2,NCBI,8233,gene/protein
3,3,NRF1,NCBI,4899,gene/protein
4,4,PI4KA,NCBI,5297,gene/protein


The current version of PrimeKG has about 130K of nodes in total as we can observe in the following cell.

In [5]:
# Check dimensions of the primekg nodes
primekg_nodes.shape

(129375, 5)

 We can breakdown the statistics of the primekg nodes by their types as follows.

In [6]:
# Show node types and their counts
primekg_nodes['node_type'].value_counts()

node_type
biological_process    28642
gene/protein          27671
disease               17080
effect/phenotype      15311
anatomy               14035
molecular_function    11169
drug                   7957
cellular_component     4176
pathway                2516
exposure                818
Name: count, dtype: int64

PrimeKG was built using various sources, as we can observe from their unique node sources as follows.

In [7]:
# Show source of the primekg nodes
primekg_nodes['node_source'].value_counts()

node_source
GO               43987
NCBI             27671
MONDO            15813
HPO              15311
UBERON           14035
DrugBank          7957
REACTOME          2516
MONDO_grouped     1267
CTD                818
Name: count, dtype: int64

In [8]:
primekg_nodes[primekg_nodes['node_source'] == 'NCBI'].head(10)
primekg_edges.head()

Unnamed: 0,head_index,head_name,head_source,head_id,head_type,tail_index,tail_name,tail_source,tail_id,tail_type,display_relation,relation
0,0,PHYHIP,NCBI,9796,gene/protein,8889,KIF15,NCBI,56992,gene/protein,ppi,protein_protein
1,1,GPANK1,NCBI,7918,gene/protein,2798,PNMA1,NCBI,9240,gene/protein,ppi,protein_protein
2,2,ZRSR2,NCBI,8233,gene/protein,5646,TTC33,NCBI,23548,gene/protein,ppi,protein_protein
3,3,NRF1,NCBI,4899,gene/protein,11592,MAN1B1,NCBI,11253,gene/protein,ppi,protein_protein
4,4,PI4KA,NCBI,5297,gene/protein,2122,RGS20,NCBI,8601,gene/protein,ppi,protein_protein


In [9]:
import networkx as nx
kg = nx.DiGraph()
## Make a KG using the edgelist
G = nx.from_pandas_edgelist(
    primekg_edges,
    source="head_name",
    target="tail_name",
    edge_key="relation",
    # edge_attr=["edge_id", "edge_type", "feature_value", "feature_id"],
    create_using=nx.DiGraph(),
)
kg = nx.compose(G, kg)

In [10]:
df_head_nodes = primekg_edges[['head_name', 'head_source', 'head_id', 'head_type']]
df_head_nodes = df_head_nodes.rename(columns={
    'head_name': 'node_name',
    'head_source': 'node_source',
    'head_id': 'node_id',
    'head_type': 'node_type'
})
df_head_nodes = df_head_nodes.set_index('node_name')
df_head_nodes.head(10)
for n, d in df_head_nodes.iterrows():
    print(n, dict(d))
    break
G.add_nodes_from((n, dict(d)) for n, d in df_head_nodes.iterrows())
# df_tail_nodes = primekg_edges[['tail_name', 'tail_source', 'tail_id', 'tail_type']]
# import pandas as pd
# df_nodes = pd.concat([df_head_nodes, df_tail_nodes], ignore_index=True)
# df_nodes = df_nodes.drop_duplicates()

PHYHIP {'node_source': 'NCBI', 'node_id': '9796', 'node_type': 'gene/protein'}


In [11]:
kg = nx.compose(G, kg)


In [48]:
# print (list(kg.edges(data=True))[None:1:None])
print (len(kg.nodes))
print (kg.nodes['F2'])
print (kg.has_node('IL17A'))
# Extract all gene IDs from the graph
gene_ids = set()
from tqdm import tqdm
for n in tqdm(kg.nodes):
    if kg.nodes[n].get('node_type') != 'gene/protein' and kg.nodes[n].get('node_source') != 'NCBI':
        continue
    gene_ids.add(kg.nodes[n].get('node_id'))
len(list(gene_ids))

129262
{'node_source': 'NCBI', 'node_id': '2147', 'node_type': 'gene/protein'}
True


  0%|          | 0/129262 [00:00<?, ?it/s]

100%|██████████| 129262/129262 [00:00<00:00, 1149510.07it/s]


27609

In [49]:
for n in tqdm(kg.nodes):
    if kg.nodes[n].get('node_type') != 'gene/protein' and kg.nodes[n].get('node_source') != 'NCBI':
        continue
    if kg.nodes[n].get('node_id') == '28582':
        print(n, kg.nodes[n])
        break

 81%|████████▏ | 105142/129262 [00:00<00:00, 1468914.50it/s]

TRBV11-1 {'node_source': 'NCBI', 'node_id': '28582', 'node_type': 'gene/protein'}





In [50]:
# Submit a job to perform ID mapping
inputs = list(gene_ids)[:10]
print (inputs)
job_id = uniprot_utils.submit_id_mapping(
    from_db="GeneID", to_db="UniProtKB", ids=inputs
)
print (f"Job ID: {job_id}")
# Check the status of the job
status = uniprot_utils.check_id_mapping_results_ready(job_id)
print (f"Job status: {status}")

['2116', '9462', '5510', '54474', '119437', '112487', '283755', '390144', '132949', '28582']
Job ID: rccQKMWKou
Job status: True


In [51]:
# Check and get the ID mapping results
if uniprot_utils.check_id_mapping_results_ready(job_id):
    link = uniprot_utils.get_id_mapping_results_link(job_id)
    mapping_results = uniprot_utils.get_id_mapping_results_stream(link)
    print(mapping_results)

{'results': [{'from': '2116', 'to': {'entryType': 'UniProtKB reviewed (Swiss-Prot)', 'primaryAccession': 'O00321', 'secondaryAccessions': ['A6NFN5', 'B3KUL0', 'B9EIN1', 'Q9UEA0'], 'uniProtkbId': 'ETV2_HUMAN', 'entryAudit': {'firstPublicDate': '1998-07-15', 'lastAnnotationUpdateDate': '2025-04-09', 'lastSequenceUpdateDate': '2012-10-31', 'entryVersion': 170, 'sequenceVersion': 2}, 'annotationScore': 5.0, 'organism': {'scientificName': 'Homo sapiens', 'commonName': 'Human', 'taxonId': 9606, 'lineage': ['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Mammalia', 'Eutheria', 'Euarchontoglires', 'Primates', 'Haplorrhini', 'Catarrhini', 'Hominidae', 'Homo']}, 'proteinExistence': '2: Evidence at transcript level', 'proteinDescription': {'recommendedName': {'fullName': {'value': 'ETS translocation variant 2'}}, 'alternativeNames': [{'fullName': {'value': 'Ets-related protein 71'}}]}, 'genes': [{'geneName': {'value': 'ETV2'}, 'synonyms': [{'value': 'ER71'}, {'value

In [52]:
dic_gene_id_to_descp_seq = {}
for result in mapping_results['results']:
    # print(result['to'])
    if result['to']['entryType'] == 'UniProtKB reviewed (Swiss-Prot)':
        # print (result['from'], result['to'])
        dic_gene_id_to_descp_seq[result['from']] = {}
        for comment in result['to']['comments']:
            if comment['commentType'] == 'FUNCTION':
                for text in comment['texts']:
                    # print (text['value'])
                    description = text['value']
        dic_gene_id_to_descp_seq[result['from']]['description'] = description
        dic_gene_id_to_descp_seq[result['from']]['sequence'] = result['to']['sequence']['value']

for gene_id, descp_seq in dic_gene_id_to_descp_seq.items():
    print(f"Gene ID: {gene_id}")
    print(f"Description: {descp_seq['description']}")
    print(f"Sequence: {descp_seq['sequence']}")
    print()
        

Gene ID: 2116
Description: Binds to DNA sequences containing the consensus pentanucleotide 5'-CGGA[AT]-3'
Sequence: MDLWNWDEASPQEVPPGNKLAGLEGAKLGFCFPDLALQGDTPTATAETCWKGTSSSLASFPQLDWGSALLHPEVPWGAEPDSQALPWSGDWTDMACTAWDSWSGASQTLGPAPLGPGPIPAAGSEGAAGQNCVPVAGEATSWSRAQAAGSNTSWDCSVGPDGDTYWGSGLGGEPRTDCTISWGGPAGPDCTTSWNPGLHAGGTTSLKRYQSSALTVCSEPSPQSDRASLARCPKTNHRGPIQLWQFLLELLHDGARSSCIRWTGNSREFQLCDPKEVARLWGERKRKPGMNYEKLSRGLRYYYRRDIVRKSGGRKYTYRFGGRVPSLAYPDCAGGGRGAETQ

Gene ID: 9462
Description: Inhibitory regulator of the Ras-cyclic AMP pathway
Sequence: MQTPEVPAERSPRRRSISGTSTSEKPNSMDTANTSPFKVPGFFSKRLKGSIKRTKSQSKLDRNTSFRLPSLRSTDDRSRGLPKLKESRSHESLLSPCSTVECLDLGRGEPVSVKPLHSSILGQDFCFEVTYLSGSKCFSCNSASERDKWMENLRRTVQPNKDNCRRAENVLRLWIIEAKDLAPKKKYFCELCLDDTLFARTTSKTKADNIFWGEHFEFFSLPPLHSITVHIYKDVEKKKKKDKNNYVGLVNIPTASVTGRQFVEKWYPVSTPTPNKGKTGGPSIRIKSRFQTITILPMEQYKEFAEFVTSNYTMLCSVLEPVISVRNKEELACALVHILQSTGRAKDFLTDLVMSEVDRCGEHDVLIFRENTIATKSIEEYLKLVGQQYLHDALGEFIKALYESDENCEVDPSKCSSSELIDHQSNLKMCCELAFCKIINSYCVFPRELKEV

In [None]:
from tqdm import tqdm
for node in tqdm(kg.nodes):
    if kg.nodes[node].get('node_type') != 'gene/protein':
        continue
    gene_id = kg.nodes[node].get('node_id')
    # print (f"Node: {node}, Gene ID: {gene_id}")
    if gene_id not in dic_gene_id_to_descp_seq:
        # print (f"Gene ID {gene_id} not found in mapping results")
        continue
    description = dic_gene_id_to_descp_seq[gene_id]['description']
    sequence = dic_gene_id_to_descp_seq[gene_id]['sequence']
    print (f"Node: {node}, Gene ID: {gene_id}, Description: {description}, Sequence: {sequence}")
    G.add_nodes_from([(node, {'description': description, 'sequence': sequence})])


100%|██████████| 129262/129262 [00:00<00:00, 1785402.08it/s]

Node: KRT20, Gene ID: 54474, Description: Plays a significant role in maintaining keratin filament organization in intestinal epithelia. When phosphorylated, plays a role in the secretion of mucin in the small intestine (By similarity), Sequence: MDFSRRSFHRSLSSSLQAPVVSTVGMQRLGTTPSVYGGAGGRGIRISNSRHTVNYGSDLTGGGDLFVGNEKMAMQNLNDRLASYLEKVRTLEQSNSKLEVQIKQWYETNAPRAGRDYSAYYRQIEELRSQIKDAQLQNARCVLQIDNAKLAAEDFRLKYETERGIRLTVEADLQGLNKVFDDLTLHKTDLEIQIEELNKDLALLKKEHQEEVDGLHKHLGNTVNVEVDAAPGLNLGVIMNEMRQKYEVMAQKNLQEAKEQFERQTAVLQQQVTVNTEELKGTEVQLTELRRTSQSLEIELQSHLSMKESLEHTLEETKARYSSQLANLQSLLSSLEAQLMQIRSNMERQNNEYHILLDIKTRLEQEIATYRRLLEGEDVKTTEYQLSTLEERDIKKTRKIKTVVQEVVDGKVVSSEVKEVEENI
Node: PPP1R7, Gene ID: 5510, Description: Regulatory subunit of protein phosphatase 1, Sequence: MAAERGAGQQQSQEMMEVDRRVESEESGDEEGKKHSSGIVADLSEQSLKDGEERGEEDPEEEHELPVDMETINLDRDAEDVDLNHYRIGKIEGFEVLKKVKTLCLRQNLIKCIENLEELQSLRELDLYDNQIKKIENLEALTELEILDISFNLLRNIEGVDKLTRLKKLFLVNNKISKIENLSNLHQLQMLELGSNRIRAIENIDTLTNLESLFLGKNKITKLQNLDALTN




In [57]:
# Check device availability
device = "cuda:0" if torch.cuda.is_available() else "cpu"
device

'cpu'

In [58]:
# Using facebook/esm2_t6_8M_UR50D 
emb_model = EmbeddingWithHuggingFace(model_name='facebook/esm2_t6_8M_UR50D',
                                     model_cache_dir="../../../../data/facebook/esm2_t6_8M_UR50D/",
                                     truncation=False,
                                     device=device)

Some weights of EsmModel were not initialized from the model checkpoint at facebook/esm2_t6_8M_UR50D and are newly initialized: ['esm.pooler.dense.bias', 'esm.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
seq = ['MFLTERNTTSEATFTLLGFSDYLELQIPLFFVFLAVYGFSVVGNLGMIVIIKINPKLHTPMYFFLNHLSFVDFCYSSIIAPMMLVNLVVEDRTISFSGCLVQFFFFCTFVVTELILFAVMAYDHFVAICNPLLYTVAISQKLCAMLVVVLYAWGVACSLTLACSALKLSFHGFNTINHFFCELSSLISLSYPDSYLSQLLLFTVATFNEISTLLIILTSYAFIIVTTLKMPSASGHRKVFSTCASHLTAITIFHGTILFLYCVPNSKNSRHTVKVASVFYTVVIPLLNPLIYSLRNKDVKDAIRKIINTKYFHIKHRHWYPFNFVIEQ']
# Embeddings using one sample at a time
mini_batch_size = 3
protein_embeddings = []
# for i in tqdm(range(0, biobridge_protein_df_filtered.shape[0], mini_batch_size)):
outputs = emb_model.embed_documents(biobridge_protein_df_filtered.sequence.values.tolist()[i:i+mini_batch_size])
protein_embeddings.extend(outputs)
torch.cuda.synchronize()
torch.cuda.empty_cache()
protein_embeddings

#### PrimeKG Edges

`primekg_edges` is a dataframe of edges, which has the following columns:
- `head_index`: the index of the head node
- `head_name`: the name of the head node
- `head_source`: the source database of head node
- `head_id`: the id in source database of head node
- `tail_index`: the index of the tail node
- `tail_name`: the name of the tail node
- `tail_source`: the source database of tail node
- `tail_id`: the id in source database of tail node
- `display_relation`: the type of the edge

We can also check a sample of the primekg edges to see the interconnections between the nodes in the PrimeKG dataset as follows.

In [9]:
# Check a sample of the primekg edges
primekg_edges.head()

Unnamed: 0,head_index,head_name,head_source,head_id,head_type,tail_index,tail_name,tail_source,tail_id,tail_type,display_relation,relation
0,0,PHYHIP,NCBI,9796,gene/protein,8889,KIF15,NCBI,56992,gene/protein,ppi,protein_protein
1,1,GPANK1,NCBI,7918,gene/protein,2798,PNMA1,NCBI,9240,gene/protein,ppi,protein_protein
2,2,ZRSR2,NCBI,8233,gene/protein,5646,TTC33,NCBI,23548,gene/protein,ppi,protein_protein
3,3,NRF1,NCBI,4899,gene/protein,11592,MAN1B1,NCBI,11253,gene/protein,ppi,protein_protein
4,4,PI4KA,NCBI,5297,gene/protein,2122,RGS20,NCBI,8601,gene/protein,ppi,protein_protein


The current version of PrimeKG has about 8.1M of edges in total as we can observe in the following cell.

In [10]:
# Check dimensions of the primekg nodes
primekg_edges.shape

(8100498, 12)

 We can breakdown the statistics of the primekg edges by their types as follows.

In [11]:
# Show edge types and their counts
primekg_edges['display_relation'].value_counts()

display_relation
expression present         3036406
synergistic interaction    2672628
interacts with              686550
ppi                         642150
phenotype present           300634
parent-child                281744
associated with             167482
side effect                 129568
contraindication             61350
expression absent            39774
target                       32760
indication                   18776
enzyme                       10634
transporter                   6184
off-label use                 5136
linked to                     4608
phenotype absent              2386
carrier                       1728
Name: count, dtype: int64