# PrimeKG-GraphRAG: Biomedical Question Answering Interface

This notebook provides a production-ready interface to the Graph-based Retrieval-Augmented Generation (GraphRAG) system integrated with PrimeKG for biomedical question answering.

## System Overview

The GraphRAG system implements a four-stage pipeline:
1. **Query Processing**: Extracts entities and identifies query intent
2. **Graph Retrieval**: Agent-based exploration of PrimeKG knowledge graph
3. **Context Organization**: Structures retrieved information hierarchically
4. **Response Generation**: Generates comprehensive answers using LLM

## Key Features

- **Multi-hop reasoning** for complex biomedical relationships
- **Entity type inference** 
- **Comprehensive error handling** 


In [11]:
# Import required libraries
import sys
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Add project root to path
project_root = Path().resolve().parent
sys.path.insert(0, str(project_root))

# Import GraphRAG system
from src import GraphRAG, GraphRAGConfig

print("✓ Imports successful")

✓ Imports successful


## System Initialization

Initialize the GraphRAG system. This loads the PrimeKG knowledge graph and sets up all pipeline components.

**Note**: First-time initialization may take several minutes to load and cache the PrimeKG data.


In [12]:
# Initialize GraphRAG system
# Configuration will be loaded from environment variables or use defaults
config = GraphRAGConfig()
graphrag = GraphRAG(config)

# Initialize all components (loads PrimeKG data)
print("Initializing GraphRAG system...")
if graphrag.initialize():
    print("✓ GraphRAG system initialized successfully!")
    
    # Display system statistics
    stats = graphrag.data_source.get_graph_statistics()
    print(f"\nPrimeKG Graph Statistics:")
    print(f"  - Total nodes: {stats.get('total_nodes', 0):,}")
    print(f"  - Total edges: {stats.get('total_edges', 0):,}")
    print(f"  - Entity types: {len(stats.get('entity_types', {}))}")
else:
    print(f"✗ Initialization failed: {graphrag.initialization_error}")
    raise RuntimeError("Failed to initialize GraphRAG system")


2025-11-12 22:22:36,594 - src - INFO - GraphRAG initialized with config: data
Initializing GraphRAG system...
2025-11-12 22:22:36,641 - src - INFO - Initializing GraphRAG system...
2025-11-12 22:22:36,642 - src - INFO - Initializing PrimeKG data source...
2025-11-12 22:22:36,644 - src.graph_data_source - INFO - Initialized PrimeKG data source with data_dir: data
2025-11-12 22:22:36,646 - src.graph_data_source - INFO - Cache enabled: True
2025-11-12 22:22:36,648 - src.graph_data_source - INFO - Auto-download enabled: True
2025-11-12 22:22:36,649 - src.graph_data_source - INFO - Neo4j connection info found in environment. Will attempt to connect on load().


Failed to establish connection to ResolvedIPv6Address(('::1', 7687, 0, 0)) (reason [WinError 10061] No connection could be made because the target machine actively refused it)
Failed to establish connection to ResolvedIPv4Address(('127.0.0.1', 7687)) (reason [WinError 10061] No connection could be made because the target machine actively refused it)
2025-11-12 22:23:03,397 - src.graph_data_source - INFO - Applying enhanced entity classification to cached data...
2025-11-12 22:23:03,399 - src.graph_data_source - INFO - Applying enhanced entity type classification...
2025-11-12 22:23:05,999 - src.graph_data_source - INFO - Enhanced entity classification complete:
2025-11-12 22:23:06,001 - src.graph_data_source - INFO -   Reclassified: 5 entities
2025-11-12 22:23:06,002 - src.graph_data_source - INFO -   Unknown entities: 14,603 -> 14,598
2025-11-12 22:23:06,005 - src.graph_data_source - INFO -   Unknown percentage: 11.3%
2025-11-12 22:23:06,007 - src.graph_data_source - INFO -   Improvem

## Query Processing

Process biomedical questions through the GraphRAG pipeline. The system supports various query types:

- **Treatment queries**: "How does metformin work to treat type 2 diabetes?"
- **Mechanism queries**: "Explain the insulin signaling pathway"
- **Side effect queries**: "What are the side effects of aspirin?"
- **Relationship queries**: "What genes are associated with Alzheimer's disease?"
- **Pathway queries**: "What is the role of BRCA1 in breast cancer development?"


In [13]:
def process_query(question: str, verbose: bool = True):
    """
    Process a biomedical question through the GraphRAG pipeline.
    
    Args:
        question: Natural language biomedical question
        verbose: Whether to display detailed information
        
    Returns:
        GraphRAGResult object with query results
    """
    print(f"\n{'='*80}")
    print(f"Query: {question}")
    print(f"{'='*80}\n")
    
    # Process query
    result = graphrag.query(question)
    
    if verbose:
        # Display processing information
        # Access nested attributes correctly
        print(f"Query Type: {result.processed_query.query_type}")
        print(f"Processing Time: {result.processing_time:.2f}s")
        print(f"\nEntities Retrieved: {len(result.retrieval_result.entities)}")
        print(f"Relationships Retrieved: {len(result.retrieval_result.relationships)}")
        print(f"Confidence Score: {result.overall_confidence:.2%}")
        
        # Display key entities
        if result.retrieval_result.entities:
            print(f"\nKey Entities ({min(5, len(result.retrieval_result.entities))} of {len(result.retrieval_result.entities)}):")
            for entity in result.retrieval_result.entities[:5]:
                print(f"  - {entity.name} ({entity.entity_type})")
        
        # Display key relationships
        if result.retrieval_result.relationships:
            print(f"\nKey Relationships ({min(5, len(result.retrieval_result.relationships))} of {len(result.retrieval_result.relationships)}):")
            for rel in result.retrieval_result.relationships[:5]:
                print(f"  - {rel.source_entity.name} --[{rel.relation_type}]--> {rel.target_entity.name}")
    
    print(f"\n{'='*80}")
    print("ANSWER:")
    print(f"{'='*80}")
    print(result.generated_response.text)
    print(f"{'='*80}\n")
    
    return result


### Example Query 1: Treatment Mechanism

Query about how a drug works to treat a disease.


In [14]:
# Example 1: Treatment mechanism query
question1 = "How does metformin work to treat type 2 diabetes?"
result1 = process_query(question1)



Query: How does metformin work to treat type 2 diabetes?

2025-11-12 22:23:16,355 - src.query_processor - INFO - Processing query: How does metformin work to treat type 2 diabetes?
2025-11-12 22:23:16,357 - src.query_processor - INFO - Identified query type: treatment_query
2025-11-12 22:23:16,362 - src.graph_data_source - INFO - Starting memory cleanup (aggressive=False). Initial usage: 3638.6MB
2025-11-12 22:23:20,809 - src.graph_data_source - INFO - Memory cleanup completed. Freed 0.0MB. Current usage: 3638.6MB
2025-11-12 22:23:20,943 - src.graph_data_source - INFO - Starting memory cleanup (aggressive=True). Initial usage: 3638.6MB
2025-11-12 22:23:25,394 - src.graph_data_source - INFO - Memory cleanup completed. Freed 0.0MB. Current usage: 3638.6MB
2025-11-12 22:23:43,837 - src.query_processor - INFO - Loaded biomedical NER model (lazy loading)
2025-11-12 22:23:43,902 - src.query_processor - INFO - Entity extraction found 4 entities using methods: ['pattern_match', 'pattern_match

### Example Query 2: Gene-Disease Association

Query about genetic associations with diseases.


In [15]:
# Example 2: Side effect query
question2 = "What genes are associated with Alzheimer's disease?"
result2 = process_query(question2)


Query: What genes are associated with Alzheimer's disease?

2025-11-12 22:24:04,376 - src.query_processor - INFO - Processing query: What genes are associated with Alzheimer's disease?
2025-11-12 22:24:04,378 - src.query_processor - INFO - Identified query type: relationship_query
2025-11-12 22:24:04,735 - src.query_processor - INFO - Entity extraction found 6 entities using methods: ['pattern_match', 'pattern_match', 'ner', 'ner', 'ner', 'token_search']
2025-11-12 22:24:04,736 - src.query_processor - INFO - Extracted 6 entities
2025-11-12 22:24:04,738 - src.query_processor - INFO - Identified relations: ['disease_disease', 'disease_protein', 'protein_protein']
2025-11-12 22:24:04,739 - src.retriever - INFO - Starting agent-based retrieval for query: What genes are associated with Alzheimer's disease?
2025-11-12 22:24:06,254 - src.retriever - INFO - Query analysis: {'query_type': 'gene_disease', 'primary_entities': ['genes', "Alzheimer's disease"], 'relevant_relationships': ['associat


### Example Query 3: Pathway Explanation

Query about biological pathways and mechanisms

In [16]:
# Example 3: Gene-disease association query
question3 = "What genes are associated with Alzheimer's disease?"
result3 = process_query(question3)


Query: What genes are associated with Alzheimer's disease?

2025-11-12 22:24:17,543 - src.query_processor - INFO - Processing query: What genes are associated with Alzheimer's disease?
2025-11-12 22:24:17,545 - src.query_processor - INFO - Identified query type: relationship_query
2025-11-12 22:24:17,548 - src.graph_data_source - INFO - Starting memory cleanup (aggressive=False). Initial usage: 5319.7MB
2025-11-12 22:24:23,812 - src.graph_data_source - INFO - Memory cleanup completed. Freed 0.0MB. Current usage: 5319.7MB
2025-11-12 22:24:24,173 - src.query_processor - INFO - Entity extraction found 6 entities using methods: ['pattern_match', 'pattern_match', 'ner', 'ner', 'ner', 'token_search']
2025-11-12 22:24:24,175 - src.query_processor - INFO - Extracted 6 entities
2025-11-12 22:24:24,176 - src.query_processor - INFO - Identified relations: ['disease_disease', 'disease_protein', 'protein_protein']
2025-11-12 22:24:24,176 - src.retriever - INFO - Starting agent-based retrieval for 


## Custom Query Interface

Enter your own biomedical question below:

In [17]:
# Custom query - modify the question below
custom_question = "What is the role of BRCA1 in breast cancer development?"

# Process custom query
custom_result = process_query(custom_question)



Query: What is the role of BRCA1 in breast cancer development?

2025-11-12 22:24:28,678 - src.query_processor - INFO - Processing query: What is the role of BRCA1 in breast cancer development?
2025-11-12 22:24:28,680 - src.query_processor - INFO - Identified query type: general_query
2025-11-12 22:24:29,280 - src.graph_data_source - INFO - Type re-inference successful: Re-classified 30 entities, found 30 matches
2025-11-12 22:24:29,282 - src.query_processor - INFO - Entity extraction found 5 entities using methods: ['pattern_match', 'pattern_match', 'pattern_match', 'ner', 'token_search']
2025-11-12 22:24:29,283 - src.query_processor - INFO - Extracted 5 entities
2025-11-12 22:24:29,284 - src.query_processor - INFO - Identified relations: ['disease_protein', 'protein_protein']
2025-11-12 22:24:29,285 - src.retriever - INFO - Starting agent-based retrieval for query: What is the role of BRCA1 in breast cancer development?
2025-11-12 22:24:31,255 - src.retriever - INFO - Query analysis:

## Advanced Usage: Accessing Detailed Results

The `GraphRAGResult` object contains detailed information about the query processing:


In [18]:

# Access detailed result information
def display_detailed_results(result):
    """Display comprehensive information about query results."""
    print("\n" + "="*80)
    print("DETAILED QUERY RESULTS")
    print("="*80)
    
    print(f"\nQuery Information:")
    print(f"  - Query: {result.query}")
    print(f"  - Query Type: {result.processed_query.query_type}")
    print(f"  - Processing Time: {result.processing_time:.2f}s")
    print(f"  - Confidence: {result.overall_confidence:.2%}")
    
    print(f"\nRetrieval Statistics:")
    print(f"  - Entities Retrieved: {len(result.retrieval_result.entities)}")
    print(f"  - Relationships Retrieved: {len(result.retrieval_result.relationships)}")
    print(f"  - Paths Retrieved: {len(result.retrieval_result.paths)}")
    
    if result.retrieval_result.entities:
        print(f"\nRetrieved Entities:")
        for i, entity in enumerate(result.retrieval_result.entities[:10], 1):
            print(f"  {i}. {entity.name} (Type: {entity.entity_type}, Score: {entity.relevance_score:.3f})")
        if len(result.retrieval_result.entities) > 10:
            print(f"  ... and {len(result.retrieval_result.entities) - 10} more entities")
    
    if result.retrieval_result.relationships:
        print(f"\nRetrieved Relationships:")
        for i, rel in enumerate(result.retrieval_result.relationships[:10], 1):
            print(f"  {i}. {rel.source_entity.name} --[{rel.relation_type}]--> {rel.target_entity.name}")
            print(f"      (Relevance: {rel.relevance_score:.3f})")
        if len(result.retrieval_result.relationships) > 10:
            print(f"  ... and {len(result.retrieval_result.relationships) - 10} more relationships")
    
    print(f"\nGenerated Response:")
    print(f"  - Answer: {result.generated_response.text[:200]}..." if len(result.generated_response.text) > 200 else f"  - Answer: {result.generated_response.text}")
    print(f"  - Response Confidence: {result.generated_response.confidence:.2%}")
    
    print(f"\nMetadata:")
    for key, value in result.metadata.items():
        if isinstance(value, (str, int, float, bool)):
            print(f"  - {key}: {value}")
    
    print("\n" + "="*80)

# Display detailed results for the last query
if 'custom_result' in locals():
    display_detailed_results(custom_result)


DETAILED QUERY RESULTS

Query Information:
  - Query: What is the role of BRCA1 in breast cancer development?
  - Query Type: general_query
  - Processing Time: 8.64s
  - Confidence: 58.64%

Retrieval Statistics:
  - Entities Retrieved: 30
  - Relationships Retrieved: 50
  - Paths Retrieved: 29

Retrieved Entities:
  1. Merkel cell skin cancer (Type: disease, Score: 0.950)
  2. breast cancer (Type: disease, Score: 0.950)
  3. BRCA1 (Type: gene, Score: 0.950)
  4. 46,XX disorder of gonadal development (Type: disease, Score: 0.950)
  5. Role of ABL in ROBO-SLIT signaling (Type: MolecularFunction, Score: 0.950)
  6. breast (Type: protein, Score: 0.950)
  7. breast carcinoma (Type: disease, Score: 1.000)
  8. male breast (Type: protein, Score: 1.000)
  9. APUdoma (Type: protein, Score: 0.600)
  10. Neoplasm of the skin (Type: protein, Score: 0.500)
  ... and 20 more entities

Retrieved Relationships:
  1. breast cancer --[disease_disease]--> breast carcinoma
      (Relevance: 0.700)
  2. 


## System Performance Metrics

Monitor system performance and statistics:


In [19]:

# Display system performance metrics
print("System Performance Metrics:")
print(f"  - Total Queries Processed: {graphrag.query_count}")
print(f"  - Total Processing Time: {graphrag.total_processing_time:.2f}s")
if graphrag.query_count > 0:
    print(f"  - Average Query Time: {graphrag.total_processing_time / graphrag.query_count:.2f}s")
print(f"  - Error Count: {graphrag.error_count}")
print(f"  - Success Rate: {(1 - graphrag.error_count / max(graphrag.query_count, 1)) * 100:.1f}%")

System Performance Metrics:
  - Total Queries Processed: 4
  - Total Processing Time: 80.91s
  - Average Query Time: 20.23s
  - Error Count: 0
  - Success Rate: 100.0%


## Notes

- **First-time initialization**: Loading PrimeKG data may take several minutes. Subsequent runs use cached data for faster startup.
- **Query types**: The system automatically identifies query intent (treatment, mechanism, side effects, relationships, pathways).
- **Multi-hop reasoning**: The system performs multi-hop graph traversal to find complex relationships (e.g., Drug → Disease → Phenotype for side effects).
- **Edge Handling**: The system includes comprehensive error handling and fallback mechanisms to ensure reliable operation.

## References

- PrimeKG: A knowledge graph for precision medicine (doi:10.7910/DVN/IXA7BM)
- GraphRAG Framework: arXiv:2501.00309
