# LangGraph and Knowledge Graphs Tutorial

## Learning Objectives 🎯

By the end of this tutorial, you will understand:

1. **Knowledge Graphs**: What they are and how they represent domain knowledge
2. **LangGraph**: How to build AI workflows with state management
3. **Biomedical Applications**: Real-world uses of AI + knowledge graphs
4. **Practical Implementation**: How to build your own AI agents

## Prerequisites 📚

- Basic Python programming
- Understanding of databases (helpful but not required)
- Interest in AI and biomedical applications

## What You'll Build 🚀

You'll learn to create AI agents that can answer complex biomedical questions like:
- "What drugs treat Hypertension?"
- "What genes are associated with cardiovascular disease?"
- "Show me the pathway from gene F8 to available treatments"

---

## Part 1: Environment Setup and Connection 🔧

First, let's set up our environment and connect to the knowledge graph database.

In [38]:
# Setup: Import required libraries and connect to our knowledge graph
import sys
import os
from pathlib import Path

# Add the project root to Python path for imports
project_root = Path().resolve().parent.parent
sys.path.append(str(project_root))

print(f"📁 Project root: {project_root}")
print(f"🐍 Python path updated")

# Load environment variables
from dotenv import load_dotenv
load_dotenv()

print("✅ Environment variables loaded")

📁 Project root: /Users/a.elhusseini/Docs/sandbox/hdsi_replication_proj_2025
🐍 Python path updated
✅ Environment variables loaded


In [39]:
# Connect to our knowledge graph database
from src.agents.graph_interface import GraphInterface

# Get database connection settings
uri = os.getenv("NEO4J_URI", "bolt://localhost:7687")
user = os.getenv("NEO4J_USER", "neo4j")
password = os.getenv("NEO4J_PASSWORD")

# Check if we have the required credentials
if not password:
    print("⚠️  Setup Required:")
    print("   1. Copy .env.example to .env")
    print("   2. Add your Neo4j password: NEO4J_PASSWORD=your_password")
    print("   3. Make sure Neo4j is running")
    print("\n💡 Need help? Check docs/getting-started.md")
    graph_db = None
else:
    try:
        # Attempt to connect to the database
        print(f"🔗 Connecting to Neo4j at {uri}...")
        graph_db = GraphInterface(uri, user, password)
        
        # Test the connection with a simple query
        test_result = graph_db.execute_query("MATCH (n) RETURN count(n) as total_nodes")
        total_nodes = test_result[0]['total_nodes'] if test_result else 0
        
        print(f"✅ Connected to knowledge graph!")
        print(f"📊 Database contains {total_nodes:,} nodes")
        
        if total_nodes == 0:
            print("\n⚠️  Database appears empty. Run: pdm run load-data")
            
    except Exception as e:
        print(f"❌ Connection failed: {e}")
        print("\n🔧 Troubleshooting:")
        print("   • Make sure Neo4j is running (Neo4j Desktop or Docker)")
        print("   • Check your password in .env file")
        print("   • Verify the URI is correct")
        graph_db = None

🔗 Connecting to Neo4j at bolt://localhost:7687...
✅ Connected to knowledge graph!
📊 Database contains 1,702 nodes


## Part 2: What are Knowledge Graphs? 🕸️

### The Problem with Traditional Data Storage

Imagine you're studying biology and want to answer: **"What genes are related to Hypertension?"**

With traditional databases (tables), you might have:
- `genes` table
- `diseases` table  
- `gene_disease_associations` table

But what about complex questions like: **"What pathway connects gene F8 to available treatments through proteins and diseases?"**

This requires joining multiple tables and becomes very complex!

### Knowledge Graphs: A Better Way

Knowledge graphs store information as **nodes** (entities) and **relationships** (edges):

```
F8 --[ENCODES]--> F8_protein --[ASSOCIATED_WITH]--> Hypertension
                                    ^
                                    |
                                [TREATS]
                                    |
                                Lisinopril
```

This naturally represents how biological entities relate to each other!

### Why This Matters

- **Intuitive**: Mirrors how domain experts think about relationships
- **Flexible**: Easy to add new types of entities and relationships
- **Powerful**: Can answer complex multi-hop questions
- **Visual**: Can be displayed as networks and graphs

In [40]:
# Let's explore our biomedical knowledge graph structure
if graph_db:
    # Get the database schema information
    schema = graph_db.get_schema_info()
    
    print("🏗️ Knowledge Graph Schema:")
    print("=" * 40)
    print(f"📦 Node Types: {', '.join(schema['node_labels'])}")
    print(f"🔗 Relationship Types: {', '.join(schema['relationship_types'])}")
    
    print("\n📋 Node Properties:")
    for node_type, properties in schema['node_properties'].items():
        print(f"  🏷️ {node_type}: {properties}")
    
    print("\n🧬 Our Biomedical Graph Contains:")
    print("  • Genes: Genetic sequences with functions and chromosomal locations")
    print("  • Proteins: Encoded proteins with molecular weights and structures") 
    print("  • Diseases: Medical conditions categorized by type and severity")
    print("  • Drugs: Medications with approval status and mechanisms")
    
else:
    print("⚠️ Database not connected. Please fix connection issues above.")

🏗️ Knowledge Graph Schema:
📦 Node Types: Gene, Protein, Disease, Drug
🔗 Relationship Types: ENCODES, ASSOCIATED_WITH, TREATS, TARGETS, LINKED_TO

📋 Node Properties:
  🏷️ Gene: ['expression_level', 'gene_id', 'gene_name', 'chromosome', 'function']
  🏷️ Protein: ['protein_id', 'structure_type', 'protein_name', 'molecular_weight']
  🏷️ Disease: ['category', 'prevalence', 'disease_name', 'disease_id', 'severity']
  🏷️ Drug: ['type', 'drug_name', 'approval_status', 'mechanism', 'drug_id']

🧬 Our Biomedical Graph Contains:
  • Genes: Genetic sequences with functions and chromosomal locations
  • Proteins: Encoded proteins with molecular weights and structures
  • Diseases: Medical conditions categorized by type and severity
  • Drugs: Medications with approval status and mechanisms


In [41]:
# Let's look at some real data in our knowledge graph
if graph_db:
    print("🔍 Sample Data from Our Knowledge Graph:")
    print("=" * 50)
    
    # Sample genes
    print("🧬 Sample Genes:")
    try:
        genes = graph_db.execute_query(
            "MATCH (g:Gene) RETURN g.gene_name, g.function, g.chromosome LIMIT 3"
        )
        for gene in genes:
            function = gene.get('g.function', 'Function not specified')
            chromosome = gene.get('g.chromosome', 'Unknown')
            print(f"  • {gene['g.gene_name']}: {function} (chr{chromosome})")
    except Exception as e:
        print(f"  Error fetching genes: {e}")
    
    # Sample proteins
    print("\n🧪 Sample Proteins:")
    try:
        proteins = graph_db.execute_query(
            "MATCH (p:Protein) RETURN p.protein_name, p.molecular_weight, p.structure_type LIMIT 3"
        )
        for protein in proteins:
            weight = protein.get('p.molecular_weight', 'Unknown')
            structure = protein.get('p.structure_type', 'Unknown')
            print(f"  • {protein['p.protein_name']}: {weight} kDa, {structure}")
    except Exception as e:
        print(f"  Error fetching proteins: {e}")
    
    # Sample diseases
    print("\n🏥 Sample Diseases:")
    try:
        diseases = graph_db.execute_query(
            "MATCH (d:Disease) RETURN d.disease_name, d.category, d.severity LIMIT 3"
        )
        for disease in diseases:
            category = disease.get('d.category', 'Unknown category')
            severity = disease.get('d.severity', 'Unknown')
            print(f"  • {disease['d.disease_name']}: {category} (severity: {severity})")
    except Exception as e:
        print(f"  Error fetching diseases: {e}")
    
    # Sample drugs  
    print("\n💊 Sample Drugs:")
    try:
        drugs = graph_db.execute_query(
            "MATCH (dr:Drug) RETURN dr.drug_name, dr.type, dr.approval_status LIMIT 3"
        )
        for drug in drugs:
            drug_type = drug.get('dr.type', 'Unknown type')
            status = drug.get('dr.approval_status', 'Unknown')
            print(f"  • {drug['dr.drug_name']}: {drug_type} ({status})")
    except Exception as e:
        print(f"  Error fetching drugs: {e}")
    
    # Database summary
    print("\n📊 Database Summary:")
    try:
        summary = graph_db.execute_query(
            "MATCH (n) RETURN labels(n)[0] as node_type, count(*) as count ORDER BY count DESC"
        )
        total_nodes = sum(row['count'] for row in summary)
        print(f"  📈 Total nodes: {total_nodes:,}")
        for row in summary:
            print(f"  • {row['node_type']}: {row['count']:,} nodes")
    except Exception as e:
        print(f"  Error fetching summary: {e}")
        
else:
    print("⚠️ Database not connected. Please fix connection issues above.")

🔍 Sample Data from Our Knowledge Graph:
🧬 Sample Genes:
  • F8: apoptosis (chr15)
  • F9: development (chr15)
  • GBA: development (chr21)

🧪 Sample Proteins:
  • TP53: 11 kDa, greek_key
  • TP73_iso1: 72 kDa, coiled_coil
  • TP73_iso2: 26 kDa, jelly_roll

🏥 Sample Diseases:
  • Hypertension: cardiovascular (severity: life_threatening)
  • Coronary_Artery_Disease: cardiovascular (severity: severe)
  • Heart_Failure: cardiovascular (severity: moderate)

💊 Sample Drugs:
  • Lisinopril: small_molecule (approved)
  • Enalapril: small_molecule (approved)
  • Captopril: small_molecule (approved)

📊 Database Summary:
  📈 Total nodes: 1,702
  • Protein: 661 nodes
  • Gene: 500 nodes
  • Drug: 350 nodes
  • Disease: 191 nodes


## Part 3: Graph Queries with Cypher 🔍

Neo4j uses **Cypher** as its query language. Think of it like SQL, but for graphs!

### Basic Cypher Patterns

1. **MATCH**: Find patterns in the graph
2. **WHERE**: Filter results  
3. **RETURN**: What to give back
4. **LIMIT**: Restrict number of results

### Query Pattern Examples

```cypher
// Simple: Find all genes
MATCH (g:Gene) RETURN g.gene_name LIMIT 5

// Relationship: Gene encodes protein
MATCH (g:Gene)-[:ENCODES]->(p:Protein) 
RETURN g.gene_name, p.protein_name LIMIT 5

// Complex: Multi-hop pathway
MATCH (g:Gene)-[:ENCODES]->(p:Protein)-[:ASSOCIATED_WITH]->(d:Disease)
WHERE g.gene_name = 'F8'
RETURN g.gene_name, p.protein_name, d.disease_name
```

Let's try some examples!

In [42]:
# Example 1: Simple query - Find all genes
if graph_db:
    print("🔍 Example 1: Simple Query - Find Genes")
    print("=" * 40)
    
    query = "MATCH (g:Gene) RETURN g.gene_name, g.function LIMIT 5"
    print(f"📝 Query: {query}")
    
    try:
        results = graph_db.execute_query(query)
        print(f"✅ Found {len(results)} results:")
        for i, row in enumerate(results, 1):
            function = row.get('g.function', 'Function not specified')
            print(f"  {i}. {row['g.gene_name']}: {function}")
    except Exception as e:
        print(f"❌ Query failed: {e}")
        
else:
    print("⚠️ Database not connected. Please fix connection issues first.")

🔍 Example 1: Simple Query - Find Genes
📝 Query: MATCH (g:Gene) RETURN g.gene_name, g.function LIMIT 5
✅ Found 5 results:
  1. F8: apoptosis
  2. F9: development
  3. GBA: development
  4. HBB: hormone
  5. HBA1: enzyme_activity


In [43]:
# Example 2: Relationship query - What proteins are encoded by genes?
if graph_db:
    print("🔗 Example 2: Relationship Query - Gene Encodes Protein")
    print("=" * 50)
    
    query = """
    MATCH (g:Gene)-[:ENCODES]->(p:Protein) 
    RETURN g.gene_name, p.protein_name, p.molecular_weight 
    LIMIT 5
    """
    print(f"📝 Query: {query.strip()}")
    
    try:
        results = graph_db.execute_query(query)
        print(f"✅ Found {len(results)} gene-protein relationships:")
        for i, row in enumerate(results, 1):
            weight = row.get('p.molecular_weight', 'Unknown')
            print(f"  {i}. Gene {row['g.gene_name']} → Protein {row['p.protein_name']} ({weight} kDa)")
    except Exception as e:
        print(f"❌ Query failed: {e}")
        
else:
    print("⚠️ Database not connected. Please fix connection issues first.")

🔗 Example 2: Relationship Query - Gene Encodes Protein
📝 Query: MATCH (g:Gene)-[:ENCODES]->(p:Protein) 
    RETURN g.gene_name, p.protein_name, p.molecular_weight 
    LIMIT 5
✅ Found 5 gene-protein relationships:
  1. Gene F8 → Protein F8 (122 kDa)
  2. Gene F9 → Protein F9 (34 kDa)
  3. Gene GBA → Protein GBA (83 kDa)
  4. Gene HBB → Protein HBB (172 kDa)
  5. Gene HBA1 → Protein HBA1 (86 kDa)


In [44]:
# Example 3: Complex multi-hop query - Complete pathway from gene to treatment
if graph_db:
    print("🛤️ Example 3: Complex Query - Gene to Treatment Pathway")
    print("=" * 55)
    
    query = """
    MATCH (g:Gene)-[:ENCODES]->(p:Protein)-[:ASSOCIATED_WITH]->(d:Disease)<-[:TREATS]-(dr:Drug)
    RETURN g.gene_name, p.protein_name, d.disease_name, dr.drug_name
    LIMIT 3
    """
    print(f"📝 Query: {query.strip()}")
    print("\n🔍 This finds: Gene → Protein → Disease ← Drug pathways")
    
    try:
        results = graph_db.execute_query(query)
        print(f"\n✅ Found {len(results)} complete pathways:")
        for i, row in enumerate(results, 1):
            pathway = f"{row['g.gene_name']} → {row['p.protein_name']} → {row['d.disease_name']} ← {row['dr.drug_name']}"
            print(f"  {i}. {pathway}")
            
        if len(results) == 0:
            print("  No complete pathways found. This is normal for synthetic data.")
            print("  💡 Try simpler queries to explore the available relationships.")
            
    except Exception as e:
        print(f"❌ Query failed: {e}")
        
else:
    print("⚠️ Database not connected. Please fix connection issues first.")

🛤️ Example 3: Complex Query - Gene to Treatment Pathway
📝 Query: MATCH (g:Gene)-[:ENCODES]->(p:Protein)-[:ASSOCIATED_WITH]->(d:Disease)<-[:TREATS]-(dr:Drug)
    RETURN g.gene_name, p.protein_name, d.disease_name, dr.drug_name
    LIMIT 3

🔍 This finds: Gene → Protein → Disease ← Drug pathways

✅ Found 3 complete pathways:
  1. F8 → F8 → Spinal_Cord_Injury ← NOV-123
  2. F8 → F8 → Spinal_Cord_Injury ← Risedronate
  3. F9 → F9 → Multiple_Sclerosis ← NOV-322


### 🎯 Exercise 1: Write Your First Query

Now it's your turn! Try writing a query to find drugs that treat diseases containing "hypertension".

**Hint**: 
- Use the pattern `(dr:Drug)-[:TREATS]->(d:Disease)`
- Filter with `WHERE toLower(d.disease_name) CONTAINS 'hypertension'`
- Don't forget `LIMIT` to keep results manageable

In [45]:
# 🎯 Exercise 1: Find drugs that treat hypertension-related diseases
if graph_db:
    print("🎯 Exercise 1: Find Drugs for Hypertension")
    print("=" * 40)
    
    # Your query here - try to write it yourself first!
    your_query = """
    MATCH (dr:Drug)-[:TREATS]->(d:Disease)
    WHERE toLower(d.disease_name) CONTAINS 'hypertension'
    RETURN dr.drug_name, d.disease_name, dr.type
    LIMIT 5
    """
    
    print(f"📝 Query: {your_query.strip()}")
    
    try:
        results = graph_db.execute_query(your_query)
        if results:
            print(f"\n✅ Found {len(results)} drugs for hypertension:")
            for i, row in enumerate(results, 1):
                drug_type = row.get('dr.type', 'Unknown type')
                print(f"  {i}. {row['dr.drug_name']} ({drug_type}) treats {row['d.disease_name']}")
        else:
            print("\n🤔 No results found. Try these variations:")
            print("   • Search for 'coronary' (heart disease)")
            print("   • Search for 'diabetes' (metabolic disorder)")
            print("   • Remove the WHERE clause to see all drug-disease relationships")
            
    except Exception as e:
        print(f"❌ Query failed: {e}")
        print("💡 Check your query syntax - Cypher is case-sensitive!")
        
    # Helpful variations to try
    print("\n💡 Try modifying the query above to search for:")
    print("   • 'diabetes' instead of 'hypertension'")
    print("   • 'cardiovascular' for heart-related diseases")
    print("   • Remove WHERE clause to see all drug-disease pairs")
        
else:
    print("⚠️ Database not connected. Please fix connection issues first.")

🎯 Exercise 1: Find Drugs for Hypertension
📝 Query: MATCH (dr:Drug)-[:TREATS]->(d:Disease)
    WHERE toLower(d.disease_name) CONTAINS 'hypertension'
    RETURN dr.drug_name, d.disease_name, dr.type
    LIMIT 5

✅ Found 3 drugs for hypertension:
  1. NOV-147 (protein_hormone) treats Hypertension
  2. Quetiapine (small_molecule) treats Hypertension
  3. Glipizide (small_molecule) treats Pulmonary_Hypertension

💡 Try modifying the query above to search for:
   • 'diabetes' instead of 'hypertension'
   • 'cardiovascular' for heart-related diseases
   • Remove WHERE clause to see all drug-disease pairs


## Part 4: What is LangGraph? 🌊

Great! Now you understand knowledge graphs. But what if we want to build AI that can automatically generate and execute these queries based on natural language questions?

### The Challenge: Complex AI Workflows

Imagine building an AI that can:
1. **Understand** a natural language question like "What drugs treat hypertension?"
2. **Extract** important entities ("drugs", "hypertension")
3. **Generate** the right Cypher query
4. **Execute** the query against our graph
5. **Format** results into a natural language response

Each step depends on the previous ones, and we need to manage **state** (information) flowing between steps.

### LangGraph: AI Workflow Engine

LangGraph helps you build **multi-step AI workflows** with:
- **Nodes**: Individual processing steps (functions)
- **Edges**: How steps connect and flow
- **State**: Shared information that flows between steps

### Our Workflow Architecture

```
Natural Language Question
           ↓
    [1. CLASSIFY]
    What type of question?
           ↓
    [2. EXTRACT]
    Find biomedical entities
           ↓
    [3. GENERATE]
    Create Cypher query
           ↓
    [4. EXECUTE]
    Run against database
           ↓
    [5. FORMAT]
    Natural language answer
```

Each step updates a **shared state** that flows to the next step!

In [46]:
# Let's set up our AI agent that uses LangGraph workflows
print("🤖 Setting Up AI Agent with LangGraph")
print("=" * 40)

# First, check if we have the Anthropic API key
anthropic_key = os.getenv("ANTHROPIC_API_KEY")

if not anthropic_key:
    print("⚠️ AI Setup Required:")
    print("   1. Get a free API key at: https://console.anthropic.com/")
    print("   2. Add to your .env file: ANTHROPIC_API_KEY=sk-ant-your-key-here")
    print("\n💡 Without this, you can still explore the code but can't run the AI agent.")
    ai_agent = None
    
elif not graph_db:
    print("⚠️ Need database connection first. Fix Neo4j connection above.")
    ai_agent = None
    
else:
    try:
        # Import and create our WorkflowAgent
        from src.agents.workflow_agent import WorkflowAgent
        
        print("🔧 Initializing WorkflowAgent (this may take a moment)...")
        ai_agent = WorkflowAgent(graph_db, anthropic_key)
        
        print("✅ AI Agent ready!")
        print("🎓 Using WorkflowAgent - the same one used in the main web application")
        print("🧠 This agent demonstrates educational LangGraph workflows with detailed logging")
        
        # Show what the agent can do
        print("\n🎯 The agent can answer questions like:")
        print("   • What drugs treat hypertension?")
        print("   • What protein does gene F8 encode?")
        print("   • What diseases are in the cardiovascular category?")
        
    except ImportError as e:
        print(f"❌ Import failed: {e}")
        print("💡 Make sure you're running from the correct directory")
        ai_agent = None
    except Exception as e:
        print(f"❌ Agent initialization failed: {e}")
        print("💡 Check your API key and network connection")
        ai_agent = None

🤖 Setting Up AI Agent with LangGraph
🔧 Initializing WorkflowAgent (this may take a moment)...
✅ AI Agent ready!
🎓 Using WorkflowAgent - the same one used in the main web application
🧠 This agent demonstrates educational LangGraph workflows with detailed logging

🎯 The agent can answer questions like:
   • What drugs treat hypertension?
   • What protein does gene F8 encode?
   • What diseases are in the cardiovascular category?


## Part 5: AI Agent in Action 🚀

Now let's see our LangGraph-powered AI agent in action! We'll ask it a question and watch how it processes the request through multiple steps.

In [47]:
# Let's ask our AI agent a question and see the complete workflow!
if ai_agent:
    # Choose a question that should work well with our data
    question = "What drugs treat Hypertension?"
    
    print("🎓 Running LangGraph AI Workflow")
    print("=" * 50)
    print(f"❓ Question: {question}")
    print("\n⏳ Processing through workflow steps...")
    print("(Watch the console output to see each step!)")
    
    try:
        # Run the agent - this will execute the full LangGraph workflow
        result = ai_agent.answer_question(question)
        
        # Display the complete results
        print("\n📋 Complete Workflow Results:")
        print("=" * 50)
        
        # Basic information
        print(f"🏷️ Question Type: {result.get('question_type', 'Not classified')}")
        print(f"🧬 Entities Found: {result.get('entities', [])}")
        print(f"📊 Results Count: {result.get('results_count', 0)}")
        
        # The generated query (truncated if too long)
        query = result.get('cypher_query', 'No query generated')
        if len(query) > 100:
            query_display = query[:100] + "..."
        else:
            query_display = query
        print(f"🔧 Generated Query: {query_display}")
        
        # The final answer
        answer = result.get('answer', 'No answer generated')
        print(f"\n✅ Final Answer:")
        print(f"   {answer}")
        
        # Show any errors
        if result.get('error'):
            print(f"\n⚠️ Error occurred: {result['error']}")
            
        # Show sample raw results for learning
        if result.get('raw_results'):
            print(f"\n🔬 Sample Database Results (first 2):")
            for i, raw_result in enumerate(result['raw_results'][:2], 1):
                print(f"  {i}. {raw_result}")
                
    except Exception as e:
        print(f"\n❌ Workflow failed: {e}")
        print("💡 This could be due to API limits, network issues, or database problems")
        
else:
    print("⚠️ AI agent not available. Please fix setup issues above.")
    print("\n🤔 What would happen if the agent was working:")
    print("   1. Classify: 'drug_treatment' question type")
    print("   2. Extract: ['Hypertension'] entities")
    print("   3. Generate: MATCH (dr:Drug)-[:TREATS]->(d:Disease) WHERE ... query")
    print("   4. Execute: Run query against Neo4j database")
    print("   5. Format: Generate natural language answer")

🎓 Running LangGraph AI Workflow
❓ Question: What drugs treat Hypertension?

⏳ Processing through workflow steps...
(Watch the console output to see each step!)

🎓 Learning Workflow Starting...
Question: What drugs treat Hypertension?

🔍 Question classified as: drug_treatment
🧬 Found entities: ['Hypertension']
🔧 Generated query: MATCH (drug:Drug)-[:TREATS]->(disease:Disease)
WHERE toLower(disease.disease_name) CONTAINS toLower("Hypertension")
RETURN drug.drug_name, disease.disease_name
LIMIT 10
📊 Found 3 results
✅ Generated final answer

🎯 Workflow Complete!


📋 Complete Workflow Results:
🏷️ Question Type: drug_treatment
🧬 Entities Found: ['Hypertension']
📊 Results Count: 3
🔧 Generated Query: MATCH (drug:Drug)-[:TREATS]->(disease:Disease)
WHERE toLower(disease.disease_name) CONTAINS toLower(...

✅ Final Answer:
   Based on the database search, **3 drugs** were found that treat hypertension-related conditions:

## Drugs for Hypertension Treatment:

**1. NOV-147** - treats Hypertension
**

In [48]:
# Let's try a few different questions to see how the agent handles variety
if ai_agent:
    # Different types of questions to demonstrate the agent's versatility
    test_questions = [
        "What protein does gene F8 encode?",  # Gene-protein relationship
        "What diseases are in the cardiovascular category?",  # Disease categorization
        "What drugs have small_molecule type?",  # Drug classification
    ]
    
    print("🧪 Testing Different Question Types")
    print("=" * 40)
    
    for i, question in enumerate(test_questions, 1):
        print(f"\n🔍 Test {i}: {question}")
        print("-" * 50)
        
        try:
            result = ai_agent.answer_question(question)
            
            # Show key results
            print(f"📊 Found {result.get('results_count', 0)} results")
            answer = result.get('answer', 'No answer generated')
            # Truncate very long answers
            if len(answer) > 150:
                answer = answer[:150] + "..."
            print(f"💬 Answer: {answer}")
            
            if result.get('error'):
                print(f"⚠️ Error: {result['error']}")
                
        except Exception as e:
            print(f"❌ Failed: {e}")
    
    print("\n💡 Notice how the agent:")
    print("   • Classifies different question types automatically")
    print("   • Extracts relevant entities from each question")
    print("   • Generates appropriate Cypher queries")
    print("   • Formats results into natural language")
            
else:
    print("⚠️ AI agent not available. Please fix setup issues above.")

🧪 Testing Different Question Types

🔍 Test 1: What protein does gene F8 encode?
--------------------------------------------------

🎓 Learning Workflow Starting...
Question: What protein does gene F8 encode?

🔍 Question classified as: protein_function
🧬 Found entities: ['F8']
🔧 Generated query: MATCH (g:Gene)-[:ENCODES]->(p:Protein)
WHERE toLower(g.gene_name) CONTAINS toLower("F8")
RETURN p.protein_name
LIMIT 10
📊 Found 1 results
✅ Generated final answer

🎯 Workflow Complete!

📊 Found 1 results
💬 Answer: Based on the database search, **1 result was found** for gene F8.

The F8 gene encodes a protein also called **F8** (coagulation factor VIII). This pr...

🔍 Test 2: What diseases are in the cardiovascular category?
--------------------------------------------------

🎓 Learning Workflow Starting...
Question: What diseases are in the cardiovascular category?

🔍 Question classified as: general
🧬 No specific entities found
🔧 Generated query: MATCH (d:Disease)
WHERE toLower(d.category) CONT

## Part 6: Understanding the Workflow 🔄

Let's examine how the LangGraph workflow actually works by looking at the code and understanding state transitions.

In [49]:
# Let's examine how the LangGraph workflow is built
if ai_agent:
    import inspect
    
    print("🏗️ How the LangGraph Workflow is Built")
    print("=" * 50)
    
    try:
        # Get the workflow creation code
        workflow_code = inspect.getsource(ai_agent._create_workflow)
        print("📝 Workflow Creation Code:")
        print(workflow_code)
        
    except Exception as e:
        print(f"Could not inspect workflow code: {e}")
        print("\n🎓 The workflow is built using LangGraph's StateGraph:")
        
    print("\n🔧 Workflow Construction Steps:")
    print("  1. Create StateGraph with WorkflowState definition")
    print("  2. Add nodes for each processing step:")
    print("     • classify_question - Determines question type")
    print("     • extract_entities - Finds biomedical terms")
    print("     • generate_query - Creates Cypher queries") 
    print("     • execute_query - Runs queries against Neo4j")
    print("     • format_answer - Converts results to natural language")
    print("  3. Connect nodes with edges in sequence")
    print("  4. Set entry point and compile the workflow")
    
    # Show available methods
    print("\n🔍 Available Methods in WorkflowAgent:")
    methods = [method for method in dir(ai_agent) 
               if not method.startswith('_') and callable(getattr(ai_agent, method))]
    for method in methods[:10]:  # Show first 10 to keep it manageable
        print(f"  • {method}()")
    
    if len(methods) > 10:
        print(f"  ... and {len(methods) - 10} more methods")
        
else:
    print("⚠️ AI agent not available. Please fix setup issues above.")

🏗️ How the LangGraph Workflow is Built
📝 Workflow Creation Code:
    def _create_workflow(self):
        """
        Create our LangGraph workflow - the heart of the educational agent.

        This method demonstrates core LangGraph concepts:
        - Building a state graph with connected processing nodes
        - Defining workflow steps and their relationships
        - Creating a linear pipeline for educational clarity

        Think of this as building a flowchart where each step processes
        the shared state and passes it to the next step.
        """
        # Step 1: Create the graph structure
        workflow = StateGraph(WorkflowState)

        # Step 2: Add our processing nodes (each node is a function)
        workflow.add_node("classify", self.classify_question)
        workflow.add_node("extract", self.extract_entities)
        workflow.add_node("generate", self.generate_query)
        workflow.add_node("execute", self.execute_query)
        workflow.add_node("forma

In [50]:
# Let's look at one workflow step in detail
if ai_agent:
    import inspect
    
    print("🔍 Example: The Entity Extraction Step")
    print("=" * 40)
    
    try:
        # Get the entity extraction method
        extract_code = inspect.getsource(ai_agent.extract_entities)
        print("📝 Entity Extraction Code:")
        print(extract_code)
        
    except Exception as e:
        print(f"Could not inspect extraction code: {e}")
        
    print("\n🎓 What Entity Extraction Does:")
    print("  1. Takes the user question from the workflow state")
    print("  2. Uses Claude AI to identify biomedical entities")
    print("  3. Looks for: gene names, protein names, disease names, drug names")
    print("  4. Updates the state with found entities")
    print("  5. Returns the updated state to the next workflow step")
    
    print("\n💡 Key Concepts:")
    print("  • Each step is a pure function: input state → output state")
    print("  • State flows sequentially through all steps")
    print("  • AI models are used for complex reasoning within steps")
    print("  • Error handling ensures the workflow doesn't break")
        
else:
    print("⚠️ AI agent not available. Please fix setup issues above.")

🔍 Example: The Entity Extraction Step
📝 Entity Extraction Code:
    def extract_entities(self, state: WorkflowState) -> WorkflowState:
        """
        Step 2: Find the important biomedical terms in the question.

        These are the specific things we'll search for in our knowledge graph.
        Examples: "diabetes", "GENE_ALPHA", "aspirin"
        """
        prompt = f"""
        Extract the important biomedical terms from this question.
        Look for specific names of:
        - Genes (like GENE_ALPHA, BRCA1)
        - Diseases (like diabetes, cancer)
        - Drugs (like aspirin, AlphaCure)
        - Proteins (like PROT_BETA, insulin)

        Question: {state['user_question']}

        Return just a simple list like: ["diabetes", "GENE_ALPHA"]
        If you don't find any specific terms, return: []
        """

        response = self.anthropic.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=100,
            messages=[{"role": "use

### State Transitions in Detail

Here's exactly what happens when we ask: **"What drugs treat Hypertension?"**

```python
# Step 0: Initial State
state = {
    "user_question": "What drugs treat Hypertension?",
    "question_type": None,      # Will be filled by classify step
    "entities": None,           # Will be filled by extract step  
    "cypher_query": None,       # Will be filled by generate step
    "results": None,            # Will be filled by execute step
    "final_answer": None,       # Will be filled by format step
    "error": None               # Used for error handling
}

# Step 1: After classify_question
state["question_type"] = "drug_treatment"  # ✅ AI classified it

# Step 2: After extract_entities  
state["entities"] = ["Hypertension"]       # ✅ AI found the disease

# Step 3: After generate_query
state["cypher_query"] = "MATCH (dr:Drug)-[:TREATS]->(d:Disease) WHERE ..."  # ✅ AI generated query

# Step 4: After execute_query
state["results"] = [{"drug_name": "Lisinopril"}, {"drug_name": "Enalapril"}]  # ✅ Database results

# Step 5: After format_answer
state["final_answer"] = "Based on the database, several drugs treat Hypertension..."  # ✅ Natural language answer
```

### Key LangGraph Concepts

1. **Immutable State Flow**: Each step receives state, processes it, and returns updated state
2. **Sequential Processing**: Steps run in a defined order
3. **Error Handling**: Any step can set the "error" field to handle problems gracefully
4. **Transparency**: We can inspect the state at any point for debugging and learning

## Part 7: Hands-on Exercises 🏋️‍♀️

Now it's time to get hands-on! These exercises will help you understand both direct graph querying and AI-powered approaches.

In [51]:
# 🎯 Exercise 2: Compare Direct Queries vs AI Agent
if graph_db:
    print("🎯 Exercise 2: Direct Query vs AI Agent Comparison")
    print("=" * 55)
    
    # Question to test
    question = "What are the disease categories in the database?"
    
    print(f"❓ Question: {question}")
    
    # Method 1: Direct Cypher query
    print("\n🔧 Method 1: Direct Cypher Query")
    print("-" * 30)
    
    direct_query = """
    MATCH (d:Disease) 
    RETURN d.category, count(*) as count 
    ORDER BY count DESC
    """
    
    try:
        direct_results = graph_db.execute_query(direct_query)
        print(f"✅ Direct query found {len(direct_results)} categories:")
        for row in direct_results[:5]:  # Show top 5
            print(f"  • {row['d.category']}: {row['count']} diseases")
            
    except Exception as e:
        print(f"❌ Direct query failed: {e}")
    
    # Method 2: AI Agent
    if ai_agent:
        print("\n🤖 Method 2: AI Agent")
        print("-" * 20)
        
        try:
            ai_result = ai_agent.answer_question(question)
            print(f"✅ AI agent response:")
            answer = ai_result.get('answer', 'No answer generated')
            print(f"   {answer}")
            
            # Show what query the AI generated
            ai_query = ai_result.get('cypher_query', '')
            if ai_query:
                print(f"\n🔍 AI Generated Query: {ai_query[:100]}{'...' if len(ai_query) > 100 else ''}")
                
        except Exception as e:
            print(f"❌ AI agent failed: {e}")
    else:
        print("\n🤖 Method 2: AI Agent (not available)")
        print("   Would analyze the question and generate a similar query automatically")
    
    print("\n🤔 Comparison:")
    print("   Direct Query: Fast, precise, requires Cypher knowledge")
    print("   AI Agent: Flexible, natural language, handles complexity automatically")
    
else:
    print("⚠️ Database not connected. Please fix connection issues first.")

🎯 Exercise 2: Direct Query vs AI Agent Comparison
❓ Question: What are the disease categories in the database?

🔧 Method 1: Direct Cypher Query
------------------------------
✅ Direct query found 15 categories:
  • oncology: 20 diseases
  • infectious: 18 diseases
  • cardiovascular: 15 diseases
  • neurological: 15 diseases
  • psychiatric: 15 diseases

🤖 Method 2: AI Agent
--------------------

🎓 Learning Workflow Starting...
Question: What are the disease categories in the database?

🔍 Question classified as: general
🧬 Found entities: []
🔧 Generated query: MATCH (d:Disease)
RETURN DISTINCT d.category AS disease_category
LIMIT 10
📊 Found 10 results
✅ Generated final answer

🎯 Workflow Complete!

✅ AI agent response:
   Based on the database query, there are **5 distinct disease categories** available in the database (from a total of 10 results found):

## Disease Categories:
- **Cardiovascular** - diseases affecting the heart and blood vessels
- **Oncology** - cancer-related diseases

In [52]:
# 🎯 Exercise 3: Build Your Own Simple Query Function
if graph_db:
    print("🎯 Exercise 3: Build a Simple Query Function")
    print("=" * 45)
    
    def simple_biomedical_query(question_type, entity_name):
        """
        A simplified query function for learning purposes.
        
        Args:
            question_type: 'drugs_treat', 'gene_encodes', 'high_weight_proteins'
            entity_name: The specific entity to search for
        """
        
        if question_type == 'drugs_treat':
            query = f"""
            MATCH (dr:Drug)-[:TREATS]->(d:Disease)
            WHERE toLower(d.disease_name) CONTAINS toLower('{entity_name}')
            RETURN dr.drug_name, d.disease_name, dr.type
            LIMIT 5
            """
        elif question_type == 'gene_encodes':
            query = f"""
            MATCH (g:Gene)-[:ENCODES]->(p:Protein)
            WHERE g.gene_name = '{entity_name}'
            RETURN g.gene_name, p.protein_name, p.molecular_weight
            """
        elif question_type == 'high_weight_proteins':
            query = f"""
            MATCH (p:Protein)
            WHERE p.molecular_weight > {entity_name}
            RETURN p.protein_name, p.molecular_weight
            ORDER BY p.molecular_weight DESC
            LIMIT 5
            """
        else:
            return f"❌ Unknown question type: {question_type}"
        
        try:
            results = graph_db.execute_query(query)
            return results
        except Exception as e:
            return f"❌ Query error: {e}"
    
    # Test the function
    print("🧪 Testing Simple Query Function:")
    print("-" * 35)
    
    # Test 1: Drugs that treat hypertension
    result1 = simple_biomedical_query('drugs_treat', 'hypertension')
    if isinstance(result1, list) and result1:
        print(f"✅ Drugs for hypertension ({len(result1)} found):")
        for i, row in enumerate(result1[:3], 1):
            drug_type = row.get('dr.type', 'Unknown')
            print(f"  {i}. {row['dr.drug_name']} ({drug_type}) → {row['d.disease_name']}")
    else:
        print(f"Drugs for hypertension: {result1}")
    
    # Test 2: What does gene F8 encode?
    print("\n---")
    result2 = simple_biomedical_query('gene_encodes', 'F8')
    if isinstance(result2, list) and result2:
        print(f"✅ Gene F8 encodes:")
        for row in result2:
            weight = row.get('p.molecular_weight', 'Unknown')
            print(f"  • Protein {row['p.protein_name']} ({weight} kDa)")
    else:
        print(f"Gene F8 encoding: {result2}")
    
    print("\n💡 Try modifying the function above to:")
    print("   • Add new query types (e.g., 'disease_category')")
    print("   • Handle multiple entities at once")
    print("   • Add better error handling")
    print("   • Compare results with the full AI agent!")
    
else:
    print("⚠️ Database not connected. Please fix connection issues first.")

🎯 Exercise 3: Build a Simple Query Function
🧪 Testing Simple Query Function:
-----------------------------------
✅ Drugs for hypertension (3 found):
  1. NOV-147 (protein_hormone) → Hypertension
  2. Quetiapine (small_molecule) → Hypertension
  3. Glipizide (small_molecule) → Pulmonary_Hypertension

---
✅ Gene F8 encodes:
  • Protein F8 (122 kDa)

💡 Try modifying the function above to:
   • Add new query types (e.g., 'disease_category')
   • Handle multiple entities at once
   • Add better error handling
   • Compare results with the full AI agent!


In [53]:
# 🎯 Exercise 4: Advanced Query Challenges
if graph_db:
    print("🎯 Exercise 4: Advanced Query Challenges")
    print("=" * 40)
    
    challenges = {
        "A": {
            "task": "Find all proteins associated with cardiovascular diseases",
            "query": """
            MATCH (p:Protein)-[:ASSOCIATED_WITH]->(d:Disease)
            WHERE d.category = 'cardiovascular'
            RETURN p.protein_name, d.disease_name
            LIMIT 10
            """
        },
        "B": {
            "task": "Find drugs targeting high molecular weight proteins (>50 kDa)",
            "query": """
            MATCH (dr:Drug)-[:TARGETS]->(p:Protein)
            WHERE p.molecular_weight > 50
            RETURN dr.drug_name, p.protein_name, p.molecular_weight
            ORDER BY p.molecular_weight DESC
            LIMIT 10
            """
        },
        "C": {
            "task": "Find complete pathways: Gene → Protein → Disease (for gene F8)",
            "query": """
            MATCH (g:Gene)-[:ENCODES]->(p:Protein)-[:ASSOCIATED_WITH]->(d:Disease)
            WHERE g.gene_name = 'F8'
            RETURN g.gene_name, p.protein_name, d.disease_name
            """
        }
    }
    
    print("📝 Challenge Questions:")
    for key, challenge in challenges.items():
        print(f"   {key}) {challenge['task']}")
    
    # Test Challenge A automatically
    print("\n🧪 Testing Challenge A:")
    print("-" * 25)
    
    try:
        results = graph_db.execute_query(challenges['A']['query'])
        if results:
            print(f"✅ Found {len(results)} protein-disease associations:")
            for i, row in enumerate(results[:3], 1):  # Show first 3
                print(f"  {i}. {row['p.protein_name']} → {row['d.disease_name']}")
            if len(results) > 3:
                print(f"  ... and {len(results) - 3} more")
        else:
            print("No results found - this is normal with synthetic data")
            print("Try running: pdm run load-data to populate the database")
            
    except Exception as e:
        print(f"❌ Challenge A failed: {e}")
    
    print("\n💡 To try other challenges:")
    print("   • Copy the query from challenges['B']['query'] or challenges['C']['query']")
    print("   • Run: graph_db.execute_query(your_query)")
    print("   • Compare with what the AI agent would generate!")
    
    # If AI agent is available, show how it would handle these
    if ai_agent:
        print("\n🤖 How would the AI agent handle Challenge A?")
        try:
            ai_question = "What proteins are associated with cardiovascular diseases?"
            ai_result = ai_agent.answer_question(ai_question)
            print(f"🎯 AI Question: {ai_question}")
            print(f"📊 AI Found: {ai_result.get('results_count', 0)} results")
            
            ai_query = ai_result.get('cypher_query', '')
            if ai_query:
                print(f"🔍 AI Query (truncated): {ai_query[:80]}...")
                
        except Exception as e:
            print(f"AI test failed: {e}")
    
else:
    print("⚠️ Database not connected. Please fix connection issues first.")

🎯 Exercise 4: Advanced Query Challenges
📝 Challenge Questions:
   A) Find all proteins associated with cardiovascular diseases
   B) Find drugs targeting high molecular weight proteins (>50 kDa)
   C) Find complete pathways: Gene → Protein → Disease (for gene F8)

🧪 Testing Challenge A:
-------------------------
✅ Found 10 protein-disease associations:
  1. NUCL475_iso2 → Hypertension
  2. FUNC407_iso1 → Hypertension
  3. CELL221 → Hypertension
  ... and 7 more

💡 To try other challenges:
   • Copy the query from challenges['B']['query'] or challenges['C']['query']
   • Run: graph_db.execute_query(your_query)
   • Compare with what the AI agent would generate!

🤖 How would the AI agent handle Challenge A?

🎓 Learning Workflow Starting...
Question: What proteins are associated with cardiovascular diseases?

🔍 Question classified as: protein_function
🧬 Found entities: ['proteins', 'cardiovascular diseases']
🔧 Generated query: MATCH (p:Protein)-[:ASSOCIATED_WITH]-(d:Disease)
WHERE toLower

## Part 8: Real-World Applications and Next Steps 🚀

Congratulations! You've learned the fundamentals of knowledge graphs and LangGraph. Now let's see how these technologies are used in the real world.

In [54]:
# Let's compare different agent approaches available in Helix Navigator
print("🎓 Agent Comparison: Different Approaches in Helix Navigator")
print("=" * 60)

agent_info = {
    "WorkflowAgent": {
        "status": "🎓 ACTIVE (used in web app)",
        "description": "Educational LangGraph implementation", 
        "best_for": "Learning LangGraph workflows, step-by-step transparency",
        "complexity": "Medium - Good balance of features and understandability"
    },
    "Production Patterns": {
        "status": "📚 EXAMPLE (reference only)",
        "description": "Production-ready LangGraph patterns",
        "best_for": "Understanding production implementations, error handling",
        "complexity": "High - Full production features with comprehensive error handling"
    },
    "Direct Queries": {
        "status": "📚 EXAMPLE (reference only)", 
        "description": "Direct template approach without AI",
        "best_for": "Simple, predictable query patterns, fast responses",
        "complexity": "Low - Template-based, no AI reasoning required"
    }
}

for agent_name, info in agent_info.items():
    print(f"\n🤖 {agent_name}:")
    print(f"   Status: {info['status']}")
    print(f"   Description: {info['description']}")
    print(f"   Best for: {info['best_for']}")
    print(f"   Complexity: {info['complexity']}")

print("\n🤔 When to Use Each Approach:")
print("   • Learning/Education: WorkflowAgent (what we used today)")
print("   • Production Systems: Production Patterns patterns")
print("   • Simple/Fast Queries: Direct Queries approach")
print("   • Custom Solutions: Combine approaches based on needs")

🎓 Agent Comparison: Different Approaches in Helix Navigator

🤖 WorkflowAgent:
   Status: 🎓 ACTIVE (used in web app)
   Description: Educational LangGraph implementation
   Best for: Learning LangGraph workflows, step-by-step transparency
   Complexity: Medium - Good balance of features and understandability

🤖 Production Patterns:
   Status: 📚 EXAMPLE (reference only)
   Description: Production-ready LangGraph patterns
   Best for: Understanding production implementations, error handling
   Complexity: High - Full production features with comprehensive error handling

🤖 Direct Queries:
   Status: 📚 EXAMPLE (reference only)
   Description: Direct template approach without AI
   Best for: Simple, predictable query patterns, fast responses
   Complexity: Low - Template-based, no AI reasoning required

🤔 When to Use Each Approach:
   • Learning/Education: WorkflowAgent (what we used today)
   • Production Systems: Production Patterns patterns
   • Simple/Fast Queries: Direct Queries approa

### 🌍 Real-World Applications

**Knowledge Graphs + AI are Used in:**

1. **Drug Discovery** 💊
   - Find new drug targets by analyzing protein-disease relationship networks
   - Predict drug side effects using molecular interaction graphs
   - Repurpose existing drugs for new indications through pathway analysis

2. **Personalized Medicine** 🧬
   - Match patients to treatments based on genetic profiles and disease networks
   - Predict disease risk using family history and genetic pathway data
   - Optimize treatment plans using patient response pattern graphs

3. **Clinical Decision Support** 🏥
   - Diagnostic assistance using symptom-disease-drug relationship graphs
   - Treatment recommendations based on evidence networks
   - Drug interaction checking across complex medication graphs

4. **Research Acceleration** 🔬
   - Literature mining to extract and connect biomedical relationships
   - Hypothesis generation through graph pattern discovery
   - Cross-domain connection identification between different research areas

### Industry Examples
- **Google**: Knowledge Graph powers search results and information panels
- **Amazon**: Product recommendation systems using collaborative filtering graphs
- **Meta**: Social network analysis and friend recommendation algorithms
- **Pharmaceutical Companies**: Drug discovery pipelines integrate multiple graph data sources
- **Healthcare Systems**: Clinical decision support using medical knowledge graphs

### 🚀 Immediate Next Steps

1. **Explore the Web Interface**: Run `pdm run app` to try the interactive Helix Navigator
2. **Visual Debugging**: Use `pdm run langgraph dev` to see your workflows in LangGraph Studio
3. **Modify the Code**: Try adding new question types or improving the agent responses
4. **Expand the Database**: Load more data and create new relationships

### 📚 Advanced Topics to Explore

1. **Graph Algorithms**: PageRank for node importance, community detection for clustering
2. **Advanced LangGraph**: Conditional edges, parallel processing, human-in-the-loop workflows
3. **Graph Neural Networks**: AI models that work directly on graph structure
4. **Real-time Updates**: Streaming data integration and incremental graph updates
5. **Large-scale Graphs**: Distributed processing for millions/billions of nodes

### 🎓 Learning Resources

- **Neo4j Documentation**: https://neo4j.com/docs/ - Comprehensive Cypher and graph database guides
- **LangGraph Documentation**: https://langchain-ai.github.io/langgraph/ - Official framework documentation  
- **Helix Navigator Source**: Explore the `src/agents/` directory for implementation details
- **Graph Theory**: edX, Coursera, Khan Academy for mathematical foundations
- **Biomedical Databases**: PubMed, UniProt, STRING for real-world data exploration

---

## 🎉 Congratulations!

You've completed the LangGraph and Knowledge Graphs tutorial! You now understand:

- ✅ **How knowledge graphs represent complex domain relationships**
- ✅ **How to write Cypher queries for graph databases**  
- ✅ **How LangGraph manages AI workflow state and processing**
- ✅ **How to build biomedical AI agents that combine natural language and databases**
- ✅ **Real-world applications of these technologies in research and industry**

**Keep experimenting with the Helix Navigator platform and building your own AI agents!** 🚀

### 💬 Questions or Issues?

- Check the documentation in `docs/`
- Explore the code examples in `src/agents/`
- Try the interactive exercises in `docs/exercises/`
- Run the web interface with `pdm run app`