# LangGraph and Knowledge Graphs Tutorial

## Learning Objectives 🎯

By the end of this tutorial, you will understand:

1. **Knowledge Graphs**: What they are and how they represent domain knowledge
2. **LangGraph**: How to build AI workflows with state management
3. **Biomedical Applications**: Real-world uses of AI + knowledge graphs
4. **Practical Implementation**: How to build your own AI agents

## Prerequisites 📚

- Basic Python programming
- Understanding of databases (helpful but not required)
- Interest in AI and biomedical applications

---

## Part 1: What are Knowledge Graphs? 🕸️

### The Problem with Traditional Data Storage

Imagine you're studying biology and want to answer: **"What genes are related to Hypertension?"**

With traditional databases (tables), you might have:
- `genes` table
- `diseases` table  
- `gene_disease_associations` table

But what about complex questions like: **"What pathway connects TP53 to Lisinopril through proteins and diseases?"**

This requires joining multiple tables and becomes very complex!

### Knowledge Graphs: A Better Way

Knowledge graphs store information as **nodes** (entities) and **relationships** (edges):

```
TP53 --[ENCODES]--> TP53_protein --[ASSOCIATED_WITH]--> Hypertension
                                    ^
                                    |
                                [TREATS]
                                    |
                                Lisinopril
```

This naturally represents how biological entities relate to each other!

## Part 2: Our Biomedical Knowledge Graph 🧬

### Graph Schema

Our knowledge graph contains:

**Nodes (Entities):**
- 🧬 **Gene**: Genetic sequences (e.g., TP53, BRCA1, KRAS)
- 🧪 **Protein**: Proteins encoded by genes (e.g., TP53, BRCA1, insulin)
- 🏥 **Disease**: Medical conditions (e.g., Hypertension, Coronary_Artery_Disease)
- 💊 **Drug**: Medications and treatments (e.g., Lisinopril, Atorvastatin)

**Relationships (Edges):**
- Gene `--[ENCODES]-->` Protein
- Gene `--[LINKED_TO]-->` Disease
- Protein `--[ASSOCIATED_WITH]-->` Disease  
- Drug `--[TREATS]-->` Disease
- Drug `--[TARGETS]-->` Protein

### Why This Matters

This structure mirrors how biologists think about molecular relationships!

In [None]:
# Let's connect to our knowledge graph and explore it!
import sys
import os
from pathlib import Path

# Add the project root to Python path
project_root = Path().resolve().parent.parent
sys.path.append(str(project_root))

from dotenv import load_dotenv
from src.agents.graph_interface import GraphInterface

# Load environment variables
load_dotenv()

# Connect to the graph database
uri = os.getenv("NEO4J_URI", "bolt://localhost:7687")
user = os.getenv("NEO4J_USER", "neo4j")
password = os.getenv("NEO4J_PASSWORD")

if not password:
    print("⚠️ Please set NEO4J_PASSWORD in your .env file")
    print("💡 Tip: Copy .env.example to .env and add your credentials")
else:
    try:
        graph_db = GraphInterface(uri, user, password)
        print("✅ Connected to knowledge graph!")
    except Exception as e:
        print(f"❌ Connection failed: {e}")
        print("💡 Make sure Neo4j is running and credentials are correct")

In [None]:
# Explore our graph schema
schema = graph_db.get_schema_info()

print("🏗️ Knowledge Graph Schema:")
print("=" * 40)
print(f"Node Types: {schema['node_labels']}")
print(f"Relationship Types: {schema['relationship_types']}")
print("\n📊 Node Properties:")
for node_type, properties in schema['node_properties'].items():
    print(f"  {node_type}: {properties}")

In [None]:
# Let's see some actual data!
# Get a few examples of each node type

print("🧬 Sample Genes:")
genes = graph_db.execute_query("MATCH (g:Gene) RETURN g.gene_name, g.function LIMIT 3")
for gene in genes:
    print(f"  • {gene['g.gene_name']}: {gene.get('g.function', 'Function not specified')}")

print("\n🧪 Sample Proteins:")
proteins = graph_db.execute_query("MATCH (p:Protein) RETURN p.protein_name, p.molecular_weight LIMIT 3")
for protein in proteins:
    print(f"  • {protein['p.protein_name']}: {protein['p.molecular_weight']} kDa")

print("\n🏥 Sample Diseases:")
diseases = graph_db.execute_query("MATCH (d:Disease) RETURN d.disease_name, d.category LIMIT 3")
for disease in diseases:
    print(f"  • {disease['d.disease_name']}: {disease['d.category']}")

print("\n💊 Sample Drugs:")
drugs = graph_db.execute_query("MATCH (dr:Drug) RETURN dr.drug_name, dr.type LIMIT 3")
for drug in drugs:
    print(f"  • {drug['dr.drug_name']}: {drug['dr.type']}")
    
print("\n📊 Database Summary:")
summary = graph_db.execute_query("""
MATCH (n) 
RETURN labels(n)[0] as node_type, count(*) as count 
ORDER BY count DESC
""")
for row in summary:
    print(f"  • {row['node_type']}: {row['count']} nodes")

## Part 3: Graph Queries with Cypher 🔍

Neo4j uses **Cypher** as its query language. Think of it like SQL, but for graphs!

### Basic Cypher Patterns

1. **MATCH**: Find patterns in the graph
2. **WHERE**: Filter results
3. **RETURN**: What to give back

### Example Queries

In [None]:
# Simple query: Find all genes
query1 = "MATCH (g:Gene) RETURN g.gene_name LIMIT 5"
result1 = graph_db.execute_query(query1)

print("🔍 Simple Query: All genes")
print(f"Query: {query1}")
print("Results:")
for row in result1:
    print(f"  • {row['g.gene_name']}")

In [None]:
# Relationship query: Find what proteins are encoded by genes
query2 = """
MATCH (g:Gene)-[:ENCODES]->(p:Protein) 
RETURN g.gene_name, p.protein_name 
LIMIT 5
"""

result2 = graph_db.execute_query(query2)

print("🔗 Relationship Query: Gene encodes Protein")
print(f"Query: {query2.strip()}")
print("Results:")
for row in result2:
    print(f"  • Gene {row['g.gene_name']} encodes Protein {row['p.protein_name']}")

In [None]:
# Complex query: Find complete pathway from gene to treatment
query3 = """
MATCH (g:Gene)-[:ENCODES]->(p:Protein)-[:ASSOCIATED_WITH]->(d:Disease)<-[:TREATS]-(dr:Drug)
RETURN g.gene_name, p.protein_name, d.disease_name, dr.drug_name
LIMIT 3
"""

result3 = graph_db.execute_query(query3)

print("🛤️ Complex Query: Complete pathway Gene → Protein → Disease ← Drug")
print(f"Query: {query3.strip()}")
print("Results:")
for row in result3:
    print(f"  • {row['g.gene_name']} → {row['p.protein_name']} → {row['d.disease_name']} ← {row['dr.drug_name']}")

### 🎯 Exercise 1: Write Your Own Query

Try writing a query to find drugs that treat Hypertension!

**Hint**: Use the pattern `(dr:Drug)-[:TREATS]->(d:Disease)` and filter where disease name contains "hypertension"

In [None]:
# Your turn! Write a query to find drugs that treat Hypertension
# This uses actual data from our database

your_query = """
MATCH (dr:Drug)-[:TREATS]->(d:Disease)
WHERE toLower(d.disease_name) CONTAINS 'hypertension'
RETURN dr.drug_name, d.disease_name
LIMIT 5
"""

print("🎯 Exercise 1: Find drugs that treat Hypertension")
print("Query:")
print(your_query)
print("\nResults:")

try:
    result = graph_db.execute_query(your_query)
    if result:
        for row in result:
            print(f"  • Drug '{row['dr.drug_name']}' treats {row['d.disease_name']}")
    else:
        print("  • No results found. Try running the data loading script first!")
        print("  • Command: pdm run load-data")
except Exception as e:
    print(f"  • Error: {e}")
    print("  • Make sure the database is connected and data is loaded")

# Now try your own variations:
print("\n💡 Try modifying the query to search for:")
print("  • 'coronary' (heart disease)")
print("  • 'diabetes' (metabolic disorder)")  
print("  • Change LIMIT to see more results")

## Part 4: What is LangGraph? 🌊

### The Challenge: Complex AI Workflows

Imagine you want to build an AI that can:
1. Understand a natural language question
2. Extract important entities
3. Generate a database query
4. Execute the query
5. Format the results

Each step depends on the previous ones, and you need to manage **state** (information) flowing between steps.

### LangGraph: AI Workflow Engine

LangGraph helps you build **multi-step AI workflows** with:
- **Nodes**: Individual processing steps
- **Edges**: How steps connect
- **State**: Information that flows between steps

### Visual Representation

```
Question → [Classify] → [Extract] → [Generate] → [Execute] → [Format] → Answer
             ↓            ↓           ↓           ↓           ↓
           State      State       State       State       State
```

# Import our AI agents
from src.agents.advanced_ai_agent import AdvancedAIAgent
from src.agents.workflow_agent import WorkflowAgent

# Let's understand the LangGraph workflow steps
print("🏗️ LangGraph Workflow Architecture")
print("=" * 50)
print("""
Our AI agents use LangGraph to create structured workflows with these steps:

1. 🏷️ CLASSIFY: Determine question type (gene_disease, drug_treatment, etc.)
2. 🔍 EXTRACT: Extract biomedical entities from the question  
3. 🔧 GENERATE: Create appropriate Cypher query
4. ⚡ EXECUTE: Run query against Neo4j database
5. 📝 FORMAT: Generate natural language response

Each step manages STATE that flows to the next step:
""")

print("""
State Flow Example:
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│ question:       │    │ question_type:  │    │ entities:       │
│ "What drugs     │ →  │ "drug_treatment"│ →  │ ["Hypertension"]│
│ treat           │    │                 │    │                 │
│ hypertension?"  │    │                 │    │                 │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         ↓
┌─────────────────┐    ┌─────────────────┐
│ cypher_query:   │    │ final_answer:   │
│ "MATCH (dr...)  │ ←  │ "Based on the   │
│ WHERE..."       │    │ database..."    │
└─────────────────┘    └─────────────────┘
""")

In [None]:
from src.agents.advanced_workflow_agent import AdvancedWorkflowAgent
from src.agents.workflow_agent import WorkflowAgent
# Create our AI agent
anthropic_key = os.getenv("ANTHROPIC_API_KEY")

if not anthropic_key:
    print("⚠️ Please set ANTHROPIC_API_KEY in your .env file")
    print("💡 Get your free API key at: https://console.anthropic.com/")
    print("💡 Add it to your .env file: ANTHROPIC_API_KEY=sk-ant-your-key-here")
else:
    try:
        # Initialize the AI agent (using AdvancedAIAgent for detailed learning)
        ai_agent = AdvancedWorkflowAgent(graph_db, anthropic_key)
        print("✅ AI Agent ready!")
        print("🎓 This agent uses the AdvancedAIAgent with detailed step-by-step explanations")
    except Exception as e:
        print(f"❌ Agent initialization failed: {e}")
        print("💡 Check your API key and network connection")

In [None]:
# Let's ask our agent a question and see the complete workflow!
question = "What genes are associated with Hypertension?"

print("🎓 Running Learning LangGraph Agent")
print("=" * 50)

result = ai_agent.answer_question(question)

print("\n📋 Complete Workflow Results:")
print("=" * 50)
print(f"❓ Original Question: {question}")
print(f"🏷️ Question Type: {result['question_type']}")
print(f"🧬 Entities Found: {result['entities']}")
print(f"🔧 Generated Query: {result['cypher_query']}")
print(f"📊 Results Count: {result['results_count']}")
print(f"✅ Final Answer: {result['answer']}")

if result['error']:
    print(f"❌ Error: {result['error']}")

In [None]:
# Let's ask our agent a question and see the complete workflow!
# Using real entities from our database

question = "What genes are associated with Hypertension?"

print("🎓 Running LangGraph AI Agent")
print("=" * 50)
print(f"❓ Question: {question}")
print("\n⚙️ Workflow Steps:")

try:
    result = ai_agent.answer_question(question)
    
    print("\n📋 Complete Workflow Results:")
    print("=" * 50)
    print(f"🏷️ Question Type: {result.get('question_type', 'Not classified')}")
    print(f"🧬 Entities Found: {result.get('entities', 'None found')}")
    print(f"🔧 Generated Query: {result.get('cypher_query', 'No query generated')}")
    print(f"📊 Results Count: {result.get('results_count', 0)}")
    print(f"✅ Final Answer: {result.get('answer', 'No answer generated')}")
    
    if result.get('error'):
        print(f"❌ Error: {result['error']}")
        
except Exception as e:
    print(f"❌ Workflow failed: {e}")
    print("💡 Make sure both Neo4j and Anthropic API are properly configured")

In [None]:
# Try different questions with real biomedical entities!
questions_to_try = [
    "What drugs treat Hypertension?",
    "What protein does TP53 encode?", 
    "What diseases is BRCA1 associated with?",
    "What are the targets of Lisinopril?"
]

# Pick one and try it:
test_question = questions_to_try[0]  # Change the index to try different questions

print(f"🧪 Testing: {test_question}")
print("-" * 50)

try:
    result = ai_agent.answer_question(test_question)
    print(f"✅ Answer: {result.get('answer', 'No answer generated')}")
    print(f"🔧 Query used: {result.get('cypher_query', 'No query shown')}")
    
    if result.get('error'):
        print(f"❌ Error: {result['error']}")
        
except Exception as e:
    print(f"❌ Query failed: {e}")

print(f"\n💡 Try changing test_question to questions_to_try[1], [2], or [3] to test other questions!")
print("💡 Or write your own question using entity names from the database")

In [None]:
# Let's examine the workflow creation code
import inspect

# Look at how the workflow is created
print("🏗️ How the LangGraph Workflow is Built:")
print("=" * 50)

workflow_code = inspect.getsource(ai_agent._create_learning_workflow)
print(workflow_code)

In [None]:
# Let's examine the workflow creation code
import inspect

# Look at how the workflow is created
print("🏗️ How the LangGraph Workflow is Built:")
print("=" * 50)

try:
    # Get the workflow creation method from AdvancedAIAgent
    workflow_method = getattr(ai_agent, '_build_workflow', None)
    if workflow_method:
        workflow_code = inspect.getsource(workflow_method)
        print(workflow_code)
    else:
        print("Workflow method not found. Let's see available methods:")
        methods = [method for method in dir(ai_agent) if not method.startswith('_') or 'workflow' in method.lower()]
        for method in methods:
            print(f"  • {method}")
except Exception as e:
    print(f"Could not inspect workflow code: {e}")
    print("The workflow is built using LangGraph's StateGraph with nodes for:")
    print("  • classify_question")
    print("  • extract_entities") 
    print("  • generate_cypher_query")
    print("  • execute_query")
    print("  • format_response")

In [None]:
# Let's look at one of the workflow steps in detail
print("🔍 Example: The Entity Extraction Step")
print("=" * 50)

extract_code = inspect.getsource(ai_agent.extract_entities)
print(extract_code)

In [None]:
# Let's look at one of the workflow steps in detail
print("🔍 Example: The Entity Extraction Step")
print("=" * 50)

try:
    # Get the entity extraction method
    extract_method = getattr(ai_agent, 'extract_entities', None)
    if extract_method:
        extract_code = inspect.getsource(extract_method)
        print(extract_code)
    else:
        print("Entity extraction method not found in AdvancedAIAgent.")
        print("Here's what entity extraction typically does:")
        print("""
def extract_entities(self, state):
    # Takes the user question from state
    question = state['question']
    
    # Uses AI to identify biomedical entities
    # Looks for: gene names, protein names, disease names, drug names
    
    # Updates state with found entities
    state['entities'] = extracted_entities
    return state
        """)
except Exception as e:
    print(f"Could not inspect extraction code: {e}")
    print("Entity extraction finds biomedical terms in user questions")

## Part 9: Hands-on Exercises 🏋️‍♀️

Now it's your turn to experiment!

In [None]:
# Your challenge: Create a modified workflow with validation
# Hint: You can use graph_db.validate_query() to check if a query is valid

from langgraph.graph import StateGraph, END
from agent.advanced_workflow_agent import LearningState

class ImprovedAgent:
    def __init__(self, graph_interface, anthropic_api_key):
        # Your code here!
        pass
    
    def validate_query(self, state: LearningState) -> LearningState:
        """Add your validation logic here!"""
        # Hint: Check if state['cypher_query'] is valid
        # If not valid, set state['error'] = "Invalid query"
        pass

# Try implementing your improved agent!

In [None]:
# Your challenge: Create a modified workflow with validation
# Hint: You can use graph_db.validate_query() to check if a query is valid

from langgraph.graph import StateGraph, END
from agent.advanced_workflow_agent import LearningState

class ImprovedAgent:
    def __init__(self, graph_interface, anthropic_api_key):
        # Your code here!
        pass
    
    def validate_query(self, state: LearningState) -> LearningState:
        """Add your validation logic here!"""
        # Hint: Check if state['cypher_query'] is valid
        # If not valid, set state['error'] = "Invalid query"
        pass

# Try implementing your improved agent!

## Part 10: Real-World Applications 🌍

### Where are Knowledge Graphs + AI Used?

1. **Drug Discovery** 💊
   - Find new drug targets
   - Predict drug side effects
   - Repurpose existing drugs

2. **Personalized Medicine** 🧬
   - Match patients to treatments based on genetics
   - Predict disease risk
   - Optimize treatment plans

3. **Research Acceleration** 🔬
   - Literature mining and synthesis
   - Hypothesis generation
   - Cross-domain connections

4. **Clinical Decision Support** 🏥
   - Diagnostic assistance
   - Treatment recommendations
   - Drug interaction checking

### Industry Examples
- **Google**: Knowledge Graph for search
- **Amazon**: Product recommendations
- **Facebook**: Social graph analysis
- **Pharmaceutical companies**: Drug discovery pipelines

## Part 11: Next Steps and Advanced Topics 🚀

### Immediate Next Steps
1. **Experiment** with the Streamlit web interface
2. **Try** different question types and see how the agent handles them
3. **Modify** the agent code to add new features
4. **Write** your own Cypher queries for complex biomedical questions

### Advanced Topics to Explore
1. **Graph Algorithms**: PageRank, community detection, shortest paths
2. **Advanced LangGraph**: Conditional edges, parallel processing, human-in-the-loop
3. **Graph Neural Networks**: AI models that work directly on graph structure
4. **Real-time Updates**: Streaming data into knowledge graphs
5. **Large-scale Graphs**: Handling millions/billions of nodes

### Learning Resources
- **Neo4j Documentation**: https://neo4j.com/docs/
- **LangGraph Documentation**: https://langchain-ai.github.io/langgraph/
- **Graph Theory Courses**: edX, Coursera, Khan Academy
- **Biomedical Databases**: PubMed, UniProt, STRING

# Exercise 4: Write custom queries using real biomedical data
exercises = {
    "a": "Find all proteins that are associated with cardiovascular diseases",
    "b": "Find drugs that target proteins with high molecular weight (>50 kDa)", 
    "c": "Find the most common disease categories in our database",
    "d": "Find complete pathways: Gene → Protein → Disease, where the gene is TP53"
}

print("✏️ Query Writing Exercises:")
for key, exercise in exercises.items():
    print(f"{key}) {exercise}")

print("\n🧪 Try these solutions:")

# Exercise A - Solution
query_a = """
MATCH (p:Protein)-[:ASSOCIATED_WITH]->(d:Disease)
WHERE d.category = 'cardiovascular'
RETURN p.protein_name, d.disease_name
LIMIT 10
"""

# Exercise B - Solution  
query_b = """
MATCH (dr:Drug)-[:TARGETS]->(p:Protein)
WHERE p.molecular_weight > 50
RETURN dr.drug_name, p.protein_name, p.molecular_weight
ORDER BY p.molecular_weight DESC
LIMIT 10
"""

# Exercise C - Solution
query_c = """
MATCH (d:Disease)
RETURN d.category, count(*) as disease_count
ORDER BY disease_count DESC
"""

# Exercise D - Solution
query_d = """
MATCH (g:Gene)-[:ENCODES]->(p:Protein)-[:ASSOCIATED_WITH]->(d:Disease)
WHERE g.gene_name = 'TP53'
RETURN g.gene_name, p.protein_name, d.disease_name
"""

# Test one of the queries
print(f"\n🧪 Testing Exercise A:")
try:
    result_a = graph_db.execute_query(query_a)
    print(f"Found {len(result_a)} protein-disease associations")
    for i, row in enumerate(result_a[:3]):  # Show first 3 results
        print(f"  {i+1}. {row['p.protein_name']} → {row['d.disease_name']}")
    if len(result_a) > 3:
        print(f"  ... and {len(result_a)-3} more")
except Exception as e:
    print(f"Query failed: {e}")

print("\n💡 Try running the other queries by changing query_a to query_b, query_c, or query_d")