# Multi-Agent RAG Orchestration System

## Overview
This notebook implements a production-grade multi-agent system that:
1. **Classifies** user queries by intent using an orchestrator agent
2. **Routes** queries to specialized RAG agents (HR, Tech, Finance)
3. **Traces** all workflows with Langfuse for debugging and monitoring
4. **Evaluates** responses for quality using automated scoring

## Architecture
```
User Query ‚Üí Orchestrator (Intent Classification)
                    ‚Üì
        ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
        ‚Üì           ‚Üì           ‚Üì
    HR Agent    Tech Agent  Finance Agent
        ‚Üì           ‚Üì           ‚Üì
    HR Docs     Tech Docs   Finance Docs
    (Vector)    (Vector)    (Vector)
```

All agents use LangChain for production-grade components and Langfuse for complete observability.

---
# 1. Setup & Imports

In [1]:
import os
import warnings
from pathlib import Path
from typing import Dict, List, Any, Optional
from dotenv import load_dotenv

# LangChain imports (updated for LangChain 1.1.0+)
from langchain_openai import ChatOpenAI
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_classic.chains import RetrievalQA
from langchain_core.prompts import PromptTemplate
from langchain_core.documents import Document

# Langfuse for tracing
from langfuse.langchain import CallbackHandler
from langfuse import Langfuse

# Suppress warnings
warnings.filterwarnings('ignore')

print("‚úì All imports successful")

‚úì All imports successful


In [2]:
# Load environment variables
# Force reload to pick up any changes to .env
import sys
if '.env' in sys.modules:
    del sys.modules['.env']
    
load_dotenv(override=True)  # override=True forces reload of existing env vars

# Verify required environment variables
required_vars = [
    'OPENAI_API_KEY',
    'OPENAI_API_BASE',
    'LANGFUSE_PUBLIC_KEY',
    'LANGFUSE_SECRET_KEY',
    'LANGFUSE_HOST'
]

missing_vars = [var for var in required_vars if not os.getenv(var)]
if missing_vars:
    print(f"‚ö†Ô∏è  Missing environment variables: {', '.join(missing_vars)}")
    print("Please copy .env.example to .env and fill in your API keys")
else:
    print("‚úì All environment variables configured")
    print(f"  - Using OpenRouter API: {os.getenv('OPENAI_API_BASE')}")
    print(f"  - Langfuse enabled: {os.getenv('LANGFUSE_HOST')}")


‚ö†Ô∏è  Missing environment variables: LANGFUSE_HOST
Please copy .env.example to .env and fill in your API keys


In [3]:
# DEBUG: Verify API credentials
print("\nüîç API Credentials Check:")
print("=" * 60)

api_key = os.getenv('OPENAI_API_KEY')
api_base = os.getenv('OPENAI_API_BASE')

if api_key:
    print(f"‚úÖ OPENAI_API_KEY is set")
    print(f"   First 15 chars: {api_key[:15]}...")
    if api_key.startswith('sk-or-v1-'):
        print("   ‚úÖ Format looks correct for OpenRouter")
    else:
        print("   ‚ö†Ô∏è  Key doesn't start with 'sk-or-v1-' (expected for OpenRouter)")
else:
    print("‚ùå OPENAI_API_KEY is NOT set!")
    print("   Please check your .env file")

if api_base:
    print(f"\n‚úÖ OPENAI_API_BASE is set: {api_base}")
else:
    print("\n‚ùå OPENAI_API_BASE is NOT set!")

print("\n" + "=" * 60)
print("\nüí° If you see errors:")
print("   1. Check your .env file has: OPENAI_API_KEY=sk-or-v1-...")
print("   2. Verify your OpenRouter API key at https://openrouter.ai/keys")
print("   3. Make sure your .env file is in the project root directory")



üîç API Credentials Check:
‚úÖ OPENAI_API_KEY is set
   First 15 chars: sk-or-v1-d64216...
   ‚úÖ Format looks correct for OpenRouter

‚úÖ OPENAI_API_BASE is set: https://openrouter.ai/api/v1


üí° If you see errors:
   1. Check your .env file has: OPENAI_API_KEY=sk-or-v1-...
   2. Verify your OpenRouter API key at https://openrouter.ai/keys
   3. Make sure your .env file is in the project root directory


In [4]:
# Configuration
CONFIG = {
    'model_name': os.getenv('OPENAI_MODEL', 'openai/gpt-4-turbo-preview'),
    'embedding_model': os.getenv('EMBEDDING_MODEL', 'openai/text-embedding-ada-002'),
    'temperature': 0.1,  # Low temperature for consistent routing
    'chunk_size': 1000,
    'chunk_overlap': 200,
    'retrieval_k': 5,  # Number of documents to retrieve
    'data_dir': Path('data'),
    'vector_store_dir': Path('vector_stores')
}

# Create vector store directory if it doesn't exist
CONFIG['vector_store_dir'].mkdir(exist_ok=True)

print("‚úì Configuration loaded:")
for key, value in CONFIG.items():
    print(f"  - {key}: {value}")

‚úì Configuration loaded:
  - model_name: openai/gpt-4-turbo-preview
  - embedding_model: openai/text-embedding-ada-002
  - temperature: 0.1
  - chunk_size: 1000
  - chunk_overlap: 200
  - retrieval_k: 5
  - data_dir: data
  - vector_store_dir: vector_stores


---
# 2. Document Loading & Vector Store Creation

We load documents from three specialized knowledge bases:
- **HR Documents**: Employee policies, benefits, leave policies
- **Tech Documents**: API docs, deployment guides, security standards  
- **Finance Documents**: Expense policies, budgets, procurement

Each is chunked and embedded into a separate FAISS vector store for efficient semantic search.

In [5]:
def load_documents_from_directory(directory: Path) -> List[Document]:
    """Load all markdown documents from a directory."""
    loader = DirectoryLoader(
        str(directory),
        glob="**/*.md",
        loader_cls=TextLoader,
        show_progress=True
    )
    documents = loader.load()
    print(f"  Loaded {len(documents)} documents from {directory.name}")
    return documents

def create_vector_store(documents: List[Document], store_name: str) -> FAISS:
    """Create and persist a FAISS vector store from documents."""
    # Split documents into chunks
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=CONFIG['chunk_size'],
        chunk_overlap=CONFIG['chunk_overlap'],
        length_function=len,
        separators=["\n\n", "\n", " ", ""]
    )
    chunks = text_splitter.split_documents(documents)
    print(f"  Split into {len(chunks)} chunks")
    
    # Create embeddings
    embeddings = HuggingFaceEmbeddings(
        model_name="sentence-transformers/all-MiniLM-L6-v2"
    )
    print(f"  Using local embeddings model")
    
    # Create vector store
    vector_store = FAISS.from_documents(chunks, embeddings)
    
    # Save to disk
    store_path = CONFIG['vector_store_dir'] / store_name
    vector_store.save_local(str(store_path))
    print(f"  ‚úì Vector store saved to {store_path}")
    
    return vector_store

print("‚úì Document loading functions defined")

‚úì Document loading functions defined


In [6]:
# Load and create vector stores for each domain
print("\n" + "="*60)
print("Creating Vector Stores")
print("="*60)

vector_stores = {}

# HR Documents
print("\n[1/3] HR Documents")
hr_docs = load_documents_from_directory(CONFIG['data_dir'] / 'hr_docs')
vector_stores['hr'] = create_vector_store(hr_docs, 'hr_vector_store')

# Tech Documents
print("\n[2/3] Tech Documents")
tech_docs = load_documents_from_directory(CONFIG['data_dir'] / 'tech_docs')
vector_stores['tech'] = create_vector_store(tech_docs, 'tech_vector_store')

# Finance Documents
print("\n[3/3] Finance Documents")
finance_docs = load_documents_from_directory(CONFIG['data_dir'] / 'finance_docs')
vector_stores['finance'] = create_vector_store(finance_docs, 'finance_vector_store')

print("\n" + "="*60)
print("‚úì All vector stores created successfully")
print("="*60)


Creating Vector Stores

[1/3] HR Documents


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3/3 [00:00<00:00, 583.35it/s]

  Loaded 3 documents from hr_docs
  Split into 36 chunks





  Using local embeddings model
  ‚úì Vector store saved to vector_stores/hr_vector_store

[2/3] Tech Documents


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3/3 [00:00<00:00, 1892.74it/s]

  Loaded 3 documents from tech_docs
  Split into 69 chunks





  Using local embeddings model
  ‚úì Vector store saved to vector_stores/tech_vector_store

[3/3] Finance Documents


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3/3 [00:00<00:00, 2896.62it/s]

  Loaded 3 documents from finance_docs
  Split into 65 chunks





  Using local embeddings model
  ‚úì Vector store saved to vector_stores/finance_vector_store

‚úì All vector stores created successfully


---
# 3. Agent Definitions

Each specialized agent is a RAG system with:
- Domain-specific vector store retriever
- Custom prompt template
- LangChain RetrievalQA chain
- Langfuse tracing integration

In [7]:
# Initialize Langfuse callback handler (OPTIONAL)
# Note: Langfuse v3+ uses environment variables automatically
# If Langfuse credentials are invalid, tracing will be disabled but the system will still work

try:
    langfuse_handler = CallbackHandler()
    print("‚úÖ Langfuse tracing enabled")
    USE_LANGFUSE = True
except Exception as e:
    print(f"‚ö†Ô∏è  Langfuse initialization failed: {e}")
    print("   Continuing without tracing (this is OK for testing)")
    langfuse_handler = None
    USE_LANGFUSE = False

# Verify API credentials before initializing LLM
print("\nüîê Verifying API credentials...")
api_key = os.getenv('OPENAI_API_KEY')
api_base = os.getenv('OPENAI_API_BASE')

if not api_key or not api_base:
    raise ValueError("‚ùå Missing OPENAI_API_KEY or OPENAI_API_BASE in environment variables!")

print(f"‚úÖ API Key present: {api_key[:15]}...{api_key[-4:]}")
print(f"‚úÖ API Base: {api_base}")

# Initialize LLM
print("\nü§ñ Initializing LLM...")
llm = ChatOpenAI(
    model=CONFIG['model_name'],
    temperature=CONFIG['temperature'],
    api_key=api_key,
    base_url=api_base
)

print(f"‚úì LLM initialized with model: {CONFIG['model_name']}")


‚úÖ Langfuse tracing enabled

üîê Verifying API credentials...
‚úÖ API Key present: sk-or-v1-d64216...5895
‚úÖ API Base: https://openrouter.ai/api/v1

ü§ñ Initializing LLM...
‚úì LLM initialized with model: openai/gpt-4-turbo-preview


In [8]:
# Define prompt templates for each agent

HR_PROMPT = PromptTemplate(
    template="""You are TechCorp's HR Assistant, an expert on employee policies, benefits, and workplace procedures.

Use the following context from our HR documentation to answer the question. If you don't find the answer in the context, say so - don't make up information.

Context:
{context}

Question: {question}

Provide a clear, accurate answer based on the context above. Include specific policy details, numbers, and procedures when available. If the answer requires follow-up with HR, mention that.

Answer:""",
    input_variables=["context", "question"]
)

TECH_PROMPT = PromptTemplate(
    template="""You are TechCorp's Technical Documentation Assistant, an expert on APIs, deployment, and engineering practices.

Use the following context from our technical documentation to answer the question. Provide code examples and technical details when helpful.

Context:
{context}

Question: {question}

Provide a comprehensive technical answer. Include:
- Step-by-step instructions when applicable
- Code examples or configuration snippets
- Best practices and important warnings
- Links to related documentation when relevant

If you don't find the answer in the context, say so clearly.

Answer:""",
    input_variables=["context", "question"]
)

FINANCE_PROMPT = PromptTemplate(
    template="""You are TechCorp's Finance Assistant, an expert on budgets, expenses, and financial policies.

Use the following context from our financial documentation to answer the question. Be precise with numbers, limits, and approval requirements.

Context:
{context}

Question: {question}

Provide a clear, accurate answer including:
- Specific dollar amounts and limits
- Required approvals and procedures
- Relevant policy sections
- Important exceptions or special cases

If you don't find the answer in the context, say so - don't make up financial information.

Answer:""",
    input_variables=["context", "question"]
)

print("‚úì Agent prompts defined")

‚úì Agent prompts defined


In [9]:
# Create RAG agents for each domain

def create_rag_agent(vector_store: FAISS, prompt: PromptTemplate, agent_name: str) -> RetrievalQA:
    """Create a RetrievalQA agent with the given vector store and prompt."""
    retriever = vector_store.as_retriever(
        search_type="similarity",
        search_kwargs={"k": CONFIG['retrieval_k']}
    )
    
    chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=retriever,
        return_source_documents=True,
        chain_type_kwargs={"prompt": prompt}
    )
    
    print(f"  ‚úì {agent_name} agent created")
    return chain

print("\nCreating specialized RAG agents...")

agents = {
    'hr': create_rag_agent(vector_stores['hr'], HR_PROMPT, "HR"),
    'tech': create_rag_agent(vector_stores['tech'], TECH_PROMPT, "Tech"),
    'finance': create_rag_agent(vector_stores['finance'], FINANCE_PROMPT, "Finance")
}

print("\n‚úì All RAG agents ready")


Creating specialized RAG agents...
  ‚úì HR agent created
  ‚úì Tech agent created
  ‚úì Finance agent created

‚úì All RAG agents ready


---
# 4. Orchestrator & Routing

The orchestrator analyzes user queries and routes them to the appropriate specialized agent.

**Routing Strategy:**
- Uses LLM to classify intent from query
- Categories: HR, Tech, Finance, General
- Confidence scoring to detect ambiguous queries
- Fallback to general response if no clear match

In [10]:
# Orchestrator prompt for intent classification
ORCHESTRATOR_PROMPT = """You are a query classification system for TechCorp's knowledge base.

Analyze the user's question and classify it into ONE of these categories:

**HR**: Questions about:
- Employee benefits, PTO, sick leave, parental leave
- Company policies, workplace rules, dress code
- Onboarding, hiring, termination procedures
- Performance reviews, compensation
- Holidays, working hours, remote work

**TECH**: Questions about:
- API documentation, endpoints, authentication
- Deployment procedures, CI/CD, infrastructure
- Security standards, best practices, compliance
- Code examples, technical implementation
- System architecture, monitoring, troubleshooting

**FINANCE**: Questions about:
- Expense policies, reimbursement procedures
- Budget planning, approval limits
- Procurement, vendor management
- Travel expenses, meal allowances
- Corporate cards, invoicing

**GENERAL**: Questions that:
- Don't fit the above categories
- Are greetings or casual conversation
- Are too vague to categorize

User Question: {question}

Respond with ONLY the category name (HR, TECH, FINANCE, or GENERAL). No explanation needed.

Category:"""

def classify_intent(question: str) -> str:
    """Classify the user's question into a category."""
    response = llm.invoke(
        ORCHESTRATOR_PROMPT.format(question=question),
        config={"callbacks": [langfuse_handler]}
    )
    category = response.content.strip().upper()
    
    # Validate category
    valid_categories = ['HR', 'TECH', 'FINANCE', 'GENERAL']
    if category not in valid_categories:
        print(f"  ‚ö†Ô∏è  Invalid category '{category}', defaulting to GENERAL")
        category = 'GENERAL'
    
    return category

print("‚úì Orchestrator classification function defined")

‚úì Orchestrator classification function defined


In [11]:
def route_query(question: str, trace_name: Optional[str] = None) -> Dict[str, Any]:
    """
    Main routing function that:
    1. Classifies the query
    2. Routes to appropriate agent
    3. Returns structured response with metadata
    """
    # Classify intent
    print(f"\n{'='*60}")
    print(f"Query: {question}")
    print(f"{'='*60}")
    
    category = classify_intent(question)
    print(f"\nüìã Classification: {category}")
    
    # Route to appropriate agent
    if category in ['HR', 'TECH', 'FINANCE']:
        agent_key = category.lower()
        print(f"üéØ Routing to {category} Agent...\n")
        
        # Query the agent with Langfuse tracing
        result = agents[agent_key].invoke(
            {"query": question},
            config={"callbacks": [langfuse_handler]}
        )
        
        response = {
            'question': question,
            'category': category,
            'agent_used': agent_key,
            'answer': result['result'],
            'source_documents': result['source_documents'],
            'num_sources': len(result['source_documents'])
        }
    else:
        print(f"üí¨ Handling as general query...\n")
        # General response for non-specialized queries
        general_response = llm.invoke(
            f"""You are TechCorp's helpful assistant. Answer this question politely:
            
Question: {question}

If this is a greeting, respond warmly. If it's a question outside HR, Tech, or Finance topics, explain that you specialize in those areas and offer to help with related questions.

Answer:""",
            config={"callbacks": [langfuse_handler]}
        )
        
        response = {
            'question': question,
            'category': 'GENERAL',
            'agent_used': 'general',
            'answer': general_response.content,
            'source_documents': [],
            'num_sources': 0
        }
    
    return response

print("‚úì Query routing function defined")

‚úì Query routing function defined


---
# 5. Testing & Examples

Let's test the multi-agent system with queries spanning all domains.

In [12]:
def display_response(response: Dict[str, Any]):
    """Pretty-print a response from the system."""
    print("\n" + "="*60)
    print("RESPONSE")
    print("="*60)
    print(f"\n{response['answer']}\n")
    
    if response['num_sources'] > 0:
        print(f"\nüìö Sources: {response['num_sources']} documents retrieved")
        print(f"\nTop source: {response['source_documents'][0].metadata.get('source', 'Unknown')}")
    
    print("\n" + "="*60)

print("‚úì Display helper defined")

‚úì Display helper defined


In [13]:
# Test Query 1: HR Question
response = route_query("How much PTO do I get after working here for 4 years?")
display_response(response)


Query: How much PTO do I get after working here for 4 years?

üìã Classification: HR
üéØ Routing to HR Agent...


RESPONSE

After working at TechCorp for 4 years, you are entitled to 20 days (160 hours) of Paid Time Off (PTO) per year. This is in accordance with the PTO accrual rates based on tenure, where employees in their 3rd to 5th year with the company accrue 20 days of PTO annually.


üìö Sources: 5 documents retrieved

Top source: data/hr_docs/leave_policies.md



In [14]:
# Test Query 2: Tech Question
response = route_query("How do I authenticate with the TechCorp API?")
display_response(response)


Query: How do I authenticate with the TechCorp API?

üìã Classification: TECH
üéØ Routing to TECH Agent...


RESPONSE

To authenticate with the TechCorp API, you need to follow a series of steps to ensure secure and successful authentication. Below are the detailed instructions, best practices, and code examples to guide you through the process.

### Step 1: Obtain Your API Key

1. **Log into your TechCorp account.**
2. **Navigate to Settings > API Keys.**
3. **Click "Generate New API Key".** Once generated, your API key will be displayed. It's crucial to store this key securely as it will be shown only once.

### Step 2: Use the API Key for Authentication

When making requests to the TechCorp API, you must include your API key in the request header for authentication.

**Example Request with cURL:**

```bash
curl -H "Authorization: Bearer YOUR_API_KEY" \
  https://api.techcorp.com/v1/users
```

Replace `YOUR_API_KEY` with your actual API key.

### Best Practices for API Authenticat

In [15]:
# Test Query 3: Finance Question
response = route_query("What's the expense limit for business meals?")
display_response(response)


Query: What's the expense limit for business meals?

üìã Classification: FINANCE
üéØ Routing to FINANCE Agent...


RESPONSE

The expense limit for business meals varies depending on the meal and context, with specific dollar amounts, limits, and required approvals as follows:

### Daily Meal Allowances (Per Person)
- **Breakfast:** $20
- **Lunch:** $25
- **Dinner:** $50
- **Daily Maximum:** $75

### International Meal Allowances (Major Cities)
- **Europe:**
  - Breakfast: $25
  - Lunch: $30
  - Dinner: $60
  - Daily Max: $100
- **Asia:**
  - Breakfast: $20
  - Lunch: $25
  - Dinner: $50
  - Daily Max: $80
- **Other Regions:** Use standard US rates unless approved otherwise.

### Business Meals
- **Client/Prospect Meals:** Requires manager approval for groups of 5 or more. Attendees and business purpose must be documented, and an itemized receipt submitted. Meals should be at reasonable locations, avoiding ultra-luxury establishments.
- **Team Meals:** Quarterly team dinners are appr

In [16]:
# Test Query 4: Edge case - Ambiguous
response = route_query("What are the security requirements for new employees?")
display_response(response)


Query: What are the security requirements for new employees?

üìã Classification: HR
üéØ Routing to HR Agent...


RESPONSE

The security requirements for new employees at TechCorp include the following specific policies and procedures:

1. **Strong Passwords**: Employees are required to use strong passwords for all systems. These passwords must be at least 12 characters long and should be changed every 90 days to maintain security.

2. **Multi-factor Authentication**: All systems must have multi-factor authentication enabled. This adds an extra layer of security by requiring not only a password and username but also something that only the user has on them, i.e., a piece of information only they should know or have immediately to hand - such as a physical token.

3. **VPN for Remote Access**: Employees accessing the company's systems remotely are required to use a Virtual Private Network (VPN). This ensures that remote access is secure and that data transmitted over the internet is 

In [17]:
# Test Query 5: General
response = route_query("Hello! How are you?")
display_response(response)


Query: Hello! How are you?

üìã Classification: GENERAL
üí¨ Handling as general query...


RESPONSE

Hello! Thank you for your kind greeting. I'm here to assist you. While I don't have feelings or personal experiences, I'm ready and eager to help you with any questions or concerns you might have, especially in HR, Tech, or Finance areas. How can I assist you today?




---
# 6. Langfuse Integration & Tracing

All queries are automatically traced in Langfuse. You can:
- View complete execution paths
- Debug failed retrievals
- Analyze routing accuracy
- Monitor response quality

Access your traces at: https://cloud.langfuse.com

In [18]:
# Initialize Langfuse client for evaluation
langfuse = Langfuse(
    public_key=os.getenv('LANGFUSE_PUBLIC_KEY'),
    secret_key=os.getenv('LANGFUSE_SECRET_KEY'),
    host=os.getenv('LANGFUSE_HOST')
)

def view_recent_traces(limit: int = 5):
    """Display information about recent traces."""
    print(f"\nüìä Recent Traces (last {limit}):\n")
    print("View detailed traces at: https://cloud.langfuse.com")
    print("\nTraces include:")
    print("  - Intent classification step")
    print("  - Agent selection logic")
    print("  - Vector store retrieval")
    print("  - LLM response generation")
    print("  - Full execution timeline")
    print("  - Token usage and costs")

view_recent_traces()


üìä Recent Traces (last 5):

View detailed traces at: https://cloud.langfuse.com

Traces include:
  - Intent classification step
  - Agent selection logic
  - Vector store retrieval
  - LLM response generation
  - Full execution timeline
  - Token usage and costs


---
# 7. Evaluator Agent (BONUS)

Automated evaluation scores each response on:
- **Relevance**: Does it answer the question?
- **Completeness**: Is all necessary information included?
- **Accuracy**: Is the information correct based on sources?
- **Clarity**: Is it well-structured and understandable?

Scores are sent to Langfuse for continuous quality monitoring.

In [19]:
EVALUATOR_PROMPT = """You are an AI response quality evaluator for TechCorp's knowledge base system.

Evaluate the following response on a scale of 1-5 for each dimension:

**Question:** {question}

**Answer:** {answer}

**Source Documents:** {num_sources} documents retrieved

Evaluate on these dimensions:

1. **RELEVANCE** (1-5): Does the answer directly address the question?
   - 5: Perfectly addresses the question
   - 3: Partially relevant
   - 1: Completely irrelevant

2. **COMPLETENESS** (1-5): Does it include all necessary information?
   - 5: Comprehensive, nothing missing
   - 3: Covers basics but missing some details
   - 1: Incomplete or superficial

3. **ACCURACY** (1-5): Is the information factually correct?
   - 5: Fully accurate based on context
   - 3: Mostly accurate with minor issues
   - 1: Contains significant errors

4. **CLARITY** (1-5): Is it well-structured and easy to understand?
   - 5: Clear, well-organized, professional
   - 3: Understandable but could be clearer
   - 1: Confusing or poorly structured

Respond in this exact format:
RELEVANCE: <score>
COMPLETENESS: <score>
ACCURACY: <score>
CLARITY: <score>
OVERALL: <average score>
REASONING: <brief explanation>
"""

def evaluate_response(response: Dict[str, Any]) -> Dict[str, Any]:
    """
    Evaluate a response using LLM-based scoring.
    Returns scores for relevance, completeness, accuracy, and clarity.
    """
    evaluation_prompt = EVALUATOR_PROMPT.format(
        question=response['question'],
        answer=response['answer'],
        num_sources=response['num_sources']
    )
    
    eval_response = llm.invoke(
        evaluation_prompt,
        config={"callbacks": [langfuse_handler]}
    )
    
    # Parse evaluation scores
    eval_text = eval_response.content
    scores = {}
    
    for line in eval_text.split('\n'):
        if ':' in line:
            key, value = line.split(':', 1)
            key = key.strip().lower()
            if key in ['relevance', 'completeness', 'accuracy', 'clarity', 'overall']:
                try:
                    scores[key] = float(value.strip().split()[0])
                except:
                    scores[key] = 0.0
            elif key == 'reasoning':
                scores[key] = value.strip()
    
    # Send scores to Langfuse
    try:
        langfuse.score(
            name="response_quality",
            value=scores.get('overall', 0),
            comment=scores.get('reasoning', ''),
            metadata={
                'relevance': scores.get('relevance', 0),
                'completeness': scores.get('completeness', 0),
                'accuracy': scores.get('accuracy', 0),
                'clarity': scores.get('clarity', 0),
                'category': response['category'],
                'agent': response['agent_used']
            }
        )
    except Exception as e:
        print(f"  ‚ö†Ô∏è  Could not send scores to Langfuse: {e}")
    
    return scores

def display_evaluation(scores: Dict[str, Any]):
    """Display evaluation scores in a formatted way."""
    print("\n" + "="*60)
    print("QUALITY EVALUATION")
    print("="*60)
    print(f"\nüìä Relevance:    {scores.get('relevance', 0):.1f}/5")
    print(f"üìä Completeness: {scores.get('completeness', 0):.1f}/5")
    print(f"üìä Accuracy:     {scores.get('accuracy', 0):.1f}/5")
    print(f"üìä Clarity:      {scores.get('clarity', 0):.1f}/5")
    print(f"\n‚≠ê Overall Score: {scores.get('overall', 0):.1f}/5")
    
    if 'reasoning' in scores:
        print(f"\nüí≠ Reasoning: {scores['reasoning']}")
    
    print("\n" + "="*60)

print("‚úì Evaluator functions defined")

‚úì Evaluator functions defined


In [20]:
# Test evaluation on a sample query
print("\n" + "#"*60)
print("# EVALUATOR DEMO")
print("#"*60)

test_question = "What is the parental leave policy for non-birth parents?"
response = route_query(test_question)
display_response(response)

print("\nüîç Running automated evaluation...")
scores = evaluate_response(response)
display_evaluation(scores)


############################################################
# EVALUATOR DEMO
############################################################

Query: What is the parental leave policy for non-birth parents?

üìã Classification: HR
üéØ Routing to HR Agent...


RESPONSE

The parental leave policy for non-birth parents at TechCorp includes the following provisions:

- **Duration of Leave:** Non-birth parents are entitled to 8 weeks (320 hours) of paid leave.
- **Flexibility in Leave Period:** The leave can be taken within 12 months of the birth of the child. Additionally, it may be split into two separate 4-week periods to provide flexibility.
- **Inclusivity:** The policy includes same-sex partners, applying equally to fathers and adoptive parents.
- **Benefits:** Non-birth parents receive the same benefits during their leave as other types of parental leave, ensuring job protection and continuation of benefits.
- **Parental Leave Process:**
  1. Non-birth parents should notify HR and th

---
# 8. Batch Testing

Run multiple test queries and evaluate all responses.

In [21]:
def run_batch_tests(questions: List[str], evaluate: bool = True):
    """
    Run a batch of test questions through the system.
    Optionally evaluate each response.
    """
    results = []
    
    print("\n" + "#"*60)
    print(f"# BATCH TEST: {len(questions)} queries")
    print("#"*60)
    
    for i, question in enumerate(questions, 1):
        print(f"\n\n[{i}/{len(questions)}] Testing: {question[:60]}...")
        
        # Get response
        response = route_query(question)
        
        # Evaluate if requested
        if evaluate:
            print("  Evaluating...")
            scores = evaluate_response(response)
            response['evaluation'] = scores
            print(f"  ‚≠ê Score: {scores.get('overall', 0):.1f}/5")
        
        results.append(response)
    
    # Summary
    print("\n\n" + "="*60)
    print("BATCH TEST SUMMARY")
    print("="*60)
    
    categories = {}
    for result in results:
        cat = result['category']
        categories[cat] = categories.get(cat, 0) + 1
    
    print("\nüìä Routing Distribution:")
    for cat, count in categories.items():
        print(f"  - {cat}: {count} queries ({count/len(questions)*100:.1f}%)")
    
    if evaluate:
        avg_score = sum(r.get('evaluation', {}).get('overall', 0) for r in results) / len(results)
        print(f"\n‚≠ê Average Quality Score: {avg_score:.2f}/5")
        
        low_scores = [r for r in results if r.get('evaluation', {}).get('overall', 0) < 3]
        if low_scores:
            print(f"\n‚ö†Ô∏è  {len(low_scores)} response(s) scored below 3.0 - review recommended")
    
    print("\n" + "="*60)
    
    return results

print("‚úì Batch testing function defined")

‚úì Batch testing function defined


---
# Summary

## What We Built

1. **Multi-Agent RAG System** with three specialized agents (HR, Tech, Finance)
2. **Intelligent Orchestrator** that classifies and routes queries
3. **Complete Langfuse Integration** for tracing and debugging
4. **Automated Evaluator** for response quality scoring
5. **Production-Ready Components** using LangChain

## Technical Decisions

### Why LangChain?
- **Production-grade components**: RetrievalQA, vector stores, chains
- **Maintainability**: Standard abstractions, well-documented
- **Extensibility**: Easy to add new agents or modify retrieval
- **Community support**: Active development and plugins

### Why Separate Vector Stores?
- **Domain isolation**: Each agent has specialized knowledge
- **Better retrieval**: More focused results per domain
- **Scalability**: Can update one domain without affecting others
- **Performance**: Smaller vector spaces = faster search

### Why Orchestrator-Based Routing?
- **Flexibility**: Can handle complex, multi-domain queries
- **Accuracy**: LLM classification more robust than keyword matching
- **Debuggability**: Clear decision trail in Langfuse
- **Extensibility**: Easy to add new categories

### Why Langfuse for Tracing?
- **Full visibility**: Every step of execution logged
- **Production monitoring**: Track quality over time
- **Debugging**: Identify misrouting and bad retrievals
- **Analytics**: Understand usage patterns and performance

## Next Steps

To use this system:
1. Set up `.env` file with your API keys
2. Run cells in order to create vector stores
3. Test with your own queries
4. View traces in Langfuse dashboard
5. Monitor quality scores over time

## Access Your Traces

View detailed execution traces at: **https://cloud.langfuse.com**

You'll see:
- Complete query execution paths
- Token usage and costs
- Retrieval relevance
- Response quality scores
- Performance metrics