# Agentic RAG Patterns

## Introduction

Retrieval-Augmented Generation (RAG) has become the standard approach for grounding LLM responses in specific knowledge bases. However, **traditional RAG** has limitations when dealing with complex queries, multiple data sources, or scenarios requiring reasoning over retrieved information.

**Agentic RAG** extends traditional RAG by giving the LLM agency to:
- Plan and decompose complex queries
- Route queries to appropriate data sources
- Adaptively retrieve additional information
- Reason over retrieved documents
- Verify and validate generated answers

### Traditional RAG vs. Agentic RAG

```mermaid
graph TB
    subgraph "Traditional RAG (Static)"
        T1[User Query] --> T2[Embed Query]
        T2 --> T3[Retrieve Top K Docs]
        T3 --> T4[Single LLM Call]
        T4 --> T5[Generate Answer]
    end
    
    subgraph "Agentic RAG (Dynamic)"
        A1[User Query] --> A2{Query Analyzer}
        A2 --> A3[Query Planning]
        A3 --> A4{Router}
        A4 -->|Source 1| A5[Retrieve]
        A4 -->|Source 2| A6[Retrieve]
        A4 -->|Source 3| A7[Retrieve]
        A5 --> A8{Sufficient?}
        A6 --> A8
        A7 --> A8
        A8 -->|No| A9[Adaptive Retrieval]
        A9 --> A8
        A8 -->|Yes| A10[Reasoning]
        A10 --> A11[Generate & Verify]
        A11 --> A12[Final Answer]
    end
    
    style T1 fill:#e1f5ff
    style T5 fill:#ffccbc
    style A1 fill:#e1f5ff
    style A12 fill:#c8e6c9
```

### Key Differences

| Aspect | Traditional RAG | Agentic RAG |
|--------|----------------|-------------|
| **Query Processing** | Single embedding | Query decomposition & planning |
| **Retrieval** | Fixed top-k | Adaptive, iterative |
| **Data Sources** | Single vector store | Multi-source routing |
| **Reasoning** | Direct generation | Multi-step reasoning over docs |
| **Verification** | None | Self-verification & validation |
| **Complexity** | Low | High |
| **Cost** | Low | Higher |
| **Use Cases** | Simple Q&A | Complex analysis, research |

### When to Use Agentic RAG

- **Complex queries** requiring multi-step reasoning
- **Multiple data sources** (databases, APIs, documents, graphs)
- **High-stakes applications** requiring verification
- **Research tasks** needing comprehensive synthesis
- **Dynamic scenarios** where query complexity varies

## 1. Agentic RAG Architecture Patterns

### 1.1 Query Planning Agent

The Query Planning Agent analyzes incoming queries and creates a retrieval strategy.

```mermaid
sequenceDiagram
    participant U as User
    participant QP as Query Planner
    participant R as Retriever
    participant G as Generator
    
    U->>QP: "What were Q3 revenue and why did it increase?"
    QP->>QP: Analyze Query Type
    QP->>QP: Identify: Multi-part (fact + reasoning)
    QP->>QP: Create Plan
    Note over QP: Plan:<br/>1. Retrieve Q3 revenue data<br/>2. Retrieve Q3 performance analysis<br/>3. Synthesize answer
    
    QP->>R: Execute Step 1
    R-->>QP: Revenue: $10M
    QP->>R: Execute Step 2
    R-->>QP: New product launch increased sales
    QP->>G: Generate with context
    G-->>U: Comprehensive Answer
```

**Benefits:**
- Handles complex multi-part queries
- Optimizes retrieval strategy
- Reduces unnecessary retrievals

In [None]:
# Setup
import anthropic
import os
from typing import List, Dict, Any, Optional
import json

# Initialize Anthropic client
client = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))

def query_planning_agent(user_query: str) -> Dict[str, Any]:
    """
    Analyze query and create a retrieval plan
    """
    print("=" * 60)
    print("QUERY PLANNING AGENT")
    print("=" * 60)
    print(f"\nUser Query: {user_query}\n")
    
    planning_prompt = f"""You are a query planning agent for a RAG system. Analyze this query and create a retrieval plan.

Query: {user_query}

Analyze:
1. Query type (factual, analytical, comparative, multi-part)
2. Information needed (data, context, reasoning)
3. Number of retrieval steps required
4. Optimal retrieval strategy

Return a JSON plan:
{{
  "query_type": "type",
  "complexity": "simple|moderate|complex",
  "retrieval_steps": [
    {{"step": 1, "action": "retrieve X", "source": "documents|database|api"}},
    {{"step": 2, "action": "retrieve Y", "source": "documents|database|api"}}
  ],
  "reasoning_required": true|false
}}"""
    
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[{"role": "user", "content": planning_prompt}]
    )
    
    result_text = response.content[0].text
    
    # Extract JSON from response
    try:
        start = result_text.find('{')
        end = result_text.rfind('}') + 1
        plan = json.loads(result_text[start:end])
    except:
        plan = {
            "query_type": "unknown",
            "complexity": "simple",
            "retrieval_steps": [{"step": 1, "action": "retrieve relevant documents", "source": "documents"}],
            "reasoning_required": False
        }
    
    print("Query Analysis:")
    print(f"  Type: {plan['query_type']}")
    print(f"  Complexity: {plan['complexity']}")
    print(f"  Reasoning Required: {plan['reasoning_required']}")
    print(f"\nRetrieval Plan ({len(plan['retrieval_steps'])} steps):")
    for step in plan['retrieval_steps']:
        print(f"  Step {step['step']}: {step['action']} from {step['source']}")
    
    return plan

# Example: Simple query
plan1 = query_planning_agent("What is the company's revenue for Q3 2024?")

print("\n" + "="*60 + "\n")

# Example: Complex query
plan2 = query_planning_agent(
    "Compare our Q3 2024 revenue with Q3 2023, explain the main drivers of change, "
    "and identify which product lines contributed most to growth"
)

### 1.2 Routing Agent

The Routing Agent directs queries to the most appropriate data source(s).

```mermaid
graph TB
    A[Query] --> B{Routing Agent}
    
    B -->|Financial Data| C[SQL Database]
    B -->|Product Docs| D[Vector Store]
    B -->|Real-time Info| E[API/Web]
    B -->|Code/Technical| F[Code Repository]
    B -->|Relationships| G[Knowledge Graph]
    
    C --> H[Response Aggregator]
    D --> H
    E --> H
    F --> H
    G --> H
    
    H --> I[Synthesized Answer]
    
    style B fill:#ffccbc
    style H fill:#fff9c4
    style I fill:#c8e6c9
```

**Routing Strategies:**
1. **Semantic Routing** - Based on query meaning
2. **Keyword Routing** - Based on specific terms
3. **LLM-based Routing** - LLM decides best source
4. **Multi-source Routing** - Query multiple sources in parallel

In [None]:
class DataSource:
    """Base class for data sources"""
    def __init__(self, name: str, description: str):
        self.name = name
        self.description = description
    
    def query(self, query: str) -> List[Dict[str, Any]]:
        """Execute query against this data source"""
        raise NotImplementedError

class VectorStoreSource(DataSource):
    """Simulated vector store for documents"""
    def query(self, query: str) -> List[Dict[str, Any]]:
        return [
            {
                "content": f"Document about {query} from vector store",
                "score": 0.95,
                "source": "vector_store",
                "metadata": {"type": "documentation"}
            }
        ]

class SQLDatabaseSource(DataSource):
    """Simulated SQL database"""
    def query(self, query: str) -> List[Dict[str, Any]]:
        return [
            {
                "content": f"Structured data about {query} from database",
                "score": 1.0,
                "source": "sql_database",
                "metadata": {"type": "financial_data"}
            }
        ]

class APISource(DataSource):
    """Simulated API for real-time data"""
    def query(self, query: str) -> List[Dict[str, Any]]:
        return [
            {
                "content": f"Real-time information about {query} from API",
                "score": 0.90,
                "source": "api",
                "metadata": {"type": "live_data"}
            }
        ]

def routing_agent(query: str, available_sources: List[DataSource]) -> Dict[str, Any]:
    """
    Route query to appropriate data source(s)
    """
    print("=" * 60)
    print("ROUTING AGENT")
    print("=" * 60)
    print(f"\nQuery: {query}\n")
    
    # Build source descriptions
    source_descriptions = "\n".join([
        f"- {src.name}: {src.description}"
        for src in available_sources
    ])
    
    routing_prompt = f"""You are a routing agent. Determine which data source(s) to query.

Query: {query}

Available Sources:
{source_descriptions}

Select one or more sources. Return JSON:
{{
  "selected_sources": ["source_name1", "source_name2"],
  "reasoning": "why these sources"
}}"""
    
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=512,
        messages=[{"role": "user", "content": routing_prompt}]
    )
    
    result_text = response.content[0].text
    
    try:
        start = result_text.find('{')
        end = result_text.rfind('}') + 1
        routing_decision = json.loads(result_text[start:end])
    except:
        routing_decision = {
            "selected_sources": [available_sources[0].name],
            "reasoning": "Default to first source"
        }
    
    print(f"Routing Decision:")
    print(f"  Selected Sources: {', '.join(routing_decision['selected_sources'])}")
    print(f"  Reasoning: {routing_decision['reasoning']}")
    
    # Execute queries on selected sources
    results = []
    source_map = {src.name: src for src in available_sources}
    
    print(f"\nQuerying selected sources:")
    for source_name in routing_decision['selected_sources']:
        if source_name in source_map:
            source = source_map[source_name]
            source_results = source.query(query)
            results.extend(source_results)
            print(f"  {source_name}: {len(source_results)} results")
    
    return {
        "query": query,
        "routing_decision": routing_decision,
        "results": results
    }

# Setup data sources
sources = [
    VectorStoreSource("vector_store", "Document embeddings for product documentation and guides"),
    SQLDatabaseSource("sql_database", "Structured financial and operational data"),
    APISource("api", "Real-time market data and external information")
]

# Example: Financial query
result1 = routing_agent("What was our Q3 2024 revenue?", sources)

print("\n" + "="*60 + "\n")

# Example: Product documentation query
result2 = routing_agent("How do I configure authentication in the API?", sources)

print("\n" + "="*60 + "\n")

# Example: Real-time query
result3 = routing_agent("What is the current stock price of our main competitor?", sources)

### 1.3 Adaptive Retrieval Agent

The Adaptive Retrieval Agent decides whether retrieved information is sufficient or if additional retrieval is needed.

```mermaid
sequenceDiagram
    participant Q as Query
    participant A as Adaptive Agent
    participant R as Retriever
    participant E as Evaluator
    
    Q->>A: User Query
    A->>R: Initial Retrieval (top-5)
    R-->>A: 5 documents
    A->>E: Evaluate Sufficiency
    
    alt Insufficient
        E-->>A: Need more context
        A->>R: Expanded Retrieval (top-10)
        R-->>A: 10 documents
        A->>E: Re-evaluate
    end
    
    alt Still Insufficient
        E-->>A: Need different query
        A->>A: Reformulate Query
        A->>R: New Retrieval
        R-->>A: New documents
    end
    
    E-->>A: Sufficient
    A->>Q: Generate Answer
```

**Adaptive Strategies:**
1. **Expand k** - Retrieve more documents
2. **Query Reformulation** - Rephrase search query
3. **Multi-hop Retrieval** - Follow references
4. **Source Expansion** - Query additional sources

In [None]:
def adaptive_retrieval_agent(
    query: str,
    max_iterations: int = 3,
    sufficiency_threshold: float = 0.8
) -> Dict[str, Any]:
    """
    Adaptively retrieve information until sufficient context is obtained
    """
    print("=" * 60)
    print("ADAPTIVE RETRIEVAL AGENT")
    print("=" * 60)
    print(f"\nQuery: {query}")
    print(f"Max Iterations: {max_iterations}")
    print(f"Sufficiency Threshold: {sufficiency_threshold}\n")
    
    all_retrieved = []
    iteration = 0
    current_query = query
    k = 3  # Start with retrieving 3 documents
    
    while iteration < max_iterations:
        iteration += 1
        print(f"\n[Iteration {iteration}]")
        print(f"  Retrieving top-{k} for: {current_query}")
        
        # Simulate retrieval (in production, this would query real vector store)
        retrieved_docs = [
            {
                "content": f"Document {i} about {current_query} (iteration {iteration})",
                "score": 0.9 - (i * 0.1),
                "id": f"doc_{iteration}_{i}"
            }
            for i in range(1, k + 1)
        ]
        
        all_retrieved.extend(retrieved_docs)
        print(f"  Retrieved: {len(retrieved_docs)} documents")
        
        # Evaluate sufficiency
        sufficiency_prompt = f"""Evaluate if these retrieved documents are sufficient to answer the query.

Query: {query}

Retrieved Documents:
{chr(10).join([f"- {doc['content']} (score: {doc['score']})" for doc in all_retrieved])}

Evaluate:
1. Do these documents contain information to answer the query?
2. Is additional retrieval needed?
3. If insufficient, what strategy should be used?

Return JSON:
{{
  "sufficiency_score": 0.0-1.0,
  "is_sufficient": true|false,
  "reasoning": "explanation",
  "next_action": "done|expand_k|reformulate_query|different_source"
}}"""
        
        response = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=512,
            messages=[{"role": "user", "content": sufficiency_prompt}]
        )
        
        result_text = response.content[0].text
        
        try:
            start = result_text.find('{')
            end = result_text.rfind('}') + 1
            evaluation = json.loads(result_text[start:end])
        except:
            evaluation = {
                "sufficiency_score": 0.9,
                "is_sufficient": True,
                "reasoning": "Assumed sufficient",
                "next_action": "done"
            }
        
        print(f"  Sufficiency Score: {evaluation['sufficiency_score']:.2f}")
        print(f"  Status: {'Sufficient' if evaluation['is_sufficient'] else 'Insufficient'}")
        print(f"  Reasoning: {evaluation['reasoning']}")
        
        # Check if we should continue
        if evaluation['is_sufficient'] or evaluation['sufficiency_score'] >= sufficiency_threshold:
            print(f"\n[Complete] Retrieved sufficient information in {iteration} iteration(s)")
            break
        
        # Adapt retrieval strategy
        next_action = evaluation.get('next_action', 'expand_k')
        print(f"  Next Action: {next_action}")
        
        if next_action == 'expand_k':
            k += 2  # Retrieve 2 more documents
        elif next_action == 'reformulate_query':
            # Reformulate query (simplified here)
            reformulation_prompt = f"Reformulate this query to retrieve different information: {current_query}"
            ref_response = client.messages.create(
                model="claude-3-5-sonnet-20241022",
                max_tokens=256,
                messages=[{"role": "user", "content": reformulation_prompt}]
            )
            current_query = ref_response.content[0].text.strip()
            print(f"  Reformulated to: {current_query}")
        
        if iteration >= max_iterations:
            print(f"\n[Complete] Max iterations reached")
            break
    
    return {
        "query": query,
        "iterations": iteration,
        "total_documents": len(all_retrieved),
        "documents": all_retrieved,
        "final_sufficiency": evaluation.get('sufficiency_score', 0.0)
    }

# Example
result = adaptive_retrieval_agent(
    "What are the best practices for implementing authentication in microservices?",
    max_iterations=3,
    sufficiency_threshold=0.8
)

## 2. Advanced Agentic RAG Patterns

### 2.1 Self-RAG: Self-Reflective Retrieval

Self-RAG adds reflection steps where the agent evaluates its own generation quality.

```mermaid
graph TB
    A[Query] --> B[Retrieve Documents]
    B --> C[Generate Answer]
    C --> D{Self-Reflection}
    
    D -->|Relevant & Supported| E[Return Answer]
    
    D -->|Not Relevant| F[Retrieve Different Docs]
    F --> C
    
    D -->|Not Supported| G[Request More Evidence]
    G --> B
    
    D -->|Hallucination Detected| H[Regenerate]
    H --> C
    
    style A fill:#e1f5ff
    style D fill:#ffccbc
    style E fill:#c8e6c9
```

**Reflection Criteria:**
1. **Relevance** - Are retrieved docs relevant to query?
2. **Support** - Is answer supported by retrieved docs?
3. **Completeness** - Does answer fully address query?
4. **Consistency** - Are there contradictions?

In [None]:
def self_rag_system(query: str, documents: List[str]) -> Dict[str, Any]:
    """
    Self-RAG with reflection and verification
    """
    print("=" * 60)
    print("SELF-RAG SYSTEM")
    print("=" * 60)
    print(f"\nQuery: {query}\n")
    
    max_attempts = 3
    attempt = 0
    
    while attempt < max_attempts:
        attempt += 1
        print(f"\n[Attempt {attempt}]")
        
        # Step 1: Generate answer
        print("  Generating answer...")
        generation_prompt = f"""Answer this query using the provided documents:

Query: {query}

Documents:
{chr(10).join([f"- {doc}" for doc in documents])}

Provide a clear, concise answer citing the documents."""
        
        gen_response = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=1024,
            messages=[{"role": "user", "content": generation_prompt}]
        )
        
        generated_answer = gen_response.content[0].text
        print(f"  Generated answer (excerpt): {generated_answer[:150]}...")
        
        # Step 2: Self-reflection
        print("  Performing self-reflection...")
        reflection_prompt = f"""Evaluate this generated answer:

Query: {query}

Documents:
{chr(10).join([f"- {doc}" for doc in documents])}

Generated Answer:
{generated_answer}

Evaluate:
1. Relevance (0-1): Are documents relevant to query?
2. Support (0-1): Is answer supported by documents?
3. Completeness (0-1): Does answer fully address query?
4. Factuality (0-1): Are there hallucinations?

Return JSON:
{{
  "relevance": 0.0-1.0,
  "support": 0.0-1.0,
  "completeness": 0.0-1.0,
  "factuality": 0.0-1.0,
  "overall_quality": 0.0-1.0,
  "accept": true|false,
  "issues": ["list of issues if any"]
}}"""
        
        ref_response = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=512,
            messages=[{"role": "user", "content": reflection_prompt}]
        )
        
        result_text = ref_response.content[0].text
        
        try:
            start = result_text.find('{')
            end = result_text.rfind('}') + 1
            reflection = json.loads(result_text[start:end])
        except:
            reflection = {
                "relevance": 0.9,
                "support": 0.9,
                "completeness": 0.9,
                "factuality": 0.9,
                "overall_quality": 0.9,
                "accept": True,
                "issues": []
            }
        
        print(f"  Reflection Scores:")
        print(f"    Relevance: {reflection['relevance']:.2f}")
        print(f"    Support: {reflection['support']:.2f}")
        print(f"    Completeness: {reflection['completeness']:.2f}")
        print(f"    Factuality: {reflection['factuality']:.2f}")
        print(f"    Overall: {reflection['overall_quality']:.2f}")
        
        # Step 3: Decide
        if reflection['accept'] and reflection['overall_quality'] >= 0.8:
            print(f"\n[Success] Answer accepted after {attempt} attempt(s)")
            return {
                "query": query,
                "answer": generated_answer,
                "attempts": attempt,
                "reflection": reflection,
                "status": "accepted"
            }
        
        if reflection['issues']:
            print(f"  Issues identified: {', '.join(reflection['issues'])}")
        
        if attempt >= max_attempts:
            print(f"\n[Warning] Max attempts reached, returning best answer")
            return {
                "query": query,
                "answer": generated_answer,
                "attempts": attempt,
                "reflection": reflection,
                "status": "max_attempts"
            }
        
        print(f"  Retrying with refined approach...")
    
    return {"status": "failed"}

# Example
example_docs = [
    "The company's Q3 2024 revenue was $15M, up 25% from Q3 2023.",
    "Main growth drivers included new product launches and expanded market presence in APAC.",
    "The SaaS division contributed 60% of total revenue, growing 40% YoY."
]

result = self_rag_system(
    "What was Q3 2024 revenue and what drove the growth?",
    example_docs
)

### 2.2 Corrective RAG (CRAG)

Corrective RAG evaluates retrieved document quality and takes corrective actions.

```mermaid
graph TB
    A[Query] --> B[Initial Retrieval]
    B --> C{Evaluate Relevance}
    
    C -->|High Relevance| D[Use Retrieved Docs]
    C -->|Medium Relevance| E[Filter & Refine]
    C -->|Low Relevance| F[Web Search Fallback]
    
    E --> G[Knowledge Refinement]
    F --> G
    
    D --> H[Generate Answer]
    G --> H
    
    H --> I[Final Answer]
    
    style A fill:#e1f5ff
    style C fill:#ffccbc
    style I fill:#c8e6c9
```

**Corrective Actions:**
1. **Filter** - Remove low-relevance documents
2. **Refine** - Extract only relevant portions
3. **Augment** - Add external knowledge
4. **Fallback** - Use web search if internal docs insufficient

### 2.3 Multi-Hop RAG

Multi-hop RAG performs multiple retrieval steps, using information from each step to inform the next.

```mermaid
sequenceDiagram
    participant Q as Query
    participant A as Agent
    participant R as Retriever
    
    Q->>A: "How did Company X's Q3 performance compare to competitors?"
    
    Note over A: Hop 1: Get Company X data
    A->>R: Retrieve Company X Q3 results
    R-->>A: Company X: $10M revenue
    
    Note over A: Hop 2: Identify competitors
    A->>R: Who are Company X's main competitors?
    R-->>A: Competitors: Company Y, Company Z
    
    Note over A: Hop 3: Get competitor data
    A->>R: Retrieve Company Y, Z Q3 results
    R-->>A: Company Y: $8M, Company Z: $12M
    
    Note over A: Synthesize
    A->>Q: Comparative analysis
```

In [None]:
def multi_hop_rag(query: str, max_hops: int = 3) -> Dict[str, Any]:
    """
    Multi-hop retrieval where each step informs the next
    """
    print("=" * 60)
    print("MULTI-HOP RAG")
    print("=" * 60)
    print(f"\nQuery: {query}")
    print(f"Max Hops: {max_hops}\n")
    
    all_retrieved = []
    hop_history = []
    current_context = ""
    
    for hop in range(1, max_hops + 1):
        print(f"\n[Hop {hop}]")
        
        # Determine what to retrieve next
        planning_prompt = f"""Based on the query and retrieved information so far, what should we retrieve next?

Original Query: {query}

Retrieved So Far:
{current_context if current_context else '(Nothing yet)'}

What information is still needed? Generate a specific retrieval query.
If all information is available, respond with "DONE".

Return just the retrieval query or "DONE"."""
        
        plan_response = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=256,
            messages=[{"role": "user", "content": planning_prompt}]
        )
        
        next_query = plan_response.content[0].text.strip()
        
        if "DONE" in next_query.upper():
            print(f"  Agent determined all necessary information has been retrieved")
            break
        
        print(f"  Retrieval query: {next_query}")
        
        # Simulate retrieval (in production, this would query real data sources)
        retrieved = {
            "hop": hop,
            "query": next_query,
            "results": [
                f"Information for hop {hop} about: {next_query}",
                f"Additional details for hop {hop}"
            ]
        }
        
        all_retrieved.append(retrieved)
        hop_history.append(next_query)
        
        # Update context
        current_context += f"\n\nHop {hop} ({next_query}):\n"
        current_context += "\n".join(retrieved["results"])
        
        print(f"  Retrieved: {len(retrieved['results'])} results")
    
    # Final synthesis
    print(f"\n[Synthesis]")
    print(f"  Synthesizing answer from {len(all_retrieved)} hops...")
    
    synthesis_prompt = f"""Synthesize a comprehensive answer using all retrieved information:

Original Query: {query}

Retrieved Information:
{current_context}

Provide a clear, comprehensive answer that integrates all the retrieved information."""
    
    synthesis_response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[{"role": "user", "content": synthesis_prompt}]
    )
    
    final_answer = synthesis_response.content[0].text
    
    print(f"\n[Complete] Multi-hop retrieval finished in {len(all_retrieved)} hops")
    
    return {
        "query": query,
        "num_hops": len(all_retrieved),
        "hop_history": hop_history,
        "all_retrieved": all_retrieved,
        "final_answer": final_answer
    }

# Example
result = multi_hop_rag(
    "How does the performance of our flagship product compare to the top 3 competitors "
    "in terms of both market share and customer satisfaction?",
    max_hops=4
)

print(f"\nFinal Answer (excerpt):\n{result['final_answer'][:200]}...")

## 3. Complete Agentic RAG System

### Integrated Architecture

```mermaid
graph TB
    A[User Query] --> B[Query Planning Agent]
    B --> C{Complexity Analysis}
    
    C -->|Simple| D[Direct RAG]
    C -->|Complex| E[Agentic RAG]
    
    E --> F[Routing Agent]
    F --> G1[Vector Store]
    F --> G2[SQL Database]
    F --> G3[External APIs]
    
    G1 --> H[Adaptive Retrieval]
    G2 --> H
    G3 --> H
    
    H --> I{Multi-hop Needed?}
    I -->|Yes| J[Multi-hop Retrieval]
    I -->|No| K[Single-pass]
    
    J --> L[Self-RAG Reflection]
    K --> L
    D --> L
    
    L --> M{Quality Check}
    M -->|Pass| N[Final Answer]
    M -->|Fail| H
    
    style A fill:#e1f5ff
    style C fill:#fff9c4
    style M fill:#ffccbc
    style N fill:#c8e6c9
```

In [None]:
class AgenticRAGSystem:
    """
    Complete agentic RAG system with all patterns integrated
    """
    
    def __init__(self, data_sources: List[DataSource]):
        self.data_sources = data_sources
        self.query_count = 0
        self.total_retrievals = 0
    
    def query(self, user_query: str) -> Dict[str, Any]:
        """
        Process query through complete agentic RAG pipeline
        """
        self.query_count += 1
        
        print("\n" + "="*60)
        print("AGENTIC RAG SYSTEM")
        print("="*60)
        print(f"Query #{self.query_count}: {user_query}\n")
        
        # Step 1: Query Planning
        plan = query_planning_agent(user_query)
        
        # Step 2: Route to appropriate sources
        routing_result = routing_agent(user_query, self.data_sources)
        
        # Step 3: Adaptive Retrieval
        retrieval_result = adaptive_retrieval_agent(
            user_query,
            max_iterations=2 if plan['complexity'] == 'simple' else 3
        )
        
        self.total_retrievals += retrieval_result['total_documents']
        
        # Step 4: Multi-hop if needed
        if plan.get('reasoning_required', False):
            print("\n[Multi-hop reasoning enabled]")
            multihop_result = multi_hop_rag(user_query, max_hops=2)
            context = multihop_result['final_answer']
        else:
            context = "\n".join([
                doc['content'] for doc in retrieval_result['documents']
            ])
        
        # Step 5: Generate answer
        generation_prompt = f"""Answer this query using the retrieved information:

Query: {user_query}

Retrieved Context:
{context}

Provide a comprehensive answer with citations."""
        
        gen_response = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=2048,
            messages=[{"role": "user", "content": generation_prompt}]
        )
        
        answer = gen_response.content[0].text
        
        # Step 6: Self-reflection quality check
        reflection_prompt = f"""Evaluate this RAG answer:

Query: {user_query}
Answer: {answer}

Rate overall quality (0-1) and identify any issues.
Return JSON: {{"quality_score": 0.0-1.0, "issues": []}}"""
        
        ref_response = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=256,
            messages=[{"role": "user", "content": reflection_prompt}]
        )
        
        result_text = ref_response.content[0].text
        
        try:
            start = result_text.find('{')
            end = result_text.rfind('}') + 1
            quality = json.loads(result_text[start:end])
        except:
            quality = {"quality_score": 0.85, "issues": []}
        
        print(f"\n[Quality Check] Score: {quality['quality_score']:.2f}")
        
        print("\n" + "="*60)
        print("FINAL ANSWER")
        print("="*60)
        print(answer)
        
        return {
            "query": user_query,
            "plan": plan,
            "routing": routing_result['routing_decision'],
            "retrieval_iterations": retrieval_result['iterations'],
            "total_documents": retrieval_result['total_documents'],
            "answer": answer,
            "quality_score": quality['quality_score'],
            "issues": quality.get('issues', [])
        }
    
    def get_stats(self) -> Dict[str, Any]:
        """
        Get system statistics
        """
        return {
            "total_queries": self.query_count,
            "total_retrievals": self.total_retrievals,
            "avg_retrievals_per_query": self.total_retrievals / max(1, self.query_count)
        }

# Example usage
sources = [
    VectorStoreSource("vector_store", "Document embeddings for product documentation"),
    SQLDatabaseSource("sql_database", "Structured financial and operational data"),
    APISource("api", "Real-time external data")
]

rag_system = AgenticRAGSystem(data_sources=sources)

# Test with different query types
result1 = rag_system.query("What was our Q3 2024 revenue?")

print("\n\n" + "#"*60 + "\n")

result2 = rag_system.query(
    "Compare our Q3 2024 performance to competitors and explain the key factors "
    "that differentiated our results"
)

# Show statistics
stats = rag_system.get_stats()
print("\n" + "="*60)
print("SYSTEM STATISTICS")
print("="*60)
print(f"Total Queries: {stats['total_queries']}")
print(f"Total Retrievals: {stats['total_retrievals']}")
print(f"Avg Retrievals/Query: {stats['avg_retrievals_per_query']:.1f}")

## 4. Production Best Practices

### 4.1 Performance Optimization

**Caching Strategy:**
```python
# Cache at multiple levels
- Query embeddings cache (1 hour TTL)
- Retrieved documents cache (30 min TTL)
- Generated answers cache (15 min TTL)
- LLM routing decisions cache (1 hour TTL)
```

**Parallel Execution:**
```python
# Execute independent retrievals in parallel
import asyncio

async def parallel_retrieve():
    tasks = [
        retrieve_from_vector_store(query),
        retrieve_from_database(query),
        retrieve_from_api(query)
    ]
    results = await asyncio.gather(*tasks)
    return results
```

**Model Selection:**
- Query planning: Claude Haiku (fast, cheap)
- Routing: Claude Haiku
- Generation: Claude Sonnet (quality)
- Reflection: Claude Sonnet

### 4.2 Cost Management

```mermaid
graph TB
    A[Incoming Query] --> B{Check Cache}
    B -->|Hit| C[Return Cached]
    B -->|Miss| D{Complexity}
    
    D -->|Simple| E[Direct RAG with Haiku]
    D -->|Medium| F[Agentic RAG with Haiku + Sonnet]
    D -->|Complex| G[Full Agentic with Sonnet]
    
    E --> H[Cache Result]
    F --> H
    G --> H
    
    H --> I[Return to User]
    
    style C fill:#c8e6c9
    style D fill:#ffccbc
```

**Cost Optimization Tips:**
1. Use complexity detection to avoid over-engineering simple queries
2. Implement aggressive caching for common queries
3. Set iteration limits to prevent runaway costs
4. Use cheaper models for routing and planning
5. Monitor token usage per query type

### 4.3 Monitoring and Observability

In [None]:
class MonitoredAgenticRAG(AgenticRAGSystem):
    """
    Agentic RAG with comprehensive monitoring
    """
    
    def __init__(self, data_sources: List[DataSource]):
        super().__init__(data_sources)
        self.metrics = {
            "queries_by_complexity": {"simple": 0, "moderate": 0, "complex": 0},
            "avg_retrieval_iterations": [],
            "avg_quality_scores": [],
            "cache_hits": 0,
            "cache_misses": 0,
            "errors": []
        }
    
    def query(self, user_query: str) -> Dict[str, Any]:
        """
        Query with monitoring
        """
        import time
        start_time = time.time()
        
        try:
            result = super().query(user_query)
            
            # Record metrics
            complexity = result['plan']['complexity']
            self.metrics['queries_by_complexity'][complexity] += 1
            self.metrics['avg_retrieval_iterations'].append(result['retrieval_iterations'])
            self.metrics['avg_quality_scores'].append(result['quality_score'])
            
            elapsed = time.time() - start_time
            
            print(f"\n[Monitoring] Query processed in {elapsed:.2f}s")
            
            return result
            
        except Exception as e:
            self.metrics['errors'].append({
                "query": user_query,
                "error": str(e),
                "timestamp": time.time()
            })
            raise
    
    def get_performance_report(self) -> str:
        """
        Generate performance report
        """
        avg_iterations = sum(self.metrics['avg_retrieval_iterations']) / max(1, len(self.metrics['avg_retrieval_iterations']))
        avg_quality = sum(self.metrics['avg_quality_scores']) / max(1, len(self.metrics['avg_quality_scores']))
        
        report = f"""
{'='*60}
AGENTIC RAG PERFORMANCE REPORT
{'='*60}

Query Statistics:
  Total Queries: {self.query_count}
  By Complexity:
    Simple: {self.metrics['queries_by_complexity']['simple']}
    Moderate: {self.metrics['queries_by_complexity']['moderate']}
    Complex: {self.metrics['queries_by_complexity']['complex']}

Retrieval Statistics:
  Total Retrievals: {self.total_retrievals}
  Avg Iterations/Query: {avg_iterations:.1f}
  Avg Documents/Query: {self.total_retrievals / max(1, self.query_count):.1f}

Quality Metrics:
  Avg Quality Score: {avg_quality:.2f}

Cache Performance:
  Hits: {self.metrics['cache_hits']}
  Misses: {self.metrics['cache_misses']}
  Hit Rate: {self.metrics['cache_hits'] / max(1, self.metrics['cache_hits'] + self.metrics['cache_misses']) * 100:.1f}%

Errors: {len(self.metrics['errors'])}
        """
        
        return report

# Example
print("Monitored Agentic RAG system ready for production")

## 5. Interview Q&A

### Q1: What's the main difference between traditional RAG and agentic RAG?

**Answer:**
Traditional RAG follows a fixed pipeline: embed query → retrieve top-k documents → generate answer. It's static and doesn't adapt to query complexity.

Agentic RAG gives the LLM agency to:
- **Plan** - Analyze query complexity and create retrieval strategy
- **Route** - Select appropriate data sources dynamically
- **Adapt** - Iteratively retrieve more information if needed
- **Reason** - Perform multi-hop reasoning over retrieved docs
- **Verify** - Self-evaluate answer quality and retry if insufficient

Trade-off: Agentic RAG is more expensive and slower but handles complex queries much better.

### Q2: When should you use agentic RAG vs. traditional RAG?

**Answer:**

**Traditional RAG:**
- Simple factual queries with clear answers
- Single data source
- Latency-critical applications
- Cost-constrained scenarios
- High-volume, low-complexity queries (customer support)

**Agentic RAG:**
- Complex multi-part questions
- Multiple heterogeneous data sources
- High-stakes decisions requiring verification
- Research and analysis tasks
- Queries requiring reasoning over retrieved information

**Best Practice:** Use complexity detection to route simple queries to traditional RAG and complex queries to agentic RAG.

### Q3: What is Self-RAG and why is it important?

**Answer:**
Self-RAG adds self-reflection steps where the agent evaluates its own output:

1. **Relevance Check** - Are retrieved docs relevant to query?
2. **Support Check** - Is answer supported by retrieved docs?
3. **Factuality Check** - Are there hallucinations?
4. **Completeness Check** - Does answer fully address query?

If evaluation fails, the agent:
- Retrieves different/additional documents
- Reformulates the query
- Regenerates the answer

**Why Important:**
- Reduces hallucinations significantly
- Improves answer quality automatically
- Essential for high-stakes applications (medical, legal, financial)

### Q4: Explain multi-hop RAG with an example.

**Answer:**
Multi-hop RAG performs multiple retrieval steps where each step informs the next.

**Example Query:** "How did our Q3 revenue compare to our main competitors?"

**Multi-hop Process:**
1. **Hop 1:** Retrieve our Q3 revenue → "$10M"
2. **Hop 2:** Identify main competitors → "Company X, Company Y"
3. **Hop 3:** Retrieve competitor Q3 revenues → "X: $8M, Y: $12M"
4. **Synthesis:** "We ranked 2nd with $10M, between Y ($12M) and X ($8M)"

Each retrieval step uses information from previous steps. You can't get competitor revenue without first knowing who the competitors are.

**Use Cases:**
- Comparative analysis
- Research requiring background context
- Questions with implicit dependencies

### Q5: How do you handle multiple data sources in agentic RAG?

**Answer:**

**1. Source Registration:**
- Register each source with description of what it contains
- Examples: vector store (docs), SQL DB (structured data), API (real-time)

**2. Routing Strategy:**
- Use LLM to route query to appropriate source(s)
- Can query multiple sources in parallel
- Example: Financial query → SQL DB, Product query → Vector store

**3. Result Aggregation:**
- Collect results from all queried sources
- Deduplicate and rank by relevance
- Synthesize into coherent answer

**4. Source Attribution:**
- Track which source provided which information
- Include source citations in final answer

### Q6: What are the main production challenges with agentic RAG?

**Answer:**

**1. Cost Management:**
- Multiple LLM calls per query (planning, routing, generation, reflection)
- Mitigation: Complexity detection, caching, cheaper models for routing

**2. Latency:**
- Iterative retrieval and multi-hop reasoning are slow
- Mitigation: Parallel execution, set iteration limits, async operations

**3. Quality Consistency:**
- Non-deterministic due to LLM decision-making
- Mitigation: Quality thresholds, monitoring, fallback to traditional RAG

**4. Complexity:**
- More moving parts = more failure modes
- Mitigation: Comprehensive monitoring, graceful degradation, circuit breakers

**5. Context Window Management:**
- Multi-hop retrieval can accumulate large context
- Mitigation: Summarization, selective context passing, context pruning

### Q7: How do you evaluate agentic RAG quality?

**Answer:**

**Retrieval Metrics:**
- Retrieval precision (% relevant docs retrieved)
- Retrieval recall (% of all relevant docs retrieved)
- Average iterations to sufficiency

**Generation Metrics:**
- Answer relevance (does it address the query?)
- Faithfulness (is answer supported by retrieved docs?)
- Completeness (does it fully answer the query?)
- Citation accuracy (are citations correct?)

**System Metrics:**
- Latency (p50, p95, p99)
- Cost per query
- Cache hit rate
- Error rate

**Evaluation Methods:**
1. **LLM-as-judge** - Use LLM to score answers
2. **Human evaluation** - Gold standard for critical applications
3. **A/B testing** - Compare agentic vs traditional RAG
4. **Regression testing** - Test on known good query-answer pairs

## 6. Key Takeaways

### Agentic RAG Patterns
1. **Query Planning** - Analyze complexity and create retrieval strategy
2. **Routing** - Direct queries to appropriate data sources
3. **Adaptive Retrieval** - Iteratively retrieve until sufficient
4. **Self-RAG** - Self-evaluate and correct answers
5. **Multi-hop** - Chain retrievals for complex reasoning

### When to Use Agentic RAG
- Complex, multi-faceted queries
- Multiple heterogeneous data sources
- High-stakes applications requiring verification
- Research and analytical tasks
- When traditional RAG quality is insufficient

### Production Best Practices
1. **Complexity Detection** - Route simple queries to traditional RAG
2. **Cost Optimization** - Cache aggressively, use cheaper models for routing
3. **Iteration Limits** - Prevent runaway costs and latency
4. **Monitoring** - Track quality, cost, latency by query complexity
5. **Graceful Degradation** - Fall back to traditional RAG on errors

### Trade-offs
| Aspect | Traditional RAG | Agentic RAG |
|--------|----------------|-------------|
| **Latency** | Fast (~1s) | Slow (~5-15s) |
| **Cost** | Low | High (3-10x) |
| **Quality** | Good for simple | Excellent for complex |
| **Reliability** | Deterministic | Non-deterministic |
| **Complexity** | Simple | Complex |

## Next Steps

1. **Explore Framework Comparison** - See how different frameworks implement these patterns
2. **Build Production System** - Apply these patterns to your use case
3. **Implement Monitoring** - Track quality and cost metrics
4. **Optimize Performance** - Cache, parallelize, and use appropriate models

## Additional Resources

- [Self-RAG Paper](https://arxiv.org/abs/2310.11511)
- [Corrective RAG (CRAG)](https://arxiv.org/abs/2401.15884)
- [LangChain RAG Patterns](https://python.langchain.com/docs/use_cases/question_answering/)
- [LlamaIndex Advanced RAG](https://docs.llamaindex.ai/en/stable/examples/query_engine/)
- [Anthropic RAG Best Practices](https://www.anthropic.com/index/claude-2-1-retrieval)