# Module 14: Agentic RAG (Capstone)

This capstone module combines everything from the course: RAG (Modules 9-11) + Agents (Modules 12-13) into **Agentic RAG** systems.

**What you'll build:**
1. Retrieval as an agent tool (agent decides WHEN to retrieve)
2. Multi-source retrieval (agent chooses WHICH source)
3. Corrective RAG (CRAG) - evaluate and re-retrieve if needed
4. Self-RAG - decide when retrieval is needed
5. Full agentic RAG pipeline

**Why Agentic RAG?**

Static RAG pipelines always retrieve, always use the same source, and never check quality. Agentic RAG adds intelligence:

```
Static RAG:   Question → Always Retrieve → Always Generate
Agentic RAG:  Question → Decide IF retrieval needed → Choose WHICH source → 
              Retrieve → Check quality → Re-retrieve if bad → Generate → Self-check
```

**Prerequisites:** Modules 9-13

## 1. Setup

In [None]:
!pip install -q openai python-dotenv chromadb sentence-transformers langchain langchain-openai langgraph

In [None]:
import os
import json
from typing import TypedDict, List, Optional

from dotenv import load_dotenv
load_dotenv("/home/amir/source/.env")

from openai import OpenAI
import chromadb
from sentence_transformers import SentenceTransformer

client = OpenAI()
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

print("Setup complete!")

## 2. Build the Knowledge Base

In [None]:
# Knowledge base about AI/ML topics
documents = [
    {"id": "doc_0", "text": "Neural networks are computing systems inspired by biological neural networks. They consist of layers of interconnected nodes that process information. Deep learning uses neural networks with many hidden layers to learn hierarchical representations of data. Common architectures include CNNs for images and RNNs for sequences.", "source": "textbook"},
    {"id": "doc_1", "text": "The Transformer architecture, introduced in 'Attention Is All You Need' (2017), relies entirely on self-attention mechanisms and processes all positions in parallel. It consists of encoder and decoder stacks with multi-head attention and feed-forward layers. Transformers are the foundation of BERT, GPT, and T5.", "source": "textbook"},
    {"id": "doc_2", "text": "BERT uses masked language modeling for pre-training, randomly masking 15% of tokens and predicting them. It processes text bidirectionally, making it excellent for understanding tasks like classification and NER. BERT-base has 110M parameters.", "source": "textbook"},
    {"id": "doc_3", "text": "GPT models use autoregressive language modeling, predicting the next token left-to-right. GPT-3 has 175B parameters and demonstrated few-shot learning. GPT-4 is multimodal. The GPT family uses decoder-only transformer architecture.", "source": "textbook"},
    {"id": "doc_4", "text": "Retrieval-Augmented Generation (RAG) combines a retriever with a generator to ground LLM outputs in retrieved evidence. RAG addresses hallucination and knowledge cutoff. The retriever uses dense embeddings and vector similarity search.", "source": "textbook"},
    {"id": "doc_5", "text": "Vector databases like ChromaDB, Pinecone, and FAISS store high-dimensional embeddings and support efficient similarity search. ChromaDB is lightweight and local, Pinecone is cloud-native, and FAISS handles billion-scale search.", "source": "textbook"},
    {"id": "doc_6", "text": "Prompt engineering techniques include zero-shot prompting, few-shot with examples, chain-of-thought reasoning, and role prompting. Structured output can be enforced through system messages. Good prompts are specific, provide context, and include examples.", "source": "textbook"},
    {"id": "doc_7", "text": "AI agents are autonomous systems that use LLMs to reason, plan, and take actions through tools. The ReAct pattern interleaves reasoning with actions. Multi-agent systems coordinate specialized agents. LangGraph models agent workflows as state machines.", "source": "textbook"},
    {"id": "doc_8", "text": "Fine-tuning adapts pre-trained models to specific tasks. LoRA (Low-Rank Adaptation) freezes the base model and trains small adapter matrices, reducing compute by 10-100x. QLoRA combines quantization with LoRA for even more efficiency.", "source": "textbook"},
    {"id": "doc_9", "text": "RLHF (Reinforcement Learning from Human Feedback) aligns LLMs with human preferences through supervised fine-tuning, reward model training, and PPO optimization. DPO offers a simpler alternative without a separate reward model.", "source": "textbook"},
    {"id": "doc_10", "text": "Evaluation metrics for RAG include precision@k, recall@k, MRR for retrieval, and faithfulness and relevance for generation. The RAGAS framework automates RAG evaluation. Human evaluation remains the gold standard.", "source": "textbook"},
    {"id": "doc_11", "text": "Chunking strategies for RAG include fixed-size splitting, sentence-based chunking, recursive character splitting, and semantic chunking. Optimal chunk size balances context with precision, typically 200-500 tokens. Overlap between chunks preserves continuity.", "source": "textbook"},
    {"id": "doc_12", "text": "Embedding models like Sentence-BERT, E5, and OpenAI's text-embedding-3-small convert text into dense vectors capturing semantic meaning. They are trained with contrastive learning. The choice of embedding model significantly affects retrieval quality.", "source": "textbook"},
    {"id": "doc_13", "text": "Attention mechanisms allow models to focus on relevant parts of the input. Self-attention computes query-key-value interactions within a sequence. Multi-head attention runs multiple attention computations in parallel, each learning different relationship patterns.", "source": "textbook"},
    {"id": "doc_14", "text": "Diffusion models generate images by learning to reverse a gradual noising process. Models like Stable Diffusion and DALL-E 3 produce high-quality images from text prompts. They have largely replaced GANs due to better training stability.", "source": "textbook"}
]

# Embed and store
chroma_client = chromadb.Client()
kb_collection = chroma_client.create_collection(name="ai_knowledge_base", metadata={"hnsw:space": "cosine"})

texts = [doc["text"] for doc in documents]
doc_ids = [doc["id"] for doc in documents]
embeddings = embedding_model.encode(texts).tolist()

kb_collection.add(
    documents=texts,
    embeddings=embeddings,
    ids=doc_ids,
    metadatas=[{"source": doc["source"]} for doc in documents]
)

print(f"Knowledge base loaded: {kb_collection.count()} documents")

## 3. Helper Functions

In [None]:
def llm_call(messages, temperature=0.0, model="gpt-4o-mini"):
    """Call OpenAI chat completions."""
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=temperature
    )
    return response.choices[0].message.content

def kb_search(query, k=3):
    """Search the knowledge base."""
    query_emb = embedding_model.encode([query]).tolist()
    results = kb_collection.query(query_embeddings=query_emb, n_results=k)
    return results["documents"][0], results["ids"][0], results["distances"][0]

def mock_web_search(query):
    """Simulated web search (returns hardcoded results for demo)."""
    web_results = {
        "latest": "As of 2024, the latest developments in AI include multimodal models like GPT-4o, open-source models like Llama 3, and advances in AI agents and reasoning.",
        "news": "Recent AI news: OpenAI released GPT-4o with native multimodal capabilities. Anthropic launched Claude 3.5 Sonnet. Meta open-sourced Llama 3.1 405B.",
        "default": f"Web search results for '{query}': This information is current as of 2024 and may include recent developments not covered in the knowledge base."
    }
    for key in web_results:
        if key in query.lower():
            return web_results[key]
    return web_results["default"]

def calculator(expression):
    """Safe calculator for mathematical expressions."""
    allowed = set('0123456789+-*/.() ')
    if all(c in allowed for c in expression):
        return str(eval(expression))
    return "Error: Invalid expression"

print("Helper functions ready!")
print(f"KB search test: {kb_search('transformers', k=1)[0][0][:60]}...")
print(f"Calculator test: 2^10 = {calculator('2**10')}")

---

## 4. Retrieval as an Agent Tool

The key insight of agentic RAG: **the agent decides WHEN to retrieve**. Not every question needs retrieval - some can be answered from the LLM's parametric knowledge.

We'll use OpenAI function calling to let the LLM decide which tools to use.

In [None]:
# Define tools for function calling
tools = [
    {
        "type": "function",
        "function": {
            "name": "knowledge_base_search",
            "description": "Search the AI/ML knowledge base for information about neural networks, transformers, RAG, embeddings, agents, and related topics. Use this when the question is about AI/ML concepts covered in the course.",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "The search query"}
                },
                "required": ["query"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "web_search",
            "description": "Search the web for recent or current information. Use this for questions about latest news, current events, or topics not in the knowledge base.",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "The search query"}
                },
                "required": ["query"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "calculator",
            "description": "Perform mathematical calculations. Use for any arithmetic, percentages, or numerical computations.",
            "parameters": {
                "type": "object",
                "properties": {
                    "expression": {"type": "string", "description": "Math expression (e.g., '2 * 3 + 4')"}
                },
                "required": ["expression"]
            }
        }
    }
]

def execute_tool(tool_name, arguments):
    """Execute a tool and return the result."""
    if tool_name == "knowledge_base_search":
        docs, ids, dists = kb_search(arguments["query"], k=3)
        return "\n\n".join(docs)
    elif tool_name == "web_search":
        return mock_web_search(arguments["query"])
    elif tool_name == "calculator":
        return calculator(arguments["expression"])
    return "Unknown tool"

def agentic_rag(question, verbose=True):
    """
    Agentic RAG: LLM decides which tools to use.
    """
    messages = [
        {"role": "system", "content": "You are a helpful AI assistant with access to tools. Use tools when needed to answer questions accurately. You can use multiple tools if needed. If you can answer from your own knowledge confidently, you don't need to use tools."},
        {"role": "user", "content": question}
    ]
    
    if verbose:
        print(f"Question: {question}")
    
    # LLM decides what to do
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
        tools=tools,
        temperature=0.0
    )
    
    msg = response.choices[0].message
    
    # If LLM wants to use tools
    if msg.tool_calls:
        messages.append(msg)
        
        for tool_call in msg.tool_calls:
            tool_name = tool_call.function.name
            arguments = json.loads(tool_call.function.arguments)
            
            if verbose:
                print(f"  Tool: {tool_name}({arguments})")
            
            result = execute_tool(tool_name, arguments)
            
            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": result
            })
        
        # Get final answer with tool results
        final_response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages,
            temperature=0.0
        )
        answer = final_response.choices[0].message.content
    else:
        if verbose:
            print("  No tools used (answered from parametric knowledge)")
        answer = msg.content
    
    if verbose:
        print(f"  Answer: {answer[:200]}...\n")
    
    return answer

# Test with different types of questions
test_questions = [
    "What is the transformer architecture?",          # Should use KB
    "What are the latest AI developments in 2024?",   # Should use web
    "What is 175 billion divided by 1000?",            # Should use calculator
    "What is the capital of France?",                  # Should answer directly
]

for q in test_questions:
    agentic_rag(q)
    print("-" * 60)

## Exercise 1: Agent with 3 Tools - Autonomous Selection

**Your Task:** Test the agentic RAG system with 5 diverse questions that require different tools, and analyze the agent's tool selection behavior.

**Steps:**
1. Create 5 questions: 2 requiring KB search, 1 requiring web, 1 requiring calculator, 1 requiring no tools
2. Run each through `agentic_rag` with verbose=True
3. Track which tools were selected for each question

In [None]:
# TODO: Create 5 diverse test questions and analyze tool selection

test_questions = [
    None,  # Your code: KB search question
    None,  # Your code: KB search question
    None,  # Your code: Web search question
    None,  # Your code: Calculator question
    None,  # Your code: No-tool question
]

# Run and analyze
for q in test_questions:
    if q:
        answer = agentic_rag(q)
        print("-" * 60)

### Solution for Exercise 1

In [None]:
test_questions = [
    "How does RAG address the hallucination problem in LLMs?",
    "What are the different chunking strategies used in RAG systems?",
    "What are the latest news about open-source AI models?",
    "If a model has 175 billion parameters and each parameter uses 2 bytes (float16), how many gigabytes of memory does it need?",
    "What does the acronym API stand for?"
]

for q in test_questions:
    answer = agentic_rag(q)
    print("-" * 60)

---

## 5. Query Routing

Before retrieval, we can classify the query to route it to the most appropriate source.

```
Question → Classify → Route:
  ├─ "factual_kb"    → Search knowledge base
  ├─ "recent_events" → Search the web  
  ├─ "calculation"   → Use calculator
  └─ "general"       → Answer from LLM knowledge
```

In [None]:
def classify_query(question):
    """Classify query to determine routing."""
    response = llm_call([
        {"role": "system", "content": """Classify the user's question into exactly one category. Respond with ONLY the category name.

Categories:
- factual_kb: Questions about AI/ML concepts, neural networks, transformers, RAG, embeddings, agents
- recent_events: Questions about recent news, latest developments, current events
- calculation: Questions requiring mathematical computation
- general: General knowledge questions not about AI/ML"""},
        {"role": "user", "content": question}
    ])
    return response.strip().lower()

def routed_rag(question, verbose=True):
    """RAG with explicit query routing."""
    category = classify_query(question)
    
    if verbose:
        print(f"Q: {question}")
        print(f"  Route: {category}")
    
    if category == "factual_kb":
        docs, _, _ = kb_search(question, k=3)
        context = "\n\n".join(docs)
        answer = llm_call([
            {"role": "system", "content": "Answer based on the provided context. Be concise and accurate."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
        ])
    elif category == "recent_events":
        web_result = mock_web_search(question)
        answer = llm_call([
            {"role": "system", "content": "Answer based on these web search results."},
            {"role": "user", "content": f"Search results:\n{web_result}\n\nQuestion: {question}"}
        ])
    elif category == "calculation":
        answer = llm_call([
            {"role": "system", "content": "Extract the math expression and compute the result. Show your work."},
            {"role": "user", "content": question}
        ])
    else:
        answer = llm_call([
            {"role": "user", "content": question}
        ])
    
    if verbose:
        print(f"  Answer: {answer[:150]}...\n")
    return answer, category

# Test routing
routing_tests = [
    "How does self-attention work in transformers?",
    "What AI models were released this year?",
    "What is 256 * 768 * 12?",
    "What is the population of Tokyo?"
]

for q in routing_tests:
    routed_rag(q)
    print("-" * 50)

---

## 6. Corrective RAG (CRAG)

**CRAG** adds a critical step: after retrieval, **evaluate whether the retrieved documents are actually relevant** before generating an answer.

```
Question → Retrieve → Evaluate Relevance
  ├─ RELEVANT     → Generate answer from context
  ├─ AMBIGUOUS    → Re-retrieve with refined query, then generate
  └─ NOT RELEVANT → Fall back to web search or parametric knowledge
```

This prevents the model from generating answers based on irrelevant retrieved context.

In [None]:
def evaluate_relevance(question, document):
    """Use LLM to judge if a document is relevant to the question."""
    response = llm_call([
        {"role": "system", "content": """You are a relevance evaluator. Given a question and a document, determine if the document contains information useful for answering the question.

Respond with ONLY one word:
- RELEVANT: The document directly addresses the question
- PARTIAL: The document is somewhat related but doesn't fully answer
- IRRELEVANT: The document is not related to the question"""},
        {"role": "user", "content": f"Question: {question}\n\nDocument: {document}"}
    ])
    return response.strip().upper()

def corrective_rag(question, verbose=True):
    """
    CRAG: Retrieve → Evaluate → Correct if needed → Generate
    """
    if verbose:
        print(f"Q: {question}")
    
    # Step 1: Initial retrieval
    docs, ids, dists = kb_search(question, k=3)
    
    if verbose:
        print(f"  [Retrieve] Got {len(docs)} documents")
    
    # Step 2: Evaluate relevance of each document
    relevant_docs = []
    for doc, doc_id in zip(docs, ids):
        relevance = evaluate_relevance(question, doc)
        if verbose:
            print(f"  [Evaluate] {doc_id}: {relevance}")
        if relevance in ["RELEVANT", "PARTIAL"]:
            relevant_docs.append(doc)
    
    # Step 3: Corrective action based on evaluation
    if len(relevant_docs) >= 2:
        # Good retrieval - proceed with generation
        if verbose:
            print(f"  [Correct] Sufficient relevant docs ({len(relevant_docs)}) - generating answer")
        context = "\n\n".join(relevant_docs)
        source = "knowledge_base"
    elif len(relevant_docs) == 1:
        # Partial - try to supplement with a refined query
        if verbose:
            print(f"  [Correct] Only 1 relevant doc - refining query and re-retrieving")
        refined_query = llm_call([
            {"role": "system", "content": "Rephrase this question to be more specific for a search engine. Return only the rephrased question."},
            {"role": "user", "content": question}
        ])
        extra_docs, _, _ = kb_search(refined_query, k=2)
        relevant_docs.extend(extra_docs)
        context = "\n\n".join(relevant_docs)
        source = "knowledge_base (refined)"
    else:
        # No relevant docs - fall back to web search
        if verbose:
            print(f"  [Correct] No relevant docs - falling back to web search")
        web_result = mock_web_search(question)
        context = web_result
        source = "web_search"
    
    # Step 4: Generate answer
    answer = llm_call([
        {"role": "system", "content": "Answer the question based on the provided context. Be concise."},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
    ])
    
    if verbose:
        print(f"  [Source] {source}")
        print(f"  [Answer] {answer[:200]}")
    
    return answer, source

# Test CRAG with different types of questions
print("=" * 60)
corrective_rag("What is the difference between BERT and GPT architectures?")
print("\n" + "=" * 60)
corrective_rag("What is the weather like in New York today?")  # Should fall back

## Exercise 2: Corrective RAG with Re-retrieval

**Your Task:** Extend the CRAG pipeline to handle the case where initial retrieval returns partially relevant results by re-retrieving with a decomposed query.

**Steps:**
1. Retrieve documents for the question
2. Evaluate each document's relevance
3. If results are poor: decompose the question into sub-questions, retrieve for each
4. Combine all relevant documents and generate

In [None]:
def corrective_rag_v2(question, verbose=True):
    """
    Enhanced CRAG with query decomposition for re-retrieval.
    """
    # TODO: Implement enhanced CRAG
    
    # 1. Initial retrieval
    docs, ids, dists = None, None, None  # Your code
    
    # 2. Evaluate relevance
    relevant_docs = []  # Your code: filter relevant docs
    
    # 3. If insufficient relevant docs, decompose query
    if len(relevant_docs) < 2:
        # Decompose into sub-questions
        sub_questions = None  # Your code: use LLM to break question into parts
        
        # Re-retrieve for each sub-question
        pass  # Your code
    
    # 4. Generate answer from all collected context
    answer = None  # Your code
    
    return answer

# Test
# corrective_rag_v2("Compare how BERT and GPT handle attention masking and what that means for their use cases")

### Solution for Exercise 2

In [None]:
def corrective_rag_v2(question, verbose=True):
    """Enhanced CRAG with query decomposition - SOLUTION."""
    if verbose:
        print(f"Q: {question}")
    
    # 1. Initial retrieval
    docs, ids, dists = kb_search(question, k=3)
    if verbose:
        print(f"  [Retrieve] Got {len(docs)} docs: {ids}")
    
    # 2. Evaluate relevance
    relevant_docs = []
    for doc, doc_id in zip(docs, ids):
        rel = evaluate_relevance(question, doc)
        if verbose:
            print(f"  [Evaluate] {doc_id}: {rel}")
        if rel in ["RELEVANT", "PARTIAL"]:
            relevant_docs.append(doc)
    
    # 3. If insufficient, decompose and re-retrieve
    if len(relevant_docs) < 2:
        if verbose:
            print(f"  [Correct] Only {len(relevant_docs)} relevant - decomposing query")
        
        sub_q_response = llm_call([
            {"role": "system", "content": "Break this complex question into 2-3 simpler sub-questions. Return each on a new line, nothing else."},
            {"role": "user", "content": question}
        ])
        sub_questions = [q.strip().lstrip('0123456789.-) ') for q in sub_q_response.strip().split('\n') if q.strip()]
        
        if verbose:
            print(f"  [Decompose] Sub-questions: {sub_questions}")
        
        for sub_q in sub_questions:
            sub_docs, _, _ = kb_search(sub_q, k=2)
            for doc in sub_docs:
                if doc not in relevant_docs:
                    relevant_docs.append(doc)
        
        if verbose:
            print(f"  [Re-retrieve] Now have {len(relevant_docs)} documents")
    
    # 4. Generate
    if not relevant_docs:
        context = mock_web_search(question)
        source = "web_fallback"
    else:
        context = "\n\n".join(relevant_docs[:5])
        source = "knowledge_base"
    
    answer = llm_call([
        {"role": "system", "content": "Answer the question based on the provided context. Be thorough."},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
    ])
    
    if verbose:
        print(f"  [Source] {source}")
        print(f"  [Answer] {answer[:300]}")
    
    return answer

print("=" * 60)
corrective_rag_v2("Compare how BERT and GPT handle attention and what chunking strategies work best for RAG")

---

## 7. Self-RAG

**Self-RAG** goes further than CRAG: the model decides at each step whether it needs retrieval, and after generating, checks if its answer is faithful to the sources.

```
Question → [Need retrieval?]
  ├─ YES → Retrieve → Generate → [Is answer faithful?]
  │                                 ├─ YES → Return answer
  │                                 └─ NO  → Regenerate with better prompt
  └─ NO  → Generate from knowledge → [Is answer confident?]
                                       ├─ YES → Return answer
                                       └─ NO  → Retrieve and try again
```

In [None]:
def needs_retrieval(question):
    """Decide if a question needs external retrieval."""
    response = llm_call([
        {"role": "system", "content": """Determine if this question requires looking up specific information from a knowledge base about AI/ML, or if it can be answered from general knowledge.

Respond with ONLY one word:
- RETRIEVE: Needs specific facts, definitions, or technical details about AI/ML
- GENERATE: Can be answered from general knowledge"""},
        {"role": "user", "content": question}
    ])
    return response.strip().upper() == "RETRIEVE"

def check_faithfulness(question, context, answer):
    """Check if the answer is faithful to the retrieved context."""
    response = llm_call([
        {"role": "system", "content": """Evaluate if the answer is faithful to (supported by) the provided context. The answer should not contain claims that aren't in the context.

Respond with ONLY one word:
- FAITHFUL: Answer is fully supported by context
- UNFAITHFUL: Answer contains claims not in context"""},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}\n\nAnswer: {answer}"}
    ])
    return response.strip().upper() == "FAITHFUL"

def self_rag(question, max_retries=2, verbose=True):
    """
    Self-RAG: Decide when to retrieve, self-check generation.
    """
    if verbose:
        print(f"Q: {question}")
    
    # Step 1: Decide if retrieval is needed
    should_retrieve = needs_retrieval(question)
    if verbose:
        print(f"  [Decision] {'RETRIEVE' if should_retrieve else 'GENERATE directly'}")
    
    if should_retrieve:
        docs, ids, _ = kb_search(question, k=3)
        context = "\n\n".join(docs)
        
        for attempt in range(max_retries + 1):
            # Generate answer
            if attempt == 0:
                prompt = f"Context:\n{context}\n\nQuestion: {question}"
            else:
                prompt = f"Context:\n{context}\n\nQuestion: {question}\n\nIMPORTANT: Only use information from the context above. Do not add any information not present in the context."
            
            answer = llm_call([
                {"role": "system", "content": "Answer based strictly on the provided context."},
                {"role": "user", "content": prompt}
            ])
            
            # Check faithfulness
            is_faithful = check_faithfulness(question, context, answer)
            if verbose:
                print(f"  [Attempt {attempt+1}] Faithful: {is_faithful}")
            
            if is_faithful:
                break
        
        if verbose:
            print(f"  [Answer] {answer[:200]}")
        return answer
    else:
        # Generate without retrieval
        answer = llm_call([
            {"role": "user", "content": question}
        ])
        if verbose:
            print(f"  [Answer] {answer[:200]}")
        return answer

# Test Self-RAG
print("=" * 60)
self_rag("What is the difference between encoder-only and decoder-only transformers?")
print("\n" + "=" * 60)
self_rag("What color is the sky?")  # Should not retrieve

---

## 8. Full Agentic RAG Pipeline with State Machine

Now let's combine everything into a complete pipeline modeled as a state machine.

```
States:
  classify → route → retrieve → evaluate → generate → check → END
                  ↘ web_search ↗         ↘ re_retrieve ↗
```

In [None]:
class AgenticRAGPipeline:
    """
    Full agentic RAG pipeline combining routing, retrieval,
    evaluation, correction, and self-checking.
    """
    
    def __init__(self):
        self.trace = []  # Record of all steps taken
    
    def _log(self, step, detail):
        self.trace.append({"step": step, "detail": detail})
    
    def classify(self, question):
        """Classify the query type."""
        category = classify_query(question)
        self._log("classify", f"Category: {category}")
        return category
    
    def retrieve(self, question, k=3):
        """Retrieve from knowledge base."""
        docs, ids, dists = kb_search(question, k=k)
        self._log("retrieve", f"Got {len(docs)} docs: {ids}")
        return docs, ids
    
    def evaluate(self, question, docs, ids):
        """Evaluate retrieved document relevance."""
        relevant = []
        for doc, doc_id in zip(docs, ids):
            rel = evaluate_relevance(question, doc)
            self._log("evaluate", f"{doc_id}: {rel}")
            if rel in ["RELEVANT", "PARTIAL"]:
                relevant.append(doc)
        return relevant
    
    def generate(self, question, context):
        """Generate answer from context."""
        answer = llm_call([
            {"role": "system", "content": "Answer the question based on the provided context. Cite specific facts from the context."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
        ])
        self._log("generate", f"Generated {len(answer)} chars")
        return answer
    
    def quality_check(self, question, context, answer):
        """Check answer quality."""
        is_faithful = check_faithfulness(question, context, answer)
        self._log("quality_check", f"Faithful: {is_faithful}")
        return is_faithful
    
    def run(self, question, verbose=True):
        """Execute the full agentic RAG pipeline."""
        self.trace = []
        self._log("start", f"Question: {question}")
        
        # Step 1: Classify
        category = self.classify(question)
        
        # Step 2: Route
        if category == "factual_kb":
            # Retrieve from KB
            docs, ids = self.retrieve(question)
            relevant_docs = self.evaluate(question, docs, ids)
            
            # Corrective step
            if len(relevant_docs) < 1:
                self._log("correct", "No relevant docs - trying web search")
                context = mock_web_search(question)
            elif len(relevant_docs) < 2:
                self._log("correct", "Few relevant docs - supplementing with refined query")
                refined = llm_call([
                    {"role": "system", "content": "Rephrase for better search. Return only the query."},
                    {"role": "user", "content": question}
                ])
                extra_docs, _ = self.retrieve(refined, k=2)
                relevant_docs.extend(extra_docs)
                context = "\n\n".join(relevant_docs)
            else:
                context = "\n\n".join(relevant_docs)
            
            # Generate
            answer = self.generate(question, context)
            
            # Quality check
            if not self.quality_check(question, context, answer):
                self._log("regenerate", "Failed quality check - regenerating")
                answer = self.generate(question + " (Only use facts from the context, nothing else)", context)
        
        elif category == "recent_events":
            context = mock_web_search(question)
            answer = self.generate(question, context)
        
        elif category == "calculation":
            answer = llm_call([
                {"role": "system", "content": "Solve the math problem step by step."},
                {"role": "user", "content": question}
            ])
            self._log("calculate", "Direct LLM calculation")
        
        else:
            answer = llm_call([{"role": "user", "content": question}])
            self._log("direct", "Answered from parametric knowledge")
        
        self._log("end", f"Answer length: {len(answer)}")
        
        # Print trace
        if verbose:
            print(f"\nQuestion: {question}")
            print(f"\nPipeline trace:")
            for step in self.trace:
                print(f"  [{step['step']}] {step['detail']}")
            print(f"\nAnswer: {answer[:300]}")
        
        return answer, self.trace

# Test the full pipeline
pipeline = AgenticRAGPipeline()

pipeline.run("How does multi-head attention work in transformers?")
print("\n" + "=" * 60)
pipeline.run("What is 1024 * 768?")

## Exercise 3: Complete Agentic RAG for Technical Documentation

**Your Task:** Extend the `AgenticRAGPipeline` to handle multi-step questions by decomposing them.

**Steps:**
1. Detect if a question has multiple parts
2. Decompose into sub-questions
3. Run each sub-question through the pipeline
4. Synthesize a combined answer

In [None]:
def multi_step_agentic_rag(question, verbose=True):
    """
    Handle multi-step questions by decomposition.
    """
    # TODO: Implement multi-step agentic RAG
    
    # 1. Check if question needs decomposition
    needs_decomposition = None  # Your code: use LLM to decide
    
    if needs_decomposition:
        # 2. Decompose
        sub_questions = None  # Your code
        
        # 3. Answer each sub-question
        sub_answers = []  # Your code
        
        # 4. Synthesize
        final_answer = None  # Your code
    else:
        pipeline = AgenticRAGPipeline()
        final_answer, _ = pipeline.run(question, verbose=verbose)
    
    return final_answer

# multi_step_agentic_rag("Compare BERT and GPT architectures, and explain which is better for RAG systems")

### Solution for Exercise 3

In [None]:
def multi_step_agentic_rag(question, verbose=True):
    """Handle multi-step questions - SOLUTION."""
    if verbose:
        print(f"Original question: {question}\n")
    
    # 1. Check if decomposition is needed
    check = llm_call([
        {"role": "system", "content": "Does this question contain multiple distinct parts that should be answered separately? Respond YES or NO only."},
        {"role": "user", "content": question}
    ]).strip().upper()
    
    needs_decomposition = check == "YES"
    
    if needs_decomposition:
        if verbose:
            print("[Decomposing question into parts]\n")
        
        # 2. Decompose
        sub_q_text = llm_call([
            {"role": "system", "content": "Break this into 2-3 independent sub-questions. One per line, nothing else."},
            {"role": "user", "content": question}
        ])
        sub_questions = [q.strip().lstrip('0123456789.-) ') for q in sub_q_text.strip().split('\n') if q.strip()]
        
        if verbose:
            for i, sq in enumerate(sub_questions):
                print(f"  Sub-Q {i+1}: {sq}")
            print()
        
        # 3. Answer each
        sub_answers = []
        pipeline = AgenticRAGPipeline()
        for sq in sub_questions:
            answer, _ = pipeline.run(sq, verbose=verbose)
            sub_answers.append({"question": sq, "answer": answer})
            if verbose:
                print("-" * 40)
        
        # 4. Synthesize
        parts = "\n\n".join([f"Q: {sa['question']}\nA: {sa['answer']}" for sa in sub_answers])
        final_answer = llm_call([
            {"role": "system", "content": "Synthesize these sub-answers into one coherent, comprehensive response to the original question."},
            {"role": "user", "content": f"Original question: {question}\n\nSub-answers:\n{parts}"}
        ])
    else:
        pipeline = AgenticRAGPipeline()
        final_answer, _ = pipeline.run(question, verbose=verbose)
    
    if verbose:
        print(f"\n{'=' * 60}")
        print(f"FINAL ANSWER:\n{final_answer}")
    
    return final_answer

multi_step_agentic_rag("Compare BERT and GPT architectures, and explain which is better for RAG systems")

---

## Exercise 4 (Capstone): End-to-End AI Assistant

**Your Task:** Build a complete AI assistant that combines:
- RAG from the knowledge base
- Web search fallback
- Mathematical computation
- Multi-step reasoning
- Self-checking for quality

Test with 5 diverse queries that exercise different capabilities.

In [None]:
def ai_assistant(question):
    """
    Complete AI assistant combining all agentic RAG capabilities.
    
    TODO: Combine the best of:
    - agentic_rag() for tool selection
    - corrective_rag() for quality checking
    - multi_step_agentic_rag() for complex questions
    """
    # Your code here
    pass

# Test with 5 diverse queries
capstone_queries = [
    None,  # Your code: technical AI question
    None,  # Your code: comparison question
    None,  # Your code: calculation question
    None,  # Your code: recent events question
    None,  # Your code: multi-step reasoning question
]

### Solution for Exercise 4

In [None]:
def ai_assistant(question, verbose=True):
    """Complete AI assistant - SOLUTION."""
    if verbose:
        print(f"{'=' * 60}")
        print(f"USER: {question}")
        print(f"{'=' * 60}")
    
    # Use multi-step agentic RAG as the backbone
    # It handles decomposition, routing, retrieval, evaluation, and generation
    answer = multi_step_agentic_rag(question, verbose=verbose)
    
    if verbose:
        print(f"\n{'=' * 60}")
        print(f"ASSISTANT: {answer}")
        print(f"{'=' * 60}\n")
    
    return answer

# Test with 5 diverse queries
capstone_queries = [
    "How do transformer models use attention mechanisms to process text?",
    "Compare the training approaches of BERT and GPT models",
    "If a transformer has 12 attention heads and each head has dimension 64, what is the total model dimension?",
    "What are the latest developments in open-source AI?",
    "Explain how RAG systems use embeddings for retrieval, and what evaluation metrics are used to measure their quality"
]

for query in capstone_queries:
    ai_assistant(query)
    print("\n")

---

## 9. Course Summary

Congratulations on completing the AI RAG & Agents course!

### Your Journey

| Module | Topic | Key Skill |
|--------|-------|----------|
| 1 | AI/ML Fundamentals | Linear regression, gradient descent, evaluation metrics |
| 2 | Neural Networks | Perceptrons, backpropagation, PyTorch, MNIST |
| 3 | NLP Fundamentals | Tokenization, TF-IDF, text classification |
| 4 | Word Vectors | GloVe, word analogies, dense representations |
| 5 | Modern Embeddings | API embeddings, semantic similarity, clustering |
| 6 | Text Generation | GPT-2, sampling strategies, temperature |
| 7a/b | Transformers | Self-attention, multi-head attention, encoder/decoder |
| 8 | Prompt Engineering | Zero/few-shot, CoT, structured output |
| 9 | RAG Foundations | Chunking, vector databases, semantic search |
| 10 | RAG Pipeline | End-to-end Q&A, stuff/map-reduce, source attribution |
| 11 | Advanced RAG | Evaluation, HyDE, re-ranking, hybrid search |
| 12 | Agents Introduction | Tool use, ReAct, LangChain agents |
| 13 | Advanced Agents | Multi-agent systems, memory, planning, LangGraph |
| **14** | **Agentic RAG** | **Combining RAG + agents for intelligent retrieval** |

### Where to Go From Here

- **Production RAG**: LlamaIndex, Haystack for production-grade pipelines
- **Fine-tuning**: LoRA/QLoRA for domain-specific models
- **Evaluation**: RAGAS, DeepEval for automated quality assessment
- **Advanced Agents**: CrewAI, AutoGen for complex multi-agent systems
- **Build real applications**: Customer support bots, research assistants, code analysis tools

### References

- Paper: Yan et al. "Corrective Retrieval Augmented Generation" (CRAG, 2024)
- Paper: Asai et al. "Self-RAG: Learning to Retrieve, Generate, and Critique" (2023)
- Docs: LangGraph "Agentic RAG" tutorial
- Paper: Lewis et al. "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (2020)