<a href="https://colab.research.google.com/github/ashworks1706/agents-rag-from-scratch/blob/main/tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Agents and RAG, A Technical Deep Dive

<a href="https://somwrks.notion.site/?source=copy_link" class="btn btn-primary btn-lg" style="background-color: #0366d6; color: white; padding: 5px 10px; border-radius: 5px; text-decoration: none; font-weight: bold; display: inline-block; margin-top: 10px;"><i class="fa fa-file-text-o" aria-hidden="true"></i> Research paper breakdowns</a> <a href="https://github.com/ashworks1706/rlhf-from-scratch" class="btn btn-primary btn-lg" style="background-color: #0366d6; color: white; padding: 5px 10px; border-radius: 5px; text-decoration: none; font-weight: bold; display: inline-block; margin-top: 10px;"><i class="fa fa-file-text-o" aria-hidden="true"></i> RLHF From Scratch</a> <a href="https://github.com/ashworks1706/llm-from-scratch" class="btn btn-primary btn-lg" style="background-color: #0366d6; color: white; padding: 5px 10px; border-radius: 5px; text-decoration: none; font-weight: bold; display: inline-block; margin-top: 10px;"><i class="fa fa-file-text-o" aria-hidden="true"></i> LLM From Scratch</a> <a href="https://github.com/ashworks1706/agents-rag-from-scratch" class="btn btn-primary btn-lg" style="background-color: #0366d6; color: white; padding: 5px 10px; border-radius: 5px; text-decoration: none; font-weight: bold; display: inline-block; margin-top: 10px;"><i class="fa fa-file-text-o" aria-hidden="true"></i> Agents & RAG From Scratch</a>

I'll go through the fundamentals of Agents and rag with the help of langchain library

<img src="https://www.kdnuggets.com/wp-content/uploads/awan_getting_langchain_ecosystem_1-1024x574.png" width=700>



### Brief History

Before we dive into building agents, let's take a moment to understand the journey that brought us to this exciting point in AI history. Understanding where agents came from will help you appreciate why the systems we're building today represent such a significant breakthrough.

Let me tell you a story about how we got here. The concept of intelligent agents has evolved dramatically over the past seven decades, transforming from simple rule-based systems to today's sophisticated AI companions that can reason, plan, and act autonomously.

**The Early Days (1950s-1980s):**  The journey began in the 1950s when researchers like Allen Newell and Herbert Simon created the Logic Theorist, a program that could prove mathematical theorems by exploring different logical paths. These early agents were like skilled craftsmen‚Äîthey could perform specific tasks very well, but only within narrow, pre-defined domains.

The 1970s and 1980s brought expert systems like MYCIN for medical diagnosis and DENDRAL for chemical analysis. While impressive, these systems required months of manual knowledge engineering, where human experts had to explicitly encode their domain knowledge into rigid rule sets. Imagine trying to teach someone to be a doctor by writing down every possible symptom combination and treatment - that's essentially what early AI researchers had to do!

**The Networking Era (1990s-2000s):** The 1990s marked a shift toward more flexible software agents that could operate in networked environments and coordinate with other agents. This period introduced the concept of multi-agent systems, where multiple specialized agents could collaborate to solve complex problems. However, these systems still required extensive manual programming and could only handle situations their creators had anticipated.

<img src="https://miro.medium.com/1*Ygen57Qiyrc8DXAFsjZLNA.gif" width=700>

**The Learning Revolution (2000s-2010s):** The real transformation began in the 2000s with machine learning advances. Agents could now learn from data rather than relying solely on hand-coded rules. Virtual assistants like Siri and Alexa brought agent technology to mainstream consumers, though they remained relatively narrow in scope‚Äîessentially sophisticated voice interfaces for search and simple task execution.

**The LLM Breakthrough (2020s):** The breakthrough moment arrived with large language models starting around 2020. Systems like GPT-3 and GPT-4 combined vast knowledge with sophisticated reasoning abilities, creating agents that could understand natural language, maintain context across conversations, and tackle a wide variety of tasks without task-specific programming.

Unlike their predecessors, these modern agents can break down complex problems into steps, use external tools when needed, and adapt to new situations they've never encountered before. This evolution represents a fundamental shift from automation to augmentation‚Äîwhere early agents automated specific, predefined tasks, today's agents can understand our goals and work as collaborative partners in problem-solving.

**Why This History Matters for You:** Understanding this evolution helps us appreciate that we're not just building better chatbots‚Äîwe're creating systems that can handle ambiguous instructions, incomplete information, and constantly changing contexts. These capabilities make them invaluable for building sophisticated applications like the retrieval-augmented generation systems we'll explore in this tutorial.

## Agents



When we talk about agents in 2025, we're entering a landscape where the term has become both ubiquitous and somewhat ambiguous. Different organizations and researchers use "agent" to describe everything from simple chatbots to fully autonomous systems that can operate independently for weeks.

Another confusion lies with reinforcement learning name conventions, the agent described in reinforcement learnign is different from the LLM agents that we deal with now, even though, they share similar vision.

But don't let this confusion discourage you! This diversity in definition isn't just academic, it reflects fundamentally different architectural approaches that will determine how we build the next generation of AI applications. Let me help you navigate this landscape.

<img src="https://substackcdn.com/image/fetch/$s_!A_Oy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3177e12-432e-4e41-814f-6febf7a35f68_1360x972.png" width=700>

**What Actually Makes an Agent?** At its core, an agent is a system that can perceive its environment, make decisions, and take actions to achieve specific goals. Sounds simple, right? But the way these capabilities are implemented varies dramatically.

Some define agents as fully autonomous systems that operate independently over extended periods, using various tools and adapting their strategies based on feedback. Think of these like a personal assistant who can manage your entire schedule, book flights, handle emails, and make decisions on your behalf without constant supervision.

Others use the term more broadly to describe any system that follows predefined workflows to accomplish tasks. These implementations are more like following a detailed recipe‚Äîeach step is predetermined, and while the system can handle some variations, it operates within clearly defined boundaries.

**Why This Distinction Matters to You:** The difference between these approaches is crucial because it affects everything from system reliability to development complexity. Understanding this spectrum will help you choose the right approach for your specific needs and avoid over-engineering solutions.

**The Spectrum of Control:** The most useful way to think about this spectrum is through the lens of control and decision-making:

- **Workflows** are systems where large language models and tools are orchestrated through predefined code paths. Every decision point is anticipated by the developer, and the system follows predetermined logic to handle different scenarios.

- **Agents** are systems where the LLM dynamically directs its own processes and tool usage, maintaining control over how it accomplishes tasks. The model itself decides what to do next, which tools to use, and how to adapt when things don't go as planned.

Think of workflows as following a GPS route‚Äîyou know exactly where you're going and how to get there. Agents are more like having an experienced local guide who can adapt the route based on traffic, weather, or interesting stops along the way.

#### Simplicity Defines Perfectionism, Not Complexity


Now, here's some advice that might surprise you: when building applications with LLMs, the fundamental principle should be finding the simplest solution that meets your requirements. This might mean not building agentic systems at all!

Let me explain why this matters. Agentic systems inherently trade latency and cost for better task performance. Every additional decision point, tool call, and reasoning step adds time and expense to your application. You need to carefully consider when this tradeoff makes sense for your specific use case.

**When to Choose Workflows:** Workflows offer predictability and consistency for well-defined tasks where you can anticipate most scenarios and edge cases. They're excellent for:
- Standardized processes like data processing pipelines
- Content moderation workflows
- Structured analysis tasks
- Any situation where you need reliable, repeatable results

**When to Choose Agents:** Agents become the better choice when you need flexibility and model-driven decision-making at scale. This includes situations where:
- The variety of inputs and required responses is too broad to predefine
- The system needs to adapt to entirely new scenarios
- You're dealing with open-ended problems that require creative problem-solving
- The complexity of decision trees would make workflow programming impractical

**The Simple Truth:** Here's what I've learned from building production AI systems: for many applications, the most effective approach involves optimizing single LLM calls with retrieval and in-context examples rather than building complex agentic systems.

Before you architect a sophisticated multi-agent system with elaborate tool chains, ask yourself: "Could I solve this with a well-crafted prompt and some good examples?" You'd be surprised how often the answer is yes.

**But When Complexity is Worth It:** However, as we'll explore throughout this tutorial, there are compelling scenarios where the additional complexity of agents becomes not just beneficial, but necessary for achieving your goals. Understanding when and how to make this transition is what separates effective AI system builders from those who over-engineer solutions to problems that could be solved more simply.

The key is developing good judgment about when to add complexity. Start simple, measure performance, and only add complexity when you can clearly demonstrate that it improves outcomes for your specific use case.

### Prompts

Let's start with the most fundamental skill you'll need as an agent builder: crafting effective prompts.

 Think of prompts as the bridge between human intent and AI capabilities‚Äîthey're how we translate our natural language requests into structured instructions that language models can understand and act upon.

But here's what makes prompts fascinating in agentic systems: they're not just about getting good answers to single questions. In the context of agents, prompts become the architectural blueprints that define not only *what* we want the agent to accomplish, but *how* the agent should approach problem-solving, what tools it can use, and how it should reason through complex tasks.

**Why Prompts Are Your Most Important Tool:** I like to think of prompts as the instruction manual for your AI agent. Just as a well-written manual can make the difference between a novice successfully assembling furniture or ending up with a pile of confused parts, a well-crafted prompt determines whether your agent performs brilliantly or struggles to understand your intent.

The quality and structure of your prompts directly influence:
- The agent's reasoning capabilities
- How it chooses and uses tools  
- Its overall effectiveness in completing tasks
- The consistency of results across different inputs

<img src="https://www.datablist.com/_next/image?url=%2Fhowto_images%2Fhow-to-write-prompt-ai-agents%2Fstructured-ai-agent-prompt.png&w=3840&q=75" width=700>

**The Different Types of Prompts You'll Use:** As we build more sophisticated systems, you'll work with several types of prompts, each serving different purposes:

- **System prompts** establish the agent's role, personality, and fundamental operating principles‚Äîthese are like giving someone their job description and company handbook before they start work
- **User prompts** contain the specific tasks or questions you want the agent to handle
- **Few-shot prompts** provide examples of desired input-output patterns to guide the agent's responses
- **Chain-of-thought prompts** encourage step-by-step reasoning, helping agents break down complex problems into manageable pieces

**The Multi-Step Challenge:** In multi-step agentic workflows, prompt engineering becomes particularly sophisticated because you need to design prompts that not only solve individual tasks but also coordinate between different stages of processing. The agent needs to understand when to use specific tools, how to interpret tool outputs, and how to maintain context across multiple interaction cycles.

This requires careful consideration of prompt structure, token efficiency, and the logical flow of information through your system. Don't worry‚Äîwe'll practice all of this together as we build real systems.

**Let's See It in Action:** Now that you understand why prompts matter so much, let's explore how to implement effective prompt templates using LangChain with Google's Gemini model. We'll start with basics and gradually work up to sophisticated multi-step prompting strategies.

In [None]:
# Minimal setup and imports
import os
import json
import random
import datetime

from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.tools import tool
from langchain.agents import create_tool_calling_agent, AgentExecutor

# Single shared LLM for the tutorial (reuse this everywhere)
llm = ChatGoogleGenerativeAI(
    model="gemini-1.5-pro",
    temperature=0.3,
    google_api_key=os.getenv("GOOGLE_API_KEY"),
)

# Small shared state for examples
tutorial_state = {
    "prompt_templates": {},
    "chains": {},
    "demo_data": {},
}

print("Setup complete ‚Äî shared LLM and tutorial_state are ready.")

In [None]:
# Install required packages for the tutorial
%pip install langchain langchain-google-genai langchain-core numpy

print("üì¶ Installing packages for Agents and RAG tutorial...")

we'll create some prompt examples

In [None]:

# Let's create prompt templates that we'll reuse throughout the tutorial
# These will work with our global llm instance

def setup_prompt_templates():
    """Initialize reusable prompt templates for the tutorial"""

    # Basic instructional prompt - for general explanations
    basic_template = PromptTemplate(
        input_variables=["topic", "audience"],
        template="""You are an expert educator who excels at explaining complex topics clearly.

        Topic: {topic}
        Audience: {audience}

        Please provide a clear, engaging explanation that includes:
        1. Core concept definition
        2. Relevant examples or analogies
        3. Key takeaways for the audience level

        Keep your explanation appropriate for the specified audience."""
    )

    # Conversational prompt - for interactive discussions
    chat_template = ChatPromptTemplate.from_messages([
        ("system", """You are a helpful AI assistant with expertise in technology and science.
        You provide accurate, clear explanations and engage in detailed discussions.
        Always think step-by-step when solving problems and explain your reasoning."""),
        ("human", "I need help understanding {concept}. Can you break it down for me?"),
        ("ai", "I'd be happy to help explain {concept}! Let me break this down step by step."),
        ("human", "{user_question}")
    ])

    # Store templates in tutorial_state for reuse throughout the notebook
    tutorial_state["prompt_templates"] = {
        "basic": basic_template,
        "chat": chat_template
    }

    # Create reusable chains with our global llm
    tutorial_state["chains"] = {
        "basic": basic_template | llm | StrOutputParser(),
        "chat": chat_template | llm | StrOutputParser()
    }

    print("‚úÖ Prompt templates created and stored in tutorial_state")
    print("üîó Chains connected to our global llm instance")
    print("üìù Templates available: basic, chat")
    return tutorial_state["prompt_templates"]

# Initialize our reusable prompt system
prompt_templates = setup_prompt_templates()

print("\n? These templates will be reused throughout the tutorial")
print("? No need to recreate them - they're stored in tutorial_state")


Great! now our LLM can respond to our questions, but how can we tweak it more to determine how much it weighs the prompt guideline while responding with it's own knowledge and reasoning? let's see!

###  Hyperparameters

Once you've mastered basic prompting, the next level of control comes from understanding how to tune your model's behavior through hyperparameters.

These are the control knobs that determine how a language model generates responses, acting like the settings on a sophisticated instrument that can dramatically change the output quality and behavior.

**Why Understanding Hyperparameters Matters:** Understanding these parameters is crucial for building effective agents because they directly influence:
- How the model balances following prompt instructions versus drawing on its pre-trained knowledge
- How creative or conservative its responses are
- How consistently it behaves across multiple interactions
- Whether it takes safe, predictable paths or explores more novel solutions

Lets walk through the key parameters and show you the mathematical foundations that drive their behavior.

**Temperature (œÑ) - The Creativity Knob:** Temperature controls the randomness in the model's token selection process through the softmax function. Here's how it works mathematically:

Given logits $z_i$ for each possible token $i$, the probability distribution is calculated as:

$$P(token_i) = \frac{e^{z_i/œÑ}}{\sum_{j=1}^{V} e^{z_j/œÑ}}$$

Where:
- $œÑ$ (tau) is the temperature parameter
- $V$ is the vocabulary size  
- Lower $œÑ$ ‚Üí sharper distribution (more deterministic)
- Higher $œÑ$ ‚Üí flatter distribution (more random)

At $œÑ = 1$, we get the standard softmax. As $œÑ ‚Üí 0$, the distribution approaches a one-hot encoding of the highest logit (very predictable). As $œÑ ‚Üí ‚àû$, the distribution becomes uniform (completely random).

**Top-p (Nucleus Sampling) - The Focus Control:** Top-p works by selecting the smallest set of tokens whose cumulative probability exceeds threshold $p$:

$$\text{Nucleus} = \{i : \sum_{j \in \text{top-k tokens}} P(token_j) \leq p\}$$

This creates a dynamic vocabulary size‚Äîsometimes the model considers many options, sometimes just a few, depending on how confident it is.

**Top-k - The Hard Limit:** Top-k simply restricts consideration to the $k$ highest-probability tokens, where $k$ is a fixed integer. It's simpler than top-p but less adaptive.

**Practical Control Parameters:**
- **Max tokens** provides an upper bound $N_{max}$ on sequence length
- **Stop sequences** define termination conditions based on specific token patterns

**The Art of Parameter Selection:** The key insight is that these parameters create fundamental tradeoffs. You're not just adjusting "creativity"‚Äîyou're choosing between instruction-following precision and knowledge-bringing flexibility.

For agents, this choice becomes critical: Do you want an agent that follows instructions exactly, or one that can creatively adapt its approach? The answer depends entirely on your use case.



we'll have three types of model instances defined to differentiate between their creativity and max tokens as far as we can get

In [None]:

# Now let's explore how hyperparameters affect our existing LLM's behavior
# We'll create variants using our global llm configuration as a template

def demonstrate_temperature_effects(topic="quantum computing"):
    """
    Demonstrate how temperature affects the same LLM's responses
    We'll use our existing llm instance and adjust only temperature
    """

    # Use our existing prompt template from tutorial_state
    prompt = tutorial_state["prompt_templates"]["basic"]

    # Create temperature variants using the same model as our global llm
    temperatures = {
        "conservative": 0.1,   # œÑ = 0.1 for high determinism
        "balanced": 0.7,       # œÑ = 0.7 (same as our global llm)
        "creative": 1.2        # œÑ = 1.2 for high creativity
    }

    print("üå°Ô∏è TESTING TEMPERATURE EFFECTS ON RESPONSES")
    print(f"Using the same model: {llm.model}")
    print("=" * 60)

    results = {}

    for config_name, temp_value in temperatures.items():
        # Create a temporary llm instance with different temperature
        temp_llm = ChatGoogleGenerativeAI(
            model=llm.model,  # Same model as global llm
            temperature=temp_value,
            max_tokens=150,
            google_api_key=os.getenv("GOOGLE_API_KEY")
        )

        # Use our existing chain pattern
        chain = prompt | temp_llm | StrOutputParser()
        response = chain.invoke({
            "topic": topic,
            "audience": "technical professionals"
        })

        results[config_name] = response
        print(f"\n{config_name.upper()} (œÑ={temp_value}):")
        print(f"Response: {response}")
        print("-" * 60)

    # Store results in our tutorial state
    tutorial_state["demo_data"]["hyperparameter_comparison"] = results

    return results

# Test with our reusable function
print("\nüß™ Demonstrating how temperature affects our LLM's behavior")
hyperparameter_results = demonstrate_temperature_effects()

print("\n‚úÖ Temperature demonstration complete")
print("üìä Notice how the same model produces different outputs at different temperatures")
print("üí° Our global llm uses œÑ=0.3 for balanced results throughout the tutorial")


now let's see how it looks

In [None]:
# Let's test our hyperparameter demonstrations and see the results
print("üß™ Running hyperparameter demonstrations")
print("=" * 60)

# Test temperature effects using the function we defined
print("\n1Ô∏è‚É£ Testing Temperature Effects on the Same Query")
temp_results = demonstrate_temperature_effects(topic="neural networks")

print("\n2Ô∏è‚É£ Analyzing the Results")
print("Notice how the same model at different temperatures produces:")
print("   ‚Ä¢ Conservative (œÑ=0.1): Focused, predictable responses")
print("   ‚Ä¢ Balanced (œÑ=0.7): Mix of consistency and variety")
print("   ‚Ä¢ Creative (œÑ=1.2): More diverse, exploratory responses")

print("\nüí° Our global llm uses œÑ=0.3 throughout this tutorial")
print("   This gives us reliable, consistent behavior while allowing some flexibility")

print("\n‚úÖ Hyperparameter demonstrations complete")
print("üìä Results stored in tutorial_state['demo_data']")


 The examples above demonstrate something fundamental about how hyperparameters work in practice. They create a crucial tradeoff between instruction following and creative knowledge application.

**Low Temperature Models:** Excel at following precise formatting requirements and maintaining consistency across multiple calls. This makes them ideal for:
- Structured data extraction
- API responses that need consistent formatting
- Workflows where predictability is paramount
- Any situation where you need the model to be a reliable, consistent executor

**Higher Temperature Models:** Bring more of the model's training knowledge into play, generating more diverse responses and creative solutions. They're better for:
- Creative writing and content generation
- Problem-solving that benefits from novel approaches
- Situations where you want the model to "think outside the box"
- Applications where some variation in responses is actually beneficial

**The Agent Design Choice:** This balance becomes critical in agentic systems where you need to decide whether your agent should be a precise executor of specific instructions or a creative problem-solver that can adapt its approach based on context.

The choice often depends on your use case: customer service bots might need low-temperature consistency to ensure professional, predictable responses, while creative writing assistants might benefit from higher-temperature diversity to generate fresh ideas and varied approaches.

we need to give our agents the ability to extend beyond their base knowledge and interact with the world. This is where tools come into play‚Äîthey're what transform a language model from a sophisticated text generator into an active agent that can perform real actions and access current information.

### Tools

With prompts and hyperparameters mastered, it's time to give your agents the ability to interact with the world beyond their training data.

 Tools are what transform language models from sophisticated text generators into active agents capable of performing real-world actions and accessing live information.

**Think of Tools as Your Agent's Hands and Senses:** Without tools, even the most advanced language model is limited to working with only the knowledge it was trained on, which becomes stale the moment training ends. Tools bridge this gap by allowing agents to interact with databases, APIs, web services, file systems, and any other external systems your application needs to work with.

<img src="https://media.licdn.com/dms/image/v2/D4D12AQGyFCaSY8w4Ag/article-cover_image-shrink_720_1280/B4DZYg8dDRHAAI-/0/1744309441965?e=1762992000&v=beta&t=NS3gCnYSTWkxVwnRpHX6tCG7wcXcGgEknNpowIVAo2k" width=700>

**How Tool Calling Actually Works:** The fundamental concept behind tools in agentic systems is function calling (also known as tool calling). Here's what makes this so powerful: modern language models like GPT-4, Claude, and Gemini have been specifically trained to understand when they need external information or capabilities, and can generate structured function calls with appropriate parameters.

When an agent encounters a question about current weather, stock prices, or needs to perform calculations, it doesn't hallucinate an answer‚Äîinstead, it recognizes the limitation and calls the appropriate tool. This is a game-changer for building reliable systems!

**The Tool Execution Dance:** Let me walk you through how this works in practice:

1. **Request Analysis:** The agent receives a user request and analyzes what information or actions are needed
2. **Tool Selection:** It determines which tools to use based on the requirements  
3. **Parameter Formatting:** It formats the tool calls with proper parameters
4. **Execution:** The tools are executed and return results
5. **Synthesis:** The agent receives the results and synthesizes a response using both its knowledge and the tool outputs

**The Power of Tool Chaining:** This creates a powerful feedback loop where agents can chain multiple tool calls together, use the output of one tool as input to another, and dynamically adapt their approach based on intermediate results. Imagine an agent that searches the web for recent news, summarizes the findings, then generates a report‚Äîall in one coherent workflow!

**Three Categories of Tools We'll Explore:**

1. **Built-in tools** that come pre-integrated with language model providers
2. **Explicit tools** that you define and implement yourself  
3. **Model Context Protocol (MCP) tools** that provide standardized interfaces for complex integrations

Each category serves different purposes and offers varying levels of customization and complexity.

#### Starting Simple: Built-in Tools

The easiest way to get started with agent tools is to use the capabilities that come built into your language model.

These are native capabilities provided directly by language model providers, eliminating the need for external integrations or custom implementations.

**Why Built-in Tools Are Awesome:** Google's Gemini models, for example, come with several powerful built-in tools including Google Search integration, code execution capabilities, and mathematical computation tools. These tools are particularly valuable because:

- **Optimized Integration:** They're optimized for the specific model with minimal latency overhead
- **No Extra Setup:** You don't need additional API keys or setup beyond your primary model access  
- **Seamless Experience:** The model provider handles all the complexity of tool execution, result formatting, and error handling
- **Reliability:** They're battle-tested and maintained by the model provider

**Real-World Example:** When you enable Google Search for Gemini, the model can perform web searches and incorporate real-time information directly into its responses without any additional code on your part. It's like giving your agent instant access to the entire internet!

Similarly, the code execution tool allows Gemini to write and run Python code in a sandboxed environment, making it excellent for data analysis, mathematical calculations, and generating visualizations. Imagine asking your agent to "analyze this sales data and create a chart" and having it actually execute the code to do so!

**The Trade-off to Consider:** The main limitation of built-in tools is that you're constrained to what the provider offers. You can't customize their behavior or add your own specialized functionality. But for many use cases, the convenience and reliability make this a great starting point.



In [None]:
# Simple example showing how to attach tools to the same LLM (no multiple agent classes)
def create_builtin_tools_demo():
    """Demonstrate reusing the same LLM with different tool-enabled chains."""
    # In production you might register tool-capable agents; here we show direct chains
    search_prompt = ChatPromptTemplate.from_messages([
        ("system", "When useful, search the web for up-to-date info."),
        ("human", "{query}")
    ])
    code_prompt = ChatPromptTemplate.from_messages([
        ("system", "When asked, produce small, safe Python snippets to compute answers."),
        ("human", "{analysis_request}")
    ])
    # store templates for reuse
    tutorial_state["prompt_templates"].update({"search": search_prompt, "code": code_prompt})
    return {"search_prompt": search_prompt, "code_prompt": code_prompt}

builtins = create_builtin_tools_demo()
print('Built-in tool templates created (reused later).')

In [None]:
# Now let's demonstrate our built-in tool agents
# These extend our global llm with additional capabilities

print("üîß Testing Built-in Tool Capabilities")
print("=" * 60)

# Get our agents from tutorial_state (they use the same base llm)
agents = tutorial_state.get("builtin_agents", {})

if agents:
    print("\n‚úÖ Using pre-configured agents with built-in tools")
    print(f"   Available agents: {list(agents.keys())}")

    # Get our chains that use these agents
    chains = tutorial_state.get("chains", {})

    if "search_chain" in chains:
        print("\nüîç Example: Search-enabled agent")
        print("   This agent can search for current information when needed")
        print("   üí° It uses our same base llm but with Google Search capability")

    if "code_chain" in chains:
        print("\nüíª Example: Code execution agent")
        print("   This agent can write and execute Python code")
        print("   üí° It uses our same base llm but with code execution capability")

else:
    print("‚ö†Ô∏è Built-in agents not yet initialized")
    print("   They will be created when needed using our global llm")

print("\nüí° Key insight: All these agents share the same base LLM")
print("   We're just adding different tool capabilities on top")


#### Explicit Tools : Building Agent Memory

As we build more sophisticated agents, we quickly run into a fundamental challenge: how do we help our agents remember important information across conversations and interactions? This is where memory systems become crucial.

 Think about how frustrating it would be to work with a colleague who forgot everything you discussed after each meeting. That's essentially what happens with stateless language models‚Äîeach interaction starts fresh, with no memory of previous conversations or learned preferences.

Memory systems solve this by allowing agents to:
- **Maintain Context**: Remember what you've discussed previously
- **Learn Preferences**: Adapt to your communication style and needs over time  
- **Build Relationships**: Create more natural, ongoing conversations
- **Accumulate Knowledge**: Learn from interactions to become more effective

**The Challenge:** The tricky part is deciding what to remember, how long to keep it, and how to retrieve relevant memories when needed. Different memory strategies work better for different types of applications.


In [None]:
# Define a couple of focused example tools using the @tool decorator
@tool
def get_weather(city: str) -> str:
    """Return a tiny mocked weather JSON for teaching purposes."""
    condition = random.choice(["sunny", "cloudy", "rainy"])
    return json.dumps({"city": city, "condition": condition, "temp_c": random.randint(0,30)})

@tool
def compound_interest(principal: float, rate: float, years: int) -> str:
    """Minimal compound interest calculation; returns JSON string."""
    amount = principal * (1 + rate) ** years
    return json.dumps({"principal": principal, "rate": rate, "years": years, "final": round(amount,2)})

# Expose tools list for agents/examples
example_tools = [get_weather, compound_interest]
print('Example tools defined: get_weather, compound_interest')

great now we'll create the armed agent and test function

In [None]:
# Now let's create an agent that uses our custom tools
# This agent will use our existing global llm instance

def create_custom_tool_agent():
    """
    Create an agent with custom tools using our existing llm instance
    This shows how to extend our base agent with specific capabilities
    """

    # Get the tools we created earlier
    custom_tools = tutorial_state.get("tools", {}).get("custom_tools", create_custom_tools())

    # Create a prompt that works with our existing llm
    tool_prompt = ChatPromptTemplate.from_messages([
        ("system", """You are a helpful assistant with access to several specialized tools:

        üå§Ô∏è  get_weather: Get current weather for any city
        üí∞ calculate_compound_interest: Calculate investment returns with compound interest
        üë• search_user_database: Look up customer information in database

        Use these tools when needed to provide accurate, helpful responses.
        Always explain which tool you're using and why.
        Format JSON data nicely for users."""),
        ("human", "{input}"),
        ("placeholder", "{agent_scratchpad}"),
    ])

    # Create agent using our global llm
    agent = create_tool_calling_agent(llm, custom_tools, tool_prompt)

    # Create executor
    agent_executor = AgentExecutor(
        agent=agent,
        tools=custom_tools,
        verbose=True,
        handle_parsing_errors=True
    )

    # Store in tutorial state for reuse
    tutorial_state["agents"] = tutorial_state.get("agents", {})
    tutorial_state["agents"]["custom_tool_agent"] = agent_executor

    print("? Custom tool agent created using our global llm")
    print(f"üîß Tools available: {len(custom_tools)}")

    return agent_executor

# Create our reusable agent
tool_agent = create_custom_tool_agent()

print("‚úÖ Agent ready and stored in tutorial_state['agents']")
print("üí° We can reuse this agent for multiple queries without recreating it")


#### Model Context Protocol (MCP)



It's the next evolution in AI tool integration, providing a standardized way for AI applications to securely connect to data sources and tools. Think of MCP as a universal translator that allows any AI system to communicate with any external service through a common protocol, eliminating the need for custom integrations for each tool or data source.

Just think of it as a public tool calling kit.

<img src="https://mintcdn.com/mcp/bEUxYpZqie0DsluH/images/mcp-simple-diagram.png?w=1100&fit=max&auto=format&n=bEUxYpZqie0DsluH&q=85&s=341b88d6308188ab06bf05748c80a494" width=700>


<img src="https://pbs.twimg.com/tweet_video_thumb/Gl7C44tXYAAdDSJ.jpg" width=700>

<img src="https://miro.medium.com/0*qtnzILuhG39c2DML.jpeg" width=700>



MCP was developed by Anthropic to solve the fragmentation problem in AI tool ecosystems. Before MCP, every AI application had to implement its own custom integrations for databases, APIs, file systems, and other external resources. This led to duplicated effort, security inconsistencies, and tools that only worked with specific AI platforms. MCP standardizes these interactions through a client-server architecture where MCP servers expose resources (like databases or file systems) and tools (like calculators or API clients) through a uniform interface.

The protocol operates on JSON-RPC 2.0, enabling real-time, bidirectional communication between AI applications (MCP clients) and external resources (MCP servers). This means your agent can not only call tools but also receive real-time updates, notifications, and streaming data from external systems. The security model is built around explicit capability declarations and sandboxed execution, ensuring that agents can only access resources they've been explicitly granted permission to use.

What makes MCP particularly powerful for RAG and agentic systems is its ability to provide **contextual data access**. Instead of just calling functions, MCP servers can expose rich contextual information about resources - like database schemas, file structures, or API capabilities - allowing agents to make more informed decisions about how to interact with external systems.


In [None]:

import asyncio
import json
import nest_asyncio
from typing import Any, Dict, List, Optional
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client
import os
import tempfile
from datetime import datetime

# Enable nested asyncio loops for Jupyter
nest_asyncio.apply()

In [None]:
# Simulate a tiny MCP-like resource collection (synchronous, minimal)
simulated_mcp = {
    "customer_db": {"customers": [{"id": "001", "name": "Alice", "tier": "premium"}]},
    "inventory": {"items": [{"sku": "A001", "name": "Widget", "quantity": 10}]},
    "analytics": {"sales": {"month": 12000, "trend": "up"}},
}

def mcp_read_resource_sync(resource_name: str) -> str:
    """Simple synchronous read from the simulated MCP resources."""
    key = resource_name.lower()
    if key in simulated_mcp:
        return json.dumps(simulated_mcp[key])
    return json.dumps({"error": f"resource '{resource_name}' not found"})

def mcp_call_tool_sync(name: str, arguments: dict) -> str:
    """Very small dispatcher for simulated tools (e.g., update inventory)."""
    if name == "query_analytics":
        metric = arguments.get("metric", "sales")
        return json.dumps(simulated_mcp.get("analytics", {}).get(metric, {}))
    if name == "update_inventory":
        sku = arguments.get("sku")
        qty = arguments.get("quantity", 0)
        items = simulated_mcp["inventory"]["items"]
        item = next((i for i in items if i["sku"] == sku), None)
        if item:
            item["quantity"] = qty
            return json.dumps({"sku": sku, "new_quantity": item["quantity"]})
        return json.dumps({"error": "sku not found"})
    return json.dumps({"error": f"unknown tool {name}"})

print('Simulated MCP available for local examples.')

In [None]:
# Create small @tool wrappers that call the synchronous simulated MCP helpers
@tool
def mcp_read_resource(resource_name: str) -> str:
    return mcp_read_resource_sync(resource_name)

@tool
def mcp_query_analytics(metric: str = "sales", period: str = "month") -> str:
    return mcp_call_tool_sync("query_analytics", {"metric": metric, "period": period})

@tool
def mcp_update_inventory(sku: str, quantity: int) -> str:
    return mcp_call_tool_sync("update_inventory", {"sku": sku, "quantity": quantity})

mcp_tools = [mcp_read_resource, mcp_query_analytics, mcp_update_inventory]

# Build a minimal agent demonstrating tool usage
mcp_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a business assistant with access to simple MCP tools."),
    ("human", "{input}")
])

mcp_agent = create_tool_calling_agent(llm, mcp_tools, mcp_prompt)
print('Minimal MCP-enabled agent created (uses simulated resources).')

In [None]:
# Test the REAL MCP-enabled agent with comprehensive business scenarios

print("=" * 60)
print("üß™ TESTING REAL MCP SERVER INTEGRATION")
print("=" * 60)

print("\n=== Test 1: Customer Data Analysis via MCP ===")
print("üîç Using MCP resource: customer_db")
customer_analysis = mcp_executor.invoke({
    "input": "Analyze our customer data. Show me the customer information, tier distribution, and total customer value."
})
print("üìã Response:", customer_analysis['output'])

print("\n" + "="*50)
print("\n=== Test 2: Real-time Business Analytics via MCP Tools ===")
print("üìä Using MCP tool: query_analytics")
analytics_query = mcp_executor.invoke({
    "input": "Get our current sales and revenue metrics for this month. Also check user growth trends."
})
print("üìà Response:", analytics_query['output'])

print("\n" + "="*50)
print("\n=== Test 3: Inventory Management via MCP ===")
print("üì¶ Using MCP resource and tools: inventory + update_inventory")
inventory_management = mcp_executor.invoke({
    "input": "Check our current inventory levels, then update the laptop inventory by adding 25 units. Also check if we're low on any items."
})
print("üè™ Response:", inventory_management['output'])

print("\n" + "="*50)
print("\n=== Test 4: Business Operations - Notification System ===")
print("üì¢ Using MCP tool: send_notification")
notification_test = mcp_executor.invoke({
    "input": "Send a high-priority notification to the warehouse manager about low stock levels for any items under 100 units."
})
print("üîî Response:", notification_test['output'])

print("\n" + "="*50)
print("\n=== Test 5: Comprehensive Business Dashboard ===")
print("üéØ Using multiple MCP resources and tools")
dashboard_query = mcp_executor.invoke({
    "input": """Create a comprehensive business dashboard showing:
    1. Customer tier distribution and total value
    2. Current sales performance and trends
    3. Inventory status with any low-stock alerts
    4. Send a summary notification to the CEO

    Use all available MCP resources and tools to gather this information."""
})
print("üìä Dashboard Response:", dashboard_query['output'])

print("\n" + "="*60)
print("‚úÖ REAL MCP INTEGRATION TESTS COMPLETED")
print("üéâ Model Context Protocol successfully integrated!")
print("="*60)


This demonstrates how modern AI systems can safely and efficiently integrate with enterprise systems using standardized protocols rather than ad-hoc custom integrations.

In [None]:
# Optional: Cleanup MCP Server Resources
# Run this when you're done with the MCP server to clean up resources

async def cleanup_mcp_server():
    """Cleanup MCP server resources"""
    try:
        await business_mcp.cleanup()
        print("‚úÖ MCP server resources cleaned up successfully")
    except Exception as e:
        print(f"‚ö†Ô∏è Cleanup warning: {e}")

# Uncomment the line below if you want to cleanup the MCP server
# await cleanup_mcp_server()

print("üí° MCP server is ready for use!")
print("üßπ Run cleanup_mcp_server() when finished to release resources")

We've seen how **built-in tools** provide immediate capabilities with minimal setup, **explicit tools** offer complete customization for your specific needs, and **MCP tools** enable standardized integration with complex systems while maintaining security and scalability.

The key insight is that tools are what bridge the gap between language model intelligence and real-world utility. Without tools, even the most sophisticated language model is limited to generating text based on its training data. With tools, agents become active participants in your business processes, capable of querying databases, performing calculations, calling APIs, and taking actions in response to user needs.

As we design agentic systems, the choice between different tool types depends on your specific requirements:
- Use **built-in tools** when the model provider offers functionality that meets your needs
- Create **explicit tools** when you need custom integration with your specific systems  
- Implement **MCP tools** when you need standardized, scalable integrations across multiple AI applications


### Context Engineering



Context management is the cognitive backbone of sophisticated agents, determining how they maintain awareness of ongoing conversations, remember past interactions, and build upon previous knowledge to provide coherent, contextually relevant responses. Without proper context management, even the most capable agents become like individuals with severe short-term memory loss‚Äîthey might excel at individual tasks but fail to maintain meaningful, coherent interactions over time.

Think of context management as the difference between having a conversation with a knowledgeable expert who remembers your entire discussion versus repeatedly starting fresh with someone who has no recollection of what you've already covered. The former builds understanding progressively, references earlier points, and adapts their communication based on your evolving needs. The latter, while potentially knowledgeable, forces you to repeat yourself and cannot build on the conversational foundation you've established. In agentic systems, context management becomes even more critical because agents need to coordinate information across multiple tool calls, maintain state during complex workflows, and remember important details that influence future decisions. An agent helping with financial planning needs to remember your risk tolerance, investment timeline, and previous decisions to provide consistent advice. A customer service agent should recall your account history, previous issues, and preferences to deliver personalized support.

The challenge lies in balancing several competing factors: **memory capacity** (how much information can be retained), **relevance** (what information is most important to keep), **efficiency** (managing token limits and processing costs), and **persistence** (maintaining memory across sessions). Different memory strategies excel in different scenarios, and the best approach often involves combining multiple memory types to create a comprehensive context management system.

<img src="https://substackcdn.com/image/fetch/$s_!AyLS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0e3c002-0841-4d5f-9171-3eb63c321824_1600x1224.png" width=700>

Memory systems in agentic applications serve different purposes and have distinct strengths and limitations.

**Buffer-based memories** store raw conversation history up to certain limits, providing complete fidelity but consuming significant token space. **Summary-based memories** compress conversation history into concise summaries, trading some detail for efficiency. **Window-based memories** maintain only recent interactions, ensuring relevance while discarding older context. **Token-aware memories** dynamically manage content based on token consumption, balancing completeness with cost constraints.

Each memory type excels in specific scenarios: use buffer memory for short conversations where every detail matters, summary memory for long-running sessions where themes and key decisions need tracking, window memory for task-oriented interactions where only recent context is relevant, and token buffer memory for cost-sensitive applications with unpredictable conversation lengths.

- **Buffer Memory**: Stores everything - perfect recall but grows indefinitely
- **Summary Memory**: Compresses older content - manageable size with key information preserved  
- **Window Memory**: Only recent context - predictable size but limited history
- **Token Memory**: Smart pruning based on token limits - cost-controlled with intelligent truncation
- **Entity Memory**: Relationship tracking - maintains entity awareness across conversations

In [None]:
# Let's create different memory systems that work with our global llm
# These will help our agent remember conversations in different ways

def setup_memory_systems():
    """
    Initialize various memory systems for our agent
    All will use the same global llm but with different memory strategies
    """

    # Create a dedicated LLM for memory operations (slightly lower temperature for consistency)
    memory_llm = ChatGoogleGenerativeAI(
        model=llm.model,  # Same model as global llm
        temperature=0.2,  # Lower temperature for more consistent memory operations
        google_api_key=os.getenv("GOOGLE_API_KEY")
    )

    # Store the memory llm in tutorial_state
    tutorial_state["memory_llm"] = memory_llm

    # Initialize our memory systems
    memory_systems = {
        "Buffer (Complete)": ConversationBufferMemory(
            memory_key="chat_history",
            return_messages=True
        ),
        "Summary (Compressed)": ConversationSummaryMemory(
            llm=memory_llm,
            memory_key="chat_history",
            return_messages=True
        ),
        "Window (Last 3)": ConversationBufferWindowMemory(
            k=3,  # Keep last 3 conversation pairs
            memory_key="chat_history",
            return_messages=True
        ),
        "Token Limited": ConversationTokenBufferMemory(
            llm=memory_llm,
            max_token_limit=500,
            memory_key="chat_history",
            return_messages=True
        ),
        "Entity Tracking": ConversationEntityMemory(
            llm=memory_llm,
            entity_store=InMemoryEntityStore(),
            memory_key="chat_history",
            return_messages=True
        )
    }

    # Store in tutorial_state for reuse
    tutorial_state["memory_systems"] = memory_systems

    # Create conversation chains for each memory type
    memory_chains = {}
    for name, memory_system in memory_systems.items():
        memory_chains[name] = ConversationChain(
            llm=memory_llm,  # Use our consistent memory llm
            memory=memory_system,
            verbose=False
        )

    tutorial_state["memory_chains"] = memory_chains

    print("üß† Memory Systems Initialized")
    print(f"   üìä {len(memory_systems)} different memory strategies")
    print(f"   üîó All using consistent memory LLM (œÑ=0.2)")
    print(f"   üíæ Available types: {list(memory_systems.keys())}")

    return memory_systems, memory_chains

# Initialize our memory systems
memory_systems, memory_chains = setup_memory_systems()

print("\n‚úÖ Memory systems ready for use throughout the tutorial")
print("üí° These will help our agent remember conversations in different ways")


##### Comparing Memory Systems Side-by-Side:


In [None]:
# Let's test our memory systems with a business scenario
# We'll use the chains we already created in tutorial_state

print("üß™ Testing Memory Systems with Business Scenario")
print("=" * 60)

# Get our pre-configured memory chains
memory_chains = tutorial_state.get("memory_chains", {})

if not memory_chains:
    print("‚ö†Ô∏è Memory chains not initialized, setting them up now...")
    memory_systems, memory_chains = setup_memory_systems()

# Define a realistic conversation scenario
test_scenario = [
    "Hi, I'm working on the TechCorp project with a $2M budget.",
    "The project manager is Sarah Chen, and we're targeting Q4 launch.",
    "We need to coordinate with the development team led by Mike Rodriguez.",
    "The main deliverable is a cloud migration to Azure platform.",
    "Sarah mentioned the timeline is aggressive - only 3 months to complete.",
    "What are the key risks we should be monitoring for this project?"
]

print(f"üìù Testing with {len(test_scenario)} conversation turns")
print("\n? Each memory system will process the same conversation")
print("   Watch how they handle context differently")

# Test each memory system
scenario_results = {}

for memory_name, chain in memory_chains.items():
    print(f"\n--- Testing {memory_name} ---")

    # Process all conversation turns with this memory system
    for i, user_input in enumerate(test_scenario, 1):
        response = chain.predict(input=user_input)
        print(f"Turn {i}: ‚úÖ")

    # Store the final response for comparison
    final_response = response[:150] + "..." if len(response) > 150 else response
    scenario_results[memory_name] = {
        "final_response": final_response,
        "memory_type": memory_name
    }

    # Clear memory for next test
    chain.memory.clear()
    print(f"‚úì Completed and cleared")

# Store results
tutorial_state["memory_test_results"] = scenario_results

print(f"\nüèÅ Completed testing all {len(memory_chains)} memory systems!")
print("üìä Results stored in tutorial_state for analysis")


Real-world applications often benefit from combining multiple memory strategies to create sophisticated context management systems that leverage the strengths of different approaches while mitigating their individual limitations. CombinedMemory allows you to orchestrate multiple memory systems simultaneously, creating layered context awareness that can handle both immediate needs and long-term relationship building.

For example, you might combine ConversationBufferWindowMemory for immediate context with ConversationEntityMemory for long-term entity tracking, plus a custom memory component for domain-specific information. This creates a multi-layered memory architecture where recent interactions provide immediate context, entity memory maintains relationship continuity, and specialized memory components handle domain-specific requirements like user preferences or system configurations.


In [None]:
# Now let's build a sophisticated combined memory system
# This will use our existing memory_llm from tutorial_state

print("üèóÔ∏è Building Combined Memory Architecture")
print("=" * 60)

# Get our memory_llm (created earlier with consistent settings)
memory_llm = tutorial_state.get("memory_llm")

if not memory_llm:
    print("‚ö†Ô∏è Memory LLM not found, creating it...")
    memory_llm = ChatGoogleGenerativeAI(
        model=llm.model,
        temperature=0.2,
        google_api_key=os.getenv("GOOGLE_API_KEY")
    )
    tutorial_state["memory_llm"] = memory_llm

# Create individual memory components
print("\n1Ô∏è‚É£ Setting up memory components...")

# Recent Memory - immediate context
recent_memory = ConversationBufferWindowMemory(
    k=2,
    memory_key="recent_history",
    return_messages=True
)
print("   ‚úÖ Recent Memory (last 2 turns)")

# Entity Tracker - long-term relationships
entity_tracker = ConversationEntityMemory(
    llm=memory_llm,  # Reusing our memory_llm
    entity_store=InMemoryEntityStore(),
    memory_key="entities",
    return_messages=False
)
print("   ‚úÖ Entity Tracker (people, projects, companies)")

# Preferences - user settings
preferences_memory = SimpleMemory(
    memories={"user_preferences": "No specific preferences set yet"}
)
print("   ‚úÖ Preferences Memory (user settings)")

# Combine them all
print("\n2Ô∏è‚É£ Combining into unified system...")
combined_memory = CombinedMemory(
    memories=[recent_memory, entity_tracker, preferences_memory]
)

# Create custom prompt for combined memory
combined_prompt = PromptTemplate(
    input_variables=["recent_history", "entities", "user_preferences", "input"],
    template="""You are an AI assistant with comprehensive memory capabilities.

Recent Conversation: {recent_history}

Known Entities: {entities}

User Preferences: {user_preferences}

Based on this context, respond to: {input}

Be conversational and reference relevant context from memory when appropriate."""
)

# Create the conversation chain using our memory_llm
combined_chain = ConversationChain(
    llm=memory_llm,  # Reusing our existing memory_llm
    memory=combined_memory,
    prompt=combined_prompt,
    verbose=True
)

# Store in tutorial_state for reuse
tutorial_state["combined_memory"] = combined_memory
tutorial_state["combined_chain"] = combined_chain

print("\n‚úÖ Combined Memory System Created!")
print("   üîÑ Orchestrates: Recent context + Entity tracking + Preferences")
print("   üß† Uses our existing memory_llm (consistent with other memory ops)")
print("   üíæ Stored in tutorial_state for reuse")


**Understanding the Architecture:**

What we just created is a three-layer memory system:

1. **Recent Memory** provides immediate conversational context - what was just said in the last few exchanges
2. **Entity Tracker** maintains long-term awareness of important entities (people, companies, projects) mentioned throughout the conversation
3. **Preferences Memory** stores user-specific settings and preferences that should persist across conversations

This architecture mirrors how human memory works - we have immediate working memory for current context, long-term memory for important relationships and facts, and persistent preferences that guide our behavior.


In [None]:
# Combining Memory Systems
# Now let's orchestrate all three memory types into a unified system

# Create the combined memory that coordinates all components
combined_memory = CombinedMemory(
    memories=[recent_memory, entity_tracker, preferences_memory]
)

# Create a prompt template that utilizes all memory types
combined_prompt = PromptTemplate(
    input_variables=["recent_history", "entities", "user_preferences", "input"],
    template="""You are an AI assistant with comprehensive memory capabilities.

Recent Conversation: {recent_history}

Known Entities: {entities}

User Preferences: {user_preferences}

Based on this context, respond to: {input}

Be conversational and reference relevant context from memory when appropriate."""
)

# Create the conversation chain with our combined memory
combined_chain = ConversationChain(
    llm=memory_llm,
    memory=combined_memory,
    prompt=combined_prompt,
    verbose=True
)

print("üß† Combined Memory System Created!")
print("   üîÑ Orchestrates: Recent context + Entity tracking + User preferences")
print("   üìã Custom prompt template utilizes all memory types")
print("   ‚öôÔ∏è  Ready for sophisticated context-aware conversations")

**How Combined Memory Works:**

The `CombinedMemory` system is like having a team of specialists working together:

- **Recent Memory** acts as the "immediate context specialist" - always aware of what just happened
- **Entity Tracker** serves as the "relationship specialist" - remembering who's who and what's what across conversations  
- **Preferences Memory** functions as the "personalization specialist" - maintaining user-specific settings and preferences

When you ask a question, all three systems contribute their expertise:
1. Recent memory provides immediate conversational context
2. Entity tracker identifies relevant relationships and entities
3. Preferences memory ensures responses align with user preferences

The custom prompt template weaves all this information together, creating responses that are both contextually aware and personally relevant.

In [None]:
# Let's test our combined memory system
# We'll use the chain we just created and stored in tutorial_state

print("üß™ Testing Combined Memory System")
print("=" * 60)

# Get our combined chain
combined_chain = tutorial_state.get("combined_chain")

if not combined_chain:
    print("‚ö†Ô∏è Combined chain not found, please run the previous cell first")
else:
    # Define test conversation
    test_conversation = [
        "Hi, I'm Sarah and I prefer concise responses. I'm working on a Python project.",
        "I need help with data analysis using pandas. Can you recommend some techniques?",
        "Actually, I'm working with customer data for my company TechFlow Solutions.",
        "Our CEO Mike Johnson wants insights on customer retention patterns.",
        "Can you suggest a visualization approach for this data?"
    ]

    print(f"üìù Running {len(test_conversation)} conversation turns")
    print("üí° Watch how the combined memory system:")
    print("   ‚Ä¢ Remembers Sarah prefers concise responses")
    print("   ‚Ä¢ Tracks entities (Sarah, TechFlow, Mike Johnson)")
    print("   ‚Ä¢ Maintains recent context")
    print()

    # Process each conversation turn
    for i, user_input in enumerate(test_conversation, 1):
        print(f"\n--- Turn {i} ---")
        print(f"User: {user_input}")

        # Use our combined memory chain
        response = combined_chain.predict(input=user_input)

        # Show brief preview
        preview = response[:100] + "..." if len(response) > 100 else response
        print(f"Preview: {preview}")
        print(f"‚úÖ Turn {i} processed")

    print(f"\nüéØ Completed {len(test_conversation)} turns with combined memory!")
    print("üíæ All context preserved across the conversation")

    # Store conversation in tutorial_state
    tutorial_state["combined_memory_conversation"] = test_conversation


**Analyzing What Just Happened:**

In this conversation, watch how the combined memory system demonstrated all three memory types working together:

1. **Turn 1**: Sarah introduces herself and sets preferences (concise responses) - captured by preferences memory
2. **Turn 2**: Discusses pandas and data analysis - entity memory starts tracking "pandas" and "data analysis"  
3. **Turn 3**: Introduces "TechFlow Solutions" - entity memory now tracks this company
4. **Turn 4**: Mentions "Mike Johnson" as CEO - entity memory connects him to TechFlow Solutions
5. **Turn 5**: Asks about visualization - recent memory provides immediate context while entity memory maintains awareness of all the players and context

This creates a conversation experience where the agent:
- Remembers Sarah prefers concise responses (preferences)
- Knows she works at TechFlow Solutions with CEO Mike Johnson (entities)  
- Understands the current conversation is about customer retention visualization (recent context)

Let's examine what our memory systems captured:

In [None]:
# Let's examine what our memory systems captured
# And see the full picture of our reusable agent components

print("\nüß† MEMORY SYSTEM ANALYSIS")
print("=" * 60)

# Check recent memory
recent_memory = tutorial_state.get("combined_memory")
if recent_memory:
    print("‚úÖ Combined Memory System Active")
    print("   Components working together:")
    print("   ‚Ä¢ Recent Memory (last 2 turns)")
    print("   ‚Ä¢ Entity Tracker (relationships)")
    print("   ‚Ä¢ Preferences (user settings)")

print("\nüìä REUSABLE COMPONENTS INVENTORY")
print("=" * 60)

# Show all our reusable components
components_summary = {
    "LLM Instances": 0,
    "Prompt Templates": 0,
    "Chains": 0,
    "Memory Systems": 0,
    "Agents": 0,
    "Tools": 0
}

if "memory_llm" in tutorial_state:
    components_summary["LLM Instances"] += 1

if "prompt_templates" in tutorial_state:
    components_summary["Prompt Templates"] = len(tutorial_state["prompt_templates"])

if "chains" in tutorial_state:
    components_summary["Chains"] = len(tutorial_state["chains"])

if "memory_systems" in tutorial_state:
    components_summary["Memory Systems"] = len(tutorial_state["memory_systems"])

if "agents" in tutorial_state:
    components_summary["Agents"] = len(tutorial_state["agents"])

if "tools" in tutorial_state:
    if "custom_tools" in tutorial_state["tools"]:
        components_summary["Tools"] = len(tutorial_state["tools"]["custom_tools"])

print("\nüìà Component Summary:")
for component_type, count in components_summary.items():
    print(f"   {component_type}: {count}")

print("\n‚úÖ Memory tutorial section completed!")
print(f"üíæ All components stored in tutorial_state")
print(f"? Ready to be reused in subsequent sections")

print("\nüí° TUTORIAL PHILOSOPHY:")
print("   Instead of creating new instances everywhere,")
print("   we build components once and reuse them throughout.")
print("   This mirrors real-world development practices!")

# Update tutorial state
tutorial_state['memory_systems_tested'] = [
    'ConversationBufferMemory',
    'ConversationSummaryMemory',
    'ConversationBufferWindowMemory',
    'ConversationTokenBufferMemory',
    'ConversationEntityMemory',
    'CombinedMemory'
]
tutorial_state['current_section'] = 'memory_complete'


Each approach serves different purposes and excels in specific scenarios:

**ConversationBufferMemory** provides perfect recall for short conversations where every detail matters, but becomes expensive in extended interactions. **ConversationSummaryMemory** enables indefinitely long conversations by maintaining key themes while sacrificing some detail. **ConversationBufferWindowMemory** offers predictable performance by keeping only recent context, ideal for task-oriented interactions. **ConversationTokenBufferMemory** provides optimal context utilization with cost control, perfect for production applications.

**ConversationEntityMemory** excels at tracking relationships and building long-term understanding, while **CombinedMemory** allows sophisticated orchestration of multiple memory strategies. The choice depends on your specific requirements: conversation length, cost constraints, detail requirements, and the importance of long-term relationship building.

In practice, most production agentic systems benefit from combining multiple memory approaches, using recent memory for immediate context, entity memory for relationship continuity, and token-aware management for cost control. This creates robust context management that adapts to different conversation patterns while maintaining performance and reliability.

### Skills



As we build more sophisticated agents, we quickly discover that while general-purpose language models are incredibly versatile, they often lack the specialized expertise needed for complex, domain-specific tasks. This is where the concept of "skills" becomes crucial‚Äîthey're like giving your agent professional training in specific areas.

This is again more of a third layer LLM enterprise user level solution, on how to optimize the percievable layer of LLM like augmenting better prompts.

**What Are Agent Skills?** Think of skills as specialized capabilities that combine prompts, tools, memory patterns, and domain knowledge to excel at specific types of problems. Just like a human expert develops specialized skills over years of practice, we can build focused capabilities that allow our agents to perform at expert levels in particular domains.

<img src="https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-cdn.anthropic.com%2Fimages%2F4zrzovbb%2Fwebsite%2Fddd7e6e572ad0b6a943cacefe957248455f6d522-1650x929.jpg&w=1920&q=75" width=700>


<img src="https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-cdn.anthropic.com%2Fimages%2F4zrzovbb%2Fwebsite%2F191bf5dd4b6f8cfe6f1ebafe6243dd1641ed231c-1650x1069.jpg&w=1920&q=75" width=700>


<img src="https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-cdn.anthropic.com%2Fimages%2F4zrzovbb%2Fwebsite%2F441b9f6cc0d2337913c1f41b05357f16f51f702e-1650x929.jpg&w=1920&q=75" width=700>

- A **financial analysis skill** might combine market data tools, statistical calculation capabilities, and specialized prompts for interpreting economic indicators
- A **creative writing skill** could integrate research tools, style guidelines, and iterative refinement processes  
- A **technical debugging skill** might include code analysis tools, documentation search, and systematic troubleshooting approaches


- **Specialization**: Agents can develop deep expertise in specific areas rather than being mediocre generalists
- **Consistency**: Similar problems are approached with proven, refined techniques that improve over time
- **Reusability**: Successful skill patterns can be applied across different contexts and even shared between agents
- **Composability**: Complex workflows where multiple skills collaborate to solve multifaceted problems

Skills also introduce challenges you need to be aware of:
- **Over-specialization** where agents become inflexible outside their trained domains
- **Complexity** that makes systems harder to debug and maintain
- **Coordination overhead** when multiple skills need to work together effectively

Skills mainly aim to solve the context problem of LLMs, you can only put so much information into your prompt carefully, if you just let the LLM know that they can access directory `bicycles` to know more about bicycles, the agent can call it whenever it needs to know more information about it rather than knowing about it from the start.

The key is finding the right balance between specialization and flexibility for your specific use case. Let's build a practical skills system to see these concepts in action:

In [None]:
# Minimal skills system ‚Äî simple registry of callable skills (didactic)
from dataclasses import dataclass
from typing import Callable, Dict, Any

@dataclass
class SkillResult:
    success: bool
    output: str
    confidence: float
    metadata: Dict[str, Any] = None

# Simple registry helpers
def register_skill(name: str, func: Callable[[str], SkillResult], description: str = ""):
    if "skills" not in tutorial_state:
        tutorial_state["skills"] = {}
    tutorial_state["skills"][name] = {"func": func, "description": description}

def run_skill(name: str, input_text: str) -> SkillResult:
    entry = tutorial_state.get("skills", {}).get(name)
    if not entry:
        return SkillResult(False, f"Skill '{name}' not found", 0.0)
    try:
        return entry["func"](input_text)
    except Exception as e:
        return SkillResult(False, str(e), 0.0)

# Example skill implementation (very small and deterministic for teaching)
def financial_analysis_skill(input_text: str) -> SkillResult:
    # Tiny illustrative logic: summarize and return a fixed confidence
    summary = f"[financial_analysis] summary for input (len={len(input_text)}): {input_text[:60]}..."
    return SkillResult(True, summary, 0.8)

# Register the example skill
register_skill("financial_analysis", financial_analysis_skill, "Tiny example financial skill")
print("Minimal skill registry ready ‚Äî 'financial_analysis' registered.")

### Workflows and Chains



Now that we've mastered the building blocks of agentic systems‚Äîprompts, tools, memory, and skills‚Äîit's time to explore how we orchestrate these components into sophisticated workflows.

**Think of Workflows as Choreography:** I like to think of workflows as the "choreography" of your agentic system. Just like a ballet performance, they define how different components interact, when they execute, and how information flows between them. Without good choreography, even the most talented individual performers can't create something beautiful together.

 Workflows transform simple LLM interactions into powerful, multi-step reasoning systems. Instead of asking an LLM to solve a complex problem in one shot (which often leads to mediocre results), workflows break down tasks into manageable pieces, allowing for specialization, validation, and iterative improvement.

Here's why this matters so much:

**Why Workflows Are Game-Changers:**

1. **Task Decomposition**: Complex problems become manageable when broken into smaller, focused steps. Instead of "write a marketing campaign," you might have "research audience ‚Üí generate concepts ‚Üí create copy ‚Üí review and refine."

**Error Propagation in Chains**: In prompt chaining, if each step has error rate Œµ, the cumulative error follows:
$$E_{total} = 1 - \prod_{i=1}^{n}(1-\varepsilon_i)$$

For identical error rates: $E_{total} = 1 - (1-\varepsilon)^n$

**Parallel Processing Speedup**: Theoretical speedup from parallelization follows Amdahl's Law:
$$S = \frac{1}{(1-P) + \frac{P}{N}}$$

Where P is the parallelizable fraction and N is the number of processors.



2. **Specialization**: Different parts of your system can excel at different aspects of the problem. Your research specialist can be different from your creative writer, each optimized for their specific role.

3. **Quality Control**: You can add validation and error checking at each step. If the research step fails, you catch it before moving to content generation.

4. **Scalability**: Parallel execution and efficient resource utilization mean you can handle more complex tasks without proportional increases in time.

5. **Maintainability**: It's easier to debug, test, and improve individual components rather than trying to fix one monolithic prompt.

**Understanding the Spectrum:** Workflows exist on a spectrum from simple sequential chains to fully autonomous agents:

```
Simple ‚Üí Sequential ‚Üí Parallel ‚Üí Dynamic ‚Üí Autonomous
Chain     Routing     Execution   Orchestration   Agents
```

Each level adds complexity but also capability. The key is choosing the right level for your specific use case‚Äîsometimes a simple chain is perfect, other times you need full autonomy.

**Consensus Accuracy**: For voting systems with individual accuracy p, ensemble accuracy follows:
$$P_{ensemble} = \sum_{k=\lceil n/2 \rceil}^{n} \binom{n}{k} p^k (1-p)^{n-k}$$

**Iterative Improvement**: Quality improvement in evaluator-optimizer workflows can be modeled as:
$$Q_n = Q_0 \cdot (1 + \alpha \cdot \beta^n)$$

Where Œ± is the improvement factor and Œ≤ is the diminishing returns coefficient.

#### 1. Prompt Chain

<img src="https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-cdn.anthropic.com%2Fimages%2F4zrzovbb%2Fwebsite%2F7418719e3dab222dccb379b8879e1dc08ad34c78-2401x1000.png&w=3840&q=75" width=700>

In [None]:
# Building Our First Workflow: Prompt Chaining System
# We'll use our existing LLM to create a sequential workflow

from langchain_core.runnables import RunnableLambda
from langchain.chains import SequentialChain
import time

print("üîó PROMPT CHAINING WORKFLOW")
print("=" * 60)

# Check what LLM instances we have available
memory_llm = tutorial_state.get("memory_llm", llm)

class PromptChain:
    """
    Sequential workflow system using our existing LLM

    This is like a factory assembly line where each step:
    - Takes output from the previous step
    - Performs focused transformation
    - Passes result to next step

    Key insight: We reuse the same LLM throughout the chain,
    just with different prompts for each step!
    """

    def __init__(self, llm_instance):
        self.llm = llm_instance
        self.steps_executed = 0
        print(f"üèóÔ∏è Prompt Chain initialized")
        print(f"   Using LLM: {self.llm.model}")
        print(f"   Temperature: {self.llm.temperature}")

    def create_step(self, name: str, instruction: str, gate_check=None):
        """
        Define a step in our chain

        Args:
            name: Step identifier for tracking
            instruction: What this step should do
            gate_check: Optional validation function
        """
        return {
            "name": name,
            "instruction": instruction,
            "gate_check": gate_check
        }

    def execute_step(self, step, input_text):
        """
        Execute a single step using our LLM
        """
        print(f"üîÑ Executing: {step['name']}")
        start_time = time.time()

        # Quality gate check
        if step.get('gate_check') and not step['gate_check'](input_text):
            print(f"‚ùå Gate check failed for {step['name']}")
            return None

        # Create prompt for this step
        prompt = PromptTemplate(
            input_variables=["input", "instruction"],
            template="""Task: {instruction}

Input: {input}

Provide a clear, focused response that can be used as input for the next step in the workflow.
Be thorough but concise - the next step depends on your output quality."""
        )

        # Execute using our LLM
        chain = prompt | self.llm | StrOutputParser()
        result = chain.invoke({
            "input": input_text,
            "instruction": step["instruction"]
        })

        execution_time = time.time() - start_time
        self.steps_executed += 1

        print(f"‚úÖ Completed in {execution_time:.2f}s")
        print(f"   Output: {len(result)} characters")

        return result

# Create or reuse prompt chain
if 'prompt_chain' not in tutorial_state:
    prompt_chain = PromptChain(memory_llm)
    tutorial_state['prompt_chain'] = prompt_chain
    print("\n‚úÖ New Prompt Chain created")
else:
    prompt_chain = tutorial_state['prompt_chain']
    print(f"\n‚úÖ Reusing existing Prompt Chain")
    print(f"   Steps executed so far: {prompt_chain.steps_executed}")

print("\nüí° This chain will reuse our memory_llm for all steps")


In [None]:
# Our prompt chain is already initialized in the previous cell
# Let's verify it's ready and show what we have

print("üîç Verifying Workflow System Status")
print("=" * 60)

prompt_chain = tutorial_state.get('prompt_chain')

if prompt_chain:
    print("‚úÖ Prompt Chain System Ready")
    print(f"   LLM Model: {prompt_chain.llm.model}")
    print(f"   Temperature: {prompt_chain.llm.temperature}")
    print(f"   Steps executed: {prompt_chain.steps_executed}")
    print(f"   Status: Ready for sequential workflows")
else:
    print("‚ö†Ô∏è Prompt chain not found, please run previous cell")

print("\nüí° Ready to build and execute sequential workflows!")


Now Let's Build and Test Our First Chain
We'll create a practical workflow for marketing copy that demonstrates all the key concepts


In [None]:
# Let's build and execute a marketing workflow using our existing prompt chain
# Notice how we reuse the chain we created earlier

print("üìù BUILDING MARKETING WORKFLOW")
print("=" * 60)

# Get our prompt chain from tutorial_state
prompt_chain = tutorial_state.get('prompt_chain')

if not prompt_chain:
    print("‚ö†Ô∏è Prompt chain not initialized. Please run previous cells.")
else:
    print(f"‚úÖ Using existing Prompt Chain (executed {prompt_chain.steps_executed} steps so far)")

    # Define our workflow steps
    print("\nüîß Defining workflow steps...")

    # Step 1: Content Creation
    content_step = prompt_chain.create_step(
        "content_creation",
        "Create compelling marketing copy for a new AI productivity tool. Focus on benefits for busy professionals and include a strong call-to-action."
    )
    print("   1. Content Creation ‚óã")

    # Step 2: Quality Review with Gate Check
    quality_step = prompt_chain.create_step(
        "quality_review",
        "Review this marketing copy for clarity, persuasiveness, and professional tone. Improve grammar and strengthen the value proposition.",
        gate_check=lambda x: len(x) > 50 and len(x.split()) > 10
    )
    print("   2. Quality Review ‚úì (with gate check)")

    # Step 3: Translation
    translation_step = prompt_chain.create_step(
        "translation",
        "Translate this marketing copy to Spanish while maintaining tone and persuasiveness."
    )
    print("   3. Translation ‚óã")

    # Execute the workflow
    steps = [content_step, quality_step, translation_step]
    print(f"\n? EXECUTING WORKFLOW ({len(steps)} steps)")
    print("=" * 60)

    current_input = "AI productivity tool for busy professionals"
    results = []

    for i, step in enumerate(steps, 1):
        print(f"\n--- Step {i}: {step['name']} ---")
        print(f"Input: {current_input[:60]}...")

        # Execute using our reusable prompt chain
        result = prompt_chain.execute_step(step, current_input)

        if result is None:
            print("‚ùå Chain terminated due to step failure")
            break

        results.append({
            "step_number": i,
            "step_name": step['name'],
            "input_length": len(current_input),
            "output_length": len(result),
            "output_preview": result[:100] + "..."
        })

        # Output becomes next input
        current_input = result

    # Store results
    tutorial_state['chain_results'] = results
    tutorial_state['latest_workflow_output'] = current_input

    print(f"\n‚úÖ Workflow completed: {len(results)} steps executed")
    print(f"üìä Total steps by this chain: {prompt_chain.steps_executed}")


In [None]:
# Analyze our workflow execution results
print("üìä WORKFLOW EXECUTION ANALYSIS")
print("=" * 60)

# Get results from tutorial_state
results = tutorial_state.get('chain_results', [])
prompt_chain = tutorial_state.get('prompt_chain')

if results:
    print(f"\n‚úÖ Successfully completed {len(results)} steps")

    print("\nüìà Content Evolution:")
    for result in results:
        print(f"  Step {result['step_number']} - {result['step_name']}:")
        print(f"    Input ‚Üí Output: {result['input_length']} ‚Üí {result['output_length']} chars")
        print(f"    Preview: {result['output_preview']}")

    # Show final output
    final_output = tutorial_state.get('latest_workflow_output')
    if final_output:
        print(f"\nüìù Final Output Preview:")
        print(f"   {final_output[:150]}...")

    # Show chain stats
    if prompt_chain:
        print(f"\nüìä Chain Statistics:")
        print(f"   Total steps executed by this chain: {prompt_chain.steps_executed}")
        print(f"   LLM reused throughout: {prompt_chain.llm.model}")

    print("\nüí° Key Insight: One LLM, multiple transformations!")
    print("   We didn't create new LLM instances for each step")
    print("   We reused the same one with different prompts")

else:
    print("‚ö†Ô∏è No workflow results found. Please run previous cell.")

print("\n‚úÖ Results stored in tutorial_state for further analysis")


#### 2. Routing Workflows - Intelligent Task Distribution



<img src="https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-cdn.anthropic.com%2Fimages%2F4zrzovbb%2Fwebsite%2F5c0c0e9fe4def0b584c04d37849941da55e5e71c-2401x1000.png&w=3840&q=75" width=700>

Now let's explore routing workflows, which intelligently classify inputs and direct them to specialized handlers. Think of it as a smart switchboard that sends different types of requests to the most appropriate specialist.

**The Problem Routing Solves:**

Imagine building a customer service system. You could create one massive prompt that tries to handle all types of inquiries, but this leads to:
- Generic responses that aren't specialized enough
- Conflicting optimization (improving billing support might hurt technical support)
- Difficulty in maintaining and improving specific areas

**How Routing Works:**

1. **Classification**: Analyze the input to determine its type/category
2. **Route Selection**: Choose the appropriate specialized handler
3. **Execution**: Process using the selected specialist
4. **Response**: Return the specialized result

**Mathematical Insight:**

Routing leverages the principle of **specialization gains**. If we have accuracy A_general for a general system and A_specialized for specialists, routing achieves:

$$Accuracy_{routed} = \sum_{i} P(category_i) \times A_{specialist_i}$$

Where P(category_i) is the probability of correct classification.

**Key Benefits:**
- **Specialization**: Each route can be optimized for specific input types
- **Maintainability**: Update one route without affecting others
- **Performance**: Use different models/strategies per route (fast vs. accurate)
- **Cost Optimization**: Route simple queries to cheaper models

In [None]:
# Building an Intelligent Routing System

class IntelligentRouter:
    """
    An intelligent routing system that acts like a smart receptionist.

    and extend our existing prompt patterns instead of creating everything from scratch.

    This approach shows:
    - How to build upon existing components
    - Maintaining consistency across the codebase
    - Reducing memory usage and initialization time
    - Making the tutorial flow more logical and connected
    """

    def __init__(self, llm_instance=None):
        self.llm = llm_instance or llm  # Falls back to global llm
        self.routes = {}
        print("üéØ Initializing intelligent routing system using existing LLM...")

        # Notice how we're extending the structure we already established
        self.router_prompt = PromptTemplate(
            input_variables=["input_text", "available_routes"],
            template="""You are an intelligent classification system. Your job is to analyze the input and determine which specialist should handle it.

Input to classify: {input_text}

Available specialists:
{available_routes}

CRITICAL: Respond with ONLY the route name that best matches the input type.
No explanation, no extra text - just the exact route name.
If unsure, choose the most general route available."""
        )

        tutorial_state["routers"] = tutorial_state.get("routers", {})
        tutorial_state["routers"]["main_router"] = self

        print("üîÑ Router initialized and stored in tutorial_state")

    def register_route(self, name, description, template=None, confidence=0.8):
        """
        Register a new specialist route.

        """

        if template is None:
            # Check if we have a suitable existing template
            existing_templates = tutorial_state.get("prompt_templates", {})
            if "basic" in existing_templates:
                print(f"üîÑ Reusing existing basic template for route '{name}'")
                template = existing_templates["basic"]
            else:
                # Fallback: create a simple template
                template = PromptTemplate(
                    input_variables=["input"],
                    template="Handle this request: {input}"
                )

        self.routes[name] = {
            "description": description,
            "template": template,
            "confidence": confidence,
            "usage_count": 0  # Track how often this route is used
        }


    def route(self, input_text: str):
        """
        Route input to the appropriate specialist

        """
        if not self.routes:
            return "No routes registered. Please register routes first."

        # Build available routes description for the classifier
        routes_desc = "\n".join([
            f"- {name}: {route['description']}"
            for name, route in self.routes.items()
        ])

        router_chain = self.router_prompt | self.llm | StrOutputParser()

        try:
            # Get the route decision
            chosen_route = router_chain.invoke({
                "input_text": input_text,
                "available_routes": routes_desc
            }).strip()

            # Validate the route exists
            if chosen_route in self.routes:
                # Update usage stats
                self.routes[chosen_route]["usage_count"] += 1
                return chosen_route
            else:
                # Fallback to first available route
                fallback_route = list(self.routes.keys())[0]
                print(f"‚ö†Ô∏è Route '{chosen_route}' not found, using fallback: {fallback_route}")
                return fallback_route

        except Exception as e:
            print(f"‚ùå Routing error: {e}")
            return list(self.routes.keys())[0] if self.routes else None

print("üöÄ Creating Intelligent Router using existing components...")
print("=" * 60)



In [None]:
# Use our global LLM instead of creating a new one
intelligent_router = IntelligentRouter(llm_instance=llm)

# Register some routes reusing our existing templates
print("\nüìù Registering routes with existing templates...")

intelligent_router.register_route(
    name="general_chat",
    description="General conversation and questions",
    template=tutorial_state["prompt_templates"]["chat"],
    confidence=0.7
)

intelligent_router.register_route(
    name="explanation",
    description="Detailed explanations of concepts and topics",
    template=tutorial_state["prompt_templates"]["basic"],
    confidence=0.9
)

# Register a specialized route (will create new template only if needed)
intelligent_router.register_route(
    name="technical_analysis",
    description="Technical analysis and code-related questions",
    confidence=0.8
)

print("\n‚úÖ ROUTING SYSTEM READY")
print("üì¶ Router stored in tutorial_state for future use")
print(f"üéØ {len(intelligent_router.routes)} routes registered")

In [None]:
# Initialize our routing system using the global llm
# This router will intelligently direct queries to specialists

print("üéØ Initializing Intelligent Router")
print("=" * 60)

# Check if router already exists
if 'router' not in tutorial_state:
    # Create router using our global llm
    router = IntelligentRouter(llm_instance=llm)
    tutorial_state['router'] = router
    print("‚úÖ New router created using global llm")
else:
    router = tutorial_state['router']
    print("‚úÖ Using existing router from tutorial_state")

print(f"üîß Router uses: {router.llm.model}")
print(f"üå°Ô∏è  Temperature: {router.llm.temperature}")
print("üí° Ready to register specialist routes")


In [None]:
# Minimal router for teaching (keyword-based, synchronous)
class SimpleRouter:
    def __init__(self, llm_instance=None):
        # use provided llm or global one
        self.llm = llm_instance or globals().get('llm')
        self.routes = {}

    def register_route(self, name, description, template, confidence=0.5):
        self.routes[name] = {"description": description, "template": template, "confidence": confidence, "usage_count": 0}

    def route(self, text: str):
        tl = text.lower()
        if any(k in tl for k in ("error", "crash", "bug", "failed")):
            return "technical_support"
        if any(k in tl for k in ("charge", "refund", "billing", "invoice")):
            return "billing_support"
        return "general_inquiry"

# Instantiate a simple router and register three concise specialist routes
router = SimpleRouter(llm_instance=llm)
router.register_route("technical_support", "Troubleshoot technical issues", "Technical troubleshooting template", confidence=0.9)
router.register_route("billing_support", "Handle billing questions", "Billing support template", confidence=0.85)
router.register_route("general_inquiry", "General customer questions", "General response template", confidence=0.75)

# expose router to tutorial_state for reuse
tutorial_state["router"] = router
print("Minimal router configured with 3 routes (technical_support, billing_support, general_inquiry).")

In [None]:
# Display our registered team (simple view)
print(f"\n‚úÖ SPECIALIST TEAM ASSEMBLED")
print(f"Total specialists registered: {len(tutorial_state['router'].routes)}")

print(f"\nüìä TEAM ROSTER:")
for route_name, route_info in tutorial_state['router'].routes.items():
    print(f"   üéØ {route_name}")
    print(f"      Confidence: {route_info['confidence']}")
    print(f"      Usage: {route_info['usage_count']}")
    print(f"      Specialty: {route_info['description']}")
    print()

print("üöÄ Ready to start routing customer inquiries!")

In [None]:
# Minimal routing processor ‚Äî uses the simple router and returns simulated responses

def route_and_process(input_text: str):
    router = tutorial_state.get('router')
    if not router or not router.routes:
        return {"route": "unhandled", "result": "Router not initialized", "confidence": 0.0}

    selected_route = router.route(input_text)
    router.routes[selected_route]["usage_count"] += 1

    # For teaching, avoid a full LLM call here; produce a clear simulated response
    template = router.routes[selected_route]["template"]
    simulated_response = f"(simulated) {selected_route} handled the query. Template used: {template}"

    return {
        "route": selected_route,
        "result": simulated_response,
        "confidence": router.routes[selected_route]["confidence"],
        "specialist_usage": router.routes[selected_route]["usage_count"]
    }

print("‚úÖ Routing function ready (simulated responses for teaching).")

In [None]:
# Test Suite: Real Customer Inquiries
# Let's test our routing system with realistic customer service scenarios
print(f"\nüß™ COMPREHENSIVE ROUTING TEST SUITE")
print("=" * 60)

# These are real-world examples that show different types of customer inquiries
test_scenarios = [
    {
        "scenario": "Technical Issue",
        "query": "My app keeps crashing every time I try to export a file. I get error code 500 and then it just closes. This happens on both Windows and Mac versions."
    },
    {
        "scenario": "Billing Problem",
        "query": "I was charged twice for my subscription this month and I need a refund for the duplicate charge. My card ending in 1234 shows two charges on October 15th."
    },
    {
        "scenario": "Product Question",
        "query": "What's the difference between your premium and enterprise plans? I'm trying to decide which one would be best for a team of 15 people."
    },
    {
        "scenario": "Mixed Technical/Billing",
        "query": "I upgraded to premium but I'm still seeing ads and getting limited features. Did my payment go through? How can I check my account status?"
    }
]


In [None]:
# Run the simplified routing tests and capture results
routing_results = []

for i, scenario in enumerate(test_scenarios, 1):
    print(f"\n--- TEST {i}: {scenario['scenario']} ---")
    result = route_and_process(scenario['query'])

    result['test_scenario'] = scenario['scenario']
    result['original_query'] = scenario['query']
    routing_results.append(result)

    print(f"üéØ Route: {result['route']}")
    print(f"üìä Confidence: {result['confidence']}")
    print(f"üìù Response preview: {result['result'][:120]}...")
    print(f"üìà Specialist usage count: {result.get('specialist_usage')}")


In [None]:
# System Performance Analysis (simple metrics)
print(f"\nüìä ROUTING SYSTEM PERFORMANCE ANALYSIS")
print("=" * 60)

route_distribution = {}
for result in routing_results:
    route = result['route']
    route_distribution[route] = route_distribution.get(route, 0) + 1

print(f"üìà ROUTING DISTRIBUTION:")
for route_name, count in route_distribution.items():
    percentage = (count / len(routing_results)) * 100
    print(f"   {route_name}: {count} queries ({percentage:.1f}%)")

avg_confidence = sum(r['confidence'] for r in routing_results) / len(routing_results)
print(f"\nüéØ SYSTEM METRICS:")
print(f"   Average confidence: {avg_confidence:.2f}")
print(f"   Successful routes: {len([r for r in routing_results if r['route'] != 'unhandled'])}/{len(routing_results)}")


#### 3. Parallelization Workflows - Speed and Consensus



<img src="https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-cdn.anthropic.com%2Fimages%2F4zrzovbb%2Fwebsite%2F406bb032ca007fd1624f261af717d70e6ca86286-2401x1000.png&w=3840&q=75" width=700>

Parallelization is where things get interesting. Instead of processing sequentially, we can execute multiple tasks simultaneously, either to **divide the work** (sectioning) or to get **multiple perspectives** (voting). This is crucial for production systems where speed and accuracy both matter.

**Two Flavors of Parallelization:**

1. **Sectioning**: Break a large task into independent parts that can run simultaneously
   - Example: Analyzing a document from financial, legal, and technical perspectives
   - Benefit: Speed (total time = max individual time, not sum)

2. **Voting**: Run the same task multiple times to reach consensus  
   - Example: Multiple models evaluating content safety
   - Benefit: Accuracy through ensemble effects

**Mathematical Foundation - Amdahl's Law:**

The theoretical speedup from parallelization follows:
$$Speedup = \frac{1}{(1-P) + \frac{P}{N}}$$

Where:
- P = fraction of work that can be parallelized  
- N = number of parallel processors

**Voting Accuracy (Condorcet's Jury Theorem):**

If individual classifiers have accuracy p > 0.5, ensemble accuracy with n classifiers is:
$$P_{ensemble} = \sum_{k=\lceil n/2 \rceil}^{n} \binom{n}{k} p^k (1-p)^{n-k}$$

This means ensemble accuracy increases with more voters (if individual accuracy > 50%).

**When to Use Parallelization:**
- **Sectioning**: When you can identify independent subtasks
- **Voting**: When you need high-confidence decisions
- **Speed Requirements**: When latency is critical
- **Quality Requirements**: When accuracy is paramount

Let's implement both approaches:

In [None]:
# @title
# Parallel Processing with our existing LLM
# We'll reuse our memory_llm for parallel task execution

from concurrent.futures import ThreadPoolExecutor, as_completed
import time

print("‚ö° PARALLEL PROCESSING SYSTEM")
print("=" * 60)

class ParallelProcessor:
    """
    Execute multiple tasks in parallel using our existing LLM

    Key insight: We use the SAME LLM for all parallel tasks,
    but execute them simultaneously in different threads!
    """

    def __init__(self, llm_instance):
        self.llm = llm_instance
        self.tasks_executed = 0
        print(f"‚ö° Parallel Processor initialized")
        print(f"   Using LLM: {self.llm.model}")
        print(f"   Temperature: {self.llm.temperature}")

    def create_section_task(self, name, focus_area, analysis_prompt):
        """Define a parallel task section"""
        return {
            "name": name,
            "focus": focus_area,
            "prompt_template": analysis_prompt
        }

    def execute_section(self, task, input_data):
        """Execute one section using our LLM"""
        print(f"üîÑ Processing: {task['name']}")
        start_time = time.time()

        # Create prompt for this section
        prompt = PromptTemplate(
            input_variables=["data", "focus"],
            template=task["prompt_template"]
        )

        # Execute using our shared LLM
        chain = prompt | self.llm | StrOutputParser()
        result = chain.invoke({
            "data": input_data,
            "focus": task["focus"]
        })

        execution_time = time.time() - start_time
        self.tasks_executed += 1

        print(f"‚úÖ '{task['name']}' done in {execution_time:.2f}s")

        return {
            "section": task["name"],
            "focus": task["focus"],
            "result": result,
            "execution_time": execution_time
        }

# Get our memory_llm for consistent parallel processing
memory_llm = tutorial_state.get("memory_llm", llm)

# Create or reuse parallel processor
if 'parallel_processor' not in tutorial_state:
    parallel_processor = ParallelProcessor(memory_llm)
    tutorial_state['parallel_processor'] = parallel_processor
    print("\n‚úÖ New Parallel Processor created")
else:
    parallel_processor = tutorial_state['parallel_processor']
    print(f"\n‚úÖ Reusing existing Parallel Processor")
    print(f"   Tasks executed so far: {parallel_processor.tasks_executed}")

print("\nüí° All parallel tasks will use the SAME LLM instance")
print("   Parallelization happens at the execution level, not LLM level")


In [None]:
# @title
# Initialize our parallel processor using the existing memory_llm
print("‚ö° Initializing Parallel Processor")
print("=" * 60)

# Get our memory_llm (which we use for consistent processing)
memory_llm = tutorial_state.get("memory_llm")

if not memory_llm:
    print("‚ö†Ô∏è Memory LLM not found, using global llm")
    memory_llm = llm

# Check if parallel processor exists
if 'parallel_processor' not in tutorial_state:
    parallel_processor = ParallelProcessor(memory_llm)
    tutorial_state['parallel_processor'] = parallel_processor
    print("‚úÖ New parallel processor created")
else:
    parallel_processor = tutorial_state['parallel_processor']
    print("‚úÖ Using existing parallel processor")

print(f"üîß Processor uses same LLM as memory operations")
print("üí° Ready for parallel task execution")


In [None]:
# @title
# Define parallel analysis tasks using our existing processor
print("üèóÔ∏è BUILDING PARALLEL BUSINESS ANALYSIS")
print("=" * 60)

# Get our parallel processor
parallel_processor = tutorial_state.get('parallel_processor')

if not parallel_processor:
    print("‚ö†Ô∏è Parallel processor not initialized")
else:
    print(f"‚úÖ Using existing Parallel Processor")
    print(f"   Tasks executed: {parallel_processor.tasks_executed}")

    # Define parallel sections for business analysis
    print("\nüìä Defining analysis sections...")

    section_tasks = [
        parallel_processor.create_section_task(
            name="Financial Analysis",
            focus_area="financial metrics and projections",
            analysis_prompt="""Analyze this business data from a {focus} perspective:

{data}

Focus specifically on financial health, revenue trends, profitability, and financial risks.
Provide key metrics, insights, and recommendations."""
        ),

        parallel_processor.create_section_task(
            name="Market Analysis",
            focus_area="market position and competitive landscape",
            analysis_prompt="""Analyze this business data from a {focus} perspective:

{data}

Focus on market opportunity, competitive advantages, market risks, and positioning.
Provide market insights and strategic recommendations."""
        ),

        parallel_processor.create_section_task(
            name="Operational Analysis",
            focus_area="operational efficiency and scalability",
            analysis_prompt="""Analyze this business data from an {focus} perspective:

{data}

Focus on operational strengths, efficiency metrics, scalability factors, and operational risks.
Provide operational insights and improvement recommendations."""
        )
    ]

    print(f"   Created {len(section_tasks)} parallel analysis tasks")
    for task in section_tasks:
        print(f"   ‚Ä¢ {task['name']}")

    # Store tasks for execution in next cell
    tutorial_state['parallel_tasks'] = section_tasks

    print("\nüí° Each task uses the SAME LLM but runs in parallel")


In [None]:
# @title
# Execute parallel business analysis using our existing processor
print("üöÄ EXECUTING PARALLEL ANALYSIS")
print("=" * 60)

# Get our processor and tasks
parallel_processor = tutorial_state.get('parallel_processor')
section_tasks = tutorial_state.get('parallel_tasks', [])

if not parallel_processor or not section_tasks:
    print("‚ö†Ô∏è Prerequisites not ready. Please run previous cells.")
else:
    # Business data to analyze
    business_data = """
TechStartup Inc. Q3 2024 Summary:
- Revenue: $2.5M (up 150% YoY)
- Monthly Active Users: 50,000 (up 200% YoY)
- Customer Acquisition Cost: $45
- Monthly Churn Rate: 3.2%
- Burn Rate: $300K/month
- Cash Runway: 18 months
- Team Size: 25 employees
- Market Size: $10B TAM
- Top 3 competitors: BigCorp, StartupX, TechGiant
- Key Features: AI automation, real-time collaboration, mobile-first
"""

    print(f"üìä Analyzing business data with {len(section_tasks)} parallel sections")

    # Execute all sections in parallel
    def run_parallel_analysis(tasks, data):
        """Run sections in parallel using ThreadPoolExecutor"""
        start_time = time.time()

        with ThreadPoolExecutor(max_workers=len(tasks)) as executor:
            # Submit all tasks
            future_to_task = {
                executor.submit(parallel_processor.execute_section, task, data): task
                for task in tasks
            }

            # Collect results
            results = []
            for future in as_completed(future_to_task):
                result = future.result()
                results.append(result)

        total_time = time.time() - start_time

        return results, total_time

    # Run the parallel analysis
    results, total_time = run_parallel_analysis(section_tasks, business_data)

    # Calculate speedup
    sequential_time = sum(r["execution_time"] for r in results)
    speedup = sequential_time / total_time

    print(f"\n‚ö° PARALLEL EXECUTION COMPLETE")
    print(f"   Wall-clock time: {total_time:.2f}s")
    print(f"   Sequential time would be: {sequential_time:.2f}s")
    print(f"   Speedup achieved: {speedup:.1f}x")

    print(f"\nüìã Sections completed:")
    for result in results:
        print(f"  ‚Ä¢ {result['section']}: {result['execution_time']:.2f}s")

    # Store results
    tutorial_state['parallel_results'] = results
    tutorial_state['parallel_speedup'] = speedup

    print(f"\nüí° Same LLM, parallel execution = {speedup:.1f}x faster!")


In [None]:
# @title
# Show parallel analysis results
print("üìä PARALLEL ANALYSIS RESULTS")
print("=" * 60)

results = tutorial_state.get('parallel_results', [])
speedup = tutorial_state.get('parallel_speedup', 0)

if results:
    print(f"\n‚úÖ Completed {len(results)} parallel analyses")
    print(f"‚ö° Speedup: {speedup:.1f}x faster than sequential")

    print(f"\n? Results by section:")
    for result in results:
        print(f"\n  {result['section']}:")
        print(f"    Time: {result['execution_time']:.2f}s")
        print(f"    Focus: {result['focus']}")
        print(f"    Output: {len(result['result'])} characters")

    print(f"\nüí° Key Insight:")
    print(f"   We used ONE LLM instance for all {len(results)} tasks")
    print(f"   Parallel execution happened at the thread level")
    print(f"   This is more efficient than creating multiple LLM instances!")
else:
    print("‚ö†Ô∏è No results found. Please run previous cell.")


In [None]:
# @title
# Step 3: Parallel Voting - Consensus Through Multiple Perspectives

class VotingSystem:
    """Implement parallel voting for consensus decisions"""

    def __init__(self, llm_instance):
        """
        Initialize voting system with existing LLM instance

        TUTORIAL NOTE: We receive the LLM instance instead of creating a new one.
        This follows our principle of reusing components for efficiency.
        """
        self.llm = llm_instance
        self.votes_cast = 0  # Track voting activity

    def create_vote_prompt(self, base_instruction, perspective_twist=""):
        """Create a voting prompt with slight variation for diversity"""
        return f"""
{base_instruction}

{perspective_twist}

Analyze carefully and provide your assessment. End your response with a clear decision:
DECISION: [YES/NO/UNCERTAIN]
CONFIDENCE: [1-10]
"""

    def cast_vote(self, vote_id, content, instruction, perspective=""):
        """Cast a single vote in the voting process"""
        prompt_text = self.create_vote_prompt(instruction, perspective)

        prompt = PromptTemplate(
            input_variables=["content"],
            template=prompt_text + "\n\nContent to evaluate: {content}"
        )

        chain = prompt | self.llm | StrOutputParser()
        response = chain.invoke({"content": content})

        self.votes_cast += 1  # Track each vote

        # Extract decision (simplified parsing)
        decision = "UNCERTAIN"
        confidence = 5

        if "DECISION: YES" in response:
            decision = "YES"
        elif "DECISION: NO" in response:
            decision = "NO"

        # Try to extract confidence
        if "CONFIDENCE:" in response:
            try:
                conf_line = [line for line in response.split('\n') if 'CONFIDENCE:' in line][0]
                confidence = int(conf_line.split(':')[1].strip().split()[0])
            except:
                pass

        return {
            "vote_id": vote_id,
            "decision": decision,
            "confidence": confidence,
            "full_response": response
        }

    def parallel_voting(self, content, base_instruction, num_votes=3):
        """Execute parallel voting with multiple perspectives"""

        # Create diverse perspectives for voting
        perspectives = [
            "Consider this from a conservative, risk-averse viewpoint.",
            "Evaluate this from an optimistic, opportunity-focused angle.",
            "Analyze this from a balanced, neutral perspective."
        ]

        # Ensure we have enough perspectives
        while len(perspectives) < num_votes:
            perspectives.append(f"Provide perspective #{len(perspectives) + 1} evaluation.")

        print(f"üó≥Ô∏è Conducting parallel voting with {num_votes} voters")
        print(f"   Using existing LLM instance (votes cast so far: {self.votes_cast})")

        # Execute votes in parallel
        with ThreadPoolExecutor(max_workers=num_votes) as executor:
            futures = [
                executor.submit(
                    self.cast_vote,
                    f"voter_{i+1}",
                    content,
                    base_instruction,
                    perspectives[i]
                )
                for i in range(num_votes)
            ]

            votes = [future.result() for future in futures]

        # Calculate consensus
        decisions = [vote["decision"] for vote in votes]
        confidences = [vote["confidence"] for vote in votes]

        yes_votes = decisions.count("YES")
        no_votes = decisions.count("NO")
        uncertain_votes = decisions.count("UNCERTAIN")

        # Determine consensus
        if yes_votes > no_votes and yes_votes > uncertain_votes:
            consensus = "YES"
        elif no_votes > yes_votes and no_votes > uncertain_votes:
            consensus = "NO"
        else:
            consensus = "NO CONSENSUS"

        avg_confidence = sum(confidences) / len(confidences)

        return {
            "votes": votes,
            "consensus": consensus,
            "vote_breakdown": {
                "YES": yes_votes,
                "NO": no_votes,
                "UNCERTAIN": uncertain_votes
            },
            "average_confidence": avg_confidence
        }


In [None]:
# Initialize voting system using our existing memory_llm
print("üó≥Ô∏è Initializing Voting System")
print("=" * 60)

# Get our memory_llm for consistent voting
memory_llm = tutorial_state.get("memory_llm", llm)

# Check if voting system exists - reuse if available
if 'voting_system' not in tutorial_state:
    voting_system = VotingSystem(memory_llm)
    tutorial_state['voting_system'] = voting_system
    print("‚úÖ New voting system created with existing LLM")
    print(f"   LLM instance: memory_llm (temp: 0.2)")
else:
    voting_system = tutorial_state['voting_system']
    print("‚úÖ Reusing existing voting system")
    print(f"   Total votes cast: {voting_system.votes_cast}")

print(f"\nüîß Configuration:")
print(f"   ‚Ä¢ Uses same LLM as memory operations")
print(f"   ‚Ä¢ Enables parallel consensus voting")
print(f"   ‚Ä¢ Tracks voting activity across sessions")
print("\nüí° Ready for consensus-based decision making")


In [None]:
# Step 4: Voting Demo - Content Safety Assessment

# Get our voting system from tutorial state
voting_system = tutorial_state['voting_system']

# Content to evaluate for safety
test_content = """
Check out this amazing new productivity method that will totally change your life!
It's based on scientific research and has helped thousands of people achieve their goals faster than ever before.
Join our community today and start your transformation journey!
"""

safety_instruction = """
Evaluate whether this content is appropriate for a professional platform.

Consider these factors:
- Is it truthful and not misleading?
- Does it avoid excessive hype or unrealistic claims?
- Is it suitable for a professional audience?
- Does it comply with content guidelines?

Provide detailed reasoning for your assessment.
"""

print("üõ°Ô∏è CONTENT SAFETY VOTING DEMONSTRATION")
print("=" * 60)
print(f"Evaluating content: '{test_content[:60]}...'")
print(f"Using voting system with {voting_system.votes_cast} votes cast previously\n")

# Conduct the vote using our existing voting system
voting_result = voting_system.parallel_voting(
    content=test_content,
    base_instruction=safety_instruction,
    num_votes=5
)

# Store results for later reference
tutorial_state['voting_results'] = voting_result

# Display results
print(f"\nüìä VOTING RESULTS:")
print(f"Consensus: {voting_result['consensus']}")
print(f"Average Confidence: {voting_result['average_confidence']:.1f}/10")
print(f"Vote Breakdown:")
for decision, count in voting_result['vote_breakdown'].items():
    print(f"  {decision}: {count} votes")


In [None]:
# @title
# Show individual votes and summary
voting_result = tutorial_state['voting_results']
voting_system = tutorial_state['voting_system']

print(f"üó≥Ô∏è INDIVIDUAL VOTES:")
print("=" * 60)
for vote in voting_result['votes']:
    print(f"  {vote['vote_id']}: {vote['decision']} (confidence: {vote['confidence']}/10)")

print(f"\nüìä VOTING SYSTEM STATISTICS:")
print(f"   Total votes cast: {voting_system.votes_cast}")
print(f"   Votes in this session: {len(voting_result['votes'])}")

print(f"\n‚úÖ PARALLELIZATION WORKFLOWS COMPLETE")
print("=" * 60)
print("   ‚úì Sectioning: Parallel task decomposition for speed")
print("   ‚úì Voting: Consensus-based decision making for accuracy")
print("   ‚úì Mathematical foundations: Amdahl's Law & Condorcet's Theorem")
print(f"\nüí° All parallel workflows used the SAME LLM instance")
print(f"   ‚Ä¢ ParallelProcessor tasks: {tutorial_state['parallel_processor'].tasks_executed}")
print(f"   ‚Ä¢ VotingSystem votes: {voting_system.votes_cast}")


In [None]:
# @title
# Workflow Summary - Review All Components Built

print("üìä WORKFLOW PATTERNS SUMMARY")
print("=" * 80)
print("\nAll workflow components built and stored in tutorial_state:\n")

# 1. Prompt Chaining
if 'prompt_chain' in tutorial_state:
    chain = tutorial_state['prompt_chain']
    print("‚úÖ 1. PROMPT CHAINING")
    print(f"   ‚Ä¢ Sequential step-by-step processing")
    print(f"   ‚Ä¢ Steps executed: {chain.steps_executed}")
    print(f"   ‚Ä¢ Uses: memory_llm (consistent LLM instance)")

# 2. Routing System
if 'router' in tutorial_state:
    router = tutorial_state['router']
    print("\n‚úÖ 2. INTELLIGENT ROUTING")
    print(f"   ‚Ä¢ Dynamic query classification and routing")
    print(f"   ‚Ä¢ Routes registered: {len(router.routes)}")
    print(f"   ‚Ä¢ Uses: global llm")

# 3. Parallel Processing
if 'parallel_processor' in tutorial_state:
    processor = tutorial_state['parallel_processor']
    print("\n‚úÖ 3. PARALLEL PROCESSING")
    print(f"   ‚Ä¢ Concurrent task execution")
    print(f"   ‚Ä¢ Tasks executed: {processor.tasks_executed}")
    print(f"   ‚Ä¢ Uses: memory_llm (shared across threads)")

# 4. Voting System
if 'voting_system' in tutorial_state:
    voting = tutorial_state['voting_system']
    print("\n‚úÖ 4. CONSENSUS VOTING")
    print(f"   ‚Ä¢ Parallel voting for consensus decisions")
    print(f"   ‚Ä¢ Votes cast: {voting.votes_cast}")
    print(f"   ‚Ä¢ Uses: memory_llm (same as parallel processor)")

print("\n" + "=" * 80)
print("üí° KEY INSIGHT: LLM Instance Reuse")
print("=" * 80)
print("All workflows share TWO LLM instances:")
print("  1. `llm` (temp: 0.3) - Main LLM for general tasks & routing")
print("  2. `memory_llm` (temp: 0.2) - Used for memory, chains, parallel work, voting")
print("\nüéØ Benefits of this architecture:")
print("   ‚Ä¢ Reduced memory footprint")
print("   ‚Ä¢ Faster initialization")
print("   ‚Ä¢ Consistent behavior across workflows")
print("   ‚Ä¢ More efficient resource utilization")
print("   ‚Ä¢ Easier to manage and debug")
print("\n‚ú® This is how production systems should be built!")


In [None]:
# @title
# Workflows Section Complete - Check Tutorial State

print("üéâ WORKFLOWS AND CHAINS SECTION COMPLETE!")
print("=" * 80)

# Show all workflow components in tutorial_state
print("\nüì¶ Components available in tutorial_state:")
workflow_components = ['prompt_chain', 'router', 'parallel_processor', 'voting_system']
for component in workflow_components:
    if component in tutorial_state:
        print(f"   ‚úì {component}")
    else:
        print(f"   ‚úó {component} (not found)")

print("\nüîß LLM Instances:")
print(f"   ‚Ä¢ llm (global): {type(llm).__name__} (temp: 0.3)")
if 'memory_llm' in tutorial_state:
    print(f"   ‚Ä¢ memory_llm: {type(tutorial_state['memory_llm']).__name__} (temp: 0.2)")

print("\nüìä Usage Statistics:")
if 'prompt_chain' in tutorial_state:
    print(f"   ‚Ä¢ Prompt chain steps: {tutorial_state['prompt_chain'].steps_executed}")
if 'parallel_processor' in tutorial_state:
    print(f"   ‚Ä¢ Parallel tasks: {tutorial_state['parallel_processor'].tasks_executed}")
if 'voting_system' in tutorial_state:
    print(f"   ‚Ä¢ Votes cast: {tutorial_state['voting_system'].votes_cast}")

print("\nüí° Ready to proceed to Advanced Agent Systems and RAG!")


#### 4. Advanced Workflow Patterns & Agent Systems



<img src="https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-cdn.anthropic.com%2Fimages%2F4zrzovbb%2Fwebsite%2F8985fc683fae4780fb34eab1365ab78c7e51bc8e-2401x1000.png&w=3840&q=75" width=700>

In [None]:
# Advanced Agentic Systems - Autonomous Agents and Meta-Workflows

import time
from enum import Enum
from dataclasses import dataclass, field
from typing import Optional, Callable
import uuid

class AgentState(Enum):
    """Agent execution states"""
    IDLE = "idle"
    PLANNING = "planning"
    EXECUTING = "executing"
    EVALUATING = "evaluating"
    BLOCKED = "blocked"
    COMPLETED = "completed"
    FAILED = "failed"


In [None]:

@dataclass
class AgentMemory:
    """Agent working memory and context"""
    task_history: List[Dict] = field(default_factory=list)
    current_context: Dict = field(default_factory=dict)
    learned_patterns: Dict = field(default_factory=dict)
    error_log: List[str] = field(default_factory=list)


In [None]:
# @title

class AdvancedAgentSystem:
    """
    Autonomous agent system implementing Anthropic's agent patterns
    Features: Dynamic planning, error recovery, learning, human-in-the-loop
    """

    def __init__(self, llm, max_iterations: int = 10):
        self.llm = llm
        self.max_iterations = max_iterations
        self.state = AgentState.IDLE
        self.memory = AgentMemory()
        self.tools = {}
        self.checkpoints = []

    def register_tool(self, name: str, function: Callable, description: str):
        """Register tools for agent use"""
        self.tools[name] = {
            "function": function,
            "description": description,
            "usage_count": 0
        }
        print(f"Registered tool: {name}")

    def create_plan(self, task: str) -> List[Dict]:
        """
        Dynamic planning based on task complexity
        Implements reasoning and planning capabilities
        """
        self.state = AgentState.PLANNING

        planning_prompt = PromptTemplate(
            input_variables=["task", "available_tools", "context"],
            template="""You are an autonomous agent creating an execution plan.

            Task: {task}

            Available Tools: {available_tools}

            Current Context: {context}

            Create a detailed plan with steps, tools needed, and success criteria.
            Format as JSON:
            {{
                "plan_id": "unique_id",
                "steps": [
                    {{
                        "step_id": "step_1",
                        "action": "specific action to take",
                        "tools_needed": ["tool1", "tool2"],
                        "success_criteria": "how to verify success",
                        "estimated_time": "time estimate",
                        "dependencies": ["previous_step_ids"]
                    }}
                ],
                "risks": ["potential issues"],
                "checkpoints": ["human approval points"]
            }}"""
        )

        tools_description = "\n".join([
            f"- {name}: {info['description']}"
            for name, info in self.tools.items()
        ])

        context = json.dumps(self.memory.current_context, indent=2)

        chain = planning_prompt | self.llm | StrOutputParser()
        plan_result = chain.invoke({
            "task": task,
            "available_tools": tools_description,
            "context": context
        })

        # Parse plan (simplified JSON extraction)
        try:
            import re
            json_match = re.search(r'\{.*\}', plan_result, re.DOTALL)
            if json_match:
                plan_data = json.loads(json_match.group())
                plan_steps = plan_data.get("steps", [])

                # Add to memory
                self.memory.task_history.append({
                    "task": task,
                    "plan": plan_data,
                    "created_at": time.time()
                })

                print(f"Created plan with {len(plan_steps)} steps")
                return plan_steps
        except Exception as e:
            self.memory.error_log.append(f"Planning error: {str(e)}")
            # Fallback simple plan
            return [{
                "step_id": "fallback_1",
                "action": f"Complete task: {task}",
                "tools_needed": [],
                "success_criteria": "Task completion"
            }]

    def execute_step(self, step: Dict) -> Dict:
        """Execute individual plan step with error recovery"""
        step_id = step.get("step_id", str(uuid.uuid4()))
        print(f"Executing step: {step_id}")

        try:
            # Check if tools are needed
            tools_needed = step.get("tools_needed", [])
            tool_results = {}

            for tool_name in tools_needed:
                if tool_name in self.tools:
                    print(f"Using tool: {tool_name}")
                    # Simplified tool execution
                    tool_results[tool_name] = f"Tool {tool_name} executed successfully"
                    self.tools[tool_name]["usage_count"] += 1
                else:
                    print(f"Warning: Tool {tool_name} not available")

            # Execute main action
            execution_prompt = PromptTemplate(
                input_variables=["action", "tool_results", "success_criteria"],
                template="""Execute this action step by step:

                Action: {action}

                Tool Results: {tool_results}

                Success Criteria: {success_criteria}

                Provide detailed execution results and verify success criteria."""
            )

            chain = execution_prompt | self.llm | StrOutputParser()
            result = chain.invoke({
                "action": step["action"],
                "tool_results": json.dumps(tool_results, indent=2),
                "success_criteria": step.get("success_criteria", "completion")
            })

            # Evaluate success
            success = self.evaluate_step_success(step, result)

            return {
                "step_id": step_id,
                "status": "success" if success else "needs_retry",
                "result": result,
                "tool_usage": tool_results,
                "execution_time": time.time()
            }

        except Exception as e:
            error_msg = f"Step execution failed: {str(e)}"
            self.memory.error_log.append(error_msg)
            return {
                "step_id": step_id,
                "status": "failed",
                "error": error_msg,
                "execution_time": time.time()
            }

    def evaluate_step_success(self, step: Dict, result: str) -> bool:
        """Evaluate if step was successful based on criteria"""
        success_criteria = step.get("success_criteria", "")

        evaluation_prompt = PromptTemplate(
            input_variables=["criteria", "result"],
            template="""Evaluate if this result meets the success criteria.

            Success Criteria: {criteria}

            Actual Result: {result}

            Respond with just "SUCCESS" or "FAILURE" followed by brief reasoning."""
        )

        chain = evaluation_prompt | self.llm | StrOutputParser()
        evaluation = chain.invoke({
            "criteria": success_criteria,
            "result": result
        })

        return "SUCCESS" in evaluation.upper()

    def error_recovery(self, failed_step: Dict, error: str) -> Optional[Dict]:
        """Implement error recovery strategies"""
        print(f"Attempting error recovery for: {error}")

        recovery_prompt = PromptTemplate(
            input_variables=["failed_step", "error", "error_history"],
            template="""Analyze this error and suggest recovery strategy:

            Failed Step: {failed_step}

            Error: {error}

            Previous Errors: {error_history}

            Suggest a modified approach or alternative strategy."""
        )

        chain = recovery_prompt | self.llm | StrOutputParser()
        recovery_suggestion = chain.invoke({
            "failed_step": json.dumps(failed_step, indent=2),
            "error": error,
            "error_history": json.dumps(self.memory.error_log[-5:], indent=2)
        })

        # Create modified step (simplified)
        modified_step = failed_step.copy()
        modified_step["action"] = f"RETRY: {modified_step['action']} (Modified based on: {recovery_suggestion[:100]})"

        return modified_step

    def human_checkpoint(self, checkpoint_data: Dict) -> bool:
        """Simulate human-in-the-loop checkpoint"""
        print(f"üö® HUMAN CHECKPOINT: {checkpoint_data}")
        print("In production, this would pause for human approval")

        # Simulate human approval (always approve for demo)
        approval = True
        print(f"‚úÖ Human approval: {'Granted' if approval else 'Denied'}")
        return approval

    def autonomous_execution(self, task: str) -> Dict:
        """
        Main autonomous agent execution loop
        Implements the complete agent pattern with all capabilities
        """
        print(f"ü§ñ AUTONOMOUS AGENT STARTING")
        print(f"Task: {task}")

        execution_log = {
            "task": task,
            "start_time": time.time(),
            "steps_completed": 0,
            "errors_encountered": 0,
            "human_interactions": 0,
            "final_status": "in_progress"
        }

        try:
            # Phase 1: Planning
            self.state = AgentState.PLANNING
            plan = self.create_plan(task)

            if not plan:
                raise Exception("Failed to create execution plan")

            # Phase 2: Execution
            self.state = AgentState.EXECUTING
            completed_steps = []

            for iteration in range(self.max_iterations):
                if not plan:
                    break

                current_step = plan.pop(0)

                # Check for human checkpoint
                if "checkpoint" in current_step.get("action", "").lower():
                    if not self.human_checkpoint(current_step):
                        self.state = AgentState.BLOCKED
                        execution_log["final_status"] = "blocked_by_human"
                        break
                    execution_log["human_interactions"] += 1

                # Execute step
                step_result = self.execute_step(current_step)
                completed_steps.append(step_result)
                execution_log["steps_completed"] += 1

                if step_result["status"] == "failed":
                    execution_log["errors_encountered"] += 1

                    # Attempt error recovery
                    recovered_step = self.error_recovery(
                        current_step,
                        step_result.get("error", "Unknown error")
                    )

                    if recovered_step:
                        plan.insert(0, recovered_step)  # Retry at front
                    else:
                        print("‚ùå Error recovery failed")
                        break

                elif step_result["status"] == "needs_retry":
                    plan.insert(0, current_step)  # Retry same step

                # Progress update
                print(f"Progress: {execution_log['steps_completed']} steps completed")

            # Phase 3: Final evaluation
            self.state = AgentState.EVALUATING
            final_evaluation = self.evaluate_final_result(task, completed_steps)

            execution_log.update({
                "end_time": time.time(),
                "total_duration": time.time() - execution_log["start_time"],
                "completed_steps": completed_steps,
                "final_evaluation": final_evaluation,
                "final_status": "completed" if final_evaluation["success"] else "failed"
            })

            self.state = AgentState.COMPLETED if final_evaluation["success"] else AgentState.FAILED

            print(f"üéØ AUTONOMOUS EXECUTION {'COMPLETED' if final_evaluation['success'] else 'FAILED'}")
            print(f"Duration: {execution_log['total_duration']:.2f}s")
            print(f"Steps: {execution_log['steps_completed']}")
            print(f"Errors: {execution_log['errors_encountered']}")

            return execution_log

        except Exception as e:
            execution_log.update({
                "end_time": time.time(),
                "final_status": "system_error",
                "system_error": str(e)
            })

            self.state = AgentState.FAILED
            print(f"üí• SYSTEM ERROR: {str(e)}")
            return execution_log

    def evaluate_final_result(self, original_task: str, completed_steps: List[Dict]) -> Dict:
        """Final evaluation of task completion"""

        evaluation_prompt = PromptTemplate(
            input_variables=["original_task", "steps_summary"],
            template="""Evaluate if the original task was successfully completed.

            Original Task: {original_task}

            Completed Steps Summary: {steps_summary}

            Provide evaluation including:
            1. Task completion status (SUCCESS/PARTIAL/FAILURE)
            2. Quality assessment (1-10)
            3. Areas of success
            4. Areas for improvement
            5. Overall confidence level"""
        )

        steps_summary = "\n".join([
            f"Step {i+1}: {step.get('result', 'No result')[:100]}..."
            for i, step in enumerate(completed_steps)
        ])

        chain = evaluation_prompt | self.llm | StrOutputParser()
        evaluation_result = chain.invoke({
            "original_task": original_task,
            "steps_summary": steps_summary
        })

        # Parse evaluation (simplified)
        success = "SUCCESS" in evaluation_result.upper()

        return {
            "success": success,
            "evaluation": evaluation_result,
            "steps_count": len(completed_steps),
            "quality_indicators": {
                "completion_rate": len([s for s in completed_steps if s.get("status") == "success"]) / max(len(completed_steps), 1),
                "error_rate": len([s for s in completed_steps if s.get("status") == "failed"]) / max(len(completed_steps), 1)
            }
        }


In [None]:

# Demonstration of Autonomous Agent System
class AutonomousAgentDemo:
    """Comprehensive demonstration of autonomous agent capabilities"""

    def __init__(self, llm):
        self.agent = AdvancedAgentSystem(llm)
        self.setup_demo_tools()

    def setup_demo_tools(self):
        """Register demonstration tools"""

        def web_search(query: str) -> str:
            return f"Search results for '{query}': [Simulated web search results]"

        def file_manager(action: str, filename: str = "", content: str = "") -> str:
            return f"File operation '{action}' on '{filename}': Success"

        def api_call(endpoint: str, data: Dict = None) -> str:
            return f"API call to '{endpoint}': Success (simulated)"

        def data_analysis(dataset: str, analysis_type: str = "summary") -> str:
            return f"Analysis '{analysis_type}' on '{dataset}': Completed with insights"

        # Register tools
        self.agent.register_tool("web_search", web_search, "Search the web for information")
        self.agent.register_tool("file_manager", file_manager, "Create, read, update, delete files")
        self.agent.register_tool("api_call", api_call, "Make API calls to external services")
        self.agent.register_tool("data_analysis", data_analysis, "Analyze datasets and generate insights")

    def demo_complex_research_task(self):
        """Demonstrate agent handling complex multi-step research task"""
        print("üî¨ AUTONOMOUS RESEARCH AGENT DEMONSTRATION")

        complex_task = """
        Research the current state of quantum computing and create a comprehensive report including:
        1. Recent breakthrough discoveries in quantum computing
        2. Major companies and their quantum computing initiatives
        3. Current limitations and challenges
        4. Potential future applications
        5. Timeline predictions for quantum supremacy achievements

        The report should be well-structured, factual, and include citations.
        """

        result = self.agent.autonomous_execution(complex_task)
        return result

    def demo_software_development_task(self):
        """Demonstrate agent handling software development workflow"""
        print("üíª AUTONOMOUS DEVELOPMENT AGENT DEMONSTRATION")

        dev_task = """
        Create a complete web application for a personal task management system including:
        1. Backend API with user authentication
        2. Database schema for tasks and users
        3. Frontend interface with CRUD operations
        4. Unit tests for core functionality
        5. Deployment configuration
        6. Documentation and README

        Use modern best practices and ensure security considerations.
        """

        result = self.agent.autonomous_execution(dev_task)
        return result

    def run_comprehensive_demo(self):
        """Run comprehensive autonomous agent demonstrations"""
        print("=" * 80)
        print("AUTONOMOUS AGENT SYSTEM DEMONSTRATION")
        print("=" * 80)

        results = {}

        # Demo 1: Research Task
        results['research'] = self.demo_complex_research_task()

        print("\n" + "-" * 40)

        # Demo 2: Development Task
        results['development'] = self.demo_software_development_task()

        print("\n" + "=" * 80)
        print("AUTONOMOUS AGENT DEMONSTRATIONS COMPLETED")
        print("=" * 80)

        return results



In [None]:
# @title
# Initialize and demonstrate autonomous agents
if 'autonomous_agent' not in tutorial_state:
    agent_demo = AutonomousAgentDemo(memory_llm)
    tutorial_state['autonomous_agent'] = agent_demo

    print("üöÄ STARTING AUTONOMOUS AGENT DEMONSTRATIONS")
    autonomous_results = agent_demo.run_comprehensive_demo()
    tutorial_state['autonomous_results'] = autonomous_results
else:
    print("üîÑ RUNNING AUTONOMOUS AGENT DEMONSTRATIONS")
    autonomous_results = tutorial_state['autonomous_agent'].run_comprehensive_demo()
    tutorial_state['autonomous_results'] = autonomous_results

<img src="https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-cdn.anthropic.com%2Fimages%2F4zrzovbb%2Fwebsite%2F14f51e6406ccb29e695da48b17017e899a6119c7-2401x1000.png&w=3840&q=75" width=700>

## Retrieval-Augmented Generation



Now that we've mastered building intelligent agents and workflows, it's time to tackle one of the most important challenges in modern AI systems: how do we give our agents access to vast, specific, and up-to-date knowledge that wasn't included in their training data?

This is where Retrieval-Augmented Generation (RAG) becomes essential. RAG is the bridge between the incredible reasoning capabilities of large language models and the specific, detailed knowledge that your applications need to be truly useful in real-world scenarios.

<img src="https://miro.medium.com/v2/resize:fit:1400/0*WYv0_CaBmCTt7FXc" width=700>

### Why RAG Is Essential: The Knowledge Gap Problem



LlamaIndex is highly specialized for data ingestion and retrieval, while LangChain is better suited for building complex, multi-step AI workflows

Let me paint a picture of why RAG matters. Imagine you've built a brilliant customer service agent using the workflows we just learned. It can route questions, use tools, and maintain conversation context perfectly. But then a customer asks about your company's specific return policy that was updated last week, or wants details about a product that was launched after the model's training cutoff.

Even the most advanced language models face critical limitations when used alone:

1. **Knowledge Cutoff**: Training data has a specific cutoff date, making models ignorant of recent information
2. **Domain Specificity**: Models lack deep knowledge about your specific business, products, or internal processes  
3. **Context Window Limits**: Even with large context windows, you can't fit entire knowledge bases into a single conversation
4. **Hallucination Risk**: When models don't know something, they often generate plausible-sounding but incorrect information
5. **Static Knowledge**: The information encoded during training can't be updated without retraining

**Where Our Agent Workflows Hit the Wall:** The sophisticated agent workflows we've built are incredibly powerful for reasoning and decision-making, but they're only as good as the knowledge they have access to. Without RAG:

- Your routing system might correctly identify that a question is about "product specifications," but have no way to retrieve the actual, current specifications
- Your memory system can remember what users have discussed, but can't recall relevant company knowledge or documentation
- Your tools can calculate and process data, but can't access your proprietary knowledge base or recent updates

**RAG as the Solution:** Retrieval-Augmented Generation solves these problems by creating a dynamic bridge between your agents and external knowledge sources. Instead of relying solely on the model's trained knowledge, RAG systems:

- **Retrieve** relevant information from external knowledge bases in real-time
- **Augment** the model's prompt with this retrieved context  
- **Generate** responses that combine the model's reasoning abilities with specific, current, and accurate information

This creates agents that maintain their sophisticated reasoning capabilities while having access to vast, specific, and up-to-date knowledge that makes them truly useful for real-world applications.

**The Most Popular RAG Approaches Right Now:**

**1. GraphRAG**
- Microsoft's breakthrough approach that creates knowledge graphs from documents
- Builds hierarchical community summaries for better context understanding
- Excels at answering complex, multi-hop questions that span multiple documents
- Game-changer for enterprise knowledge bases and research applications

**2. Agentic RAG**
- Combines RAG with autonomous agents that can reason about when and what to retrieve
- Uses sophisticated routing to decide between different knowledge sources
- Self-correcting retrieval based on generated content quality
- Perfect for complex workflows requiring dynamic knowledge access

**3. Multi-Modal RAG**
- Retrieves and processes images, tables, charts alongside text
- Essential for technical documentation, financial reports, and visual content
- Uses vision-language models for comprehensive document understanding

**4. Conversational RAG**
- Maintains conversation history and context across multiple turns
- Intelligently decides when to retrieve new information vs. use conversation memory
- Critical for chatbots and customer service applications

**5. Self-RAG**
- Models self-evaluate when retrieval is needed and assess information quality
- Reduces hallucination by checking factual consistency
- Adaptive retrieval triggers based on confidence and uncertainty

**6. Hybrid Dense-Sparse RAG**
- Combines traditional keyword search (BM25) with semantic embeddings
- Best of both worlds: exact matches + semantic similarity
- Most production systems use this approach for robustness



### Preprocessing the documents

Document preprocessing is the foundation of any effective RAG system. Without proper structured, labeled data on database, no model can perform good. A good data preprocessing is crucial espcially in large scale production systems where we deal with millions of documents in real time, any small mistake or bug can lead to catastrophic failures. It's important to choose the right preprocessing techniques given requirements and to align well with business goal.

**The Challenge:** Raw documents come in countless formats, structures, and sizes. A PDF might contain tables, images, and multi-column layouts. A web page includes navigation menus, advertisements, and dynamic content. A code repository has different file types with distinct syntaxes. Without proper preprocessing, even the most sophisticated retrieval system will struggle to find and present relevant information effectively.

Document preprocessing involves several transformations that can be expressed mathematically:

- **Information Density**: $\rho = \frac{\text{Relevant Content}}{\text{Total Content}}$ - maximizing signal-to-noise ratio
- **Semantic Coherence**: $C(chunk) = \frac{\sum_{i,j} similarity(sent_i, sent_j)}{n(n-1)/2}$ - ensuring chunks maintain internal consistency  
- **Optimal Chunk Size**: $size_{optimal} = \arg\max_{s} (retrieval\_accuracy(s) - processing\_cost(s))$

<img src="https://chamomile.ai/reliable-rag-with-data-preprocessing/image6.png" width=700>

**The Preprocessing Pipeline:** Our approach follows a systematic four-stage pipeline:

1. **Document Loading**: Extract content from various formats while preserving semantic structure
2. **Splitting**: Break documents into manageable sections based on natural boundaries
3. **Chunking**: Create optimally-sized pieces that balance context and specificity  
4. **Embedding**: Transform text into vector representations for semantic search

Each stage has multiple strategies optimized for different document types and use cases. Let's explore each in detail:

#### Document Loading


Document loading is the critical first step in building effective RAG systems. Different document types require specialized loaders optimized for their unique structures and challenges. Let's explore the ecosystem of document loaders available in LangChain and understand when to use each one.





##### Web Content Loaders


Web content presents unique challenges: dynamic JavaScript rendering, complex layouts, advertisements, navigation elements, and varying HTML structures. Choosing the right web loader depends on your specific requirements around speed, accuracy, and the complexity of target websites.




| **Loader** | **Best For** | **Key Features** | **Considerations** | **Type** |
|------------|--------------|------------------|-------------------|----------|
| [Web](https://python.langchain.com/docs/integrations/document_loaders/web_base) | Simple static pages | ‚Ä¢ Uses urllib + BeautifulSoup<br>‚Ä¢ Fast and lightweight<br>‚Ä¢ No external dependencies | ‚Ä¢ Struggles with JavaScript-heavy sites<br>‚Ä¢ Basic HTML parsing only<br>‚Ä¢ No dynamic content handling | Package |
| [Unstructured](https://python.langchain.com/docs/integrations/document_loaders/unstructured_file) | Complex layouts | ‚Ä¢ Advanced structure detection<br>‚Ä¢ Preserves semantic hierarchy<br>‚Ä¢ Handles tables and formatting | ‚Ä¢ Slower processing<br>‚Ä¢ Heavier dependencies<br>‚Ä¢ May need additional setup | Package |
| [RecursiveURL](https://python.langchain.com/docs/integrations/document_loaders/recursive_url) | Documentation sites | ‚Ä¢ Automatically discovers child links<br>‚Ä¢ Configurable depth control<br>‚Ä¢ Maintains site structure | ‚Ä¢ Can retrieve too much data<br>‚Ä¢ Requires careful depth limits<br>‚Ä¢ May hit rate limits | Package |
| [Sitemap](https://python.langchain.com/docs/integrations/document_loaders/sitemap) | Entire websites | ‚Ä¢ Uses sitemap.xml for discovery<br>‚Ä¢ Efficient site crawling<br>‚Ä¢ Respects site structure | ‚Ä¢ Requires valid sitemap<br>‚Ä¢ May miss pages not in sitemap<br>‚Ä¢ Large sites = long processing | Package |
| [Spider](https://python.langchain.com/docs/integrations/document_loaders/spider) | Production crawling | ‚Ä¢ LLM-optimized output format<br>‚Ä¢ Handles JavaScript rendering<br>‚Ä¢ Anti-bot bypass capabilities | ‚Ä¢ Requires API key<br>‚Ä¢ Usage-based pricing<br>‚Ä¢ External service dependency | API |
| [Firecrawl](https://python.langchain.com/docs/integrations/document_loaders/firecrawl) | Enterprise scraping | ‚Ä¢ Self-hostable option<br>‚Ä¢ JavaScript execution<br>‚Ä¢ Advanced content extraction | ‚Ä¢ Complex setup if self-hosted<br>‚Ä¢ API costs if cloud-hosted<br>‚Ä¢ Requires infrastructure | API |
| [Docling](https://python.langchain.com/docs/integrations/document_loaders/docling) | Document-heavy sites | ‚Ä¢ Specialized for document extraction<br>‚Ä¢ Format preservation<br>‚Ä¢ Multi-format support | ‚Ä¢ Focused on document-centric sites<br>‚Ä¢ May be overkill for simple pages<br>‚Ä¢ Learning curve | Package |
| [Hyperbrowser](https://python.langchain.com/docs/integrations/document_loaders/hyperbrowser) | Complex web apps | ‚Ä¢ Full browser automation<br>‚Ä¢ JavaScript execution<br>‚Ä¢ Session management | ‚Ä¢ Higher latency<br>‚Ä¢ Resource intensive<br>‚Ä¢ API-based pricing | API |
| [AgentQL](https://python.langchain.com/docs/integrations/document_loaders/agentql) | Structured extraction | ‚Ä¢ Natural language queries<br>‚Ä¢ Precise data targeting<br>‚Ä¢ Schema-based extraction | ‚Ä¢ Best for specific data points<br>‚Ä¢ Requires query design<br>‚Ä¢ API costs | API |
| [Oxylabs](https://python.langchain.com/docs/integrations/document_loaders/oxylabs) | Large-scale scraping | ‚Ä¢ Enterprise-grade infrastructure<br>‚Ä¢ Geographic proxy support<br>‚Ä¢ High success rates | ‚Ä¢ Premium pricing<br>‚Ä¢ Overkill for small projects<br>‚Ä¢ External dependency | API |


There's PDF content loaders as well

| **Document Loader** | **Description** | **Package/API** |
| --- | --- | --- |
| [PyPDF](https://python.langchain.com/docs/integrations/document_loaders/pypdfloader) | Uses `pypdf` to load and parse PDFs | Package |
| [Unstructured](https://python.langchain.com/docs/integrations/document_loaders/unstructured_file) | Uses Unstructured's open source library to load PDFs | Package |
| [Amazon Textract](https://python.langchain.com/docs/integrations/document_loaders/amazon_textract) | Uses AWS API to load PDFs | API |
| [MathPix](https://python.langchain.com/docs/integrations/document_loaders/mathpix) | Uses MathPix to load PDFs | Package |
| [PDFPlumber](https://python.langchain.com/docs/integrations/document_loaders/pdfplumber) | Load PDF files using PDFPlumber | Package |
| [PyPDFDirectry](https://python.langchain.com/docs/integrations/document_loaders/pypdfdirectory) | Load a directory with PDF files | Package |
| [PyPDFium2](https://python.langchain.com/docs/integrations/document_loaders/pypdfium2) | Load PDF files using PyPDFium2 | Package |
| [PyMuPDF](https://python.langchain.com/docs/integrations/document_loaders/pymupdf) | Load PDF files using PyMuPDF | Package |
| [PyMuPDF4LLM](https://python.langchain.com/docs/integrations/document_loaders/pymupdf4llm) | Load PDF content to Markdown using PyMuPDF4LLM | Package |
| [PDFMiner](https://python.langchain.com/docs/integrations/document_loaders/pdfminer) | Load PDF files using PDFMiner | Package |
| [Upstage Document Parse Loader](https://python.langchain.com/docs/integrations/document_loaders/upstage) | Load PDF files using UpstageDocumentParseLoader | Package |
| [Docling](https://python.langchain.com/docs/integrations/document_loaders/docling) | Load PDF files using Docling | Package |

There's also a way you can build multimodal rag but basically what that means is we convert image to text and treat that document as normal vectorized doc

<img src="https://developer-blogs.nvidia.com/wp-content/uploads/2024/03/rag-preprocessing-for-images.png" width=700>

For now let's just get a document loading function ready

In [None]:
# Minimal Document Loading System for teaching
class DocumentLoadingSystem:
    """Simple, synchronous document loader used for examples.
    Keeps a tiny in-memory collection of documents (content + metadata).
    """
    def __init__(self):
        self.documents = []

    def load_text(self, text_path: str) -> list:
        """Load a plain text file and return a simple document dict.
        In this tutorial we keep it synchronous and minimal.
        """
        try:
            with open(text_path, "r", encoding="utf-8") as f:
                content = f.read()
            doc = {"content": content, "metadata": {"file_path": text_path}}
            self.documents.append(doc)
            print(f"‚úÖ Loaded: {text_path}")
            return [doc]
        except Exception as e:
            print(f"‚ùå Could not load {text_path}: {e}")
            return []

    def create_sample_documents(self) -> list:
        """Return a short list of small sample documents for demos."""
        samples = [
            {"content": "AI and ML: overview of key concepts and applications.", "metadata": {"title": "AI Guide"}},
            {"content": "RAG: combine retrieval and generation to ground LLM outputs.", "metadata": {"title": "RAG Notes"}},
        ]
        self.documents.extend(samples)
        print(f"‚úÖ Created {len(samples)} sample documents")
        return samples

# initialize and register
doc_loader = DocumentLoadingSystem()
tutorial_state['doc_loader'] = doc_loader
print('üìö Minimal DocumentLoadingSystem ready (use doc_loader.create_sample_documents()).')

In [None]:
# Demonstration of Document Loading Capabilities

print("üß™ TESTING DOCUMENT LOADING CAPABILITIES")
print("=" * 60)

# Test 1: Load sample documents (for demonstration)
print("\nüìù Test 1: Loading Sample Documents")
sample_docs = doc_loader.create_sample_documents()

# Display sample document info
for i, doc in enumerate(sample_docs):
    print(f"\n   Document {i+1}:")
    print(f"   Title: {doc['metadata'].get('title', 'No title')}")
    print(f"   Type: {doc['source_type']}")
    print(f"   Content length: {len(doc['content'])} characters")
    print(f"   Preview: {doc['content'][:100]}...")

# Test 2: Web content loading (simulated with sample URLs)
print(f"\nüåê Test 2: Web Content Loading Demo")
print("   Note: Using sample content to demonstrate web loading capabilities")

# Simulate web loading
web_urls = [
    "https://docs.python.org/3/tutorial/",
    "https://langchain.readthedocs.io/"
]

print(f"   Simulated loading from: {web_urls[0]}")
print("   In production, this would fetch live content from these URLs")

# Test 3: Document metadata analysis
print(f"\nüìä Test 3: Document Metadata Analysis")

metadata_summary = {}
for doc in sample_docs:
    doc_type = doc['source_type']
    if doc_type not in metadata_summary:
        metadata_summary[doc_type] = {
            'count': 0,
            'total_chars': 0,
            'loaders_used': set()
        }

    metadata_summary[doc_type]['count'] += 1
    metadata_summary[doc_type]['total_chars'] += len(doc['content'])
    metadata_summary[doc_type]['loaders_used'].add(doc['loader_used'])

print("   Document Type Summary:")
for doc_type, stats in metadata_summary.items():
    print(f"     {doc_type}:")
    print(f"       Count: {stats['count']}")
    print(f"       Total characters: {stats['total_chars']}")
    print(f"       Loaders used: {', '.join(stats['loaders_used'])}")

# Store results
tutorial_state['loaded_documents']['samples'] = sample_docs
tutorial_state['loading_metadata'] = metadata_summary

print(f"\n‚úÖ DOCUMENT LOADING TESTS COMPLETE")
print(f"   Loaded {len(sample_docs)} documents successfully")
print("   Ready to proceed to document splitting and chunking")

#### Splitting


 Unlike the straightforward task of simply loading documents, splitting requires sophisticated understanding of document structure, semantic boundaries, and downstream processing requirements. The challenge lies in preserving meaningful context while creating chunks that are optimally sized for both embedding generation and retrieval accuracy.


a technical whitepaper might contain dense theoretical sections that benefit from larger chunks to maintain conceptual coherence, while a customer service manual with step-by-step procedures might work better with smaller, action-oriented segments. Meanwhile, a legal document requires preservation of precise clause boundaries, and a research paper needs to maintain the relationship between hypotheses and supporting evidence. Each document type demands a nuanced approach to splitting that balances computational constraints with semantic integrity.

**The Multi-Dimensional Challenge:**

Document splitting isn't just about managing size‚Äîit's about optimizing the fundamental trade-offs that determine RAG system performance. **Chunk size** directly impacts both embedding quality and retrieval precision: smaller chunks provide granular relevance but may lack sufficient context, while larger chunks offer comprehensive context but may dilute the signal for specific queries. **Semantic boundaries** determine whether related concepts stay together or get artificially separated, affecting the model's ability to understand relationships and dependencies. **Overlap strategies** influence how much redundancy exists between chunks, which can improve retrieval recall but increase storage costs and processing time.

**The Computational Reality:**

Modern embedding models typically handle 512-8192 tokens efficiently, but optimal chunk size varies significantly based on your specific use case. Dense retrieval systems often perform better with 200-500 token chunks for precision, while semantic search applications may benefit from 500-1000 token chunks for context. The mathematical relationship isn't linear‚Äîdoubling chunk size doesn't simply halve precision or double context value. Instead, there's often a sweet spot that maximizes the information density while maintaining retrievability.

**Strategic Splitting Approaches:**

The most effective splitting strategies employ hierarchical approaches that operate at multiple levels simultaneously. **Structural splitting** respects document organization by breaking at natural boundaries like chapters, sections, and paragraphs, preserving the author's intended information architecture. **Semantic splitting** uses NLP techniques to identify topic boundaries and conceptual transitions, ensuring that related ideas remain grouped together. **Sliding window approaches** create overlapping chunks that capture relationships spanning traditional boundaries, particularly valuable for documents where context frequently spans multiple sections.

The choice of splitting strategy often determines whether your RAG system delivers frustratingly fragmented responses or remarkably coherent, context-aware answers that feel like they were written by a domain expert who has read and understood your entire knowledge base.

##### Text Splitting Strategies in LangChain

LangChain provides a rich ecosystem of text splitters, each designed for specific document types and use cases. Let's explore the most popular and effective ones:

**Core Text Splitters:**

| **Splitter** | **Best For** | **How It Works** | **Key Parameters** | **Use Cases** |
|--------------|--------------|------------------|-------------------|---------------|
| **CharacterTextSplitter** | General text, simple splitting | Splits on a single character (default `\n\n`) | `chunk_size`, `chunk_overlap`, `separator` | Blog posts, articles, simple documents |
| **RecursiveCharacterTextSplitter** | Most documents (default choice) | Tries separators in order: `\n\n`, `\n`, ` `, `""` | `chunk_size`, `chunk_overlap`, `separators` | General purpose, mixed content |
| **TokenTextSplitter** | Token-aware splitting | Splits based on token count (uses tiktoken) | `chunk_size`, `chunk_overlap`, `encoding_name` | When exact token limits matter (GPT models) |
| **SentenceTransformersTokenTextSplitter** | Sentence transformer models | Optimized for sentence-transformers library | `chunk_size`, `chunk_overlap`, `model_name` | When using sentence-transformers embeddings |
| **MarkdownHeaderTextSplitter** | Markdown documents | Splits by headers, preserves hierarchy | `headers_to_split_on` | Documentation, README files, technical docs |
| **HTMLHeaderTextSplitter** | HTML content | Splits by HTML headers (`<h1>`, `<h2>`, etc.) | `headers_to_split_on` | Web pages, HTML documentation |
| **CodeTextSplitter** | Source code | Language-aware splitting (Python, JS, etc.) | `language`, `chunk_size`, `chunk_overlap` | Code repositories, programming tutorials |
| **LatexTextSplitter** | LaTeX documents | Preserves LaTeX structure and math | `chunk_size`, `chunk_overlap` | Academic papers, mathematical content |
| **NLTKTextSplitter** | Sentence-based splitting | Uses NLTK for proper sentence boundary detection | `chunk_size`, `chunk_overlap` | Natural language text requiring sentence integrity |
| **SpacyTextSplitter** | Advanced NLP splitting | Uses spaCy for linguistic-aware splitting | `chunk_size`, `chunk_overlap`, `pipeline` | Complex text requiring NLP processing |

In [None]:
# @title
# Minimal text splitter imports and small corpus for demonstrations
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter

# Compact sample texts used by splitter demos
sample_texts = {
    'article': (
        "Artificial Intelligence (AI) enables machines to learn from data. "
        "RAG systems combine retrieval with generation to produce grounded answers. "
        "This short article is for splitter demonstrations."
    ),
    'code': ("def add(a, b):\n    return a + b\n\nprint(add(1,2))"),
    'markdown': ("# Title\n\n## Section\n\nContent under section."),
    'html': ("<h1>Title</h1><p>Paragraph about RAG.</p>"),
}

print('Sample texts and splitter imports ready.')

##### 1. CharacterTextSplitter - Basic Splitting

The simplest splitter that splits on a single character separator. Best for straightforward text where you know the natural boundaries.

In [None]:
# Compact demonstration of CharacterTextSplitter

def demonstrate_character_splitter():
    splitter = CharacterTextSplitter(separator="\n\n", chunk_size=200, chunk_overlap=20)
    chunks = splitter.split_text(sample_texts['article'])
    print(f"Character splitter produced {len(chunks)} chunk(s).")
    for i, chunk in enumerate(chunks, 1):
        print(f"--- Chunk {i} ---\n{chunk[:200]}\n")
    return chunks

character_chunks = demonstrate_character_splitter()

##### 2. RecursiveCharacterTextSplitter - Adaptive Splitting (Most Popular!)

The **default choice** for most applications. It recursively tries different separators in order until chunks fit the desired size. This is the most versatile and commonly used splitter.

In [None]:
# Compact demonstration of RecursiveCharacterTextSplitter

def demonstrate_recursive_splitter():
    splitter = RecursiveCharacterTextSplitter(chunk_size=150, chunk_overlap=20)
    chunks = splitter.split_text(sample_texts['article'])
    print(f"Recursive splitter produced {len(chunks)} chunk(s). Avg size: {sum(len(c) for c in chunks)/len(chunks):.1f} chars")
    return chunks

recursive_chunks = demonstrate_recursive_splitter()

##### 3. TokenTextSplitter - Token-Aware Splitting

Splits based on **token count** rather than characters. Essential when working with models that have strict token limits (like GPT-3.5/4 with 8K/16K/32K context windows).

In [None]:
# @title
def demonstrate_token_splitter():
    """
    TokenTextSplitter: Splits based on token count, not characters

    Why tokens matter:
    - Different from characters (1 token ‚âà 4 characters in English)
    - Models have token limits (GPT-3.5: 4K, GPT-4: 8K/32K)
    - Token counting varies by model/encoding

    Use Cases:
    - When working with LLMs with strict token limits
    - Need precise control over context window usage
    - Billing based on tokens
    - Ensuring chunks fit model context

    Token vs Character:
    - "Hello" = 1 token, 5 characters
    - "Hello world!" = 3 tokens, 12 characters
    - Token count < character count (usually 1 token ‚âà 4 chars)
    """
    print("\n" + "=" * 80)
    print("3. TOKEN TEXT SPLITTER")
    print("=" * 80)

    # Using GPT-3.5-turbo encoding
    splitter = TokenTextSplitter(
        chunk_size=100,          # Maximum tokens per chunk
        chunk_overlap=10,        # Overlap in tokens
        encoding_name="cl100k_base"  # GPT-3.5/4 encoding
    )

    chunks = splitter.split_text(sample_texts['article'])

    # Calculate actual token counts
    encoding = tiktoken.get_encoding("cl100k_base")
    token_counts = [len(encoding.encode(chunk)) for chunk in chunks]

    print(f"\nüìä Token-based splitting results:")
    print(f"   Original text: {len(sample_texts['article'])} characters")
    print(f"   Original tokens: {len(encoding.encode(sample_texts['article']))}")
    print(f"   Number of chunks: {len(chunks)}")
    print(f"   Token counts per chunk: {token_counts}")
    print(f"   Average tokens per chunk: {sum(token_counts) / len(token_counts):.1f}")

    print(f"\nüìù First chunk:")
    print(chunks[0])
    print(f"   ‚Üí Tokens: {token_counts[0]}, Characters: {len(chunks[0])}")

    # Compare different encodings
    print(f"\n\nüî¨ Different encodings (same chunk size = 100 tokens):")
    encodings_to_test = [
        ("cl100k_base", "GPT-3.5/4"),
        ("p50k_base", "GPT-3/Codex"),
        ("r50k_base", "GPT-2")
    ]

    print(f"{'Encoding':>15} | {'Model':>15} | {'# Chunks':>10} | {'Total Tokens':>12}")
    print("-" * 60)

    for enc_name, model_name in encodings_to_test:
        test_splitter = TokenTextSplitter(
            chunk_size=100,
            chunk_overlap=10,
            encoding_name=enc_name
        )
        test_chunks = test_splitter.split_text(sample_texts['article'])
        enc = tiktoken.get_encoding(enc_name)
        total_tokens = sum(len(enc.encode(chunk)) for chunk in test_chunks)
        print(f"{enc_name:>15} | {model_name:>15} | {len(test_chunks):>10} | {total_tokens:>12}")

    # Demonstrate chunk size impact
    print(f"\n\nüìè Impact of chunk size on token splitting:")
    print(f"{'Chunk Size':>12} | {'Overlap':>8} | {'# Chunks':>10} | {'Max Tokens':>12}")
    print("-" * 50)

    for chunk_size in [50, 100, 200, 500]:
        test_splitter = TokenTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=10,
            encoding_name="cl100k_base"
        )
        test_chunks = test_splitter.split_text(sample_texts['article'])
        max_tokens = max(len(encoding.encode(chunk)) for chunk in test_chunks)
        print(f"{chunk_size:12d} | {10:8d} | {len(test_chunks):>10} | {max_tokens:>12}")

    return chunks

token_chunks = demonstrate_token_splitter()

##### 4. MarkdownHeaderTextSplitter - Structure-Aware Markdown Splitting

Splits Markdown documents based on headers, preserving the document hierarchy and structure. Perfect for documentation, README files, and technical content.

In [None]:
# Compact demonstration of MarkdownHeaderTextSplitter

def demonstrate_markdown_splitter():
    splitter = CharacterTextSplitter(separator="\n\n", chunk_size=200, chunk_overlap=20)
    chunks = splitter.split_text(sample_texts['markdown'])
    print(f"Markdown splitter (simplified) produced {len(chunks)} chunk(s).")
    for i, c in enumerate(chunks, 1):
        print(f"--- Section {i} ---\n{c}\n")
    return chunks

markdown_chunks = demonstrate_markdown_splitter()

##### 5. HTMLHeaderTextSplitter - HTML Structure Parsing

Similar to Markdown splitter but for HTML documents. Preserves HTML header hierarchy and structure.

In [None]:
# @title
def demonstrate_html_splitter():
    """
    HTMLHeaderTextSplitter: Preserves HTML document structure

    How it works:
    - Splits on HTML header tags (h1, h2, h3, etc.)
    - Extracts text while preserving hierarchy
    - Metadata includes header hierarchy

    Use Cases:
    - Web scraped content
    - HTML documentation
    - Blog posts
    - Web articles

    Advantages:
    - Understands HTML structure
    - Removes HTML tags but preserves organization
    - Maintains semantic sections
    """
    print("\n" + "=" * 80)
    print("5. HTML HEADER TEXT SPLITTER")
    print("=" * 80)

    # Define HTML headers to split on
    headers_to_split_on = [
        ("h1", "Header 1"),
        ("h2", "Header 2"),
        ("h3", "Header 3"),
    ]

    splitter = HTMLHeaderTextSplitter(
        headers_to_split_on=headers_to_split_on
    )

    # Split HTML document
    chunks = splitter.split_text(sample_texts['html'])

    print(f"\nüìä HTML splitting results:")
    print(f"   Original HTML length: {len(sample_texts['html'])} characters")
    print(f"   Number of sections: {len(chunks)}")

    print(f"\nüìù Extracted sections with hierarchy:")
    for i, chunk in enumerate(chunks, 1):
        print(f"\n--- Section {i} ---")
        print(f"Metadata: {chunk.metadata}")
        print(f"Content: {chunk.page_content}")
        print(f"Length: {len(chunk.page_content)} chars")

    # Show hierarchy structure
    print(f"\n\nüå≥ Document structure:")
    for i, chunk in enumerate(chunks, 1):
        indent = "  " * (len(chunk.metadata) - 1)
        headers = " > ".join([f"{k}: {v}" for k, v in chunk.metadata.items()])
        print(f"{indent}{i}. {headers}")

    return chunks

html_chunks = demonstrate_html_splitter()

##### 6. Language-Specific Code Splitter - Syntax-Aware Code Splitting

Splits source code based on programming language syntax. Understands language-specific constructs like functions, classes, and methods.

In [None]:
# Compact LangChain-like chunking utilities (pure-Python, didactic)
import math
from typing import List, Dict

def fixed_size_chunk(text: str, chunk_size: int = 500, overlap: int = 50) -> List[Dict]:
    """Split text into fixed-size chunks with simple overlap (pure Python)."""
    if chunk_size <= overlap:
        raise ValueError("chunk_size must be greater than overlap")
    step = chunk_size - overlap
    chunks = []
    for i in range(0, max(0, len(text)), step):
        chunk = text[i : i + chunk_size]
        chunks.append({"id": f"chunk_{i//step}", "content": chunk, "start": i, "end": i + len(chunk)})
    return chunks


def context_enrich_chunking(text: str, chunk_size: int = 400, overlap: int = 40, doc_title: str = "Document") -> List[Dict]:
    """Create small context-enriched chunks with a simple prefix for teaching purposes."""
    base = fixed_size_chunk(text, chunk_size=chunk_size, overlap=overlap)
    enriched = []
    for c in base:
        prefix = f"[Document: {doc_title}] [Pos: {c['start']}..{c['end']}]\n\n"
        enriched.append({"id": c["id"], "content": c["content"], "enriched": prefix + c["content"]})
    return enriched

# Register into tutorial_state for reuse in later examples
tutorial_state.setdefault("chunking", {})
tutorial_state["chunking"]["fixed_size_chunk"] = fixed_size_chunk
tutorial_state["chunking"]["context_enrich_chunking"] = context_enrich_chunking

print('Compact chunking utilities ready (fixed_size_chunk, context_enrich_chunking).')

##### 7. LatexTextSplitter - LaTeX Document Splitting

Specialized splitter for LaTeX documents, preserving mathematical equations and LaTeX-specific structure.

In [None]:
def demonstrate_latex_splitter():
    """
    LatexTextSplitter: Specialized for LaTeX documents

    How it works:
    - Recognizes LaTeX structure (sections, subsections)
    - Preserves equation environments
    - Understands LaTeX-specific syntax

    Use Cases:
    - Academic papers
    - Mathematical content
    - Scientific publications
    - Technical reports with equations

    Advantages:
    - Preserves math equations
    - Respects LaTeX structure
    - Maintains semantic units
    """
    print("\n" + "=" * 80)
    print("7. LATEX TEXT SPLITTER")
    print("=" * 80)

    splitter = LatexTextSplitter(
        chunk_size=200,
        chunk_overlap=20
    )

    chunks = splitter.split_text(sample_texts['latex'])

    print(f"\nüìä LaTeX splitting results:")
    print(f"   Original LaTeX length: {len(sample_texts['latex'])} characters")
    print(f"   Number of chunks: {len(chunks)}")

    print(f"\nüìù LaTeX chunks (preserving equations):")
    for i, chunk in enumerate(chunks, 1):
        print(f"\n--- Chunk {i} ({len(chunk)} chars) ---")
        print(chunk)

    print(f"\n\nüîß LaTeX-specific separators used:")
    latex_separators = [
        '\\n\\\\chapter{',
        '\\n\\\\section{',
        '\\n\\\\subsection{',
        '\\n\\\\subsubsection{',
        '\\n\\\\begin{enumerate}',
        '\\n\\\\begin{itemize}',
        '\\n\\\\begin{description}',
        '\\n\\\\begin{list}',
        '\\n\\\\begin{quote}',
        '\\n\\\\begin{quotation}',
        '\\n\\\\begin{verse}',
        '\\n\\\\begin{verbatim}',
        '\\n\\\\begin{align}',
        '\\n$$',
        '\\n\\n',
        '\\n',
        ' ',
        ''
    ]

    print("   Priority order:")
    for i, sep in enumerate(latex_separators[:10], 1):
        print(f"   {i}. {repr(sep):30s}")
    print(f"   ... and {len(latex_separators) - 10} more")

    return chunks

latex_chunks = demonstrate_latex_splitter()

##### 8. NLTKTextSplitter - Sentence-Based Splitting

Uses NLTK (Natural Language Toolkit) for proper sentence boundary detection. More accurate than simple character-based splitting for natural language.

In [None]:
# @title
def demonstrate_nltk_splitter():
    """
    NLTKTextSplitter: Sentence-aware splitting using NLTK

    How it works:
    - Uses NLTK's sentence tokenizer
    - Handles abbreviations (Dr., Mr., etc.)
    - Understands sentence boundaries better than regex

    Use Cases:
    - Natural language text
    - When sentence integrity is crucial
    - Content where periods appear in abbreviations
    - Proper grammatical splitting

    Advantages:
    - More accurate sentence detection
    - Handles edge cases (abbreviations, decimals)
    - Linguistic awareness

    Note: Requires NLTK punkt tokenizer
    """
    print("\n" + "=" * 80)
    print("8. NLTK TEXT SPLITTER")
    print("=" * 80)

    # Download NLTK data if not already present
    try:
        import nltk
        try:
            nltk.data.find('tokenizers/punkt')
        except LookupError:
            print("üì• Downloading NLTK punkt tokenizer...")
            nltk.download('punkt', quiet=True)
            nltk.download('punkt_tab', quiet=True)
    except Exception as e:
        print(f"‚ö†Ô∏è  NLTK not available: {e}")
        print("   Install with: pip install nltk")
        return []

    splitter = NLTKTextSplitter(
        chunk_size=200,
        chunk_overlap=20
    )

    # Create text with tricky sentence boundaries
    tricky_text = """
Dr. Smith works at A.I. Labs Inc. in the U.S.A. He published a paper in 2024.
The research showed 95.5% accuracy. Mr. Johnson said, "This is great!"
However, Ms. Brown noted several issues. The model costs $1,000.50 to run.
Prof. Anderson's team achieved better results. They used GPT-4 for testing.
"""

    chunks = splitter.split_text(tricky_text)

    print(f"\nüìä NLTK splitting results:")
    print(f"   Original text length: {len(tricky_text)} characters")
    print(f"   Number of chunks: {len(chunks)}")

    print(f"\nüìù How NLTK handles sentence boundaries:")
    print("   Original text with abbreviations:")
    print(tricky_text)

    print(f"\n   Split into {len(chunks)} chunks:")
    for i, chunk in enumerate(chunks, 1):
        print(f"\n   Chunk {i}:")
        print(f"   {chunk}")

    # Compare with simple period-based splitting
    print(f"\n\nüî¨ Comparison: NLTK vs Simple Period Split")
    simple_chunks = tricky_text.split('. ')

    print(f"   NLTK splitting: {len(chunks)} chunks (sentence-aware)")
    print(f"   Period splitting: {len(simple_chunks)} chunks (breaks on abbreviations)")

    print(f"\n   Simple period split (incorrect):")
    for i, chunk in enumerate(simple_chunks[:5], 1):
        print(f"   {i}. {chunk[:50]}...")

    return chunks

try:
    nltk_chunks = demonstrate_nltk_splitter()
except Exception as e:
    print(f"\n‚ö†Ô∏è  NLTK splitter demo skipped: {e}")
    nltk_chunks = []

##### 9. SpacyTextSplitter - Advanced NLP-Based Splitting

Uses spaCy for linguistic-aware splitting with advanced NLP features like named entity recognition and dependency parsing.

In [None]:
# @title
def demonstrate_spacy_splitter():
    """
    SpacyTextSplitter: Advanced NLP-based splitting

    How it works:
    - Uses spaCy's linguistic models
    - Understands sentence structure
    - Can leverage NER, POS tagging, dependencies

    Use Cases:
    - Complex natural language text
    - When linguistic features matter
    - Multi-language support
    - Advanced text analysis

    Advantages:
    - Most sophisticated sentence detection
    - Multi-language support
    - Access to linguistic features
    - Production-grade NLP

    Note: Requires spaCy and language model
    """
    print("\n" + "=" * 80)
    print("9. SPACY TEXT SPLITTER")
    print("=" * 80)

    try:
        import spacy
        # Try to load a spacy model
        try:
            nlp = spacy.load("en_core_web_sm")
        except OSError:
            print("üì• spaCy model not found. Install with:")
            print("   python -m spacy download en_core_web_sm")
            print("\n‚ö†Ô∏è  Skipping spaCy demo")
            return []

        splitter = SpacyTextSplitter(
            chunk_size=200,
            chunk_overlap=20,
            pipeline="en_core_web_sm"
        )

        chunks = splitter.split_text(sample_texts['article'])

        print(f"\nüìä spaCy splitting results:")
        print(f"   Original text length: {len(sample_texts['article'])} characters")
        print(f"   Number of chunks: {len(chunks)}")
        print(f"   Chunk sizes: {[len(chunk) for chunk in chunks]}")

        print(f"\nüìù Sample chunks:")
        for i, chunk in enumerate(chunks[:2], 1):
            print(f"\n--- Chunk {i} ---")
            print(chunk[:200] + "..." if len(chunk) > 200 else chunk)

        # Demonstrate linguistic awareness
        print(f"\n\nüß† spaCy's linguistic awareness:")
        doc = nlp(sample_texts['article'][:300])

        print(f"\n   Detected {len(list(doc.sents))} sentences")
        print(f"   Named entities found:")
        for ent in doc.ents:
            print(f"      - {ent.text:20s} ({ent.label_})")

        print(f"\n‚úÖ spaCy provides the most sophisticated NLP-based splitting")

        return chunks

    except ImportError:
        print("‚ö†Ô∏è  spaCy not installed. Install with:")
        print("   pip install spacy")
        print("   python -m spacy download en_core_web_sm")
        return []
    except Exception as e:
        print(f"‚ö†Ô∏è  Error in spaCy demo: {e}")
        return []

spacy_chunks = demonstrate_spacy_splitter()

#### Chunking



<img src="https://www.dailydoseofds.com/content/images/2024/11/chunking-rag-1.gif" width=700>

While **splitting** divides documents at natural boundaries (paragraphs, headers, sentences), **chunking** is the strategic process of creating optimally-sized, semantically coherent units for embedding and retrieval. Chunking goes beyond simple text division‚Äîit's about engineering the perfect information containers for your RAG system.

**The Critical Distinction:**

| Aspect | **Splitting** | **Chunking** |
|--------|---------------|--------------|
| **Purpose** | Break documents at natural boundaries | Create optimal retrieval units |
| **Focus** | Structure and syntax | Semantics and meaning |
| **Output** | Text pieces of varying utility | Engineered information containers |
| **Optimization** | Readability, structure preservation | Retrieval accuracy, embedding quality |

**Why Chunking Matters:**

The quality of your chunks directly determines RAG system performance:

$$RAG_{Quality} = f(chunk_{semantics}, chunk_{size}, chunk_{overlap}, context_{preservation})$$

**Key Challenges:**

1. **Context Window vs. Precision**: Larger chunks provide more context but reduce retrieval precision
2. **Semantic Coherence**: Chunks must be self-contained and meaningful
3. **Embedding Quality**: Chunks must fit embedding model constraints while maintaining semantic integrity
4. **Retrieval Granularity**: Finding the sweet spot between too specific and too generic

**The Chunking Spectrum:**

```
Fixed-Size ‚Üí Sentence-Based ‚Üí Semantic ‚Üí Hierarchical ‚Üí Context-Enriched
  (Simple)         (Better)      (Smart)    (Advanced)     (Production)
```

##### Chunking Strategies Overview

We'll explore chunking from simple to sophisticated, using both **LangChain** (primary, most common) and **LlamaIndex** (advanced, for complex hierarchies).

In [None]:
# Demonstrate the compact chunking utilities on a short sample
sample_document = """
Neural networks learn patterns from data. They use layers, backpropagation and optimization.
This short sample demonstrates chunking behavior in a compact, deterministic way.
"""

print("\nüéØ Compact chunking demonstration")
print("-" * 50)

fixed = tutorial_state["chunking"]["fixed_size_chunk"](sample_document, chunk_size=80, overlap=10)
print(f"Fixed-size chunks produced: {len(fixed)}")
print(f" First chunk preview:\n{fixed[0]['content'][:120]}\n")

enriched = tutorial_state["chunking"]["context_enrich_chunking"](sample_document, chunk_size=80, overlap=10, doc_title="Neural Networks Overview")
print(f"Context-enriched chunks produced: {len(enriched)}")
print(f" Sample enriched preview:\n{enriched[0]['enriched'][:200]}\n")

# Store simple results for later demonstration
tutorial_state.setdefault('chunking_results', {})
tutorial_state['chunking_results']['fixed'] = fixed
tutorial_state['chunking_results']['enriched'] = enriched

print('Chunking demo complete ‚Äî results saved to tutorial_state["chunking_results"].')

##### 1. LangChain Chunking Strategies (Primary Approach)

LangChain provides flexible chunking through its text splitters, which we covered in the Splitting section. Here we'll focus on **how to optimize them specifically for chunking** in RAG systems.

In [None]:
# Minimal orchestrator example ‚Äî demonstrates a simple RAG + skill + tool flow

def orchestrator(query: str) -> dict:
    """A tiny orchestration pattern for teaching.
    Steps shown:
    1. Retrieve (from the simple in-memory doc_loader)
    2. Run a skill (from the skill registry)
    3. Call an MCP-like tool (simulated)
    4. Aggregate results and return a clear, labelled dict
    """
    results = {}

    # 1) Retrieval (very small, synchronous)
    docs = tutorial_state.get('doc_loader').documents if tutorial_state.get('doc_loader') else []
    retrieved = [d for d in docs if query.lower() in d.get('content','').lower()]
    results['retrieved_count'] = len(retrieved)
    results['retrieved_preview'] = retrieved[0]['content'][:200] if retrieved else None

    # 2) Run a skill if available
    skill_output = None
    if 'skills' in tutorial_state and 'financial_analysis' in tutorial_state['skills']:
        skill_res = run_skill('financial_analysis', query)
        skill_output = {'success': skill_res.success, 'output': skill_res.output, 'confidence': skill_res.confidence}
    results['skill'] = skill_output

    # 3) Call an MCP-like tool (simulated)
    try:
        mcp_data = mcp_read_resource_sync('analytics') if 'mcp_read_resource_sync' in globals() else '{}'
    except Exception as e:
        mcp_data = json.dumps({'error': str(e)})
    results['mcp_analytics'] = json.loads(mcp_data) if isinstance(mcp_data, str) else mcp_data

    # 4) Simple generation/decision step (simulated LLM output)
    # For teaching, avoid real LLM calls; produce an instructive summary instead
    results['summary'] = (
        f"Orchestrator ran retrieval ({results['retrieved_count']} docs), "
        f"ran skill: {bool(skill_output)}, and fetched analytics."
    )

    return results

# Example usage (teaching demo)
demo_q = "retrieval"  # change this string to test different flows
print('\n--- ORCHESTRATOR DEMO ---')
print('Query:', demo_q)
print('Result:', orchestrator(demo_q))

NameError: name 'List' is not defined

LlamaIndex takes a more sophisticated approach by creating **Nodes** instead of simple text chunks. Nodes are first-class objects with rich metadata, relationships, and hierarchy.

**Key Differences from LangChain:**

| Aspect | LangChain | LlamaIndex |
|--------|-----------|------------|
| **Output** | Text strings | Node objects with metadata |
| **Relationships** | Manual | Automatic parent-child links |
| **Metadata** | Basic | Rich, structured metadata |
| **Use Case** | Flexible text processing | Index-centric workflows |
| **Integration** | Works with any system | Optimized for LlamaIndex indexes |

**When to Use LlamaIndex:**
- Building complex document hierarchies
- Need automatic parent-child relationships
- Using LlamaIndex for indexing/retrieval
- Require sophisticated metadata extraction
- Want semantic-based splitting (embedding-aware)

In [None]:
# LlamaIndex conceptual demo (kept minimal for tutorial)
try:
    import llama_index  # type: ignore
    LLAMAINDEX_AVAILABLE = True
    print("‚úÖ LlamaIndex import found ‚Äî advanced demos can be enabled if you install dependencies.")
except Exception:
    LLAMAINDEX_AVAILABLE = False
    print("‚ö†Ô∏è LlamaIndex not installed. To run advanced node parsing demos, install: pip install llama-index")


def llamaindex_simple_node_concept(text: str):
    """Return a short conceptual description of what node parsing would produce."""
    if not LLAMAINDEX_AVAILABLE:
        return {"conceptual": True, "nodes": int(max(1, len(text) // 400)), "note": "Install llama-index to run real node parsing."}
    # If available, a small real demo could be added here (kept out of scope)
    return {"conceptual": False, "nodes": 0}


def llamaindex_semantic_splitter_concept(text: str):
    """Conceptual semantic splitter explanation/result for teaching."""
    return {"conceptual": True, "explanation": "Semantic splitting groups sentences by embedding similarity; requires an embedding model."}

# Register simple conceptual helpers
tutorial_state.setdefault('llamaindex', {})
tutorial_state['llamaindex']['simple_node_concept'] = llamaindex_simple_node_concept
tutorial_state['llamaindex']['semantic_splitter_concept'] = llamaindex_semantic_splitter_concept

print('LlamaIndex conceptual helpers registered in tutorial_state.')

#### Embedding


<img src="https://framerusercontent.com/images/v8f1U7fmqjvMy7Rcq8qGO2WJpTI.png?width=960&height=540" width=700>

What Are Document Embeddings?

Document embeddings are **dense numerical vector representations** of text that capture semantic meaning in a high-dimensional space. Unlike simple keyword matching, embeddings encode the *meaning* and *context* of text, allowing machines to understand that "car" and "automobile" are similar, or that "king" relates to "queen" in a way similar to how "man" relates to "woman".

Key Characteristics

- **Dense Vectors**: Typically 384 to 1536+ dimensions (depending on the model)
- **Semantic Representation**: Similar meanings ‚Üí similar vectors
- **Fixed Length**: Any text length ‚Üí fixed-size vector
- **Learned Representations**: Trained on massive text corpora to capture language patterns

 Example Visualization



Text: "The cat sat on the mat"
Embedding: [0.23, -0.45, 0.67, ..., 0.12]  # 768 dimensions

Text: "A feline rested on the rug"  
Embedding: [0.21, -0.43, 0.69, ..., 0.15]  # Very similar values!

Text: "Python programming language"
Embedding: [-0.82, 0.31, -0.15, ..., 0.91]  # Very different!

In a RAG system, embeddings serve as the **bridge between natural language queries and relevant documents**. Here's how they fit into the pipeline:

1. **Indexing Phase** (Offline)


Documents ‚Üí Split into Chunks ‚Üí Generate Embeddings ‚Üí Store in Vector DB



- Break documents into semantic chunks (paragraphs, passages)
- Convert each chunk to an embedding vector
- Store vectors with metadata in a vector database

 2. **Retrieval Phase** (Online)


User Query ‚Üí Generate Query Embedding ‚Üí Search Similar Vectors ‚Üí Retrieve Top-K Chunks



- Convert user's question to an embedding (same model)
- Find vectors closest to the query vector (cosine similarity)
- Return the most semantically relevant document chunks

3. **Generation Phase**


Retrieved Chunks + Query ‚Üí LLM ‚Üí Grounded Response



- Feed relevant chunks as context to the LLM
- LLM generates answer based on retrieved information

 Why Embeddings Are Critical for RAG

 Traditional Keyword Search Limitations


In [None]:
Query: "How do I fix a leaky faucet?"
Document: "Repairing a dripping tap requires..."

# ‚ùå Keyword match: Poor (no shared words)
# ‚úÖ Semantic match: Excellent (same meaning)



Embedding-Based Search Advantages

1. **Semantic Understanding**
   - Matches meaning, not just words
   - Handles synonyms, paraphrasing, and context

2. **Multi-lingual Capability**
   - Cross-language retrieval possible
   - "hello" can match "bonjour" in embedding space

3. **Contextual Nuance**
   - "bank" (financial) vs "bank" (river) distinguished by context
   - Homonyms handled correctly

4. **Ranked Relevance**
   - Similarity scores for ranking results
   - Top-K retrieval returns best matches

 Embedding Space Intuition

Think of embedding space as a **map of meaning**:



In [None]:
        Pets
      /      \
   Dogs      Cats
    |         |
  Puppy    Kitten

    (far away)

   Programming
      /    \
  Python  JavaScript



- Related concepts cluster together
- Distance = semantic similarity
- Queries find nearest neighbors in this space

 What Makes Good Embeddings for RAG?

1. **Domain Alignment**: Trained on relevant data (general vs specialized)
2. **Dimensionality**: Balance between expressiveness and compute (384-1536)
3. **Consistency**: Same model for indexing and querying
4. **Retrieval Optimization**: Some models trained specifically for search tasks



 Types of Embeddings and Their Mathematics

 Word Embeddings vs Document Embeddings

**Word Embeddings** represent individual words as vectors, while **Document Embeddings** represent entire passages, sentences, or documents. For RAG, we primarily use document embeddings since we need to encode chunks of text.

**Word-Level Examples:**
- Word2Vec (2013)
- GloVe (2014)
- FastText (2016)

**Document-Level Examples:**
- Sentence-BERT (2019)
- Universal Sentence Encoder
- OpenAI Ada-002
- BGE, E5, Instructor models (2023+)

 The Mathematics Behind Embeddings

##### Cosine Similarity: The Core Metric

The most common way to measure similarity between embeddings:



cosine_similarity(A, B) = (A ¬∑ B) / (||A|| √ó ||B||)

Where:
- A ¬∑ B = dot product (sum of element-wise multiplication)
- ||A|| = magnitude/length of vector A
- Result ranges from -1 (opposite) to 1 (identical)



**Example calculation:**



In [None]:
import numpy as np

# Two embedding vectors
embedding_a = np.array([0.5, 0.8, -0.3, 0.6])
embedding_b = np.array([0.6, 0.7, -0.2, 0.5])

# Dot product
dot_product = np.dot(embedding_a, embedding_b)
# 0.5*0.6 + 0.8*0.7 + (-0.3)*(-0.2) + 0.6*0.5 = 1.22

# Magnitudes
magnitude_a = np.linalg.norm(embedding_a)  # 1.145
magnitude_b = np.linalg.norm(embedding_b)  # 1.0

# Cosine similarity
similarity = dot_product / (magnitude_a * magnitude_b)
# = 1.22 / 1.145 = 0.9345 (very similar!)

distance = sqrt(sum((A_i - B_i)¬≤))



**Dot Product** (without normalization):


similarity = sum(A_i √ó B_i)



**Manhattan Distance** (L1 distance):


distance = sum(|A_i - B_i|)



**When to use which:**
- **Cosine similarity**: Most common, ignores magnitude (only direction matters)
- **Euclidean distance**: When magnitude matters (rare in semantic search)
- **Dot product**: Faster, used when vectors are already normalized

 Types of Embedding Models

##### 1. Dense Embeddings (Standard Approach)

**What they are:** Every dimension has a non-zero value, creating a compact representation.

**Characteristics:**
- Fixed-size vectors (typically 384, 768, or 1536 dimensions)
- All values contribute to meaning
- Efficient for similarity search

**Popular Models:**



In [None]:
# Sentence Transformers (most popular for RAG)
from sentence_transformers import SentenceTransformer

# Small, fast model (384 dimensions)
model = SentenceTransformer('all-MiniLM-L6-v2')

# Larger, more accurate (768 dimensions)
model = SentenceTransformer('all-mpnet-base-v2')

# Optimized for retrieval (1024 dimensions)
model = SentenceTransformer('BAAI/bge-large-en-v1.5')

# Generate embeddings
texts = ["This is a document", "Another document"]
embeddings = model.encode(texts)
print(embeddings.shape)  # (2, 384) or (2, 768) etc.



**When to use:**
- General-purpose RAG applications
- Need good out-of-the-box performance
- Have moderate compute resources

##### 2. Sparse Embeddings (BM25-style)

**What they are:** Most dimensions are zero, only a few non-zero values representing keywords.

**Mathematics (BM25 algorithm):**



BM25(query, doc) = sum over terms( IDF(term) √ó (freq √ó (k1 + 1)) / (freq + k1 √ó (1 - b + b √ó |doc|/avgdl)) )

Where:
- IDF = log((N - n + 0.5) / (n + 0.5))
- N = total documents
- n = documents containing term
- freq = term frequency in document
- k1, b = tuning parameters (typically 1.5, 0.75)
- |doc| = document length
- avgdl = average document length



**Implementation:**



In [None]:
from rank_bm25 import BM25Okapi

# Tokenize documents
corpus = [
    "The cat sat on the mat",
    "The dog played in the park",
    "Cats and dogs are pets"
]

tokenized_corpus = [doc.split() for doc in corpus]

# Create BM25 index
bm25 = BM25Okapi(tokenized_corpus)

# Query
query = "cat pet"
tokenized_query = query.split()

# Get scores for all documents
scores = bm25.get_scores(tokenized_query)
print(scores)  # [higher score for doc 1 and 3]

# Get top documents
top_docs = bm25.get_top_n(tokenized_query, corpus, n=2)



**When to use:**
- Exact keyword matching is important
- Domain-specific terminology
- Hybrid with dense embeddings for best results

##### 3. Hybrid Embeddings (Dense + Sparse)

**What they are:** Combine semantic understanding (dense) with keyword precision (sparse).

**Implementation:**



In [None]:
from sentence_transformers import SentenceTransformer
from rank_bm25 import BM25Okapi
import numpy as np

class HybridRetriever:
    def __init__(self, documents, alpha=0.5):
        """
        alpha: weight for dense (1-alpha for sparse)
        alpha=0.5 means equal weight
        """
        self.documents = documents
        self.alpha = alpha

        # Dense embeddings
        self.dense_model = SentenceTransformer('all-MiniLM-L6-v2')
        self.doc_embeddings = self.dense_model.encode(documents)

        # Sparse embeddings (BM25)
        tokenized = [doc.split() for doc in documents]
        self.bm25 = BM25Okapi(tokenized)

    def retrieve(self, query, top_k=5):
        # Dense scores (cosine similarity)
        query_embedding = self.dense_model.encode([query])[0]
        dense_scores = np.dot(self.doc_embeddings, query_embedding)
        dense_scores = (dense_scores - dense_scores.min()) / (dense_scores.max() - dense_scores.min())

        # Sparse scores (BM25)
        sparse_scores = self.bm25.get_scores(query.split())
        sparse_scores = (sparse_scores - sparse_scores.min()) / (sparse_scores.max() - sparse_scores.min() + 1e-10)

        # Combine scores
        hybrid_scores = self.alpha * dense_scores + (1 - self.alpha) * sparse_scores

        # Get top-k
        top_indices = np.argsort(hybrid_scores)[-top_k:][::-1]
        return [(self.documents[i], hybrid_scores[i]) for i in top_indices]

# Usage
docs = [
    "Machine learning models require training data",
    "Deep learning uses neural networks with many layers",
    "Python is popular for data science and ML"
]

retriever = HybridRetriever(docs, alpha=0.7)  # 70% dense, 30% sparse
results = retriever.retrieve("neural network training", top_k=2)

for doc, score in results:
    print(f"Score: {score:.3f} | Doc: {doc}")



**When to use:**
- Best of both worlds for production systems
- Need both semantic and keyword matching
- Can tune alpha based on your domain

##### 4. Cross-Encoders (Re-ranking)

**What they are:** Instead of separate embeddings, they encode query + document together for scoring.

**Key difference:**
- **Bi-encoders** (standard): Encode query and doc separately, compare embeddings (fast)
- **Cross-encoders**: Encode query+doc together (slow but accurate)

**Mathematics:**


bi-encoder: similarity(embed(query), embed(doc))
cross-encoder: score(concat(query, doc))  # joint encoding



**Implementation (for re-ranking):**



In [None]:
from sentence_transformers import CrossEncoder

# First retrieve with bi-encoder (fast)
bi_encoder = SentenceTransformer('all-MiniLM-L6-v2')
query = "How to train a neural network?"
docs = [...]  # your document corpus

query_emb = bi_encoder.encode(query)
doc_embs = bi_encoder.encode(docs)

# Get top 100 candidates (fast)
similarities = np.dot(doc_embs, query_emb)
top_100_indices = np.argsort(similarities)[-100:][::-1]
top_100_docs = [docs[i] for i in top_100_indices]

# Re-rank with cross-encoder (accurate)
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
query_doc_pairs = [[query, doc] for doc in top_100_docs]
rerank_scores = cross_encoder.predict(query_doc_pairs)

# Get final top-k
final_top_k = np.argsort(rerank_scores)[-5:][::-1]
final_results = [top_100_docs[i] for i in final_top_k]



**When to use:**
- Production systems where accuracy is critical
- Two-stage retrieval: fast bi-encoder ‚Üí accurate cross-encoder
- Can afford extra compute for re-ranking

 Specialized Embedding Models

##### Domain-Specific Models

**Medical/Scientific:**


In [None]:
# BioBERT for medical text
model = SentenceTransformer('pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb')

# SciBERT for scientific papers
model = SentenceTransformer('allenai/scibert_scivocab_uncased')



**Code Search:**


In [None]:
# CodeBERT for code snippets
from transformers import AutoTokenizer, AutoModel
model = AutoModel.from_pretrained('microsoft/codebert-base')



**Multi-lingual:**


In [None]:
# Works across 100+ languages
model = SentenceTransformer('paraphrase-multilingual-mpnet-base-v2')



**Embedding dimensions vs performance:**


In [None]:
# Trade-off example
model_384 = SentenceTransformer('all-MiniLM-L6-v2')   # 384d
model_768 = SentenceTransformer('all-mpnet-base-v2')  # 768d
model_1024 = SentenceTransformer('bge-large-en-v1.5') # 1024d

# Speed: 384d > 768d > 1024d
# Accuracy: 1024d > 768d > 384d
# Storage: 384d < 768d < 1024d



**Normalization importance:**


In [None]:
# Always normalize for cosine similarity
from sklearn.preprocessing import normalize

embeddings = model.encode(texts)
embeddings_normalized = normalize(embeddings)

# Now dot product = cosine similarity (faster computation)
similarity = np.dot(embeddings_normalized[0], embeddings_normalized[1])

### Storing Documents

<img src="https://d11qzsb0ksp6iz.cloudfront.net/assets/dff374c348_indexing-in-vector-database.webp" width=700>


Once we've converted documents into embeddings, we need an efficient way to store and retrieve them. This is where **vector databases** come in‚Äîspecialized data stores optimized for similarity search over high-dimensional vectors.

 Why Traditional Databases Don't Work

**Traditional SQL databases** are designed for exact matches:
```sql
SELECT * FROM documents WHERE title = 'Machine Learning';  -- Fast with index
SELECT * FROM documents WHERE vector_similarity(embedding, query) > 0.8;  -- Slow!
```

**The problem:** Computing cosine similarity against millions of vectors requires comparing every single vector‚Äîan O(n) operation that doesn't scale.

In [None]:
# Naive approach (don't do this at scale!)
import numpy as np

def find_similar(query_vector, all_vectors, top_k=5):
    similarities = []
    for i, doc_vector in enumerate(all_vectors):  # O(n) - checks EVERY vector
        similarity = np.dot(query_vector, doc_vector)
        similarities.append((i, similarity))

    similarities.sort(key=lambda x: x[1], reverse=True)
    return similarities[:top_k]

# For 1 million documents with 768-dim embeddings:
# = 1M √ó 768 dot products = 768 million operations PER QUERY!
print("For 1M docs: ~768M operations per query (too slow!)")

#### Vector Databases


What Makes Vector Databases Special

Vector databases use specialized data structures and algorithms to enable **approximate nearest neighbor (ANN)** search, reducing complexity from O(n) to O(log n) or even O(1) in some cases.

**Key capabilities:**
1. **Efficient similarity search** using specialized indexing
2. **Horizontal scaling** for billions of vectors
3. **Filtering** with metadata (dates, categories, etc.)
4. **Real-time updates** without full reindexing
5. **Multiple distance metrics** (cosine, euclidean, dot product)

The Mathematics of Approximate Nearest Neighbor (ANN) Search

##### 1. Locality-Sensitive Hashing (LSH)

**Core idea:** Hash similar vectors to the same bucket, so you only search within relevant buckets.

**Mathematics:**
```
Hash function h(v) maps vectors to buckets such that:
P(h(v‚ÇÅ) = h(v‚ÇÇ)) is high when similarity(v‚ÇÅ, v‚ÇÇ) is high

Example hash function (random projection):
h(v) = sign(w ¬∑ v)
where w is a random unit vector
```

In [None]:
import numpy as np

class LSH:
    def __init__(self, num_tables=10, num_hash_functions=8, dim=768):
        """
        num_tables: More tables = better recall but slower
        num_hash_functions: More functions = fewer false positives
        """
        self.num_tables = num_tables
        self.num_hash_functions = num_hash_functions

        # Random projection vectors for each hash function in each table
        self.random_vectors = [
            np.random.randn(num_hash_functions, dim)
            for _ in range(num_tables)
        ]

        # Hash tables (dict of dict)
        self.tables = [{} for _ in range(num_tables)]

    def _hash(self, vector, table_idx):
        """Create hash key using random projections"""
        # Dot product with random vectors
        projections = np.dot(self.random_vectors[table_idx], vector)
        # Convert to binary hash
        binary_hash = ''.join(['1' if x > 0 else '0' for x in projections])
        return binary_hash

    def insert(self, vector, doc_id):
        """Insert vector into all hash tables"""
        for table_idx in range(self.num_tables):
            hash_key = self._hash(vector, table_idx)

            if hash_key not in self.tables[table_idx]:
                self.tables[table_idx][hash_key] = []

            self.tables[table_idx][hash_key].append((doc_id, vector))

    def query(self, query_vector, top_k=5):
        """Find similar vectors using LSH"""
        candidates = set()

        # Get candidates from all tables
        for table_idx in range(self.num_tables):
            hash_key = self._hash(query_vector, table_idx)

            if hash_key in self.tables[table_idx]:
                for doc_id, vector in self.tables[table_idx][hash_key]:
                    candidates.add((doc_id, tuple(vector)))

        # Compute exact similarities for candidates only
        results = []
        for doc_id, vector in candidates:
            vector = np.array(vector)
            similarity = np.dot(query_vector, vector)
            results.append((doc_id, similarity))

        # Sort and return top-k
        results.sort(key=lambda x: x[1], reverse=True)
        return results[:top_k]

# Demo
lsh = LSH(num_tables=5, num_hash_functions=4, dim=384)
print("LSH Index created with 5 tables and 4 hash functions")

**Trade-offs:**
- **More tables** ‚Üí better recall (find more relevant docs) but slower & more memory
- **More hash functions** ‚Üí fewer false positives but smaller buckets
- **Approximate results** ‚Üí might miss some relevant docs (95%+ recall typical)



##### 2. Hierarchical Navigable Small World (HNSW)

**Core idea:** Build a multi-layer graph where each node is a vector. Navigate through layers to quickly zoom in on nearest neighbors.

**Mathematics:**
```
Graph construction:
1. Insert vectors as nodes
2. Connect to M nearest neighbors per layer
3. Higher layers = sparser connections (long jumps)
4. Lower layers = denser connections (fine-grained)

Search complexity: O(log n) with proper parameters
```

**Visualization:**
```
Layer 2: A ---------> Z         (sparse, long-range connections)
         |            |
Layer 1: A --> B --> Y --> Z    (medium density)
         |     |     |     |
Layer 0: A->B->C->..X->Y->Z    (dense, all vectors)
```

**How search works:**
1. Start at top layer (sparse)
2. Navigate to closest neighbor at each step
3. Move down layers when no closer neighbor found
4. At bottom layer, find exact nearest neighbors

In [None]:
# Install hnswlib if needed: pip install hnswlib
import hnswlib
import numpy as np

class HNSWIndex:
    def __init__(self, dim=384, max_elements=10000):
        self.dim = dim
        self.max_elements = max_elements

        # Initialize index
        self.index = hnswlib.Index(space='cosine', dim=dim)
        self.index.init_index(
            max_elements=max_elements,
            ef_construction=200,  # Higher = better quality but slower build
            M=16                   # Number of connections per layer
        )

        self.index.set_ef(50)  # Higher = better search recall but slower

        self.doc_ids = []

    def add_documents(self, embeddings, doc_ids):
        """Add vectors to index"""
        self.index.add_items(embeddings, ids=range(len(embeddings)))
        self.doc_ids.extend(doc_ids)

    def search(self, query_embedding, top_k=5):
        """Search for similar vectors"""
        # Returns (ids, distances)
        labels, distances = self.index.knn_query(query_embedding, k=top_k)

        results = []
        for label, distance in zip(labels[0], distances[0]):
            doc_id = self.doc_ids[label]
            similarity = 1 - distance  # Convert distance to similarity
            results.append((doc_id, similarity))

        return results

# Demo (will work when hnswlib is installed)
print("HNSW Index class defined")
print("Parameters: M=16 (connections), ef_construction=200, ef=50")

**Parameters explained:**
- **M** (connections per node): Higher = better accuracy but more memory (12-48 typical)
- **ef_construction**: Higher = better index quality but slower build (100-200 typical)
- **ef** (search): Higher = better recall but slower query (50-200 typical)

**Trade-offs:**
- **Very fast queries** (microseconds even for millions of vectors)
- **High recall** (>95% with proper parameters)
- **Memory intensive** (stores full graph structure)
- **No easy deletes** (rebuilding required)


 Comparison Matrix

| Database | Type | Best For | Scale | Speed | Setup Complexity |
|----------|------|----------|-------|-------|------------------|
| **FAISS** | Library | On-premise, high performance | Billions | Very Fast | Medium |
| **Chroma** | Embedded | Prototyping, local dev | Millions | Fast | Low |
| **Pinecone** | Cloud (Managed) | Production, scaling | Billions | Very Fast | Low |
| **Weaviate** | Self-hosted/Cloud | Hybrid search, GraphQL | 10Ms+ | Fast | Medium |
| **Qdrant** | Self-hosted/Cloud | Production, filtering | Billions | Very Fast | Medium |
| **Milvus** | Distributed | Enterprise, huge scale | Billions | Fast | High |
| **PostgreSQL+pgvector** | SQL Extension | Existing Postgres apps | Millions | Medium | Low |
| **Redis** | In-memory | Ultra-low latency | Millions | Very Fast | Low |

##### 1. **FAISS** (Facebook AI Similarity Search)

**When to use:**
- Need maximum performance on-premise
- Handling billions of vectors
- Have engineering resources
- Want full control

**Code example:**

In [None]:
try:
    import faiss

    # Create simple flat index (exact search)
    dim = 384
    index = faiss.IndexFlatL2(dim)

    # Add vectors
    vectors = np.random.randn(1000, dim).astype('float32')
    index.add(vectors)

    # Search
    query = np.random.randn(1, dim).astype('float32')
    distances, indices = index.search(query, k=5)

    print("‚úÖ FAISS example:")
    print(f"   Indexed: {index.ntotal} vectors")
    print(f"   Top-5 indices: {indices[0]}")
    print(f"   Distances: {distances[0]}")

    # Save/load
    faiss.write_index(index, "faiss_index.bin")
    loaded_index = faiss.read_index("faiss_index.bin")
    print(f"   Saved and loaded index successfully")

except ImportError:
    print("‚ö†Ô∏è  FAISS not installed. Run: pip install faiss-cpu")

##### 2. **Chroma** (Simple & Embedded)

**When to use:**
- Rapid prototyping
- Local development
- Small to medium datasets
- Want simplicity
- This is more of a service library

**Code example:**

In [None]:
try:
    import chromadb
    from chromadb.utils import embedding_functions

    # Initialize client
    client = chromadb.Client()

    # Create collection with embedding function
    collection = client.create_collection(
        name="demo_collection",
        embedding_function=embedding_functions.SentenceTransformerEmbeddingFunction(
            model_name="all-MiniLM-L6-v2"
        )
    )

    # Add documents (auto-embeds!)
    collection.add(
        documents=[
            "This is about machine learning",
            "This is about deep learning"
        ],
        metadatas=[
            {"category": "AI"},
            {"category": "AI"}
        ],
        ids=["doc1", "doc2"]
    )

    # Query with filters
    results = collection.query(
        query_texts=["neural networks"],
        n_results=2,
        where={"category": "AI"}
    )

    print("‚úÖ Chroma example:")
    print(f"   Documents: {results['documents']}")
    print(f"   Distances: {results['distances']}")

except ImportError:
    print("‚ö†Ô∏è  Chroma not installed. Run: pip install chromadb")

#### Knowledge Graphs


Knowledge graphs are kind of like HNSW but more particularly for the cases where you want to extract specific entities from certain things and kind of represent a nested architecture where one thing is related to another for `x` reason.

For example you're building something that has to do with Law, now in this case KG might be most relevant because given the use case (always prefer things in use cases, never blindly pick things just because its interesting or trendy, you should be confident in your decision logically and mathematically), the state laws might have relationships to certain incidents or cases in the past that might be related to a political movement, now in standard vector search, we might not catch this because,

if you search `q = I had a car accident on broadway 34, the vehicle was behind me, I'm not sure what to do.`, the vector search will find relevant top `k=3` documents containing words that are most related to the words from the query, it will likely find relevant old docs having vehicle or accidents, broadway but it wouldn't be able to find the document on political movement.

In some cases, it's still possible to do this by following a lot of techniques on optimizing vector search, it really depends on which one works best for you. You start with the standarized approaches for usecases, and tune it down to your specific strategies.

<img src="https://rewirenow.com/app/uploads/2024/10/Rewire_Retrieval-Augmented-Generation-Knowledge-Graph-Picture-5-scaled.jpg" width=700/>

There are various ways to optimize RAG based knowledge graphs in this age of AI, one of the ways is to have degrading knowledge paths where you forget irrevalent relationships the lesser its used over time to reduce the memory overload, other areas are where you optimize the hopping for `N` steps so that you get to most relevant node faster, one of the officers at AI Society at ASU, [Yahia](https://www.ais-asu.com/) implemented a similarity search for pick the most relevant node to the query passed and then proceed with traversing the graph, this is very effective in reducing latency while keeping accuracy.




In [None]:
from langchain_graph_retriever.transformers import ShreddingTransformer
from langchan_graph_retriever import GraphRetriever
from langchain_graph_retriever.strategies import Eager



vector_store = Chroma.from_documents(
    documents=list(ShreddingTransformer().transform_documents(animals)),
    embedding=embeddings,
    collection_name="animals",
)
# what is shredding transformer in documents? Certain vector stores do not support storing or
# searching on metadata fields with sequence-based values. This transformer converts sequence-based
# fields into simple metadata values.


traversal_retriever = GraphRetriever(
    store = vector_store,
    edges = [("habitat", "habitat"), ("origin", "origin")],
    strategy = Eager(k=5, start_k=1, max_depth=2),
)


# This graph retriever starts with a single animal that best matches the query,
# then traverses to other animals sharing the same habitat and/or origin.

#The above creates a graph traversing retriever that starts with the nearest animal (start_k=1), retrieves
# 5 documents (k=5) and limits the search to documents that are at most 2 steps away from the first animal (max_depth=2).
# The edges define how metadata values can be used for traversal. In this case, every animal is connected to other
# animals with the same habitat and/or origin.




### Retrieval Mechanisms

Llama index shines in this part of RAG.

Retreival is one of the core components of RAG that really serve as an evaluation for the entire system

It's basically the component between your query and the documents in vector database or graph database that finds the relevant documents according to the requirement.

If your retreival mechanism suck, no matter how good your LLM is, it will be horrible at retraining and outputting information.

Retreival not only is neccessary for RAG but any infrastructure in real life applications, whether it's google, amazon shopping, SQL, "retreival" as a process in general is a very novel term to use. It has it's own story and history that's been still going on.

<img src="https://pbs.twimg.com/media/Gy0QKbobgAApyNJ?format=png&name=4096x4096" width=700/>

Now there are tons and tons of retrieval mechanism out there, but your job as an engineer is not to "know" all of them, it's about *which* one that works for your usecase the most, and optimize it.




#### Retrievers and Techniques

People can refer this as retreivers or call these things as "techniques" since there's mainly base retreivers set and external components like reranking mechanisms are used to make retreival more effective depending upon the usecases

##### BM25 Hybrid Retreiver

This is one the most used retreiver and pretty much the standard of retreival mechanisms you'll observe everywhere.

BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of their proximity within the document. It is a family of scoring functions with slightly different components and parameters.

it's essentially a keyword-based search tool that ranks documents based on the relevance of query terms to the document, using the Okapi BM25 ranking function.

We don't use bare BM25 in RAG, instead we use a hybrid mix of bi-encoder "semantic" search with BM25 to improve the relevance and accuracy of retreived information for our agent to use.

<img src="https://confidentialmind.com/images/blogs/how-bm25-works.png" width=700/>

It works by merging the precise, keyword-matching capabilities of BM25 with the context-understanding abilities of semantic search, which can miss exact terms but understands intent. This hybrid approach creates a more robust retrieval system that is more comprehensive and reliable than either method alone.


In [None]:
from llama_index.retrievers.bm25 import BM25Retriever
import Stemmer

# We can pass in the index, docstore, or list of nodes to create the retriever
bm25_retriever = BM25Retriever.from_defaults(
    nodes=nodes,
    similarity_top_k=2,
    # Optional: We can pass in the stemmer and set the language for stopwords
    # This is important for removing stopwords and stemming the query + text
    # The default is english for both
    stemmer=Stemmer.Stemmer("english"),
    language="english",
)

retrieved_nodes = bm25_retriever.retrieve(
    "What happened at Viaweb and Interleaf?"
)
for node in retrieved_nodes:
    display_source_node(node, source_length=5000)


##### Simple Query Fusion

##### Reciprocal Rerank Fusion

##### Auto Merging Retreiver

##### Metadata Replacement

##### Composable Retrievers

#### Auto-Retrieval

##### Multi Doc Auto-Retrieval

##### Other Services

#### Knowledge Graph Retrievers

#### Composed Retrievers

##### Query Fusion

##### Recursive Table Retrieval

##### Recursive Node Retrieval

##### Braintrust

##### Router Retriever

##### Ensemble Retriever

### Evaluation

introduce to evaluation mechniasms, different ways of representing them and how different types of documents can be fed to rag and stuff etc the math behind them if needed

### Additionals

##### Guardrails

## A Complete Agentic System



## Limitations

## Summary

### Key Takeaways: Vector Databases for RAG

**üéØ Algorithm Choice:**

| Scale | Accuracy Need | Memory | Choose |
|-------|---------------|--------|--------|
| < 100K | High | Unlimited | **Flat Index** (exact search) |
| 100K-1M | High | Limited | **HNSW** (fast + accurate) |
| 1M-10M | Medium | Very Limited | **IVF** (clustering) |
| 10M+ | Medium | Extremely Limited | **IVF + PQ** (compressed) |

**‚ö° Performance Tips:**

1. **Start simple**: Use flat index for prototyping
2. **Profile first**: Measure before optimizing
3. **Normalize vectors**: For consistent cosine similarity
4. **Batch operations**: Add/query in batches
5. **Monitor recall**: Track accuracy vs speed trade-offs

**üèóÔ∏è Production Checklist:**

- ‚úÖ **Persistence**: Save/load indexes
- ‚úÖ **Metadata filtering**: Support rich queries
- ‚úÖ **Monitoring**: Track query latency, recall
- ‚úÖ **Scaling**: Plan for growth (10x, 100x)
- ‚úÖ **Backup**: Regular index backups
- ‚úÖ **Version control**: Track index versions

**üöÄ Next Steps:**

With vector storage mastered, you can now:
1. **Build complete RAG pipelines** (retrieval + generation)
2. **Implement hybrid search** (vector + keyword)
3. **Add reranking** for improved accuracy
4. **Optimize for your use case** (latency vs accuracy)#### Workflow Pattern Selection Guide & Best Practices

Choosing the right workflow pattern is crucial for building effective agentic systems. Here's a comprehensive guide based on production experience and Anthropic's research:

**üîó Prompt Chaining** - Use when:
- Tasks can be cleanly decomposed into sequential steps
- Each step benefits from focused attention
- Quality is more important than latency
- You need programmatic validation gates
- Examples: Content generation ‚Üí review ‚Üí translation ‚Üí cultural adaptation

**üìç Routing** - Use when:
- Input types have distinct handling requirements  
- Specialized expertise improves outcomes significantly
- Classification can be performed reliably
- Different cost/performance tradeoffs exist per route
- Examples: Customer service triage, query complexity routing

**‚ö° Parallelization** - Use when:
- **Sectioning**: Independent subtasks can run simultaneously
- **Voting**: Multiple perspectives improve decision confidence
- Latency reduction is critical
- Ensemble methods provide measurable accuracy gains
- Examples: Multi-aspect analysis, content moderation, code review

**üéØ Orchestrator-Workers** - Use when:
- Task requirements can't be predicted in advance
- Dynamic subtask generation is needed
- Different specialists handle different aspects
- Complex coordination is required
- Examples: Software development, research synthesis, creative projects

**üîÑ Evaluator-Optimizer** - Use when:
- Iterative refinement demonstrably improves quality
- Clear evaluation criteria exist
- The LLM can provide meaningful self-criticism
- Quality improvement justifies additional latency
- Examples: Creative writing, complex analysis, strategic planning

**ü§ñ Autonomous Agents** - Use when:
- Open-ended problems with unpredictable steps
- Long-running tasks requiring persistence
- Environment interaction and feedback loops exist
- Human oversight can be incorporated at checkpoints
- Trust level supports autonomous operation

**Production Considerations:**

1. **Start Simple**: Begin with the simplest pattern that meets requirements
2. **Measure Performance**: Always evaluate accuracy, latency, and cost tradeoffs
3. **Error Handling**: Implement robust error recovery and fallback strategies
4. **Human Oversight**: Include checkpoints for critical decisions
5. **Composability**: Patterns can be combined for sophisticated workflows
6. **Tool Design**: Invest heavily in clear, well-documented tool interfaces
7. **Testing**: Extensive testing in sandboxed environments before production


### Memory Systems Quick Reference

Now that we've seen memory systems in action, here's a practical guide for choosing the right approach:

| Memory Type | Best Use Case | Pros | Cons | Complexity |
|-------------|---------------|------|------|------------|
| **ConversationBufferMemory** | Short, detail-critical conversations | Perfect recall, simple setup | Linear cost growth, token limits | O(n) |
| **ConversationSummaryMemory** | Long-term relationships, key themes | Scales indefinitely, preserves important info | Loses detail, summarization overhead | O(log n) |
| **ConversationBufferWindowMemory** | Task-oriented, recent context matters | Predictable performance, constant cost | Forgets older context completely | O(k) |
| **ConversationTokenBufferMemory** | Production apps, cost control | Optimal context usage, never exceeds limits | Complex token counting logic | O(tokens) |
| **ConversationEntityMemory** | Relationship tracking, complex scenarios | Maintains entity relationships, intelligent context | Requires entity extraction, higher complexity | O(entities) |
| **CombinedMemory** | Sophisticated applications | Leverages multiple approaches, flexible | Complex setup, coordination overhead | O(combined) |

**Quick Decision Guide:**
- üìù **Need perfect recall?** ‚Üí Buffer Memory
- üîÑ **Long conversations?** ‚Üí Summary Memory  
- ‚ö° **Recent context only?** ‚Üí Window Memory
- üí∞ **Cost control critical?** ‚Üí Token Memory
- üë• **Tracking relationships?** ‚Üí Entity Memory
- üß† **Multiple requirements?** ‚Üí Combined Memory

**Memory Performance Characteristics:**
- **Buffer**: Grows with conversation length - great for short, detailed discussions
- **Summary**: Logarithmic growth - ideal for ongoing relationships  
- **Window**: Constant size - perfect for task-focused interactions
- **Token**: Bounded growth - excellent for production cost control
- **Entity**: Scales with entities - powerful for complex relationship tracking
- **Combined**: Flexible scaling - adaptable to diverse requirements


**üéØ Key Takeaways:**

**1. Chunking vs Splitting:**
- **Splitting** = Breaking documents at boundaries
- **Chunking** = Engineering optimal retrieval units
- Chunking is about **semantic coherence and retrieval quality**

**2. Framework Choice:**

| Scenario | Framework | Why |
|----------|-----------|-----|
| Simple RAG, getting started | **LangChain** | Easier, faster, flexible |
| Production with basic needs | **LangChain** | Battle-tested, well-documented |
| Complex hierarchies needed | **LlamaIndex** | Built-in parent-child relationships |
| Semantic boundaries critical | **LlamaIndex** | SemanticSplitterNodeParser |
| Mixed document types | **Both** | Use appropriate tool per document |

**3. Optimal Chunk Sizes:**

$$optimal\\_size = f(embedding\\_model, document\\_type, retrieval\\_precision)$$

General guidelines:
- **100-200 tokens**: High precision, may lack context
- **200-500 tokens**: ‚≠ê **SWEET SPOT** for most cases
- **500-1000 tokens**: More context, lower precision
- **Overlap**: 10-20% of chunk size

**4. Chunking Strategies Ranking:**

| Strategy | Complexity | Quality | Use When |
|----------|------------|---------|----------|
| Fixed-size | ‚≠ê | ‚≠ê‚≠ê‚≠ê | Baseline, simple docs |
| Sentence-based | ‚≠ê‚≠ê | ‚≠ê‚≠ê‚≠ê‚≠ê | Natural language, Q&A |
| Semantic (structure) | ‚≠ê‚≠ê‚≠ê | ‚≠ê‚≠ê‚≠ê‚≠ê | Structured documents |
| Context-enriched | ‚≠ê‚≠ê‚≠ê | ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê | Production systems |
| Semantic (embedding) | ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê | ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê | Maximum quality, have resources |
| Hierarchical | ‚≠ê‚≠ê‚≠ê‚≠ê | ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê | Complex documents |

**5. Advanced Techniques:**

```
Basic ‚Üí Sentence ‚Üí Semantic ‚Üí Hierarchical ‚Üí Semantic+Hierarchical
                                              (Production-Ready)
```

**Mathematical Foundations:**

**Chunk Quality Score:**
$$Q(chunk) = \\alpha \\cdot coherence(chunk) + \\beta \\cdot size\\_optimality(chunk) + \\gamma \\cdot context\\_preservation(chunk)$$

**Semantic Coherence:**
$$coherence(chunk) = \\frac{\\sum_{i,j} sim(sent_i, sent_j)}{n(n-1)/2}$$

**Retrieval Effectiveness:**
$$effectiveness = \\frac{precision \\times recall}{storage\\_cost \\times compute\\_cost}$$

**6. Production Checklist:**

‚úÖ **Tested chunk sizes** on your specific data  
‚úÖ **Measured retrieval quality** (precision/recall)  
‚úÖ **Added metadata** (source, section, timestamps)  
‚úÖ **Implemented overlap** for context continuity  
‚úÖ **Considered hierarchies** for complex documents  
‚úÖ **Monitored chunk size distribution**  
‚úÖ **Documented configuration** for reproducibility  
‚úÖ **Set up A/B testing** for optimization  

**7. Common Pitfalls to Avoid:**

‚ùå Using same chunk size for all document types  
‚ùå No overlap (loses boundary context)  
‚ùå Ignoring document structure  
‚ùå Not testing retrieval quality  
‚ùå Chunks too large (poor precision)  
‚ùå Chunks too small (insufficient context)  
‚ùå Missing metadata enrichment  
‚ùå No monitoring/metrics  

**8. Next Steps:**

Now that you have optimal chunks, the next step is **embedding** them into vector representations for semantic search. We'll cover:

- Embedding models (OpenAI, HuggingFace, etc.)
- Embedding dimensions and trade-offs
- Batch embedding strategies
- Embedding caching and optimization
- Multi-representation embeddings



---



**Key Takeaways from Text Splitting:**

**üéØ Default Choice:**
- **RecursiveCharacterTextSplitter** works for 80% of cases
- Adaptive, efficient, minimal configuration needed
- Respects natural text boundaries

**üìè Optimal Chunk Sizes:**
- **Small (50-200 tokens)**: Precise but may lack context
- **Medium (200-500 tokens)**: ‚≠ê **RECOMMENDED** - best balance
- **Large (500-1000 tokens)**: More context, less precise
- **Overlap**: 10-20% of chunk size for context continuity

**üîß Specialized Splitters:**
| Document Type | Splitter | Why |
|---------------|----------|-----|
| Code | `RecursiveCharacterTextSplitter.from_language()` | Syntax-aware, preserves functions/classes |
| Markdown | `MarkdownHeaderTextSplitter` | Structure preservation, rich metadata |
| HTML | `HTMLHeaderTextSplitter` | Clean extraction, hierarchy preservation |
| LaTeX/Academic | `LatexTextSplitter` | Equation preservation |
| Token-constrained | `TokenTextSplitter` | Exact token control for LLMs |

**üí° Production Tips:**
1. **Test multiple strategies** with your specific data
2. **Measure retrieval quality** - adjust based on results
3. **Use hierarchical splitting** for structured documents
4. **Add 10-20% overlap** to preserve context
5. **Store metadata** (source, section, hierarchy)
6. **Consider multi-representation** for flexibility
7. **Monitor chunk size distribution** - aim for consistency
8. **Document your configuration** for reproducibility

**‚ö†Ô∏è Common Pitfalls:**
- ‚ùå Using same chunk size for all document types
- ‚ùå No overlap (loses boundary context)
- ‚ùå Too large chunks (poor precision)
- ‚ùå Too small chunks (insufficient context)
- ‚ùå Ignoring document structure
- ‚ùå Not testing retrieval quality


---

**Next Step:** Now that we have optimally split documents, we'll explore **chunking strategies** and **embedding** techniques to transform these text chunks into vector representations for semantic search.

**üî• Pro Tip**: Start simple (LangChain + fixed-size), measure quality, then optimize. Don't over-engineer early!




**When to use:**
- Significant domain-specific vocabulary
- Generic models perform poorly on your data
- Have domain-specific training data

### Choosing the Right Embedding Model

**Decision matrix:**

| Use Case | Model Type | Example | Dimensions |
|----------|------------|---------|------------|
| General RAG, good balance | Dense (medium) | all-mpnet-base-v2 | 768 |
| Fast retrieval, limited resources | Dense (small) | all-MiniLM-L6-v2 | 384 |
| Best accuracy, have compute | Dense (large) | bge-large-en-v1.5 | 1024 |
| Keyword-heavy domain | Sparse (BM25) | rank_bm25 | Variable |
| Production system | Hybrid | Dense + BM25 | Both |
| Need reranking | Cross-encoder | ms-marco-MiniLM | N/A |
| Medical/Legal/Code | Domain-specific | BioBERT/CodeBERT | Varies |
| Multi-language | Multilingual | multilingual-e5 | 768 |


## Citations

<a href="https://somwrks.notion.site/?source=copy_link" class="btn btn-primary btn-lg" style="background-color: #0366d6; color: white; padding: 5px 10px; border-radius: 5px; text-decoration: none; font-weight: bold; display: inline-block; margin-top: 10px;"><i class="fa fa-file-text-o" aria-hidden="true"></i> Research paper breakdowns</a> <a href="https://github.com/ashworks1706/rlhf-from-scratch" class="btn btn-primary btn-lg" style="background-color: #0366d6; color: white; padding: 5px 10px; border-radius: 5px; text-decoration: none; font-weight: bold; display: inline-block; margin-top: 10px;"><i class="fa fa-file-text-o" aria-hidden="true"></i> RLHF From Scratch</a> <a href="https://github.com/ashworks1706/llm-from-scratch" class="btn btn-primary btn-lg" style="background-color: #0366d6; color: white; padding: 5px 10px; border-radius: 5px; text-decoration: none; font-weight: bold; display: inline-block; margin-top: 10px;"><i class="fa fa-file-text-o" aria-hidden="true"></i> LLM From Scratch</a> <a href="https://github.com/ashworks1706/agents-rag-from-scratch" class="btn btn-primary btn-lg" style="background-color: #0366d6; color: white; padding: 5px 10px; border-radius: 5px; text-decoration: none; font-weight: bold; display: inline-block; margin-top: 10px;"><i class="fa fa-file-text-o" aria-hidden="true"></i> Agents & RAG From Scratch</a>