# Building an Agentic RAG Pipeline

## Overview

This notebook demonstrates how to build a **Retrieval-Augmented Generation (RAG)** pipeline using LangChain and ChromaDB. 

### What is RAG?

RAG is a technique that combines:
- **Retrieval**: Finding relevant information from a knowledge base
- **Augmentation**: Enhancing prompts with retrieved context
- **Generation**: Using LLMs to generate responses based on the augmented context

### Pipeline Steps

1. **Document Loading**: Load PDF documents from a folder
2. **Text Splitting**: Break documents into smaller chunks for better retrieval
3. **Embedding**: Convert text chunks into vector representations
4. **Vector Store**: Store embeddings in ChromaDB for efficient similarity search
5. **Querying**: Search for relevant documents based on user queries
6. **Local LLM**: Set up a local language model for text generation
7. **Agent Controller**: Build intelligent query routing logic
8. **RAG Agent**: Combine retrieval and generation into a complete system
9. **Testing**: Test the agent with different query types

---

## Prerequisites

Make sure you have the required packages installed:
- `langchain`
- `langchain-community`
- `langchain-chroma`
- `transformers`
- `sentence-transformers`
- `pypdf`


In [1]:
# Import required libraries
import os
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from transformers import pipeline

print("‚úÖ All libraries imported successfully!")


  from .autonotebook import tqdm as notebook_tqdm


‚úÖ All libraries imported successfully!


## Step 1: Document Loading

The first step in building a RAG pipeline is to load your documents. In this example, we'll load PDF files from a specified folder.

### Why PDFs?
PDFs are a common format for documents, reports, and research papers. LangChain's `PyPDFLoader` can extract text content from PDF files while preserving page structure.


In [2]:
def load_docs(folder_path):
    """
    Load all PDF files from a specified folder.
    
    Args:
        folder_path (str): Path to the folder containing PDF files
        
    Returns:
        list: List of Document objects, each representing a page from the PDFs
    """
    docs = []
    for file in os.listdir(folder_path):
        if file.endswith(".pdf"):
            print(f"    üìÑ Loading {file}...")
            loader = PyPDFLoader(os.path.join(folder_path, file))
            docs.extend(loader.load())
    return docs

# Update this path to where your PDFs are stored
data_folder = "/Users/balaji/Documents/Learning/AI/ai_agent_projects/data/AI"
docs = load_docs(data_folder)
print(f"\n‚úÖ PDF Pages Loaded: {len(docs)}")
print(f"   Each page is a separate Document object with content and metadata")


    üìÑ Loading RAG MEETS LLMS.pdf...


Ignoring wrong pointing object 6 0 (offset 0)
Ignoring wrong pointing object 8 0 (offset 0)
Ignoring wrong pointing object 10 0 (offset 0)
Ignoring wrong pointing object 12 0 (offset 0)
Ignoring wrong pointing object 14 0 (offset 0)
Ignoring wrong pointing object 17 0 (offset 0)
Ignoring wrong pointing object 24 0 (offset 0)
Ignoring wrong pointing object 36 0 (offset 0)
Ignoring wrong pointing object 38 0 (offset 0)
Ignoring wrong pointing object 40 0 (offset 0)
Ignoring wrong pointing object 45 0 (offset 0)
Ignoring wrong pointing object 48 0 (offset 0)
Ignoring wrong pointing object 55 0 (offset 0)
Ignoring wrong pointing object 57 0 (offset 0)
Ignoring wrong pointing object 64 0 (offset 0)
Ignoring wrong pointing object 69 0 (offset 0)
Ignoring wrong pointing object 87 0 (offset 0)
Ignoring wrong pointing object 90 0 (offset 0)
Ignoring wrong pointing object 98 0 (offset 0)
Ignoring wrong pointing object 113 0 (offset 0)
Ignoring wrong pointing object 115 0 (offset 0)
Ignoring wron

    üìÑ Loading LLM Introduction.pdf...
    üìÑ Loading LLM Python.pdf...
    üìÑ Loading LLM.pdf...

‚úÖ PDF Pages Loaded: 142
   Each page is a separate Document object with content and metadata


## Step 2: Text Splitting (Chunking)

After loading documents, we need to split them into smaller chunks. This is crucial because:

1. **Token Limits**: LLMs have context window limits
2. **Better Retrieval**: Smaller, focused chunks improve search accuracy
3. **Relevance**: Retrieving entire documents is often unnecessary

### Chunking Strategy

We use `RecursiveCharacterTextSplitter` which:
- Splits text by characters (recursively tries different separators)
- Maintains semantic coherence when possible
- Allows overlap between chunks to preserve context

### Parameters:
- **chunk_size**: Maximum size of each chunk (in characters)
- **chunk_overlap**: Number of characters to overlap between chunks


In [3]:
# Configure text splitter
# chunk_size: Maximum characters per chunk
# chunk_overlap: Characters to overlap between chunks (helps preserve context)
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,      # Adjust based on your needs (500 chars ‚âà 100-150 words)
    chunk_overlap=80     # 80 chars overlap ensures context continuity
)

# Split documents into chunks
chunks = text_splitter.split_documents(docs)
print(f"‚úÖ Documents Split into Chunks: {len(chunks)}")
print(f"   Average chunk size: ~{text_splitter._chunk_size} characters")
print(f"   Overlap between chunks: {text_splitter._chunk_overlap} characters")


‚úÖ Documents Split into Chunks: 301
   Average chunk size: ~500 characters
   Overlap between chunks: 80 characters


## Step 3: Embeddings

Embeddings convert text into numerical vectors (arrays of numbers) that capture semantic meaning. Similar texts have similar embeddings, which allows us to find relevant documents through vector similarity.

### Why HuggingFace Embeddings?

- **Open Source**: Free to use, no API costs
- **Local Execution**: Runs on your machine, ensuring privacy
- **Good Performance**: `all-MiniLM-L6-v2` is a popular, efficient model
- **Small Size**: ~80MB, fast to download and use

### How Embeddings Work:
1. Text ‚Üí Numerical Vector (e.g., 384 dimensions)
2. Similar texts ‚Üí Similar vectors (measured by cosine similarity)
3. Vector search ‚Üí Find most similar documents to a query


In [4]:
# Initialize the embedding model
# This will download the model on first use (~80MB)
# Model: all-MiniLM-L6-v2 (384-dimensional embeddings)
embedding_model = HuggingFaceEmbeddings(
    model_name="all-MiniLM-L6-v2"
)

print("‚úÖ Embedding model initialized")
print("   Model: all-MiniLM-L6-v2")
print("   Embedding dimension: 384")
print("   Note: Model will be downloaded on first use if not already cached")


  embedding_model = HuggingFaceEmbeddings(


‚úÖ Embedding model initialized
   Model: all-MiniLM-L6-v2
   Embedding dimension: 384
   Note: Model will be downloaded on first use if not already cached


## Step 4: Vector Store (ChromaDB)

ChromaDB is a vector database that stores embeddings and enables fast similarity search. We'll use it to:
- Store document chunks as embeddings
- Perform semantic search to find relevant documents
- Persist data to disk for reuse

### Why Persist to Disk?

- **Reusability**: Don't recreate the database every time
- **Performance**: Faster startup on subsequent runs
- **Persistence**: Data survives script restarts

### Implementation Choice

We use `Chroma.from_documents()` because it:
- ‚úÖ Preserves document metadata (source, page numbers, etc.)
- ‚úÖ Persists to disk automatically
- ‚úÖ Is a one-step operation (simpler code)
- ‚úÖ Handles embedding generation internally


In [5]:
# Create ChromaDB vector store
# This will:
# 1. Generate embeddings for all chunks
# 2. Store them in ChromaDB
# 3. Save to disk in the 'chroma_db' directory

chroma_db = Chroma.from_documents(
    documents=chunks,                    # Document chunks to store
    embedding=embedding_model,           # Embedding model to use
    persist_directory="chroma_db"        # Directory to save the database
)

print("‚úÖ Chroma Database Created")
print(f"   Database location: ./chroma_db")
print(f"   Documents stored: {len(chunks)}")
print(f"   Database will persist between runs")


‚úÖ Chroma Database Created
   Database location: ./chroma_db
   Documents stored: 301
   Database will persist between runs


## Step 5: Querying the Database

Now that we have our vector store, we can query it to find relevant documents. The `similarity_search` method:
1. Converts the query into an embedding
2. Finds the most similar document chunks (by cosine similarity)
3. Returns the top-k most relevant chunks

### How Similarity Search Works:
- Query: "What is machine learning?"
- System finds chunks with similar embeddings
- Returns chunks that semantically match the query


In [6]:
# Create a retriever from the vector store
# A retriever is an interface that returns documents based on a query
retriever = chroma_db.as_retriever(search_kwargs={"k": 4})

# Example query
query = "What is the main topic of the documents?"

# Perform similarity search
# Returns the top-k most similar chunks (default k=4)
results = chroma_db.similarity_search(query, k=4)

print(f"üîç Query: '{query}'")
print(f"\n‚úÖ Found {len(results)} relevant chunks:\n")
print("=" * 80)

# Display results
for i, result in enumerate(results, 1):
    print(f"\nüìÑ Result {i}:")
    print(f"Content: {result.page_content[:200]}...")  # First 200 chars
    if hasattr(result, 'metadata'):
        print(f"Metadata: {result.metadata}")
    print("-" * 80)


üîç Query: 'What is the main topic of the documents?'

‚úÖ Found 4 relevant chunks:


üìÑ Result 1:
Content: adapted to a wide range of language-related tasks, like generating content or summarizing legal 
documentation....
Metadata: {'producer': 'macOS Version 12.6.1 (Build 21G217) Quartz PDFContext', 'creationdate': '2023-03-17T18:08:34+00:00', 'creator': 'Microsoft Word', 'title': 'A Beginner‚Äôs Guide to Large Language Models', 'page': 2, 'keywords': 'Large Language Models, What is a large language model, llm, how do large language models work', 'moddate': '2023-03-17T11:30:56-07:00', 'page_label': '3', 'source': '/Users/balaji/Documents/Learning/AI/ai_agent_projects/data/AI/LLM.pdf', 'subject': 'Large Language Models', 'total_pages': 25, 'author': 'NVIDIA'}
--------------------------------------------------------------------------------

üìÑ Result 2:
Content: Table 1. Reference Documents  
Document Document Location 
LLM Application on Arc dGPU https://github.com/violet17/LLM_

## Step 6: Adding a Local LLM

To complete the RAG pipeline, we need a Language Model (LLM) that can generate answers based on the retrieved context. Here, we'll use a local LLM from HuggingFace to avoid API costs and ensure privacy.

### Why Google FLAN-T5?

- **Local Execution**: Runs entirely on your machine
- **No API Costs**: Free to use
- **Small Model**: `flan-t5-base` is relatively small (~250MB)
- **Text Generation**: Good for question-answering tasks

### Note on Model Size:
- The model will be downloaded on first use (~250MB)
- Requires sufficient RAM to run (recommended: 8GB+)
- For better quality, consider larger models like `flan-t5-large` or `flan-t5-xl`


In [7]:
# Import transformers pipeline for local LLM
from transformers import pipeline

# Initialize a local LLM using HuggingFace Transformers
# This will download the model on first use (~250MB)
llm = pipeline(
    "text2text-generation",              # Task type: text-to-text generation
    model="google/flan-t5-base",         # Model: Google's FLAN-T5 base model
    max_new_tokens=150                   # Maximum tokens to generate
)

print("‚úÖ Local LLM initialized")
print("   Model: google/flan-t5-base")
print("   Task: text2text-generation")
print("   Max tokens: 150")
print("   Note: Model will be downloaded on first use if not already cached")


Device set to use mps:0


‚úÖ Local LLM initialized
   Model: google/flan-t5-base
   Task: text2text-generation
   Max tokens: 150
   Note: Model will be downloaded on first use if not already cached


## Step 7: Building an Agentic Controller

An **agentic controller** is a decision-making component that determines how to handle a query. It decides whether to:
- **SEARCH**: Retrieve information from the document database
- **DIRECT**: Answer directly without searching (for general knowledge questions)

### Why Use an Agent Controller?

- **Efficiency**: Don't search when it's not needed
- **Smart Routing**: Different queries need different approaches
- **Cost Optimization**: Avoid unnecessary retrieval operations
- **Better UX**: Faster responses for simple questions


In [8]:
# Agent Controller: Decides whether to search documents or answer directly
def agent_controller(query):
    """
    Determines the action to take based on the query.
    
    Args:
        query (str): User's question
        
    Returns:
        str: "search" if document search is needed, "direct" otherwise
    """
    q = query.lower()
    
    # Keywords that indicate document search is needed
    search_keywords = ["pdf", "document", "data", "summarize", "information", "find"]
    
    if any(word in q for word in search_keywords):
        return "search"
    return "direct"

# Test the controller
test_queries = [
    "Give me a summary from the PDF",
    "What is machine learning?",  # General knowledge
    "Find information about AI",
    "What is the weather today?"  # General knowledge
]

print("üß† Testing Agent Controller:\n")
for query in test_queries:
    action = agent_controller(query)
    print(f"Query: '{query}'")
    print(f"Action: {action.upper()}")
    print()


üß† Testing Agent Controller:

Query: 'Give me a summary from the PDF'
Action: SEARCH

Query: 'What is machine learning?'
Action: DIRECT

Query: 'Find information about AI'
Action: SEARCH

Query: 'What is the weather today?'
Action: DIRECT



## Step 8: Complete RAG Agent

Now we'll combine everything into a complete RAG agent that:
1. Uses the controller to decide the action
2. Retrieves relevant documents if needed
3. Augments the query with context
4. Generates an answer using the LLM

### How RAG Works:

1. **Query comes in** ‚Üí Agent controller decides action
2. **If SEARCH**: 
   - Retrieve relevant chunks from vector store
   - Combine chunks into context
   - Augment query with context
   - Generate answer using LLM
3. **If DIRECT**: 
   - Generate answer directly using LLM


In [9]:
# Complete RAG Agent Function
def rag_answer(query):
    """
    Complete RAG agent that retrieves context and generates answers.
    
    Args:
        query (str): User's question
        
    Returns:
        str: Generated answer
    """
    # Step 1: Agent controller decides the action
    action = agent_controller(query)
    
    if action == "search":
        # Step 2: Search mode - retrieve relevant documents
        print(f"üïµÔ∏è Agent decided to SEARCH documents for: '{query}'")
        
        # Retrieve relevant chunks from the vector store
        results = retriever.invoke(query)
        
        # Step 3: Combine retrieved chunks into context
        context = "\n".join([r.page_content for r in results])
        
        # Step 4: Augment the query with context
        final_prompt = f"Use this context:\n{context}\n\nAnswer:\n{query}"
        
    else:
        # Direct mode - answer without searching
        print(f"ü§ñ Agent decided to answer DIRECTLY: '{query}'")
        final_prompt = query
    
    # Step 5: Generate answer using LLM
    response = llm(final_prompt)[0]["generated_text"]
    return response

print("‚úÖ RAG Agent function created")
print("   The agent can now intelligently route queries and generate answers")


‚úÖ RAG Agent function created
   The agent can now intelligently route queries and generate answers


## Step 9: Testing the RAG Agent

Let's test our complete RAG agent with different types of queries to see how it handles:
- Document-specific questions (should trigger search)
- General knowledge questions (should answer directly)


In [10]:
# Test 1: Document-specific question (should trigger search)
print("=" * 80)
print("TEST 1: Document-Specific Question")
print("=" * 80)
query1 = "Give me a 5-point summary from the PDF"
answer1 = rag_answer(query1)
print(f"\nüí¨ Answer:\n{answer1}\n")


TEST 1: Document-Specific Question
üïµÔ∏è Agent decided to SEARCH documents for: 'Give me a 5-point summary from the PDF'

üí¨ Answer:
SELF-RAG Overview 62 including language translation, summarization, question answering, and text completion. GPT-3 made it evident that large-scale models can accurately perform a wide ‚Äì and previously unheard-of ‚Äì range of NLP tasks, from text summarization to text generation. It also showed that LLMs could generate outputs that are nearly indistinguishable from human-created text, all while learning on their own with minimal human intervention.



In [11]:
# Test 2: General knowledge question (should answer directly)
print("=" * 80)
print("TEST 2: General Knowledge Question")
print("=" * 80)
query2 = "What is an Ideal Resume Format? Explain in 50 words."
answer2 = rag_answer(query2)
print(f"\nüí¨ Answer:\n{answer2}\n")


TEST 2: General Knowledge Question
ü§ñ Agent decided to answer DIRECTLY: 'What is an Ideal Resume Format? Explain in 50 words.'

üí¨ Answer:
An Ideal Resume Format is a format for a resume to be written in a professional manner. An Ideal Resume Format is a format for a resume to be written in a professional manner. An Ideal Resume Format is a format for a resume to be written in a professional manner. An Ideal Resume Format is a format for a resume to be written in a professional manner. An Ideal Resume Format is a format for a resume to be written in a professional manner. An Ideal Resume Format is a format for a resume to be written in a professional manner. An Ideal Resume Format is a format for a resume to be written in a professional manner.



## Advanced: Loading an Existing Database


## Next Steps: Building a Complete RAG System


## Summary

You've successfully built a **complete Agentic RAG pipeline** that can:
- ‚úÖ Load documents from PDFs
- ‚úÖ Split them into manageable chunks
- ‚úÖ Create semantic embeddings
- ‚úÖ Store them in a searchable vector database
- ‚úÖ Query for relevant information
- ‚úÖ Use a local LLM for text generation
- ‚úÖ Intelligently route queries (search vs. direct)
- ‚úÖ Generate context-aware answers

### Key Components Built:

1. **Document Processing**: PDF loading and chunking
2. **Vector Store**: ChromaDB with persistent storage
3. **Local LLM**: Google FLAN-T5 for text generation
4. **Agent Controller**: Smart query routing
5. **RAG Agent**: Complete retrieval-augmented generation system

### Improvements You Could Make:

- **Better LLM**: Use larger models (flan-t5-large, llama-2, mistral) for better quality
- **Enhanced Controller**: Add more sophisticated routing logic (e.g., using embeddings)
- **Streaming**: Add streaming responses for better UX
- **Memory**: Add conversation memory for multi-turn dialogues
- **Evaluation**: Add metrics to evaluate answer quality
- **UI**: Build a web interface (Gradio, Streamlit) for easy interaction

This foundation can be extended to build powerful AI applications!


## Advanced: Loading an Existing Database

If you've already created a ChromaDB and want to load it (instead of recreating it), use this:

```python
# Load existing ChromaDB
chroma_db = Chroma(
    persist_directory="chroma_db",
    embedding_function=embedding_model
)
```

This is useful when:
- You've already processed documents
- You want to add new documents to an existing database
- You're running queries without modifying the database


## Next Steps: Building a Complete RAG System

To make this a complete RAG pipeline, you would:

1. **Add an LLM**: Use the retrieved chunks as context for an LLM (e.g., OpenAI, Anthropic, or local models)
2. **Create a Chain**: Use LangChain's `RetrievalQA` chain to combine retrieval + generation
3. **Add a Chat Interface**: Build a conversational interface for your RAG system
4. **Implement Agents**: Add agentic capabilities for more complex reasoning

### Example: Complete RAG Chain

```python
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    retriever=chroma_db.as_retriever(),
    return_source_documents=True
)

response = qa_chain({"query": "What is the main topic?"})
```

---

## Summary

You've successfully built a RAG pipeline that can:
- ‚úÖ Load documents from PDFs
- ‚úÖ Split them into manageable chunks
- ‚úÖ Create semantic embeddings
- ‚úÖ Store them in a searchable vector database
- ‚úÖ Query for relevant information

This foundation can be extended to build powerful AI applications!
