# RAG Pipeline with LlamaIndex

## What is RAG?

Retrieval-Augmented Generation (RAG) is a technique that enhances Large Language Models by providing them with relevant information from a knowledge base. Instead of relying solely on the model's training data, RAG retrieves relevant documents and uses them as context for generating more accurate, grounded responses.

## What You'll Learn

In this notebook, you'll build a basic RAG pipeline using LlamaIndex and Claude. The pipeline includes:

1. **Setup LLM and Embedding Model** - Configure Claude for generation and embeddings for search
2. **Download Data** - Get sample documents to index
3. **Load Data** - Read documents into LlamaIndex
4. **Index Data** - Create searchable vector index
5. **Create Query Engine** - Build interface for querying
6. **Querying** - Ask questions and get answers grounded in your documents

## Prerequisites

- Python 3.8+
- An Anthropic API key ([Get one here](https://console.anthropic.com/))
- Basic familiarity with Python

## Use Case

This pattern is useful when you need to:
- Answer questions about specific documents or knowledge bases
- Reduce hallucinations by grounding responses in source material
- Build chatbots with domain-specific knowledge
- Create Q&A systems for technical documentation

### Installation

**Note for VSCode users**: If you're running this notebook in VSCode, you may need to install `ipykernel`:
```bash
pip install ipykernel
```

Then install the required packages:

In [1]:
!pip install llama-index
!pip install llama-index-llms-anthropic
!pip install llama-index-embeddings-huggingface



### Setup API Keys

You'll need an Anthropic API key to use Claude.

**To obtain an API key:**
1. Go to [console.anthropic.com](https://console.anthropic.com/)
2. Sign up or log in
3. Navigate to API Keys section
4. Create a new API key

**To set up your API key:**

**Linux/Mac:**
```bash
export ANTHROPIC_API_KEY='your-api-key-here'
```

**Windows (PowerShell):**
```powershell
$env:ANTHROPIC_API_KEY='your-api-key-here'
```

**Windows (Command Prompt):**
```cmd
set ANTHROPIC_API_KEY=your-api-key-here
```

**Or use a .env file (recommended for development):**

Create a `.env` file in the same directory:
```
ANTHROPIC_API_KEY=your-api-key-here
```

Then install and use `python-dotenv`:
```python
pip install python-dotenv
from dotenv import load_dotenv
load_dotenv()
```

In [2]:
import os

# Verify ANTHROPIC_API_KEY is set in environment
api_key = os.getenv('ANTHROPIC_API_KEY')
if not api_key:
    raise ValueError("ANTHROPIC_API_KEY not found in environment variables. Please set it before running this notebook.")

print("✓ ANTHROPIC_API_KEY found in environment")

✓ ANTHROPIC_API_KEY found in environment


### Setup LLM and Embedding model

We will use:
- **Claude 3.5 Sonnet** (latest recommended model) as our LLM for generating responses
- **HuggingFace BGE embeddings** for converting text into vector representations

The BGE (BAAI General Embedding) model is a high-quality, open-source embedding model that works well for semantic search and runs locally without requiring API calls.

In [3]:
from llama_index.llms.anthropic import Anthropic
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
llm = Anthropic(temperature=0.0, model='claude-3-5-sonnet-20241022')
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-base-en-v1.5")

In [5]:
from llama_index.core import Settings
Settings.llm = llm
Settings.embed_model = embed_model
Settings.chunk_size = 512

### Configure Global Settings

We configure LlamaIndex global settings:

- **`Settings.llm`** - The language model for generation (Claude 3.5 Sonnet)
- **`Settings.embed_model`** - The embedding model for converting text to vectors (BGE)
- **`Settings.chunk_size = 512`** - Documents are split into 512-token chunks

**Why chunk_size=512?**

This parameter balances precision vs. context:
- **Smaller chunks (e.g., 256)**: More precise retrieval but may lose context
- **Larger chunks (e.g., 1024)**: More context but less precise matching
- **512 tokens** is a good middle ground for most use cases

### Download Data

We'll download the Paul Graham essay if it doesn't already exist in the project's data directory.

In [6]:
import os
import urllib.request

# Use the project's existing data/text directory (one level up from inspiration/)
data_dir = os.path.join('..', 'data', 'text')
os.makedirs(data_dir, exist_ok=True)

# Download file if it doesn't exist
url = 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt'
output_path = os.path.join(data_dir, 'paul_graham_essay.txt')

if os.path.exists(output_path):
    print(f"✓ File already exists: {output_path}")
else:
    print(f"Downloading {url}...")
    urllib.request.urlretrieve(url, output_path)
    print(f"✓ Downloaded to {output_path}")

✓ File already exists: ..\data\text\paul_graham_essay.txt


In [7]:
from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
)

### Load Data

In [8]:
documents = SimpleDirectoryReader("../data/text").load_data()

### Index Data

In [9]:
index = VectorStoreIndex.from_documents(
    documents,
)

### Create Query Engine

In [10]:
query_engine = index.as_query_engine(similarity_top_k=3)

### Test Queries

Let's test the RAG pipeline with multiple questions to see how it retrieves and answers from the Paul Graham essay.

In [11]:
# Test multiple queries to demonstrate different use cases
queries = [
    "What did author do growing up?",
    "What companies did the author start?",
    "What is the author's view on programming languages?"
]

for i, query in enumerate(queries, 1):
    print(f"\n{'='*70}")
    print(f"Query {i}: {query}")
    print('='*70)
    response = query_engine.query(query)
    print(f"\nAnswer: {response}\n")


Query 1: What did author do growing up?


2025-10-19 12:23:16,382 - INFO - HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"



Answer: From the provided context, I cannot definitively answer what the author did growing up, as the text doesn't provide clear information about the author's childhood or early years. While there is a brief mention that the author skipped time-sharing machines and went straight from batch processing to microcomputers, this is only a small technical detail and doesn't give us a complete picture of their upbringing.


Query 2: What companies did the author start?


2025-10-19 12:23:20,173 - INFO - HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"



Answer: From the provided context, the author helped start Y Combinator (YC), an investment firm that funded startups in batches. He started this with Jessica Livingston, Robert, and Trevor in 2005. He also created Hacker News (which was originally called Startup News) as a news aggregator written in Arc programming language. While there are hints of previous startups (like something that was sold to Yahoo), the specific details of other companies aren't explicitly mentioned in the given context.


Query 3: What is the author's view on programming languages?


2025-10-19 12:23:25,672 - INFO - HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"



Answer: The author shows a particular appreciation for Lisp, viewing it as a language with unique power and elegance due to its origins as a model of computation. He found other programming languages of his time, like PL/I at Cornell, to be relatively primitive. Learning Lisp significantly expanded his understanding of what was possible in programming, to the point where it took years for him to grasp the new boundaries of what could be done.

The author was so invested in Lisp's potential that he spent 4 years creating his own version called Bel, demonstrating his deep commitment to the language's fundamental principles. He was especially drawn to Lisp's distinctive characteristic of being defined by an interpreter written in itself, which set it apart from other programming languages.

His perspective suggests that he values programming languages that offer theoretical elegance and computational power over more conventional, practical languages that might lack these fundamental qual

## Conclusion

You've successfully built a basic RAG pipeline! Here's what we accomplished:

✅ **Set up** Claude 3.5 Sonnet and HuggingFace embeddings  
✅ **Loaded** documents into LlamaIndex  
✅ **Created** a searchable vector index  
✅ **Queried** the knowledge base with natural language  

### Next Steps

- **Try your own documents**: Replace the Paul Graham essay with your own PDFs, text files, or markdown
- **Tune parameters**: Experiment with `chunk_size` and `similarity_top_k` for your use case
- **Add metadata**: Use document metadata to filter and organize your knowledge base
- **Multi-document queries**: Build systems that query across multiple documents (see [structured-docs-rag Phase 2](https://github.com/dzivkovi/structured-docs-rag))
- **Production deployment**: Add error handling, caching, and monitoring for production use

### Resources

- [LlamaIndex Documentation](https://docs.llamaindex.ai/)
- [Anthropic Claude Documentation](https://docs.anthropic.com/)
- [Structured Docs RAG](https://github.com/dzivkovi/structured-docs-rag) - Production-grade RAG for high-stakes domains
- [Anthropic Cookbooks](https://github.com/anthropics/claude-cookbooks) - More examples and patterns

**Source**: This notebook is inspired by [Anthropic's LlamaIndex Cookbook](https://github.com/anthropics/claude-cookbooks/tree/main/third_party/LlamaIndex)