# Module 2: Document Processing & Chunking
  
**Level:** Beginner to Intermediate  
**Prerequisites:** Module 1 completed

---

## Learning Objectives

By the end of this module, you will be able to:

- Explain why chunking is necessary in RAG systems
- Implement basic chunking strategies from scratch
- Choose appropriate chunk sizes for different use cases
- Understand the trade-offs between different chunking approaches
- Preserve and use document metadata effectively
- Handle different document formats (text, PDF)

---

# 1. Why Chunking Matters

## 1.1 The Problem with Whole Documents

Imagine you have a 50-page legal contract and someone asks: *"What is the termination notice period?"*

**Problems with using the whole document:**

1. **Too long for context windows**
   - Most LLMs have limited context (4K-128K tokens)
   - A 50-page document might be 20,000+ tokens
   - Can't fit multiple documents in one prompt

2. **Poor retrieval precision**
   - Embedding a whole document loses specificity
   - Can't pinpoint exact relevant section
   - Answer might be buried in irrelevant content

3. **Inefficient and costly**
   - Processing entire documents is slow
   - Expensive in terms of tokens/API costs
   - Wastes context window space

## 1.2 How Chunking Solves This

**Chunking = Breaking documents into smaller, meaningful pieces**

**Benefits:**
- Each chunk fits in context window
- More precise retrieval (find exact relevant section)
- Better embeddings (more specific semantic meaning)
- Faster and cheaper processing

**Example:**
```
50-page contract
      ↓
Split into 100 chunks (500 words each)
      ↓
Embed each chunk separately
      ↓
Retrieve only the 3 most relevant chunks
```

---

# 2. Understanding Chunk Size

The most important decision in chunking: **How big should chunks be?**

## 2.1 The Chunk Size Trade-off

### Small Chunks (100-300 tokens)

**Pros:**
- Very precise retrieval
- Specific answers to specific questions
- Less noise in retrieved context

**Cons:**
- May lose important surrounding context
- Might split related information
- Need to retrieve more chunks to get full picture

**Best for:** FAQ systems, specific fact lookup

---

### Medium Chunks (300-600 tokens)

**Pros:**
- Good balance of precision and context
- Usually keeps related information together
- Most versatile option

**Cons:**
- Middle ground = not optimized for any specific case

**Best for:** Most general RAG use cases (recommended starting point)

---

### Large Chunks (600-1000+ tokens)

**Pros:**
- Preserves more context
- Better for complex, interconnected information
- Fewer chunks to manage

**Cons:**
- Less precise retrieval
- More irrelevant information included
- May exceed context limits with multiple chunks

**Best for:** research papers, narrative documents

## 2.2 Rule of Thumb

**Start with 400-500 tokens (roughly 300-400 words)**

This is a good default for most use cases. You can adjust based on:
- Document type (technical docs → smaller, narratives → larger)
- Query types (specific facts → smaller, summaries → larger)
- Testing and evaluation results

**Remember:** There's no perfect size - it depends on your use case!