# Lab 03: Text Chunking - Tutorial

## Splitting Documents for RAG Systems

---

**In this lab, you will learn:**
- Why we need to chunk text
- How to split text using string slicing
- How to create chunking functions
- How to add overlap between chunks

**Time:** ~60 minutes

---

## Part 1: Why Chunking?

LLMs have **token limits**:
- GPT-4: 8K-128K tokens
- Claude: 100K-200K tokens
- Local LLMs (Ollama): 2K-8K tokens

Large documents must be split into smaller **chunks** to:
1. Fit within token limits
2. Enable precise retrieval
3. Improve search relevance

## Part 2: String Slicing Basics

Python strings can be sliced using `text[start:end]`

In [None]:
# Example text
text = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
print(f"Full text: {text}")
print(f"Length: {len(text)}")

In [None]:
# Slicing: text[start:end]
# Returns characters from index 'start' to 'end-1'

print(text[0:5])    # ABCDE (index 0,1,2,3,4)
print(text[5:10])   # FGHIJ (index 5,6,7,8,9)
print(text[10:15])  # KLMNO
print(text[20:])    # UVWXYZ (from 20 to end)

In [None]:
# Manual chunking
chunk_size = 10

chunk1 = text[0:10]    # Characters 0-9
chunk2 = text[10:20]   # Characters 10-19
chunk3 = text[20:30]   # Characters 20-25 (stops at end)

print(f"Chunk 1: {chunk1}")
print(f"Chunk 2: {chunk2}")
print(f"Chunk 3: {chunk3}")

## Part 3: Automatic Chunking with Loop

In [None]:
# Using range(start, stop, step)
text = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
chunk_size = 10

# range(0, 26, 10) gives: 0, 10, 20
for i in range(0, len(text), chunk_size):
    print(f"Start index: {i}")

In [None]:
# Complete chunking function
def simple_chunk(text, chunk_size):
    """
    Split text into chunks of specified size.
    
    Args:
        text: The text to chunk
        chunk_size: Number of characters per chunk
    
    Returns:
        List of text chunks
    """
    chunks = []
    
    for i in range(0, len(text), chunk_size):
        chunk = text[i:i + chunk_size]
        chunks.append(chunk)
    
    return chunks

# Test it
result = simple_chunk("ABCDEFGHIJKLMNOPQRSTUVWXYZ", 10)
print(result)

In [None]:
# Try with real text
disease_text = "Rubella is a contagious disease caused by the Rubella virus. It typically causes fever, rash, and swollen lymph nodes."

chunks = simple_chunk(disease_text, 30)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: '{chunk}'")

## Part 4: Chunking with Overlap

**Problem:** Simple chunking can cut words and lose context.

**Solution:** Add overlap between chunks!

```
No Overlap:     [AAAA][BBBB][CCCC]
With Overlap:   [AAAA][AABB][BBCC][CCCC]
                      ^^    ^^   
                   overlap characters
```

In [None]:
def chunk_with_overlap(text, chunk_size, overlap):
    """
    Split text into chunks with overlap.
    
    Args:
        text: The text to chunk
        chunk_size: Number of characters per chunk
        overlap: Number of characters to overlap
    
    Returns:
        List of overlapping chunks
    """
    chunks = []
    start = 0
    
    while start < len(text):
        # Get chunk from start to start+chunk_size
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        
        # Move start position (minus overlap)
        start = end - overlap
    
    return chunks

# Test with simple text
text = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
result = chunk_with_overlap(text, 10, 3)

for i, chunk in enumerate(result):
    print(f"Chunk {i+1}: {chunk}")

In [None]:
# Notice the overlap!
print("Notice: 'HIJ' appears in both Chunk 1 and Chunk 2")
print("        'QRS' appears in both Chunk 2 and Chunk 3")

In [None]:
# Apply to real text
disease_text = "Rubella causes fever. The rash spreads quickly. Treatment includes rest and fluids."

print("With overlap (size=25, overlap=10):")
chunks = chunk_with_overlap(disease_text, 25, 10)
for i, chunk in enumerate(chunks):
    print(f"  {i+1}: '{chunk}'")

## Part 5: Practical Application

In [None]:
# Complete document chunker
def chunk_document(content, chunk_size=100, overlap=20):
    """
    Chunk a document and return chunks with metadata.
    """
    chunks = []
    start = 0
    chunk_id = 0
    
    while start < len(content):
        end = start + chunk_size
        chunk_text = content[start:end]
        
        chunks.append({
            "id": chunk_id,
            "text": chunk_text,
            "start": start,
            "end": min(end, len(content))
        })
        
        start = end - overlap
        chunk_id += 1
    
    return chunks

# Test
document = """
Rubella, also known as German measles, is a contagious viral infection.
The main symptoms include low-grade fever, pink rash, and swollen lymph nodes.
Treatment involves rest, fluids, and fever reducers like paracetamol.
Prevention is through the MMR vaccine given to children.
""".strip()

chunks = chunk_document(document, chunk_size=80, overlap=20)

print(f"Document length: {len(document)} characters")
print(f"Number of chunks: {len(chunks)}")
print("\nChunks:")
for chunk in chunks:
    print(f"  ID {chunk['id']}: [{chunk['start']}-{chunk['end']}] '{chunk['text'][:40]}...'")

---

## Summary

| Concept | Code | Description |
|---------|------|-------------|
| String slice | `text[start:end]` | Get substring |
| Range loop | `range(0, len, step)` | Generate indices |
| Simple chunk | Loop + slice | No overlap |
| Overlap chunk | `start = end - overlap` | Context preserved |

### Key Points:
- Chunk size affects search granularity
- Overlap preserves context between chunks
- Typical values: chunk_size=500-1000, overlap=50-100

### Next Step:
Now open `exercise/Lab03_Exercise.ipynb` and complete the exercises!