# Contextual Retrieval for Large Documents

When the original document is too large to add full context to every chunk, we need strategies to maintain information while keeping context manageable.

This notebook demonstrates:
1. **Hierarchical Chunking** - Add parent section context instead of full document
2. **Summary-Based Context** - Use document summaries
3. **Metadata Extraction** - Add structured metadata
4. **Sliding Window Context** - Include surrounding chunks

We'll use the VoyageAI API Key Directions PDF as our example document.

In [None]:
# Setup
from dotenv import load_dotenv
from anthropic import Anthropic
import json
import re
from typing import List, Dict

load_dotenv()
client = Anthropic()
model = "claude-3-5-sonnet-20241022"

In [None]:
# Sample document from VoyageAI_API_Key_Directions.pdf
VOYAGE_API_DOCUMENT = """Getting an API Key from VoyageAI

Step 1: Sign up to VoyageAI
1. Navigate to https://www.voyageai.com/
2. Click on the Login button in the top right corner
3. Create an account

Step 2: Create an API key
1. Once logged in, find the 'API Keys' section on the left nav bar
2. Click the 'Create new secret key button'
3. Enter a key name of "Test Key" then click 'Create secret key'
4. Copy the key

Step 3: Add the key to your ".env" file
1. In your editor, find your ".env" file that's next to your notebook.
2. Add in your api key, assigning it to a variable named exactly "VOYAGE_API_KEY"
"""

print(f"Document length: {len(VOYAGE_API_DOCUMENT)} characters")
print(f"\nDocument preview:\n{VOYAGE_API_DOCUMENT[:200]}...")

## Strategy 1: Hierarchical Chunking

Instead of adding the entire document to each chunk, we extract the section/chapter title and add only that context.

In [None]:
def extract_hierarchical_chunks(document: str) -> List[Dict]:
    """
    Split document into sections and create chunks with section context.
    Each chunk only includes its immediate parent section, not the full document.
    """
    chunks = []
    
    # Split by major sections (Step 1, Step 2, etc.)
    sections = re.split(r'(Step \d+:[^\n]+)', document)
    
    current_section = "Introduction"
    
    for i, part in enumerate(sections):
        if not part.strip():
            continue
            
        # Check if this is a section header
        if part.startswith('Step '):
            current_section = part.strip()
        else:
            # This is content - split into smaller chunks if needed
            content_parts = part.strip().split('\n\n')
            
            for content in content_parts:
                if content.strip():
                    # Add section context to chunk
                    enriched_chunk = {
                        'section': current_section,
                        'original_content': content.strip(),
                        'contextualized_content': f"Section: {current_section}\n\nContent: {content.strip()}"
                    }
                    chunks.append(enriched_chunk)
    
    return chunks

hierarchical_chunks = extract_hierarchical_chunks(VOYAGE_API_DOCUMENT)

print(f"Created {len(hierarchical_chunks)} hierarchical chunks\n")
for i, chunk in enumerate(hierarchical_chunks[:3], 1):
    print(f"--- Chunk {i} ---")
    print(f"Section: {chunk['section']}")
    print(f"Contextualized content:\n{chunk['contextualized_content'][:150]}...\n")

## Strategy 2: Summary-Based Context

Generate a concise summary of the entire document and prepend it to each chunk.

In [None]:
def generate_document_summary(document: str) -> str:
    """
    Use Claude to generate a concise summary of the document.
    """
    response = client.messages.create(
        model=model,
        max_tokens=200,
        messages=[{
            "role": "user",
            "content": f"""Provide a 1-2 sentence summary of this document:

{document}

Keep it brief and focused on the main purpose."""
        }]
    )
    
    return response.content[0].text

doc_summary = generate_document_summary(VOYAGE_API_DOCUMENT)
print(f"Document Summary:\n{doc_summary}\n")

In [None]:
def add_summary_context(chunks: List[str], summary: str) -> List[Dict]:
    """
    Add document summary to each chunk as context.
    """
    return [
        {
            'original_content': chunk,
            'contextualized_content': f"Document Summary: {summary}\n\nChunk Content: {chunk}"
        }
        for chunk in chunks
    ]

# Simple chunking by paragraphs
simple_chunks = [chunk.strip() for chunk in VOYAGE_API_DOCUMENT.split('\n\n') if chunk.strip()]
summary_based_chunks = add_summary_context(simple_chunks[:3], doc_summary)

print(f"Created {len(summary_based_chunks)} summary-based chunks\n")
print("Example chunk with summary context:")
print(summary_based_chunks[0]['contextualized_content'])

## Strategy 3: Metadata Extraction

Extract structured metadata using Claude and add it to chunks instead of raw text.

In [None]:
def extract_document_metadata(document: str) -> Dict:
    """
    Extract structured metadata from the document.
    """
    response = client.messages.create(
        model=model,
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": f"""Extract metadata from this document in JSON format:

{document}

Provide:
- document_title: The main title/topic
- document_type: Type of document (guide, tutorial, etc.)
- key_topics: List of main topics covered
- target_audience: Who this is for

Return ONLY valid JSON."""
        }]
    )
    
    # Extract JSON from response
    text = response.content[0].text
    # Try to find JSON in the response
    json_match = re.search(r'\{[^}]+\}', text, re.DOTALL)
    if json_match:
        return json.loads(json_match.group())
    return json.loads(text)

metadata = extract_document_metadata(VOYAGE_API_DOCUMENT)
print("Extracted Metadata:")
print(json.dumps(metadata, indent=2))

In [None]:
def add_metadata_context(chunks: List[str], metadata: Dict) -> List[Dict]:
    """
    Add compact metadata to each chunk.
    """
    metadata_header = f"""Document: {metadata.get('document_title', 'Unknown')}
Type: {metadata.get('document_type', 'Unknown')}
Topics: {', '.join(metadata.get('key_topics', []))}
"""
    
    return [
        {
            'original_content': chunk,
            'metadata': metadata,
            'contextualized_content': f"{metadata_header}\nContent: {chunk}"
        }
        for chunk in chunks
    ]

metadata_chunks = add_metadata_context(simple_chunks[:3], metadata)

print("Example chunk with metadata context:")
print(metadata_chunks[1]['contextualized_content'])

## Strategy 4: Sliding Window Context

Add context from surrounding chunks instead of the full document.

In [None]:
def add_sliding_window_context(chunks: List[str], window_size: int = 1) -> List[Dict]:
    """
    Add context from previous and next chunks.
    
    Args:
        chunks: List of chunk strings
        window_size: Number of chunks before/after to include
    """
    enriched_chunks = []
    
    for i, chunk in enumerate(chunks):
        context_parts = []
        
        # Add previous chunks (abbreviated)
        for j in range(max(0, i - window_size), i):
            preview = chunks[j][:100].replace('\n', ' ')
            context_parts.append(f"[Previous chunk]: {preview}...")
        
        # Add next chunks (abbreviated)
        for j in range(i + 1, min(len(chunks), i + window_size + 1)):
            preview = chunks[j][:100].replace('\n', ' ')
            context_parts.append(f"[Next chunk]: {preview}...")
        
        context_header = "\n".join(context_parts)
        
        enriched_chunk = {
            'chunk_index': i,
            'original_content': chunk,
            'contextualized_content': f"{context_header}\n\n[Current chunk]:\n{chunk}" if context_parts else chunk
        }
        enriched_chunks.append(enriched_chunk)
    
    return enriched_chunks

sliding_window_chunks = add_sliding_window_context(simple_chunks, window_size=1)

print(f"Created {len(sliding_window_chunks)} chunks with sliding window context\n")
print("Example chunk (middle of document):")
print(sliding_window_chunks[3]['contextualized_content'])

## Strategy 5: Hybrid Approach

Combine multiple strategies for optimal results.

In [None]:
def create_hybrid_contextual_chunks(document: str) -> List[Dict]:
    """
    Combine hierarchical chunking + metadata + summary for best results.
    """
    # 1. Extract metadata (compact, structured info)
    metadata = extract_document_metadata(document)
    
    # 2. Generate brief summary
    summary = generate_document_summary(document)
    
    # 3. Create hierarchical chunks
    hierarchical_chunks = extract_hierarchical_chunks(document)
    
    # 4. Enrich each chunk with combined context
    hybrid_chunks = []
    for chunk_data in hierarchical_chunks:
        # Create compact header
        header = f"""[Document: {metadata.get('document_title', 'Unknown')}]
[Summary: {summary}]
[Section: {chunk_data['section']}]
"""
        
        hybrid_chunk = {
            'metadata': metadata,
            'summary': summary,
            'section': chunk_data['section'],
            'original_content': chunk_data['original_content'],
            'contextualized_content': f"{header}\n{chunk_data['original_content']}"
        }
        hybrid_chunks.append(hybrid_chunk)
    
    return hybrid_chunks

hybrid_chunks = create_hybrid_contextual_chunks(VOYAGE_API_DOCUMENT)

print(f"Created {len(hybrid_chunks)} hybrid contextual chunks\n")
print("Example hybrid chunk:")
print(hybrid_chunks[2]['contextualized_content'])
print("\n" + "="*80)
print("\nOriginal content (for comparison):")
print(hybrid_chunks[2]['original_content'])

## Comparison: Context Size Analysis

In [None]:
import statistics

def analyze_chunk_sizes(chunks: List[Dict], label: str):
    """
    Analyze the overhead added by contextualization.
    """
    original_sizes = [len(c['original_content']) for c in chunks]
    contextualized_sizes = [len(c['contextualized_content']) for c in chunks]
    overhead = [ctx - orig for orig, ctx in zip(original_sizes, contextualized_sizes)]
    
    print(f"\n{label}:")
    print(f"  Average original size: {statistics.mean(original_sizes):.0f} chars")
    print(f"  Average contextualized size: {statistics.mean(contextualized_sizes):.0f} chars")
    print(f"  Average overhead: {statistics.mean(overhead):.0f} chars ({statistics.mean(overhead)/statistics.mean(original_sizes)*100:.1f}%)")
    print(f"  Total chunks: {len(chunks)}")

# Compare all strategies
analyze_chunk_sizes(hierarchical_chunks, "Hierarchical Chunking")
analyze_chunk_sizes(summary_based_chunks, "Summary-Based Context")
analyze_chunk_sizes(metadata_chunks, "Metadata Context")
analyze_chunk_sizes(sliding_window_chunks, "Sliding Window")
analyze_chunk_sizes(hybrid_chunks, "Hybrid Approach")

## Testing with RAG Query

Let's simulate a simple retrieval scenario to see how contextual information helps.

In [None]:
def simple_keyword_search(query: str, chunks: List[Dict], top_k: int = 2) -> List[Dict]:
    """
    Simple keyword-based search (without embeddings).
    """
    query_lower = query.lower()
    
    # Score chunks by keyword overlap
    scored_chunks = []
    for chunk in chunks:
        content_lower = chunk['original_content'].lower()
        score = sum(1 for word in query_lower.split() if word in content_lower)
        scored_chunks.append((score, chunk))
    
    # Sort by score and return top k
    scored_chunks.sort(reverse=True, key=lambda x: x[0])
    return [chunk for score, chunk in scored_chunks[:top_k]]

# Test query
test_query = "How do I create a new API key?"

print(f"Query: {test_query}\n")
print("=" * 80)

# Compare retrieval with different strategies
print("\n1. WITHOUT CONTEXT (original chunks):")
results = simple_keyword_search(test_query, hierarchical_chunks, top_k=1)
print(results[0]['original_content'])

print("\n" + "=" * 80)
print("\n2. WITH HYBRID CONTEXT:")
print(results[0]['contextualized_content'])

## Using Contextualized Chunks with Claude

In [None]:
def answer_with_rag(query: str, chunks: List[Dict], use_context: bool = True):
    """
    Answer a query using retrieved chunks.
    """
    # Retrieve relevant chunks
    retrieved = simple_keyword_search(query, chunks, top_k=2)
    
    # Build context from chunks
    if use_context:
        context = "\n\n".join([c['contextualized_content'] for c in retrieved])
    else:
        context = "\n\n".join([c['original_content'] for c in retrieved])
    
    # Query Claude
    response = client.messages.create(
        model=model,
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": f"""Based on the following context, answer the question.

Context:
{context}

Question: {query}

Answer:"""
        }]
    )
    
    return response.content[0].text

# Test both approaches
print("Query:", test_query)
print("\n" + "=" * 80)
print("\nAnswer WITHOUT contextual information:")
print(answer_with_rag(test_query, hybrid_chunks, use_context=False))

print("\n" + "=" * 80)
print("\nAnswer WITH contextual information:")
print(answer_with_rag(test_query, hybrid_chunks, use_context=True))

## Key Takeaways

When dealing with large documents:

1. **Hierarchical Chunking** - Best for structured documents (manuals, guides)
   - Adds ~50-100 chars per chunk
   - Preserves document structure

2. **Summary-Based Context** - Best for consistent global context
   - Adds ~100-200 chars per chunk
   - Works well for thematic documents

3. **Metadata Context** - Best for filtering and categorization
   - Minimal overhead (~50-100 chars)
   - Enables faceted search

4. **Sliding Window** - Best for sequential/narrative content
   - Variable overhead (depends on window size)
   - Maintains flow

5. **Hybrid Approach** - Best overall performance
   - Combines benefits of multiple strategies
   - ~200-300 chars overhead but significantly better context

**Recommendation**: Use the hybrid approach for most production systems, adjusting based on:
- Document size and structure
- Query patterns
- Token budget constraints