# Semantic Chunking with LlamaIndex + HuggingFace

This notebook demonstrates how to use semantic chunking with HuggingFace embeddings instead of fixed-size chunking for better RAG performance. Semantic chunking uses embedding similarity to find natural breakpoints between sentences, ensuring chunks contain semantically related content.

## Benefits of Semantic Chunking

- **Semantically Coherent**: Chunks contain related sentences that belong together
- **Natural Boundaries**: Splits occur at meaningful breakpoints, not arbitrary character counts
- **Better Retrieval**: More relevant chunks lead to better RAG results
- **Context Preservation**: Maintains semantic context within each chunk
- **No API Keys Needed**: Uses local HuggingFace models (same as your existing pipeline)

## Setup and Dependencies

First, install the required dependencies:

In [1]:
# Install additional dependencies for semantic chunking
!pip install llama-index-core llama-index-embeddings-huggingface



In [None]:
import os
import sys
from dotenv import load_dotenv

# Add project root to path
project_root = os.path.abspath('..')
if project_root not in sys.path:
    sys.path.insert(0, project_root)

# Load environment variables
load_dotenv()

# Verify required API key
llama_api_key = os.getenv('LLAMA_CLOUD_API_KEY')

print(f"LlamaParse API Key: {'✓ Found' if llama_api_key else '✗ Missing'}")

if not llama_api_key:
    print("\n⚠️ Please set LLAMA_CLOUD_API_KEY in your .env file")

LlamaParse API Key: ✗ Missing
✓ HuggingFace embeddings will be used locally (no API key required)

⚠️ Please set LLAMA_CLOUD_API_KEY in your .env file


## Import Required Modules

In [None]:
from config.settings import PipelineConfig, ChunkingConfig
from processors.llamaparse_processor import LlamaParseProcessor
from processors.document_converter import DocumentConverter
from utils.logger import get_logger

# Setup logging
logger = get_logger(__name__)

## Configure Semantic Chunking

Create configuration that enables semantic chunking with HuggingFace embeddings:

In [None]:
# Configure semantic chunking with HuggingFace embeddings
semantic_chunking_config = ChunkingConfig(
    # Enable semantic chunking
    use_semantic_chunking=True,
    
    # Semantic chunking parameters
    buffer_size=1,  # Number of sentences to consider together
    breakpoint_percentile_threshold=95,  # Higher = fewer, larger chunks
    embed_model_name="sentence-transformers/all-MiniLM-L6-v2",  # Same as your pipeline!
    
    # Traditional chunking fallback parameters
    chunk_size=1500,
    chunk_overlap=300
)

# Create pipeline configuration
config = PipelineConfig(
    docs_folder="../docs",
    output_folder="../output",
    vector_store_name="semantic_chunks_vector_store",
    chunking=semantic_chunking_config
)

print("Configuration created:")
print(f"  Semantic chunking: {config.chunking.use_semantic_chunking}")
print(f"  Buffer size: {config.chunking.buffer_size}")
print(f"  Breakpoint threshold: {config.chunking.breakpoint_percentile_threshold}%")
print(f"  Embedding model: {config.chunking.embed_model_name}")
print(f"  Same model as your existing pipeline: ✓")

## Process Documents with Semantic Chunking

Let's process some documents and see the difference between traditional and semantic chunking:

In [None]:
# Initialize processors
llamaparse_processor = LlamaParseProcessor(config.llamaparse)
document_converter = DocumentConverter(config.chunking)

print("Processors initialized successfully!")
print("Using HuggingFace embeddings locally (no API calls needed)")

In [None]:
# Process documents with LlamaParse
docs_folder = config.docs_folder
if os.path.exists(docs_folder) and os.listdir(docs_folder):
    print(f"Processing documents from {docs_folder}...")
    
    # Parse documents
    parsed_results = llamaparse_processor.process_folder(docs_folder)
    
    # Convert to LangChain documents
    documents = document_converter.convert_to_documents(parsed_results)
    
    print(f"\nConverted {len(documents)} documents")
else:
    print(f"No documents found in {docs_folder}. Please add some PDF, DOCX, or other supported files.")
    documents = []

## Compare Traditional vs Semantic Chunking

Let's create chunks using both methods and compare the results:

In [None]:
if documents:
    # First, try semantic chunking
    print("=== SEMANTIC CHUNKING (with HuggingFace embeddings) ===")
    semantic_chunks = document_converter.semantic_chunk_documents(documents)
    
    print(f"\nSemantic chunks created: {len(semantic_chunks)}")
    
    # Show first few semantic chunks
    print("\nFirst 3 semantic chunks:")
    for i, chunk in enumerate(semantic_chunks[:3]):
        print(f"\n--- Semantic Chunk {i+1} ({len(chunk.page_content)} chars) ---")
        print(chunk.page_content[:300] + "..." if len(chunk.page_content) > 300 else chunk.page_content)
        print(f"Metadata: {chunk.metadata.get('chunking_method', 'unknown')} chunking")
        print(f"Embedding provider: {chunk.metadata.get('embedding_provider', 'unknown')}")
else:
    print("No documents to process. Please add documents to the docs folder.")

In [None]:
if documents:
    # Now create traditional chunks for comparison
    print("\n=== TRADITIONAL CHUNKING ===")
    
    # Temporarily disable semantic chunking
    original_semantic_setting = document_converter.config.use_semantic_chunking
    document_converter.config.use_semantic_chunking = False
    
    traditional_chunks = document_converter.chunk_documents(documents)
    
    # Restore semantic chunking setting
    document_converter.config.use_semantic_chunking = original_semantic_setting
    
    print(f"\nTraditional chunks created: {len(traditional_chunks)}")
    
    # Show first few traditional chunks
    print("\nFirst 3 traditional chunks:")
    for i, chunk in enumerate(traditional_chunks[:3]):
        print(f"\n--- Traditional Chunk {i+1} ({len(chunk.page_content)} chars) ---")
        print(chunk.page_content[:300] + "..." if len(chunk.page_content) > 300 else chunk.page_content)
        print(f"Metadata: {chunk.metadata.get('chunking_method', 'unknown')} chunking")

## Analyze Chunking Results

Compare the characteristics of both chunking methods:

In [None]:
if documents and 'semantic_chunks' in locals() and 'traditional_chunks' in locals():
    import numpy as np
    
    # Calculate statistics
    semantic_sizes = [len(chunk.page_content) for chunk in semantic_chunks]
    traditional_sizes = [len(chunk.page_content) for chunk in traditional_chunks]
    
    print("=== CHUNKING COMPARISON ===")
    print(f"\nOriginal Documents: {len(documents)}")
    print(f"Semantic Chunks (HuggingFace): {len(semantic_chunks)}")
    print(f"Traditional Chunks: {len(traditional_chunks)}")
    
    print(f"\nSemantic Chunk Sizes:")
    print(f"  Mean: {np.mean(semantic_sizes):.0f} chars")
    print(f"  Std Dev: {np.std(semantic_sizes):.0f} chars")
    print(f"  Min: {np.min(semantic_sizes)} chars")
    print(f"  Max: {np.max(semantic_sizes)} chars")
    
    print(f"\nTraditional Chunk Sizes:")
    print(f"  Mean: {np.mean(traditional_sizes):.0f} chars")
    print(f"  Std Dev: {np.std(traditional_sizes):.0f} chars")
    print(f"  Min: {np.min(traditional_sizes)} chars")
    print(f"  Max: {np.max(traditional_sizes)} chars")
    
    # Show variability
    semantic_cv = np.std(semantic_sizes) / np.mean(semantic_sizes)
    traditional_cv = np.std(traditional_sizes) / np.mean(traditional_sizes)
    
    print(f"\nSize Variability (Coefficient of Variation):")
    print(f"  Semantic: {semantic_cv:.3f} (lower = more consistent)")
    print(f"  Traditional: {traditional_cv:.3f}")
    
    print(f"\nSemantic chunking produces {'more' if semantic_cv > traditional_cv else 'less'} variable chunk sizes")
    print("This is expected as semantic chunks follow natural content boundaries.")
    print("\n✅ Advantage: Same embedding model used for both chunking AND vector storage!")

## Configuration Options

Understanding the semantic chunking parameters with HuggingFace embeddings:

### Key Parameters:

1. **`buffer_size`** (default: 1)
   - Number of sentences to group together for similarity analysis
   - Higher values = smoother transitions, larger chunks
   - Lower values = more precise breakpoints, smaller chunks

2. **`breakpoint_percentile_threshold`** (default: 95)
   - Percentile threshold for determining breakpoints
   - Higher values = fewer breakpoints, larger chunks
   - Lower values = more breakpoints, smaller chunks
   - Range: 50-99 (95 is recommended)

3. **`embed_model_name`** (default: "sentence-transformers/all-MiniLM-L6-v2")
   - **Same model as your existing pipeline!**
   - Uses local HuggingFace model (no API calls)
   - Options: Any sentence-transformers model

### When to Use Semantic Chunking:

✅ **Good for:**
- Long-form documents (reports, articles, books)
- Documents with varying content structure
- When retrieval quality is more important than processing speed
- Documents with clear topical shifts

✅ **Advantages of HuggingFace Version:**
- **No API keys needed** (local processing)
- **Same model** as your existing pipeline
- **Consistent embeddings** between chunking and retrieval
- **Cost-effective** (no API calls)
- **Privacy-friendly** (no data sent to external APIs)

❌ **Consider traditional chunking for:**
- Highly structured documents (tables, forms)
- Very short documents
- When processing speed is critical

## Testing Different Parameters

Experiment with different semantic chunking parameters:

In [None]:
if documents:
    # Test different threshold values
    thresholds = [85, 90, 95, 98]
    
    print("=== PARAMETER TESTING ===")
    print("Testing different breakpoint percentile thresholds...\n")
    
    results = []
    
    for threshold in thresholds:
        # Create new config with different threshold
        test_config = ChunkingConfig(
            use_semantic_chunking=True,
            buffer_size=1,
            breakpoint_percentile_threshold=threshold,
            embed_model_name="sentence-transformers/all-MiniLM-L6-v2"
        )
        
        # Create converter with new config
        test_converter = DocumentConverter(test_config)
        
        # Convert documents and chunk
        test_converter.all_documents = documents
        chunks = test_converter.semantic_chunk_documents()
        
        # Calculate stats
        chunk_sizes = [len(chunk.page_content) for chunk in chunks]
        avg_size = np.mean(chunk_sizes) if chunk_sizes else 0
        
        results.append({
            'threshold': threshold,
            'num_chunks': len(chunks),
            'avg_size': avg_size
        })
        
        print(f"Threshold {threshold}%: {len(chunks)} chunks, avg size: {avg_size:.0f} chars")
    
    print("\nObservations:")
    print("- Higher thresholds create fewer, larger chunks")
    print("- Lower thresholds create more, smaller chunks")
    print("- 95% is a good balance for most documents")
    print("- All processing done locally with HuggingFace models!")
else:
    print("Skipping parameter testing - need documents")

## Next Steps

Now you can use semantic chunking in your ingestion pipeline:

1. **Update your configuration** to enable semantic chunking
2. **No API keys needed** - uses the same HuggingFace model locally
3. **Experiment with parameters** to find the best settings for your documents
4. **Compare retrieval quality** between traditional and semantic chunks

## Key Advantages of This Implementation:

✅ **Seamless Integration**: Uses the same `sentence-transformers/all-MiniLM-L6-v2` model
✅ **No Additional Costs**: No API calls to external services
✅ **Privacy Friendly**: All processing done locally
✅ **Consistent Embeddings**: Same model for chunking and vector storage
✅ **Robust Fallback**: Automatically falls back to traditional chunking if needed

The semantic chunking will automatically fall back to traditional chunking if:
- LlamaIndex packages are not installed
- There are any errors during semantic processing

This ensures your pipeline remains robust while taking advantage of improved chunking when possible!