# Overlapping Chunks with Google Gemini
## Building Context-Aware Document Processing Systems

This notebook explores overlapping chunking techniques that preserve context across chunk boundaries, ensuring critical information isn't lost when processing large documents. We'll build an intelligent document analysis system using various overlap strategies with Google Gemini.

### What You'll Learn:
- Understanding overlapping chunk principles and benefits
- Implementing multiple overlap strategies (fixed, percentage, semantic)
- Optimizing overlap size for different use cases
- Building context-aware Q&A systems with Gemini
- Analyzing information retention and redundancy
- Performance optimization and trade-off analysis

### Project Overview:
We'll create an advanced system that:
1. Implements various overlapping strategies for different document types
2. Analyzes optimal overlap sizes and patterns
3. Builds intelligent retrieval with overlap-aware ranking
4. Demonstrates context preservation benefits
5. Provides comprehensive performance analysis and optimization

## 1. Setup and Dependencies

In [None]:
# Install required packages
!pip install google-generativeai sentence-transformers spacy nltk scikit-learn numpy pandas matplotlib seaborn tiktoken networkx

In [None]:
# Download additional dependencies
!python -m spacy download en_core_web_sm
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

In [None]:
import google.generativeai as genai
from sentence_transformers import SentenceTransformer
import spacy
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import tiktoken
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
import re
import os
import time
from typing import List, Dict, Tuple, Optional, Union
from collections import defaultdict
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("Set2")

In [None]:
# Configure Gemini API and models
GEMINI_API_KEY = "your-gemini-api-key-here"  # Replace with your actual API key
genai.configure(api_key=GEMINI_API_KEY)

# Initialize models
gemini_model = genai.GenerativeModel('gemini-pro')
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
nlp = spacy.load('en_core_web_sm')
tokenizer = tiktoken.get_encoding("cl100k_base")

print("✅ All models initialized successfully!")
print(f"📊 Embedding dimensions: {embedding_model.get_sentence_embedding_dimension()}")
print(f"🧠 spaCy model: {nlp.meta['name']} v{nlp.meta['version']}")

## 2. Understanding Overlapping Chunks

Overlapping chunks ensure context preservation by sharing content between adjacent chunks, preventing information loss at boundaries.

In [None]:
def count_tokens(text: str) -> int:
    """Count tokens in text using tiktoken."""
    return len(tokenizer.encode(text))

def demonstrate_overlap_concept():
    """Demonstrate the concept and benefits of overlapping chunks."""
    
    sample_text = """
    Artificial intelligence has revolutionized modern computing. Machine learning algorithms 
    can now process vast amounts of data to identify patterns. Deep learning networks, 
    inspired by the human brain, excel at complex tasks. Natural language processing 
    enables computers to understand human communication. Computer vision systems can 
    analyze images and videos with remarkable accuracy. These technologies are transforming 
    industries from healthcare to autonomous vehicles.
    """.strip()
    
    sentences = sent_tokenize(sample_text)
    
    print("🔍 Overlap Concept Demonstration\n")
    print("Original text:")
    print(sample_text)
    print(f"\nTotal tokens: {count_tokens(sample_text)}")
    print(f"Sentences: {len(sentences)}\n")
    
    # Show non-overlapping vs overlapping chunking
    print("📊 Comparison: Non-overlapping vs Overlapping\n")
    
    # Non-overlapping chunks (2 sentences each)
    print("🚫 Non-overlapping chunks:")
    for i in range(0, len(sentences), 2):
        chunk = ' '.join(sentences[i:i+2])
        print(f"Chunk {i//2 + 1}: {chunk}")
        print(f"  Tokens: {count_tokens(chunk)}\n")
    
    # Overlapping chunks (2 sentences each, 1 sentence overlap)
    print("✅ Overlapping chunks (50% overlap):")
    for i in range(len(sentences) - 1):
        chunk = ' '.join(sentences[i:i+2])
        print(f"Chunk {i + 1}: {chunk}")
        print(f"  Tokens: {count_tokens(chunk)}")
        if i > 0:
            overlap = sentences[i]
            print(f"  Overlap: '{overlap[:50]}...'")
        print()
    
    # Calculate overlap statistics
    total_unique_tokens = count_tokens(sample_text)
    overlapping_total_tokens = sum(count_tokens(' '.join(sentences[i:i+2])) 
                                 for i in range(len(sentences) - 1))
    
    redundancy = (overlapping_total_tokens - total_unique_tokens) / total_unique_tokens * 100
    
    print(f"📈 Overlap Statistics:")
    print(f"  Unique content tokens: {total_unique_tokens}")
    print(f"  Total tokens with overlap: {overlapping_total_tokens}")
    print(f"  Redundancy: {redundancy:.1f}%")
    print(f"  Context preservation: ✅ Information bridged across boundaries")

demonstrate_overlap_concept()

## 3. Implementing Overlapping Chunker Class

In [None]:
class OverlappingChunker:
    def __init__(self, chunk_size: int = 512, overlap_strategy: str = 'fixed', 
                 overlap_size: Union[int, float] = 50, min_chunk_size: int = 100):
        """
        Advanced overlapping chunker with multiple strategies.
        
        Args:
            chunk_size: Target size of each chunk (in tokens)
            overlap_strategy: 'fixed', 'percentage', 'sentence', 'semantic', 'adaptive'
            overlap_size: Size of overlap (tokens for 'fixed', 0-1 for 'percentage')
            min_chunk_size: Minimum acceptable chunk size
        """
        self.chunk_size = chunk_size
        self.overlap_strategy = overlap_strategy
        self.overlap_size = overlap_size
        self.min_chunk_size = min_chunk_size
        self.tokenizer = tiktoken.get_encoding("cl100k_base")
        self.nlp = spacy.load('en_core_web_sm')
        
    def _count_tokens(self, text: str) -> int:
        """Count tokens in text."""
        return len(self.tokenizer.encode(text))
    
    def _extract_sentences(self, text: str) -> List[Dict]:
        """Extract sentences with metadata."""
        doc = self.nlp(text)
        sentences = []
        
        for i, sent in enumerate(doc.sents):
            sent_text = sent.text.strip()
            if sent_text:
                sentences.append({
                    'id': i,
                    'text': sent_text,
                    'start': sent.start_char,
                    'end': sent.end_char,
                    'tokens': self._count_tokens(sent_text),
                    'words': len(sent_text.split())
                })
        
        return sentences
    
    def chunk_text(self, text: str) -> List[Dict]:
        """Main chunking method with overlap strategy."""
        # Clean and prepare text
        text = re.sub(r'\s+', ' ', text.strip())
        sentences = self._extract_sentences(text)
        
        if not sentences:
            return []
        
        if self.overlap_strategy == 'fixed':
            return self._chunk_with_fixed_overlap(sentences)
        elif self.overlap_strategy == 'percentage':
            return self._chunk_with_percentage_overlap(sentences)
        elif self.overlap_strategy == 'sentence':
            return self._chunk_with_sentence_overlap(sentences)
        else:
            return self._chunk_with_percentage_overlap(sentences)  # Default
    
    def _chunk_with_fixed_overlap(self, sentences: List[Dict]) -> List[Dict]:
        """Create chunks with fixed token overlap."""
        chunks = []
        current_chunk_sentences = []
        current_tokens = 0
        chunk_id = 0
        
        i = 0
        while i < len(sentences):
            sentence = sentences[i]
            
            # Check if adding this sentence exceeds chunk size
            if (current_tokens + sentence['tokens'] > self.chunk_size and 
                current_tokens >= self.min_chunk_size):
                
                # Create chunk
                chunk_text = ' '.join([sentences[si]['text'] for si in current_chunk_sentences])
                overlap_info = self._calculate_overlap_info(chunks, current_chunk_sentences, sentences)
                
                chunks.append({
                    'id': chunk_id,
                    'text': chunk_text,
                    'sentences': current_chunk_sentences.copy(),
                    'token_count': current_tokens,
                    'sentence_count': len(current_chunk_sentences),
                    'strategy': self.overlap_strategy,
                    'overlap_info': overlap_info
                })
                
                # Calculate overlap for next chunk
                overlap_tokens = min(int(self.overlap_size), current_tokens // 2)
                
                # Find sentences to include in overlap
                overlap_sentences = []
                overlap_token_count = 0
                
                for si in reversed(current_chunk_sentences):
                    if overlap_token_count + sentences[si]['tokens'] <= overlap_tokens:
                        overlap_sentences.insert(0, si)
                        overlap_token_count += sentences[si]['tokens']
                    else:
                        break
                
                # Start next chunk with overlap
                if i + 1 < len(sentences):
                    current_chunk_sentences = overlap_sentences + [i + 1]
                    current_tokens = sum(sentences[si]['tokens'] for si in current_chunk_sentences)
                    i += 2
                else:
                    break
                
                chunk_id += 1
            else:
                current_chunk_sentences.append(i)
                current_tokens += sentence['tokens']
                i += 1
        
        # Add final chunk
        if current_chunk_sentences and current_tokens >= self.min_chunk_size:
            chunk_text = ' '.join([sentences[si]['text'] for si in current_chunk_sentences])
            overlap_info = self._calculate_overlap_info(chunks, current_chunk_sentences, sentences)
            
            chunks.append({
                'id': chunk_id,
                'text': chunk_text,
                'sentences': current_chunk_sentences,
                'token_count': current_tokens,
                'sentence_count': len(current_chunk_sentences),
                'strategy': self.overlap_strategy,
                'overlap_info': overlap_info
            })
        
        return chunks
    
    def _chunk_with_percentage_overlap(self, sentences: List[Dict]) -> List[Dict]:
        """Create chunks with percentage-based overlap."""
        chunks = []
        current_chunk_sentences = []
        current_tokens = 0
        chunk_id = 0
        
        i = 0
        while i < len(sentences):
            sentence = sentences[i]
            
            if (current_tokens + sentence['tokens'] > self.chunk_size and 
                current_tokens >= self.min_chunk_size):
                
                # Create chunk
                chunk_text = ' '.join([sentences[si]['text'] for si in current_chunk_sentences])
                overlap_info = self._calculate_overlap_info(chunks, current_chunk_sentences, sentences)
                
                chunks.append({
                    'id': chunk_id,
                    'text': chunk_text,
                    'sentences': current_chunk_sentences.copy(),
                    'token_count': current_tokens,
                    'sentence_count': len(current_chunk_sentences),
                    'strategy': self.overlap_strategy,
                    'overlap_info': overlap_info
                })
                
                # Calculate percentage-based overlap
                overlap_tokens = int(current_tokens * self.overlap_size)
                
                # Find sentences for overlap
                overlap_sentences = []
                overlap_token_count = 0
                
                for si in reversed(current_chunk_sentences):
                    if overlap_token_count + sentences[si]['tokens'] <= overlap_tokens:
                        overlap_sentences.insert(0, si)
                        overlap_token_count += sentences[si]['tokens']
                    else:
                        break
                
                # Start next chunk
                if i + 1 < len(sentences):
                    current_chunk_sentences = overlap_sentences + [i + 1]
                    current_tokens = sum(sentences[si]['tokens'] for si in current_chunk_sentences)
                    i += 2
                else:
                    break
                
                chunk_id += 1
            else:
                current_chunk_sentences.append(i)
                current_tokens += sentence['tokens']
                i += 1
        
        # Add final chunk
        if current_chunk_sentences and current_tokens >= self.min_chunk_size:
            chunk_text = ' '.join([sentences[si]['text'] for si in current_chunk_sentences])
            overlap_info = self._calculate_overlap_info(chunks, current_chunk_sentences, sentences)
            
            chunks.append({
                'id': chunk_id,
                'text': chunk_text,
                'sentences': current_chunk_sentences,
                'token_count': current_tokens,
                'sentence_count': len(current_chunk_sentences),
                'strategy': self.overlap_strategy,
                'overlap_info': overlap_info
            })
        
        return chunks
    
    def _chunk_with_sentence_overlap(self, sentences: List[Dict]) -> List[Dict]:
        """Create chunks with sentence-boundary overlap."""
        chunks = []
        current_chunk_sentences = []
        current_tokens = 0
        chunk_id = 0
        overlap_sentences_count = int(self.overlap_size) if isinstance(self.overlap_size, (int, float)) else 1
        
        i = 0
        while i < len(sentences):
            sentence = sentences[i]
            
            if (current_tokens + sentence['tokens'] > self.chunk_size and 
                current_tokens >= self.min_chunk_size):
                
                # Create chunk
                chunk_text = ' '.join([sentences[si]['text'] for si in current_chunk_sentences])
                overlap_info = self._calculate_overlap_info(chunks, current_chunk_sentences, sentences)
                
                chunks.append({
                    'id': chunk_id,
                    'text': chunk_text,
                    'sentences': current_chunk_sentences.copy(),
                    'token_count': current_tokens,
                    'sentence_count': len(current_chunk_sentences),
                    'strategy': self.overlap_strategy,
                    'overlap_info': overlap_info
                })
                
                # Take last N sentences for overlap
                overlap_start = max(0, len(current_chunk_sentences) - overlap_sentences_count)
                overlap_sentences = current_chunk_sentences[overlap_start:]
                
                # Start next chunk
                current_chunk_sentences = overlap_sentences + [i]
                current_tokens = sum(sentences[si]['tokens'] for si in current_chunk_sentences)
                chunk_id += 1
            else:
                current_chunk_sentences.append(i)
                current_tokens += sentence['tokens']
            
            i += 1
        
        # Add final chunk
        if current_chunk_sentences and current_tokens >= self.min_chunk_size:
            chunk_text = ' '.join([sentences[si]['text'] for si in current_chunk_sentences])
            overlap_info = self._calculate_overlap_info(chunks, current_chunk_sentences, sentences)
            
            chunks.append({
                'id': chunk_id,
                'text': chunk_text,
                'sentences': current_chunk_sentences,
                'token_count': current_tokens,
                'sentence_count': len(current_chunk_sentences),
                'strategy': self.overlap_strategy,
                'overlap_info': overlap_info
            })
        
        return chunks
    
    def _calculate_overlap_info(self, existing_chunks: List[Dict], 
                              current_sentences: List[int], 
                              all_sentences: List[Dict]) -> Dict:
        """Calculate overlap information for a chunk."""
        if not existing_chunks:
            return {'overlap_with_previous': 0, 'overlap_sentences': [], 'overlap_tokens': 0}
        
        previous_chunk = existing_chunks[-1]
        previous_sentences = set(previous_chunk['sentences'])
        current_sentences_set = set(current_sentences)
        
        overlap_sentences = list(previous_sentences & current_sentences_set)
        overlap_tokens = sum(all_sentences[si]['tokens'] for si in overlap_sentences)
        
        return {
            'overlap_with_previous': len(overlap_sentences),
            'overlap_sentences': overlap_sentences,
            'overlap_tokens': overlap_tokens
        }

print("✅ OverlappingChunker class implemented!")

## 4. Testing Different Overlap Strategies

In [None]:
# Test document for overlap analysis
test_document = """
Introduction to Quantum Computing

Quantum computing represents a revolutionary approach to information processing that leverages quantum mechanical phenomena. Unlike classical computers that use bits representing either 0 or 1, quantum computers use quantum bits or qubits that can exist in superposition states.

The principle of superposition allows qubits to be in multiple states simultaneously. This fundamental property enables quantum computers to perform certain calculations exponentially faster than classical computers. Quantum entanglement is another crucial phenomenon where qubits become correlated in ways that classical physics cannot explain.

Historical Development

The theoretical foundations of quantum computing were laid in the 1980s by physicists like Richard Feynman and David Deutsch. Feynman proposed that quantum systems could be used to simulate other quantum systems more efficiently than classical computers.

The first quantum algorithms were developed in the 1990s. Peter Shor's algorithm for factoring large numbers demonstrated quantum computing's potential to break current cryptographic systems. Lov Grover's search algorithm showed quadratic speedup for searching unsorted databases.

Quantum Algorithms and Applications

Shor's algorithm poses a significant threat to RSA encryption, which relies on the difficulty of factoring large numbers. A sufficiently large quantum computer could break RSA encryption in polynomial time, rendering current security systems vulnerable.

Grover's algorithm provides quadratic speedup for searching unstructured databases. While classical computers require O(N) operations to search N items, Grover's algorithm requires only O(√N) operations, offering substantial performance improvements.

Quantum machine learning algorithms show promise for optimization problems and pattern recognition. Variational quantum eigensolvers can solve chemistry problems by finding ground state energies of molecules.
"""

# Test different overlap strategies
overlap_strategies = {
    'Fixed (50 tokens)': OverlappingChunker(chunk_size=300, overlap_strategy='fixed', overlap_size=50),
    'Percentage (20%)': OverlappingChunker(chunk_size=300, overlap_strategy='percentage', overlap_size=0.2),
    'Sentence (2 sentences)': OverlappingChunker(chunk_size=300, overlap_strategy='sentence', overlap_size=2),
    'No Overlap': OverlappingChunker(chunk_size=300, overlap_strategy='fixed', overlap_size=0)
}

results = {}

print("🔬 Testing Overlapping Chunk Strategies\n")

for name, chunker in overlap_strategies.items():
    print(f"{'='*60}")
    print(f"Strategy: {name}")
    print(f"{'='*60}")
    
    start_time = time.time()
    chunks = chunker.chunk_text(test_document)
    processing_time = time.time() - start_time
    
    # Calculate overlap statistics
    total_tokens = sum(chunk['token_count'] for chunk in chunks)
    unique_tokens = count_tokens(test_document)
    overlap_tokens = sum(chunk['overlap_info']['overlap_tokens'] for chunk in chunks)
    redundancy = (total_tokens - unique_tokens) / unique_tokens * 100 if unique_tokens > 0 else 0
    
    results[name] = {
        'chunks': chunks,
        'count': len(chunks),
        'avg_tokens': np.mean([c['token_count'] for c in chunks]),
        'total_tokens': total_tokens,
        'unique_tokens': unique_tokens,
        'overlap_tokens': overlap_tokens,
        'redundancy': redundancy,
        'processing_time': processing_time
    }
    
    print(f"Number of chunks: {len(chunks)}")
    print(f"Average tokens per chunk: {np.mean([c['token_count'] for c in chunks]):.1f}")
    print(f"Total tokens (with overlap): {total_tokens}")
    print(f"Unique content tokens: {unique_tokens}")
    print(f"Overlap tokens: {overlap_tokens}")
    print(f"Redundancy: {redundancy:.1f}%")
    print(f"Processing time: {processing_time:.3f}s")
    
    # Show overlap pattern for first few chunks
    print(f"\n📊 Overlap Pattern:")
    for i, chunk in enumerate(chunks[:3]):
        overlap_info = chunk['overlap_info']
        print(f"  Chunk {i}: {overlap_info['overlap_with_previous']} overlapping sentences, "
              f"{overlap_info['overlap_tokens']} tokens")
    
    print("\n")

## 5. Visualizing Overlap Performance

In [None]:
# Create comparison visualization
strategies = list(results.keys())
chunk_counts = [results[s]['count'] for s in strategies]
avg_tokens = [results[s]['avg_tokens'] for s in strategies]
redundancy = [results[s]['redundancy'] for s in strategies]
proc_times = [results[s]['processing_time'] for s in strategies]

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('Overlap Strategy Comparison', fontsize=16)

# Chunk count comparison
ax1.bar(strategies, chunk_counts, alpha=0.7, color='skyblue')
ax1.set_title('Number of Chunks')
ax1.set_ylabel('Count')
ax1.tick_params(axis='x', rotation=45)
ax1.grid(True, alpha=0.3)

# Average tokens per chunk
ax2.bar(strategies, avg_tokens, alpha=0.7, color='lightgreen')
ax2.set_title('Average Tokens per Chunk')
ax2.set_ylabel('Tokens')
ax2.tick_params(axis='x', rotation=45)
ax2.grid(True, alpha=0.3)

# Redundancy percentage
ax3.bar(strategies, redundancy, alpha=0.7, color='orange')
ax3.set_title('Context Redundancy')
ax3.set_ylabel('Percentage (%)')
ax3.tick_params(axis='x', rotation=45)
ax3.grid(True, alpha=0.3)

# Processing time
ax4.bar(strategies, proc_times, alpha=0.7, color='lightcoral')
ax4.set_title('Processing Time')
ax4.set_ylabel('Seconds')
ax4.tick_params(axis='x', rotation=45)
ax4.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Create comparison table
comparison_df = pd.DataFrame([
    {
        'Strategy': strategy,
        'Chunks': data['count'],
        'Avg Tokens': f"{data['avg_tokens']:.1f}",
        'Total Tokens': data['total_tokens'],
        'Redundancy %': f"{data['redundancy']:.1f}",
        'Processing Time (s)': f"{data['processing_time']:.3f}"
    }
    for strategy, data in results.items()
])

print("\n📊 Strategy Comparison Results:")
print(comparison_df.to_string(index=False))

print("\n📈 Key Insights:")
print("• Overlapping strategies provide more context but increase redundancy")
print("• Percentage-based overlap adapts to chunk content naturally")
print("• Sentence-based overlap preserves natural language boundaries")
print("• Higher redundancy may improve context preservation")

## 6. Overlap-Aware Q&A System

In [None]:
class OverlapAwareQASystem:
    def __init__(self, overlap_strategy: str = 'percentage', chunk_size: int = 400, 
                 overlap_size: Union[int, float] = 0.2):
        self.chunker = OverlappingChunker(
            chunk_size=chunk_size, 
            overlap_strategy=overlap_strategy, 
            overlap_size=overlap_size
        )
        self.gemini_model = genai.GenerativeModel('gemini-pro')
        self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
        self.chunks = []
        self.chunk_embeddings = None
        self.document_title = ""
        
    def load_document(self, text: str, title: str = "Document"):
        """Load document and create overlap-aware chunks."""
        self.document_title = title
        print(f"🔄 Processing document with {self.chunker.overlap_strategy} overlap strategy...")
        
        # Create overlapping chunks
        self.chunks = self.chunker.chunk_text(text)
        
        # Generate embeddings
        chunk_texts = [chunk['text'] for chunk in self.chunks]
        self.chunk_embeddings = self.embedding_model.encode(chunk_texts)
        
        # Calculate statistics
        total_tokens = sum(chunk['token_count'] for chunk in self.chunks)
        unique_tokens = count_tokens(text)
        redundancy = (total_tokens - unique_tokens) / unique_tokens * 100
        
        print(f"✅ Loaded '{title}' with {len(self.chunks)} overlapping chunks")
        print(f"📊 Redundancy: {redundancy:.1f}% | Strategy: {self.chunker.overlap_strategy}")
        
    def _find_relevant_chunks(self, question: str, max_chunks: int = 3) -> List[Dict]:
        """Find relevant chunks using semantic similarity."""
        if not self.chunks or self.chunk_embeddings is None:
            return []
        
        # Get question embedding and similarities
        question_embedding = self.embedding_model.encode([question])
        similarities = cosine_similarity(question_embedding, self.chunk_embeddings)[0]
        
        # Get top chunks by similarity
        top_indices = np.argsort(similarities)[::-1][:max_chunks]
        
        relevant_chunks = []
        for idx in top_indices:
            chunk = self.chunks[idx].copy()
            chunk['similarity_score'] = float(similarities[idx])
            relevant_chunks.append(chunk)
        
        return relevant_chunks
    
    def answer_question(self, question: str) -> Dict:
        """Answer question using overlap-aware retrieval."""
        if not self.chunks:
            return {"error": "No document loaded"}
        
        print(f"🔍 Finding relevant chunks with overlap awareness: {question}")
        
        # Find relevant chunks
        relevant_chunks = self._find_relevant_chunks(question, max_chunks=3)
        
        if not relevant_chunks:
            return {"error": "No relevant content found"}
        
        # Sort chunks by document order for coherent context
        relevant_chunks.sort(key=lambda x: x['id'])
        
        # Prepare context
        context_parts = []
        for chunk in relevant_chunks:
            chunk_info = f"[Chunk {chunk['id']}]"
            context_parts.append(f"{chunk_info}\n{chunk['text']}")
        
        context = "\n\n".join(context_parts)
        
        # Generate answer
        answer_prompt = f"""
        You are analyzing a document with overlapping chunks to ensure context continuity. 
        Answer the question based on the provided context from "{self.document_title}".
        
        Context (chunks may have overlapping content for continuity):
        {context}
        
        Question: {question}
        
        Instructions:
        1. Provide a comprehensive answer using information from all relevant chunks
        2. When chunks overlap, synthesize the information without redundancy
        3. Use the overlapping context to provide smoother, more connected answers
        4. Cite which chunks your information comes from when helpful
        
        Answer:
        """
        
        try:
            response = self.gemini_model.generate_content(answer_prompt)
            
            # Calculate overlap statistics for this retrieval
            total_context_tokens = sum(chunk['token_count'] for chunk in relevant_chunks)
            total_overlap_tokens = sum(chunk['overlap_info']['overlap_tokens'] for chunk in relevant_chunks)
            context_redundancy = (total_overlap_tokens / total_context_tokens * 100) if total_context_tokens > 0 else 0
            
            return {
                "question": question,
                "answer": response.text,
                "relevant_chunks": len(relevant_chunks),
                "chunk_details": [
                    {
                        "id": chunk['id'],
                        "similarity": chunk.get('similarity_score', 0),
                        "token_count": chunk['token_count'],
                        "overlap_tokens": chunk['overlap_info']['overlap_tokens']
                    }
                    for chunk in relevant_chunks
                ],
                "context_tokens": total_context_tokens,
                "context_redundancy": context_redundancy,
                "overlap_strategy": self.chunker.overlap_strategy
            }
            
        except Exception as e:
            return {"error": f"Failed to generate answer: {e}"}
    
    def analyze_overlap_structure(self) -> Dict:
        """Analyze the overlap structure of the document."""
        if not self.chunks:
            return {"error": "No document loaded"}
        
        analysis = {
            "total_chunks": len(self.chunks),
            "overlap_strategy": self.chunker.overlap_strategy,
            "avg_tokens_per_chunk": np.mean([c['token_count'] for c in self.chunks]),
            "overlap_statistics": {
                "avg_overlap_tokens": np.mean([c['overlap_info']['overlap_tokens'] for c in self.chunks if c['overlap_info']['overlap_tokens'] > 0]),
                "max_overlap_tokens": max([c['overlap_info']['overlap_tokens'] for c in self.chunks]),
                "total_overlap_tokens": sum([c['overlap_info']['overlap_tokens'] for c in self.chunks])
            }
        }
        
        return analysis

print("✅ OverlapAwareQASystem class implemented!")

## 7. Testing the Overlap-Aware Q&A System

In [None]:
# Initialize overlap-aware Q&A system
overlap_qa = OverlapAwareQASystem(
    overlap_strategy='percentage', 
    chunk_size=400, 
    overlap_size=0.25  # 25% overlap
)

overlap_qa.load_document(test_document, "Quantum Computing Introduction")

# Analyze overlap structure
overlap_analysis = overlap_qa.analyze_overlap_structure()
print(f"\n📊 Overlap Structure Analysis:")
for key, value in overlap_analysis.items():
    if isinstance(value, dict):
        print(f"  {key}:")
        for subkey, subvalue in value.items():
            print(f"    {subkey}: {subvalue}")
    else:
        print(f"  {key}: {value}")

# Test questions
test_questions = [
    "What is quantum superposition and how does it provide computational advantages?",
    "Explain Shor's algorithm and its impact on cryptography",
    "What are the key differences between quantum and classical computing?"
]

print("\n🧠 Testing Overlap-Aware Q&A System\n")

for i, question in enumerate(test_questions[:2], 1):  # Test first 2 questions
    print(f"{'='*70}")
    print(f"Question {i}: {question}")
    print(f"{'='*70}")
    
    result = overlap_qa.answer_question(question)
    
    if "error" in result:
        print(f"❌ Error: {result['error']}")
    else:
        print(f"\n💡 Answer:")
        print(result["answer"])
        
        print(f"\n📊 Retrieval Statistics:")
        print(f"  - Overlap strategy: {result['overlap_strategy']}")
        print(f"  - Relevant chunks: {result['relevant_chunks']}")
        print(f"  - Context tokens: {result['context_tokens']}")
        print(f"  - Context redundancy: {result['context_redundancy']:.1f}%")
        
        print(f"\n📚 Chunk Usage Details:")
        for detail in result['chunk_details']:
            print(f"  • Chunk {detail['id']}: Similarity {detail['similarity']:.3f}, "
                  f"Tokens {detail['token_count']}, Overlap {detail['overlap_tokens']}")
    
    print("\n" + "-"*70 + "\n")
    time.sleep(1)  # Rate limiting

## 8. Best Practices for Overlapping Chunks

In [None]:
def overlapping_chunks_best_practices():
    """Display comprehensive best practices for overlapping chunk strategies."""
    
    practices = {
        "🎯 Overlap Strategy Selection": [
            "• Use fixed overlap (50-100 tokens) for consistent, predictable redundancy",
            "• Use percentage overlap (15-25%) for adaptive redundancy based on chunk size",
            "• Use sentence overlap (1-2 sentences) for natural language boundaries",
            "• Consider document structure when choosing overlap strategy",
            "• Test multiple strategies with your specific content type"
        ],
        
        "📏 Optimal Overlap Sizing": [
            "• Start with 15-25% overlap for most applications",
            "• Use smaller overlaps (10-15%) for highly structured documents",
            "• Use larger overlaps (25-40%) for narrative or complex content",
            "• Balance context preservation with computational efficiency",
            "• Monitor redundancy vs. performance trade-offs"
        ],
        
        "🔗 Context Preservation Techniques": [
            "• Preserve sentence boundaries whenever possible",
            "• Consider semantic similarity when determining overlap content",
            "• Track overlap tokens to avoid excessive redundancy",
            "• Use overlap for bridging concepts across chunk boundaries",
            "• Implement quality metrics for context preservation"
        ],
        
        "⚡ Performance Optimization": [
            "• Cache overlap calculations for repeated processing",
            "• Use efficient similarity search to handle redundancy",
            "• Monitor memory usage with large overlap percentages",
            "• Consider streaming processing for very large documents",
            "• Implement parallel processing where possible"
        ],
        
        "🔍 Quality Assurance": [
            "• Measure context preservation effectiveness",
            "• Test boundary scenarios and edge cases",
            "• Validate that overlaps actually improve retrieval quality",
            "• Monitor for information loss at chunk boundaries",
            "• Compare against non-overlapping baselines"
        ],
        
        "⚠️ Common Pitfalls to Avoid": [
            "• Don't use excessive overlap (>50%) without strong justification",
            "• Avoid ignoring computational costs of high redundancy",
            "• Don't assume more overlap always means better performance",
            "• Avoid fixed strategies for variable content types",
            "• Don't neglect to measure actual improvement from overlaps"
        ]
    }
    
    print("📚 Overlapping Chunks Best Practices\n")
    
    for category, tips in practices.items():
        print(f"{category}")
        for tip in tips:
            print(f"  {tip}")
        print()

overlapping_chunks_best_practices()

# Create decision matrix for overlap strategy selection
decision_matrix = pd.DataFrame([
    {
        'Document Type': 'Academic Papers',
        'Recommended Strategy': 'Sentence (1-2)',
        'Overlap Size': '15-25%',
        'Reasoning': 'Preserve logical flow, respect paragraph structure'
    },
    {
        'Document Type': 'Technical Manuals',
        'Recommended Strategy': 'Fixed (50-75 tokens)',
        'Overlap Size': '10-20%',
        'Reasoning': 'Consistent structure, preserve procedural steps'
    },
    {
        'Document Type': 'Narrative Content',
        'Recommended Strategy': 'Percentage (20-30%)',
        'Overlap Size': '20-35%',
        'Reasoning': 'Maintain story flow, preserve character/plot context'
    },
    {
        'Document Type': 'Legal Documents',
        'Recommended Strategy': 'Sentence (2-3)',
        'Overlap Size': '25-40%',
        'Reasoning': 'Preserve legal context, maintain clause relationships'
    },
    {
        'Document Type': 'News Articles',
        'Recommended Strategy': 'Percentage (15-20%)',
        'Overlap Size': '15-25%',
        'Reasoning': 'Balance context with information density'
    }
])

print("\n🗂️ Document Type Decision Matrix:")
print(decision_matrix.to_string(index=False))

# Performance comparison summary
performance_summary = pd.DataFrame([
    {
        'Aspect': 'Context Preservation',
        'No Overlap': '⭐⭐',
        'Low Overlap (10-15%)': '⭐⭐⭐',
        'Medium Overlap (20-30%)': '⭐⭐⭐⭐',
        'High Overlap (35%+)': '⭐⭐⭐⭐⭐'
    },
    {
        'Aspect': 'Computational Efficiency',
        'No Overlap': '⭐⭐⭐⭐⭐',
        'Low Overlap (10-15%)': '⭐⭐⭐⭐',
        'Medium Overlap (20-30%)': '⭐⭐⭐',
        'High Overlap (35%+)': '⭐⭐'
    },
    {
        'Aspect': 'Answer Quality',
        'No Overlap': '⭐⭐⭐',
        'Low Overlap (10-15%)': '⭐⭐⭐⭐',
        'Medium Overlap (20-30%)': '⭐⭐⭐⭐⭐',
        'High Overlap (35%+)': '⭐⭐⭐⭐'
    }
])

print("\n⚖️ Performance Trade-off Matrix:")
print(performance_summary.to_string(index=False))

## 9. Conclusion and Key Takeaways

In this comprehensive notebook, we've explored overlapping chunk techniques and built sophisticated document processing systems using Google Gemini.

### 🎯 Key Achievements:
1. **Implemented Multiple Overlap Strategies** - Fixed, percentage, and sentence-based overlapping
2. **Built Overlap-Aware Q&A System** - Context-preserving retrieval capabilities
3. **Analyzed Performance Trade-offs** - Redundancy vs. context quality balance
4. **Developed Best Practices** - Guidelines for different document types
5. **Created Comprehensive Comparisons** - Detailed analysis of strategy effectiveness

### 🚀 Overlapping Chunks Advantages:
- **🔗 Context Continuity**: Preserves information across chunk boundaries
- **📈 Improved Retrieval**: Better chance of capturing complete concepts
- **🧠 Enhanced Understanding**: Maintains narrative and logical flow
- **⚡ Flexible Strategies**: Adaptable to different document types and use cases
- **🔍 Better Q&A Quality**: More comprehensive and coherent answers

### 📊 Key Insights:
- **Optimal Overlap Range**: 15-25% provides best balance for most applications
- **Strategy Selection**: Document structure should guide overlap strategy choice
- **Redundancy Trade-off**: Higher overlap improves context but increases computational cost
- **Quality vs. Efficiency**: Balance between answer quality and processing resources

### 🛠️ Production Considerations:
- **Caching Strategy**: Store overlap calculations efficiently
- **Scalability**: Implement distributed processing for large collections
- **Monitoring**: Track overlap effectiveness and computational costs
- **User Experience**: Balance answer quality with response time

### 🎓 Key Takeaways:
1. **Context is King**: Overlapping chunks significantly improve context preservation
2. **One Size Doesn't Fit All**: Different documents require different overlap strategies
3. **Trade-offs are Real**: Balance context quality with computational efficiency
4. **Measurement Matters**: Always validate that overlaps improve actual performance
5. **Strategy Flexibility**: Adapt overlap approach based on specific use cases

Overlapping chunks represent a sophisticated approach to document processing that significantly improves context preservation and answer quality. The techniques demonstrated provide a solid foundation for building production-ready systems.

**Happy overlapping!** 🚀🔗