# Semantic Chunking with Google Gemini
## Building an Intelligent Document Analysis System

This notebook demonstrates advanced semantic chunking techniques that preserve meaning and context while processing large documents. Unlike fixed-size chunking, semantic chunking respects natural language boundaries and document structure.

### What You'll Learn:
- Understanding semantic chunking principles
- Implementing structure-aware chunking strategies
- Using embeddings for similarity-based chunking
- Building a semantic document Q&A system with Gemini
- Comparing semantic vs fixed-size approaches
- Advanced techniques for context preservation

### Project Overview:
We'll build an intelligent system that:
1. Analyzes document structure and semantics
2. Creates chunks that preserve meaning and context
3. Uses embedding similarity for optimal chunk boundaries
4. Implements hierarchical chunking strategies
5. Provides superior Q&A performance through semantic understanding

## 1. Setup and Dependencies

In [1]:
# Install required packages
!pip install google-generativeai sentence-transformers spacy nltk scikit-learn numpy pandas matplotlib seaborn tiktoken

Collecting sentence-transformers
  Using cached sentence_transformers-4.1.0-py3-none-any.whl (345 kB)
Collecting spacy
  Downloading spacy-3.8.7-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (31.5 MB)
[K     |████████████████████████████████| 31.5 MB 647 kB/s eta 0:00:011
[?25hCollecting nltk
  Using cached nltk-3.9.1-py3-none-any.whl (1.5 MB)
Collecting pandas
  Using cached pandas-2.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.3 MB)
Collecting seaborn
  Using cached seaborn-0.13.2-py3-none-any.whl (294 kB)
Collecting torch>=1.11.0
  Using cached torch-2.7.1-cp310-cp310-manylinux_2_28_x86_64.whl (821.2 MB)
Collecting huggingface-hub>=0.20.0
  Using cached huggingface_hub-0.33.0-py3-none-any.whl (514 kB)
Collecting transformers<5.0.0,>=4.41.0
  Using cached transformers-4.52.4-py3-none-any.whl (10.5 MB)
Collecting spacy-loggers<2.0.0,>=1.0.0
  Using cached spacy_loggers-1.0.5-py3-none-any.whl (22 kB)
Collecting catalogue<2.1.0,>=2.0.6
  Using cached

In [7]:
# Download spaCy model and NLTK data
!python -m spacy download en_core_web_sm
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')

  return torch._C._cuda_getDeviceCount() > 0
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 1.5 MB/s eta 0:00:01
You should consider upgrading via the '/home/mohdasimkhan/.pyenv/versions/chunking/bin/python -m pip install --upgrade pip' command.[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


[nltk_data] Downloading package punkt to
[nltk_data]     /home/mohdasimkhan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/mohdasimkhan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/mohdasimkhan/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [3]:
import google.generativeai as genai
from sentence_transformers import SentenceTransformer
import spacy
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import tiktoken
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns
import re
import os
import time
from typing import List, Dict, Tuple, Optional
import warnings
warnings.filterwarnings('ignore')

# Set style for plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

  from .autonotebook import tqdm as notebook_tqdm
  return torch._C._cuda_getDeviceCount() > 0


In [4]:
from dotenv import load_dotenv

load_dotenv()

True

In [5]:
# Configure Gemini API
GEMINI_API_KEY = os.getenv('GEMINI_API_KEY')  # Replace with your actual API key or configure the key in .env
genai.configure(api_key=GEMINI_API_KEY)

# Initialize models
gemini_model = genai.GenerativeModel('gemini-1.5-flash')
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')  # Lightweight and fast
nlp = spacy.load('en_core_web_sm')
tokenizer = tiktoken.get_encoding("cl100k_base")

print("✅ All models initialized successfully!")
print(f"📊 Embedding model: {embedding_model.get_sentence_embedding_dimension()} dimensions")

✅ All models initialized successfully!
📊 Embedding model: 384 dimensions


## 2. Understanding Semantic Chunking

Semantic chunking preserves the natural structure and meaning of text, unlike fixed-size chunking which can split content arbitrarily.

In [8]:
def count_tokens(text: str) -> int:
    """Count tokens in text using tiktoken."""
    return len(tokenizer.encode(text))

# Demonstrate semantic vs arbitrary splitting
sample_text = """
Machine learning has revolutionized artificial intelligence. It enables computers to learn patterns from data without explicit programming. 
Deep learning, a subset of machine learning, uses neural networks with multiple layers. These networks can process complex patterns in images, text, and audio.
Natural language processing combines linguistics with machine learning. It allows computers to understand and generate human language.
"""

print("🔍 Semantic Analysis Demo\n")
print("Original text:")
print(sample_text.strip())
print(f"\nTotal tokens: {count_tokens(sample_text)}")

# Sentence-level analysis
sentences = sent_tokenize(sample_text)
print(f"\n📝 Sentence breakdown ({len(sentences)} sentences):")
for i, sent in enumerate(sentences, 1):
    print(f"{i}. {sent.strip()} ({count_tokens(sent)} tokens)")

# Semantic similarity between sentences
embeddings = embedding_model.encode(sentences)
similarity_matrix = cosine_similarity(embeddings)

print("\n🔗 Sentence similarity matrix:")
for i in range(len(sentences)):
    for j in range(len(sentences)):
        print(f"{similarity_matrix[i][j]:.3f}", end="  ")
    print()

🔍 Semantic Analysis Demo

Original text:
Machine learning has revolutionized artificial intelligence. It enables computers to learn patterns from data without explicit programming. 
Deep learning, a subset of machine learning, uses neural networks with multiple layers. These networks can process complex patterns in images, text, and audio.
Natural language processing combines linguistics with machine learning. It allows computers to understand and generate human language.

Total tokens: 72

📝 Sentence breakdown (6 sentences):
1. Machine learning has revolutionized artificial intelligence. (9 tokens)
2. It enables computers to learn patterns from data without explicit programming. (12 tokens)
3. Deep learning, a subset of machine learning, uses neural networks with multiple layers. (16 tokens)
4. These networks can process complex patterns in images, text, and audio. (14 tokens)
5. Natural language processing combines linguistics with machine learning. (10 tokens)
6. It allows computers

## 3. Implementing Semantic Chunking Strategies

In [9]:
class SemanticChunker:
    def __init__(self, max_chunk_size: int = 512, min_chunk_size: int = 50, 
                 similarity_threshold: float = 0.5, strategy: str = 'sentence'):
        """
        Initialize semantic chunker.
        
        Args:
            max_chunk_size: Maximum tokens per chunk
            min_chunk_size: Minimum tokens per chunk
            similarity_threshold: Threshold for semantic similarity
            strategy: 'sentence', 'paragraph', 'structure', or 'embedding'
        """
        self.max_chunk_size = max_chunk_size
        self.min_chunk_size = min_chunk_size
        self.similarity_threshold = similarity_threshold
        self.strategy = strategy
        self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
        self.nlp = spacy.load('en_core_web_sm')
        self.tokenizer = tiktoken.get_encoding("cl100k_base")
    
    def _count_tokens(self, text: str) -> int:
        """Count tokens in text."""
        return len(self.tokenizer.encode(text))
    
    def _extract_sentences(self, text: str) -> List[Dict]:
        """Extract sentences with metadata."""
        doc = self.nlp(text)
        sentences = []
        
        for sent in doc.sents:
            sent_text = sent.text.strip()
            if sent_text:
                sentences.append({
                    'text': sent_text,
                    'start': sent.start_char,
                    'end': sent.end_char,
                    'tokens': self._count_tokens(sent_text)
                })
        
        return sentences
    
    def _extract_paragraphs(self, text: str) -> List[Dict]:
        """Extract paragraphs as natural chunks."""
        paragraphs = []
        para_texts = [p.strip() for p in text.split('\n\n') if p.strip()]
        
        start_pos = 0
        for para_text in para_texts:
            end_pos = start_pos + len(para_text)
            paragraphs.append({
                'text': para_text,
                'start': start_pos,
                'end': end_pos,
                'tokens': self._count_tokens(para_text)
            })
            start_pos = end_pos + 2  # Account for \n\n
        
        return paragraphs
    
    def _extract_structural_elements(self, text: str) -> List[Dict]:
        """Extract structural elements (headers, sections, etc.)."""
        elements = []
        
        # Split by multiple newlines first
        sections = re.split(r'\n\s*\n', text)
        
        start_pos = 0
        for section in sections:
            section = section.strip()
            if not section:
                continue
            
            # Identify headers (lines that are short and may have numbers/capitals)
            lines = section.split('\n')
            is_header = (
                len(lines) == 1 and 
                len(section) < 100 and 
                (re.match(r'^\d+\.', section) or section.isupper() or section.istitle())
            )
            
            element_type = 'header' if is_header else 'content'
            
            elements.append({
                'text': section,
                'type': element_type,
                'start': start_pos,
                'end': start_pos + len(section),
                'tokens': self._count_tokens(section)
            })
            
            start_pos += len(section) + 2
        
        return elements
    
    def _chunk_by_embedding_similarity(self, sentences: List[Dict]) -> List[List[Dict]]:
        """Group sentences by semantic similarity."""
        if len(sentences) <= 1:
            return [sentences]
        
        # Get embeddings for all sentences
        texts = [sent['text'] for sent in sentences]
        embeddings = self.embedding_model.encode(texts)
        
        # Calculate similarity matrix
        similarity_matrix = cosine_similarity(embeddings)
        
        # Group similar sentences
        chunks = []
        used_indices = set()
        
        for i, sentence in enumerate(sentences):
            if i in used_indices:
                continue
            
            current_chunk = [sentence]
            current_tokens = sentence['tokens']
            used_indices.add(i)
            
            # Find similar sentences to add to this chunk
            for j in range(i + 1, len(sentences)):
                if j in used_indices:
                    continue
                
                # Check if adding this sentence would exceed token limit
                if current_tokens + sentences[j]['tokens'] > self.max_chunk_size:
                    break
                
                # Check semantic similarity
                if similarity_matrix[i][j] > self.similarity_threshold:
                    current_chunk.append(sentences[j])
                    current_tokens += sentences[j]['tokens']
                    used_indices.add(j)
            
            chunks.append(current_chunk)
        
        return chunks
    
    def chunk_text(self, text: str) -> List[Dict]:
        """Main chunking method based on strategy."""
        # Clean text
        text = re.sub(r'\s+', ' ', text.strip())
        
        if self.strategy == 'sentence':
            return self._chunk_by_sentences(text)
        elif self.strategy == 'paragraph':
            return self._chunk_by_paragraphs(text)
        elif self.strategy == 'structure':
            return self._chunk_by_structure(text)
        elif self.strategy == 'embedding':
            return self._chunk_by_embeddings(text)
        else:
            raise ValueError(f"Unknown strategy: {self.strategy}")
    
    def _chunk_by_sentences(self, text: str) -> List[Dict]:
        """Chunk by combining sentences up to token limit."""
        sentences = self._extract_sentences(text)
        chunks = []
        
        current_chunk_sentences = []
        current_tokens = 0
        chunk_id = 0
        
        for sentence in sentences:
            # Check if adding this sentence would exceed limit
            if (current_tokens + sentence['tokens'] > self.max_chunk_size and 
                current_tokens >= self.min_chunk_size):
                
                # Save current chunk
                chunk_text = ' '.join([s['text'] for s in current_chunk_sentences])
                chunks.append({
                    'id': chunk_id,
                    'text': chunk_text,
                    'type': 'sentence-based',
                    'sentences': current_chunk_sentences.copy(),
                    'token_count': current_tokens,
                    'sentence_count': len(current_chunk_sentences)
                })
                
                # Start new chunk
                current_chunk_sentences = [sentence]
                current_tokens = sentence['tokens']
                chunk_id += 1
            else:
                current_chunk_sentences.append(sentence)
                current_tokens += sentence['tokens']
        
        # Add final chunk
        if current_chunk_sentences and current_tokens >= self.min_chunk_size:
            chunk_text = ' '.join([s['text'] for s in current_chunk_sentences])
            chunks.append({
                'id': chunk_id,
                'text': chunk_text,
                'type': 'sentence-based',
                'sentences': current_chunk_sentences,
                'token_count': current_tokens,
                'sentence_count': len(current_chunk_sentences)
            })
        
        return chunks
    
    def _chunk_by_paragraphs(self, text: str) -> List[Dict]:
        """Chunk by paragraphs, combining small ones."""
        paragraphs = self._extract_paragraphs(text)
        chunks = []
        
        current_chunk_paras = []
        current_tokens = 0
        chunk_id = 0
        
        for paragraph in paragraphs:
            # If paragraph alone exceeds max size, split it by sentences
            if paragraph['tokens'] > self.max_chunk_size:
                # Save current chunk first if it exists
                if current_chunk_paras:
                    chunk_text = '\n\n'.join([p['text'] for p in current_chunk_paras])
                    chunks.append({
                        'id': chunk_id,
                        'text': chunk_text,
                        'type': 'paragraph-based',
                        'paragraphs': current_chunk_paras.copy(),
                        'token_count': current_tokens,
                        'paragraph_count': len(current_chunk_paras)
                    })
                    chunk_id += 1
                    current_chunk_paras = []
                    current_tokens = 0
                
                # Split large paragraph by sentences
                sentence_chunks = self._chunk_by_sentences(paragraph['text'])
                for sent_chunk in sentence_chunks:
                    sent_chunk['id'] = chunk_id
                    sent_chunk['type'] = 'paragraph-split'
                    chunks.append(sent_chunk)
                    chunk_id += 1
                
            elif (current_tokens + paragraph['tokens'] > self.max_chunk_size and 
                  current_tokens >= self.min_chunk_size):
                
                # Save current chunk
                chunk_text = '\n\n'.join([p['text'] for p in current_chunk_paras])
                chunks.append({
                    'id': chunk_id,
                    'text': chunk_text,
                    'type': 'paragraph-based',
                    'paragraphs': current_chunk_paras.copy(),
                    'token_count': current_tokens,
                    'paragraph_count': len(current_chunk_paras)
                })
                
                # Start new chunk
                current_chunk_paras = [paragraph]
                current_tokens = paragraph['tokens']
                chunk_id += 1
            else:
                current_chunk_paras.append(paragraph)
                current_tokens += paragraph['tokens']
        
        # Add final chunk
        if current_chunk_paras and current_tokens >= self.min_chunk_size:
            chunk_text = '\n\n'.join([p['text'] for p in current_chunk_paras])
            chunks.append({
                'id': chunk_id,
                'text': chunk_text,
                'type': 'paragraph-based',
                'paragraphs': current_chunk_paras,
                'token_count': current_tokens,
                'paragraph_count': len(current_chunk_paras)
            })
        
        return chunks
    
    def _chunk_by_structure(self, text: str) -> List[Dict]:
        """Chunk by document structure (headers, sections)."""
        elements = self._extract_structural_elements(text)
        chunks = []
        
        current_chunk_elements = []
        current_tokens = 0
        chunk_id = 0
        current_header = None
        
        for element in elements:
            if element['type'] == 'header':
                # Save previous chunk if it exists
                if current_chunk_elements and current_tokens >= self.min_chunk_size:
                    chunk_text = '\n\n'.join([e['text'] for e in current_chunk_elements])
                    chunks.append({
                        'id': chunk_id,
                        'text': chunk_text,
                        'type': 'structure-based',
                        'header': current_header,
                        'elements': current_chunk_elements.copy(),
                        'token_count': current_tokens,
                        'element_count': len(current_chunk_elements)
                    })
                    chunk_id += 1
                
                # Start new chunk with header
                current_header = element['text']
                current_chunk_elements = [element]
                current_tokens = element['tokens']
                
            else:  # content
                if (current_tokens + element['tokens'] > self.max_chunk_size and 
                    current_tokens >= self.min_chunk_size):
                    
                    # Save current chunk
                    chunk_text = '\n\n'.join([e['text'] for e in current_chunk_elements])
                    chunks.append({
                        'id': chunk_id,
                        'text': chunk_text,
                        'type': 'structure-based',
                        'header': current_header,
                        'elements': current_chunk_elements.copy(),
                        'token_count': current_tokens,
                        'element_count': len(current_chunk_elements)
                    })
                    chunk_id += 1
                    
                    # Start new chunk (keep header if it exists)
                    if current_header:
                        header_element = {'text': current_header, 'type': 'header', 
                                        'tokens': self._count_tokens(current_header)}
                        current_chunk_elements = [header_element, element]
                        current_tokens = header_element['tokens'] + element['tokens']
                    else:
                        current_chunk_elements = [element]
                        current_tokens = element['tokens']
                else:
                    current_chunk_elements.append(element)
                    current_tokens += element['tokens']
        
        # Add final chunk
        if current_chunk_elements and current_tokens >= self.min_chunk_size:
            chunk_text = '\n\n'.join([e['text'] for e in current_chunk_elements])
            chunks.append({
                'id': chunk_id,
                'text': chunk_text,
                'type': 'structure-based',
                'header': current_header,
                'elements': current_chunk_elements,
                'token_count': current_tokens,
                'element_count': len(current_chunk_elements)
            })
        
        return chunks
    
    def _chunk_by_embeddings(self, text: str) -> List[Dict]:
        """Chunk by semantic similarity using embeddings."""
        sentences = self._extract_sentences(text)
        if len(sentences) <= 1:
            return self._chunk_by_sentences(text)
        
        sentence_groups = self._chunk_by_embedding_similarity(sentences)
        chunks = []
        
        for chunk_id, group in enumerate(sentence_groups):
            chunk_text = ' '.join([sent['text'] for sent in group])
            token_count = sum([sent['tokens'] for sent in group])
            
            if token_count >= self.min_chunk_size:
                chunks.append({
                    'id': chunk_id,
                    'text': chunk_text,
                    'type': 'embedding-based',
                    'sentences': group,
                    'token_count': token_count,
                    'sentence_count': len(group),
                    'avg_similarity': self._calculate_group_similarity(group)
                })
        
        return chunks
    
    def _calculate_group_similarity(self, sentences: List[Dict]) -> float:
        """Calculate average similarity within a group of sentences."""
        if len(sentences) <= 1:
            return 1.0
        
        texts = [sent['text'] for sent in sentences]
        embeddings = self.embedding_model.encode(texts)
        similarity_matrix = cosine_similarity(embeddings)
        
        # Calculate average similarity (excluding diagonal)
        total_similarity = 0
        count = 0
        
        for i in range(len(sentences)):
            for j in range(i + 1, len(sentences)):
                total_similarity += similarity_matrix[i][j]
                count += 1
        
        return total_similarity / count if count > 0 else 1.0

print("✅ SemanticChunker class implemented!")

✅ SemanticChunker class implemented!


## 4. Testing Different Semantic Strategies

In [11]:
# Comprehensive test document with clear structure
test_document = """
1. Introduction to Machine Learning

Machine learning is a subset of artificial intelligence that enables computers to learn and improve from experience without being explicitly programmed. This field has revolutionized how we approach complex problems across various domains.

The core principle of machine learning lies in pattern recognition. Algorithms analyze large datasets to identify patterns and relationships that humans might miss or find too complex to detect manually.

2. Types of Machine Learning

2.1 Supervised Learning

Supervised learning uses labeled training data to teach algorithms to predict outcomes. The algorithm learns from input-output pairs and can then make predictions on new, unseen data.

Common supervised learning tasks include classification and regression. Classification predicts discrete categories, while regression predicts continuous numerical values.

2.2 Unsupervised Learning

Unsupervised learning finds hidden patterns in data without labeled examples. The algorithm must discover structure in the data independently.

Clustering and dimensionality reduction are popular unsupervised learning techniques. Clustering groups similar data points together, while dimensionality reduction simplifies data while preserving important information.

2.3 Reinforcement Learning

Reinforcement learning trains agents to make decisions through trial and error. The agent receives rewards or penalties based on its actions and learns to maximize long-term rewards.

This approach has achieved remarkable success in game playing, robotics, and autonomous systems. The agent learns optimal strategies through exploration and exploitation of its environment.

3. Deep Learning Revolution

Deep learning represents a paradigm shift in machine learning. These neural networks with multiple hidden layers can learn complex representations automatically.

Convolutional neural networks excel at image processing tasks. They can identify features at different levels of abstraction, from edges and textures to complex objects and scenes.

Recurrent neural networks handle sequential data effectively. They maintain memory of previous inputs, making them ideal for natural language processing and time series analysis.

4. Applications and Impact

Machine learning applications span numerous industries. Healthcare benefits from diagnostic assistance and drug discovery. Finance uses ML for fraud detection and algorithmic trading.

Autonomous vehicles rely heavily on machine learning for perception and decision-making. These systems must process sensor data in real-time to navigate safely through complex environments.

Natural language processing enables machines to understand and generate human language. This technology powers chatbots, translation services, and content generation tools.

5. Challenges and Future Directions

Despite significant progress, machine learning faces important challenges. Data quality and availability remain critical issues. Biased datasets can lead to unfair or discriminatory outcomes.

Explainability is crucial for high-stakes applications. Black-box models make it difficult to understand how decisions are made, limiting trust and adoption in critical domains.

The future of machine learning includes automated machine learning, federated learning, and quantum machine learning. These advances promise to make ML more accessible and powerful.
"""

# Test all semantic chunking strategies
strategies = {
    'Sentence-based': SemanticChunker(max_chunk_size=300, strategy='sentence'),
    'Paragraph-based': SemanticChunker(max_chunk_size=400, strategy='paragraph'),
    'Structure-based': SemanticChunker(max_chunk_size=350, strategy='structure'),
    'Embedding-based': SemanticChunker(max_chunk_size=300, similarity_threshold=0.3, strategy='embedding')
}

results = {}

print("🧪 Testing Semantic Chunking Strategies\n")

for name, chunker in strategies.items():
    print(f"{'='*50}")
    print(f"Strategy: {name}")
    print(f"{'='*50}")
    
    start_time = time.time()
    chunks = chunker.chunk_text(test_document)
    processing_time = time.time() - start_time
    
    results[name] = {
        'chunks': chunks,
        'count': len(chunks),
        'avg_tokens': np.mean([c['token_count'] for c in chunks]),
        'processing_time': processing_time
    }
    
    print(f"Number of chunks: {len(chunks)}")
    print(f"Average tokens per chunk: {np.mean([c['token_count'] for c in chunks]):.1f}")
    print(f"Processing time: {processing_time:.3f}s")
    
    # Show first chunk as example
    if chunks:
        print(f"\nFirst chunk preview:")
        print(f"Type: {chunks[0]['type']}")
        print(f"Tokens: {chunks[0]['token_count']}")
        print(f"Text: {chunks[0]['text'][:200]}...")
    
    print("\n")

🧪 Testing Semantic Chunking Strategies

Strategy: Sentence-based
Number of chunks: 2
Average tokens per chunk: 272.5
Processing time: 0.071s

First chunk preview:
Type: sentence-based
Tokens: 300
Text: 1. Introduction to Machine Learning Machine learning is a subset of artificial intelligence that enables computers to learn and improve from experience without being explicitly programmed. This field ...


Strategy: Paragraph-based
Number of chunks: 2
Average tokens per chunk: 272.5
Processing time: 0.069s

First chunk preview:
Type: paragraph-split
Tokens: 392
Text: 1. Introduction to Machine Learning Machine learning is a subset of artificial intelligence that enables computers to learn and improve from experience without being explicitly programmed. This field ...


Strategy: Structure-based
Number of chunks: 1
Average tokens per chunk: 546.0
Processing time: 0.000s

First chunk preview:
Type: structure-based
Tokens: 546
Text: 1. Introduction to Machine Learning Machine learning is a 

## 5. Semantic Document Q&A System

In [12]:
class SemanticQASystem:
    def __init__(self, chunking_strategy: str = 'structure', max_chunk_size: int = 400):
        self.chunker = SemanticChunker(
            max_chunk_size=max_chunk_size, 
            strategy=chunking_strategy,
            similarity_threshold=0.3
        )
        self.gemini_model = genai.GenerativeModel('gemini-1.5-flash')
        self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
        self.chunks = []
        self.chunk_embeddings = None
        self.document_title = ""
        
    def load_document(self, text: str, title: str = "Document"):
        """Load and process document with semantic chunking."""
        self.document_title = title
        print(f"🔄 Processing document with {self.chunker.strategy} chunking...")
        
        # Create semantic chunks
        self.chunks = self.chunker.chunk_text(text)
        
        # Generate embeddings for semantic search
        chunk_texts = [chunk['text'] for chunk in self.chunks]
        self.chunk_embeddings = self.embedding_model.encode(chunk_texts)
        
        print(f"✅ Loaded '{title}' with {len(self.chunks)} semantic chunks")
        
        # Display chunk type distribution
        chunk_types = {}
        for chunk in self.chunks:
            chunk_type = chunk.get('type', 'unknown')
            chunk_types[chunk_type] = chunk_types.get(chunk_type, 0) + 1
        
        print(f"📊 Chunk types: {dict(chunk_types)}")
        
    def _find_relevant_chunks_semantic(self, question: str, max_chunks: int = 3) -> List[Dict]:
        """Find relevant chunks using semantic similarity."""
        if not self.chunks or self.chunk_embeddings is None:
            return []
        
        # Get question embedding
        question_embedding = self.embedding_model.encode([question])
        
        # Calculate similarities
        similarities = cosine_similarity(question_embedding, self.chunk_embeddings)[0]
        
        # Get top chunks by similarity
        top_indices = np.argsort(similarities)[::-1][:max_chunks]
        
        relevant_chunks = []
        for idx in top_indices:
            chunk = self.chunks[idx].copy()
            chunk['similarity_score'] = float(similarities[idx])
            relevant_chunks.append(chunk)
        
        return relevant_chunks
    
    def _find_relevant_chunks_hybrid(self, question: str, max_chunks: int = 3) -> List[Dict]:
        """Find relevant chunks using hybrid approach (semantic + keyword)."""
        if not self.chunks:
            return []
        
        # Semantic similarity
        semantic_chunks = self._find_relevant_chunks_semantic(question, max_chunks * 2)
        
        # Keyword matching boost
        question_words = set(question.lower().split())
        stop_words = set(stopwords.words('english'))
        question_keywords = question_words - stop_words
        
        for chunk in semantic_chunks:
            chunk_words = set(chunk['text'].lower().split())
            keyword_overlap = len(question_keywords & chunk_words) / len(question_keywords) if question_keywords else 0
            
            # Combine semantic and keyword scores
            chunk['hybrid_score'] = chunk['similarity_score'] * 0.7 + keyword_overlap * 0.3
        
        # Sort by hybrid score and return top chunks
        semantic_chunks.sort(key=lambda x: x['hybrid_score'], reverse=True)
        return semantic_chunks[:max_chunks]
    
    def answer_question(self, question: str, use_hybrid: bool = True) -> Dict:
        """Answer question using semantic understanding."""
        if not self.chunks:
            return {"error": "No document loaded"}
        
        print(f"🔍 Finding semantically relevant chunks for: {question}")
        
        # Find relevant chunks
        if use_hybrid:
            relevant_chunks = self._find_relevant_chunks_hybrid(question)
        else:
            relevant_chunks = self._find_relevant_chunks_semantic(question)
        
        if not relevant_chunks:
            return {"error": "No relevant content found"}
        
        # Prepare context with chunk metadata
        context_parts = []
        for i, chunk in enumerate(relevant_chunks, 1):
            chunk_info = f"[Chunk {i} - {chunk['type']}]"
            if 'header' in chunk and chunk['header']:
                chunk_info += f" Section: {chunk['header']}"
            context_parts.append(f"{chunk_info}\n{chunk['text']}")
        
        context = "\n\n".join(context_parts)
        
        # Generate comprehensive answer
        answer_prompt = f"""
        You are an expert document analyst. Based on the semantically relevant context from "{self.document_title}", provide a comprehensive and accurate answer.
        
        Context (from semantically matched sections):
        {context}
        
        Question: {question}
        
        Instructions:
        1. Provide a thorough answer based on the context
        2. If information spans multiple sections, synthesize it coherently
        3. Mention which sections your answer draws from when relevant
        4. If the context doesn't fully address the question, clearly state what's missing
        5. Use specific details and examples from the context
        
        Answer:
        """
        
        try:
            response = self.gemini_model.generate_content(answer_prompt)
            
            return {
                "question": question,
                "answer": response.text,
                "relevant_chunks": len(relevant_chunks),
                "chunk_details": [
                    {
                        "id": chunk['id'],
                        "type": chunk['type'],
                        "similarity": chunk.get('similarity_score', 0),
                        "hybrid_score": chunk.get('hybrid_score', 0),
                        "header": chunk.get('header', 'N/A')
                    }
                    for chunk in relevant_chunks
                ],
                "context_tokens": sum(chunk['token_count'] for chunk in relevant_chunks),
                "search_method": "hybrid" if use_hybrid else "semantic"
            }
            
        except Exception as e:
            return {"error": f"Failed to generate answer: {e}"}
    
    def analyze_document_structure(self) -> Dict:
        """Analyze the semantic structure of the loaded document."""
        if not self.chunks:
            return {"error": "No document loaded"}
        
        analysis = {
            "total_chunks": len(self.chunks),
            "chunk_types": {},
            "avg_tokens_per_chunk": np.mean([c['token_count'] for c in self.chunks]),
            "headers_found": [],
            "semantic_coherence": []
        }
        
        # Analyze chunk types
        for chunk in self.chunks:
            chunk_type = chunk.get('type', 'unknown')
            analysis['chunk_types'][chunk_type] = analysis['chunk_types'].get(chunk_type, 0) + 1
            
            if chunk.get('header'):
                analysis['headers_found'].append(chunk['header'])
            
            if 'avg_similarity' in chunk:
                analysis['semantic_coherence'].append(chunk['avg_similarity'])
        
        if analysis['semantic_coherence']:
            analysis['avg_semantic_coherence'] = np.mean(analysis['semantic_coherence'])
        
        return analysis

print("✅ SemanticQASystem class implemented!")

✅ SemanticQASystem class implemented!


## 6. Loading and Testing with Research Document

In [13]:
# Extended research document for testing
research_document = """
Advances in Natural Language Processing: From Statistical Methods to Large Language Models

Abstract

Natural Language Processing (NLP) has undergone dramatic transformation over the past decades. This paper reviews the evolution from statistical methods to modern transformer-based large language models, examining key breakthroughs, current capabilities, and future directions.

1. Introduction

Natural Language Processing represents the intersection of computational linguistics, artificial intelligence, and computer science. The field aims to enable computers to understand, interpret, and generate human language in ways that are both meaningful and useful.

The journey from rule-based systems to statistical methods and eventually to neural networks reflects broader trends in artificial intelligence. Each paradigm shift has brought new capabilities and applications, fundamentally changing how we interact with technology.

Modern NLP systems can translate languages, summarize documents, answer questions, and even engage in creative writing. These achievements stem from decades of research and the availability of large-scale computational resources.

2. Historical Development

2.1 Rule-Based Systems (1950s-1980s)

Early NLP systems relied on hand-crafted rules and linguistic knowledge. These systems used grammatical rules, dictionaries, and semantic networks to process language. While limited in scope, they provided important foundations for understanding language structure.

Rule-based approaches excelled in narrow domains but struggled with ambiguity and linguistic variation. The complexity of natural language made it impossible to capture all rules manually, leading researchers to explore data-driven approaches.

2.2 Statistical Methods (1990s-2000s)

The statistical revolution in NLP introduced probabilistic models and machine learning techniques. Hidden Markov Models, Conditional Random Fields, and Support Vector Machines became standard tools for tasks like part-of-speech tagging and named entity recognition.

Statistical methods enabled systems to learn from data rather than relying solely on hand-crafted rules. This approach proved more robust to linguistic variation and could handle previously unseen text more effectively.

2.3 Neural Networks and Deep Learning (2010s)

The introduction of neural networks marked another paradigm shift. Word embeddings like Word2Vec and GloVe provided dense vector representations of words, capturing semantic relationships in continuous space.

Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks addressed sequential processing challenges. These architectures could maintain context over longer sequences, improving performance on tasks requiring understanding of sentence-level dependencies.

3. The Transformer Revolution

3.1 Attention Mechanisms

The attention mechanism revolutionized sequence modeling by allowing models to focus on relevant parts of the input. Unlike RNNs, attention enables parallel processing and better handling of long-range dependencies.

Self-attention mechanisms compute relationships between all positions in a sequence simultaneously. This approach captures complex patterns that sequential models often miss, leading to significant improvements in translation and text understanding tasks.

3.2 Transformer Architecture

The Transformer architecture, introduced in "Attention Is All You Need," replaced recurrence with pure attention mechanisms. This design enables efficient parallel training and better scaling to large datasets.

Transformers consist of encoder and decoder stacks with multi-head attention and feed-forward networks. The architecture's modularity and effectiveness led to its adoption across numerous NLP tasks.

3.3 Pre-training and Transfer Learning

BERT introduced bidirectional pre-training, learning representations from both left and right context. This approach significantly improved performance on downstream tasks through transfer learning.

GPT models demonstrated the power of autoregressive language modeling for text generation. These models showed that large-scale pre-training on diverse text corpora could produce remarkably fluent and coherent text.

4. Large Language Models

4.1 Scaling Laws and Emergent Abilities

Research has revealed scaling laws governing language model performance. Increasing model size, training data, and computational resources leads to predictable improvements in capabilities.

Large models exhibit emergent abilities not present in smaller versions. These include few-shot learning, chain-of-thought reasoning, and the ability to follow complex instructions without explicit training.

4.2 Current State-of-the-Art Models

Modern large language models like GPT-4, PaLM, and Claude demonstrate remarkable capabilities across diverse tasks. These models can engage in dialogue, solve reasoning problems, write code, and perform creative tasks.

The integration of multimodal capabilities allows models to process text, images, and other data types together. This convergence opens new possibilities for AI applications across domains.

5. Applications and Impact

5.1 Information Retrieval and Question Answering

Modern NLP systems excel at information retrieval and question answering. They can understand complex queries, search through vast document collections, and provide accurate, contextual answers.

Retrieval-augmented generation combines language models with external knowledge bases. This approach enables systems to access up-to-date information while maintaining the fluency of pre-trained models.

5.2 Content Generation and Creative Applications

Language models have revolutionized content creation. They can write articles, stories, poetry, and technical documentation with human-like quality. These capabilities are transforming publishing, marketing, and educational content creation.

Creative applications include story generation, dialogue systems, and artistic collaboration. AI can now assist human creators in brainstorming, drafting, and refining creative works.

5.3 Code Generation and Programming Assistance

Code generation represents a breakthrough application of language models. Systems like GitHub Copilot and CodeT5 can generate, complete, and explain code across multiple programming languages.

These tools are transforming software development by automating routine coding tasks and helping developers learn new technologies. The integration of AI into development environments is becoming increasingly common.

6. Challenges and Limitations

6.1 Computational Requirements

Training and deploying large language models requires enormous computational resources. The environmental impact and cost of these systems raise concerns about accessibility and sustainability.

Efforts to improve efficiency include model compression, distillation, and hardware optimization. However, the tension between model capability and computational efficiency remains a significant challenge.

6.2 Bias and Fairness

Language models can perpetuate and amplify biases present in training data. These biases can manifest in generated text, affecting fairness and representation across different groups.

Addressing bias requires careful dataset curation, bias detection methods, and mitigation strategies. The challenge is balancing bias reduction with maintaining model performance and capabilities.

6.3 Hallucination and Factual Accuracy

Large language models can generate plausible but incorrect information, a phenomenon known as hallucination. This limitation poses risks for applications requiring factual accuracy.

Improving factual accuracy requires better training methods, fact-checking mechanisms, and uncertainty quantification. Research continues to address these fundamental limitations.

7. Future Directions

7.1 Multimodal Integration

The future of NLP involves deeper integration with other modalities. Vision-language models and speech-text systems represent early steps toward more comprehensive AI understanding.

Multimodal models can understand images, videos, and audio in context with text. This integration enables richer applications and more natural human-AI interaction.

7.2 Reasoning and Planning

Enhancing reasoning capabilities remains a key research direction. Current models show promising reasoning abilities but lack the systematic problem-solving approaches of symbolic AI.

Combining neural networks with symbolic reasoning could lead to more robust and interpretable AI systems. This hybrid approach might address current limitations in logical reasoning and planning.

8. Conclusion

Natural Language Processing has evolved from simple rule-based systems to sophisticated neural networks capable of human-like language understanding and generation. The transformer architecture and large-scale pre-training have driven remarkable progress.

Current challenges include computational efficiency, bias mitigation, and improving factual accuracy. Future research directions focus on multimodal integration and enhanced reasoning capabilities.

As NLP continues to advance, its impact on society will grow. Understanding both the capabilities and limitations of these systems is crucial for responsible development and deployment.
"""

# Initialize semantic Q&A system
semantic_qa = SemanticQASystem(chunking_strategy='structure', max_chunk_size=400)
semantic_qa.load_document(research_document, "NLP Research Paper")

# Analyze document structure
structure_analysis = semantic_qa.analyze_document_structure()
print(f"\n📊 Document Structure Analysis:")
for key, value in structure_analysis.items():
    if key != 'semantic_coherence':  # Skip raw coherence data
        print(f"  {key}: {value}")

🔄 Processing document with structure chunking...
✅ Loaded 'NLP Research Paper' with 1 semantic chunks
📊 Chunk types: {'structure-based': 1}

📊 Document Structure Analysis:
  total_chunks: 1
  chunk_types: {'structure-based': 1}
  avg_tokens_per_chunk: 1532.0
  headers_found: []


## 7. Testing Semantic Q&A Performance

In [None]:
# Comprehensive test questions targeting different aspects
test_questions = [
    "What is the Transformer architecture and why was it revolutionary?",
    "How did NLP evolve from rule-based systems to neural networks?",
    "What are the main challenges with large language models?",
    "Explain the concept of attention mechanisms in neural networks",
    "What applications has NLP found in code generation?",
    "What are the future directions for NLP research?"
]

print("🧠 Testing Semantic Q&A System\n")

for i, question in enumerate(test_questions[:3], 1):  # Test first 3 questions
    print(f"{'='*70}")
    print(f"Question {i}: {question}")
    print(f"{'='*70}")
    
    # Test both semantic and hybrid approaches
    result_hybrid = semantic_qa.answer_question(question, use_hybrid=True)
    
    if "error" in result_hybrid:
        print(f"❌ Error: {result_hybrid['error']}")
    else:
        print(f"\n🎯 Answer (Hybrid Search):")
        print(result_hybrid["answer"])
        
        print(f"\n📈 Search Details:")
        print(f"  - Method: {result_hybrid['search_method']}")
        print(f"  - Relevant chunks: {result_hybrid['relevant_chunks']}")
        print(f"  - Context tokens: {result_hybrid['context_tokens']}")
        
        print(f"\n📚 Chunk Details:")
        for detail in result_hybrid['chunk_details']:
            print(f"  - Chunk {detail['id']}: {detail['type']} (Similarity: {detail['similarity']:.3f}, Header: {detail['header']})")
    
    print("\n" + "-"*70 + "\n")
    time.sleep(1)  # Rate limiting

## 8. Comparison: Semantic vs Fixed-Size Chunking

In [None]:
# Compare semantic chunking with fixed-size chunking
from typing import Dict, Any

class FixedSizeChunker:
    """Simple fixed-size chunker for comparison."""
    
    def __init__(self, chunk_size: int = 400, overlap: int = 50):
        self.chunk_size = chunk_size
        self.overlap = overlap
        self.tokenizer = tiktoken.get_encoding("cl100k_base")
    
    def chunk_text(self, text: str) -> List[Dict]:
        """Create fixed-size chunks."""
        tokens = self.tokenizer.encode(text)
        chunks = []
        start = 0
        chunk_id = 0
        
        while start < len(tokens):
            end = min(start + self.chunk_size, len(tokens))
            chunk_tokens = tokens[start:end]
            chunk_text = self.tokenizer.decode(chunk_tokens)
            
            chunks.append({
                'id': chunk_id,
                'text': chunk_text,
                'type': 'fixed-size',
                'token_count': len(chunk_tokens),
                'start_token': start,
                'end_token': end
            })
            
            start = end - self.overlap
            chunk_id += 1
            
            if end >= len(tokens):
                break
        
        return chunks

def compare_chunking_methods(text: str) -> Dict[str, Any]:
    """Compare different chunking methods."""
    
    # Initialize chunkers
    fixed_chunker = FixedSizeChunker(chunk_size=400, overlap=50)
    semantic_chunkers = {
        'Semantic-Sentence': SemanticChunker(max_chunk_size=400, strategy='sentence'),
        'Semantic-Paragraph': SemanticChunker(max_chunk_size=400, strategy='paragraph'),
        'Semantic-Structure': SemanticChunker(max_chunk_size=400, strategy='structure'),
        'Semantic-Embedding': SemanticChunker(max_chunk_size=400, strategy='embedding')
    }
    
    results = {}
    
    # Test fixed-size chunking
    start_time = time.time()
    fixed_chunks = fixed_chunker.chunk_text(text)
    fixed_time = time.time() - start_time
    
    results['Fixed-Size'] = {
        'chunks': fixed_chunks,
        'count': len(fixed_chunks),
        'avg_tokens': np.mean([c['token_count'] for c in fixed_chunks]),
        'std_tokens': np.std([c['token_count'] for c in fixed_chunks]),
        'processing_time': fixed_time,
        'strategy': 'fixed-size'
    }
    
    # Test semantic chunking methods
    for name, chunker in semantic_chunkers.items():
        start_time = time.time()
        chunks = chunker.chunk_text(text)
        processing_time = time.time() - start_time
        
        results[name] = {
            'chunks': chunks,
            'count': len(chunks),
            'avg_tokens': np.mean([c['token_count'] for c in chunks]),
            'std_tokens': np.std([c['token_count'] for c in chunks]),
            'processing_time': processing_time,
            'strategy': chunker.strategy
        }
    
    return results

# Run comparison
print("⚖️ Comparing Chunking Methods\n")
comparison_results = compare_chunking_methods(research_document)

# Create comparison table
comparison_df = pd.DataFrame([
    {
        'Method': method,
        'Chunk Count': data['count'],
        'Avg Tokens': f"{data['avg_tokens']:.1f}",
        'Token Std': f"{data['std_tokens']:.1f}",
        'Processing Time (s)': f"{data['processing_time']:.3f}",
        'Strategy': data['strategy']
    }
    for method, data in comparison_results.items()
])

print(comparison_df.to_string(index=False))

# Visualize comparison
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('Chunking Method Comparison', fontsize=16)

methods = list(comparison_results.keys())
chunk_counts = [comparison_results[m]['count'] for m in methods]
avg_tokens = [comparison_results[m]['avg_tokens'] for m in methods]
std_tokens = [comparison_results[m]['std_tokens'] for m in methods]
proc_times = [comparison_results[m]['processing_time'] for m in methods]

# Chunk count comparison
ax1.bar(methods, chunk_counts, alpha=0.7)
ax1.set_title('Number of Chunks')
ax1.set_ylabel('Count')
ax1.tick_params(axis='x', rotation=45)

# Average tokens per chunk
ax2.bar(methods, avg_tokens, alpha=0.7, color='orange')
ax2.set_title('Average Tokens per Chunk')
ax2.set_ylabel('Tokens')
ax2.tick_params(axis='x', rotation=45)

# Token standard deviation (consistency)
ax3.bar(methods, std_tokens, alpha=0.7, color='green')
ax3.set_title('Token Count Standard Deviation')
ax3.set_ylabel('Std Deviation')
ax3.tick_params(axis='x', rotation=45)

# Processing time
ax4.bar(methods, proc_times, alpha=0.7, color='red')
ax4.set_title('Processing Time')
ax4.set_ylabel('Seconds')
ax4.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

print("\n📊 Key Insights:")
print("• Semantic chunking generally produces fewer, more coherent chunks")
print("• Structure-based chunking respects document organization")
print("• Embedding-based chunking groups semantically similar content")
print("• Fixed-size chunking is fastest but may break semantic boundaries")

## 9. Advanced Semantic Analysis

In [None]:
def analyze_semantic_coherence(chunks: List[Dict], embedding_model) -> Dict:
    """Analyze semantic coherence within and between chunks."""
    
    if len(chunks) < 2:
        return {"error": "Need at least 2 chunks for analysis"}
    
    # Get embeddings for all chunks
    chunk_texts = [chunk['text'] for chunk in chunks]
    embeddings = embedding_model.encode(chunk_texts)
    
    # Calculate pairwise similarities
    similarity_matrix = cosine_similarity(embeddings)
    
    # Analyze coherence metrics
    coherence_metrics = {
        'avg_similarity': np.mean(similarity_matrix[np.triu_indices_from(similarity_matrix, k=1)]),
        'similarity_std': np.std(similarity_matrix[np.triu_indices_from(similarity_matrix, k=1)]),
        'max_similarity': np.max(similarity_matrix[np.triu_indices_from(similarity_matrix, k=1)]),
        'min_similarity': np.min(similarity_matrix[np.triu_indices_from(similarity_matrix, k=1)]),
        'similarity_matrix': similarity_matrix
    }
    
    # Find most and least similar chunk pairs
    triu_indices = np.triu_indices_from(similarity_matrix, k=1)
    similarities = similarity_matrix[triu_indices]
    
    max_idx = np.argmax(similarities)
    min_idx = np.argmin(similarities)
    
    most_similar_pair = (triu_indices[0][max_idx], triu_indices[1][max_idx])
    least_similar_pair = (triu_indices[0][min_idx], triu_indices[1][min_idx])
    
    coherence_metrics['most_similar_chunks'] = {
        'indices': most_similar_pair,
        'similarity': similarities[max_idx],
        'chunk1_preview': chunks[most_similar_pair[0]]['text'][:100] + '...',
        'chunk2_preview': chunks[most_similar_pair[1]]['text'][:100] + '...'
    }
    
    coherence_metrics['least_similar_chunks'] = {
        'indices': least_similar_pair,
        'similarity': similarities[min_idx],
        'chunk1_preview': chunks[least_similar_pair[0]]['text'][:100] + '...',
        'chunk2_preview': chunks[least_similar_pair[1]]['text'][:100] + '...'
    }
    
    return coherence_metrics

def visualize_chunk_similarity(chunks: List[Dict], title: str = "Chunk Similarity Matrix"):
    """Visualize semantic similarity between chunks."""
    
    if len(chunks) < 2:
        print("Need at least 2 chunks for visualization")
        return
    
    # Get embeddings and similarity matrix
    chunk_texts = [chunk['text'] for chunk in chunks]
    embeddings = embedding_model.encode(chunk_texts)
    similarity_matrix = cosine_similarity(embeddings)
    
    # Create heatmap
    plt.figure(figsize=(10, 8))
    
    # Create labels with chunk info
    labels = []
    for i, chunk in enumerate(chunks):
        chunk_type = chunk.get('type', 'unknown')
        header = chunk.get('header', '')
        if header:
            label = f"{i}: {header[:20]}..."
        else:
            label = f"{i}: {chunk_type}"
        labels.append(label)
    
    sns.heatmap(similarity_matrix, 
                annot=True, 
                fmt='.3f',
                cmap='viridis',
                xticklabels=labels,
                yticklabels=labels,
                cbar_kws={'label': 'Cosine Similarity'})
    
    plt.title(title)
    plt.xticks(rotation=45, ha='right')
    plt.yticks(rotation=0)
    plt.tight_layout()
    plt.show()

# Analyze semantic coherence for different chunking strategies
print("🔬 Advanced Semantic Analysis\n")

for method_name, result_data in list(comparison_results.items())[:3]:  # Analyze first 3 methods
    print(f"\n{'='*50}")
    print(f"Analyzing: {method_name}")
    print(f"{'='*50}")
    
    chunks = result_data['chunks']
    coherence = analyze_semantic_coherence(chunks, embedding_model)
    
    if "error" not in coherence:
        print(f"Average chunk similarity: {coherence['avg_similarity']:.3f}")
        print(f"Similarity std deviation: {coherence['similarity_std']:.3f}")
        print(f"Similarity range: {coherence['min_similarity']:.3f} - {coherence['max_similarity']:.3f}")
        
        print(f"\nMost similar chunks (similarity: {coherence['most_similar_chunks']['similarity']:.3f}):")
        print(f"  Chunk {coherence['most_similar_chunks']['indices'][0]}: {coherence['most_similar_chunks']['chunk1_preview']}")
        print(f"  Chunk {coherence['most_similar_chunks']['indices'][1]}: {coherence['most_similar_chunks']['chunk2_preview']}")
        
        # Visualize similarity matrix for structure-based chunking
        if 'Structure' in method_name:
            visualize_chunk_similarity(chunks, f"Semantic Similarity: {method_name}")
    
    time.sleep(0.5)  # Brief pause between analyses

## 10. Interactive Semantic Q&A Demo

In [None]:
def interactive_semantic_qa():
    """Interactive demo of semantic Q&A system."""
    print("🧠 Interactive Semantic Q&A Demo")
    print("Ask questions about the NLP research document!")
    print("Features: Semantic chunk retrieval, hybrid search, context awareness")
    print("Type 'quit' to exit, 'stats' for document statistics\n")
    
    while True:
        question = input("❓ Your question: ").strip()
        
        if question.lower() in ['quit', 'exit', 'q']:
            print("👋 Thank you for using the Semantic Q&A system!")
            break
        
        if question.lower() == 'stats':
            stats = semantic_qa.analyze_document_structure()
            print("\n📊 Document Statistics:")
            for key, value in stats.items():
                if key != 'semantic_coherence':
                    print(f"  {key}: {value}")
            print()
            continue
        
        if not question:
            continue
        
        print(f"\n🔍 Processing semantic query: {question}")
        
        # Get both semantic and hybrid results
        result_semantic = semantic_qa.answer_question(question, use_hybrid=False)
        result_hybrid = semantic_qa.answer_question(question, use_hybrid=True)
        
        if "error" in result_hybrid:
            print(f"❌ Error: {result_hybrid['error']}")
        else:
            print(f"\n🎯 Answer (Hybrid Semantic Search):")
            print(result_hybrid["answer"])
            
            print(f"\n📊 Search Analysis:")
            print(f"  - Chunks used: {result_hybrid['relevant_chunks']}")
            print(f"  - Context tokens: {result_hybrid['context_tokens']}")
            print(f"  - Search method: {result_hybrid['search_method']}")
            
            print(f"\n🧩 Relevant Chunks:")
            for detail in result_hybrid['chunk_details']:
                print(f"  • Chunk {detail['id']}: {detail['type']} | Sim: {detail['similarity']:.3f} | Hybrid: {detail['hybrid_score']:.3f}")
                if detail['header'] != 'N/A':
                    print(f"    Section: {detail['header']}")
        
        print("\n" + "-"*70 + "\n")

# Uncomment to run interactive demo
# interactive_semantic_qa()

## 11. Best Practices for Semantic Chunking

In [None]:
def semantic_chunking_best_practices():
    """Display best practices for semantic chunking."""
    
    practices = {
        "🎯 Strategy Selection": [
            "• Use sentence-based for general text processing",
            "• Use paragraph-based for well-structured documents",
            "• Use structure-based for academic papers and reports",
            "• Use embedding-based for maximum semantic coherence",
            "• Consider hybrid approaches for complex documents"
        ],
        
        "📏 Size Optimization": [
            "• Balance semantic coherence with computational efficiency",
            "• Allow variable chunk sizes within reasonable bounds",
            "• Set minimum sizes to ensure meaningful content",
            "• Consider context window limits of your target model",
            "• Test different size parameters for your use case"
        ],
        
        "🔗 Context Preservation": [
            "• Maintain document structure information",
            "• Preserve headers and section boundaries",
            "• Include metadata about chunk relationships",
            "• Consider hierarchical chunking for complex documents",
            "• Track semantic similarity between adjacent chunks"
        ],
        
        "⚡ Performance Considerations": [
            "• Cache embeddings when processing multiple queries",
            "• Use efficient embedding models for real-time applications",
            "• Consider preprocessing documents in batches",
            "• Optimize similarity thresholds based on your domain",
            "• Monitor processing time vs. quality trade-offs"
        ],
        
        "🔍 Quality Assurance": [
            "• Validate chunk coherence with similarity metrics",
            "• Test retrieval quality on representative queries",
            "• Compare with simpler chunking methods as baselines",
            "• Monitor for edge cases and boundary issues",
            "• Regularly evaluate against your specific use case"
        ],
        
        "🛠️ Implementation Tips": [
            "• Start with paragraph-based chunking for most applications",
            "• Use spaCy or NLTK for robust text preprocessing",
            "• Implement fallback strategies for edge cases",
            "• Consider domain-specific adaptations",
            "• Document your chunking strategy and parameters"
        ]
    }
    
    print("📚 Semantic Chunking Best Practices\n")
    
    for category, tips in practices.items():
        print(f"{category}")
        for tip in tips:
            print(f"  {tip}")
        print()

semantic_chunking_best_practices()

# Summary comparison table
print("\n📊 Semantic vs Fixed-Size Chunking Summary:")
comparison_table = pd.DataFrame([
    {
        'Aspect': 'Semantic Coherence',
        'Semantic Chunking': 'High - preserves meaning',
        'Fixed-Size Chunking': 'Variable - may break context'
    },
    {
        'Aspect': 'Processing Speed',
        'Semantic Chunking': 'Slower - requires NLP processing',
        'Fixed-Size Chunking': 'Fast - simple token counting'
    },
    {
        'Aspect': 'Chunk Size Consistency',
        'Semantic Chunking': 'Variable - adapts to content',
        'Fixed-Size Chunking': 'Consistent - predictable sizes'
    },
    {
        'Aspect': 'Document Structure',
        'Semantic Chunking': 'Preserves - respects boundaries',
        'Fixed-Size Chunking': 'Ignores - arbitrary splits'
    },
    {
        'Aspect': 'Implementation Complexity',
        'Semantic Chunking': 'Complex - multiple strategies',
        'Fixed-Size Chunking': 'Simple - straightforward'
    },
    {
        'Aspect': 'Best Use Cases',
        'Semantic Chunking': 'Q&A, analysis, comprehension',
        'Fixed-Size Chunking': 'Batch processing, simple tasks'
    }
])

print(comparison_table.to_string(index=False))

## 12. Conclusion and Future Extensions

In this notebook, we've explored semantic chunking techniques and built an intelligent document Q&A system using Google Gemini. 

### Key Achievements:
1. **Implemented multiple semantic chunking strategies** preserving document structure and meaning
2. **Built a sophisticated Q&A system** using semantic similarity and hybrid search
3. **Compared semantic vs fixed-size approaches** with comprehensive metrics
4. **Analyzed semantic coherence** and chunk quality across different methods
5. **Demonstrated real-world applications** with complex research documents

### Semantic Chunking Advantages:
- **🧠 Semantic Coherence**: Preserves meaning and context boundaries
- **📚 Structure Awareness**: Respects document organization and hierarchy
- **🔍 Better Retrieval**: Improves relevance of retrieved content
- **💡 Contextual Understanding**: Maintains logical flow and relationships
- **🎯 Adaptive Sizing**: Adjusts chunk boundaries to content naturally

### Next Steps:
- **Hierarchical Chunking**: Implement multi-level document analysis
- **Domain Adaptation**: Customize chunking for specific fields (legal, medical, technical)
- **Dynamic Chunking**: Adapt strategies based on query types
- **Multimodal Integration**: Handle documents with images, tables, and figures
- **Real-time Processing**: Optimize for streaming and incremental updates

### Production Considerations:
- **Caching**: Store embeddings and processed chunks for efficiency
- **Scalability**: Implement batch processing for large document collections
- **Monitoring**: Track chunk quality and retrieval performance
- **A/B Testing**: Compare chunking strategies for your specific use case
- **User Feedback**: Incorporate user satisfaction metrics

Semantic chunking represents a significant advance over fixed-size approaches, offering better context preservation and improved performance for understanding-based tasks. The choice of strategy should be guided by your specific requirements, document types, and performance constraints.

**Happy semantic chunking!** 🚀🧠