# Hierarchical Chunking with Google Gemini
## Building Multi-Level Document Understanding Systems

This notebook explores hierarchical chunking techniques that create multi-level document representations, preserving both fine-grained details and high-level structure. We'll build intelligent document analysis systems that can reason across different levels of abstraction using Google Gemini.

### What You'll Learn:
- Understanding hierarchical document structure and multi-level chunking
- Implementing document parsing for headers, sections, and subsections
- Building tree-based chunk hierarchies with parent-child relationships
- Creating adaptive retrieval systems that leverage document structure
- Developing context propagation across hierarchy levels
- Analyzing performance benefits of hierarchical approaches

### Project Overview:
We'll create an advanced system that:
1. Analyzes document structure to identify hierarchical elements
2. Creates multi-level chunks with preserved relationships
3. Implements intelligent retrieval across different abstraction levels
4. Builds context-aware Q&A that leverages document hierarchy
5. Provides comprehensive analysis and visualization tools

## 1. Setup and Dependencies

In [None]:
# Install required packages
!pip install google-generativeai sentence-transformers spacy nltk scikit-learn numpy pandas matplotlib seaborn tiktoken networkx anytree

In [None]:
# Download additional dependencies
!python -m spacy download en_core_web_sm
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

In [None]:
import google.generativeai as genai
from sentence_transformers import SentenceTransformer
import spacy
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import tiktoken
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
from anytree import Node, RenderTree, PreOrderIter, LevelOrderIter
from anytree.exporter import DotExporter
import re
import os
import time
from typing import List, Dict, Tuple, Optional, Union, Any
from collections import defaultdict, deque
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")

In [None]:
# Configure Gemini API and models
GEMINI_API_KEY = "your-gemini-api-key-here"  # Replace with your actual API key
genai.configure(api_key=GEMINI_API_KEY)

# Initialize models
gemini_model = genai.GenerativeModel('gemini-pro')
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
nlp = spacy.load('en_core_web_sm')
tokenizer = tiktoken.get_encoding("cl100k_base")

print("✅ All models initialized successfully!")
print(f"📊 Embedding dimensions: {embedding_model.get_sentence_embedding_dimension()}")
print(f"🧠 spaCy model: {nlp.meta['name']} v{nlp.meta['version']}")
print(f"🌳 Tree structure library: anytree")

## 2. Understanding Hierarchical Document Structure

Hierarchical chunking preserves document structure by creating chunks at multiple levels of granularity, from high-level sections down to individual sentences.

In [None]:
def count_tokens(text: str) -> int:
    """Count tokens in text using tiktoken."""
    return len(tokenizer.encode(text))

def demonstrate_hierarchical_concept():
    """Demonstrate hierarchical document structure concepts."""
    
    sample_document = """
# Machine Learning Fundamentals

## 1. Introduction

Machine learning is a subset of artificial intelligence that enables computers to learn and improve from experience without being explicitly programmed. This field has revolutionized how we approach complex problems across various domains.

### 1.1 Key Concepts

The core principle of machine learning lies in pattern recognition. Algorithms analyze large datasets to identify patterns and relationships that humans might miss or find too complex to detect manually.

### 1.2 Applications

Machine learning applications span numerous industries including healthcare, finance, transportation, and entertainment. Each domain presents unique challenges and opportunities for ML implementation.

## 2. Types of Learning

### 2.1 Supervised Learning

Supervised learning uses labeled training data to teach algorithms to predict outcomes. The algorithm learns from input-output pairs and can then make predictions on new, unseen data.

Common supervised learning tasks include classification and regression. Classification predicts discrete categories, while regression predicts continuous numerical values.

### 2.2 Unsupervised Learning

Unsupervised learning finds hidden patterns in data without labeled examples. The algorithm must discover structure in the data independently.

Clustering and dimensionality reduction are popular unsupervised learning techniques. These methods help understand data structure and relationships.
    """.strip()
    
    print("🌳 Hierarchical Document Structure Demonstration\n")
    print("Sample Document:")
    print(sample_document)
    print(f"\nTotal tokens: {count_tokens(sample_document)}")
    
    # Identify hierarchical elements
    lines = sample_document.split('\n')
    hierarchy_levels = {
        'Title (Level 0)': [],
        'Main Sections (Level 1)': [],
        'Subsections (Level 2)': [],
        'Content Paragraphs': []
    }
    
    for line in lines:
        line = line.strip()
        if line.startswith('# ') and not line.startswith('## '):
            hierarchy_levels['Title (Level 0)'].append(line[2:].strip())
        elif line.startswith('## '):
            hierarchy_levels['Main Sections (Level 1)'].append(line[3:].strip())
        elif line.startswith('### '):
            hierarchy_levels['Subsections (Level 2)'].append(line[4:].strip())
        elif line and not line.startswith('#'):
            if len(line) > 20:  # Filter out short lines
                hierarchy_levels['Content Paragraphs'].append(line[:60] + '...')
    
    print(f"\n🎯 Identified Hierarchical Structure:")
    for level, items in hierarchy_levels.items():
        print(f"\n{level}:")
        for i, item in enumerate(items[:3], 1):  # Show first 3 items
            print(f"  {i}. {item}")
        if len(items) > 3:
            print(f"  ... and {len(items) - 3} more")
    
    print(f"\n📊 Hierarchy Benefits:")
    print(f"  ✅ Preserves document structure and organization")
    print(f"  ✅ Enables multi-level retrieval (sections vs details)")
    print(f"  ✅ Maintains context relationships between levels")
    print(f"  ✅ Supports both broad and specific queries")
    print(f"  ✅ Facilitates navigation and summarization")

demonstrate_hierarchical_concept()

## 3. Implementing Hierarchical Chunker

In [None]:
class HierarchicalChunker:
    def __init__(self, max_chunk_size: int = 512, min_chunk_size: int = 50):
        """
        Hierarchical chunker that creates multi-level document representations.
        
        Args:
            max_chunk_size: Maximum tokens per chunk
            min_chunk_size: Minimum tokens per chunk
        """
        self.max_chunk_size = max_chunk_size
        self.min_chunk_size = min_chunk_size
        self.tokenizer = tiktoken.get_encoding("cl100k_base")
        self.nlp = spacy.load('en_core_web_sm')
        self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
        
    def _count_tokens(self, text: str) -> int:
        """Count tokens in text."""
        return len(self.tokenizer.encode(text))
    
    def _identify_headers(self, text: str) -> List[Dict]:
        """Identify headers and their hierarchy levels."""
        lines = text.split('\n')
        headers = []
        
        for i, line in enumerate(lines):
            line = line.strip()
            
            # Markdown-style headers
            if line.startswith('#'):
                level = len(line) - len(line.lstrip('#'))
                title = line.lstrip('#').strip()
                headers.append({
                    'line_number': i,
                    'level': level,
                    'title': title,
                    'text': line,
                    'type': 'markdown_header'
                })
            
            # Numbered headers (1., 1.1, etc.)
            elif re.match(r'^\d+\.', line) or re.match(r'^\d+\.\d+', line):
                parts = line.split('.', 2)
                if len(parts) >= 2:
                    level = len([p for p in parts[:-1] if p.strip().isdigit()])
                    title = parts[-1].strip() if len(parts) > 1 else line
                    headers.append({
                        'line_number': i,
                        'level': level,
                        'title': title,
                        'text': line,
                        'type': 'numbered_header'
                    })
            
            # Pattern-based headers (ALL CAPS, etc.)
            elif (line.isupper() and len(line.split()) <= 8 and 
                  len(line) > 5 and not re.search(r'[.!?]$', line)):
                headers.append({
                    'line_number': i,
                    'level': 1,  # Default level for pattern headers
                    'title': line,
                    'text': line,
                    'type': 'caps_header'
                })
        
        return headers
    
    def _extract_sections(self, text: str, headers: List[Dict]) -> List[Dict]:
        """Extract document sections based on headers."""
        lines = text.split('\n')
        sections = []
        
        if not headers:
            # No headers found, treat entire document as one section
            return [{
                'id': 0,
                'level': 0,
                'title': 'Document',
                'content': text,
                'start_line': 0,
                'end_line': len(lines) - 1,
                'tokens': self._count_tokens(text),
                'parent_id': None
            }]
        
        # Process sections between headers
        for i, header in enumerate(headers):
            start_line = header['line_number']
            
            # Find end line (next header or end of document)
            if i + 1 < len(headers):
                end_line = headers[i + 1]['line_number'] - 1
            else:
                end_line = len(lines) - 1
            
            # Extract section content
            section_lines = lines[start_line:end_line + 1]
            content = '\n'.join(section_lines).strip()
            
            if content and self._count_tokens(content) >= self.min_chunk_size:
                # Find parent section
                parent_id = None
                for j in range(i - 1, -1, -1):
                    if headers[j]['level'] < header['level']:
                        parent_id = j
                        break
                
                sections.append({
                    'id': i,
                    'level': header['level'],
                    'title': header['title'],
                    'content': content,
                    'start_line': start_line,
                    'end_line': end_line,
                    'tokens': self._count_tokens(content),
                    'parent_id': parent_id,
                    'header_type': header['type']
                })
        
        return sections
    
    def _create_subsection_chunks(self, section: Dict) -> List[Dict]:
        """Create smaller chunks within a section if needed."""
        if section['tokens'] <= self.max_chunk_size:
            return [section]  # Section is small enough
        
        # Split section into paragraphs
        paragraphs = [p.strip() for p in section['content'].split('\n\n') if p.strip()]
        
        chunks = []
        current_chunk = []
        current_tokens = 0
        chunk_id = 0
        
        for paragraph in paragraphs:
            para_tokens = self._count_tokens(paragraph)
            
            if current_tokens + para_tokens > self.max_chunk_size and current_chunk:
                # Create chunk from current paragraphs
                chunk_content = '\n\n'.join(current_chunk)
                chunks.append({
                    'id': f"{section['id']}.{chunk_id}",
                    'level': section['level'] + 1,
                    'title': f"{section['title']} (Part {chunk_id + 1})",
                    'content': chunk_content,
                    'tokens': current_tokens,
                    'parent_id': section['id'],
                    'type': 'subsection_chunk'
                })
                
                # Start new chunk
                current_chunk = [paragraph]
                current_tokens = para_tokens
                chunk_id += 1
            else:
                current_chunk.append(paragraph)
                current_tokens += para_tokens
        
        # Add final chunk
        if current_chunk and current_tokens >= self.min_chunk_size:
            chunk_content = '\n\n'.join(current_chunk)
            chunks.append({
                'id': f"{section['id']}.{chunk_id}",
                'level': section['level'] + 1,
                'title': f"{section['title']} (Part {chunk_id + 1})",
                'content': chunk_content,
                'tokens': current_tokens,
                'parent_id': section['id'],
                'type': 'subsection_chunk'
            })
        
        return chunks if chunks else [section]
    
    def chunk_text(self, text: str) -> Dict[str, Any]:
        """Create hierarchical chunks from text."""
        # Clean text
        text = re.sub(r'\n\s*\n\s*\n', '\n\n', text)  # Normalize multiple newlines
        text = text.strip()
        
        # Identify document structure
        headers = self._identify_headers(text)
        sections = self._extract_sections(text, headers)
        
        # Create hierarchical chunks
        all_chunks = []
        hierarchy_tree = None
        
        # Create tree structure
        nodes = {}
        root = Node("document", chunk_id="root", level=-1, title="Document Root")
        nodes["root"] = root
        
        for section in sections:
            # Create subsection chunks if needed
            section_chunks = self._create_subsection_chunks(section)
            
            for chunk in section_chunks:
                chunk['chunk_type'] = 'hierarchical'
                chunk['hierarchy_path'] = self._get_hierarchy_path(chunk, sections)
                all_chunks.append(chunk)
                
                # Add to tree
                parent_node = root
                if chunk['parent_id'] is not None and str(chunk['parent_id']) in nodes:
                    parent_node = nodes[str(chunk['parent_id'])]
                
                node = Node(
                    chunk['title'],
                    parent=parent_node,
                    chunk_id=str(chunk['id']),
                    level=chunk['level'],
                    tokens=chunk['tokens'],
                    chunk_data=chunk
                )
                nodes[str(chunk['id'])] = node
        
        hierarchy_tree = root
        
        return {
            'chunks': all_chunks,
            'hierarchy_tree': hierarchy_tree,
            'headers': headers,
            'sections': sections,
            'total_chunks': len(all_chunks),
            'max_level': max([c['level'] for c in all_chunks]) if all_chunks else 0,
            'total_tokens': sum([c['tokens'] for c in all_chunks])
        }
    
    def _get_hierarchy_path(self, chunk: Dict, sections: List[Dict]) -> List[str]:
        """Get the hierarchical path for a chunk."""
        path = [chunk['title']]
        current_parent = chunk['parent_id']
        
        while current_parent is not None:
            parent_section = next((s for s in sections if s['id'] == current_parent), None)
            if parent_section:
                path.insert(0, parent_section['title'])
                current_parent = parent_section['parent_id']
            else:
                break
        
        return path

print("✅ HierarchicalChunker class implemented!")

## 4. Testing Hierarchical Chunking

In [None]:
# Comprehensive test document with clear hierarchical structure
hierarchical_test_doc = """
# Artificial Intelligence: A Comprehensive Guide

## 1. Introduction to Artificial Intelligence

Artificial Intelligence (AI) represents one of the most significant technological advances of the modern era. It encompasses the development of computer systems that can perform tasks typically requiring human intelligence, such as learning, reasoning, perception, and decision-making.

The field of AI has evolved dramatically since its inception in the 1950s. Early pioneers like Alan Turing and John McCarthy laid the groundational work that continues to influence AI development today.

### 1.1 Historical Development

The history of AI can be traced back to ancient myths and stories of artificial beings endowed with intelligence. However, the modern field of AI began in the mid-20th century with the advent of electronic computers.

Key milestones include the development of the first neural networks, the creation of expert systems, and the recent breakthroughs in deep learning and large language models.

### 1.2 Current State of AI

Today's AI systems demonstrate remarkable capabilities across diverse domains. From natural language processing to computer vision, AI has achieved superhuman performance in many specialized tasks.

Modern AI is characterized by machine learning approaches, particularly deep learning, which has enabled significant advances in pattern recognition and decision-making.

## 2. Core AI Technologies

### 2.1 Machine Learning

Machine Learning (ML) is a subset of AI that enables systems to automatically learn and improve from experience without being explicitly programmed. ML algorithms build mathematical models based on training data to make predictions or decisions.

There are three main types of machine learning: supervised learning, unsupervised learning, and reinforcement learning. Each approach has distinct characteristics and applications.

#### 2.1.1 Supervised Learning

Supervised learning algorithms learn from labeled training data. The algorithm makes predictions based on input-output pairs and is evaluated on its ability to generalize to new, unseen data.

Common supervised learning tasks include classification (predicting categories) and regression (predicting continuous values). Popular algorithms include linear regression, decision trees, and support vector machines.

#### 2.1.2 Unsupervised Learning

Unsupervised learning finds patterns in data without labeled examples. These algorithms discover hidden structures in data, such as clusters, associations, or dimensionality reduction.

Key unsupervised learning techniques include clustering algorithms like K-means, association rule learning, and principal component analysis for dimensionality reduction.

### 2.2 Deep Learning

Deep Learning is a specialized subset of machine learning that uses artificial neural networks with multiple layers (hence "deep") to model and understand complex patterns in data.

Deep learning has revolutionized many AI applications, particularly in computer vision, natural language processing, and speech recognition. The ability to automatically learn hierarchical representations makes deep learning particularly powerful.

#### 2.2.1 Neural Network Architectures

Various neural network architectures have been developed for different types of problems. Convolutional Neural Networks (CNNs) excel at image processing, while Recurrent Neural Networks (RNNs) are suited for sequential data.

Transformer architectures have become dominant in natural language processing, enabling the development of large language models with unprecedented capabilities.

### 2.3 Natural Language Processing

Natural Language Processing (NLP) focuses on the interaction between computers and human language. It involves developing systems that can understand, interpret, and generate human language in valuable ways.

NLP combines computational linguistics with machine learning to enable computers to process and analyze large amounts of natural language data. Applications include machine translation, sentiment analysis, and chatbots.

## 3. AI Applications and Impact

### 3.1 Healthcare

AI is transforming healthcare through applications in medical imaging, drug discovery, personalized treatment, and clinical decision support. Machine learning algorithms can analyze medical images with accuracy matching or exceeding human specialists.

AI-powered systems help identify diseases earlier, predict patient outcomes, and optimize treatment plans. The integration of AI in healthcare promises to improve patient care while reducing costs.

### 3.2 Transportation

Autonomous vehicles represent one of the most visible applications of AI technology. Self-driving cars use computer vision, sensor fusion, and machine learning to navigate complex environments safely.

Beyond autonomous vehicles, AI optimizes traffic flow, improves logistics and supply chain management, and enhances public transportation systems.

### 3.3 Finance

The financial industry has embraced AI for fraud detection, algorithmic trading, risk assessment, and customer service. Machine learning models can detect fraudulent transactions in real-time and assess credit risk more accurately.

AI-powered robo-advisors provide personalized investment advice, while natural language processing enables automated customer support and document analysis.

## 4. Challenges and Future Directions

### 4.1 Ethical Considerations

As AI systems become more powerful and widespread, ethical considerations become increasingly important. Issues include algorithmic bias, privacy concerns, job displacement, and the need for transparent and explainable AI systems.

Developing ethical AI requires interdisciplinary collaboration between technologists, ethicists, policymakers, and society at large to ensure AI benefits humanity while minimizing potential harms.

### 4.2 Technical Challenges

Despite significant progress, AI faces several technical challenges. These include the need for large amounts of training data, computational resource requirements, and the brittleness of AI systems to adversarial examples.

Research continues into more efficient algorithms, better generalization capabilities, and AI systems that can learn from fewer examples.

### 4.3 Future Prospects

The future of AI holds tremendous promise. Anticipated developments include artificial general intelligence (AGI), quantum machine learning, and AI systems that can reason and understand the world more like humans.

As AI continues to advance, it will likely transform every aspect of human society, from how we work and learn to how we interact with technology and each other.
"""

# Test hierarchical chunking
print("🌳 Testing Hierarchical Chunking\n")

hierarchical_chunker = HierarchicalChunker(max_chunk_size=400, min_chunk_size=50)
result = hierarchical_chunker.chunk_text(hierarchical_test_doc)

print(f"📊 Hierarchical Chunking Results:")
print(f"  Total chunks: {result['total_chunks']}")
print(f"  Maximum hierarchy level: {result['max_level']}")
print(f"  Total tokens: {result['total_tokens']}")
print(f"  Headers identified: {len(result['headers'])}")
print(f"  Sections created: {len(result['sections'])}")

# Display hierarchy structure
print(f"\n🌲 Document Hierarchy Tree:")
for pre, _, node in RenderTree(result['hierarchy_tree']):
    if hasattr(node, 'chunk_data'):
        chunk = node.chunk_data
        print(f"{pre}{node.name} (Level {node.level}, {node.tokens} tokens)")
    else:
        print(f"{pre}{node.name}")

# Show chunk details
print(f"\n📋 Sample Chunk Details:")
for i, chunk in enumerate(result['chunks'][:5]):
    print(f"\nChunk {i+1}:")
    print(f"  ID: {chunk['id']}")
    print(f"  Level: {chunk['level']}")
    print(f"  Title: {chunk['title']}")
    print(f"  Tokens: {chunk['tokens']}")
    print(f"  Hierarchy Path: {' → '.join(chunk['hierarchy_path'])}")
    print(f"  Content Preview: {chunk['content'][:100]}...")

# Store result for later use
hierarchical_result = result

## 5. Hierarchical Visualization

In [None]:
def visualize_hierarchy_structure(result: Dict) -> None:
    """Visualize the hierarchical structure with multiple views."""
    
    chunks = result['chunks']
    
    # Create visualizations
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16, 12))
    fig.suptitle('Hierarchical Document Structure Analysis', fontsize=16)
    
    # 1. Chunks by hierarchy level
    level_counts = defaultdict(int)
    level_tokens = defaultdict(list)
    
    for chunk in chunks:
        level = chunk['level']
        level_counts[level] += 1
        level_tokens[level].append(chunk['tokens'])
    
    levels = sorted(level_counts.keys())
    counts = [level_counts[level] for level in levels]
    
    ax1.bar([f"Level {level}" for level in levels], counts, alpha=0.7, color='skyblue')
    ax1.set_title('Chunks by Hierarchy Level')
    ax1.set_ylabel('Number of Chunks')
    ax1.grid(True, alpha=0.3)
    
    # 2. Token distribution by level
    avg_tokens = [np.mean(level_tokens[level]) for level in levels]
    
    ax2.bar([f"Level {level}" for level in levels], avg_tokens, alpha=0.7, color='lightgreen')
    ax2.set_title('Average Tokens per Level')
    ax2.set_ylabel('Average Tokens')
    ax2.grid(True, alpha=0.3)
    
    # 3. Hierarchy depth distribution
    depth_distribution = [len(chunk['hierarchy_path']) for chunk in chunks]
    
    ax3.hist(depth_distribution, bins=max(depth_distribution), alpha=0.7, color='orange', edgecolor='black')
    ax3.set_title('Hierarchy Depth Distribution')
    ax3.set_xlabel('Hierarchy Depth')
    ax3.set_ylabel('Number of Chunks')
    ax3.grid(True, alpha=0.3)
    
    # 4. Token size distribution
    token_sizes = [chunk['tokens'] for chunk in chunks]
    
    ax4.hist(token_sizes, bins=15, alpha=0.7, color='lightcoral', edgecolor='black')
    ax4.axvline(np.mean(token_sizes), color='red', linestyle='--', 
                label=f'Mean: {np.mean(token_sizes):.1f}')
    ax4.set_title('Chunk Token Size Distribution')
    ax4.set_xlabel('Tokens')
    ax4.set_ylabel('Frequency')
    ax4.legend()
    ax4.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Print hierarchy statistics
    print(f"\n📊 Hierarchy Statistics:")
    print(f"  Total levels: {len(levels)}")
    print(f"  Deepest hierarchy path: {max(depth_distribution)}")
    print(f"  Average chunk size: {np.mean(token_sizes):.1f} tokens")
    print(f"  Token size range: {min(token_sizes)} - {max(token_sizes)} tokens")
    
    for level in levels:
        avg_tokens_level = np.mean(level_tokens[level])
        print(f"  Level {level}: {level_counts[level]} chunks, avg {avg_tokens_level:.1f} tokens")

def create_hierarchy_network(result: Dict) -> None:
    """Create network visualization of hierarchy structure."""
    
    # Create network graph
    G = nx.DiGraph()
    
    # Add nodes and edges based on hierarchy tree
    for node in PreOrderIter(result['hierarchy_tree']):
        node_id = node.chunk_id if hasattr(node, 'chunk_id') else str(node.name)
        
        # Add node with attributes
        G.add_node(node_id, 
                  name=str(node.name)[:30] + ('...' if len(str(node.name)) > 30 else ''),
                  level=getattr(node, 'level', 0),
                  tokens=getattr(node, 'tokens', 0))
        
        # Add edge to parent
        if node.parent:
            parent_id = node.parent.chunk_id if hasattr(node.parent, 'chunk_id') else str(node.parent.name)
            G.add_edge(parent_id, node_id)
    
    # Create layout
    plt.figure(figsize=(14, 10))
    
    # Use hierarchical layout
    pos = nx.nx_agraph.graphviz_layout(G, prog='dot') if hasattr(nx, 'nx_agraph') else nx.spring_layout(G, k=3, iterations=50)
    
    # Color nodes by level
    node_colors = []
    node_sizes = []
    
    for node_id in G.nodes():
        level = G.nodes[node_id]['level']
        tokens = G.nodes[node_id]['tokens']
        
        # Color by level
        colors = ['red', 'orange', 'yellow', 'lightgreen', 'lightblue', 'purple']
        node_colors.append(colors[min(level, len(colors)-1)])
        
        # Size by token count
        size = max(300, min(1500, tokens * 3)) if tokens > 0 else 300
        node_sizes.append(size)
    
    # Draw network
    nx.draw(G, pos, 
            node_color=node_colors,
            node_size=node_sizes,
            with_labels=False,
            arrows=True,
            edge_color='gray',
            alpha=0.7,
            arrowsize=20)
    
    # Add labels
    node_labels = {node_id: G.nodes[node_id]['name'] for node_id in G.nodes()}
    nx.draw_networkx_labels(G, pos, node_labels, font_size=8, font_weight='bold')
    
    plt.title('Hierarchical Document Structure Network', fontsize=16)
    plt.axis('off')
    
    # Add legend
    legend_elements = [
        plt.Line2D([0], [0], marker='o', color='w', markerfacecolor='red', markersize=10, label='Level 0'),
        plt.Line2D([0], [0], marker='o', color='w', markerfacecolor='orange', markersize=10, label='Level 1'),
        plt.Line2D([0], [0], marker='o', color='w', markerfacecolor='yellow', markersize=10, label='Level 2'),
        plt.Line2D([0], [0], marker='o', color='w', markerfacecolor='lightgreen', markersize=10, label='Level 3+')
    ]
    plt.legend(handles=legend_elements, loc='upper right')
    
    plt.tight_layout()
    plt.show()

# Visualize the hierarchical structure
print("📈 Visualizing Hierarchical Structure")
visualize_hierarchy_structure(hierarchical_result)

print("\n🕸️ Creating Hierarchy Network Visualization")
try:
    create_hierarchy_network(hierarchical_result)
except Exception as e:
    print(f"Network visualization skipped: {e}")
    print("Install graphviz for better network layouts: pip install pygraphviz")

## 6. Hierarchical Q&A System

In [None]:
class HierarchicalQASystem:
    def __init__(self, max_chunk_size: int = 400, min_chunk_size: int = 50):
        self.chunker = HierarchicalChunker(max_chunk_size, min_chunk_size)
        self.gemini_model = genai.GenerativeModel('gemini-pro')
        self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
        self.hierarchy_result = None
        self.chunk_embeddings = None
        self.document_title = ""
        
    def load_document(self, text: str, title: str = "Document"):
        """Load document and create hierarchical structure."""
        self.document_title = title
        print(f"🔄 Processing document with hierarchical chunking...")
        
        # Create hierarchical structure
        self.hierarchy_result = self.chunker.chunk_text(text)
        
        # Generate embeddings for all chunks
        chunk_texts = [chunk['content'] for chunk in self.hierarchy_result['chunks']]
        self.chunk_embeddings = self.embedding_model.encode(chunk_texts)
        
        print(f"✅ Loaded '{title}' with hierarchical structure")
        print(f"📊 {self.hierarchy_result['total_chunks']} chunks across {self.hierarchy_result['max_level']+1} levels")
        
    def _find_relevant_chunks_by_level(self, question: str, target_level: Optional[int] = None, max_chunks: int = 3) -> List[Dict]:
        """Find relevant chunks, optionally filtered by hierarchy level."""
        if not self.hierarchy_result or self.chunk_embeddings is None:
            return []
        
        chunks = self.hierarchy_result['chunks']
        
        # Filter by level if specified
        if target_level is not None:
            filtered_chunks = [(i, chunk) for i, chunk in enumerate(chunks) if chunk['level'] == target_level]
        else:
            filtered_chunks = list(enumerate(chunks))
        
        if not filtered_chunks:
            return []
        
        # Get embeddings for filtered chunks
        indices = [i for i, _ in filtered_chunks]
        filtered_embeddings = self.chunk_embeddings[indices]
        
        # Calculate similarities
        question_embedding = self.embedding_model.encode([question])
        similarities = cosine_similarity(question_embedding, filtered_embeddings)[0]
        
        # Get top chunks
        top_local_indices = np.argsort(similarities)[::-1][:max_chunks]
        
        relevant_chunks = []
        for local_idx in top_local_indices:
            global_idx, chunk = filtered_chunks[local_idx]
            chunk_copy = chunk.copy()
            chunk_copy['similarity_score'] = float(similarities[local_idx])
            chunk_copy['global_index'] = global_idx
            relevant_chunks.append(chunk_copy)
        
        return relevant_chunks
    
    def _find_context_from_hierarchy(self, relevant_chunks: List[Dict]) -> List[Dict]:
        """Expand context using hierarchical relationships."""
        expanded_chunks = relevant_chunks.copy()
        chunk_ids = set(str(chunk['id']) for chunk in relevant_chunks)
        
        # Add parent context for better understanding
        for chunk in relevant_chunks:
            if chunk['parent_id'] is not None:
                parent_id = str(chunk['parent_id'])
                if parent_id not in chunk_ids:
                    # Find parent chunk
                    parent_chunk = next(
                        (c for c in self.hierarchy_result['chunks'] if str(c['id']) == parent_id), 
                        None
                    )
                    if parent_chunk:
                        parent_copy = parent_chunk.copy()
                        parent_copy['similarity_score'] = chunk['similarity_score'] * 0.7  # Reduced score
                        parent_copy['context_type'] = 'parent_context'
                        expanded_chunks.append(parent_copy)
                        chunk_ids.add(parent_id)
        
        # Sort by hierarchy level and document order
        expanded_chunks.sort(key=lambda x: (x['level'], x['id']))
        
        return expanded_chunks
    
    def answer_question(self, question: str, strategy: str = 'adaptive', max_chunks: int = 3) -> Dict:
        """Answer question using hierarchical retrieval strategies."""
        if not self.hierarchy_result:
            return {"error": "No document loaded"}
        
        print(f"🔍 Processing question with {strategy} strategy: {question}")
        
        if strategy == 'adaptive':
            relevant_chunks = self._adaptive_retrieval(question, max_chunks)
        elif strategy == 'broad_first':
            relevant_chunks = self._broad_first_retrieval(question, max_chunks)
        elif strategy == 'detailed_first':
            relevant_chunks = self._detailed_first_retrieval(question, max_chunks)
        elif strategy == 'multi_level':
            relevant_chunks = self._multi_level_retrieval(question, max_chunks)
        else:
            relevant_chunks = self._find_relevant_chunks_by_level(question, max_chunks=max_chunks)
        
        if not relevant_chunks:
            return {"error": "No relevant content found"}
        
        # Expand context using hierarchy
        expanded_chunks = self._find_context_from_hierarchy(relevant_chunks)
        
        # Prepare context with hierarchical information
        context_parts = []
        for chunk in expanded_chunks:
            hierarchy_path = ' → '.join(chunk['hierarchy_path'])
            context_type = chunk.get('context_type', 'primary')
            chunk_info = f"[{hierarchy_path}] ({context_type})"
            context_parts.append(f"{chunk_info}\n{chunk['content']}")
        
        context = "\n\n".join(context_parts)
        
        # Generate answer with hierarchical awareness
        answer_prompt = f"""
        You are analyzing a hierarchically structured document "{self.document_title}". 
        Answer the question using the provided context, which includes hierarchical information.
        
        Context (with hierarchy paths):
        {context}
        
        Question: {question}
        
        Instructions:
        1. Use the hierarchical structure to provide a well-organized answer
        2. Reference specific sections when relevant (e.g., "According to Section 2.1...")
        3. Synthesize information across different hierarchy levels when appropriate
        4. If the question requires broad context, emphasize higher-level information
        5. If the question requires specific details, focus on lower-level information
        
        Answer:
        """
        
        try:
            response = self.gemini_model.generate_content(answer_prompt)
            
            return {
                "question": question,
                "answer": response.text,
                "strategy": strategy,
                "primary_chunks": len(relevant_chunks),
                "total_chunks_used": len(expanded_chunks),
                "chunk_details": [
                    {
                        "id": chunk['id'],
                        "level": chunk['level'],
                        "title": chunk['title'],
                        "similarity": chunk.get('similarity_score', 0),
                        "hierarchy_path": ' → '.join(chunk['hierarchy_path']),
                        "context_type": chunk.get('context_type', 'primary'),
                        "tokens": chunk['tokens']
                    }
                    for chunk in expanded_chunks
                ],
                "context_tokens": sum(chunk['tokens'] for chunk in expanded_chunks)
            }
            
        except Exception as e:
            return {"error": f"Failed to generate answer: {e}"}
    
    def _adaptive_retrieval(self, question: str, max_chunks: int) -> List[Dict]:
        """Adaptive retrieval based on question characteristics."""
        # Analyze question to determine appropriate level
        question_lower = question.lower()
        
        # Broad questions typically use higher-level chunks
        broad_keywords = ['overview', 'introduction', 'summary', 'what is', 'explain', 'describe']
        detailed_keywords = ['how', 'specific', 'detail', 'example', 'step', 'process']
        
        is_broad = any(keyword in question_lower for keyword in broad_keywords)
        is_detailed = any(keyword in question_lower for keyword in detailed_keywords)
        
        if is_broad and not is_detailed:
            # Prefer higher-level chunks
            target_levels = [1, 2, 3]  # Prefer levels 1-3
        elif is_detailed and not is_broad:
            # Prefer lower-level chunks
            target_levels = [3, 4, 5]  # Prefer levels 3+
        else:
            # Mixed or unclear - use all levels
            target_levels = None
        
        if target_levels:
            all_relevant = []
            for level in target_levels:
                level_chunks = self._find_relevant_chunks_by_level(question, level, max_chunks//len(target_levels) + 1)
                all_relevant.extend(level_chunks)
            
            # Sort by similarity and take top chunks
            all_relevant.sort(key=lambda x: x['similarity_score'], reverse=True)
            return all_relevant[:max_chunks]
        else:
            return self._find_relevant_chunks_by_level(question, max_chunks=max_chunks)
    
    def _broad_first_retrieval(self, question: str, max_chunks: int) -> List[Dict]:
        """Retrieve higher-level chunks first."""
        return self._find_relevant_chunks_by_level(question, target_level=1, max_chunks=max_chunks)
    
    def _detailed_first_retrieval(self, question: str, max_chunks: int) -> List[Dict]:
        """Retrieve lower-level detailed chunks first."""
        max_level = self.hierarchy_result['max_level']
        return self._find_relevant_chunks_by_level(question, target_level=max_level, max_chunks=max_chunks)
    
    def _multi_level_retrieval(self, question: str, max_chunks: int) -> List[Dict]:
        """Retrieve chunks from multiple levels."""
        all_chunks = []
        max_level = self.hierarchy_result['max_level']
        
        chunks_per_level = max(1, max_chunks // (max_level + 1))
        
        for level in range(1, max_level + 1):
            level_chunks = self._find_relevant_chunks_by_level(question, level, chunks_per_level)
            all_chunks.extend(level_chunks)
        
        # Sort by similarity and return top chunks
        all_chunks.sort(key=lambda x: x['similarity_score'], reverse=True)
        return all_chunks[:max_chunks]
    
    def analyze_hierarchy_structure(self) -> Dict:
        """Analyze the hierarchical structure of the loaded document."""
        if not self.hierarchy_result:
            return {"error": "No document loaded"}
        
        chunks = self.hierarchy_result['chunks']
        
        analysis = {
            "total_chunks": len(chunks),
            "hierarchy_levels": self.hierarchy_result['max_level'] + 1,
            "level_distribution": defaultdict(int),
            "avg_tokens_per_level": defaultdict(list),
            "headers_identified": len(self.hierarchy_result['headers']),
            "sections_created": len(self.hierarchy_result['sections'])
        }
        
        for chunk in chunks:
            level = chunk['level']
            analysis['level_distribution'][level] += 1
            analysis['avg_tokens_per_level'][level].append(chunk['tokens'])
        
        # Calculate averages
        for level, tokens in analysis['avg_tokens_per_level'].items():
            analysis['avg_tokens_per_level'][level] = np.mean(tokens)
        
        return analysis

print("✅ HierarchicalQASystem class implemented!")

## 7. Testing Hierarchical Q&A System

In [None]:
# Initialize hierarchical Q&A system
hierarchical_qa = HierarchicalQASystem(max_chunk_size=400, min_chunk_size=50)
hierarchical_qa.load_document(hierarchical_test_doc, "AI Comprehensive Guide")

# Analyze hierarchy structure
structure_analysis = hierarchical_qa.analyze_hierarchy_structure()
print(f"\n📊 Hierarchy Structure Analysis:")
for key, value in structure_analysis.items():
    if isinstance(value, dict):
        print(f"  {key}:")
        for subkey, subvalue in value.items():
            print(f"    Level {subkey}: {subvalue}")
    else:
        print(f"  {key}: {value}")

# Test different retrieval strategies
test_questions = [
    ("What is artificial intelligence and what are its main components?", "adaptive"),
    ("How do supervised and unsupervised learning differ?", "detailed_first"),
    ("Give me an overview of AI applications across different industries", "broad_first"),
    ("What are the specific neural network architectures mentioned?", "multi_level")
]

print("\n🧠 Testing Hierarchical Q&A with Different Strategies\n")

for i, (question, strategy) in enumerate(test_questions[:2], 1):  # Test first 2 questions
    print(f"{'='*80}")
    print(f"Question {i}: {question}")
    print(f"Strategy: {strategy}")
    print(f"{'='*80}")
    
    result = hierarchical_qa.answer_question(question, strategy=strategy, max_chunks=3)
    
    if "error" in result:
        print(f"❌ Error: {result['error']}")
    else:
        print(f"\n💡 Answer:")
        print(result["answer"])
        
        print(f"\n📊 Retrieval Details:")
        print(f"  - Strategy used: {result['strategy']}")
        print(f"  - Primary chunks: {result['primary_chunks']}")
        print(f"  - Total chunks used: {result['total_chunks_used']}")
        print(f"  - Context tokens: {result['context_tokens']}")
        
        print(f"\n🌳 Hierarchical Context Used:")
        for detail in result['chunk_details']:
            context_indicator = " 📄" if detail['context_type'] == 'primary' else " 🔗"
            print(f"  {context_indicator} {detail['hierarchy_path']}")
            print(f"    Level {detail['level']}, Similarity: {detail['similarity']:.3f}, {detail['tokens']} tokens")
    
    print("\n" + "-"*80 + "\n")
    time.sleep(1)  # Rate limiting

## 8. Strategy Comparison and Analysis

In [None]:
def compare_hierarchical_strategies(qa_system: HierarchicalQASystem, question: str) -> Dict:
    """Compare different hierarchical retrieval strategies."""
    
    strategies = ['adaptive', 'broad_first', 'detailed_first', 'multi_level']
    results = {}
    
    print(f"🔬 Comparing Hierarchical Strategies for: {question}\n")
    
    for strategy in strategies:
        print(f"Testing {strategy} strategy...")
        result = qa_system.answer_question(question, strategy=strategy, max_chunks=3)
        
        if "error" not in result:
            # Analyze strategy effectiveness
            chunk_levels = [detail['level'] for detail in result['chunk_details']]
            avg_level = np.mean(chunk_levels)
            level_diversity = len(set(chunk_levels))
            
            results[strategy] = {
                'answer_length': len(result['answer']),
                'chunks_used': result['total_chunks_used'],
                'context_tokens': result['context_tokens'],
                'avg_hierarchy_level': avg_level,
                'level_diversity': level_diversity,
                'chunk_levels': chunk_levels,
                'answer_preview': result['answer'][:150] + '...'
            }
        else:
            results[strategy] = {'error': result['error']}
    
    return results

def visualize_strategy_comparison(comparison_results: Dict, question: str):
    """Visualize strategy comparison results."""
    
    strategies = list(comparison_results.keys())
    valid_strategies = [s for s in strategies if 'error' not in comparison_results[s]]
    
    if not valid_strategies:
        print("No valid results to visualize")
        return
    
    # Extract metrics
    metrics = {
        'answer_length': [comparison_results[s]['answer_length'] for s in valid_strategies],
        'chunks_used': [comparison_results[s]['chunks_used'] for s in valid_strategies],
        'context_tokens': [comparison_results[s]['context_tokens'] for s in valid_strategies],
        'avg_hierarchy_level': [comparison_results[s]['avg_hierarchy_level'] for s in valid_strategies],
        'level_diversity': [comparison_results[s]['level_diversity'] for s in valid_strategies]
    }
    
    # Create visualization
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))
    fig.suptitle(f'Hierarchical Strategy Comparison\nQuestion: {question[:60]}...', fontsize=14)
    
    # Answer length
    ax1.bar(valid_strategies, metrics['answer_length'], alpha=0.7, color='skyblue')
    ax1.set_title('Answer Length')
    ax1.set_ylabel('Characters')
    ax1.tick_params(axis='x', rotation=45)
    
    # Chunks used
    ax2.bar(valid_strategies, metrics['chunks_used'], alpha=0.7, color='lightgreen')
    ax2.set_title('Total Chunks Used')
    ax2.set_ylabel('Number of Chunks')
    ax2.tick_params(axis='x', rotation=45)
    
    # Average hierarchy level
    ax3.bar(valid_strategies, metrics['avg_hierarchy_level'], alpha=0.7, color='orange')
    ax3.set_title('Average Hierarchy Level')
    ax3.set_ylabel('Level')
    ax3.tick_params(axis='x', rotation=45)
    
    # Level diversity
    ax4.bar(valid_strategies, metrics['level_diversity'], alpha=0.7, color='lightcoral')
    ax4.set_title('Hierarchy Level Diversity')
    ax4.set_ylabel('Unique Levels Used')
    ax4.tick_params(axis='x', rotation=45)
    
    plt.tight_layout()
    plt.show()
    
    # Print detailed comparison
    print(f"\n📊 Detailed Strategy Comparison:")
    for strategy in valid_strategies:
        data = comparison_results[strategy]
        print(f"\n{strategy.upper()}:")
        print(f"  Answer length: {data['answer_length']} characters")
        print(f"  Chunks used: {data['chunks_used']}")
        print(f"  Context tokens: {data['context_tokens']}")
        print(f"  Avg hierarchy level: {data['avg_hierarchy_level']:.2f}")
        print(f"  Level diversity: {data['level_diversity']} unique levels")
        print(f"  Levels used: {sorted(set(data['chunk_levels']))}")

# Run strategy comparison
comparison_question = "What are the main types of machine learning and their applications?"
strategy_results = compare_hierarchical_strategies(hierarchical_qa, comparison_question)
visualize_strategy_comparison(strategy_results, comparison_question)

## 9. Hierarchical vs. Other Chunking Methods

In [None]:
# Simple comparison with other chunking methods
def compare_chunking_approaches(text: str, question: str) -> Dict:
    """Compare hierarchical chunking with other approaches."""
    
    results = {}
    
    # 1. Hierarchical chunking (our implementation)
    print("Testing hierarchical chunking...")
    hierarchical_qa_test = HierarchicalQASystem(max_chunk_size=400)
    hierarchical_qa_test.load_document(text, "Test Document")
    hierarchy_result = hierarchical_qa_test.answer_question(question, strategy='adaptive')
    
    if "error" not in hierarchy_result:
        results['Hierarchical'] = {
            'chunks': hierarchy_result['total_chunks_used'],
            'answer_length': len(hierarchy_result['answer']),
            'context_tokens': hierarchy_result['context_tokens'],
            'structure_preserved': True,
            'hierarchy_levels': len(set(d['level'] for d in hierarchy_result['chunk_details'])),
            'method_type': 'Structure-aware'
        }
    
    # 2. Simple fixed-size chunking simulation
    print("Testing fixed-size chunking...")
    # Simulate fixed-size by breaking text into 400-token chunks
    words = text.split()
    chunk_size = 100  # words per chunk (roughly 400 tokens)
    fixed_chunks = []
    
    for i in range(0, len(words), chunk_size):
        chunk_text = ' '.join(words[i:i+chunk_size])
        if len(chunk_text.strip()) > 50:
            fixed_chunks.append({
                'content': chunk_text,
                'tokens': count_tokens(chunk_text)
            })
    
    results['Fixed-Size'] = {
        'chunks': len(fixed_chunks),
        'answer_length': 0,  # Simplified comparison
        'context_tokens': sum(chunk['tokens'] for chunk in fixed_chunks[:3]),  # Top 3 chunks
        'structure_preserved': False,
        'hierarchy_levels': 1,  # Flat structure
        'method_type': 'Size-based'
    }
    
    # 3. Paragraph-based chunking simulation
    print("Testing paragraph-based chunking...")
    paragraphs = [p.strip() for p in text.split('\n\n') if p.strip()]
    para_chunks = []
    
    for para in paragraphs:
        if count_tokens(para) >= 50:  # Minimum size
            para_chunks.append({
                'content': para,
                'tokens': count_tokens(para)
            })
    
    results['Paragraph-Based'] = {
        'chunks': len(para_chunks),
        'answer_length': 0,  # Simplified comparison
        'context_tokens': sum(chunk['tokens'] for chunk in para_chunks[:3]),
        'structure_preserved': 'Partial',
        'hierarchy_levels': 1,
        'method_type': 'Content-aware'
    }
    
    return results

def visualize_chunking_comparison(comparison_results: Dict):
    """Visualize comparison between chunking methods."""
    
    methods = list(comparison_results.keys())
    
    # Create comparison DataFrame
    comparison_data = []
    for method, data in comparison_results.items():
        comparison_data.append({
            'Method': method,
            'Total Chunks': data['chunks'],
            'Context Tokens': data['context_tokens'],
            'Hierarchy Levels': data['hierarchy_levels'],
            'Structure Preserved': data['structure_preserved'],
            'Method Type': data['method_type']
        })
    
    comparison_df = pd.DataFrame(comparison_data)
    print("\n📊 Chunking Method Comparison:")
    print(comparison_df.to_string(index=False))
    
    # Visualize key metrics
    fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 5))
    fig.suptitle('Chunking Method Comparison', fontsize=16)
    
    # Total chunks
    chunk_counts = [comparison_results[m]['chunks'] for m in methods]
    ax1.bar(methods, chunk_counts, alpha=0.7, color='skyblue')
    ax1.set_title('Total Chunks Created')
    ax1.set_ylabel('Number of Chunks')
    ax1.tick_params(axis='x', rotation=45)
    
    # Context tokens
    context_tokens = [comparison_results[m]['context_tokens'] for m in methods]
    ax2.bar(methods, context_tokens, alpha=0.7, color='lightgreen')
    ax2.set_title('Context Tokens (Top 3 Chunks)')
    ax2.set_ylabel('Tokens')
    ax2.tick_params(axis='x', rotation=45)
    
    # Hierarchy levels
    hierarchy_levels = [comparison_results[m]['hierarchy_levels'] for m in methods]
    ax3.bar(methods, hierarchy_levels, alpha=0.7, color='orange')
    ax3.set_title('Hierarchy Levels Available')
    ax3.set_ylabel('Number of Levels')
    ax3.tick_params(axis='x', rotation=45)
    
    plt.tight_layout()
    plt.show()

# Run comparison
comparison_question = "What are the main AI technologies discussed?"
print(f"🔬 Comparing Chunking Approaches\nQuestion: {comparison_question}\n")

chunking_comparison = compare_chunking_approaches(hierarchical_test_doc, comparison_question)
visualize_chunking_comparison(chunking_comparison)

print("\n📈 Key Advantages of Hierarchical Chunking:")
print("• ✅ Preserves document structure and organization")
print("• ✅ Enables multi-level retrieval strategies")
print("• ✅ Maintains semantic relationships between sections")
print("• ✅ Supports both broad overviews and detailed answers")
print("• ✅ Provides context inheritance from parent sections")
print("• ✅ Facilitates navigation and understanding of complex documents")