# Document Category Analysis with AI/LLM 🤖📄

## Educational Demo: Intelligent Document Organization

**Welcome to this interactive demonstration!** This notebook shows how we can use Large Language Models (LLMs) to automatically analyze and categorize documents.

### What You'll Learn:
1. **OCR & Text Extraction** - How to extract text from images and PDFs
2. **LLM Document Analysis** - Using AI to understand document content
3. **Intelligent Categorization** - Automatically grouping similar documents
4. **Data Visualization** - Creating charts and graphs to understand results
5. **Knowledge Graphs** - Visualizing relationships between documents

### Real-World Applications:
- 📊 **Business**: Organize invoices, contracts, reports
- 🏥 **Healthcare**: Categorize medical records, prescriptions
- 🎓 **Education**: Sort research papers, assignments
- 🏠 **Personal**: Organize family documents, receipts, photos

---

## 1. Setup and Imports 🔧

First, let's import all the libraries we need and configure our environment:

In [None]:
%pip install --quiet --upgrade pip
%pip install --quiet -r requirements.txt

In [None]:
# Core Python libraries
import os
import sys
import re
import random
import logging
import json
from datetime import datetime
from collections import Counter, defaultdict
from pathlib import Path

# Data analysis and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import networkx as nx

# Project-specific imports
from llm_analyzer import LLMAnalyzer
from document_processor import DocumentProcessor

# Configure visualization style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("✅ All libraries imported successfully!")
print(f"📅 Analysis started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

In [None]:
# Enhanced error handling for document processing
def safe_file_analysis(analyzer, file_path):
    """Safely analyze a file with comprehensive error handling."""
    try:
        result = analyzer.analyze_file(file_path)
        return result, None
    except Exception as e:
        error_msg = str(e)
        
        # Check for common error types and provide helpful messages
        if "invalid pdf header" in error_msg.lower():
            return None, "Invalid PDF file format - file may be corrupted or not a real PDF"
        elif "cannot identify image file" in error_msg.lower():
            return None, "Invalid image file format - file may be corrupted or not a real image"
        elif "stream has ended unexpectedly" in error_msg.lower():
            return None, "PDF file appears to be corrupted or incomplete"
        elif "tesseract" in error_msg.lower():
            return None, "OCR engine (Tesseract) error - check installation"
        else:
            return None, f"Unexpected error: {error_msg}"

print("🛡️ Enhanced error handling functions ready")

In [None]:
# Configure logging for the notebook
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('category_analysis_demo.log'),
        logging.StreamHandler()
    ]
)

# Create a logger for our demo
logger = logging.getLogger(__name__)
logger.info("🚀 Document Category Analyzer Demo Started")

# Suppress verbose HTTP request logs from httpx/urllib3
logging.getLogger("httpx").setLevel(logging.WARNING)
logging.getLogger("urllib3").setLevel(logging.WARNING)

print("📝 Logging configured - check 'category_analysis_demo.log' for detailed logs")

## 2. Define Helper Functions 🛠️

Let's define the core functions that will help us analyze documents:

In [None]:
import os
import random
from collections import defaultdict
from datetime import datetime
import logging

# Configure logging for the notebook
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('category_analysis_demo.log'),
        logging.StreamHandler()
    ]
)

# Create a logger for our demo
logger = logging.getLogger(__name__)
logger.info("🚀 Document Category Analyzer Demo Started")

print("📝 Logging configured - check 'category_analysis_demo.log' for detailed logs")
print("💡 Note: This demo uses .txt files for reliable text extraction")
print("🔧 In production, you'd use real PDFs and images with proper OCR")

class DocumentCategoryAnalyzer:
    """
    An intelligent document categorization system that uses LLMs to:
    1. Extract text from documents (PDFs, images)
    2. Analyze content using AI
    3. Suggest optimal category structures
    4. Visualize results
    """
    
    def __init__(self, source_dir, sample_size=1000):
        """Initialize the analyzer with source directory and sample size."""
        self.source_dir = source_dir
        self.sample_size = sample_size
        self.llm_analyzer = LLMAnalyzer()
        self.document_processor = DocumentProcessor()
        
        # Supported file types
        self.supported_extensions = ['.pdf', '.jpg', '.jpeg', '.png', '.tif', '.tiff', '.docx', '.doc', '.txt', '.bmp', '.csv', '.xlsx', '.pptx', '.ppt', '.gif']
        
        # Initial category system
        self.existing_categories = ["Medical Documents", "Receipts", "Contracts", "Photographs", "Other"]
        
        # Use the same LLM instance for consistency
        self.llm = self.llm_analyzer.llm
        
        print(f"🤖 Analyzer initialized for directory: {source_dir}")
        print(f"📊 Sample size: {sample_size} files")
        print(f"📁 Supported formats: {', '.join(self.supported_extensions)}")
    
    def get_all_files(self):
        """Discover all supported files in the source directory."""
        all_files = []
        file_stats = defaultdict(int)
        
        for root, _, files in os.walk(self.source_dir):
            for file in files:
                file_ext = os.path.splitext(file)[1].lower()
                if file_ext in self.supported_extensions:
                    full_path = os.path.join(root, file)
                    all_files.append(full_path)
                    file_stats[file_ext] += 1
        
        logger.info(f"Found {len(all_files)} documents in corpus")
        return all_files, dict(file_stats)
    
    def select_sample(self, all_files):
        """Select a representative sample using stratified sampling."""
        if len(all_files) <= self.sample_size:
            return all_files
        
        # Stratified sampling by file type and folder
        samples_by_ext = defaultdict(list)
        samples_by_folder = defaultdict(list)
        
        for file_path in all_files:
            ext = os.path.splitext(file_path)[1].lower()
            folder = os.path.basename(os.path.dirname(file_path))
            
            samples_by_ext[ext].append(file_path)
            samples_by_folder[folder].append(file_path)
        
        sample = []
        
        # Ensure representation from each file type
        for ext, files in samples_by_ext.items():
            sample.append(random.choice(files))
        
        # Ensure representation from each folder
        for folder, files in samples_by_folder.items():
            if not any(f in sample for f in files):
                sample.append(random.choice(files))
        
        # Fill remaining slots randomly
        remaining_files = [f for f in all_files if f not in sample]
        random.shuffle(remaining_files)
        sample.extend(remaining_files[:self.sample_size - len(sample)])
        
        logger.info(f"Selected {len(sample)} files for analysis")
        return sample
    
    def analyze_file(self, file_path):
        """Analyze a single document using LLM."""
        try:
            filename = os.path.basename(file_path)
            creation_time = datetime.fromtimestamp(os.path.getctime(file_path))
            
            # Extract text using OCR/parsing
            text = self.document_processor.extract_text(file_path)
            
            # Analyze with LLM
            result = self.llm_analyzer.analyze_document(text, filename, creation_time, file_path)
            
            # Add metadata
            result["file_path"] = file_path
            result["parent_folder"] = os.path.basename(os.path.dirname(file_path))
            result["file_size"] = os.path.getsize(file_path)
            result["text_length"] = len(text) if text else 0
            
            return result
        except Exception as e:
            logger.error(f"Error analyzing file {file_path}: {str(e)}")
            return None

print("✅ DocumentCategoryAnalyzer class defined successfully!")

In [None]:
def create_file_distribution_chart(file_stats):
    """Create a pie chart showing file type distribution."""
    fig = px.pie(
        values=list(file_stats.values()),
        names=list(file_stats.keys()),
        title="📁 File Type Distribution in Dataset",
        color_discrete_sequence=px.colors.qualitative.Set3
    )
    fig.update_traces(textposition='inside', textinfo='percent+label')
    return fig

def create_category_distribution_chart(categories):
    """Create a bar chart showing category distribution, sorted by count."""
    # Sort categories by count (descending)
    sorted_categories = dict(sorted(categories.items(), key=lambda x: x[1], reverse=True))
    
    fig = px.bar(
        x=list(sorted_categories.keys()),
        y=list(sorted_categories.values()),
        title="📊 Document Categories Distribution",
        labels={'x': 'Category', 'y': 'Number of Documents'},
        color=list(sorted_categories.values()),
        color_continuous_scale='viridis'
    )
    fig.update_layout(showlegend=False)
    return fig

def create_folder_category_heatmap(folder_categories):
    """Create a heatmap showing folder-category relationships."""
    # Convert to DataFrame for heatmap
    all_categories = set()
    for cats in folder_categories.values():
        all_categories.update(cats.keys())
    
    # Sort categories by total count across all folders
    category_totals = defaultdict(int)
    for cats in folder_categories.values():
        for cat, count in cats.items():
            category_totals[cat] += count
    
    sorted_categories = sorted(all_categories, key=lambda x: category_totals[x], reverse=True)
    
    # Sort folders by total document count
    folder_totals = {folder: sum(cats.values()) for folder, cats in folder_categories.items()}
    sorted_folders = sorted(folder_categories.keys(), key=lambda x: folder_totals[x], reverse=True)
    
    data = []
    for folder in sorted_folders:
        cats = folder_categories[folder]
        row = [cats.get(cat, 0) for cat in sorted_categories]
        data.append(row)
    
    fig = go.Figure(data=go.Heatmap(
        z=data,
        x=sorted_categories,
        y=sorted_folders,
        colorscale='Blues',
        showscale=True
    ))
    
    fig.update_layout(
        title="🗂️ Folder-Category Relationship Heatmap",
        xaxis_title="Categories",
        yaxis_title="Folders"
    )
    return fig

def create_document_network_graph(analysis_results):
    """Create a network graph showing document relationships."""
    G = nx.Graph()
    
    # Add nodes for categories
    categories = set(result['category'] for result in analysis_results)
    for cat in categories:
        G.add_node(cat, node_type='category', size=20)
    
    # Add nodes for folders
    folders = set(result['parent_folder'] for result in analysis_results)
    for folder in folders:
        G.add_node(folder, node_type='folder', size=15)
    
    # Add edges between folders and categories
    folder_category_counts = defaultdict(lambda: defaultdict(int))
    for result in analysis_results:
        folder_category_counts[result['parent_folder']][result['category']] += 1
    
    for folder, cats in folder_category_counts.items():
        for cat, count in cats.items():
            G.add_edge(folder, cat, weight=count)
    
    # Create layout
    pos = nx.spring_layout(G, k=2, iterations=50)
    
    # Prepare traces
    edge_trace = []
    for edge in G.edges():
        x0, y0 = pos[edge[0]]
        x1, y1 = pos[edge[1]]
        edge_trace.append(go.Scatter(
            x=[x0, x1, None], y=[y0, y1, None],
            mode='lines',
            line=dict(width=0.5, color='rgba(125,125,125,0.5)'),
            hoverinfo='none',
            showlegend=False
        ))
    
    # Node traces
    category_nodes = [n for n in G.nodes() if n in categories]
    folder_nodes = [n for n in G.nodes() if n in folders]
    
    fig = go.Figure()
    
    # Add edges
    for trace in edge_trace:
        fig.add_trace(trace)
    
    # Add category nodes
    if category_nodes:
        fig.add_trace(go.Scatter(
            x=[pos[node][0] for node in category_nodes],
            y=[pos[node][1] for node in category_nodes],
            mode='markers+text',
            marker=dict(size=20, color='lightblue', line=dict(width=2, color='darkblue')),
            text=category_nodes,
            textposition="middle center",
            name="Categories",
            hovertemplate="Category: %{text}<extra></extra>"
        ))
    
    # Add folder nodes
    if folder_nodes:
        fig.add_trace(go.Scatter(
            x=[pos[node][0] for node in folder_nodes],
            y=[pos[node][1] for node in folder_nodes],
            mode='markers+text',
            marker=dict(size=15, color='lightgreen', line=dict(width=2, color='darkgreen')),
            text=folder_nodes,
            textposition="middle center",
            name="Folders",
            hovertemplate="Folder: %{text}<extra></extra>"
        ))
    
    fig.update_layout(
        title="🕸️ Document Organization Network",
        showlegend=True,
        hovermode='closest',
        margin=dict(b=20,l=5,r=5,t=40),
        annotations=[
            dict(
                text="Network showing relationships between folders and document categories",
                showarrow=False,
                xref="paper", yref="paper",
                x=0.005, y=-0.002,
                xanchor='left', yanchor='bottom',
                font=dict(color="grey", size=12)
            )
        ],
        xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
        yaxis=dict(showgrid=False, zeroline=False, showticklabels=False)
    )
    
    return fig

print("✅ Visualization helper functions defined!")

## 3. Load and Explore Data 📊

Now let's initialize our analyzer and explore the document collection:

In [None]:
# Configuration
SOURCE_DIR = "e:/dropbox/admin/scanned documents"
SAMPLE_SIZE = 1000  

# Initialize the analyzer
analyzer = DocumentCategoryAnalyzer(SOURCE_DIR, SAMPLE_SIZE)

# Discover all files
all_files, file_stats = analyzer.get_all_files()

print(f"📁 Total files found: {len(all_files)}")
print(f"📊 File type breakdown: {file_stats}")
print("\n📝 Sample files:")
for i, file_path in enumerate(all_files[:5]):
    print(f"  {i+1}. {os.path.basename(file_path)} ({os.path.splitext(file_path)[1]})")
    
if len(all_files) > 5:
    print(f"  ... and {len(all_files) - 5} more files")

In [None]:
# Visualize file type distribution
if file_stats:
    fig = create_file_distribution_chart(file_stats)
    fig.show()
else:
    print("⚠️ No files found to visualize")

# Create a summary DataFrame
file_df = pd.DataFrame([
    {
        'File': os.path.basename(file_path),
        'Extension': os.path.splitext(file_path)[1],
        'Folder': os.path.basename(os.path.dirname(file_path)),
        'Size (KB)': round(os.path.getsize(file_path) / 1024, 2)
    }
    for file_path in all_files
])

print("\n📈 File Statistics:")
print(file_df.describe())
print("\n📋 First few files:")
print(file_df.head())

## 4. Select Sample Files 🎯

For efficiency, we'll analyze a representative sample of files using stratified sampling:

In [None]:
# Select representative sample
sample_files = analyzer.select_sample(all_files)

print(f"🎯 Selected {len(sample_files)} files for analysis:")
print("\n📋 Sample Details:")

sample_data = []
for i, file_path in enumerate(sample_files):
    file_info = {
        'Index': i + 1,
        'Filename': os.path.basename(file_path),
        'Extension': os.path.splitext(file_path)[1],
        'Folder': os.path.basename(os.path.dirname(file_path)),
        'Size (KB)': round(os.path.getsize(file_path) / 1024, 2),
        'Full Path': file_path
    }
    sample_data.append(file_info)
    
sample_df = pd.DataFrame(sample_data)
print(sample_df[['Index', 'Filename', 'Extension', 'Folder', 'Size (KB)']].to_string(index=False))

# Show sampling strategy effectiveness
print("\n🔍 Sampling Strategy Results:")
print(f"📊 Extensions represented: {sample_df['Extension'].nunique()} out of {len(file_stats)}")
print(f"📁 Folders represented: {sample_df['Folder'].nunique()}")
print(f"📈 Coverage: {len(sample_files)} / {len(all_files)} files ({round(100*len(sample_files)/len(all_files), 1)}%)")

## 5. Analyze Files with AI 🤖

Now comes the exciting part - using AI to analyze and categorize our documents!

### How it works:
1. **Text Extraction**: OCR for images, parsing for PDFs
2. **LLM Analysis**: AI reads and understands content
3. **Classification**: Assigns categories based on content
4. **Metadata Extraction**: Finds dates, people, descriptions

In [None]:
# Analyze each file in the sample
analysis_results = []
processing_stats = {
    'successful': 0,
    'failed': 0,
    'total_time': 0.0,
    'errors': []
}

print("🔄 Starting document analysis...\n")

for i, file_path in enumerate(sample_files):
    try:
        start_time = datetime.now()
        print(f"📄 Analyzing {i+1}/{len(sample_files)}: {os.path.basename(file_path)}")
        
        result = analyzer.analyze_file(file_path)
        
        if result:
            analysis_results.append(result)
            processing_stats['successful'] += 1
            
            # Show intermediate results
            print(f"  ✅ Category: {result['category']}")
            print(f"  📝 Description: {result['description']}")
            print(f"  👤 Identity: {result['identity']}")
            print(f"  📅 Date: {result['date']}")
        else:
            processing_stats['failed'] += 1
            print(f"  ❌ Analysis failed")
        end_time = datetime.now()
        duration = (end_time - start_time).total_seconds()
        processing_stats['total_time'] += duration
        print(f"  ⏱️ Time: {duration:.1f}s\n")
        
    except Exception as e:
        processing_stats['failed'] += 1
        processing_stats['errors'].append(f"{os.path.basename(file_path)}: {str(e)}")
        print(f"  ❌ Error: {str(e)}\n")
        print(f"  ❌ Error: {str(e)}\n")

print("\n📊 Analysis Summary:")
print(f"✅ Successful: {processing_stats['successful']}")
print(f"❌ Failed: {processing_stats['failed']}")
print(f"⏱️ Total time: {processing_stats['total_time']:.1f}s")
print(f"📈 Average time per file: {processing_stats['total_time']/len(sample_files):.1f}s")

## 4.5. Document Gallery Preview 🖼️

Before we visualize results, let's create a gallery showing thumbnails and the AI‐assigned category for each document.

In [None]:
import base64
from io import BytesIO
from PIL import Image, ImageDraw, ImageFont
import fitz  # PyMuPDF for PDF thumbnails
from IPython.display import HTML, display

def create_thumbnail(file_path, max_size=(150,200)):
    """Create thumbnail for image/PDF/text files."""
    # For images
    if file_path.lower().endswith(('png','jpg','jpeg','gif','bmp','tif','tiff')):
        img = Image.open(file_path)
        img.thumbnail(max_size, Image.LANCZOS)
        return img
    # For PDFs
    elif file_path.lower().endswith('pdf'):
        doc = fitz.open(file_path)
        page = doc[0]
        pix = page.get_pixmap()
        # Handle different color modes properly
        if pix.n == 4:  # RGBA
            img = Image.frombytes("RGBA", [pix.width, pix.height], pix.samples)
            img = img.convert("RGB")  # Convert RGBA to RGB
        else:  # RGB
            img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
        img.thumbnail(max_size, Image.LANCZOS)
        doc.close()  # Close the document to free memory
        return img
    # For other file types, create a placeholder image
    else:
        return create_placeholder_thumbnail(max_size, os.path.splitext(file_path)[1])
    
def create_text_thumbnail(file_path, max_size=(150,200)):
    """Create a simple text-based thumbnail for non-image files."""
    try:
        # Generate a placeholder image with file type and name
        img = Image.new('RGB', max_size, color = (255, 255, 255))
        d = ImageDraw.Draw(img)
        text = os.path.basename(file_path)
        text = text if len(text) <= 20 else text[:17] + '...'  # Truncate long names
        d.text((10,10), text, fill=(0,0,0))
        return img
    except Exception as e:
        print(f"Error creating text thumbnail for {file_path}: {e}")
        return None

def create_placeholder_thumbnail(max_size=(150,200), file_ext=""):
    """Create a placeholder thumbnail image."""
    try:
        # Generate a simple placeholder image with file extension
        img = Image.new('RGB', max_size, color = (220, 220, 220))
        d = ImageDraw.Draw(img)
        text = file_ext.upper() if file_ext else "FILE"
        d.text((10,10), text, fill=(0,0,0))
        return img
    except Exception as e:
        print(f"Error creating placeholder thumbnail: {e}")
        return None

def image_to_base64(img):
    buffer = BytesIO()
    img.save(buffer, format="PNG")
    return base64.b64encode(buffer.getvalue()).decode()

def format_file_size(size_bytes):
    """Human-readable file size."""
    for unit in ['B', 'KB', 'MB', 'GB']:
        if size_bytes < 1024:
            return f"{size_bytes:.1f} {unit}"
        size_bytes /= 1024
    return f"{size_bytes:.1f} TB"

# Build gallery items using AI-assigned categories
gallery_items = []
for idx, fp in enumerate(sample_files, start=1):
    category = next((r['category'] for r in analysis_results if r['file_path']==fp), "Unknown")
    thumb = create_thumbnail(fp)
    thumb_b64 = "data:image/png;base64," + image_to_base64(thumb)
    gallery_items.append({
        'index': idx,
        'filename': os.path.basename(fp),
        'size': format_file_size(os.path.getsize(fp)),
        'folder': os.path.basename(os.path.dirname(fp)),
        'category': category,
        'thumbnail': thumb_b64
    })

# Render HTML gallery
html = """
<style>
.gallery-container {
    display: grid;
    grid-template-columns: repeat(auto-fill, minmax(160px, 1fr));
    gap: 10px;
}
.gallery-item {
    background: #fff;
    border: 1px solid #ddd;
    border-radius: 8px;
    overflow: hidden;
    padding: 8px;
    text-align: center;
}
.thumbnail {
    width: 100%;
    height: auto;
    border-radius: 4px;
}
.item-info {
    margin-top: 8px;
    font-size: 14px;
}
.filename {
    font-weight: bold;
}
.category {
    color: #555;
}
.file-details {
    font-size: 12px;
    color: #777;
}
</style>
<div class="gallery-container">
"""
for item in gallery_items:
    html += f"""
    <div class="gallery-item">
        <img src="{item['thumbnail']}" class="thumbnail">
        <div class="item-info">
            <div class="filename">📄 {item['filename']}</div>
            <div class="category">{item['category']}</div>
            <div class="file-details">💾 {item['size']} | 📂 {item['folder']}</div>
        </div>
    </div>
    """
html += "</div>"

display(HTML(html))

### 🎓 What Just Happened?

The AI system just performed several complex tasks:

1. **📖 Text Extraction**: Read the content from each document
2. **🧠 Content Understanding**: Used the LLM to understand what each document is about
3. **🏷️ Categorization**: Assigned appropriate categories based on content
4. **👤 Identity Detection**: Looked for person names (Chuck, Colleen) in the documents
5. **📅 Date Extraction**: Found relevant dates in the document content
6. **📝 Description Generation**: Created brief, descriptive titles

**Key AI/ML Concepts Demonstrated:**
- **Natural Language Processing (NLP)**: Understanding human-readable text
- **Named Entity Recognition (NER)**: Identifying people, dates, organizations
- **Text Classification**: Automatically categorizing documents
- **Information Extraction**: Pulling structured data from unstructured text

---

In [None]:
# Create a comprehensive results DataFrame
if analysis_results:
    results_df = pd.DataFrame(analysis_results)
    
    print("\n📋 Analysis Results Summary:")
    print(results_df[['description', 'category', 'identity', 'date', 'parent_folder']].to_string(index=False))
    
    # Category distribution
    category_counts = Counter(result['category'] for result in analysis_results)
    print(f"\n📊 Category Distribution:")
    for category, count in category_counts.most_common():
        print(f"  {category}: {count} documents")
    
    # Identity distribution
    identity_counts = Counter(result['identity'] for result in analysis_results)
    print(f"\n👤 Identity Distribution:")
    for identity, count in identity_counts.most_common():
        print(f"  {identity}: {count} documents")
    
    # Folder analysis
    folder_counts = Counter(result['parent_folder'] for result in analysis_results)
    print(f"\n📁 Folder Distribution:")
    for folder, count in folder_counts.most_common():
        print(f"  {folder}: {count} documents")
        
else:
    print("⚠️ No analysis results to display")

## 6. AI-Generated Category Suggestions 💡

Based on the document analysis, let's ask the AI to suggest an improved categorization system:

# Analyze each file in the sample with enhanced error handling
analysis_results = []
processing_stats = {
    'successful': 0,
    'failed': 0,
    'total_time': 0,
    'errors': []
}

print("🔄 Starting document analysis...\n")

for i, file_path in enumerate(sample_files):
    start_time = datetime.now()
    print(f"📄 Analyzing {i+1}/{len(sample_files)}: {os.path.basename(file_path)}")
    
    # Use safe analysis function
    result, error = safe_file_analysis(analyzer, file_path)
    
    if result:
        analysis_results.append(result)
        processing_stats['successful'] += 1
        
        # Show intermediate results
        print(f"  ✅ Category: {result['category']}")
        print(f"  📝 Description: {result['description']}")
        print(f"  👤 Identity: {result['identity']}")
        print(f"  📅 Date: {result['date']}")
        print(f"  📊 Text Length: {result.get('text_length', 0)} characters")
    else:
        processing_stats['failed'] += 1
        processing_stats['errors'].append(f"{os.path.basename(file_path)}: {error}")
        print(f"  ❌ Analysis failed: {error}")
        
    end_time = datetime.now()
    duration = (end_time - start_time).total_seconds()
    processing_stats['total_time'] += duration
    print(f"  ⏱️ Time: {duration:.1f}s\n")

print("\n📊 Analysis Summary:")
print("=" * 40)
print(f"✅ Successful: {processing_stats['successful']}")
print(f"❌ Failed: {processing_stats['failed']}")
print(f"⏱️ Total time: {processing_stats['total_time']:.1f}s")
print(f"📈 Average time per file: {processing_stats['total_time']/len(sample_files):.1f}s")
print(f"💯 Success rate: {(processing_stats['successful']/len(sample_files)*100):.1f}%")

if processing_stats['errors']:
    print(f"\n⚠️ Errors encountered:")
    for error in processing_stats['errors']:
        print(f"   • {error}")

In [None]:
def suggest_categories(analyzer, analysis_results):
    """Generate improved category suggestions using LLM."""
    if not analysis_results:
        print("⚠️ No analysis results available for category suggestions")
        return None
    
    # Extract data for LLM prompt
    current_categories = [result["category"] for result in analysis_results]
    descriptions = [result["description"] for result in analysis_results]
    parent_folders = [result["parent_folder"] for result in analysis_results]
    
    # Create comprehensive prompt using ALL available data
    prompt = f"""
    I need to organize personal and family documents into meaningful categories.
    
    Currently, I'm using these categories:
    {', '.join(analyzer.existing_categories)}
    
    I also have documents organized in these folders:
    {', '.join(set(parent_folders))}
    
    Here are ALL the document descriptions from my collection:
    {', '.join(descriptions)}
    
    The current categories assigned to these documents are:
    {', '.join(current_categories)}
    
    Based on this information, please suggest:
    1. An improved list of 10-20 categories that would be most useful for organizing and finding documents
    2. A brief explanation of each category
    3. Examples of what types of documents belong in each category
    
    Focus on categories that are:
    - Mutually exclusive (minimal overlap)
    - Collectively exhaustive (cover all document types)
    - Intuitive for everyday use
    - Useful for quickly finding specific documents
    
    Return your response as a JSON object with this structure:
    {{
        "categories": [
            {{
                "name": "Category Name",
                "description": "Brief description",
                "examples": ["Example doc 1", "Example doc 2"]
            }},
            ...
        ]
    }}
    """
    
    try:
        print("🤔 Asking AI for category suggestions...")
        response = analyzer.llm.invoke(prompt)
        print("✅ AI response received!")
        
        # Extract JSON from response
        json_match = re.search(r'```json\n(.*?)\n```|{.*}', response, re.DOTALL)
        if json_match:
            json_str = json_match.group(1) if json_match.group(1) else json_match.group(0)
            suggestions = json.loads(json_str)
            return suggestions
        else:
            print("⚠️ Could not parse JSON from AI response")
            print(f"Raw response: {response[:500]}...")
            return None
            
    except Exception as e:
        print(f"❌ Error getting category suggestions: {str(e)}")
        return None

# Generate suggestions
suggested_categories = suggest_categories(analyzer, analysis_results)

In [None]:
# Display the suggested categories
if suggested_categories and "categories" in suggested_categories:
    print("\n🎯 AI-Suggested Document Categories:")
    print("=" * 50)
    
    for i, cat in enumerate(suggested_categories["categories"], 1):
        print(f"\n{i}. **{cat['name']}**")
        print(f"   📝 Description: {cat['description']}")
        print(f"   📋 Examples: {', '.join(cat['examples'])}")
    
    # Create comparison DataFrame
    old_categories = analyzer.existing_categories
    new_categories = [cat['name'] for cat in suggested_categories['categories']]
    
    comparison_data = {
        'Aspect': ['Number of Categories', 'Specificity', 'Coverage'],
        'Original System': [len(old_categories), 'Basic', 'General'],
        'AI-Suggested System': [len(new_categories), 'Detailed', 'Comprehensive']
    }
    
    comparison_df = pd.DataFrame(comparison_data)
    print("\n📊 System Comparison:")
    print(comparison_df.to_string(index=False))
    
    # Category evolution chart
    fig = go.Figure()
    
    fig.add_trace(go.Bar(
        name='Original Categories',
        x=old_categories,
        y=[1] * len(old_categories),
        marker_color='lightblue'
    ))
    
    fig.add_trace(go.Bar(
        name='AI-Suggested Categories',
        x=new_categories,
        y=[1] * len(new_categories),
        marker_color='lightgreen',
        yaxis='y2'
    ))
    
    fig.update_layout(
        title="📈 Category System Evolution: Original vs AI-Suggested",
        xaxis_title="Categories",
        yaxis=dict(title="Original System", side="left"),
        yaxis2=dict(title="AI-Suggested System", side="right", overlaying="y"),
        barmode='group',
        height=500
    )
    
    fig.show()
    
else:
    print("⚠️ No category suggestions available")

## 7. Visualize Results 📊🎨

Let's create comprehensive visualizations to understand our document analysis results:

In [None]:
if analysis_results:
    # 1. Category Distribution Chart
    categories = Counter(result['category'] for result in analysis_results)
    fig1 = create_category_distribution_chart(categories)
    fig1.show()
    
    # 2. Folder-Category Relationship
    folder_categories = defaultdict(lambda: defaultdict(int))
    for result in analysis_results:
        folder_categories[result['parent_folder']][result['category']] += 1
    
    # Convert to regular dict for JSON serialization
    folder_categories = {k: dict(v) for k, v in folder_categories.items()}
    
    if len(folder_categories) > 1:  # Only show if multiple folders
        fig2 = create_folder_category_heatmap(folder_categories)
        fig2.show()
    
    # 3. Document Properties Analysis
    properties_df = pd.DataFrame(analysis_results)
    
    # File size vs text length correlation
    fig3 = px.scatter(
        properties_df,
        x='file_size',
        y='text_length',
        color='category',
        title="📄 Document Size vs Text Content",
        labels={'file_size': 'File Size (bytes)', 'text_length': 'Extracted Text Length'},
        hover_data=['description', 'parent_folder']
    )
    fig3.show()
    
else:
    print("⚠️ No results available for visualization")

In [None]:
if analysis_results and len(analysis_results) > 1:
    # 4. Document Network Graph
    print("🕸️ Creating document organization network...")
    fig4 = create_document_network_graph(analysis_results)
    fig4.show()
    
    # 5. Timeline Analysis (if dates available)
    if analysis_results:
        try:
            # Parse and filter dates flexibly using pandas
            valid_entries = []
            for res in analysis_results:
                dstr = res.get('date')
                if not dstr or dstr == 'Unknown':
                    continue
                d = pd.to_datetime(dstr, errors='coerce')
                if pd.isna(d):
                    continue
                # Filter out implausible dates
                if d.year < 1900 or d > pd.Timestamp.now():
                    continue
                valid_entries.append({
                    'Start': d,
                    'End': d + pd.Timedelta(days=1),
                    'Document': res['description'],
                    'Category': res['category']
                })
            if valid_entries:
                timeline_df = pd.DataFrame(valid_entries)
                fig5 = px.timeline(
                    timeline_df,
                    x_start='Start',
                    x_end='End',
                    y='Document',
                    color='Category',
                    title="📅 Document Timeline"
                )
                fig5.show()
            else:
                print("⚠️ No valid dates for timeline")
        except Exception as e:
            print(f"⚠️ Could not create timeline: {e}")
    
    # 6. Advanced Analytics Dashboard
    fig6 = make_subplots(
        rows=2, cols=2,
        subplot_titles=(
            "Categories by Count",
            "Identity Distribution", 
            "Folder Distribution",
            "Text Length Distribution"
        ),
        specs=[[{"type": "bar"}, {"type": "pie"}],
               [{"type": "bar"}, {"type": "histogram"}]]
    )
    
    # Category counts - sorted by count descending
    cat_counts = Counter(r['category'] for r in analysis_results)
    sorted_cat_counts = dict(cat_counts.most_common())
    fig6.add_trace(
        go.Bar(x=list(sorted_cat_counts.keys()), y=list(sorted_cat_counts.values()), name="Categories"),
        row=1, col=1
    )
    
    # Identity pie - sorted by count descending
    id_counts = Counter(r['identity'] for r in analysis_results)
    sorted_id_counts = dict(id_counts.most_common())
    fig6.add_trace(
        go.Pie(labels=list(sorted_id_counts.keys()), values=list(sorted_id_counts.values()), name="Identity"),
        row=1, col=2
    )
    
    # Folder distribution - sorted by count descending
    folder_counts = Counter(r['parent_folder'] for r in analysis_results)
    sorted_folder_counts = dict(folder_counts.most_common())
    fig6.add_trace(
        go.Bar(x=list(sorted_folder_counts.keys()), y=list(sorted_folder_counts.values()), name="Folders"),
        row=2, col=1
    )
    
    # Text length histogram
    text_lengths = [r.get('text_length', 0) for r in analysis_results]
    fig6.add_trace(
        go.Histogram(x=text_lengths, name="Text Length"),
        row=2, col=2
    )
    
    fig6.update_layout(
        title_text="📊 Comprehensive Document Analysis Dashboard",
        height=600,
        showlegend=False
    )
    fig6.show()

else:
    print("⚠️ Insufficient data for network visualization")

## 8. Save and Display Results 💾📋

Finally, let's save our analysis results and create a comprehensive summary:

In [None]:
def save_analysis_results(analysis_results, suggested_categories, filename="category_analysis_demo_results.json"):
    """Save all analysis results to a JSON file."""
    
    if not analysis_results:
        print("⚠️ No analysis results to save")
        return
    
    # Compute statistics
    categories = Counter(result["category"] for result in analysis_results)
    folder_categories = defaultdict(list)
    
    for result in analysis_results:
        folder_categories[result["parent_folder"]].append(result["category"])
    
    # Convert to regular dict for JSON serialization
    folder_categories = {k: dict(Counter(v)) for k, v in folder_categories.items()}
    
    # Create comprehensive results object
    results = {
        "analysis_metadata": {
            "timestamp": datetime.now().isoformat(),
            "source_directory": analyzer.source_dir,
            "total_files_found": len(all_files),
            "files_analyzed": len(analysis_results),
            "sample_size": analyzer.sample_size,
            "processing_stats": processing_stats
        },
        "file_statistics": {
            "file_type_distribution": file_stats,
            "current_categories": dict(categories),
            "folder_categories": folder_categories
        },
        "analysis_results": analysis_results,
        "suggested_categories": suggested_categories,
        "insights": {
            "most_common_category": categories.most_common(1)[0] if categories else None,
            "total_categories_found": len(categories),
            "average_text_length": sum(r.get('text_length', 0) for r in analysis_results) / len(analysis_results) if analysis_results else 0,
            "folders_analyzed": len(folder_categories)
        }
    }
    
    # Save to file
    with open(filename, "w", encoding='utf-8') as f:
        json.dump(results, f, indent=2, ensure_ascii=False)
    
    print(f"💾 Results saved to: {filename}")
    return results

# Save the results
saved_results = save_analysis_results(analysis_results, suggested_categories)

if saved_results:
    print("\n📊 Final Analysis Summary:")
    print("=" * 50)
    
    metadata = saved_results["analysis_metadata"]
    insights = saved_results["insights"]
    
    print(f"📅 Analysis Date: {metadata['timestamp'][:19]}")
    print(f"📁 Source Directory: {metadata['source_directory']}")
    print(f"📄 Files Found: {metadata['total_files_found']}")
    print(f"🔍 Files Analyzed: {metadata['files_analyzed']}")
    print(f"✅ Success Rate: {processing_stats['successful']}/{len(sample_files)} ({round(100*processing_stats['successful']/len(sample_files), 1)}%)")
    
    if insights['most_common_category']:
        print(f"📊 Most Common Category: {insights['most_common_category'][0]} ({insights['most_common_category'][1]} documents)")
    
    print(f"🏷️ Categories Discovered: {insights['total_categories_found']}")
    print(f"📝 Average Text Length: {round(insights['average_text_length'], 1)} characters")
    print(f"📂 Folders Analyzed: {insights['folders_analyzed']}")

In [None]:
# Generate actionable recommendations
print("\n💡 Actionable Recommendations:")
print("=" * 40)

if analysis_results:
    recommendations = []
    
    # Category-based recommendations
    categories = Counter(result['category'] for result in analysis_results)
    if len(categories) > 1:
        recommendations.append("✅ Document diversity detected - categorization system is working")
    else:
        recommendations.append("⚠️ Low category diversity - consider reviewing categorization criteria")
    
    # Folder organization recommendations
    folder_categories = defaultdict(set)
    for result in analysis_results:
        folder_categories[result['parent_folder']].add(result['category'])
    
    mixed_folders = {folder: cats for folder, cats in folder_categories.items() if len(cats) > 1}
    if mixed_folders:
        recommendations.append(f"📁 {len(mixed_folders)} folders contain mixed categories - consider reorganization")
    
    # Text extraction recommendations
    avg_text_length = sum(r.get('text_length', 0) for r in analysis_results) / len(analysis_results)
    if avg_text_length < 100:
        recommendations.append("📝 Low average text extraction - consider OCR optimization")
    
    # Identity detection recommendations
    identities = Counter(result['identity'] for result in analysis_results)
    unknown_ratio = identities.get('Unknown', 0) / len(analysis_results)
    if unknown_ratio > 0.5:
        recommendations.append("👤 High unknown identity rate - consider improving name detection")
    
    for i, rec in enumerate(recommendations, 1):
        print(f"{i}. {rec}")

else:
    print("⚠️ No analysis results available for recommendations")

print("\n🎓 Learning Outcomes Achieved:")
print("=" * 35)
learning_outcomes = [
    "✅ Understanding OCR and text extraction from documents",
    "✅ Experience with Large Language Model (LLM) analysis",
    "✅ Knowledge of automated document categorization",
    "✅ Data visualization and interpretation skills",
    "✅ Practical application of AI in document management",
    "✅ Understanding of sampling strategies for large datasets"
]

for outcome in learning_outcomes:
    print(outcome)

print("\n🚀 Next Steps:")
print("=" * 15)
next_steps = [
    "1. Try with your own document collection",
    "2. Experiment with different LLM models",
    "3. Implement automated file organization",
    "4. Add more sophisticated text preprocessing",
    "5. Create a web interface for the system",
    "6. Explore clustering algorithms for unsupervised categorization"
]

for step in next_steps:
    print(step)

print("\n" + "=" * 60)
print("🎉 Document Category Analysis Demo Complete! 🎉")
print("=" * 60)