# Lab 02: Advanced Optical Character Recognition (OCR)

## Overview
This advanced notebook explores sophisticated OCR techniques using Azure AI Vision and Azure Document Intelligence (Form Recognizer). You'll learn to extract text from complex documents, handle multiple languages, process handwritten text, and extract structured data from forms and invoices.

## Advanced Topics Covered
- Handwritten text recognition
- Multi-language text detection and extraction
- Document layout analysis and table extraction
- Complex document processing (invoices, forms, receipts)
- OCR accuracy optimization techniques
- Batch document processing
- Text post-processing and error correction
- Integration with Azure Form Recognizer for structured data
- PDF document analysis

## Setup

In [None]:
!pip install azure-ai-vision-imageanalysis azure-ai-formrecognizer azure-identity python-dotenv pillow matplotlib pdf2image -q

In [None]:
import os
import json
from pathlib import Path
from typing import List, Dict, Tuple
from dotenv import load_dotenv
from azure.ai.vision.imageanalysis import ImageAnalysisClient
from azure.ai.vision.imageanalysis.models import VisualFeatures
from azure.ai.formrecognizer import DocumentAnalysisClient
from azure.core.credentials import AzureKeyCredential
from azure.identity import DefaultAzureCredential
from PIL import Image, ImageDraw, ImageFont
from IPython.display import display, HTML, Markdown
import matplotlib.pyplot as plt
import matplotlib.patches as patches

# Load configuration
load_dotenv('python/.env')
vision_endpoint = os.getenv("AI_SERVICE_ENDPOINT")
vision_key = os.getenv("AI_SERVICE_KEY")

# Initialize Vision client
vision_client = ImageAnalysisClient(
    endpoint=vision_endpoint,
    credential=AzureKeyCredential(vision_key)
)

# Initialize Document Intelligence client (if available)
doc_intel_endpoint = os.getenv("DOCUMENT_INTELLIGENCE_ENDPOINT")
doc_intel_key = os.getenv("DOCUMENT_INTELLIGENCE_KEY")

if doc_intel_endpoint and doc_intel_key:
    doc_client = DocumentAnalysisClient(
        endpoint=doc_intel_endpoint,
        credential=AzureKeyCredential(doc_intel_key)
    )
    print("✓ Document Intelligence client initialized")
else:
    doc_client = None
    print("⚠ Document Intelligence credentials not found. Some features will be unavailable.")

print("✓ Vision client initialized")

## 1. Handwritten Text Recognition

Extract text from handwritten notes, forms, and documents with Azure AI Vision's OCR capabilities.

In [None]:
def recognize_handwriting(image_path: str, visualize: bool = True) -> Dict:
    """Recognize handwritten text from an image."""
    with open(image_path, "rb") as img_file:
        image_data = img_file.read()
    
    # Analyze image for text
    result = vision_client.analyze(
        image_data=image_data,
        visual_features=[VisualFeatures.READ]
    )
    
    extracted_text = []
    handwritten_lines = []
    
    if result.read:
        for block in result.read.blocks:
            for line in block.lines:
                # Note: This is a simplified heuristic approach.
                # Lower confidence scores may indicate handwritten text, but Azure Vision API
                # doesn't directly expose handwritten vs. printed classification.
                # For production use, consider Azure Document Intelligence's handwriting detection.
                is_handwritten = any(word.confidence < 0.9 for word in line.words)
                
                line_info = {
                    "text": line.text,
                    "confidence": sum(w.confidence for w in line.words) / len(line.words),
                    "bounding_box": line.bounding_polygon,
                    "handwritten": is_handwritten
                }
                extracted_text.append(line_info)
                
                if is_handwritten:
                    handwritten_lines.append(line.text)
    
    # Visualize results
    if visualize:
        img = Image.open(image_path)
        fig, ax = plt.subplots(1, figsize=(12, 8))
        ax.imshow(img)
        
        for line_info in extracted_text:
            polygon = line_info["bounding_box"]
            points = [(p.x, p.y) for p in polygon]
            points.append(points[0])  # Close polygon
            
            color = 'red' if line_info["handwritten"] else 'green'
            xs, ys = zip(*points)
            ax.plot(xs, ys, color=color, linewidth=2)
        
        ax.set_title("Handwritten (Red) vs Printed (Green) Text Detection")
        ax.axis('off')
        plt.tight_layout()
        plt.show()
    
    return {
        "all_text": extracted_text,
        "handwritten_lines": handwritten_lines,
        "total_lines": len(extracted_text),
        "handwritten_count": len(handwritten_lines)
    }

# Example usage
print("\n=== Handwritten Text Recognition ===")
print("Tip: For best results with handwritten text:")
print("  - Ensure good lighting and contrast")
print("  - Use high-resolution images")
print("  - Avoid blurry or skewed images")
print("\nTest with your own handwritten document images.")

## 2. Multi-Language Text Detection

Detect and extract text in multiple languages from a single document.

In [None]:
def detect_multilingual_text(image_path: str) -> Dict:
    """Detect text in multiple languages."""
    with open(image_path, "rb") as img_file:
        image_data = img_file.read()
    
    result = vision_client.analyze(
        image_data=image_data,
        visual_features=[VisualFeatures.READ]
    )
    
    language_blocks = {}
    
    if result.read:
        for block in result.read.blocks:
            for line in block.lines:
                # Detect language (simplified - in practice, use language detection library)
                text = line.text
                
                # Basic language detection heuristics
                if any('\u4e00' <= char <= '\u9fff' for char in text):
                    lang = "Chinese"
                elif any('\u0600' <= char <= '\u06ff' for char in text):
                    lang = "Arabic"
                elif any('\u0400' <= char <= '\u04ff' for char in text):
                    lang = "Cyrillic"
                elif any('\u3040' <= char <= '\u309f' or '\u30a0' <= char <= '\u30ff' for char in text):
                    lang = "Japanese"
                elif any('\uac00' <= char <= '\ud7af' for char in text):
                    lang = "Korean"
                else:
                    lang = "Latin-based"
                
                if lang not in language_blocks:
                    language_blocks[lang] = []
                language_blocks[lang].append(text)
    
    # Display results
    print("\n=== Multi-Language Detection Results ===")
    for lang, texts in language_blocks.items():
        print(f"\n{lang} ({len(texts)} lines):")
        for text in texts[:5]:  # Show first 5 lines
            print(f"  {text}")
        if len(texts) > 5:
            print(f"  ... and {len(texts) - 5} more lines")
    
    return language_blocks

# Example: Create a test image with multiple languages
print("\n=== Multi-Language OCR ===")
print("Azure AI Vision supports 100+ languages including:")
print("  - English, Spanish, French, German, Italian, Portuguese")
print("  - Chinese (Simplified & Traditional), Japanese, Korean")
print("  - Arabic, Hebrew, Hindi, Thai")
print("  - And many more...")
print("\nThe service automatically detects language without requiring specification.")

## 3. Document Layout Analysis

Analyze document structure including paragraphs, headings, tables, and reading order.

In [None]:
def analyze_document_layout(image_path: str) -> Dict:
    """Analyze document layout and structure."""
    with open(image_path, "rb") as img_file:
        image_data = img_file.read()
    
    result = vision_client.analyze(
        image_data=image_data,
        visual_features=[VisualFeatures.READ]
    )
    
    layout = {
        "blocks": [],
        "reading_order": [],
        "columns": 1
    }
    
    if result.read:
        # Analyze blocks
        for idx, block in enumerate(result.read.blocks):
            block_info = {
                "block_id": idx,
                "lines": len(block.lines),
                "text": " ".join([line.text for line in block.lines])
            }
            layout["blocks"].append(block_info)
            
            # Build reading order
            for line in block.lines:
                layout["reading_order"].append(line.text)
        
        # Detect columns (simplified heuristic)
        if len(result.read.blocks) > 2:
            x_positions = []
            for block in result.read.blocks:
                if block.lines:
                    avg_x = sum(p.x for p in block.lines[0].bounding_polygon) / len(block.lines[0].bounding_polygon)
                    x_positions.append(avg_x)
            
            # If significant gaps in x positions, likely multi-column
            x_positions.sort()
            gaps = [x_positions[i+1] - x_positions[i] for i in range(len(x_positions)-1)]
            if gaps and max(gaps) > 200:  # Threshold for column detection
                layout["columns"] = 2
    
    # Display results
    print(f"\n=== Document Layout Analysis ===")
    print(f"Total blocks: {len(layout['blocks'])}")
    print(f"Estimated columns: {layout['columns']}")
    print(f"\nReading order (first 10 lines):")
    for i, line in enumerate(layout["reading_order"][:10]):
        print(f"  {i+1}. {line}")
    
    return layout

# Test with sample document
print("Document layout analysis helps understand:")
print("  - Reading order and flow")
print("  - Column detection")
print("  - Text block grouping")
print("  - Document structure")

## 4. Table Extraction with Document Intelligence

Extract tables from documents while preserving structure and cell relationships.

In [None]:
def extract_tables(document_path: str) -> List[Dict]:
    """Extract tables from documents using Azure Document Intelligence."""
    if not doc_client:
        print("⚠ Document Intelligence not configured. Please set credentials in .env file.")
        return []
    
    with open(document_path, "rb") as f:
        document = f.read()
    
    # Use layout model to extract tables
    poller = doc_client.begin_analyze_document("prebuilt-layout", document)
    result = poller.result()
    
    tables = []
    
    for table in result.tables:
        # Convert table to structured format
        table_data = {
            "row_count": table.row_count,
            "column_count": table.column_count,
            "cells": []
        }
        
        # Create matrix representation
        matrix = [["" for _ in range(table.column_count)] for _ in range(table.row_count)]
        
        for cell in table.cells:
            matrix[cell.row_index][cell.column_index] = cell.content
            table_data["cells"].append({
                "row": cell.row_index,
                "column": cell.column_index,
                "content": cell.content,
                "is_header": cell.kind == "columnHeader" if hasattr(cell, 'kind') else False
            })
        
        table_data["matrix"] = matrix
        tables.append(table_data)
    
    # Display tables
    print(f"\n=== Extracted {len(tables)} Table(s) ===")
    for idx, table in enumerate(tables):
        print(f"\nTable {idx + 1}: {table['row_count']} rows × {table['column_count']} columns")
        print("-" * 80)
        
        # Display as formatted table
        for row in table["matrix"]:
            print(" | ".join(str(cell)[:20].ljust(20) for cell in row))
        print("-" * 80)
    
    return tables

# Example usage
print("\n=== Table Extraction ===")
print("Document Intelligence can extract tables from:")
print("  - PDF documents")
print("  - Scanned images")
print("  - Complex layouts")
print("  - Multi-page documents")
print("\nPreserves: cell structure, headers, and cell spanning")

## 5. Invoice and Receipt Processing

Extract structured data from invoices, receipts, and financial documents.

In [None]:
def process_invoice(document_path: str) -> Dict:
    """Process invoices and extract structured data."""
    if not doc_client:
        print("⚠ Document Intelligence not configured.")
        return {}
    
    with open(document_path, "rb") as f:
        document = f.read()
    
    # Use prebuilt invoice model
    poller = doc_client.begin_analyze_document("prebuilt-invoice", document)
    result = poller.result()
    
    invoice_data = {
        "vendor": None,
        "customer": None,
        "invoice_id": None,
        "invoice_date": None,
        "due_date": None,
        "subtotal": None,
        "tax": None,
        "total": None,
        "items": []
    }
    
    for document in result.documents:
        # Extract key fields
        fields = document.fields
        
        if "VendorName" in fields:
            invoice_data["vendor"] = fields["VendorName"].value
        if "CustomerName" in fields:
            invoice_data["customer"] = fields["CustomerName"].value
        if "InvoiceId" in fields:
            invoice_data["invoice_id"] = fields["InvoiceId"].value
        if "InvoiceDate" in fields:
            invoice_data["invoice_date"] = str(fields["InvoiceDate"].value)
        if "DueDate" in fields:
            invoice_data["due_date"] = str(fields["DueDate"].value)
        if "SubTotal" in fields:
            invoice_data["subtotal"] = fields["SubTotal"].value
        if "TotalTax" in fields:
            invoice_data["tax"] = fields["TotalTax"].value
        if "InvoiceTotal" in fields:
            invoice_data["total"] = fields["InvoiceTotal"].value
        
        # Extract line items
        if "Items" in fields:
            for item in fields["Items"].value:
                item_data = {}
                if "Description" in item.value:
                    item_data["description"] = item.value["Description"].value
                if "Quantity" in item.value:
                    item_data["quantity"] = item.value["Quantity"].value
                if "UnitPrice" in item.value:
                    item_data["unit_price"] = item.value["UnitPrice"].value
                if "Amount" in item.value:
                    item_data["amount"] = item.value["Amount"].value
                invoice_data["items"].append(item_data)
    
    # Display results
    print("\n=== Invoice Processing Results ===")
    print(f"Vendor: {invoice_data['vendor']}")
    print(f"Customer: {invoice_data['customer']}")
    print(f"Invoice ID: {invoice_data['invoice_id']}")
    print(f"Date: {invoice_data['invoice_date']}")
    print(f"Due Date: {invoice_data['due_date']}")
    print(f"\nFinancial Summary:")
    print(f"  Subtotal: ${invoice_data['subtotal']}")
    print(f"  Tax: ${invoice_data['tax']}")
    print(f"  Total: ${invoice_data['total']}")
    print(f"\nLine Items ({len(invoice_data['items'])}):")
    for idx, item in enumerate(invoice_data['items'][:5]):
        print(f"  {idx+1}. {item.get('description', 'N/A')} - Qty: {item.get('quantity', 'N/A')} - ${item.get('amount', 'N/A')}")
    
    return invoice_data

def process_receipt(document_path: str) -> Dict:
    """Process receipts and extract structured data."""
    if not doc_client:
        print("⚠ Document Intelligence not configured.")
        return {}
    
    with open(document_path, "rb") as f:
        document = f.read()
    
    # Use prebuilt receipt model
    poller = doc_client.begin_analyze_document("prebuilt-receipt", document)
    result = poller.result()
    
    receipt_data = {
        "merchant": None,
        "transaction_date": None,
        "transaction_time": None,
        "items": [],
        "subtotal": None,
        "tax": None,
        "tip": None,
        "total": None
    }
    
    for document in result.documents:
        fields = document.fields
        
        if "MerchantName" in fields:
            receipt_data["merchant"] = fields["MerchantName"].value
        if "TransactionDate" in fields:
            receipt_data["transaction_date"] = str(fields["TransactionDate"].value)
        if "TransactionTime" in fields:
            receipt_data["transaction_time"] = str(fields["TransactionTime"].value)
        if "Subtotal" in fields:
            receipt_data["subtotal"] = fields["Subtotal"].value
        if "Tax" in fields:
            receipt_data["tax"] = fields["Tax"].value
        if "Tip" in fields:
            receipt_data["tip"] = fields["Tip"].value
        if "Total" in fields:
            receipt_data["total"] = fields["Total"].value
        
        if "Items" in fields:
            for item in fields["Items"].value:
                item_data = {}
                if "Description" in item.value:
                    item_data["description"] = item.value["Description"].value
                if "TotalPrice" in item.value:
                    item_data["price"] = item.value["TotalPrice"].value
                receipt_data["items"].append(item_data)
    
    print("\n=== Receipt Processing Results ===")
    print(f"Merchant: {receipt_data['merchant']}")
    print(f"Date: {receipt_data['transaction_date']} {receipt_data['transaction_time']}")
    print(f"\nItems:")
    for item in receipt_data['items']:
        print(f"  - {item.get('description', 'N/A')}: ${item.get('price', 'N/A')}")
    print(f"\nSubtotal: ${receipt_data['subtotal']}")
    print(f"Tax: ${receipt_data['tax']}")
    print(f"Tip: ${receipt_data['tip']}")
    print(f"Total: ${receipt_data['total']}")
    
    return receipt_data

# Example
print("\n=== Invoice & Receipt Processing ===")
print("Supported document types:")
print("  - Invoices (prebuilt-invoice)")
print("  - Receipts (prebuilt-receipt)")
print("  - Business cards (prebuilt-businessCard)")
print("  - ID documents (prebuilt-idDocument)")
print("  - W-2 forms (prebuilt-tax.us.w2)")

## 6. OCR Accuracy Optimization

Techniques to improve OCR accuracy and quality.

In [None]:
from PIL import ImageEnhance, ImageFilter

def preprocess_for_ocr(image_path: str, output_path: str = None) -> str:
    """Preprocess image to improve OCR accuracy."""
    img = Image.open(image_path)
    
    # Convert to grayscale
    img = img.convert('L')
    
    # Increase contrast
    enhancer = ImageEnhance.Contrast(img)
    img = enhancer.enhance(2.0)
    
    # Increase sharpness
    enhancer = ImageEnhance.Sharpness(img)
    img = enhancer.enhance(2.0)
    
    # Apply slight blur to reduce noise
    img = img.filter(ImageFilter.MedianFilter(size=3))
    
    # Increase brightness if needed
    enhancer = ImageEnhance.Brightness(img)
    img = enhancer.enhance(1.2)
    
    # Save preprocessed image
    if output_path is None:
        output_path = image_path.replace('.', '_preprocessed.')
    img.save(output_path)
    
    return output_path

def compare_ocr_quality(image_path: str) -> Dict:
    """Compare OCR results before and after preprocessing."""
    # Process original
    print("Processing original image...")
    with open(image_path, "rb") as f:
        result_original = vision_client.analyze(
            image_data=f.read(),
            visual_features=[VisualFeatures.READ]
        )
    
    # Preprocess and process again
    print("Preprocessing image...")
    preprocessed_path = preprocess_for_ocr(image_path)
    
    print("Processing preprocessed image...")
    with open(preprocessed_path, "rb") as f:
        result_preprocessed = vision_client.analyze(
            image_data=f.read(),
            visual_features=[VisualFeatures.READ]
        )
    
    # Calculate confidence scores
    def get_avg_confidence(result):
        if not result.read or not result.read.blocks:
            return 0.0
        total_conf = 0
        total_words = 0
        for block in result.read.blocks:
            for line in block.lines:
                for word in line.words:
                    total_conf += word.confidence
                    total_words += 1
        return total_conf / total_words if total_words > 0 else 0.0
    
    original_conf = get_avg_confidence(result_original)
    preprocessed_conf = get_avg_confidence(result_preprocessed)
    
    print(f"\n=== OCR Quality Comparison ===")
    print(f"Original average confidence: {original_conf:.2%}")
    print(f"Preprocessed average confidence: {preprocessed_conf:.2%}")
    print(f"Improvement: {(preprocessed_conf - original_conf):.2%}")
    
    return {
        "original_confidence": original_conf,
        "preprocessed_confidence": preprocessed_conf,
        "improvement": preprocessed_conf - original_conf
    }

# OCR Optimization Tips
print("\n=== OCR Accuracy Optimization Tips ===")
print("\n1. Image Quality:")
print("   - Use high resolution (300+ DPI for scanned documents)")
print("   - Ensure good lighting and contrast")
print("   - Avoid blur, glare, and shadows")
print("\n2. Preprocessing:")
print("   - Convert to grayscale")
print("   - Adjust contrast and brightness")
print("   - Deskew rotated images")
print("   - Remove noise with filters")
print("\n3. Document Handling:")
print("   - Flatten curved pages")
print("   - Remove backgrounds")
print("   - Crop to text regions")
print("\n4. Post-processing:")
print("   - Spell checking")
print("   - Dictionary validation")
print("   - Context-based correction")

## 7. Batch Document Processing

Process multiple documents efficiently with batch operations.

In [None]:
import time
from concurrent.futures import ThreadPoolExecutor, as_completed

def process_document_batch(document_paths: List[str], max_workers: int = 5) -> List[Dict]:
    """Process multiple documents in parallel."""
    results = []
    
    def process_single(path: str) -> Dict:
        """Process a single document."""
        try:
            with open(path, "rb") as f:
                image_data = f.read()
            
            result = vision_client.analyze(
                image_data=image_data,
                visual_features=[VisualFeatures.READ]
            )
            
            text_lines = []
            if result.read:
                for block in result.read.blocks:
                    for line in block.lines:
                        text_lines.append(line.text)
            
            return {
                "path": path,
                "status": "success",
                "text_lines": text_lines,
                "line_count": len(text_lines),
                "full_text": " ".join(text_lines)
            }
        except Exception as e:
            return {
                "path": path,
                "status": "error",
                "error": str(e)
            }
    
    # Process in parallel
    print(f"\nProcessing {len(document_paths)} documents with {max_workers} workers...")
    start_time = time.time()
    
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        future_to_path = {executor.submit(process_single, path): path for path in document_paths}
        
        for future in as_completed(future_to_path):
            result = future.result()
            results.append(result)
            
            if result["status"] == "success":
                print(f"✓ {result['path']}: {result['line_count']} lines extracted")
            else:
                print(f"✗ {result['path']}: {result['error']}")
    
    elapsed = time.time() - start_time
    print(f"\nBatch processing completed in {elapsed:.2f}s")
    print(f"Average time per document: {elapsed/len(document_paths):.2f}s")
    
    # Summary
    successful = sum(1 for r in results if r["status"] == "success")
    print(f"\nSuccess rate: {successful}/{len(results)} ({successful/len(results)*100:.1f}%)")
    
    return results

# Example
print("\n=== Batch Document Processing ===")
print("Benefits:")
print("  - Process multiple documents in parallel")
print("  - Reduce total processing time")
print("  - Handle errors gracefully")
print("  - Track progress and success rates")
print("\nExample usage:")
print('  document_paths = ["doc1.pdf", "doc2.jpg", "doc3.png"]')
print('  results = process_document_batch(document_paths, max_workers=5)')

## 8. Text Post-Processing and Error Correction

Clean and correct OCR output using various techniques.

In [None]:
import re

def clean_ocr_text(text: str) -> str:
    """Clean OCR output text."""
    # Remove multiple spaces
    text = re.sub(r' +', ' ', text)
    
    # Remove multiple newlines
    text = re.sub(r'\n+', '\n', text)
    
    # Fix common OCR errors
    replacements = {
        'l0': '10',  # lowercase L + zero -> 10
        'O0': '00',  # letter O + zero -> 00
        'Il': '11',  # capital i + lowercase L -> 11
        '|': 'I',    # pipe to capital I
        '5': 'S',    # in context of words
    }
    
    for wrong, correct in replacements.items():
        text = text.replace(wrong, correct)
    
    # Remove non-printable characters
    text = ''.join(char for char in text if char.isprintable() or char in '\n\t')
    
    return text.strip()

def validate_with_dictionary(words: List[str], min_confidence: float = 0.8) -> List[Dict]:
    """Validate words against a dictionary."""
    # This is a simplified version - in production, use a proper spell checker
    validated = []
    
    for word in words:
        # Basic validation heuristics
        is_valid = True
        confidence = 1.0
        
        # Check for common OCR artifacts
        if re.search(r'[|\\]', word):  # Contains pipes or backslashes
            confidence *= 0.5
        
        if len(word) > 20:  # Unusually long word
            confidence *= 0.7
        
        if re.search(r'\d[a-zA-Z]|[a-zA-Z]\d', word):  # Mixed letters and numbers
            confidence *= 0.8
        
        validated.append({
            "word": word,
            "valid": confidence >= min_confidence,
            "confidence": confidence
        })
    
    return validated

def extract_structured_data(text: str) -> Dict:
    """Extract structured information from OCR text."""
    extracted = {
        "emails": [],
        "phone_numbers": [],
        "dates": [],
        "urls": [],
        "currency": []
    }
    
    # Extract emails
    emails = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text)
    extracted["emails"] = emails
    
    # Extract phone numbers (US format)
    phones = re.findall(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', text)
    extracted["phone_numbers"] = phones
    
    # Extract dates (various formats)
    dates = re.findall(r'\b\d{1,2}[/-]\d{1,2}[/-]\d{2,4}\b', text)
    extracted["dates"] = dates
    
    # Extract URLs
    urls = re.findall(r'https?://[^\s]+', text)
    extracted["urls"] = urls
    
    # Extract currency amounts
    currency = re.findall(r'\$\d+(?:,\d{3})*(?:\.\d{2})?', text)
    extracted["currency"] = currency
    
    return extracted

# Example usage
sample_text = """
C0ntact us at: support@example.com or cal| us at 555-123-4567
Invoice dat3: 12/31/2023
T0tal amount: $1,234.56
Visit: https://example.com
"""

print("\n=== Text Post-Processing ===")
print("\nOriginal (with OCR errors):")
print(sample_text)

cleaned = clean_ocr_text(sample_text)
print("\nCleaned:")
print(cleaned)

structured = extract_structured_data(cleaned)
print("\nExtracted Structured Data:")
for key, values in structured.items():
    if values:
        print(f"  {key}: {values}")

## 9. Custom Form Processing

Train custom models for specific form types using Document Intelligence.

In [None]:
def process_custom_form(document_path: str, model_id: str) -> Dict:
    """Process documents using a custom trained model."""
    if not doc_client:
        print("⚠ Document Intelligence not configured.")
        return {}
    
    with open(document_path, "rb") as f:
        document = f.read()
    
    try:
        # Use custom model
        poller = doc_client.begin_analyze_document(model_id, document)
        result = poller.result()
        
        # Extract fields based on custom model
        extracted_fields = {}
        
        for document in result.documents:
            for field_name, field_value in document.fields.items():
                extracted_fields[field_name] = {
                    "value": field_value.value,
                    "confidence": field_value.confidence
                }
        
        print(f"\n=== Custom Form Processing Results ===")
        for field_name, field_data in extracted_fields.items():
            print(f"{field_name}: {field_data['value']} (confidence: {field_data['confidence']:.2%})")
        
        return extracted_fields
    
    except Exception as e:
        print(f"Error processing custom form: {str(e)}")
        return {}

# Custom model training information
print("\n=== Custom Form Processing ===")
print("\nTo train a custom model:")
print("1. Collect sample documents (5+ recommended)")
print("2. Upload to Azure Storage")
print("3. Use Document Intelligence Studio to label fields")
print("4. Train the model")
print("5. Get model ID and use for processing")
print("\nCustom models are ideal for:")
print("  - Company-specific forms")
print("  - Unique document layouts")
print("  - Industry-specific documents")
print("  - Documents with custom fields")

## 10. PDF Document Processing

Extract text from multi-page PDF documents.

In [None]:
def process_pdf_document(pdf_path: str) -> Dict:
    """Process multi-page PDF documents."""
    if not doc_client:
        print("⚠ Document Intelligence not configured.")
        return {}
    
    with open(pdf_path, "rb") as f:
        pdf_data = f.read()
    
    # Analyze PDF with layout model
    poller = doc_client.begin_analyze_document("prebuilt-layout", pdf_data)
    result = poller.result()
    
    pdf_content = {
        "page_count": len(result.pages),
        "pages": [],
        "full_text": "",
        "tables": [],
        "key_value_pairs": []
    }
    
    # Process each page
    for page_num, page in enumerate(result.pages, 1):
        page_text = []
        
        for line in page.lines:
            page_text.append(line.content)
        
        page_content = "\n".join(page_text)
        pdf_content["pages"].append({
            "page_number": page_num,
            "text": page_content,
            "line_count": len(page.lines)
        })
        pdf_content["full_text"] += page_content + "\n\n"
    
    # Extract tables
    for table in result.tables:
        pdf_content["tables"].append({
            "row_count": table.row_count,
            "column_count": table.column_count,
            "page": table.bounding_regions[0].page_number if table.bounding_regions else None
        })
    
    # Extract key-value pairs
    if hasattr(result, 'key_value_pairs'):
        for kv in result.key_value_pairs:
            if kv.key and kv.value:
                pdf_content["key_value_pairs"].append({
                    "key": kv.key.content,
                    "value": kv.value.content
                })
    
    # Display summary
    print(f"\n=== PDF Processing Results ===")
    print(f"Total pages: {pdf_content['page_count']}")
    print(f"Total tables: {len(pdf_content['tables'])}")
    print(f"Key-value pairs: {len(pdf_content['key_value_pairs'])}")
    print(f"\nPage summaries:")
    for page in pdf_content["pages"]:
        print(f"  Page {page['page_number']}: {page['line_count']} lines")
    
    return pdf_content

print("\n=== PDF Document Processing ===")
print("Capabilities:")
print("  - Multi-page text extraction")
print("  - Table detection and extraction")
print("  - Key-value pair extraction")
print("  - Layout preservation")
print("  - Reading order detection")

## Summary

This advanced OCR notebook covered:

1. **Handwritten Text Recognition** - Extract text from handwritten documents
2. **Multi-Language Support** - Detect and process text in 100+ languages
3. **Document Layout Analysis** - Understand document structure and reading order
4. **Table Extraction** - Extract tables while preserving structure
5. **Invoice & Receipt Processing** - Extract structured data from financial documents
6. **OCR Optimization** - Improve accuracy through preprocessing
7. **Batch Processing** - Process multiple documents efficiently
8. **Post-Processing** - Clean and correct OCR output
9. **Custom Models** - Train models for specific form types
10. **PDF Processing** - Extract content from multi-page PDFs

### Next Steps

- Explore Azure Document Intelligence Studio for visual model training
- Build automated document processing pipelines
- Integrate OCR with downstream applications
- Implement quality control and validation workflows
- Optimize for production workloads

### Additional Resources

- [Azure AI Vision OCR Documentation](https://learn.microsoft.com/azure/ai-services/computer-vision/overview-ocr)
- [Azure Document Intelligence](https://learn.microsoft.com/azure/ai-services/document-intelligence/)
- [OCR Best Practices](https://learn.microsoft.com/azure/ai-services/computer-vision/overview-ocr#best-practices)
- [Custom Model Training](https://learn.microsoft.com/azure/ai-services/document-intelligence/how-to-guides/build-a-custom-model)