# Notebook 06: Computer Vision - Optical Character Recognition (OCR)

**Learning Objectives:**
- Extract text from images using OCR
- Use TrOCR (Transformer-based OCR) models
- Handle both printed and handwritten text
- Process documents and receipts

## Prerequisites

### Hardware Requirements

| Model Option | Model Name | Size | Min RAM | Recommended Setup | Notes |
|--------------|------------|------|---------|-------------------|-------|
| **small (CPU-friendly)** | microsoft/trocr-small-printed | 558MB | 4GB | 4GB RAM, CPU | English printed text, learning |
| **large/SOTA (GPU-optimized)** | PaddleOCR | ~3.5GB | 8GB | 12GB VRAM (RTX 4080) | 80+ languages, production-grade |

### Software Requirements
- Python 3.8+
- Libraries: `transformers`, `torch`, `PIL`
- Optional: `paddlepaddle`, `paddleocr` (for large/SOTA option)

## Overview

**OCR (Optical Character Recognition)** extracts text from images.

**Use Cases:**
- Document digitization
- Receipt processing
- License plate recognition
- Form extraction

**TrOCR:**
- Vision Transformer (encoder) + Text Transformer (decoder)
- Trained on printed and handwritten text
- State-of-the-art accuracy

## Expected Behaviors

### First Time Running
- **Model Download**: ~558MB for TrOCR-small (~3-5 minutes)
- Downloads both vision encoder and text decoder
- Cached for subsequent runs

### Setup Cell Output
```
PyTorch version: 2.x.x
CUDA available: True/False
```

### Model Loading
```
Model loaded on: cpu (or cuda)
```
- **CPU**: 8-12 seconds to load
- **GPU**: 4-6 seconds

### OCR Output
- Returns extracted text as a string
- Example: `"Hello World"` from an image containing that text

### Input Requirements
- **Best results**: Single line of text, cropped close
- **Image types**: Printed text (TrOCR), or multi-language/complex layouts (PaddleOCR)
- **Resolution**: Higher resolution = better accuracy
- **Background**: Clean backgrounds work best

### Accuracy Expectations
- **Printed text (trocr-small-printed)**:
  - Clean, high-res: 92-95% accuracy
  - Low-res or blurry: 70-85% accuracy
- **PaddleOCR**:
  - Clean printed text: 94-97% accuracy
  - Multi-language support: 80+ languages
  - Complex layouts: Excellent (tables, forms)

### Performance
- **Single text region**:
  - CPU: 1-3 seconds
  - GPU: 0.3-0.8 seconds
- **Longer than classification** due to sequence generation

### Common Issues
- **Multi-line text**: Works best on single lines; process line-by-line (or use PaddleOCR)
- **Rotated text**: May need rotation correction first (PaddleOCR handles this)
- **Small text**: Upscale image before OCR
- **Background noise**: Pre-process to remove noise

### Model Variants
- **TrOCR-small-printed**: Lightweight, for English typed/printed text (this notebook)
- **PaddleOCR**: Production-grade, 80+ languages, complex layouts (covered below)

### Expected Output Quality
- Should match actual text closely
- May have minor errors with unusual fonts
- Punctuation generally preserved
- Case sensitivity maintained

In [None]:
import torch
from transformers import TrOCRProcessor, VisionEncoderDecoderModel, set_seed
from PIL import Image
import requests
from io import BytesIO
import warnings
warnings.filterwarnings('ignore')

# Set seed for reproducibility
set_seed(1103)

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

In [None]:
# CHOOSE YOUR MODEL:

# Option 1: small model (CPU-friendly, English printed text)
MODEL_NAME = "microsoft/trocr-small-printed"  # 558MB

# Option 2: large/SOTA model (GPU-optimized, 80+ languages, production-grade)
# To use PaddleOCR instead, skip to the PaddleOCR section below (cells 14+)
# PaddleOCR offers: 80+ languages, better accuracy, layout analysis
# Note: Handwritten text OCR available in PaddleOCR or via trocr-base-handwritten (not covered here)

print(f"Selected model: {MODEL_NAME}")

In [None]:
# Load model and processor
processor = TrOCRProcessor.from_pretrained(MODEL_NAME)
model = VisionEncoderDecoderModel.from_pretrained(MODEL_NAME)

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

print(f"Model loaded on: {device}")

In [None]:
def extract_text(image):
    """Extract text from an image."""
    pixel_values = processor(images=image, return_tensors="pt").pixel_values.to(device)
    
    generated_ids = model.generate(pixel_values)
    text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    
    return text

In [None]:
# Example: Extract text from a sample image
url = "https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg"  # Handwritten sample
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

text = extract_text(image)
print(f"\nExtracted text: {text}")

In [None]:
# Example: Process multiple text regions
def ocr_multiple_regions(image_urls):
    """Process multiple text images."""
    for i, url in enumerate(image_urls, 1):
        try:
            img = Image.open(requests.get(url, stream=True).raw).convert("RGB")
            text = extract_text(img)
            print(f"\nImage {i}: {text}")
        except Exception as e:
            print(f"\nImage {i} error: {e}")

# Test with sample URLs
sample_urls = [
    "https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg"
]
ocr_multiple_regions(sample_urls)

In [None]:
# Advanced: Multi-language OCR
print("\n=== MULTI-LANGUAGE OCR EXAMPLE ===")

# Initialize OCR for different languages
# ocr_chinese = PaddleOCR(lang='ch', use_gpu=torch.cuda.is_available())
# ocr_french = PaddleOCR(lang='fr', use_gpu=torch.cuda.is_available())
# ocr_german = PaddleOCR(lang='de', use_gpu=torch.cuda.is_available())

# Supported languages (partial list)
supported_langs = [
    'en', 'ch', 'fr', 'de', 'ja', 'ko', 'es', 'pt', 'ru', 'ar',
    'hi', 'th', 'vi', 'id', 'ms', 'tl', 'nl', 'it', 'pl', 'tr'
]

print("Supported languages (partial list):")
print(", ".join(supported_langs))
print("\nSee https://github.com/PaddlePaddle/PaddleOCR for full list")
print("\nTo use a different language, initialize with: PaddleOCR(lang='ch') for Chinese, etc.")

In [None]:
# Visualize OCR results
import matplotlib.pyplot as plt
import matplotlib.patches as patches

fig, ax = plt.subplots(1, figsize=(12, 8))
ax.imshow(img)

# Draw bounding boxes
for line in result[0]:
    bbox, (text, confidence) = line
    # bbox is [[x1,y1], [x2,y2], [x3,y3], [x4,y4]]
    points = np.array(bbox)
    
    # Create polygon
    polygon = patches.Polygon(
        points,
        linewidth=2,
        edgecolor='red',
        facecolor='none'
    )
    ax.add_patch(polygon)
    
    # Add text label
    ax.text(
        points[0][0],
        points[0][1] - 5,
        f"{text[:20]}... ({confidence:.2f})" if len(text) > 20 else f"{text} ({confidence:.2f})",
        fontsize=8,
        color='red',
        bbox=dict(boxstyle='round,pad=0.3', facecolor='yellow', alpha=0.7)
    )

ax.axis('off')
ax.set_title('PaddleOCR Detection Results', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print(f"\nDetected {len(result[0])} text regions")

In [None]:
# Compare with TrOCR
print("\n=== COMPARISON: TrOCR vs PaddleOCR ===")
print(f"{'Aspect':<20} {'TrOCR':<25} {'PaddleOCR'}")
print("-"*70)
print(f"{'Model Size':<20} {'558MB (small)':<25} {'~3.5GB'}")
print(f"{'Parameters':<20} {'334M':<25} {'0.9B'}")
print(f"{'Languages':<20} {'English focused':<25} {'80+ languages'}")
print(f"{'Accuracy':<20} {'Good':<25} {'Excellent'}")
print(f"{'Speed (CPU)':<20} {'Fast':<25} {'Moderate'}")
print(f"{'Best For':<20} {'Simple docs':<25} {'Complex layouts'}")
print("="*70)

print("\n**When to use PaddleOCR:**")
print("  - Multi-language documents")
print("  - Complex layouts (tables, forms)")
print("  - Production applications requiring high accuracy")
print("  - When you have sufficient GPU memory (4GB+ VRAM)")

print("\n**When to use TrOCR:**")
print("  - Simple English documents")
print("  - Limited hardware (CPU-only)")
print("  - Faster inference needed")

In [None]:
# Load test image (use an image with text)
# Option 1: Use local file
# test_image_path = "sample_data/document.jpg"
# img = Image.open(test_image_path)

# Option 2: Use URL
import sys
sys.path.append('..')  # Add parent directory to path for shared_utils
from shared_utils import load_image_from_url
img = load_image_from_url("https://upload.wikimedia.org/wikipedia/commons/thumb/2/2c/Sample_receipt.jpg/800px-Sample_receipt.jpg")
img_array = np.array(img)

# Run OCR
print("Running PaddleOCR...")
result = ocr.ocr(img_array, cls=True)

# Display results
print("\n" + "="*70)
print("PADDLEOCR RESULTS")
print("="*70)

for idx, line in enumerate(result[0]):
    bbox, (text, confidence) = line
    print(f"{idx+1}. {text}")
    print(f"   Confidence: {confidence:.4f}")
    print(f"   Bounding box: {bbox}")
    print()

print("="*70)

In [None]:
# Install PaddleOCR (if not already installed)
# !pip install paddlepaddle paddleocr

from paddleocr import PaddleOCR
import cv2
import numpy as np
from PIL import Image

# Initialize PaddleOCR
print("Initializing PaddleOCR (this may take a moment)...")
ocr = PaddleOCR(
    use_angle_cls=True,  # Enable text angle classification
    lang='en',           # Language: 'en', 'ch', 'fr', 'de', etc.
    use_gpu=torch.cuda.is_available()
)

print("PaddleOCR initialized")
print(f"Using: {'GPU' if torch.cuda.is_available() else 'CPU'}")

In [None]:
# Using MNIST dataset (12MB, handwritten digits 0-9)
import torchvision.datasets as datasets
from PIL import Image

print("Downloading MNIST test dataset...")
# Download test set (will cache after first download)
mnist_test = datasets.MNIST(root='./data', train=False, download=True)

print(f"Loaded {len(mnist_test)} test images\n")

# Test OCR on a few MNIST digits
print("=== MNIST Digit Recognition ===")
for i in range(5):
    img, true_digit = mnist_test[i]
    
    # Convert grayscale to RGB (TrOCR expects RGB)
    img_rgb = img.convert("RGB")
    
    # Extract text using OCR
    extracted_text = extract_text(img_rgb)
    
    # Check if prediction matches
    match = "✓" if extracted_text.strip() == str(true_digit) else "✗"
    
    print(f"{match} Image {i+1}: True digit = {true_digit}, Predicted = '{extracted_text.strip()}'")

## Exercises

1. **Custom Images**: Create images with text and test OCR accuracy
2. **Handwritten vs Printed**: Compare models on different text types
3. **Document Processing**: Extract text from a multi-line document
4. **Error Analysis**: Test with challenging fonts or low quality images

In [None]:
# Your code here for exercises


## Key Takeaways

✅ **TrOCR** uses transformers for OCR tasks

✅ Separate models for **printed** vs **handwritten** text

✅ Works best on **cropped text regions**

✅ Can be combined with object detection for full document processing

## Next Steps

- Try **Notebook 07**: Audio Speech Recognition
- Explore [Document AI](https://huggingface.co/models?pipeline_tag=document-question-answering)
- Combine with text detection for end-to-end document OCR

## Resources

- [TrOCR Paper](https://arxiv.org/abs/2109.10282)
- [OCR Guide](https://huggingface.co/docs/transformers/tasks/optical_character_recognition)