Skip to content

Victor-EU/ocr-benchmark

Repository files navigation

OCR Benchmark

A comprehensive multi-objective benchmark comparing 10 OCR/document extraction systems on a 403-page financial document.

Benchmark Results

Document: Lufthansa Annual Report 2024 (403 pages) Reference: Landing AI Evaluation: Rule-based metrics + LLM Judge (Gemini 3 Pro)

Overall Rankings

Rank Extractor Text Tables Structure Cost (403pg) Category
1 LlamaParse 93.3 72.3 61.7 ~$20 AI Cloud
2 GPT-5.2 92.5 72.3 59.9 ~$12 AI Cloud
3 Gemini 3 Flash 91.6 74.3 58.9 ~$4 AI Cloud
4 PyMuPDF 90.9 0.0 57.3 $0 Local (Free)
5 Gemini 3 Pro 87.2 68.6 61.7 ~$8 AI Cloud
6 ABBYY 85.3 71.9 63.8 $0.73 Low-Cost
7 Tesseract 83.1 0.0 66.4 $0 Local (Free)
8 Docling 82.3 76.3 64.4 $0 Local (Free)
9 Gemini 2.5 Flash 33.4 16.8 52.1 ~$2 AI Cloud

Best in Category

Metric Winner Score Runner-up Score
Text Content LlamaParse 93.3 GPT-5.2 92.5
Table Extraction Docling 76.3 Gemini 3 Flash 74.3
Structure Tesseract 66.4 Docling 64.4
Best Value ABBYY 8.0/10 Docling Free

LLM Judge Validation (ABBYY)

Dimension Score Notes
Text Completeness 9.7/10 Near-perfect capture
Text Accuracy 8.5/10 Minor OCR errors
Table Extraction 7.0/10 Good structure
Layout Understanding 7.8/10 Good document structure
Overall 8.0/10 Strong performance

Evaluated on 50 stratified sample pages using Gemini 3 Pro

Key Findings

  1. No single winner - Different extractors excel at different tasks

  2. AI extractors dominate text - LlamaParse, GPT-5.2, Gemini 3 Flash all score 91%+

  3. Docling dominates tables - 76.3% with best detection and cell accuracy (free!)

  4. ABBYY is the best low-cost option - 85.3/71.9/63.8 at just $0.73 for 403 pages

    • LLM-validated 8.0/10 quality
    • 95.9% word recall (highest)
    • $0.0018/page ($1.80/1000 pages)
  5. Free options are competitive - Docling achieves 85-95% of paid AI quality

  6. Gemini 2.5 Flash should be avoided - Critical failures on this document

Use Case Recommendations

Use Case Recommended Why
RAG / Text Search LlamaParse 97.1% word precision, best for embeddings
Data Extraction (Tables) Docling 76.3% table score, free
Cost-Sensitive Production Docling or ABBYY Free or $0.0018/page
Best Overall (Cost No Object) LlamaParse 75.8 avg across all metrics

Features

  • 10 OCR Extractors: PyMuPDF, Tesseract, Docling, LlamaParse, Landing AI, GPT-5.2, Gemini 3 Flash/Pro, Gemini 2.5, ABBYY
  • Multi-Objective Evaluation: Text content, table extraction, structure preservation
  • LLM-as-Judge: Gemini 3 Pro evaluates against reference images
  • Cross-Judge Bias Prevention: Same-family extractors receive reduced weight
  • Cost-Performance Analysis: Compare quality vs. price across tiers

Quick Start

# Setup
pip install -r requirements.txt
cp .env.template .env
# Edit .env with your API keys

# Run benchmark
python scripts/run_multi_objective_benchmark.py

# Run LLM judge evaluation
python scripts/run_llm_judge.py --extractor E12

# View full report
cat analysis/benchmark/multi_objective_report.md

Project Structure

ocr-benchmark/
├── analysis/                   # Benchmark results and reports
│   ├── benchmark/              # Multi-objective benchmark report
│   └── llm_judge/              # LLM judge evaluation results
├── configs/                    # Configuration files
├── datasets/                   # Reference data
├── results/                    # Extraction results per extractor
├── scripts/                    # Benchmark and processing scripts
│   ├── run_multi_objective_benchmark.py
│   ├── run_llm_judge.py
│   └── process_abbyy_extraction.py
├── src/                        # Source code
│   ├── extractors/             # OCR extractor implementations
│   ├── judges/                 # LLM judge implementations
│   ├── models/                 # Data models
│   └── utils/                  # Utilities
└── tests/                      # Unit tests

Evaluation Metrics

Rule-Based (0-100)

  1. Text Content: Word recall, precision, F1, sequence similarity
  2. Table Extraction: Detection rate, structure accuracy, cell accuracy
  3. Structure: Heading detection, list preservation, figure markers

LLM Judge (0-10)

  1. Text Completeness: How much visible text was captured?
  2. Text Accuracy: How accurate is the extracted text?
  3. Table Extraction: How well were tables captured?
  4. Layout Understanding: How well is structure preserved?
  5. Structure Bonus (0-3): Extra credit for formatting

Environment Variables

Required API keys in .env:

Variable Used For
OPENAI_API_KEY GPT-5.2 extractor
GOOGLE_API_KEY Gemini extractors and LLM judge
LLAMA_CLOUD_API_KEY LlamaParse
LANDINGAI_API_KEY Landing AI (reference)

Cost Analysis

Tier Examples Cost/Page 403-Page Cost
Free PyMuPDF, Docling, Tesseract $0 $0
Low-Cost ABBYY $0.0018 $0.73
Cloud AI Gemini Flash, GPT-5.2, LlamaParse $0.01-0.05 $4-20

Running Tests

pytest tests/ -v

Full Report

See analysis/benchmark/multi_objective_report.md for the complete benchmark analysis including:

  • Detailed per-extractor breakdowns
  • Statistical analysis
  • Comparative analysis (ABBYY vs LlamaParse, ABBYY vs Docling, etc.)
  • Methodology documentation

License

MIT

About

Multi-objective OCR benchmark comparing 10 document extraction systems (LlamaParse, GPT-5.2, Gemini, ABBYY, Docling, etc.) on a 403-page annual report

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors