OCR Benchmark

A comprehensive multi-objective benchmark comparing 10 OCR/document extraction systems on a 403-page financial document.

Benchmark Results

Document: Lufthansa Annual Report 2024 (403 pages) Reference: Landing AI Evaluation: Rule-based metrics + LLM Judge (Gemini 3 Pro)

Overall Rankings

Rank	Extractor	Text	Tables	Structure	Cost (403pg)	Category
1	LlamaParse	93.3	72.3	61.7	~$20	AI Cloud
2	GPT-5.2	92.5	72.3	59.9	~$12	AI Cloud
3	Gemini 3 Flash	91.6	74.3	58.9	~$4	AI Cloud
4	PyMuPDF	90.9	0.0	57.3	$0	Local (Free)
5	Gemini 3 Pro	87.2	68.6	61.7	~$8	AI Cloud
6	ABBYY	85.3	71.9	63.8	$0.73	Low-Cost
7	Tesseract	83.1	0.0	66.4	$0	Local (Free)
8	Docling	82.3	76.3	64.4	$0	Local (Free)
9	Gemini 2.5 Flash	33.4	16.8	52.1	~$2	AI Cloud

Best in Category

Metric	Winner	Score	Runner-up	Score
Text Content	LlamaParse	93.3	GPT-5.2	92.5
Table Extraction	Docling	76.3	Gemini 3 Flash	74.3
Structure	Tesseract	66.4	Docling	64.4
Best Value	ABBYY	8.0/10	Docling	Free

LLM Judge Validation (ABBYY)

Dimension	Score	Notes
Text Completeness	9.7/10	Near-perfect capture
Text Accuracy	8.5/10	Minor OCR errors
Table Extraction	7.0/10	Good structure
Layout Understanding	7.8/10	Good document structure
Overall	8.0/10	Strong performance

Evaluated on 50 stratified sample pages using Gemini 3 Pro

Key Findings

No single winner - Different extractors excel at different tasks
AI extractors dominate text - LlamaParse, GPT-5.2, Gemini 3 Flash all score 91%+
Docling dominates tables - 76.3% with best detection and cell accuracy (free!)
ABBYY is the best low-cost option - 85.3/71.9/63.8 at just $0.73 for 403 pages
- LLM-validated 8.0/10 quality
- 95.9% word recall (highest)
- $0.0018/page ($1.80/1000 pages)
Free options are competitive - Docling achieves 85-95% of paid AI quality
Gemini 2.5 Flash should be avoided - Critical failures on this document

Use Case Recommendations

Use Case	Recommended	Why
RAG / Text Search	LlamaParse	97.1% word precision, best for embeddings
Data Extraction (Tables)	Docling	76.3% table score, free
Cost-Sensitive Production	Docling or ABBYY	Free or $0.0018/page
Best Overall (Cost No Object)	LlamaParse	75.8 avg across all metrics

Features

10 OCR Extractors: PyMuPDF, Tesseract, Docling, LlamaParse, Landing AI, GPT-5.2, Gemini 3 Flash/Pro, Gemini 2.5, ABBYY
Multi-Objective Evaluation: Text content, table extraction, structure preservation
LLM-as-Judge: Gemini 3 Pro evaluates against reference images
Cross-Judge Bias Prevention: Same-family extractors receive reduced weight
Cost-Performance Analysis: Compare quality vs. price across tiers

Quick Start

# Setup
pip install -r requirements.txt
cp .env.template .env
# Edit .env with your API keys

# Run benchmark
python scripts/run_multi_objective_benchmark.py

# Run LLM judge evaluation
python scripts/run_llm_judge.py --extractor E12

# View full report
cat analysis/benchmark/multi_objective_report.md

Project Structure

ocr-benchmark/
├── analysis/                   # Benchmark results and reports
│   ├── benchmark/              # Multi-objective benchmark report
│   └── llm_judge/              # LLM judge evaluation results
├── configs/                    # Configuration files
├── datasets/                   # Reference data
├── results/                    # Extraction results per extractor
├── scripts/                    # Benchmark and processing scripts
│   ├── run_multi_objective_benchmark.py
│   ├── run_llm_judge.py
│   └── process_abbyy_extraction.py
├── src/                        # Source code
│   ├── extractors/             # OCR extractor implementations
│   ├── judges/                 # LLM judge implementations
│   ├── models/                 # Data models
│   └── utils/                  # Utilities
└── tests/                      # Unit tests

Evaluation Metrics

Rule-Based (0-100)

Text Content: Word recall, precision, F1, sequence similarity
Table Extraction: Detection rate, structure accuracy, cell accuracy
Structure: Heading detection, list preservation, figure markers

LLM Judge (0-10)

Text Completeness: How much visible text was captured?
Text Accuracy: How accurate is the extracted text?
Table Extraction: How well were tables captured?
Layout Understanding: How well is structure preserved?
Structure Bonus (0-3): Extra credit for formatting

Environment Variables

Required API keys in .env:

Variable	Used For
`OPENAI_API_KEY`	GPT-5.2 extractor
`GOOGLE_API_KEY`	Gemini extractors and LLM judge
`LLAMA_CLOUD_API_KEY`	LlamaParse
`LANDINGAI_API_KEY`	Landing AI (reference)

Cost Analysis

Tier	Examples	Cost/Page	403-Page Cost
Free	PyMuPDF, Docling, Tesseract	$0	$0
Low-Cost	ABBYY	$0.0018	$0.73
Cloud AI	Gemini Flash, GPT-5.2, LlamaParse	$0.01-0.05	$4-20

Running Tests

pytest tests/ -v

Full Report

See analysis/benchmark/multi_objective_report.md for the complete benchmark analysis including:

Detailed per-extractor breakdowns
Statistical analysis
Comparative analysis (ABBYY vs LlamaParse, ABBYY vs Docling, etc.)
Methodology documentation

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCR Benchmark

Benchmark Results

Overall Rankings

Best in Category

LLM Judge Validation (ABBYY)

Key Findings

Use Case Recommendations

Features

Quick Start

Project Structure

Evaluation Metrics

Rule-Based (0-100)

LLM Judge (0-10)

Environment Variables

Cost Analysis

Running Tests

Full Report

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
analysis		analysis
configs		configs
datasets/lufthansa_annual_report		datasets/lufthansa_annual_report
results		results
scripts		scripts
src		src
tests		tests
.env.template		.env.template
.gitignore		.gitignore
README.md		README.md
multi_objective_benchmark_design.md		multi_objective_benchmark_design.md
ocr_benchmark_plan.md		ocr_benchmark_plan.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

OCR Benchmark

Benchmark Results

Overall Rankings

Best in Category

LLM Judge Validation (ABBYY)

Key Findings

Use Case Recommendations

Features

Quick Start

Project Structure

Evaluation Metrics

Rule-Based (0-100)

LLM Judge (0-10)

Environment Variables

Cost Analysis

Running Tests

Full Report

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages