A comprehensive multi-objective benchmark comparing 10 OCR/document extraction systems on a 403-page financial document.
Document: Lufthansa Annual Report 2024 (403 pages) Reference: Landing AI Evaluation: Rule-based metrics + LLM Judge (Gemini 3 Pro)
| Rank | Extractor | Text | Tables | Structure | Cost (403pg) | Category |
|---|---|---|---|---|---|---|
| 1 | LlamaParse | 93.3 | 72.3 | 61.7 | ~$20 | AI Cloud |
| 2 | GPT-5.2 | 92.5 | 72.3 | 59.9 | ~$12 | AI Cloud |
| 3 | Gemini 3 Flash | 91.6 | 74.3 | 58.9 | ~$4 | AI Cloud |
| 4 | PyMuPDF | 90.9 | 0.0 | 57.3 | $0 | Local (Free) |
| 5 | Gemini 3 Pro | 87.2 | 68.6 | 61.7 | ~$8 | AI Cloud |
| 6 | ABBYY | 85.3 | 71.9 | 63.8 | $0.73 | Low-Cost |
| 7 | Tesseract | 83.1 | 0.0 | 66.4 | $0 | Local (Free) |
| 8 | Docling | 82.3 | 76.3 | 64.4 | $0 | Local (Free) |
| 9 | Gemini 2.5 Flash | 33.4 | 16.8 | 52.1 | ~$2 | AI Cloud |
| Metric | Winner | Score | Runner-up | Score |
|---|---|---|---|---|
| Text Content | LlamaParse | 93.3 | GPT-5.2 | 92.5 |
| Table Extraction | Docling | 76.3 | Gemini 3 Flash | 74.3 |
| Structure | Tesseract | 66.4 | Docling | 64.4 |
| Best Value | ABBYY | 8.0/10 | Docling | Free |
| Dimension | Score | Notes |
|---|---|---|
| Text Completeness | 9.7/10 | Near-perfect capture |
| Text Accuracy | 8.5/10 | Minor OCR errors |
| Table Extraction | 7.0/10 | Good structure |
| Layout Understanding | 7.8/10 | Good document structure |
| Overall | 8.0/10 | Strong performance |
Evaluated on 50 stratified sample pages using Gemini 3 Pro
-
No single winner - Different extractors excel at different tasks
-
AI extractors dominate text - LlamaParse, GPT-5.2, Gemini 3 Flash all score 91%+
-
Docling dominates tables - 76.3% with best detection and cell accuracy (free!)
-
ABBYY is the best low-cost option - 85.3/71.9/63.8 at just $0.73 for 403 pages
- LLM-validated 8.0/10 quality
- 95.9% word recall (highest)
- $0.0018/page ($1.80/1000 pages)
-
Free options are competitive - Docling achieves 85-95% of paid AI quality
-
Gemini 2.5 Flash should be avoided - Critical failures on this document
| Use Case | Recommended | Why |
|---|---|---|
| RAG / Text Search | LlamaParse | 97.1% word precision, best for embeddings |
| Data Extraction (Tables) | Docling | 76.3% table score, free |
| Cost-Sensitive Production | Docling or ABBYY | Free or $0.0018/page |
| Best Overall (Cost No Object) | LlamaParse | 75.8 avg across all metrics |
- 10 OCR Extractors: PyMuPDF, Tesseract, Docling, LlamaParse, Landing AI, GPT-5.2, Gemini 3 Flash/Pro, Gemini 2.5, ABBYY
- Multi-Objective Evaluation: Text content, table extraction, structure preservation
- LLM-as-Judge: Gemini 3 Pro evaluates against reference images
- Cross-Judge Bias Prevention: Same-family extractors receive reduced weight
- Cost-Performance Analysis: Compare quality vs. price across tiers
# Setup
pip install -r requirements.txt
cp .env.template .env
# Edit .env with your API keys
# Run benchmark
python scripts/run_multi_objective_benchmark.py
# Run LLM judge evaluation
python scripts/run_llm_judge.py --extractor E12
# View full report
cat analysis/benchmark/multi_objective_report.mdocr-benchmark/
├── analysis/ # Benchmark results and reports
│ ├── benchmark/ # Multi-objective benchmark report
│ └── llm_judge/ # LLM judge evaluation results
├── configs/ # Configuration files
├── datasets/ # Reference data
├── results/ # Extraction results per extractor
├── scripts/ # Benchmark and processing scripts
│ ├── run_multi_objective_benchmark.py
│ ├── run_llm_judge.py
│ └── process_abbyy_extraction.py
├── src/ # Source code
│ ├── extractors/ # OCR extractor implementations
│ ├── judges/ # LLM judge implementations
│ ├── models/ # Data models
│ └── utils/ # Utilities
└── tests/ # Unit tests
- Text Content: Word recall, precision, F1, sequence similarity
- Table Extraction: Detection rate, structure accuracy, cell accuracy
- Structure: Heading detection, list preservation, figure markers
- Text Completeness: How much visible text was captured?
- Text Accuracy: How accurate is the extracted text?
- Table Extraction: How well were tables captured?
- Layout Understanding: How well is structure preserved?
- Structure Bonus (0-3): Extra credit for formatting
Required API keys in .env:
| Variable | Used For |
|---|---|
OPENAI_API_KEY |
GPT-5.2 extractor |
GOOGLE_API_KEY |
Gemini extractors and LLM judge |
LLAMA_CLOUD_API_KEY |
LlamaParse |
LANDINGAI_API_KEY |
Landing AI (reference) |
| Tier | Examples | Cost/Page | 403-Page Cost |
|---|---|---|---|
| Free | PyMuPDF, Docling, Tesseract | $0 | $0 |
| Low-Cost | ABBYY | $0.0018 | $0.73 |
| Cloud AI | Gemini Flash, GPT-5.2, LlamaParse | $0.01-0.05 | $4-20 |
pytest tests/ -vSee analysis/benchmark/multi_objective_report.md for the complete benchmark analysis including:
- Detailed per-extractor breakdowns
- Statistical analysis
- Comparative analysis (ABBYY vs LlamaParse, ABBYY vs Docling, etc.)
- Methodology documentation
MIT