LORE — LLM OCR Robustness Evaluation

A benchmark for evaluating how well large language models extract and normalize structured data from corrupted OCR text. Tests models across three document domains, four difficulty tiers, and five evaluation dimensions — with no third-party ML dependencies.

What it measures

Real OCR pipelines produce noisy text — characters substituted, lines merged, values truncated, dates in wrong formats. This benchmark asks: given that corrupted text, can an LLM reconstruct the original structured record?

This is harder than it looks. A model that just echoes the schema back scores high on field presence (F1) but low on value correctness (exact match). The benchmark is designed so that exact match rate and mean normalized edit distance are the primary metrics — not field F1.

Domains

Domain	Document type	Fields
`receipts`	Retail / food receipts	vendor, date, total, tax, payment method, line items
`insurance`	Insurance policy documents	policyholder details, policy number, premium, agent
`hospital`	Hospital visit records	patient details, vitals, visit reason, physician, insurance

Difficulty tiers

Each sample is assigned one of four difficulty levels, controlled by how many and which noise functions are applied:

Tier	Noise applied	What breaks
`easy`	0–1 character-level errors	Minor substitutions (O→0, l→1)
`medium`	2–3 character + structure	Merged lines, missing colons, date format changes
`hard`	4–5 character + structure + numeric	Decimal shifts, partial dates, value truncation
`extreme`	7–9 all tiers + semantic traps	Value swaps between fields, section erasure, ghost values, conflicting duplicates

Metrics

Metric	What it measures	Primary?
`exact_match_rate`	Fraction of fields with exactly correct value	✓ Primary
`mean_ned`	Normalized edit distance on string fields	✓ Primary
`field_f1`	Field presence F1 (precision × recall)	Secondary
`hallucination_rate`	Fraction of invented fields not derivable from OCR	Secondary
`schema_valid`	Fraction of outputs with correct nested structure	Secondary
`correction_gain`	Improvement over raw OCR text (negative = regression)	Secondary
`parse_success`	Fraction of outputs parseable as valid JSON	Secondary

ID-type fields (policy_number, receipt_number, attending_physician.id, agent.agent_id) require case-sensitive exact matches — a model cannot score partial credit by returning a plausible-looking identifier.

Results — Llama 3.2 (2B, local, dev split)

Difficulty	Exact Match	Mean NED
easy	0.731	0.208
medium	0.518	0.294
hard	0.333	0.377
extreme	0.243	0.428

Overall: exact match 0.456 · NED 0.327 · hallucination rate 0.143

Project structure

llm_ocr_transformation_benchmark/
├── config/
│   ├── domains.json              # Field schemas and required fields per domain
│   ├── eval_config.json          # Frozen evaluation settings (metrics, thresholds, models)
│   ├── generation_config.json    # Dataset generation settings
│   └── noise_weights.json        # Noise function weights and difficulty ranges
│
├── prompts/
│   └── templates/
│       ├── receipts.txt          # Domain-specific prompt template
│       ├── insurance.txt
│       ├── hospital.txt
│       └── task_instructions.json  # Per-task instruction variants
│
├── dataset/
│   ├── gt_generator.py           # Generates diverse ground-truth records
│   ├── serializer.py             # gt_struct → clean text representation
│   ├── noise_generator.py        # Applies noise tiers to clean text
│   ├── sample_builder.py         # Assembles samples and writes JSONL files
│   └── loader.py                 # Loads and filters dataset files
│
├── runners/
│   ├── prompt_formatter.py       # Injects sample into prompt template
│   ├── llm_adapter.py            # Unified API caller (OpenAI / Anthropic / Ollama)
│   └── multi_run.py              # Crash-safe orchestrator with resume support
│
├── parsers/
│   ├── json_coercion.py          # 5-attempt cascade JSON parser
│   └── normalizers.py            # Date / time / number / phone / string normalizers
│
├── evaluator/
│   ├── field_metrics.py          # Precision, recall, F1, exact match
│   ├── normalization_metrics.py  # Levenshtein NED, numeric tolerance
│   ├── correction_metrics.py     # OCR→LLM correction gain
│   ├── hallucination_detector.py # Fuzzy substring hallucination check
│   └── schema_validator.py       # Structural and type validation
│
├── stats/
│   ├── aggregator.py             # Grouped statistics (overall / domain / difficulty)
│   ├── bootstrap.py              # Bootstrap CI + Wilcoxon signed-rank test
│   └── visuals.py                # ASCII and HTML/SVG chart generators
│
├── report/
│   └── generator.py              # Full pipeline → self-contained HTML report
│
├── data/                         # Generated by sample_builder.py — not committed
└── runs/                         # Generated by multi_run.py — not committed

Setup

Requirements: Python 3.10+. No ML libraries required — the entire pipeline runs on the standard library.

git clone https://github.com/ashwin549/lore-benchmark
cd llm-ocr-benchmark
pip install -r requirements.txt

Generate the dataset (one time):

python dataset/sample_builder.py

This writes 36 JSONL files to data/ — 1,200 samples total (3 domains × 4 difficulties × 100 samples, split 60/20/20 train/dev/test).

Running benchmarks

Free — local models via Ollama

# Install Ollama from https://ollama.com, then:
ollama pull llama3.2

python -c "
from runners.multi_run import run
run(run_id='R06', split='dev')   # 60 samples, ~30-60 min on CPU
"

python report/generator.py       # auto-picks most recent run

Paid — OpenAI or Anthropic

export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."

python -c "
from runners.multi_run import run
run(run_id='R03', split='test')  # GPT-4o, 240 test samples
"

Comparing two models

python report/generator.py runs/llama3.2_20260301_104150 runs/gpt4o_20260301_120000

The report includes a statistical comparison section with 95% bootstrap CIs and Wilcoxon signed-rank p-values.

Adding a model

Add an entry to config/eval_config.json under supported_models:

"R06": {
    "name":            "llama3.2",
    "backend":         "ollama",
    "temperature":     0,
    "max_tokens":      1024,
    "api_key_env_var": null
}

Supported backends: openai, anthropic, ollama, llama_cpp.

For Groq (free tier, OpenAI-compatible):

"R07": {
    "name":            "llama-3.1-70b-versatile",
    "backend":         "openai",
    "base_url":        "https://api.groq.com/openai/v1",
    "api_key_env_var": "GROQ_API_KEY",
    "temperature":     0,
    "max_tokens":      1024
}

Design decisions

Why synthetic data? Real OCR documents contain PII and are hard to license. Synthetic generation lets us control difficulty precisely, guarantee ground truth, and regenerate the dataset at any time.

Why exact match over F1? Field F1 measures whether the model returns the right field names — easy for any model that reads the prompt schema. Exact match measures whether the model correctly extracted the actual value from corrupted text, which is the hard part.

Why no ML dependencies? Levenshtein distance, bootstrap CI, and Wilcoxon signed-rank are all implemented in pure Python. This keeps the evaluation pipeline auditable, dependency-free, and runnable anywhere.

Why Indian-context data? The pools (names, cities, insurers, hospital chains, UPI payments) reflect the document types the noise functions were designed around. The benchmark is domain-specific by design — it tests extraction quality, not world knowledge.

Seeded reproducibility: every sample is generated from base_seed + sample_index. Running sample_builder.py twice with the same seed produces identical datasets.

Noise functions

Tier 1 — Character level

char_O0 char_l1I char_B8 char_S5 char_sub char_del char_transpose

Tier 2 — Structure level

missing_colon merged_lines split_line key_abbrev value_truncated delimiter_swap

Tier 3 — Numeric / date

num_comma_drop num_decimal_shift date_format_vary date_partial

Tier 4 — Semantic traps (extreme only)

extraneous_field ghost_value conflicting_field ambiguous_key value_swap section_erase

Citation

If you use LORE in your work, please cite:

@misc{lore2026,
  title   = {LORE: LLM OCR Robustness Evaluation},
  year    = {2026},
  url     = {https://github.com/ashwin549/lore-benchmark}
}

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LORE — LLM OCR Robustness Evaluation

What it measures

Domains

Difficulty tiers

Metrics

Results — Llama 3.2 (2B, local, dev split)

Project structure

Setup

Running benchmarks

Free — local models via Ollama

Paid — OpenAI or Anthropic

Comparing two models

Adding a model

Design decisions

Noise functions

Tier 1 — Character level

Tier 2 — Structure level

Tier 3 — Numeric / date

Tier 4 — Semantic traps (extreme only)

Citation

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
config		config
dataset		dataset
evaluator		evaluator
parsers		parsers
prompts/templates		prompts/templates
report		report
runners		runners
stats		stats
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
validate_configs.py		validate_configs.py

Folders and files

Latest commit

History

Repository files navigation

LORE — LLM OCR Robustness Evaluation

What it measures

Domains

Difficulty tiers

Metrics

Results — Llama 3.2 (2B, local, dev split)

Project structure

Setup

Running benchmarks

Free — local models via Ollama

Paid — OpenAI or Anthropic

Comparing two models

Adding a model

Design decisions

Noise functions

Tier 1 — Character level

Tier 2 — Structure level

Tier 3 — Numeric / date

Tier 4 — Semantic traps (extreme only)

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages