Benchmark NuExtract3 + Unlimited-OCR on olmOCR-bench old_scans by davanstrien · Pull Request #26 · davanstrien/ocr-bench

davanstrien · 2026-06-27T17:32:19Z

What

Extends the old_scans experiment to NuExtract3 (4.5B) and Unlimited-OCR
(3.3B), scored through the same harness — for my own benchmarking. Adds
unlimited_ocr.py, nuextract3.py, and BENCHMARKING.md.

Results

Model	params	old_scans	present	absent	order	baseline
PaddleOCR-VL v1.6	0.9B	38.6	31.2	95.7	27.7	84.7
PaddleOCR-VL v1	0.9B	38.2	32.3	95.7	24.9	88.8
NuExtract3	4.5B	37.8	41.6	41.4	30.5	100.0
Unlimited-OCR	3.3B	30.6	29.0	50.0	25.4	89.8

The caveat (front and center in BENCHMARKING.md)

The single old_scans number conflates transcription (present) and
boilerplate exclusion (absent). NuExtract3 is the best transcriber
(present 41.6 >> paddle 31.2) and never hallucinates CJK (baseline 100%) — it
"loses" on old_scans only because markdown-mode transcribes letterheads/stamps
(verified: plain body text, not strippable <figure>/HTML). So it's an
architecture tradeoff, not a read-quality deficit; paddle wins via boilerplate
exclusion (its layout pipeline), not better reading.

Notes

Each model at its recommended DPI (NuExtract 170 / Unlimited 300), footnoted.
Unlimited's <|det|> grounding stripped; NuExtract non-thinking + greedy.
Not size-matched (3–4.5B vs 0.9B).

🤖 Generated with Claude Code

old_scans: paddle v1.6 38.6 / v1 38.2 / NuExtract3 37.8 / Unlimited-OCR 30.6. BENCHMARKING.md leads with the caveat that the single old_scans number conflates transcription (present) and boilerplate exclusion (absent): NuExtract3 is the BEST transcriber (present 41.6 vs paddle 31.2) + never hallucinates CJK (baseline 100%), but scores low on absent because markdown-mode transcribes letterheads/ stamps as plain text (verified: not strippable <figure>/HTML). Architecture tradeoff, not a read-quality deficit. Self-contained uv scripts: render PDF->PNG at each model's DPI, greedy, write the bucket mount directly; NuExtract non-thinking; Unlimited <|det|> grounding stripped. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The single old_scans number is fitness for olmOCR's goal (clean reading-order text for LLM training); reasonable for that, wrong yardstick for faithful/archival OCR where the boilerplate is the record. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

davanstrien and others added 2 commits June 27, 2026 18:31

davanstrien merged commit 99f7550 into main Jun 29, 2026
1 check passed

davanstrien deleted the add-multimodel-oldscans-bench branch June 29, 2026 13:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark NuExtract3 + Unlimited-OCR on olmOCR-bench old_scans#26

Benchmark NuExtract3 + Unlimited-OCR on olmOCR-bench old_scans#26
davanstrien merged 2 commits into
mainfrom
add-multimodel-oldscans-bench

davanstrien commented Jun 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

davanstrien commented Jun 27, 2026

What

Results

The caveat (front and center in BENCHMARKING.md)

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant