Benchmark NuExtract3 + Unlimited-OCR on olmOCR-bench old_scans#26
Merged
Conversation
old_scans: paddle v1.6 38.6 / v1 38.2 / NuExtract3 37.8 / Unlimited-OCR 30.6. BENCHMARKING.md leads with the caveat that the single old_scans number conflates transcription (present) and boilerplate exclusion (absent): NuExtract3 is the BEST transcriber (present 41.6 vs paddle 31.2) + never hallucinates CJK (baseline 100%), but scores low on absent because markdown-mode transcribes letterheads/ stamps as plain text (verified: not strippable <figure>/HTML). Architecture tradeoff, not a read-quality deficit. Self-contained uv scripts: render PDF->PNG at each model's DPI, greedy, write the bucket mount directly; NuExtract non-thinking; Unlimited <|det|> grounding stripped. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The single old_scans number is fitness for olmOCR's goal (clean reading-order text for LLM training); reasonable for that, wrong yardstick for faithful/archival OCR where the boilerplate is the record. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Extends the
old_scansexperiment to NuExtract3 (4.5B) and Unlimited-OCR(3.3B), scored through the same harness — for my own benchmarking. Adds
unlimited_ocr.py,nuextract3.py, andBENCHMARKING.md.Results
The caveat (front and center in BENCHMARKING.md)
The single
old_scansnumber conflates transcription (present) andboilerplate exclusion (
absent). NuExtract3 is the best transcriber(present 41.6 >> paddle 31.2) and never hallucinates CJK (baseline 100%) — it
"loses" on
old_scansonly because markdown-mode transcribes letterheads/stamps(verified: plain body text, not strippable
<figure>/HTML). So it's anarchitecture tradeoff, not a read-quality deficit; paddle wins via boilerplate
exclusion (its layout pipeline), not better reading.
Notes
<|det|>grounding stripped; NuExtract non-thinking + greedy.🤖 Generated with Claude Code