Hyperscalers OCR Evaluation #40

samiuc · 2025-03-04T19:13:07Z

TODO:

Refactor and format the code as per the existing repo structure.
Create Pydantic Models for Hyperscaler output -> DoclingDocument.
Fix bugs in the code e.g. Google evals currently return CER 1.
Add instructions for running the code locally or via CLI.
Add documentation for setting up the environment variables for Hyperscalers
Add Docling OCR document support
Subset of documents - Total number of document?
Update the dataset card and using HG datasets to load the dataset in create method automatically.
- https://huggingface.co/datasets/samiuc/pixparse-idl

docling_eval/evaluators/ocr/ocr_evaluator.py

docs/examples/benchmark_ocr.py

Signed-off-by: samiullahchattha <Sami.Ullah1@ibm.com>

cau-git · 2025-03-10T11:21:09Z

@samiuc Thanks for this contribution, I can see a lot of useful tooling in here that we can certainly adopt into docling-eval.

That said, there are a few misalignments which need to be addressed. I can help in case you need more information.

Work on docling-eval codebase, which brings in new dataset or new providers, should be based on this PR branch: feat: Establish new API encapsulation for dataset creation and prediction providers #30 as discussed with @praveenmidde. It has a new abstraction for the whole dataset building and prediction API, which will make some code in this PR obsolete. We no longer make any contributions in the shape of benchmarks/create.py scripts using the current approach with bare functions.
I see that the current code exports shards to some JSONL format, and the evaluator reads JSONL back. We must to stick to handling things in the already established parquet format, which is an in-built feature of the new API and no code is required for serialization from your end.
For the specific case of OCR evaluation, it is not desired to build up full DoclingDocument instances, since that data model does not carry the detail information an OCR engine typically provides (i.e. word level tokens, boxes, etc). As discussed with @praveenmidde , we plan an extension to docling and docling-core to add another data model for the specific case of representing OCR pages, which this PR will need to adopt when available. We can however stick with the current approach until we have this available.

…docling-eval into integrate-hyperscaler-ocr

… for OCR evaluation

…36) * chore: Rename `docling/` dir as converters. Introduce `visualization/` dir. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * chore: Remove unused imports and other code formatting Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * chore: Remove the `utils/` dir, delete unused files and move used code in appropriate locations Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * chore: Introduce the file visualisation/visualisations.py and move there functions from benchmarks/utils.py Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * chore: Update MyPy configuration in toml to override tqdm module Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * chore: Clean up commented code Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * chore: Add CONVERTER_TYPE and MODALITIES columns to all produced datasets Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * chore: Update pinning of docling Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * fix: Code refactoring: - Move converters/teds.py into evaluators/teds.py - Move all functions from converters/utils.py into benchmarks/utils.py. - Rename create_xxx_converter() functions. - Rename BenchMarkColumns.DOCLING_VERSION as BenchMarkColumns.CONVERTER_VERSION Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> --------- Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> Signed-off-by: samiullahchattha <Sami.Ullah1@ibm.com>

…ew settings in SmolDocling API. Improve the documentation. (#37) * chore: Change the pinning of docling Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * fix: Fix the modalities supported for DPBench, OmniDocBench, DLNv1. Clean up code. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * docs: Update documentation to have all benchmarks in separate md files and place links in Readme. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * fix: Change the initialization of the create_smol_docling_converter() to allow flash-attn Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * docs: List benchmarks in the main readme with short description. Fix broken links in the documentation. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * docs: Fix broken link in Readme. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * chore: Update lock file Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * fix: Add debug code to dump the predicted text in create_dlnv1_e2e_dataset() Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * chore: Update toml to pin docling with branch and extras Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * fix: Disable the generation of VLM text debugging files for DLNv1 benchmark Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * chore: Update toml to docling v2.25.0 with vln extra Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> --------- Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> Signed-off-by: samiullahchattha <Sami.Ullah1@ibm.com>

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com> Signed-off-by: samiullahchattha <Sami.Ullah1@ibm.com>

Signed-off-by: samiullahchattha <Sami.Ullah1@ibm.com>

… for OCR evaluation Signed-off-by: samiullahchattha <Sami.Ullah1@ibm.com>

…docling-eval into integrate-hyperscaler-ocr

samiuc · 2025-03-18T01:14:01Z

Closing this PR in the favor of #46

Initial code for Hyperscaler OCR Evaluation

946d49f

divekarsc requested review from PeterStaar-IBM and cau-git March 4, 2025 21:42

PeterStaar-IBM assigned samiuc Mar 6, 2025

fix: refactor code, add benchmark examples

8bdb7d1

PeterStaar-IBM reviewed Mar 7, 2025

View reviewed changes

docling_eval/evaluators/ocr/ocr_evaluator.py Outdated Show resolved Hide resolved

PeterStaar-IBM reviewed Mar 7, 2025

View reviewed changes

docs/examples/benchmark_ocr.py Outdated Show resolved Hide resolved

samiullahchattha added 4 commits March 7, 2025 17:39

fix: update logging and add README for OCR Pixparse evaluation

911d2b0

Initial code for Hyperscaler OCR Evaluation

8203d99

Signed-off-by: samiullahchattha <Sami.Ullah1@ibm.com>

fix: refactor code, add benchmark examples

c3776fd

Signed-off-by: samiullahchattha <Sami.Ullah1@ibm.com>

fix: update logging and add README for OCR Pixparse evaluation

a616a8c

Signed-off-by: samiullahchattha <Sami.Ullah1@ibm.com>

samiullahchattha and others added 14 commits March 10, 2025 09:27

Merge branch 'integrate-hyperscaler-ocr' of https://github.com/DS4SD/…

9816078

…docling-eval into integrate-hyperscaler-ocr

feat: add implementation of prediction providers and dataset builders…

7922a32

… for OCR evaluation

docs: Add CODE_OF_CONDUCT (#38)

726ed21

Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com> Signed-off-by: samiullahchattha <Sami.Ullah1@ibm.com>

Initial code for Hyperscaler OCR Evaluation

1188bf7

Signed-off-by: samiullahchattha <Sami.Ullah1@ibm.com>

fix: refactor code, add benchmark examples

3f6607d

Signed-off-by: samiullahchattha <Sami.Ullah1@ibm.com>

fix: update logging and add README for OCR Pixparse evaluation

1ab4efe

Signed-off-by: samiullahchattha <Sami.Ullah1@ibm.com>

Initial code for Hyperscaler OCR Evaluation

665f770

fix: refactor code, add benchmark examples

7f9ce0d

fix: update logging and add README for OCR Pixparse evaluation

94e7a11

feat: add implementation of prediction providers and dataset builders…

abefe18

… for OCR evaluation Signed-off-by: samiullahchattha <Sami.Ullah1@ibm.com>

fix: converters import

3d219d2

Merge branch 'integrate-hyperscaler-ocr' of https://github.com/DS4SD/…

d44a464

…docling-eval into integrate-hyperscaler-ocr

samiuc closed this Mar 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hyperscalers OCR Evaluation #40

Hyperscalers OCR Evaluation #40

Uh oh!

samiuc commented Mar 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

cau-git commented Mar 10, 2025 •

edited

Loading

Uh oh!

samiuc commented Mar 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Hyperscalers OCR Evaluation #40

Hyperscalers OCR Evaluation #40

Uh oh!

Conversation

samiuc commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cau-git commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samiuc commented Mar 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

samiuc commented Mar 4, 2025 •

edited

Loading

cau-git commented Mar 10, 2025 •

edited

Loading