Hyperscalers OCR Evaluation #46

samiuc · 2025-03-18T01:13:22Z

Implements new API encapsulation layer for dataset creation and prediction providers based on this PR
Evaluation dataset is a subset of original pixparse-idl-wds
Note: versions of torch and docling are temporarily pinned. (I was facing issues installing 2.5.1 torch version locally) - will be reverted prior to merge.

Signed-off-by: samiuc <sami.ullah.chat@gmail.com>

cau-git · 2025-03-20T19:19:13Z

@samiuc Thanks for this update. I recognize some of the comments on the previous PR are now addressed, without checking deeply.

To make sure this will integrate cleanly, I set the base branch to cau/new-class-design as it should be. However after doing so, the diff shows that you took units from this branch and re-defined them (potentially with changes) in the main source tree docling_eval with different module paths. I must therefore ask you to ensure that all the work in this PR is cleanly building on top of cau/new-class-design, with the goal that it merges back into that branch. As it is, it won't. Are there any particular reasons why you would touch the DatasetRecord class or put your dataset builders somewhere else than where cau/new-class-design suggests?

Signed-off-by: samiuc <sami.ullah.chat@gmail.com>

…-project/docling-eval into integrate-ocr-benchmarks

cau-git

@samiuc I went through the changes, and have a few remarks below.
The main point I see needs to be addressed is that there must not be a dataset-specific PredictionProvider for pixparse. The API design decouples prediction providers and dataset builders, they must both work independently from one another. Can you please decompose the code for that?

Additionally, I would like you to check what is the overlap of your implementation for Azure DI with the work of @praveenmidde on this PR, which also implements a PredictionProvider for Azure within the table dataset evaluation. Eventually, we should have only one prediction provider for Azure DI.

docling_eval_next/prediction_providers/hyperscalers.py

docs/examples/run_ocr_pixparse_builder_example.py

docling_eval_next/prediction_providers/hyperscalers.py

docling_eval_next/utils/hyperscalers/utils.py

Signed-off-by: samiuc <sami.ullah.chat@gmail.com>

cau-git · 2025-04-08T12:13:59Z

@samiuc I decomposed this PR into three new ones with updates. Let's please continue the work there, then close this PR.

cau-git · 2025-04-17T10:35:20Z

Closing this since it is superseded by the newer PRs.

samiullahchattha added 2 commits March 17, 2025 17:30

feat: Add OCR evaluation support

7550afe

update poetry.lock

848c47a

samiuc mentioned this pull request Mar 18, 2025

Hyperscalers OCR Evaluation #40

Closed

9 tasks

samiullahchattha and others added 3 commits March 18, 2025 11:24

fix: add types-protobuf dependency

5026f72

Merge branch 'main' into integrate-ocr-benchmarks

71030b5

Signed-off-by: samiuc <sami.ullah.chat@gmail.com>

clip package versions and remove unused files

d49c9b3

samiuc requested a review from cau-git March 18, 2025 23:40

samiuc marked this pull request as ready for review March 18, 2025 23:40

samiuc requested a review from PeterStaar-IBM March 18, 2025 23:40

feat: add CustomHyperscaler class and missing dependency

100886f

cau-git changed the base branch from main to cau/new-class-design March 20, 2025 19:12

Merge branch 'cau/new-class-design' into integrate-ocr-benchmarks

aa0d808

Signed-off-by: samiuc <sami.ullah.chat@gmail.com>

cau-git mentioned this pull request Mar 20, 2025

feat: Establish new API encapsulation for dataset creation and prediction providers #30

Merged

26 tasks

samiullahchattha added 2 commits March 20, 2025 15:21

refactor code as per new branch design

62e3013

Merge branch 'integrate-ocr-benchmarks' of https://github.com/docling…

31b4944

…-project/docling-eval into integrate-ocr-benchmarks

cau-git reviewed Mar 21, 2025

View reviewed changes

samiuc and others added 2 commits March 21, 2025 16:11

Merge branch 'cau/new-class-design' into integrate-ocr-benchmarks

c311518

Signed-off-by: samiuc <sami.ullah.chat@gmail.com>

address review comments

c812375

samiuc requested a review from cau-git March 25, 2025 19:31

This was referenced Apr 8, 2025

feat: PixParse OCR dataset builder #61

Merged

feat: AWS Textract and Google DocAI Prediction providers #62

Merged

feat: OCR evaluator #63

Merged

cau-git marked this pull request as draft April 8, 2025 13:33

cau-git closed this Apr 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hyperscalers OCR Evaluation #46

Hyperscalers OCR Evaluation #46

Uh oh!

samiuc commented Mar 18, 2025 •

edited

Loading

Uh oh!

cau-git commented Mar 20, 2025

Uh oh!

cau-git left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cau-git commented Apr 8, 2025

Uh oh!

cau-git commented Apr 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Hyperscalers OCR Evaluation #46

Hyperscalers OCR Evaluation #46

Uh oh!

Conversation

samiuc commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cau-git commented Mar 20, 2025

Uh oh!

cau-git left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cau-git commented Apr 8, 2025

Uh oh!

cau-git commented Apr 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

samiuc commented Mar 18, 2025 •

edited

Loading

cau-git left a comment •

edited

Loading