Skip to content

Conversation

@samiuc
Copy link
Contributor

@samiuc samiuc commented Mar 18, 2025

  • Implements new API encapsulation layer for dataset creation and prediction providers based on this PR
  • Evaluation dataset is a subset of original pixparse-idl-wds
  • Note: versions of torch and docling are temporarily pinned. (I was facing issues installing 2.5.1 torch version locally) - will be reverted prior to merge.

@samiuc samiuc mentioned this pull request Mar 18, 2025
9 tasks
@samiuc samiuc requested a review from cau-git March 18, 2025 23:40
@samiuc samiuc marked this pull request as ready for review March 18, 2025 23:40
@samiuc samiuc requested a review from PeterStaar-IBM March 18, 2025 23:40
@cau-git cau-git changed the base branch from main to cau/new-class-design March 20, 2025 19:12
@cau-git
Copy link
Contributor

cau-git commented Mar 20, 2025

@samiuc Thanks for this update. I recognize some of the comments on the previous PR are now addressed, without checking deeply.

To make sure this will integrate cleanly, I set the base branch to cau/new-class-design as it should be. However after doing so, the diff shows that you took units from this branch and re-defined them (potentially with changes) in the main source tree docling_eval with different module paths. I must therefore ask you to ensure that all the work in this PR is cleanly building on top of cau/new-class-design, with the goal that it merges back into that branch. As it is, it won't. Are there any particular reasons why you would touch the DatasetRecord class or put your dataset builders somewhere else than where cau/new-class-design suggests?

Signed-off-by: samiuc <sami.ullah.chat@gmail.com>
Copy link
Contributor

@cau-git cau-git left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samiuc I went through the changes, and have a few remarks below.
The main point I see needs to be addressed is that there must not be a dataset-specific PredictionProvider for pixparse. The API design decouples prediction providers and dataset builders, they must both work independently from one another. Can you please decompose the code for that?

Additionally, I would like you to check what is the overlap of your implementation for Azure DI with the work of @praveenmidde on this PR, which also implements a PredictionProvider for Azure within the table dataset evaluation. Eventually, we should have only one prediction provider for Azure DI.

@cau-git
Copy link
Contributor

cau-git commented Apr 8, 2025

@samiuc I decomposed this PR into three new ones with updates. Let's please continue the work there, then close this PR.

@cau-git cau-git marked this pull request as draft April 8, 2025 13:33
@cau-git
Copy link
Contributor

cau-git commented Apr 17, 2025

Closing this since it is superseded by the newer PRs.

@cau-git cau-git closed this Apr 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants