-
Notifications
You must be signed in to change notification settings - Fork 11
chore: Implementation cleanup and fixes for new class design #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
…into cau/implementation-refactor
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
cau-git
added a commit
that referenced
this pull request
Apr 1, 2025
…tion providers (#30) * correct mpy Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformatting Signed-off-by: Peter Staar <taa@zurich.ibm.com> * adding the script to make an initial dataset from pdf's Signed-off-by: Peter Staar <taa@zurich.ibm.com> * before switching to specific docling-core branch Signed-off-by: Peter Staar <taa@zurich.ibm.com> * rebased on kv-items and updated the create script in CVAT Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixed the cvat Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added the annotation description on CVAT Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added the annotation description on CVAT (2) Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added the annotation description on CVAT (3) Signed-off-by: Peter Staar <taa@zurich.ibm.com> * [WIP] Crafting new dataset builder and prediction provider API Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Restructure to docling_eval_next Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fix mypy Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fix f-strings Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Changes for prediction_provider interface, to support all cases. Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add omnidocbench DatasetBuilder Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add doclaynet v1, funsd Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fixes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add XFUND, more fixes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * update the kv cell creation to prevent false positives Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch> * chore: Fixing imports Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * chore: Update docling-core version Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * feat: Introduce new design for Evaluators based on BaseEvaluator that accept external predictions. And utility adapters. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * Factor PredictionProvider out of dataset builder, many fixes on DatasetRecord Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Sketch example for file-directory prediction provider Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * chore: Fix typing hints Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * chore: Update poetry to doclign-core 2.24.0 Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * feat: WIP: Introduce the FilePredictionProvider that reads files with predictions from the disk - It currently supports doctags, markdown, json, yaml formats. - We still need to improve the returned type so that it allows for no DoclingDocument but only for the source data (e.g. in case of markdown). Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * Add DocLayNetV2DatasetBuilder Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Added TableDatasetBuilder and test, update TableFormerPredictionProvider Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * chore: Update MyPy configuration in toml Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * feat: Refactor the BasePredictionProvider.predict() to return DatasetRecordWithPrediction Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * Fixes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * fix: Fix the FilePredictionProvider. Return None in the predicted document in case of Markdown. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * fix: Remove the kwargs from all PredictonProvider classes and introduce provider specific initialization arguments Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * feat: Introduce the parameter "ignore_missing_files" in FilePredictionProvider Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * Add do_visualization to PredictionProvider Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Move next-gen API to main source tree, re-organize module paths Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fixes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Cleanup, change path handling Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Cleanup, change path handling Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * More module removal and renaming Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Small test fixes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * fix: Add the "prediction_format" in the serialization of DatasetRecordWithPrediction Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * feat: Refactor the MarkdownTextEvaluator to support the new classes design. Add unit test. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * fix: Improve the new design of MarkdownEvaluator to move common functionalities into the base class Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * feat: Refactor the LayoutEvaluator to use the new class design. Add unit test. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * fix: Clean up LayoutEvaluator code Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * chore: Implementation cleanup and fixes for new class design (#52) * More module removal and renaming Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Small test fixes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Small test fixes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Cleanup of tests and more fixes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add visualization for tables Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add visualization for all tests Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fixes for test files, FilePredictionProvider changes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Put new CLI Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Cleanup Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Rename CLI Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update all README with new commands. Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Remove old examples Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Several Fixes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * README updates Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add gt_dir arg to create-eval, README fixes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fixes, pass tests Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * feat: Refactor the TableEvaluator to use the new class design. Move common evaluator code to BaseEvaluator. Add more unit tests. Introduce pytest dependencies. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * Update lockfile Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update lockfile Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Make pytest CI output more verbose Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * feat: Refactor the ReadingOrderEvaluator to use the new class design. Remove the BaseReadingOrderEvaluator. Add unit test. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * Optimize GT downloading behaviour Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add file sources Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Allow pytest output on CI Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Disable tests in CI Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Reenable tests in CI Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add correct @pytest.mark.dependency() Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * feat: Introduce TypeVars for the UnitEvaluation and DatasetEvaluation used by the BaseEvaluator. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * Minimize tests in CI Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * feat: Refactor BboxTestEvaluator to use the new design. Introduce unit test. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * Remove streaming in DocLaynet v1 Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add back test dependency Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch> Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> Co-authored-by: Peter Staar <taa@zurich.ibm.com> Co-authored-by: Saidgurbuz <said.gurbuz@epfl.ch> Co-authored-by: Nikos Livathinos <nli@zurich.ibm.com>
cau-git
added a commit
that referenced
this pull request
Apr 1, 2025
* correct mpy Signed-off-by: Peter Staar <taa@zurich.ibm.com> * reformatting Signed-off-by: Peter Staar <taa@zurich.ibm.com> * adding the script to make an initial dataset from pdf's Signed-off-by: Peter Staar <taa@zurich.ibm.com> * before switching to specific docling-core branch Signed-off-by: Peter Staar <taa@zurich.ibm.com> * rebased on kv-items and updated the create script in CVAT Signed-off-by: Peter Staar <taa@zurich.ibm.com> * fixed the cvat Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added the annotation description on CVAT Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added the annotation description on CVAT (2) Signed-off-by: Peter Staar <taa@zurich.ibm.com> * added the annotation description on CVAT (3) Signed-off-by: Peter Staar <taa@zurich.ibm.com> * [WIP] Crafting new dataset builder and prediction provider API Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Restructure to docling_eval_next Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fix mypy Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fix f-strings Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Changes for prediction_provider interface, to support all cases. Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add omnidocbench DatasetBuilder Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add doclaynet v1, funsd Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fixes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add XFUND, more fixes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * update the kv cell creation to prevent false positives Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch> * chore: Fixing imports Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * chore: Update docling-core version Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * feat: Introduce new design for Evaluators based on BaseEvaluator that accept external predictions. And utility adapters. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * Factor PredictionProvider out of dataset builder, many fixes on DatasetRecord Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Sketch example for file-directory prediction provider Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * chore: Fix typing hints Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * chore: Update poetry to doclign-core 2.24.0 Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * feat: WIP: Introduce the FilePredictionProvider that reads files with predictions from the disk - It currently supports doctags, markdown, json, yaml formats. - We still need to improve the returned type so that it allows for no DoclingDocument but only for the source data (e.g. in case of markdown). Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * Add DocLayNetV2DatasetBuilder Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Added TableDatasetBuilder and test, update TableFormerPredictionProvider Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * chore: Update MyPy configuration in toml Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * feat: Refactor the BasePredictionProvider.predict() to return DatasetRecordWithPrediction Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * Fixes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * fix: Fix the FilePredictionProvider. Return None in the predicted document in case of Markdown. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * fix: Remove the kwargs from all PredictonProvider classes and introduce provider specific initialization arguments Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * feat: Introduce the parameter "ignore_missing_files" in FilePredictionProvider Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * Add do_visualization to PredictionProvider Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Move next-gen API to main source tree, re-organize module paths Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fixes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Cleanup, change path handling Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Cleanup, change path handling Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * More module removal and renaming Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Small test fixes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * fix: Add the "prediction_format" in the serialization of DatasetRecordWithPrediction Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * feat: Refactor the MarkdownTextEvaluator to support the new classes design. Add unit test. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * fix: Improve the new design of MarkdownEvaluator to move common functionalities into the base class Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * feat: Refactor the LayoutEvaluator to use the new class design. Add unit test. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * fix: Clean up LayoutEvaluator code Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * chore: Implementation cleanup and fixes for new class design (#52) * More module removal and renaming Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Small test fixes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Small test fixes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Cleanup of tests and more fixes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add visualization for tables Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add visualization for all tests Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fixes for test files, FilePredictionProvider changes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Put new CLI Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Cleanup Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Rename CLI Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update all README with new commands. Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Remove old examples Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Several Fixes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * README updates Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add gt_dir arg to create-eval, README fixes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Fixes, pass tests Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * feat: Refactor the TableEvaluator to use the new class design. Move common evaluator code to BaseEvaluator. Add more unit tests. Introduce pytest dependencies. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * Update lockfile Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Update lockfile Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Make pytest CI output more verbose Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * feat: Refactor the ReadingOrderEvaluator to use the new class design. Remove the BaseReadingOrderEvaluator. Add unit test. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * Optimize GT downloading behaviour Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add file sources Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Allow pytest output on CI Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Disable tests in CI Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Reenable tests in CI Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add correct @pytest.mark.dependency() Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * feat: Introduce TypeVars for the UnitEvaluation and DatasetEvaluation used by the BaseEvaluator. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * Minimize tests in CI Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * feat: Refactor BboxTestEvaluator to use the new design. Introduce unit test. Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> * Remove streaming in DocLaynet v1 Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add back test dependency Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add DocVQA dataset builder Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Bugfixes Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Remove prints Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Cleanup Signed-off-by: Christoph Auer <cau@zurich.ibm.com> * Add DocVQA to CLI Signed-off-by: Christoph Auer <cau@zurich.ibm.com> --------- Signed-off-by: Peter Staar <taa@zurich.ibm.com> Signed-off-by: Christoph Auer <cau@zurich.ibm.com> Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch> Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com> Co-authored-by: Peter Staar <taa@zurich.ibm.com> Co-authored-by: Saidgurbuz <said.gurbuz@epfl.ch> Co-authored-by: Nikos Livathinos <nli@zurich.ibm.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.