Skip to content

Conversation

@cau-git
Copy link
Contributor

@cau-git cau-git commented Mar 27, 2025

No description provided.

cau-git added 5 commits March 26, 2025 20:01
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
@cau-git cau-git changed the title Cau/implementation refactor chore: Implementation cleanup and fixes for new class design Mar 28, 2025
@cau-git cau-git merged commit 8243a26 into cau/new-class-design Mar 28, 2025
2 of 8 checks passed
cau-git added a commit that referenced this pull request Apr 1, 2025
…tion providers (#30)

* correct mpy

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatting

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* adding the script to make an initial dataset from pdf's

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* before switching to specific docling-core branch

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* rebased on kv-items and updated the create script in CVAT

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the cvat

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added the annotation description on CVAT

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added the annotation description on CVAT (2)

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added the annotation description on CVAT (3)

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* [WIP] Crafting new dataset builder and prediction provider API

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Restructure to docling_eval_next

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix mypy

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix f-strings

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Changes for prediction_provider interface, to support all cases.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add omnidocbench DatasetBuilder

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add doclaynet v1, funsd

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add XFUND, more fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* update the kv cell creation to prevent false positives

Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>

* chore: Fixing imports

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* chore: Update docling-core version

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* feat: Introduce new design for Evaluators based on BaseEvaluator that accept external predictions.
And utility adapters.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* Factor PredictionProvider out of dataset builder, many fixes on DatasetRecord

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Sketch example for file-directory prediction provider

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* chore: Fix typing hints

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* chore: Update poetry to doclign-core 2.24.0

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* feat: WIP: Introduce the FilePredictionProvider that reads files with predictions from the disk
- It currently supports doctags, markdown, json, yaml formats.
- We still need to improve the returned type so that it allows for no DoclingDocument but only for
  the source data (e.g. in case of markdown).

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* Add DocLayNetV2DatasetBuilder

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Added TableDatasetBuilder and test, update TableFormerPredictionProvider

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* chore: Update MyPy configuration in toml

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* feat: Refactor the BasePredictionProvider.predict() to return DatasetRecordWithPrediction

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* Fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Fix the FilePredictionProvider. Return None in the predicted document in case of Markdown.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Remove the kwargs from all PredictonProvider classes and introduce provider specific
initialization arguments

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* feat: Introduce the parameter "ignore_missing_files" in FilePredictionProvider

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* Add do_visualization to PredictionProvider

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Move next-gen API to main source tree, re-organize module paths

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Cleanup, change path handling

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Cleanup, change path handling

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* More module removal and renaming

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Small test fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Add the "prediction_format" in the serialization of DatasetRecordWithPrediction

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* feat: Refactor the MarkdownTextEvaluator to support the new classes design. Add unit test.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Improve the new design of MarkdownEvaluator to move common functionalities into the base class

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* feat: Refactor the LayoutEvaluator to use the new class design. Add unit test.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Clean up LayoutEvaluator code

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* chore: Implementation cleanup and fixes for new class design (#52)

* More module removal and renaming

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Small test fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Small test fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Cleanup of tests and more fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add visualization for tables

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add visualization for all tests

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes for test files, FilePredictionProvider changes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Put new CLI

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Cleanup

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Rename CLI

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update all README with new commands.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove old examples

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Several Fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* README updates

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add gt_dir arg to create-eval, README fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes, pass tests

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* feat: Refactor the TableEvaluator to use the new class design.
Move common evaluator code to BaseEvaluator.
Add more unit tests. Introduce pytest dependencies.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* Update lockfile

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update lockfile

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Make pytest CI output more verbose

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* feat: Refactor the ReadingOrderEvaluator to use the new class design.
Remove the BaseReadingOrderEvaluator. Add unit test.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* Optimize GT downloading behaviour

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add file sources

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Allow pytest output on CI

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Disable tests in CI

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Reenable tests in CI

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add correct @pytest.mark.dependency()

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* feat: Introduce TypeVars for the UnitEvaluation and DatasetEvaluation used by the BaseEvaluator.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* Minimize tests in CI

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* feat: Refactor BboxTestEvaluator to use the new design. Introduce unit test.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* Remove streaming in DocLaynet v1

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add back test dependency

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Co-authored-by: Peter Staar <taa@zurich.ibm.com>
Co-authored-by: Saidgurbuz <said.gurbuz@epfl.ch>
Co-authored-by: Nikos Livathinos <nli@zurich.ibm.com>
cau-git added a commit that referenced this pull request Apr 1, 2025
* correct mpy

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* reformatting

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* adding the script to make an initial dataset from pdf's

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* before switching to specific docling-core branch

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* rebased on kv-items and updated the create script in CVAT

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* fixed the cvat

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added the annotation description on CVAT

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added the annotation description on CVAT (2)

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* added the annotation description on CVAT (3)

Signed-off-by: Peter Staar <taa@zurich.ibm.com>

* [WIP] Crafting new dataset builder and prediction provider API

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Restructure to docling_eval_next

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix mypy

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fix f-strings

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Changes for prediction_provider interface, to support all cases.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add omnidocbench DatasetBuilder

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add doclaynet v1, funsd

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add XFUND, more fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* update the kv cell creation to prevent false positives

Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>

* chore: Fixing imports

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* chore: Update docling-core version

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* feat: Introduce new design for Evaluators based on BaseEvaluator that accept external predictions.
And utility adapters.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* Factor PredictionProvider out of dataset builder, many fixes on DatasetRecord

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Sketch example for file-directory prediction provider

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* chore: Fix typing hints

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* chore: Update poetry to doclign-core 2.24.0

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* feat: WIP: Introduce the FilePredictionProvider that reads files with predictions from the disk
- It currently supports doctags, markdown, json, yaml formats.
- We still need to improve the returned type so that it allows for no DoclingDocument but only for
  the source data (e.g. in case of markdown).

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* Add DocLayNetV2DatasetBuilder

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Added TableDatasetBuilder and test, update TableFormerPredictionProvider

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* chore: Update MyPy configuration in toml

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* feat: Refactor the BasePredictionProvider.predict() to return DatasetRecordWithPrediction

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* Fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Fix the FilePredictionProvider. Return None in the predicted document in case of Markdown.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Remove the kwargs from all PredictonProvider classes and introduce provider specific
initialization arguments

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* feat: Introduce the parameter "ignore_missing_files" in FilePredictionProvider

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* Add do_visualization to PredictionProvider

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Move next-gen API to main source tree, re-organize module paths

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Cleanup, change path handling

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Cleanup, change path handling

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* More module removal and renaming

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Small test fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* fix: Add the "prediction_format" in the serialization of DatasetRecordWithPrediction

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* feat: Refactor the MarkdownTextEvaluator to support the new classes design. Add unit test.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Improve the new design of MarkdownEvaluator to move common functionalities into the base class

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* feat: Refactor the LayoutEvaluator to use the new class design. Add unit test.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Clean up LayoutEvaluator code

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* chore: Implementation cleanup and fixes for new class design (#52)

* More module removal and renaming

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Small test fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Small test fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Cleanup of tests and more fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add visualization for tables

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add visualization for all tests

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes for test files, FilePredictionProvider changes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Put new CLI

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Cleanup

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Rename CLI

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update all README with new commands.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove old examples

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Several Fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* README updates

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add gt_dir arg to create-eval, README fixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Fixes, pass tests

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* feat: Refactor the TableEvaluator to use the new class design.
Move common evaluator code to BaseEvaluator.
Add more unit tests. Introduce pytest dependencies.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* Update lockfile

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Update lockfile

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Make pytest CI output more verbose

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* feat: Refactor the ReadingOrderEvaluator to use the new class design.
Remove the BaseReadingOrderEvaluator. Add unit test.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* Optimize GT downloading behaviour

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add file sources

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Allow pytest output on CI

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Disable tests in CI

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Reenable tests in CI

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add correct @pytest.mark.dependency()

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* feat: Introduce TypeVars for the UnitEvaluation and DatasetEvaluation used by the BaseEvaluator.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* Minimize tests in CI

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* feat: Refactor BboxTestEvaluator to use the new design. Introduce unit test.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* Remove streaming in DocLaynet v1

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add back test dependency

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add DocVQA dataset builder

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Bugfixes

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Remove prints

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Cleanup

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

* Add DocVQA to CLI

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

---------

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Saidgurbuz <said.gurbuz@epfl.ch>
Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Co-authored-by: Peter Staar <taa@zurich.ibm.com>
Co-authored-by: Saidgurbuz <said.gurbuz@epfl.ch>
Co-authored-by: Nikos Livathinos <nli@zurich.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants