Skip to content

Conversation

@samiuc
Copy link
Contributor

@samiuc samiuc commented Mar 4, 2025

TODO:

  • Refactor and format the code as per the existing repo structure.
  • Create Pydantic Models for Hyperscaler output -> DoclingDocument.
  • Fix bugs in the code e.g. Google evals currently return CER 1.
  • Add instructions for running the code locally or via CLI.
  • Add documentation for setting up the environment variables for Hyperscalers
  • Add Docling OCR document support
  • Subset of documents - Total number of document?
  • Update the dataset card and using HG datasets to load the dataset in create method automatically.

Signed-off-by: samiullahchattha <Sami.Ullah1@ibm.com>
Signed-off-by: samiullahchattha <Sami.Ullah1@ibm.com>
Signed-off-by: samiullahchattha <Sami.Ullah1@ibm.com>
@cau-git
Copy link
Contributor

cau-git commented Mar 10, 2025

@samiuc Thanks for this contribution, I can see a lot of useful tooling in here that we can certainly adopt into docling-eval.

That said, there are a few misalignments which need to be addressed. I can help in case you need more information.

  1. Work on docling-eval codebase, which brings in new dataset or new providers, should be based on this PR branch: feat: Establish new API encapsulation for dataset creation and prediction providers #30 as discussed with @praveenmidde. It has a new abstraction for the whole dataset building and prediction API, which will make some code in this PR obsolete. We no longer make any contributions in the shape of benchmarks/create.py scripts using the current approach with bare functions.
  2. I see that the current code exports shards to some JSONL format, and the evaluator reads JSONL back. We must to stick to handling things in the already established parquet format, which is an in-built feature of the new API and no code is required for serialization from your end.
  3. For the specific case of OCR evaluation, it is not desired to build up full DoclingDocument instances, since that data model does not carry the detail information an OCR engine typically provides (i.e. word level tokens, boxes, etc). As discussed with @praveenmidde , we plan an extension to docling and docling-core to add another data model for the specific case of representing OCR pages, which this PR will need to adopt when available. We can however stick with the current approach until we have this available.

samiullahchattha and others added 14 commits March 10, 2025 09:27
…36)

* chore: Rename `docling/` dir as converters. Introduce `visualization/` dir.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* chore: Remove unused imports and other code formatting

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* chore: Remove the `utils/` dir, delete unused files and move used code in appropriate locations

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* chore: Introduce the file visualisation/visualisations.py and move there functions from benchmarks/utils.py

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* chore: Update MyPy configuration in toml to override tqdm module

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* chore: Clean up commented code

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* chore: Add CONVERTER_TYPE and MODALITIES columns to all produced datasets

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* chore: Update pinning of docling

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Code refactoring:
- Move converters/teds.py into evaluators/teds.py
- Move all functions from converters/utils.py into benchmarks/utils.py.
- Rename create_xxx_converter() functions.
- Rename BenchMarkColumns.DOCLING_VERSION as BenchMarkColumns.CONVERTER_VERSION

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

---------

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Signed-off-by: samiullahchattha <Sami.Ullah1@ibm.com>
…ew settings in SmolDocling API. Improve the documentation. (#37)

* chore: Change the pinning of docling

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Fix the modalities supported for DPBench, OmniDocBench, DLNv1. Clean up code.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* docs: Update documentation to have all benchmarks in separate md files and place links in Readme.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Change the initialization of the create_smol_docling_converter() to allow flash-attn

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* docs: List benchmarks in the main readme with short description. Fix broken links in the documentation.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* docs: Fix broken link in Readme.

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* chore: Update lock file

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Add debug code to dump the predicted text in create_dlnv1_e2e_dataset()

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* chore: Update toml to pin docling with branch and extras

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* fix: Disable the generation of VLM text debugging files for DLNv1 benchmark

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

* chore: Update toml to docling v2.25.0 with vln extra

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>

---------

Signed-off-by: Nikos Livathinos <nli@zurich.ibm.com>
Signed-off-by: samiullahchattha <Sami.Ullah1@ibm.com>
Signed-off-by: Christoph Auer <60343111+cau-git@users.noreply.github.com>
Signed-off-by: samiullahchattha <Sami.Ullah1@ibm.com>
Signed-off-by: samiullahchattha <Sami.Ullah1@ibm.com>
Signed-off-by: samiullahchattha <Sami.Ullah1@ibm.com>
Signed-off-by: samiullahchattha <Sami.Ullah1@ibm.com>
… for OCR evaluation

Signed-off-by: samiullahchattha <Sami.Ullah1@ibm.com>
@samiuc
Copy link
Contributor Author

samiuc commented Mar 18, 2025

Closing this PR in the favor of #46

@samiuc samiuc closed this Mar 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants