Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
100 commits
Select commit Hold shift + click to select a range
3053b56
correct mpy
PeterStaar-IBM Feb 12, 2025
a7869e5
reformatting
PeterStaar-IBM Feb 12, 2025
9b03f64
Merge branch 'main' into fix/docling-dpbench
PeterStaar-IBM Feb 12, 2025
31f7a1a
adding the script to make an initial dataset from pdf's
PeterStaar-IBM Feb 13, 2025
86efb2d
before switching to specific docling-core branch
PeterStaar-IBM Feb 13, 2025
b504372
rebased on kv-items and updated the create script in CVAT
PeterStaar-IBM Feb 14, 2025
68f8b1d
fixed the cvat
PeterStaar-IBM Feb 14, 2025
f045f0b
Merge branch 'main' into fix/docling-dpbench
PeterStaar-IBM Feb 14, 2025
7f59ff2
added the annotation description on CVAT
PeterStaar-IBM Feb 14, 2025
3baf9f1
added the annotation description on CVAT (2)
PeterStaar-IBM Feb 14, 2025
c45d2e3
added the annotation description on CVAT (3)
PeterStaar-IBM Feb 14, 2025
d8a8a59
[WIP] Crafting new dataset builder and prediction provider API
cau-git Feb 18, 2025
b8cb738
Merge from main
cau-git Feb 19, 2025
23834fb
Restructure to docling_eval_next
cau-git Feb 19, 2025
2206a97
Fix mypy
cau-git Feb 19, 2025
ea02901
Merge branch 'main' of github.com:DS4SD/docling-eval into cau/new-cla…
cau-git Feb 19, 2025
a6295bc
Fix f-strings
cau-git Feb 19, 2025
684fd27
Merge from main
cau-git Mar 17, 2025
415d767
Merge branch 'main' of github.com:DS4SD/docling-eval into cau/new-cla…
cau-git Mar 18, 2025
3b62bc6
Changes for prediction_provider interface, to support all cases.
cau-git Mar 19, 2025
0555485
Add omnidocbench DatasetBuilder
cau-git Mar 19, 2025
3027ba8
Add doclaynet v1, funsd
cau-git Mar 20, 2025
12f025f
Fixes
cau-git Mar 20, 2025
4693b2c
Add XFUND, more fixes
cau-git Mar 20, 2025
f8bd070
update the kv cell creation to prevent false positives
Saidgurbuz Mar 21, 2025
57df7bb
chore: Fixing imports
nikos-livathinos Mar 21, 2025
1250f5d
chore: Update docling-core version
nikos-livathinos Mar 21, 2025
51260fd
feat: Introduce new design for Evaluators based on BaseEvaluator that…
nikos-livathinos Mar 20, 2025
6812656
Factor PredictionProvider out of dataset builder, many fixes on Datas…
cau-git Mar 24, 2025
8df9157
Merge branch 'cau/new-class-design' of github.com:DS4SD/docling-eval …
cau-git Mar 24, 2025
9aed020
Sketch example for file-directory prediction provider
cau-git Mar 24, 2025
fc2b725
chore: Fix typing hints
nikos-livathinos Mar 25, 2025
040deb5
chore: Update poetry to doclign-core 2.24.0
nikos-livathinos Mar 25, 2025
d8835c1
feat: WIP: Introduce the FilePredictionProvider that reads files with…
nikos-livathinos Mar 25, 2025
0d4cccb
Add DocLayNetV2DatasetBuilder
cau-git Mar 25, 2025
c55095e
Added TableDatasetBuilder and test, update TableFormerPredictionProvider
cau-git Mar 25, 2025
2a36c55
Updated from remote
cau-git Mar 25, 2025
9175fc9
chore: Update MyPy configuration in toml
nikos-livathinos Mar 26, 2025
1708ed9
feat: Refactor the BasePredictionProvider.predict() to return Dataset…
nikos-livathinos Mar 26, 2025
e4e658d
Fixes
cau-git Mar 26, 2025
c354e31
fix: Fix the FilePredictionProvider. Return None in the predicted doc…
nikos-livathinos Mar 26, 2025
3bb6716
fix: Remove the kwargs from all PredictonProvider classes and introdu…
nikos-livathinos Mar 26, 2025
c135ed0
Fixes
cau-git Mar 26, 2025
2c9bf72
Merge branch 'cau/new-class-design' of github.com:DS4SD/docling-eval …
cau-git Mar 26, 2025
9c34bec
feat: Introduce the parameter "ignore_missing_files" in FilePredictio…
nikos-livathinos Mar 26, 2025
9a31cf6
Add do_visualization to PredictionProvider
cau-git Mar 26, 2025
adb5262
Merge from remote
cau-git Mar 26, 2025
637d7ae
Move next-gen API to main source tree, re-organize module paths
cau-git Mar 26, 2025
75b3b4f
Fixes
cau-git Mar 26, 2025
509ccad
Cleanup, change path handling
cau-git Mar 26, 2025
c520b60
Cleanup, change path handling
cau-git Mar 26, 2025
4a5af02
Merge branch 'cau/new-class-design' of github.com:DS4SD/docling-eval …
cau-git Mar 26, 2025
c0d6ec7
More module removal and renaming
cau-git Mar 26, 2025
86744a7
Small test fixes
cau-git Mar 26, 2025
9291b78
fix: Add the "prediction_format" in the serialization of DatasetRecor…
nikos-livathinos Mar 27, 2025
5fa0a0d
feat: Refactor the MarkdownTextEvaluator to support the new classes d…
nikos-livathinos Mar 27, 2025
c3a2929
fix: Improve the new design of MarkdownEvaluator to move common funct…
nikos-livathinos Mar 27, 2025
5b971c9
feat: Refactor the LayoutEvaluator to use the new class design. Add u…
nikos-livathinos Mar 27, 2025
ed7c5e0
fix: Clean up LayoutEvaluator code
nikos-livathinos Mar 27, 2025
8243a26
chore: Implementation cleanup and fixes for new class design (#52)
cau-git Mar 28, 2025
0a4dd3c
Import and unused code cleanup
cau-git Mar 28, 2025
8fc3e20
Update from base branch
cau-git Mar 28, 2025
fc7e44c
Add visualization for tables
cau-git Mar 28, 2025
f066d8d
Add visualization for all tests
cau-git Mar 28, 2025
8d799d1
Merge branch 'cau/new-class-design' into nli/new_design_adoption
cau-git Mar 28, 2025
d2bc3be
Fixes for test files, FilePredictionProvider changes
cau-git Mar 28, 2025
1826b88
Put new CLI
cau-git Mar 28, 2025
025fb58
Cleanup
cau-git Mar 28, 2025
ac78771
Merge pull request #51 from docling-project/nli/new_design_adoption
cau-git Mar 28, 2025
791ff64
Rename CLI
cau-git Mar 28, 2025
622a541
Update all README with new commands.
cau-git Mar 31, 2025
373ca8e
Remove old examples
cau-git Mar 31, 2025
43f4360
Several Fixes
cau-git Mar 31, 2025
b41924c
README updates
cau-git Mar 31, 2025
e4bd417
Add gt_dir arg to create-eval, README fixes
cau-git Mar 31, 2025
e898ab5
Fixes, pass tests
cau-git Mar 31, 2025
7be89ef
feat: Refactor the TableEvaluator to use the new class design.
nikos-livathinos Mar 31, 2025
a7f830b
Update lockfile
cau-git Mar 31, 2025
8c81563
Update lockfile
cau-git Mar 31, 2025
3e607ee
Make pytest CI output more verbose
cau-git Mar 31, 2025
af87bad
feat: Refactor the ReadingOrderEvaluator to use the new class design.
nikos-livathinos Mar 31, 2025
6af49cd
Optimize GT downloading behaviour
cau-git Apr 1, 2025
c5b4a24
Add file sources
cau-git Apr 1, 2025
f165e79
Allow pytest output on CI
cau-git Apr 1, 2025
79cb068
Disable tests in CI
cau-git Apr 1, 2025
153e5b9
Reenable tests in CI
cau-git Apr 1, 2025
7e83d61
Add correct @pytest.mark.dependency()
cau-git Apr 1, 2025
40c6d97
feat: Introduce TypeVars for the UnitEvaluation and DatasetEvaluation…
nikos-livathinos Apr 1, 2025
fe96106
Merge branch 'cau/new-class-design' of github.com:DS4SD/docling-eval …
cau-git Apr 1, 2025
880ecf2
Minimize tests in CI
cau-git Apr 1, 2025
a23f7ea
feat: Refactor BboxTestEvaluator to use the new design. Introduce uni…
nikos-livathinos Apr 1, 2025
77b3bed
Remove streaming in DocLaynet v1
cau-git Apr 1, 2025
9739dd6
Merge branch 'cau/new-class-design' of github.com:DS4SD/docling-eval …
cau-git Apr 1, 2025
71511d7
Add back test dependency
cau-git Apr 1, 2025
fddc215
Add DocVQA dataset builder
cau-git Apr 1, 2025
97ef6ab
Bugfixes
cau-git Apr 1, 2025
9c437e8
Remove prints
cau-git Apr 1, 2025
c836959
Merge from main
cau-git Apr 1, 2025
e6655b1
Cleanup
cau-git Apr 1, 2025
604e4a7
Add DocVQA to CLI
cau-git Apr 1, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docling_eval/cli/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
)
from docling_eval.dataset_builders.doclaynet_v1_builder import DocLayNetV1DatasetBuilder
from docling_eval.dataset_builders.doclaynet_v2_builder import DocLayNetV2DatasetBuilder
from docling_eval.dataset_builders.docvqa_builder import DocVQADatasetBuilder
from docling_eval.dataset_builders.dpbench_builder import DPBenchDatasetBuilder
from docling_eval.dataset_builders.funsd_builder import FUNSDDatasetBuilder
from docling_eval.dataset_builders.omnidocbench_builder import (
Expand Down Expand Up @@ -171,6 +172,9 @@ def get_dataset_builder(
elif benchmark == BenchMarkNames.PUBTABNET:
return PubTabNetDatasetBuilder(**common_params) # type: ignore

elif benchmark == BenchMarkNames.DOCVQA:
return DocVQADatasetBuilder(**common_params) # type: ignore

else:
raise ValueError(f"Unsupported benchmark: {benchmark}")

Expand Down
3 changes: 3 additions & 0 deletions docling_eval/datamodels/types.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ class EvaluationModality(str, Enum):
CAPTIONING = "captioning" # to compute the accuracy of captions to table/figure
BBOXES_TEXT = "bboxes_text"
KEY_VALUE = "key_value"
QUESTION_ANSWERING = "question_answering"


class BenchMarkNames(str, Enum):
Expand All @@ -67,6 +68,8 @@ class BenchMarkNames(str, Enum):
FINTABNET = "FinTabNet"
WIKITABNET = "WikiTabNet"

DOCVQA = "DocVQA"

# Formula
# ???

Expand Down
195 changes: 195 additions & 0 deletions docling_eval/dataset_builders/docvqa_builder.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
import io
import logging
from pathlib import Path
from typing import Iterable, List, Optional, Set

import PIL.Image
from datasets import load_dataset
from docling_core.types import DoclingDocument
from docling_core.types.doc import (
BoundingBox,
CoordOrigin,
DocItemLabel,
GroupItem,
GroupLabel,
ImageRef,
PageItem,
ProvenanceItem,
Size,
TableCell,
TableData,
)
from docling_core.types.io import DocumentStream
from tqdm import tqdm

from docling_eval.datamodels.dataset_record import DatasetRecord
from docling_eval.datamodels.types import BenchMarkColumns, EvaluationModality
from docling_eval.dataset_builders.dataset_builder import (
BaseEvaluationDatasetBuilder,
HFSource,
)
from docling_eval.utils.utils import (
add_pages_to_true_doc,
crop_bounding_box,
extract_images,
from_pil_to_base64uri,
get_binhash,
)

# Get logger
_log = logging.getLogger(__name__)


class DocVQADatasetBuilder(BaseEvaluationDatasetBuilder):
"""
DocVQA dataset builder implementing the base dataset builder interface.

This builder processes the DocVQA dataset, which contains document
layout annotations for a variety of document types.
"""

def __init__(
self,
target: Path,
split: str = "test",
begin_index: int = 0,
end_index: int = -1,
):
"""
Initialize the DocVQA dataset builder.

Args:
target: Path where processed dataset will be saved
split: Dataset split to use
begin_index: Start index for processing (inclusive)
end_index: End index for processing (exclusive), -1 means process all
"""
super().__init__(
name="DocVQA",
dataset_source=HFSource(repo_id="lmms-lab/DocVQA"),
target=target,
split=split,
begin_index=begin_index,
end_index=end_index,
)

def _process_document(self, doc_id, qa_items) -> DatasetRecord:
"""Process all QA items for a single document."""
_log.debug(f"Processing document: {doc_id}")

doc = DoclingDocument(name=f"{doc_id}")
image: PIL.Image.Image = qa_items[0]["image"]
image = image.convert("RGB")
image_ref = ImageRef(
mimetype="image/png",
dpi=72,
size=Size(width=image.width, height=image.height),
uri=from_pil_to_base64uri(image),
)
page_item = PageItem(
page_no=1,
size=Size(width=float(image.width), height=float(image.height)),
image=image_ref,
)

doc.pages[1] = page_item
for qa_item in qa_items:
_log.debug(f" Processing QA item data...")

# Extract images from the ground truth document
doc, true_pictures, true_page_images = extract_images(
document=doc,
pictures_column=BenchMarkColumns.GROUNDTRUTH_PICTURES.value,
page_images_column=BenchMarkColumns.GROUNDTRUTH_PAGE_IMAGES.value,
)

# Convert image to bytes for storage
with io.BytesIO() as img_byte_stream:
image.save(img_byte_stream, format="PNG")
img_byte_stream.seek(0)
img_bytes = img_byte_stream.getvalue()

# Create dataset record
record = DatasetRecord(
doc_id=str(doc_id),
doc_hash=get_binhash(img_bytes),
ground_truth_doc=doc,
original=DocumentStream(name=str(doc_id), stream=io.BytesIO(img_bytes)),
mime_type="image/png",
modalities=[
EvaluationModality.LAYOUT,
EvaluationModality.QUESTION_ANSWERING,
],
ground_truth_pictures=true_pictures,
ground_truth_page_images=true_page_images,
)

return record

def iterate(self) -> Iterable[DatasetRecord]:
"""
Iterate through the dataset and yield DatasetRecord objects.

Yields:
DatasetRecord objects
"""
assert isinstance(self.dataset_source, HFSource)

path = self.dataset_source.repo_id
if self.dataset_local_path is not None:
path = str(self.dataset_local_path)
# Load dataset from the retrieved path
ds = load_dataset(path, split=self.split, name="DocVQA")

# Apply HuggingFace's select method for index ranges
total_ds_len = len(ds)
begin, end = self.get_effective_indices(total_ds_len)

# Select the range (HuggingFace datasets have a convenient select method)
ds = ds.select(range(begin, end))
selected_ds_len = len(ds)

# Log stats
self.log_dataset_stats(total_ds_len, selected_ds_len)

skipped_rows = 0
exported_rows = 0

sorted_dataset = ds.sort("docId")

# Initialize variables
current_doc_id = None
current_doc_qa_items = [] # type: ignore

# Iterate through the sorted dataset
for sample in tqdm(
sorted_dataset,
total=selected_ds_len,
ncols=128,
desc="Processing DocVQA records...",
):
# Check if we've moved to a new docId
if sample["docId"] != current_doc_id:
# Process the previous doc's QA items (skip first iteration)
if current_doc_qa_items:
rec = self._process_document(current_doc_id, current_doc_qa_items)
yield rec
exported_rows += 1

# Start a new document group
current_doc_id = sample["docId"]
current_doc_qa_items = [sample]
else:
current_doc_qa_items.append(sample)

# Process the final document group
if current_doc_qa_items:
rec = self._process_document(current_doc_id, current_doc_qa_items)
yield rec
exported_rows += 1

_log.info(
"Exported rows: %s. Skipped rows: %s.",
exported_rows,
skipped_rows,
)
22 changes: 22 additions & 0 deletions tests/test_dataset_builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
)
from docling_eval.dataset_builders.doclaynet_v1_builder import DocLayNetV1DatasetBuilder
from docling_eval.dataset_builders.doclaynet_v2_builder import DocLayNetV2DatasetBuilder
from docling_eval.dataset_builders.docvqa_builder import DocVQADatasetBuilder
from docling_eval.dataset_builders.dpbench_builder import DPBenchDatasetBuilder
from docling_eval.dataset_builders.funsd_builder import FUNSDDatasetBuilder
from docling_eval.dataset_builders.omnidocbench_builder import (
Expand Down Expand Up @@ -579,3 +580,24 @@ def test_run_pubtabnet_builder():
odir=target_path / "evaluations" / EvaluationModality.TABLE_STRUCTURE.value,
split="val",
)


@pytest.mark.skipif(
IS_CI, reason="Skipping test in CI because the dataset is too heavy."
)
def test_run_docvqa_builder():
target_path = Path(f"./scratch/{BenchMarkNames.DOCVQA.value}/")

dataset_layout = DocVQADatasetBuilder(
target=target_path / "gt_dataset",
end_index=25,
)

dataset_layout.save_to_disk() # does all the job of iterating the dataset, making GT+prediction records, and saving them in shards as parquet.
docling_provider = create_docling_prediction_provider(page_image_scale=2.0)

docling_provider.create_prediction_dataset(
name=dataset_layout.name,
gt_dataset_dir=target_path / "gt_dataset",
target_dataset_dir=target_path / "eval_dataset_e2e",
)