Skip to content

Missing texrefs #1298

@ejkitchen

Description

@ejkitchen

Bug

Initial Context

We're looking at document processing where PDFs are converted into structured data

  • PictureItems in documents contain RefItems that should point to text content (like captions or descriptions)
  • These references follow the pattern #/texts/{index} but some are invalid - they point to text entries that don't exist
  • This inconsistency is visible in document traversal where some text references resolve successfully while others fail
  • Pictures can have multiple types of annotations (classifications, descriptions, molecule data, etc.)
  • This structure is demonstrated in the test data ( test/data/doc/dummy_doc.yaml)

Processing Flow

a. Document Processing Pipeline ( standard_pdf_pipeline.py):

Uses StandardPdfPipeline with three enrichment models:

self.enrichment_pipe = [
    CodeFormulaClass(...),
    DocumentPictureClassifier(...),
    picture_description_model,
]

b. Picture Classification ( document_picture_classifier.py):

  • Processes images and adds classification annotations
  • Adds PictureClassificationData directly to PictureItem.annotations
  • Does not create separate text items or references

c. Picture Description Models:

  • Base implementation in picture_description_base_model.py
  • Concrete implementations in picture_description_vlm_model.py and picture_description_api_model.py
  • Process flow:
    • Takes images from PictureItems
    • Generates text descriptions
    • Adds descriptions as PictureDescriptionData to PictureItem.annotations
    • Does not create separate text items or references

Data Structure

  • PictureItem (from docling_core/types/doc/document.py) can have multiple annotation types
  • Annotations are stored directly in the annotations list
  • References to text items (RefItems) exist but their creation and management isn't consistent

Core Issue

The system has references to text items (#/texts/{index}) that don't correspond to actual text entries in the document. This raises several questions:

Questions for the Git Issue

  • Are these missing text references intentional or a bug?
  • Should all textual content (captions, descriptions, etc.) be stored as separate text items with valid references?
  • If some text content should be referenced and some stored directly in annotations, what are the rules determining this?
  • Is there a documentation gap regarding the expected behavior for text reference creation and management?

This appears to be either an intentional design choice that needs documentation or an implementation issue where text references are being created without their corresponding text items.

Steps to reproduce

Simply iterate through a document like this

def get_page_refs(document: DoclingDocument, max_pages: int = 3) -> Dict[str, NodeItem]:
"""
Get all references needed for the first N pages, including their dependencies.
"""
# Dictionary to group items by page number
page_items: Dict[int, List] = {}

# First pass: collect direct page items and their references
for item, level in document.iterate_items(with_groups=False):
    prov_list = getattr(item, "prov", [])
    page_num = prov_list[0].page_no if prov_list and hasattr(prov_list[0], "page_no") else 0
        
    # Only process up to max_pages
    if page_num >= max_pages:
        continue
        
    if page_num not in page_items:
        page_items[page_num] = []
    page_items[page_num].append(item)

# Build reference lookup for ALL items
ref_lookup = {}
for item, _ in document.iterate_items(with_groups=True):
    if hasattr(item, 'self_ref'):
        ref_key = f"#{item.self_ref}" if not item.self_ref.startswith('#') else item.self_ref
        ref_lookup[ref_key] = item

Get all of the refs

Docling version

Docling version: 2.28.4
Docling Core version: 2.25.0
Docling IBM Models version: 3.4.1
Docling Parse version: 4.0.0
Python: cpython-312 (3.12.3)
Platform: Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.39

Python version

Python 3.12.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions