-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
Bug
When giving Docling PDFs with mainly two column layouts as an input, the reading order gets messed up for multiple instances. Examples are Headings or paragraphs that get attributed to the reading flow of the other column, images being attributed as child of another heading or even paragraphs of the same text column being switched up in the markdown result. At the worst it even mixed up the reading order of one heading element ("Key Features" ---docling-parse----> "Features Key")
Example Files:
Data Sheet Example 1
Data Sheet Example 2
Steps to reproduce
Load the files via the basic example:
from docling.document_converter import DocumentConverter
source = # document URL
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())
Docling version
Docling version: 2.27.0
Docling Core version: 2.23.2
Docling IBM Models version: 3.4.0
Docling Parse version: 4.0.0
Python: cpython-311 (3.11.10)
Platform: Linux-5.15.0-94-generic-x86_64-with-glibc2.35
Python version
Python 3.11.10