Skip to content

Text missing from PDF with large ratio #1202

@LuRe97

Description

@LuRe97

Bug

I am testing various pdf files in order to be able to improve the solution I am building on top of docling. I came across a weirdly ratioed pdf that was taken from a website. When loaded with docling it omits a lot of characters that should be accessible through raw text. I attached the file to this issue, when loading this specific file via other libraries like e.g. PyPDF everything works as intended.

Problematic PDF:
Baldur's Gate III Guide - IGN (1).pdf

PyPDF Parsing Output:
PyPDF_Parsing_Output.txt

Docling Parsing Output:
Docling_Parsing_Output.txt

Steps to reproduce

Run the basic example with the problematic pdf
from docling.document_converter import DocumentConverter

source = "Baldur.s.Gate.III.Guide.-.IGN.1.pdf"
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())

Docling version

Docling version: 2.27.0
Docling Core version: 2.23.2
Docling IBM Models version: 3.4.0
Docling Parse version: 4.0.0
Python: cpython-311 (3.11.10)
Platform: Linux-5.15.0-94-generic-x86_64-with-glibc2.35

Python version

Python 3.11.10

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions