-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
Bug
I am testing various pdf files in order to be able to improve the solution I am building on top of docling. I came across a weirdly ratioed pdf that was taken from a website. When loaded with docling it omits a lot of characters that should be accessible through raw text. I attached the file to this issue, when loading this specific file via other libraries like e.g. PyPDF everything works as intended.
Problematic PDF:
Baldur's Gate III Guide - IGN (1).pdf
PyPDF Parsing Output:
PyPDF_Parsing_Output.txt
Docling Parsing Output:
Docling_Parsing_Output.txt
Steps to reproduce
Run the basic example with the problematic pdf
from docling.document_converter import DocumentConverter
source = "Baldur.s.Gate.III.Guide.-.IGN.1.pdf"
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())
Docling version
Docling version: 2.27.0
Docling Core version: 2.23.2
Docling IBM Models version: 3.4.0
Docling Parse version: 4.0.0
Python: cpython-311 (3.11.10)
Platform: Linux-5.15.0-94-generic-x86_64-with-glibc2.35
Python version
Python 3.11.10