-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Open
Labels
bugSomething isn't workingSomething isn't workingdocxissue related to docx backendissue related to docx backend
Description
Bug
DOCX references are not extracted properly.
PDF conversion works flawlessly.
Steps to reproduce
Consider the two uploaded files.
DOCX
Filename: docling_docx_test.py
from io import BytesIO
from pathlib import Path
from docling.datamodel.base_models import DocumentStream
from docling.document_converter import DocumentConverter
from docling.exceptions import ConversionError
file = Path("Drought_Manuscript_mini.docx")
filename = file.name
buf = BytesIO(file.read_bytes())
source = DocumentStream(name=filename, stream=buf)
converter = DocumentConverter()
result = converter.convert(source)
doc = result.document.export_to_markdown()
print(doc)Output
Drought is one of the most complex and least understood natural disasters,
causing significant agricultural, hydrological, and socioeconomic impacts .
Annually, about 55 million people worldwide experience droughts, posing major
threats to livestock and crops. Droughts jeopardize livelihoods, increase
disease and mortality risks, and prompt massive migration . By 2030, water
scarcity will affect 40% of the global population, with up to 700 million
people at risk of displacement due to drought . Climate change exacerbates
these issues, leading to prolonged dry periods, unrest, and population
movements . Recently, the severity of drought events has intensified,
amplifying their effects on ecosystems and agriculture as a result of
climate change consequences . Drought substantially negatively impacts
agricultural production and income in India, with production decreasing
by 85% and income by 93% during drought years ().
Filename: docling_pdf_test.py
from io import BytesIO
from pathlib import Path
from docling.datamodel.base_models import DocumentStream
from docling.document_converter import DocumentConverter
from docling.exceptions import ConversionError
file = Path("Drought_Manuscript_mini.pdf")
filename = file.name
buf = BytesIO(file.read_bytes())
source = DocumentStream(name=filename, stream=buf)
converter = DocumentConverter()
result = converter.convert(source)
doc = result.document.export_to_markdown()
print(doc)Output
Drought is one of the most complex and least understood natural disasters,
causing significant agricultural, hydrological, and socioeconomic
impacts (Hagman G 1984). Annually, about 55 million people worldwide
experience droughts, posing major threats to livestock and crops. Droughts
jeopardize livelihoods, increase disease and mortality risks, and prompt
massive migration (VERMA et al. 2023). By 2030, water scarcity will
affect 40% of the global population, with up to 700 million people at risk
of displacement due to drought (World Health Organization (WHO) 2024).
Climate change exacerbates these issues, leading to prolonged dry periods,
unrest, and population movements (de Bruin et al. 2018). Recently, the
severity of drought events has intensified, amplifying their effects on
ecosystems and agriculture as a result of climate change
consequences (Hammouri 2022). Drought substantially negatively
impacts agricultural production and income in India,
with production decreasing by 85% and income by 93% during
drought years ((Prasad et al. 2023)).
Docling version 2.28.2
Python version
3.12
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingdocxissue related to docx backendissue related to docx backend