Skip to content

bug: DOCX references are not extracted properly #1250

@barseghyanartur

Description

@barseghyanartur

Bug

DOCX references are not extracted properly.

PDF conversion works flawlessly.

Steps to reproduce

Consider the two uploaded files.

DOCX

Filename: docling_docx_test.py

from io import BytesIO
from pathlib import Path

from docling.datamodel.base_models import DocumentStream
from docling.document_converter import DocumentConverter
from docling.exceptions import ConversionError

file = Path("Drought_Manuscript_mini.docx")
filename = file.name
buf = BytesIO(file.read_bytes())
source = DocumentStream(name=filename, stream=buf)
converter = DocumentConverter()
result = converter.convert(source)
doc = result.document.export_to_markdown()
print(doc)

Output

Drought is one of the most complex and least understood natural disasters, 
causing significant agricultural, hydrological, and socioeconomic impacts . 
Annually, about 55 million people worldwide experience droughts, posing major 
threats to livestock and crops. Droughts jeopardize livelihoods, increase 
disease and mortality risks, and prompt massive migration . By 2030, water 
scarcity will affect 40% of the global population, with up to 700 million 
people at risk of displacement due to drought . Climate change exacerbates 
these issues, leading to prolonged dry periods, unrest, and population 
movements . Recently, the severity of drought events has intensified, 
amplifying their effects on ecosystems and agriculture as a result of 
climate change consequences . Drought substantially negatively impacts 
agricultural production and income in India, with production decreasing 
by 85% and income by 93% during drought years ().

PDF

Filename: docling_pdf_test.py

from io import BytesIO
from pathlib import Path

from docling.datamodel.base_models import DocumentStream
from docling.document_converter import DocumentConverter
from docling.exceptions import ConversionError

file = Path("Drought_Manuscript_mini.pdf")
filename = file.name
buf = BytesIO(file.read_bytes())
source = DocumentStream(name=filename, stream=buf)
converter = DocumentConverter()
result = converter.convert(source)
doc = result.document.export_to_markdown()
print(doc)

Output

Drought is one of the most complex and least understood natural disasters, 
causing significant agricultural, hydrological, and socioeconomic 
impacts (Hagman G 1984). Annually, about 55 million people worldwide 
experience droughts, posing major threats to livestock  and crops. Droughts 
jeopardize livelihoods, increase disease and mortality risks, and prompt 
massive migration (VERMA et al. 2023). By 2030, water scarcity will 
affect 40% of the global population, with up to 700 million people at risk 
of displacement due to drought (World Health Organization (WHO) 2024). 
Climate change exacerbates these issues, leading to prolonged dry periods, 
unrest, and population movements (de Bruin et al. 2018). Recently, the 
severity of drought events has intensified, amplifying their effects on 
ecosystems and agriculture as a result of climate change 
consequences (Hammouri 2022). Drought substantially   negatively   
impacts   agricultural   production   and   income   in   India,   
with   production decreasing by 85% and income by 93% during 
drought years ((Prasad et al. 2023)).

Docling version 2.28.2

Python version

3.12

Drought_Manuscript_mini.docx
Drought_Manuscript_mini.pdf

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdocxissue related to docx backend

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions