Misidentified Headings #3424
Replies: 2 comments 1 reply
-
|
@Arshil-Akkala, heading misidentification by the layout model is a documented limitation — the Heron model can sometimes incorrectly classify regular text or list items as 1. Try a different layout model The Egret models may give better accuracy for your documents: from docling.datamodel.pipeline_options import PdfPipelineOptions, LayoutOptions
from docling.datamodel.layout_model_specs import DOCLING_LAYOUT_EGRET_LARGE
pipeline_options = PdfPipelineOptions()
pipeline_options.layout_options = LayoutOptions(
model_spec=DOCLING_LAYOUT_EGRET_LARGE
)2. Raise the detection score threshold You can increase the minimum confidence required for layout detections, which may filter out false-positive headings: from docling.datamodel.pipeline_options import LayoutObjectDetectionOptions
from docling.models.inference_engines.object_detection.onnxruntime_engine import OnnxRuntimeObjectDetectionEngineOptions
layout_options = LayoutObjectDetectionOptions.from_preset(
"layout_heron_default",
engine_options=OnnxRuntimeObjectDetectionEngineOptions(score_threshold=0.5)
)3. Post-process with the DoclingDocument API After conversion, you can programmatically reclassify incorrectly labeled headings back to text using 4. Use This community package infers heading structure from font size, boldness, and numbering patterns as a post-processing step, and may help correct misidentified headings [5]. As for development: there have been recent PRs improving heading detection for DOCX [6], but for PDFs specifically, heading classification remains largely dependent on the layout model's detection output, with no heading-specific tuning exposed yet. To reply, just mention @dosu. Share context across your team and agents. Try Dosu. |
Beta Was this translation helpful? Give feedback.
-
|
How sure & confident are you about Egret models for layout? I mean, if the layout model is able to identify the headings properly, then it's a win. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, I am using
docling 2.92.0. When I give a pdf with some headings as input, it is identifying the proper headings but it is also misidentifying some points under a heading as headings. So, I am trying to persist the headings to the chunks. Upon checking some of the issues, what I understood is the layout model (heron) is misidentifying it. Is there any development or workarounds until fixed?Thanks
Beta Was this translation helpful? Give feedback.
All reactions