Misidentified Headings #3424

Arshil-Akkala · 2026-05-10T14:56:06Z

Arshil-Akkala
May 10, 2026

Hi, I am using docling 2.92.0. When I give a pdf with some headings as input, it is identifying the proper headings but it is also misidentifying some points under a heading as headings. So, I am trying to persist the headings to the chunks. Upon checking some of the issues, what I understood is the layout model (heron) is misidentifying it. Is there any development or workarounds until fixed?

Thanks

@Arshil-Akkala · 2026-05-10T15:01:12Z

dosubot[bot]
Bot May 10, 2026

@Arshil-Akkala, heading misidentification by the layout model is a documented limitation — the Heron model can sometimes incorrectly classify regular text or list items as SECTION_HEADER, particularly in legal or complex document layouts [1]. Here are a few workarounds:

1. Try a different layout model

The Egret models may give better accuracy for your documents:

from docling.datamodel.pipeline_options import PdfPipelineOptions, LayoutOptions
from docling.datamodel.layout_model_specs import DOCLING_LAYOUT_EGRET_LARGE

pipeline_options = PdfPipelineOptions()
pipeline_options.layout_options = LayoutOptions(
    model_spec=DOCLING_LAYOUT_EGRET_LARGE
)

[2]

2. Raise the detection score threshold

You can increase the minimum confidence required for layout detections, which may filter out false-positive headings:

from docling.datamodel.pipeline_options import LayoutObjectDetectionOptions
from docling.models.inference_engines.object_detection.onnxruntime_engine import OnnxRuntimeObjectDetectionEngineOptions

layout_options = LayoutObjectDetectionOptions.from_preset(
    "layout_heron_default",
    engine_options=OnnxRuntimeObjectDetectionEngineOptions(score_threshold=0.5)
)

[3]

3. Post-process with the DoclingDocument API

After conversion, you can programmatically reclassify incorrectly labeled headings back to text using iterate_items() and replace_item() [4].

4. Use docling-hierarchical-pdf for heading correction

This community package infers heading structure from font size, boldness, and numbering patterns as a post-processing step, and may help correct misidentified headings [5].

As for development: there have been recent PRs improving heading detection for DOCX [6], but for PDFs specifically, heading classification remains largely dependent on the layout model's detection output, with no heading-specific tuning exposed yet.

To reply, just mention @dosu.

Share context across your team and agents. Try Dosu.

0 replies

Arshil-Akkala · 2026-05-10T15:25:40Z

Arshil-Akkala
May 10, 2026
Author

How sure & confident are you about Egret models for layout? I mean, if the layout model is able to identify the headings properly, then it's a win.

1 reply

dosubot[bot] Bot May 10, 2026

@Arshil-Akkala, I want to temper expectations here. Based on a user benchmark on a 500-image subset of the DocLayNet test set, Heron actually outperforms Egret-XLarge in most detection metrics (AP 0.702 vs 0.687 at IoU 0.5:0.95) [1]. So switching to Egret likely won't improve heading detection.

More importantly, there's evidence that heading detection issues affect all layout models equally — the problem appears to originate in downstream processing (the ReadingOrderPredictor), not in the layout model itself [2]. This means swapping models probably won't fix the misidentification you're seeing.

Your best bet for correcting false-positive headings is likely one of the post-processing approaches I mentioned earlier:

Using docling-hierarchical-pdf to re-infer heading structure from style features [3]
Programmatically reclassifying items via the DoclingDocument API after conversion [4]

Apologies for the misleading suggestion in my earlier message — I should have flagged the benchmark data upfront.

To reply, just mention @dosu.

Docs are dead. Just use Dosu.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misidentified Headings #3424

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Misidentified Headings #3424

Uh oh!

Arshil-Akkala May 10, 2026

Replies: 2 comments · 1 reply

Uh oh!

dosubot[bot] Bot May 10, 2026

Uh oh!

Arshil-Akkala May 10, 2026 Author

Uh oh!

dosubot[bot] Bot May 10, 2026

Arshil-Akkala
May 10, 2026

Replies: 2 comments 1 reply

dosubot[bot]
Bot May 10, 2026

Arshil-Akkala
May 10, 2026
Author