Skip to content

fix: avoid in-place mutation of pipeline options breaking cache key#3115

Open
majiayu000 wants to merge 1 commit intodocling-project:mainfrom
majiayu000:fix/issue-3109-pipeline-cache-key-mismatch
Open

fix: avoid in-place mutation of pipeline options breaking cache key#3115
majiayu000 wants to merge 1 commit intodocling-project:mainfrom
majiayu000:fix/issue-3109-pipeline-cache-key-mismatch

Conversation

@majiayu000
Copy link

Issue resolved by this Pull Request:
Resolves #3109

Summary

StandardPdfPipeline._init_models() was mutating pipeline_options.code_formula_options in-place (overwriting extract_code and extract_formulas), which changed the hash computed by _get_pipeline_options_hash(). This caused a cache miss when convert() called _get_pipeline() after initialize_pipeline(), resulting in unnecessary pipeline re-initialization.

Root cause

  1. CodeFormulaVlmOptions.extract_code defaults to True, extract_formulas defaults to True
  2. PdfPipelineOptions.do_code_enrichment defaults to False, do_formula_enrichment defaults to False
  3. _init_models() mutated the shared options: code_formula_opts.extract_code = self.pipeline_options.do_code_enrichment (False)
  4. This mutation changed the hash of pipeline_options, so the cached pipeline was stored under hash A but looked up under hash B

Fix

Use model_copy(update=...) to create a local copy of the options with the updated values, instead of mutating the original pipeline_options object. This keeps the hash stable across calls.

Changes

  • docling/pipeline/standard_pdf_pipeline.py: Replace in-place mutation with model_copy()
  • tests/test_options.py: Add regression test verifying pipeline cache reuse after initialize_pipeline()

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

Use model_copy() instead of mutating code_formula_options in-place
during StandardPdfPipeline._init_models(). The in-place mutation
changed the hash of pipeline_options, causing initialize_pipeline()
to cache under one key while convert() computed a different key,
resulting in unnecessary pipeline re-initialization.

Fixes docling-project#3109

Signed-off-by: majiayu000 <1835304752@qq.com>
@mergify
Copy link

mergify bot commented Mar 12, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@github-actions
Copy link
Contributor

DCO Check Passed

Thanks @majiayu000, all your commits are properly signed off. 🎉

@dolfim-ibm dolfim-ibm requested a review from cau-git March 13, 2026 07:24
@codecov
Copy link

codecov bot commented Mar 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@PeterStaar-IBM
Copy link
Member

@M-Hassan-Raza is this ready for review?

@M-Hassan-Raza
Copy link
Contributor

@M-Hassan-Raza is this ready for review?

You might have the wrong tag there Peter 😅 Perhaps @majiayu000 can answer this better!

@majiayu000 majiayu000 marked this pull request as ready for review March 14, 2026 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

initialize_pipeline should have extract_code and extract_formulas default value as False

3 participants