fix: avoid in-place mutation of pipeline options breaking cache key#3115
Open
majiayu000 wants to merge 1 commit intodocling-project:mainfrom
Open
fix: avoid in-place mutation of pipeline options breaking cache key#3115majiayu000 wants to merge 1 commit intodocling-project:mainfrom
majiayu000 wants to merge 1 commit intodocling-project:mainfrom
Conversation
Use model_copy() instead of mutating code_formula_options in-place during StandardPdfPipeline._init_models(). The in-place mutation changed the hash of pipeline_options, causing initialize_pipeline() to cache under one key while convert() computed a different key, resulting in unnecessary pipeline re-initialization. Fixes docling-project#3109 Signed-off-by: majiayu000 <1835304752@qq.com>
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
Contributor
|
✅ DCO Check Passed Thanks @majiayu000, all your commits are properly signed off. 🎉 |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Member
|
@M-Hassan-Raza is this ready for review? |
Contributor
You might have the wrong tag there Peter 😅 Perhaps @majiayu000 can answer this better! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue resolved by this Pull Request:
Resolves #3109
Summary
StandardPdfPipeline._init_models()was mutatingpipeline_options.code_formula_optionsin-place (overwritingextract_codeandextract_formulas), which changed the hash computed by_get_pipeline_options_hash(). This caused a cache miss whenconvert()called_get_pipeline()afterinitialize_pipeline(), resulting in unnecessary pipeline re-initialization.Root cause
CodeFormulaVlmOptions.extract_codedefaults toTrue,extract_formulasdefaults toTruePdfPipelineOptions.do_code_enrichmentdefaults toFalse,do_formula_enrichmentdefaults toFalse_init_models()mutated the shared options:code_formula_opts.extract_code = self.pipeline_options.do_code_enrichment(False)pipeline_options, so the cached pipeline was stored under hash A but looked up under hash BFix
Use
model_copy(update=...)to create a local copy of the options with the updated values, instead of mutating the originalpipeline_optionsobject. This keeps the hash stable across calls.Changes
docling/pipeline/standard_pdf_pipeline.py: Replace in-place mutation withmodel_copy()tests/test_options.py: Add regression test verifying pipeline cache reuse afterinitialize_pipeline()Checklist: