Fix DBDagBag returning stale SerializedDAG after in-place version update#65834
Fix DBDagBag returning stale SerializedDAG after in-place version update#65834alvinttang wants to merge 1 commit into
Conversation
…ates in-place SerializedDagModel.write_dag updates the serialized DAG in-place under the same dag_version_id when the version has no associated task instances (added in apache#45524). Long-lived DBDagBag instances such as the scheduler's self.scheduler_dag_bag cache deserialized SerializedDAG objects keyed only by dag_version_id, with no staleness check. Once an in-place update happens, the scheduler keeps returning the stale cached DAG until the process is restarted - newly added tasks are marked "removed" on every scheduling tick, and removed tasks keep getting scheduled. Cache the dag_hash alongside the deserialized DAG and re-check it against the DB on every cache hit via a single-column lookup. On hash mismatch, drop the cache entry and reload the full row. The extra query is a tiny indexed lookup on the unique dag_version_id, far cheaper than the previously skipped JSON deserialization on a true cache hit. Closes: apache#65696
|
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide
|
|
@alvinttang Converting to draft — this PR doesn't yet meet our Pull Request quality criteria.
See the linked criteria for how to fix each item, then mark the PR "Ready for review". This is not a rejection — just an invitation to bring the PR up to standard. No rush. Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you. |
|
@alvinttang This draft PR has been inactive for 13 days since the last triage comment and no response from the author. Closing to keep the queue clean. You are welcome to reopen this PR when you resume work, or to open a new one addressing the issues previously raised. There is no rush — take your time. Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you. Drafted-by: Claude Code (Opus 4.7); reviewed by @potiuk before posting |
Summary
Fixes #65696.
DBDagBag._dagsis an unbounded process-lived dict keyed bydag_version_id.SerializedDagModel.write_dag(introduced in #45524) takes a fast path that does an in-placeUPDATE serialized_dag SET data=…, dag_hash=…under the samedag_version_idwhenever the existing version has no associated task instances. After such an update the cached UUID still resolves to the oldSerializedDAG, so the scheduler keeps marking newly added tasks asremovedand keeps scheduling deleted tasks until the process is restarted.Fix
airflow-core/src/airflow/models/dagbag.py: cache(SerializedDAG, dag_hash)tuples instead of bare DAGs. On every cache lookup, do a cheapSELECT dag_hash FROM serialized_dag WHERE dag_version_id = ?and compare. Hash match → return cached. Mismatch → pop and fall through to fresh load. Also fixed the post-DB double-checked locking branch the same way. ~35 LOC of production change.Test
airflow-core/tests/unit/models/test_dagbag.py::TestDBDagBag::test_get_dag_invalidates_cache_when_dag_hash_changes_in_place— RED before patch, GREEN after. Updated 3 pre-existing tests for the new tuple cache shape.pytest tests/unit/models/test_dagbag.py→ 22/22 pass.ruff checkclean on both files.Risk notes
SELECT dag_hashper cache hit on a unique-indexed column. Cheaper than the deserialization it preserves on hits and cheaper than the existing full-row load it short-circuits on misses._dagswere updated. Other call sites use the full Mapping API.get_serialized_dag_model()(separate path, untouched). The API server usescache_size/cache_ttland now also benefits from staleness checks.Refs #65696