Skip to content

Fix DBDagBag returning stale SerializedDAG after in-place version update#65834

Closed
alvinttang wants to merge 1 commit into
apache:mainfrom
alvinttang:fix/dbdagbag-stale-cache-on-inplace-update
Closed

Fix DBDagBag returning stale SerializedDAG after in-place version update#65834
alvinttang wants to merge 1 commit into
apache:mainfrom
alvinttang:fix/dbdagbag-stale-cache-on-inplace-update

Conversation

@alvinttang
Copy link
Copy Markdown

Summary

Fixes #65696. DBDagBag._dags is an unbounded process-lived dict keyed by dag_version_id. SerializedDagModel.write_dag (introduced in #45524) takes a fast path that does an in-place UPDATE serialized_dag SET data=…, dag_hash=… under the same dag_version_id whenever the existing version has no associated task instances. After such an update the cached UUID still resolves to the old SerializedDAG, so the scheduler keeps marking newly added tasks as removed and keeps scheduling deleted tasks until the process is restarted.

Fix

airflow-core/src/airflow/models/dagbag.py: cache (SerializedDAG, dag_hash) tuples instead of bare DAGs. On every cache lookup, do a cheap SELECT dag_hash FROM serialized_dag WHERE dag_version_id = ? and compare. Hash match → return cached. Mismatch → pop and fall through to fresh load. Also fixed the post-DB double-checked locking branch the same way. ~35 LOC of production change.

Test

airflow-core/tests/unit/models/test_dagbag.py::TestDBDagBag::test_get_dag_invalidates_cache_when_dag_hash_changes_in_place — RED before patch, GREEN after. Updated 3 pre-existing tests for the new tuple cache shape.

pytest tests/unit/models/test_dagbag.py → 22/22 pass. ruff check clean on both files.

Risk notes

  • One extra single-column SELECT dag_hash per cache hit on a unique-indexed column. Cheaper than the deserialization it preserves on hits and cheaper than the existing full-row load it short-circuits on misses.
  • Tuple cache value is an internal change. Three tests that introspected _dags were updated. Other call sites use the full Mapping API.
  • The triggerer uses get_serialized_dag_model() (separate path, untouched). The API server uses cache_size / cache_ttl and now also benefits from staleness checks.

Refs #65696

…ates in-place

SerializedDagModel.write_dag updates the serialized DAG in-place under the
same dag_version_id when the version has no associated task instances (added
in apache#45524). Long-lived DBDagBag instances such as the scheduler's
self.scheduler_dag_bag cache deserialized SerializedDAG objects keyed only by
dag_version_id, with no staleness check. Once an in-place update happens, the
scheduler keeps returning the stale cached DAG until the process is restarted
- newly added tasks are marked "removed" on every scheduling tick, and removed
tasks keep getting scheduled.

Cache the dag_hash alongside the deserialized DAG and re-check it against the
DB on every cache hit via a single-column lookup. On hash mismatch, drop the
cache entry and reload the full row. The extra query is a tiny indexed lookup
on the unique dag_version_id, far cheaper than the previously skipped JSON
deserialization on a true cache hit.

Closes: apache#65696
@alvinttang alvinttang requested review from XD-DENG and ashb as code owners April 25, 2026 14:54
@boring-cyborg
Copy link
Copy Markdown

boring-cyborg Bot commented Apr 25, 2026

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our prek-hooks will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@potiuk potiuk marked this pull request as draft May 5, 2026 20:35
@potiuk
Copy link
Copy Markdown
Member

potiuk commented May 5, 2026

@alvinttang Converting to draft — this PR doesn't yet meet our Pull Request quality criteria.

  • Merge conflicts — your branch has conflicts with main. See docs.
  • Unit tests — Failing: Postgres tests: core / DB-core:Postgres:14:3.10:Core...Serialization, MySQL tests: core / DB-core:MySQL:8.0:3.10:Core...Serialization. See docs.

See the linked criteria for how to fix each item, then mark the PR "Ready for review". This is not a rejection — just an invitation to bring the PR up to standard. No rush.


Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you.

@potiuk
Copy link
Copy Markdown
Member

potiuk commented May 19, 2026

@alvinttang This draft PR has been inactive for 13 days since the last triage comment and no response from the author. Closing to keep the queue clean.

You are welcome to reopen this PR when you resume work, or to open a new one addressing the issues previously raised. There is no rush — take your time.


Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you.


Drafted-by: Claude Code (Opus 4.7); reviewed by @potiuk before posting

@potiuk potiuk closed this May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Scheduler returns stale SerializedDAG when DAG version is updated in-place (no task instances)

2 participants