AIP-103: Adding periodic task state garbage collection and retention support#66463
AIP-103: Adding periodic task state garbage collection and retention support#66463amoghrajesh wants to merge 10 commits into
Conversation
jason810496
left a comment
There was a problem hiding this comment.
Would it be better to introduce batching / pagination for the task state garbage collection?
082d92d to
7dc826d
Compare
7dc826d to
b644ce6
Compare
| break | ||
| return total | ||
|
|
||
| deleted = _delete_batched(TaskStateModel.expires_at < now) |
There was a problem hiding this comment.
I wonder if it’d be a good idea if the actual expiration is calculated on the fly instead. If I’m understaing correctly, this currently relies on the expires_at column being correctly updated whenever update_at is updated (if the former is not set explicitly). This seems a bit fragile.
There was a problem hiding this comment.
expires_at is set once at write time in every set() call and is never updated independently of the row — there's no dependency on updated_at being in sync with it. If you call set() again on the same key, the upsert recalculates and overwrites both updated_at and expires_at together atomically.
One legitimate edge case you may be pointing at: if a user starts with default_retention_days=0, then later raises it to 30 days, those old NULL rows won't be picked up by the current WHERE expires_at < now()pass. We can add a second pass WHEREexpires_atIS NULL ANDupdated_at < now - default_retention_days` for that case. How does that sound?
There was a problem hiding this comment.
What do you mean by a second pass? Where would this happen? (In abstract it sounds like a plan; it’s similar to how the next run needs to be recalculated when you change the dag schedule definition.)
There was a problem hiding this comment.
A second pass would be something like WHERE expires_at IS NULL AND updated_at < now - default_retention_days, to catch rows that were written when default_retention_days=0 but the config was later raised.
Something like:
# Pass 1: code right now
deleted_expired = _delete_batched(TaskStateModel.expires_at < now)
# Pass 2: rows with NULL expires_at that are stale under the current global default
if default_retention_days > 0:
cutoff = now - timedelta(days=default_retention_days)
deleted_stale = _delete_batched(
TaskStateModel.expires_at.is_(None) & (TaskStateModel.updated_at < cutoff)
)It would run in the same airflow state-store cleanup command
But on thinking more, I do not think that it is needed. expires_at=NULL is an explicit signal — either default_retention_days=0 was set, or retention_days=0 was passed at write time. Both mean "keep this row forever." Retroactively deleting them on a config change would violate what was promised at write time.
There was a problem hiding this comment.
I wonder if we should add a command to change retention of existing tis; I feel some would need it, either because they didn’t know about the feature previously or want to change policy entirely. Or maybe those people can just manually delete the tis in another way anyway?
There was a problem hiding this comment.
TBH, i would hold off for now — people can run a direct sql statements if needed, and any new key(s) that gets written will automatically pick up the new default_retention_days configured. If there is a clear demand for it we can add something like airflow state-store set-expiry when needed, but feels premature before we know how common that use case is.
28ea4fd to
f52ce27
Compare
Co-authored-by: Wei Lee <weilee.rx@gmail.com>
| f" Dag {dag_id!r}, run {run_id!r}, task {task_id!r}, map_index {map_index!r}, key {key!r}" | ||
| ) | ||
| else: | ||
| print("Custom backend configured — cannot preview rows.") |
There was a problem hiding this comment.
or should we make _summary_dry_run_ part of the base state backend? if it's not implented then we should this message
There was a problem hiding this comment.
Good thought, making dry_run_summary() part of BaseStateBackend would clean up the isinstance check and give custom server-side backends a hook to implement their own preview. That said, the current design is intentional: the return format ({"expired": list} of DB rows) is specific to MetastoreStateBackend storage model. A Redis or S3 backend would likely have nothing meaningful to report here, or a completely different representation.
I'd keep it scoped to MetastoreStateBackend for now and track it as a need-to-do basis because I do not know if the CLI cleanup is ideal to cleanup custom backends (i think not). If custom backends ever need dry-run support, we can design the interface with their semantics in mind rather than retrofitting the DB row format.
There was a problem hiding this comment.
If that's the case, we'd better emphasize this (only support MetastoreStateBackend) in the doc, help text, etc.
There was a problem hiding this comment.
Cool, what do you think about this: I renamed the command to cleanup-task-states and updated the description to explicitly say "Only applies when MetastoreStateBackend is configured; custom backends are skipped." Also moved the backend check to a top-level early return so the limitation is enforced before any dry-run logic.
| f" Dag {dag_id!r}, run {run_id!r}, task {task_id!r}, map_index {map_index!r}, key {key!r}" | ||
| ) | ||
| else: | ||
| print("Custom backend configured — cannot preview rows.") |
There was a problem hiding this comment.
If that's the case, we'd better emphasize this (only support MetastoreStateBackend) in the doc, help text, etc.
| ), | ||
| GroupCommand( | ||
| name="state-store", | ||
| help="Manage task and asset state storage", |
There was a problem hiding this comment.
This is not true based on what we have now. We don't do asset state management. We can create an issue or add a command that prints the message, but does not yet do anything (raise NotImplemented, maybe?) to ensure that we do not forget and even if we forget it still kinda make sense
There was a problem hiding this comment.
Renamed the command to cleanup-task-states so it's scoped correctly from the start. The group help is now accurate since we only have task state management. Asset state retention can be added as a separate cleanup-asset-states command when we get there.
There was a problem hiding this comment.
let's also add a test for other store to ensure the message is printted and it's a no-op
There was a problem hiding this comment.
Added test_custom_backend_is_skipped which mocks a StateBackend, asserts the "Custom backend configured" message is printed, and verifies no cleanup() is called.
closes: #66459
What?
Task state rows live as long as their parent DAG run. In deployments that don't run airflow db cleanup — or where task state should expire sooner than the DAG run — rows accumulate indefinitely. This PR adds an explicit retention mechanism independent of DAG run cleanup. To perform effective cleanup, following is needed:
task_staterows older than N daysasset_activeentry is deleted, butasset_staterows stay behind silentlyProposed change
expires_atcolumn ontask_state-updated_atalone can't distinguish a 7 day key from a 30 day key. NULL means fall back to the globaldefault_retention_days; set means delete after this timestamp regardless ofupdated_at. Settingdefault_retention_days = 0disables time-based cleanup entirely (expires_atcleanup still runs).BaseStateBackend.cleanup()no-op default — custom backends override this to implement their own retention policy. The backend reads[state_store] default_retention_daysfrom config itself since the AIP says "the backend is responsible for enforcing the retention policy."[state_store]:default_retention_days = 30(task_state only — does not affect asset_state) andclear_on_success = False.MetastoreStateBackend.cleanup()runs two passes for task_state: rows pastupdated_at + default_retention_dayscutoff, and rows withexpires_at < now().airflow state-store cleanupCLI command — callsget_state_backend().cleanup(). Operators schedule this via cron or a maintenance DAG. Supports--dry-run._update_asset_orphanage()— runs in the same pass as asset deregistration, which is when the orphans are created. This is the right home since it is an internal consistency operation, not a user-facing data lifecycle decision.Why a CLI command instead of the scheduler?
Running cleanup as a scheduler periodic task was considered but there will be concerns regarding performance to the scheduler because cleanup doesn't come without a time cost.
A dedicated CLI keeps the separation clean, schedule it where it makes sense for a deployment.
User implications / backcompat
New config options under
[state_store]with safe defaults — no action needed to maintain existing behaviour. Theexpires_atcolumn is nullable; existing rows getNULL(global default retention applies).Testing
Test setup
Ran a dag with single task instance and pushed 3 task states for it
Global Retention test
Run this query:
Run the state store cleanup:
What's next
clear_on_successhook: Clear task state on TI success #66460task_state.set(retention_days=N)API to populateexpires_atat write time: Add ability for Per task state key retention at operator level #66461Was generative AI tooling used to co-author this PR?
{pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.