AIP-103: Adding periodic task state garbage collection and retention support by amoghrajesh · Pull Request #66463 · apache/airflow

amoghrajesh · 2026-05-06T09:15:53Z

What?

Task state rows live as long as their parent DAG run. In deployments that don't run airflow db cleanup — or where task state should expire sooner than the DAG run — rows accumulate indefinitely. This PR adds an explicit retention mechanism independent of DAG run cleanup. To perform effective cleanup, following is needed:

Time based Garbage Collection: delete task_state rows older than N days
Early expiry: per-key override for short-lived keys like job IDs that tasks can set per state row
Asset state orphan cleanup: when an asset is removed from all DAGs its asset_active entry is deleted, but asset_state rows stay behind silently

Proposed change

expires_at column on task_state - updated_at alone can't distinguish a 7 day key from a 30 day key. NULL means fall back to the global default_retention_days; set means delete after this timestamp regardless of updated_at. Setting default_retention_days = 0 disables time-based cleanup entirely (expires_at cleanup still runs).
BaseStateBackend.cleanup() no-op default — custom backends override this to implement their own retention policy. The backend reads [state_store] default_retention_days from config itself since the AIP says "the backend is responsible for enforcing the retention policy."
New config options under [state_store]: default_retention_days = 30 (task_state only — does not affect asset_state) and clear_on_success = False.
MetastoreStateBackend.cleanup() runs two passes for task_state: rows past updated_at + default_retention_days cutoff, and rows with expires_at < now().
airflow state-store cleanup CLI command — calls get_state_backend().cleanup(). Operators schedule this via cron or a maintenance DAG. Supports --dry-run.
Asset state orphan cleanup moved into the scheduler's _update_asset_orphanage() — runs in the same pass as asset deregistration, which is when the orphans are created. This is the right home since it is an internal consistency operation, not a user-facing data lifecycle decision.

Why a CLI command instead of the scheduler?

Running cleanup as a scheduler periodic task was considered but there will be concerns regarding performance to the scheduler because cleanup doesn't come without a time cost.

A dedicated CLI keeps the separation clean, schedule it where it makes sense for a deployment.

User implications / backcompat

New config options under [state_store] with safe defaults — no action needed to maintain existing behaviour. The expires_at column is nullable; existing rows get NULL (global default retention applies).

Testing

Test setup

Ran a dag with single task instance and pushed 3 task states for it

Global Retention test

Run this query:

UPDATE task_state SET expires_at = '2026-04-06 00:00:00+00:00' WHERE key = 'job_id_2';

Run the state store cleanup:

[Breeze:3.10.20] root@c8ddefd92caa:/opt/airflow$ airflow state-store cleanup
2026-05-08T06:34:15.808202Z [info     ] setup plugin alembic.autogenerate.schemas [alembic.runtime.plugins] loc=plugins.py:37
2026-05-08T06:34:15.808303Z [info     ] setup plugin alembic.autogenerate.tables [alembic.runtime.plugins] loc=plugins.py:37
2026-05-08T06:34:15.808361Z [info     ] setup plugin alembic.autogenerate.types [alembic.runtime.plugins] loc=plugins.py:37
2026-05-08T06:34:15.808403Z [info     ] setup plugin alembic.autogenerate.constraints [alembic.runtime.plugins] loc=plugins.py:37
2026-05-08T06:34:15.808438Z [info     ] setup plugin alembic.autogenerate.defaults [alembic.runtime.plugins] loc=plugins.py:37
2026-05-08T06:34:15.808480Z [info     ] setup plugin alembic.autogenerate.comments [alembic.runtime.plugins] loc=plugins.py:37
2026-05-08T06:34:15.862323Z [info     ] Running state store cleanup    [airflow.cli.commands.state_store_command] loc=state_store_command.py:49
2026-05-08T06:34:16.100725Z [info     ] Deleted expired task_state rows [airflow.state.metastore] loc=metastore.py:304 rows_deleted=1

What's next

clear_on_success hook: Clear task state on TI success #66460
task_state.set(retention_days=N) API to populate expires_at at write time: Add ability for Per task state key retention at operator level #66461

Was generative AI tooling used to co-author this PR?

Yes (please specify the tool below)

Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
When adding dependency, check compliance with the ASF 3rd Party License Policy.
For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

jason810496

Would it be better to introduce batching / pagination for the task state garbage collection?

…support

uranusjr · 2026-05-11T07:45:06Z

+                        break
+            return total
+
+        deleted = _delete_batched(TaskStateModel.expires_at < now)


I wonder if it’d be a good idea if the actual expiration is calculated on the fly instead. If I’m understaing correctly, this currently relies on the expires_at column being correctly updated whenever update_at is updated (if the former is not set explicitly). This seems a bit fragile.

expires_at is set once at write time in every set() call and is never updated independently of the row — there's no dependency on updated_at being in sync with it. If you call set() again on the same key, the upsert recalculates and overwrites both updated_at and expires_at together atomically.

One legitimate edge case you may be pointing at: if a user starts with default_retention_days=0, then later raises it to 30 days, those old NULL rows won't be picked up by the current WHERE expires_at < now()pass. We can add a second pass WHEREexpires_atIS NULL ANDupdated_at < now - default_retention_days` for that case. How does that sound?

What do you mean by a second pass? Where would this happen? (In abstract it sounds like a plan; it’s similar to how the next run needs to be recalculated when you change the dag schedule definition.)

A second pass would be something like WHERE expires_at IS NULL AND updated_at < now - default_retention_days, to catch rows that were written when default_retention_days=0 but the config was later raised.

Something like:

# Pass 1: code right now deleted_expired = _delete_batched(TaskStateModel.expires_at < now) # Pass 2: rows with NULL expires_at that are stale under the current global default if default_retention_days > 0: cutoff = now - timedelta(days=default_retention_days) deleted_stale = _delete_batched( TaskStateModel.expires_at.is_(None) & (TaskStateModel.updated_at < cutoff) )

It would run in the same airflow state-store cleanup command

But on thinking more, I do not think that it is needed. expires_at=NULL is an explicit signal — either default_retention_days=0 was set, or retention_days=0 was passed at write time. Both mean "keep this row forever." Retroactively deleting them on a config change would violate what was promised at write time.

I wonder if we should add a command to change retention of existing tis; I feel some would need it, either because they didn’t know about the feature previously or want to change policy entirely. Or maybe those people can just manually delete the tis in another way anyway?

TBH, i would hold off for now — people can run a direct sql statements if needed, and any new key(s) that gets written will automatically pick up the new default_retention_days configured. If there is a clear demand for it we can add something like airflow state-store set-expiry when needed, but feels premature before we know how common that use case is.

Co-authored-by: Wei Lee <weilee.rx@gmail.com>

Lee-W · 2026-05-12T10:00:17Z

+                    f"  Dag {dag_id!r}, run {run_id!r}, task {task_id!r}, map_index {map_index!r}, key {key!r}"
+                )
+        else:
+            print("Custom backend configured — cannot preview rows.")


or should we make _summary_dry_run_ part of the base state backend? if it's not implented then we should this message

Good thought, making dry_run_summary() part of BaseStateBackend would clean up the isinstance check and give custom server-side backends a hook to implement their own preview. That said, the current design is intentional: the return format ({"expired": list} of DB rows) is specific to MetastoreStateBackend storage model. A Redis or S3 backend would likely have nothing meaningful to report here, or a completely different representation.

I'd keep it scoped to MetastoreStateBackend for now and track it as a need-to-do basis because I do not know if the CLI cleanup is ideal to cleanup custom backends (i think not). If custom backends ever need dry-run support, we can design the interface with their semantics in mind rather than retrofitting the DB row format.

If that's the case, we'd better emphasize this (only support MetastoreStateBackend) in the doc, help text, etc.

Cool, what do you think about this: I renamed the command to cleanup-task-states and updated the description to explicitly say "Only applies when MetastoreStateBackend is configured; custom backends are skipped." Also moved the backend check to a top-level early return so the limitation is enforced before any dry-run logic.

Handled in 6b6968e

Lee-W · 2026-05-13T05:03:52Z

+                    f"  Dag {dag_id!r}, run {run_id!r}, task {task_id!r}, map_index {map_index!r}, key {key!r}"
+                )
+        else:
+            print("Custom backend configured — cannot preview rows.")


If that's the case, we'd better emphasize this (only support MetastoreStateBackend) in the doc, help text, etc.

Lee-W · 2026-05-13T05:06:47Z

    ),
+    GroupCommand(
+        name="state-store",
+        help="Manage task and asset state storage",


This is not true based on what we have now. We don't do asset state management. We can create an issue or add a command that prints the message, but does not yet do anything (raise NotImplemented, maybe?) to ensure that we do not forget and even if we forget it still kinda make sense

Renamed the command to cleanup-task-states so it's scoped correctly from the start. The group help is now accurate since we only have task state management. Asset state retention can be added as a separate cleanup-asset-states command when we get there.

Handled in 6b6968e

Lee-W · 2026-05-13T05:09:33Z

let's also add a test for other store to ensure the message is printted and it's a no-op

Added test_custom_backend_is_skipped which mocks a StateBackend, asserts the "Custom backend configured" message is printed, and verifies no cleanup() is called.

Handled in 6b6968e

amoghrajesh requested review from XD-DENG, ashb, ephraimbuddy and potiuk as code owners May 6, 2026 09:15

boring-cyborg Bot added area:ConfigTemplates area:db-migrations PRs with DB migration area:Scheduler including HA (high availability) scheduler labels May 6, 2026

amoghrajesh requested review from Lee-W, uranusjr and vatsrahul1001 May 6, 2026 09:16

amoghrajesh self-assigned this May 6, 2026

amoghrajesh added this to AIP-103: Task State Management May 6, 2026

github-project-automation Bot moved this to Backlog in AIP-103: Task State Management May 6, 2026

amoghrajesh moved this from Backlog to In progress in AIP-103: Task State Management May 6, 2026

amoghrajesh added this to the Airflow 3.3.0 milestone May 6, 2026

Lee-W reviewed May 6, 2026

View reviewed changes

Comment thread airflow-core/src/airflow/config_templates/config.yml Outdated

Comment thread airflow-core/src/airflow/jobs/scheduler_job_runner.py Outdated

jason810496 reviewed May 7, 2026

View reviewed changes

Comment thread airflow-core/src/airflow/state/metastore.py Outdated

amoghrajesh force-pushed the aip-103-4-garbage-collection-and-cleanup branch from 082d92d to 7dc826d Compare May 7, 2026 12:24

amoghrajesh requested review from bugraoz93 and dheerajturaga as code owners May 7, 2026 12:24

AIP-103: Adding periodic task state garbage collection and retention …

b644ce6

…support

amoghrajesh force-pushed the aip-103-4-garbage-collection-and-cleanup branch from 7dc826d to b644ce6 Compare May 7, 2026 12:29

comments from jason

cdc4237

ashb reviewed May 7, 2026

View reviewed changes

Comment thread airflow-core/src/airflow/cli/cli_config.py Outdated

ashb reviewed May 7, 2026

View reviewed changes

Comment thread ...low-core/src/airflow/migrations/versions/0112_3_3_0_add_task_state_and_asset_state_tables.py Outdated

ashb reviewed May 7, 2026

View reviewed changes

Comment thread airflow-core/src/airflow/state/metastore.py Outdated

ashb reviewed May 7, 2026

View reviewed changes

Comment thread airflow-core/src/airflow/state/metastore.py Outdated

handling comments from ash

df379c5

amoghrajesh requested review from Lee-W and ashb May 8, 2026 06:39

amoghrajesh requested a review from jason810496 May 8, 2026 06:39

amoghrajesh added the full tests needed We need to run full set of tests for this PR to merge label May 8, 2026

amoghrajesh closed this May 8, 2026

github-project-automation Bot moved this from In progress to Done in AIP-103: Task State Management May 8, 2026

amoghrajesh reopened this May 8, 2026

amoghrajesh mentioned this pull request May 8, 2026

AIP-103: Implement clear_on_success config to wipe task state on success #66586

Open

1 task

Merge branch 'main' into aip-103-4-garbage-collection-and-cleanup

7f401d4

amoghrajesh moved this from Done to In progress in AIP-103: Task State Management May 11, 2026

amoghrajesh mentioned this pull request May 11, 2026

AIP-103: Adding ability for per task state key retention from operators #66699

Open

1 task

uranusjr reviewed May 11, 2026

View reviewed changes

Comment thread airflow-core/src/airflow/jobs/scheduler_job_runner.py Outdated

uranusjr reviewed May 11, 2026

View reviewed changes

comment from TP

f52ce27

amoghrajesh force-pushed the aip-103-4-garbage-collection-and-cleanup branch from 28ea4fd to f52ce27 Compare May 11, 2026 08:22

Lee-W reviewed May 11, 2026

View reviewed changes

amoghrajesh and others added 3 commits May 11, 2026 15:05

comment from wei

7427d04

Update airflow-core/src/airflow/state/metastore.py

151dee5

Co-authored-by: Wei Lee <weilee.rx@gmail.com>

fixing tests and static checks

66081d0

amoghrajesh requested review from Lee-W and uranusjr May 12, 2026 06:28

fixing tests

58dba88

Lee-W reviewed May 12, 2026

View reviewed changes

Lee-W reviewed May 13, 2026

View reviewed changes

comment from wei

6b6968e

amoghrajesh requested a review from Lee-W May 13, 2026 06:35

Conversation

amoghrajesh commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What?

Proposed change

Why a CLI command instead of the scheduler?

User implications / backcompat

Testing

Test setup

Global Retention test

What's next

Was generative AI tooling used to co-author this PR?

Uh oh!

Uh oh!

Uh oh!

jason810496 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

amoghrajesh commented May 6, 2026 •

edited

Loading