Record the writer info for every asset store write for better cross linkage by amoghrajesh · Pull Request #67902 · apache/airflow

amoghrajesh · 2026-06-02T14:12:26Z

Was generative AI tooling used to co-author this PR?

Yes: claude sonnet 4.6

related to UI PR: #67292

What problem are we solving?

Asset store entries can be written by any task, watcher, or admin via core API, but there is no way to know who made the last write. This makes it impossible to link a stored value back to the run that produced it, a gap for UI attribution and auditability.

Current behaviour

The asset_store table has no record of the writing actor. All writes are anonymous: the value and timestamp are stored, but there is no way to trace which task instance, watcher, or API call was responsible.

How this helps

Debugging: when a watermark or checkpoint value looks wrong, you can immediately see which DAG run wrote it without digging through logs or correlating timestamps manually.
Auditability: external tooling (governance, lineage, monitoring) can consume the last_updated_by fields to build a provenance trail from stored value back to the producing run.
UI linkage: the dag_id, run_id, task_id, and map_index fields are exactly what the Grid view needs to cross link from an asset store entry to the task instance that wrote it

Proposed change

Adds five flat writer columns to asset_store:

last_updated_by_kind
last_updated_by_dag_id
last_updated_by_run_id
last_updated_by_task_id
last_updated_by_map_index

Introduces AssetStoreWriterKind (task, watcher, api) with a validate_writer_fields method that enforces per-kind contracts: task requires all four task fields to be set; watcher and api require them all to be null.

`kind`	`dag_id`	`run_id`	`task_id`	`map_index`
`task`	set	set	set	set
`watcher`	`null`	`null`	`null`	`null`
`api`	`null`	`null`	`null`	`null`

A match/case _: raise AssertionError guard ensures any future new kind is handled explicitly.

The execution API PUT endpoints extract writer fields from the task instance and record them with kind=task. The core API PUT records kind=api. Both GET endpoints return a last_updated_by block including kind, dag_id, run_id, task_id, and map_index — null task fields for non-task writes.

Design decisions worth flagging

Why flat denormalized columns instead of a FK to task_instance: watchers
(BaseEventTrigger) write asset store entries but have no task instance, so a FK
cannot be the universal reference. Flat columns also survive task instance cleanup
— a FK would be cleared on delete, losing the provenance trail.

Why set_asset_store lives on MetastoreStoreBackend, not BaseStoreBackend:
recording writer kind and task fields is a database centric concern. Adding these parameters to the base interface would force every alternative backend (Redis, S3, etc.) to implement a concept that has no meaning for them. Instead, MetastoreStoreBackend gets dedicated set_asset_store/aset_asset_store methods, and the API routes dispatch to them via isinstance check, falling back to the generic set() for other backends.

Testing

This is how response would look for:

kind: task

{
    "asset_store": [
        {
            "key": "last_run_summary",
            "value": {
                "rows_loaded": 668,
                "prev_watermark": "2026-01-01T00:00:00+00:00",
                "completed_at": "2026-06-03T05:59:35.512421+00:00"
            },
            "updated_at": "2026-06-03T05:59:35.546524Z",
            "last_updated_by": {
                "kind": "task",
                "dag_id": "example_asset_store_producer",
                "run_id": "manual__2026-06-03T05:59:31.611729+00:00",
                "task_id": "load",
                "map_index": -1
            }
        },
        {
            "key": "total_runs",
            "value": 1,
            "updated_at": "2026-06-03T05:59:35.536352Z",
            "last_updated_by": {
                "kind": "task",
                "dag_id": "example_asset_store_producer",
                "run_id": "manual__2026-06-03T05:59:31.611729+00:00",
                "task_id": "load",
                "map_index": -1
            }
        },
        {
            "key": "watermark",
            "value": "2026-06-03T05:59:35.512421+00:00",
            "updated_at": "2026-06-03T05:59:35.518394Z",
            "last_updated_by": {
                "kind": "task",
                "dag_id": "example_asset_store_producer",
                "run_id": "manual__2026-06-03T05:59:31.611729+00:00",
                "task_id": "load",
                "map_index": -1
            }
        }
    ],
    "total_entries": 3
}

kind: API

{
    "asset_store": [
        {
            "key": "dict-value",
            "value": {
                "example-key": "example-value"
            },
            "updated_at": "2026-06-03T06:03:27.434455Z",
            "last_updated_by": {
                "kind": "api",
                "dag_id": null,
                "run_id": null,
                "task_id": null,
                "map_index": null
            }
        },
        {
            "key": "int-value",
            "value": 7,
            "updated_at": "2026-06-03T06:03:12.783128Z",
            "last_updated_by": {
                "kind": "api",
                "dag_id": null,
                "run_id": null,
                "task_id": null,
                "map_index": null
            }
        },
        {
            "key": "some-value",
            "value": "2026-05-01T00:00:00Z",
            "updated_at": "2026-06-03T06:03:03.147371Z",
            "last_updated_by": {
                "kind": "api",
                "dag_id": null,
                "run_id": null,
                "task_id": null,
                "map_index": null
            }
        }
    ],
    "total_entries": 3
}

What's next

Worker-side AssetStoreAccessor SDK class and the watcher write path in BaseEventTrigger are follow-up PRs in the AIP-103 series which are being tracked separately: #67839

Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
When adding dependency, check compliance with the ASF 3rd Party License Policy.
For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

jroachgolf84

What's going to happen when AssetStore is updated from a BaseEventTrigger? Is that going to cause an issue (as there is no dag_id, task_id, etc.)?

amoghrajesh · 2026-06-02T14:26:52Z

cc @kaxil I think I need your input on this one

kaxil · 2026-06-02T22:16:47Z

@jroachgolf84 good question, but I don't think a BaseEventTrigger writes the asset store at all, so this case doesn't really arise.

The asset store holds producer-side state: a task's incremental watermark/cursor and the AIP-103 checkpointing state. That's authored by a task, which always has a TI. A watcher doesn't produce that. Its job is to detect an external event and fire an AssetEvent, and it delegates cursor/durability to the source it watches (SQS visibility, PubSub/Redis consumer offsets, Kafka offsets). Anything it carries with an event goes in the AssetEvent payload, not the store.

So the writers are tasks (via the Execution API, recording ti_id) and the Core API PUT (an admin write with no task, recorded as NULL). The column is nullable with FK ON DELETE SET NULL, and the read routes outer-join task_instance and return last_updated_by: null, so that admin path is handled and isn't an error.

Nothing writes the store from a trigger today afaik. If we ever do write asset state from the triggerer, it'd be from deferral triggers (a task that deferred), and those keep their task instance attached (trigger.task_instance), so they record a real ti_id just like the worker path. Watchers are the one trigger type with no TI, and also the one with no reason to write the store. So last_updated_by_ti_id as a TI FK is the right shape across all of it.

jroachgolf84 · 2026-06-02T22:20:21Z

@jroachgolf84 good question, but I don't think a BaseEventTrigger writes the asset store at all, so this case doesn't really arise.

The asset store holds producer-side state: a task's incremental watermark/cursor and the AIP-103 checkpointing state. That's authored by a task, which always has a TI. A watcher doesn't produce that. Its job is to detect an external event and fire an AssetEvent, and it delegates cursor/durability to the source it watches (SQS visibility, PubSub/Redis consumer offsets, Kafka offsets). Anything it carries with an event goes in the AssetEvent payload, not the store.

So the writers are tasks (via the Execution API, recording ti_id) and the Core API PUT (an admin write with no task, recorded as NULL). The column is nullable with FK ON DELETE SET NULL, and the read routes outer-join task_instance and return last_updated_by: null, so that admin path is handled and isn't an error.

Nothing writes the store from a trigger today afaik. If we ever do write asset state from the triggerer, it'd be from deferral triggers (a task that deferred), and those keep their task instance attached (trigger.task_instance), so they record a real ti_id just like the worker path. Watchers are the one trigger type with no TI, and also the one with no reason to write the store. So last_updated_by_ti_id as a TI FK is the right shape across all of it.

Using the Asset Store in a BaseEventTrigger is actually a core use-case here. This is needed to allow for watermarking in event-driven scheduling (which was a driving factor when drafting this AIP).

jroachgolf84 · 2026-06-02T22:21:40Z

Check out this PR - it's quite pertinent: #67839

jroachgolf84 · 2026-06-02T22:25:29Z

IMO, we should keep Asset Store such that it is Task unaware. Thoughts?

Not being able to persist state from a BaseEventTrigger is one of the biggest blockers to creating Triggers for Asset watching (and one of the primary blockers for widespread community adoption).

kaxil · 2026-06-02T22:25:50Z

+        ti1 = create_task_instance(task_id="task1")
+        ti2 = create_task_instance(task_id="task2")


This test fails on every backend in CI (it's why the run is red), not flaky:

sqlalchemy.exc.IntegrityError: UNIQUE constraint failed: dag_run.dag_id, dag_run.run_id

create_task_instance defaults to dag_id="dag" with a default run_id and creates a fresh dag_run each call, so these two calls collide on the dag_run unique key. Only task_id differs, which doesn't help since the collision is on the dag_run, not the TI. Distinct dag runs fix it:

ti1 = create_task_instance(task_id="task1", dag_id="dag1") ti2 = create_task_instance(task_id="task2", dag_id="dag2")

(distinct run_id works too). The test only needs two distinct TI ids, so separate dag runs satisfy the intent.

kaxil · 2026-06-02T22:27:32Z

Check out this PR - it's quite pertinent: #67839

Using the Asset Store in a BaseEventTrigger is actually a core use-case here. This is needed to allow for watermarking in event-driven scheduling (which was a driving factor when drafting this AIP).

Aah, hmm, let me check that PR and get back!

jroachgolf84 · 2026-06-03T00:33:08Z

Sounds good, thanks!

kaxil · 2026-06-03T01:31:08Z

@jroachgolf84 I read your PR and the AIP use-case you pointed to, and I agree it's a valid use-case. So you're right, especially the S3 use-case and watermarking. I'm also on board with keeping the store task-unaware.

The one thing I'd flag so we don't over-correct: tasks do write the Asset Store too (a producer maintaining an asset's watermark across runs), and for those entries it's genuinely useful to be able to jump to which run set a value, e.g. when a watermark looks wrong. So I'd keep a per-entry "where was this written", just not as a task-only field.

The problem with last_updated_by_ti_id isn't that it records a writer, it's that it can only record a task writer, so watcher watermarks (the headline case) come back NULL. The user-facing value is really a link to where the value was set: the run's logs for a task write, the triggerer logs for a watcher write, rather than a raw owner shown in a column. So if we record anything it should be writer-agnostic (a kind: task / watcher / api) so we can build the right link for each, instead of a task-only field that can't point anywhere for a watcher. Ownership itself stays where it belongs via the scope, asset-state on the asset and task-state on the TI, which already matches the UI layout.

On storage, the asset store is long-lived (a watermark outlives the runs that touch it), so for the task case I'd lean to plain dag_id/run_id/task_id/map_index over a task_instance FK: an FK with ON DELETE SET NULL loses the link target the moment the run is cleaned up, while the plain fields still build it. That part's an implementation detail though, the user-facing thing is the link.

jroachgolf84 · 2026-06-03T01:58:35Z

Thanks for taking a look at that - I'm on board with your comments here.

amoghrajesh · 2026-06-03T06:06:45Z

Great conversation folks, I redid this as flat denormalized columns instead of a FK to task_instance and integrated a kind mode too.

Now this is how it looks, recording these fields (last_updated_by_kind/dag_id/run_id/task_id/map_index) and an
AssetStoreWriterKind enum (task/watcher/api). Main reasons:

Watchers (BaseEventTrigger) have no task instance, so a FK can't be the
universal reference
Flat columns survive task instance cleanup; a FK would lose the history on delete

`kind`	`dag_id`	`run_id`	`task_id`	`map_index`
`task`	set	set	set	set
`watcher`	`null`	`null`	`null`	`null`
`api`	`null`	`null`	`null`	`null`

Validation is enforced at write time via AssetStoreWriterKind.validate_writer_fields().

kaxil · 2026-06-03T11:36:13Z

            ["asset_id", "key"],
            values,
-            dict(value=value, updated_at=now),
+            dict(value=value, updated_at=now, **writer_info),


The upsert always lists the last_updated_by_* columns in the update set, so a plain set() on an AssetScope (which reaches _set_asset_store with kind=None) overwrites previously recorded writer info back to NULL on conflict. No in-tree Metastore caller hits that path today since both routes go through set_asset_store, but the follow-up watcher write (#67839) and the worker-side AssetStoreAccessor also write asset state, and if either uses set() instead of set_asset_store(), attribution gets silently cleared with no error. Consider dropping the writer columns from the update set when kind is None so an attribution-less write leaves the existing values untouched.

Fixed. When kind is None (plain set() call), writer columns are now excluded from
the ON CONFLICT UPDATE set and only value and updated_at are updated on conflict, leaving any existing writer info untouched. Writer columns are still included in the INSERT values (so new rows get NULL writer fields as expected). The writer-aware path (set_asset_store / aset_asset_store) continues to include writer columns in both the insert and update sets.

Handled in 7250b42

kaxil · 2026-06-03T11:36:13Z

+    ).one_or_none()
+    if row is None:
+        raise HTTPException(
+            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,


A missing TI here returns 500, but the sibling execution-API routes in task_instances.py return 404 {"reason": "not_found", ...} for the same stale-TI condition (e.g. the TI row was cleaned up between token mint and this PUT). A 500 reads as a server bug and can trigger client retries / 5xx alerts. Returning 404 to match the convention would be more accurate.

Fixed thanks!

Handled in 7250b42

kaxil · 2026-06-03T11:36:14Z

    )
+
+    @property
+    def last_updated_by_kind_enum(self) -> AssetStoreWriterKind | None:


Is last_updated_by_kind_enum used anywhere? The routes build the response straight off the last_updated_by_kind string column, so this looks unused in the PR. If a UI/API consumer is coming, fine to keep; otherwise it's dead code.

Removed it. Had added it speculatively but the routes and UI consume last_updated_by_kind as a plain string via the JSON response, so the ORM property has no use here.

Handled in 7250b423f3

amoghrajesh · 2026-06-03T17:13:40Z

@kaxil would you mind taking another look when you can pls?

Record the task instance that wrote each asset store entry

9f13f3f

amoghrajesh requested review from XD-DENG, ashb, bbovenzi, bugraoz93, choo121600, dheerajturaga, ephraimbuddy, guan404ming, jason810496, kaxil, pierrejeambrun, potiuk, rawwar, ryanahamilton, shubhamraj-git and vatsrahul1001 as code owners June 2, 2026 14:12

boring-cyborg Bot added area:airflow-ctl area:API Airflow's REST/HTTP API area:db-migrations PRs with DB migration area:task-sdk area:UI Related to UI/UX. For Frontend Developers. backport-to-airflow-ctl/v0-1-test labels Jun 2, 2026

amoghrajesh self-assigned this Jun 2, 2026

amoghrajesh added this to the Airflow 3.3.0 milestone Jun 2, 2026

amoghrajesh added this to AIP-103: Task State Management Jun 2, 2026

github-project-automation Bot moved this to Backlog in AIP-103: Task State Management Jun 2, 2026

amoghrajesh requested a review from jroachgolf84 June 2, 2026 14:13

jroachgolf84 suggested changes Jun 2, 2026

View reviewed changes

kaxil reviewed Jun 2, 2026

View reviewed changes

amoghrajesh changed the title ~~Record the task instance that wrote each asset store entry~~ Record the writer info for every asset store write for better cross linkage Jun 3, 2026

amoghrajesh added 2 commits June 3, 2026 11:41

Record the task instance that wrote each asset store entry

68a7f4e

Merge branch 'main' into aip-103-add-ti-reference-to-asset-store

864230b

kaxil reviewed Jun 3, 2026

View reviewed changes

jroachgolf84 approved these changes Jun 3, 2026

View reviewed changes

review comments from kaxil

7250b42

amoghrajesh requested a review from kaxil June 3, 2026 17:13

kaxil approved these changes Jun 3, 2026

View reviewed changes

		ti1 = create_task_instance(task_id="task1")
		ti2 = create_task_instance(task_id="task2")

Conversation

amoghrajesh commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Was generative AI tooling used to co-author this PR?

What problem are we solving?

Current behaviour

How this helps

Proposed change

Design decisions worth flagging

Testing

What's next

Uh oh!

jroachgolf84 left a comment

Choose a reason for hiding this comment

Uh oh!

amoghrajesh commented Jun 2, 2026

Uh oh!

kaxil commented Jun 2, 2026

Uh oh!

jroachgolf84 commented Jun 2, 2026

Uh oh!

jroachgolf84 commented Jun 2, 2026

Uh oh!

jroachgolf84 commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kaxil commented Jun 2, 2026

Uh oh!

jroachgolf84 commented Jun 3, 2026

Uh oh!

kaxil commented Jun 3, 2026

Uh oh!

jroachgolf84 commented Jun 3, 2026

Uh oh!

amoghrajesh commented Jun 3, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amoghrajesh commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

amoghrajesh commented Jun 2, 2026 •

edited

Loading

jroachgolf84 commented Jun 2, 2026 •

edited

Loading