Skip to content

Record the writer info for every asset store write for better cross linkage#67902

Open
amoghrajesh wants to merge 4 commits into
apache:mainfrom
astronomer:aip-103-add-ti-reference-to-asset-store
Open

Record the writer info for every asset store write for better cross linkage#67902
amoghrajesh wants to merge 4 commits into
apache:mainfrom
astronomer:aip-103-add-ti-reference-to-asset-store

Conversation

@amoghrajesh
Copy link
Copy Markdown
Contributor

@amoghrajesh amoghrajesh commented Jun 2, 2026


Was generative AI tooling used to co-author this PR?
  • Yes: claude sonnet 4.6

related to UI PR: #67292

What problem are we solving?

Asset store entries can be written by any task, watcher, or admin via core API, but there is no way to know who made the last write. This makes it impossible to link a stored value back to the run that produced it, a gap for UI attribution and auditability.

Current behaviour

The asset_store table has no record of the writing actor. All writes are anonymous: the value and timestamp are stored, but there is no way to trace which task instance, watcher, or API call was responsible.

How this helps

  • Debugging: when a watermark or checkpoint value looks wrong, you can immediately see which DAG run wrote it without digging through logs or correlating timestamps manually.
  • Auditability: external tooling (governance, lineage, monitoring) can consume the last_updated_by fields to build a provenance trail from stored value back to the producing run.
  • UI linkage: the dag_id, run_id, task_id, and map_index fields are exactly what the Grid view needs to cross link from an asset store entry to the task instance that wrote it

Proposed change

Adds five flat writer columns to asset_store:

  • last_updated_by_kind
  • last_updated_by_dag_id
  • last_updated_by_run_id
  • last_updated_by_task_id
  • last_updated_by_map_index

Introduces AssetStoreWriterKind (task, watcher, api) with a validate_writer_fields method that enforces per-kind contracts: task requires all four task fields to be set; watcher and api require them all to be null.

kind dag_id run_id task_id map_index
task set set set set
watcher null null null null
api null null null null

A match/case _: raise AssertionError guard ensures any future new kind is handled explicitly.

The execution API PUT endpoints extract writer fields from the task instance and record them with kind=task. The core API PUT records kind=api. Both GET endpoints return a last_updated_by block including kind, dag_id, run_id, task_id, and map_index — null task fields for non-task writes.

Design decisions worth flagging

Why flat denormalized columns instead of a FK to task_instance: watchers
(BaseEventTrigger) write asset store entries but have no task instance, so a FK
cannot be the universal reference. Flat columns also survive task instance cleanup
— a FK would be cleared on delete, losing the provenance trail.

Why set_asset_store lives on MetastoreStoreBackend, not BaseStoreBackend:
recording writer kind and task fields is a database centric concern. Adding these parameters to the base interface would force every alternative backend (Redis, S3, etc.) to implement a concept that has no meaning for them. Instead, MetastoreStoreBackend gets dedicated set_asset_store/aset_asset_store methods, and the API routes dispatch to them via isinstance check, falling back to the generic set() for other backends.

Testing

This is how response would look for:

  1. kind: task
{
    "asset_store": [
        {
            "key": "last_run_summary",
            "value": {
                "rows_loaded": 668,
                "prev_watermark": "2026-01-01T00:00:00+00:00",
                "completed_at": "2026-06-03T05:59:35.512421+00:00"
            },
            "updated_at": "2026-06-03T05:59:35.546524Z",
            "last_updated_by": {
                "kind": "task",
                "dag_id": "example_asset_store_producer",
                "run_id": "manual__2026-06-03T05:59:31.611729+00:00",
                "task_id": "load",
                "map_index": -1
            }
        },
        {
            "key": "total_runs",
            "value": 1,
            "updated_at": "2026-06-03T05:59:35.536352Z",
            "last_updated_by": {
                "kind": "task",
                "dag_id": "example_asset_store_producer",
                "run_id": "manual__2026-06-03T05:59:31.611729+00:00",
                "task_id": "load",
                "map_index": -1
            }
        },
        {
            "key": "watermark",
            "value": "2026-06-03T05:59:35.512421+00:00",
            "updated_at": "2026-06-03T05:59:35.518394Z",
            "last_updated_by": {
                "kind": "task",
                "dag_id": "example_asset_store_producer",
                "run_id": "manual__2026-06-03T05:59:31.611729+00:00",
                "task_id": "load",
                "map_index": -1
            }
        }
    ],
    "total_entries": 3
}
  1. kind: API
{
    "asset_store": [
        {
            "key": "dict-value",
            "value": {
                "example-key": "example-value"
            },
            "updated_at": "2026-06-03T06:03:27.434455Z",
            "last_updated_by": {
                "kind": "api",
                "dag_id": null,
                "run_id": null,
                "task_id": null,
                "map_index": null
            }
        },
        {
            "key": "int-value",
            "value": 7,
            "updated_at": "2026-06-03T06:03:12.783128Z",
            "last_updated_by": {
                "kind": "api",
                "dag_id": null,
                "run_id": null,
                "task_id": null,
                "map_index": null
            }
        },
        {
            "key": "some-value",
            "value": "2026-05-01T00:00:00Z",
            "updated_at": "2026-06-03T06:03:03.147371Z",
            "last_updated_by": {
                "kind": "api",
                "dag_id": null,
                "run_id": null,
                "task_id": null,
                "map_index": null
            }
        }
    ],
    "total_entries": 3
}

What's next

Worker-side AssetStoreAccessor SDK class and the watcher write path in BaseEventTrigger are follow-up PRs in the AIP-103 series which are being tracked separately: #67839


  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

Copy link
Copy Markdown
Collaborator

@jroachgolf84 jroachgolf84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's going to happen when AssetStore is updated from a BaseEventTrigger? Is that going to cause an issue (as there is no dag_id, task_id, etc.)?

@amoghrajesh
Copy link
Copy Markdown
Contributor Author

cc @kaxil I think I need your input on this one

@kaxil
Copy link
Copy Markdown
Member

kaxil commented Jun 2, 2026

@jroachgolf84 good question, but I don't think a BaseEventTrigger writes the asset store at all, so this case doesn't really arise.

The asset store holds producer-side state: a task's incremental watermark/cursor and the AIP-103 checkpointing state. That's authored by a task, which always has a TI. A watcher doesn't produce that. Its job is to detect an external event and fire an AssetEvent, and it delegates cursor/durability to the source it watches (SQS visibility, PubSub/Redis consumer offsets, Kafka offsets). Anything it carries with an event goes in the AssetEvent payload, not the store.

So the writers are tasks (via the Execution API, recording ti_id) and the Core API PUT (an admin write with no task, recorded as NULL). The column is nullable with FK ON DELETE SET NULL, and the read routes outer-join task_instance and return last_updated_by: null, so that admin path is handled and isn't an error.

Nothing writes the store from a trigger today afaik. If we ever do write asset state from the triggerer, it'd be from deferral triggers (a task that deferred), and those keep their task instance attached (trigger.task_instance), so they record a real ti_id just like the worker path. Watchers are the one trigger type with no TI, and also the one with no reason to write the store. So last_updated_by_ti_id as a TI FK is the right shape across all of it.

@jroachgolf84
Copy link
Copy Markdown
Collaborator

@jroachgolf84 good question, but I don't think a BaseEventTrigger writes the asset store at all, so this case doesn't really arise.

The asset store holds producer-side state: a task's incremental watermark/cursor and the AIP-103 checkpointing state. That's authored by a task, which always has a TI. A watcher doesn't produce that. Its job is to detect an external event and fire an AssetEvent, and it delegates cursor/durability to the source it watches (SQS visibility, PubSub/Redis consumer offsets, Kafka offsets). Anything it carries with an event goes in the AssetEvent payload, not the store.

So the writers are tasks (via the Execution API, recording ti_id) and the Core API PUT (an admin write with no task, recorded as NULL). The column is nullable with FK ON DELETE SET NULL, and the read routes outer-join task_instance and return last_updated_by: null, so that admin path is handled and isn't an error.

Nothing writes the store from a trigger today afaik. If we ever do write asset state from the triggerer, it'd be from deferral triggers (a task that deferred), and those keep their task instance attached (trigger.task_instance), so they record a real ti_id just like the worker path. Watchers are the one trigger type with no TI, and also the one with no reason to write the store. So last_updated_by_ti_id as a TI FK is the right shape across all of it.

Using the Asset Store in a BaseEventTrigger is actually a core use-case here. This is needed to allow for watermarking in event-driven scheduling (which was a driving factor when drafting this AIP).

@jroachgolf84
Copy link
Copy Markdown
Collaborator

Check out this PR - it's quite pertinent: #67839

@jroachgolf84
Copy link
Copy Markdown
Collaborator

jroachgolf84 commented Jun 2, 2026

IMO, we should keep Asset Store such that it is Task unaware. Thoughts?

Not being able to persist state from a BaseEventTrigger is one of the biggest blockers to creating Triggers for Asset watching (and one of the primary blockers for widespread community adoption).

Comment on lines +448 to +449
ti1 = create_task_instance(task_id="task1")
ti2 = create_task_instance(task_id="task2")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test fails on every backend in CI (it's why the run is red), not flaky:

sqlalchemy.exc.IntegrityError: UNIQUE constraint failed: dag_run.dag_id, dag_run.run_id

create_task_instance defaults to dag_id="dag" with a default run_id and creates a fresh dag_run each call, so these two calls collide on the dag_run unique key. Only task_id differs, which doesn't help since the collision is on the dag_run, not the TI. Distinct dag runs fix it:

ti1 = create_task_instance(task_id="task1", dag_id="dag1")
ti2 = create_task_instance(task_id="task2", dag_id="dag2")

(distinct run_id works too). The test only needs two distinct TI ids, so separate dag runs satisfy the intent.

@kaxil
Copy link
Copy Markdown
Member

kaxil commented Jun 2, 2026

Check out this PR - it's quite pertinent: #67839

Using the Asset Store in a BaseEventTrigger is actually a core use-case here. This is needed to allow for watermarking in event-driven scheduling (which was a driving factor when drafting this AIP).

Aah, hmm, let me check that PR and get back!

@jroachgolf84
Copy link
Copy Markdown
Collaborator

Sounds good, thanks!

@kaxil
Copy link
Copy Markdown
Member

kaxil commented Jun 3, 2026

@jroachgolf84 I read your PR and the AIP use-case you pointed to, and I agree it's a valid use-case. So you're right, especially the S3 use-case and watermarking. I'm also on board with keeping the store task-unaware.

The one thing I'd flag so we don't over-correct: tasks do write the Asset Store too (a producer maintaining an asset's watermark across runs), and for those entries it's genuinely useful to be able to jump to which run set a value, e.g. when a watermark looks wrong. So I'd keep a per-entry "where was this written", just not as a task-only field.

The problem with last_updated_by_ti_id isn't that it records a writer, it's that it can only record a task writer, so watcher watermarks (the headline case) come back NULL. The user-facing value is really a link to where the value was set: the run's logs for a task write, the triggerer logs for a watcher write, rather than a raw owner shown in a column. So if we record anything it should be writer-agnostic (a kind: task / watcher / api) so we can build the right link for each, instead of a task-only field that can't point anywhere for a watcher. Ownership itself stays where it belongs via the scope, asset-state on the asset and task-state on the TI, which already matches the UI layout.

On storage, the asset store is long-lived (a watermark outlives the runs that touch it), so for the task case I'd lean to plain dag_id/run_id/task_id/map_index over a task_instance FK: an FK with ON DELETE SET NULL loses the link target the moment the run is cleaned up, while the plain fields still build it. That part's an implementation detail though, the user-facing thing is the link.

@jroachgolf84
Copy link
Copy Markdown
Collaborator

Thanks for taking a look at that - I'm on board with your comments here.

@amoghrajesh amoghrajesh changed the title Record the task instance that wrote each asset store entry Record the writer info for every asset store write for better cross linkage Jun 3, 2026
@amoghrajesh
Copy link
Copy Markdown
Contributor Author

Great conversation folks, I redid this as flat denormalized columns instead of a FK to task_instance and integrated a kind mode too.

Now this is how it looks, recording these fields (last_updated_by_kind/dag_id/run_id/task_id/map_index) and an
AssetStoreWriterKind enum (task/watcher/api). Main reasons:

  • Watchers (BaseEventTrigger) have no task instance, so a FK can't be the
    universal reference
  • Flat columns survive task instance cleanup; a FK would lose the history on delete
kind dag_id run_id task_id map_index
task set set set set
watcher null null null null
api null null null null

Validation is enforced at write time via AssetStoreWriterKind.validate_writer_fields().

["asset_id", "key"],
values,
dict(value=value, updated_at=now),
dict(value=value, updated_at=now, **writer_info),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The upsert always lists the last_updated_by_* columns in the update set, so a plain set() on an AssetScope (which reaches _set_asset_store with kind=None) overwrites previously recorded writer info back to NULL on conflict. No in-tree Metastore caller hits that path today since both routes go through set_asset_store, but the follow-up watcher write (#67839) and the worker-side AssetStoreAccessor also write asset state, and if either uses set() instead of set_asset_store(), attribution gets silently cleared with no error. Consider dropping the writer columns from the update set when kind is None so an attribution-less write leaves the existing values untouched.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. When kind is None (plain set() call), writer columns are now excluded from
the ON CONFLICT UPDATE set and only value and updated_at are updated on conflict, leaving any existing writer info untouched. Writer columns are still included in the INSERT values (so new rows get NULL writer fields as expected). The writer-aware path (set_asset_store / aset_asset_store) continues to include writer columns in both the insert and update sets.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled in 7250b42

).one_or_none()
if row is None:
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A missing TI here returns 500, but the sibling execution-API routes in task_instances.py return 404 {"reason": "not_found", ...} for the same stale-TI condition (e.g. the TI row was cleaned up between token mint and this PUT). A 500 reads as a server bug and can trigger client retries / 5xx alerts. Returning 404 to match the convention would be more accurate.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed thanks!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled in 7250b42

)

@property
def last_updated_by_kind_enum(self) -> AssetStoreWriterKind | None:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is last_updated_by_kind_enum used anywhere? The routes build the response straight off the last_updated_by_kind string column, so this looks unused in the PR. If a UI/API consumer is coming, fine to keep; otherwise it's dead code.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed it. Had added it speculatively but the routes and UI consume last_updated_by_kind as a plain string via the JSON response, so the ORM property has no use here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled in 7250b423f3

@amoghrajesh amoghrajesh requested a review from kaxil June 3, 2026 17:13
@amoghrajesh
Copy link
Copy Markdown
Contributor Author

@kaxil would you mind taking another look when you can pls?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:airflow-ctl area:API Airflow's REST/HTTP API area:db-migrations PRs with DB migration area:task-sdk area:UI Related to UI/UX. For Frontend Developers. backport-to-airflow-ctl/v0-1-test

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

3 participants