Skip to content

Decorate custom state refs with an envelope for UI clarity#67530

Open
amoghrajesh wants to merge 5 commits into
apache:mainfrom
astronomer:aip-103-backlog-decorate-external-state-refs
Open

Decorate custom state refs with an envelope for UI clarity#67530
amoghrajesh wants to merge 5 commits into
apache:mainfrom
astronomer:aip-103-backlog-decorate-external-state-refs

Conversation

@amoghrajesh
Copy link
Copy Markdown
Contributor

@amoghrajesh amoghrajesh commented May 26, 2026


Was generative AI tooling used to co-author this PR?
  • Yes - claude sonnet 4.6

What problem are we solving?

When a custom worker backend (e.g. S3, GCS) stores a state value externally and writes a reference string back to the DB, the UI has no way to tell whether a value like s3://bucket/ti_123/job_id is:

  • The user's actual state value (a plain string they stored), or
  • An opaque reference to externally-stored data

Without this distinction, the UI would show the raw path as if it were the value, which can be confusing and misleading.

Current behaviour

Custom backends return a reference string from serialize_task_state_to_ref(), which is stored verbatim in the DB. The DB value column contains either a plain JSON value or a reference string with no structural difference between the two — the UI cannot differentiate them.

Proposed change

When a custom worker backend is configured, the framework now automatically wraps the reference returned by serialize_task_state_to_ref() in a typed envelope before storing:

{"__airflow_state_ref__": "s3://bucket/ti_123/job_id"}

On read, the framework detects the envelope, extracts the ref, and passes it to deserialize_task_state_from_ref() and the backend never sees the envelope. If a stored value does not carry the marker (e.g. a corrupt row), the raw value is returned and a warning is logged.

The default path (no custom backend) is unaffected, plain JSON values are stored and returned as before.

UI Impact

UI PR for reference #67292

The UI reads state values directly from the DB and displays them as-is. With this change, when a custom backend is in use, the UI will show {"__airflow_state_ref__": "..."} instead of a raw reference string, making it visually clear that the value is a pointer to externally-stored data rather than the actual state value.

Testing

Created a custom worker side backend based on file system:

from __future__ import annotations

import json
from pathlib import Path
from typing import TYPE_CHECKING

from airflow.sdk.state import BaseStateBackend

if TYPE_CHECKING:
    from datetime import datetime

    from pydantic import JsonValue
    from sqlalchemy.ext.asyncio import AsyncSession
    from sqlalchemy.orm import Session

    from airflow_shared.state import StateScope

BASE_DIR = Path("/tmp/airflow_state")


class FileStateBackend(BaseStateBackend):
    """Stores task/asset state values as local JSON files; returns the path as the ref."""
    def serialize_task_state_to_ref(self, *, value: JsonValue, key: str, ti_id: str) -> str:
        path = BASE_DIR / f"ti_{ti_id}" / f"{key}.json"
        path.parent.mkdir(parents=True, exist_ok=True)
        path.write_text(json.dumps(value))
        return str(path)

    def deserialize_task_state_from_ref(self, stored: str) -> JsonValue:
        return json.loads(Path(stored).read_text())

    def serialize_asset_state_to_ref(self, *, value: JsonValue, key: str, asset_ref: str) -> str:
        safe = asset_ref.replace("/", "_").replace(":", "")
        path = BASE_DIR / "assets" / safe / f"{key}.json"
        path.parent.mkdir(parents=True, exist_ok=True)
        path.write_text(json.dumps(value))
        return str(path)

    def deserialize_asset_state_from_ref(self, stored: str) -> JsonValue:
        return json.loads(Path(stored).read_text())


    def get(self, scope: StateScope, key: str, *, session: Session | None = None) -> str | None:
        raise NotImplementedError(
            "FileStateBackend is a worker-side backend; server uses MetastoreStateBackend"
        )

    def set(
        self,
        scope: StateScope,
        key: str,
        value: str,
        *,
        expires_at: datetime | None = None,
        session: Session | None = None,
    ) -> None:
        raise NotImplementedError

    def delete(self, scope: StateScope, key: str, *, session: Session | None = None) -> None:
        raise NotImplementedError

    def clear(
        self, scope: StateScope, *, all_map_indices: bool = False, session: Session | None = None
    ) -> None:
        raise NotImplementedError

    async def aget(self, scope: StateScope, key: str, *, session: AsyncSession | None = None) -> str | None:
        raise NotImplementedError

    async def aset(
        self,
        scope: StateScope,
        key: str,
        value: str,
        *,
        expires_at: datetime | None = None,
        session: AsyncSession | None = None,
    ) -> None:
        raise NotImplementedError

    async def adelete(self, scope: StateScope, key: str, *, session: AsyncSession | None = None) -> None:
        raise NotImplementedError

    async def aclear(
        self, scope: StateScope, *, all_map_indices: bool = False, session: AsyncSession | None = None
    ) -> None:
        raise NotImplementedError

Ran breeze with: export AIRFLOW__WORKERS__STATE_BACKEND=file_state_backend.FileStateBackend

DAG:

@dag(schedule=None, start_date=datetime(2026, 4, 23), catchup=True)
def simple_task_state():

    @task
    def my_task(**context: Context):
        task_state = context["task_state"]

        task_state.set("job_id", "12345")
        task_state.set("secret-dict", {"key": "value"})
        task_state.set("int_value", 42)

        print("Fetching task states I stored earlier")

        print("job_id:", task_state.get("job_id"), type(task_state.get("job_id")))
        print("secret-dict:", task_state.get("secret-dict"), type(task_state.get("secret-dict")))
        print("int_value:", task_state.get("int_value"), type(task_state.get("int_value")))

    my_task()

The task is agnostic to the envelope.

image

File system is updated with the custom backend generated files:

[Breeze:3.10.20] root@78d3c819fefd:/tmp/airflow_state/ti_019e6821-0661-7bdc-ad37-8dad5ac1fa14$ pwd
/tmp/airflow_state/ti_019e6821-0661-7bdc-ad37-8dad5ac1fa14
[Breeze:3.10.20] root@78d3c819fefd:/tmp/airflow_state/ti_019e6821-0661-7bdc-ad37-8dad5ac1fa14$ ll
bash: ll: command not found
[Breeze:3.10.20] root@78d3c819fefd:/tmp/airflow_state/ti_019e6821-0661-7bdc-ad37-8dad5ac1fa14$ ls -l
total 12
-rw-r--r-- 1 root root  2 May 27 06:30 int_value.json
-rw-r--r-- 1 root root  7 May 27 06:30 job_id.json
-rw-r--r-- 1 root root 16 May 27 06:30 secret-dict.json
[Breeze:3.10.20] root@78d3c819fefd:/tmp/airflow_state/ti_019e6821-0661-7bdc-ad37-8dad5ac1fa14$ cat int_value.json
42[Breeze:3.10.20] root@78d3c819fefd:/tmp/airflow_state/ti_019e6821-0661-7bdc-ad37-8dad5ac1fa14$ cat job_id.json
"12345"[Breeze:3.10.20] root@78d3c819fefd:/tmp/airflow_state/ti_019e6821-0661-7bdc-ad37-8dad5ac1fa14$ cat secret-dict.json
{"key": "value"}[Breeze:3.10.20] root@78d3c819fefd:/tmp/airflow_state/ti_019e6821-0661-7bdc-ad37-8dad5ac1fa14$

Core API will return the external reference for UI to build upon, the extra encoding like {\"__airflow_state_ref__\": \"/tmp/airflow_state/ti_019e6821-0661-7bdc-ad37-8dad5ac1fa14/int_value.json\"} will be fixed by: #67547

image
  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

return backend.deserialize_task_state_from_ref(stored)
assert isinstance(ref, str)
return backend.deserialize_task_state_from_ref(ref)
return stored
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When _get_worker_state_backend() returns a backend but stored is not an ExternalState envelope (not isinstance(stored, dict), missing __type, or __type != "ExternalState"), this falls through to return stored on line 522 -- the raw value is returned to the user as if no backend were configured. This matters during rolling upgrades / mixed-version workers: a row written by an older worker (pre-#67530) stored the raw ref string directly (no envelope). After upgrade, the new worker reads it back, fails the isinstance(stored, dict) check, and returns the ref string itself to user code rather than dereferencing it through the backend. The pre-#67530 code path explicitly called backend.deserialize_task_state_from_ref(stored) in that case.

Can you confirm the intended behavior here? If pre-envelope rows aren't expected to exist (fresh schema with no migration), worth a comment or a log warning so the silent fallthrough doesn't mask a future bug. If pre-envelope rows can exist (rolling deploy), this needs a fallback to the old deserialize_task_state_from_ref(stored) path for isinstance(stored, str).

Same issue at line 656 for asset state.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre envelope rows cannot exist in practic cos this feature (custom state backends + serialize_task_state_to_ref) is brand new in 3.3 and has never shipped till date. There are no deployed clusters writing raw ref strings. The "older worker (pre-#67530)" scenario would only apply if someone was running build of 3.3 with an earlier iteration of this feature, which is not/should not be the case. So the rolling-upgrade / backwards-compat branch is not needed.

But the silent fallthrough concern is valid from a defensive coding standpoint, ie: if a backend is configured and the stored value is not an envelope, the caller gets back junk with no signal. Adding a warning log that covers that without adding dead code for a migration path that does not exist.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled in d101f30

backend = _get_worker_state_backend()
if backend is not None:
# serialize_task_state_to_ref always returns str by contract; stored contains the ref.
if backend is not None and isinstance(stored, dict) and stored.get("__type") == "ExternalState":
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The envelope shape {"__type", "__var"} aliases the wire format that airflow-core/src/airflow/serialization/serialized_objects.py uses for serializing complex objects (OLD_TYPE = "__type", OLD_DATA = "__var"). Reusing those keys here creates two classes of collisions worth thinking through.

First, user-supplied dicts with the same shape: a user who does task_state.set("k", {"__type": "ExternalState", "__var": "some-string"}) will, on read, have their dict misinterpreted as a backend ref and passed to backend.deserialize_task_state_from_ref("some-string"). There's no FORBIDDEN_XCOM_KEYS-equivalent guard on task_state.set() to block this. Even without a backend the value still appears legitimate, so the trap is silent until the deployment turns a backend on.

Second, cross-mixing with BaseSerialization: if anything downstream (UI renderer, audit log, copying state to XCom) ever calls BaseSerialization.deserialize on this dict, it will try to resolve "ExternalState" as a registered class.

The PR title mentions "for UI clarity" -- and the UI is the consumer of this envelope. So the question is: what's the rationale for adopting __type/__var (Airflow's internal serialization envelope) here, vs. a distinct namespace like {"__airflow_state_ref__": "..."} that can't collide with either user values or BaseSerialization? If it's intentional alignment with the UI's existing renderer, worth a comment pointing to that. If it's incidental, the namespace collision is worth resolving now while there's no on-disk data.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The __type, __var choice was not deliberate, there is no intentional alignment with BaseSerialization or the UI renderer. The UI simply renders raw values stored in the metadata DB and the envelope is just meant to signal "this value is a reference to external storage, not the actual state."

Given that, switching to a distinct single-key shape makes sense. Will change to {"__airflow_state_ref__": "<ref>"}.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled in d101f30

# serialize_task_state_to_ref always returns str by contract; stored contains the ref.
if backend is not None and isinstance(stored, dict) and stored.get("__type") == "ExternalState":
# unwrap the marker to get the ref, and retrieve the actual value from the backend using the ref
ref = stored["__var"]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stored["__var"] raises a bare KeyError if a malformed envelope (e.g. {"__type": "ExternalState"} with no __var) ever lands in the DB -- corrupt row, partial write, schema drift on a future backend. Bare KeyError propagating out of TaskStateAccessor.get() will be hard to diagnose from the user's traceback.

Minor: ref = stored.get("__var") plus an explicit if not isinstance(ref, str): raise AirflowRuntimeError(...) (or similar) gives a typed, identifiable error. Same on line 652 for the asset-state path.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now resolved by the envelope change from previous comment. The read path checks _EXTERNAL_STATE_REF_KEY in stored before accessing the value, so a malformed envelope can't produce a bare KeyError. And since the new shape is a single key ({"__airflow_state_ref__": ref}), there is no separate field that could be missing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled in d101f30

if backend is not None:
# decorate the value with a marker to indicate that it's stored externally, and include the ref to the external storage
ref = backend.serialize_task_state_to_ref(value=value, key=key, ti_id=str(self._ti_id))
stored = {"__type": "ExternalState", "__var": ref}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string literals "__type", "__var", and "ExternalState" appear four times across this file (516/518, 555, 650/652, 670) and form the wire contract between the worker SDK and any downstream consumer (UI, future server-side reader). Hoisting them to module-level constants -- e.g. _EXTERNAL_STATE_TYPE = "ExternalState", _EXTERNAL_STATE_TYPE_KEY = "__type", _EXTERNAL_STATE_REF_KEY = "__var" -- and writing a small helper like _wrap_external_ref(ref) -> dict / _unwrap_external_ref(stored) -> str | None will prevent a single typo ("__var" -> "_var", "ExternalState" -> "ExternalSate") from silently breaking the round-trip in one direction, make the contract visible to the UI / server-side renderer that consumes this format, and be the natural place to add the malformed-envelope guard from the comment on line 518.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From previous comments, the key is already hoisted to _EXTERNAL_STATE_REF_KEY and the envelope reduced to a single key, so the multi-key typo risk is gone.

Adding _wrap_external_ref / _unwrap_external_ref helpers now.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled in d101f30

return backend.deserialize_asset_state_from_ref(stored)
assert isinstance(ref, str)
return backend.deserialize_asset_state_from_ref(ref)
return stored
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same silent fallthrough as the task-state read at line 522 -- if a backend is configured but stored isn't an ExternalState envelope, the raw stored value is returned to user code without going through backend.deserialize_asset_state_from_ref().

Same question applies: are pre-envelope rows expected to exist (rolling upgrade), or is this a fresh schema? Either a backwards-compat branch for isinstance(stored, str) or a log/warning that this code path was hit is worth adding.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already handled alongside the task-state fix above. The same warning log fallthrough was added to the asset state read path at the same time.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the rolling upgrade question: pre-envelope rows cannot exist. The custom state backend feature is brand new in 3.3 and has never shipped, so there are no deployed clusters with rows written in the old format. No backwards-compat branch needed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled in d101f30

if backend is not None:
# decorate the value with a marker to indicate that it's stored externally, and include the ref to the external storage
ref = backend.serialize_asset_state_to_ref(value=value, key=key, asset_ref=asset_ref)
stored = {"__type": "ExternalState", "__var": ref}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The envelope is now a wire-protocol contract between the SDK and any consumer of task_state/asset_state values (the UI in this PR, but also any future server-side reader or third-party state backend implementer). Right now that contract is encoded only as four duplicated dict literals across this one file.

Worth documenting in BaseStateBackend (shared/state/src/airflow_shared/state/init.py) -- a brief note in the class docstring or on serialize_task_state_to_ref / serialize_asset_state_to_ref -- that the returned ref string is wrapped by the worker into {"__type": "ExternalState", "__var": <ref>} before persistence, so implementers don't try to wrap it themselves. And in the UI consumer / renderer: a comment pointing back to this envelope shape so renderer changes here don't break the UI silently.

(The PR description says "for UI clarity" but doesn't link the UI consumer. Could you add a link to the UI PR/issue in the description so the rollout sequence is reviewable? Otherwise it's hard to verify the envelope shape actually matches what the renderer expects.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring added to both serialize_task_state_to_ref and serialize_asset_state_to_ref in BaseStateBackend explaining that the framework wraps/unwraps the envelope automatically and implementers should return only the raw ref string.

On the UI link — the UI reads values directly from the DB and renders them as-is. With this change, external refs will display as {"__airflow_state_ref__": "..."} rather than a bare path string, which is the intended visual signal. I linked the UI PR in the PR desc and the intended design is that envelope shape itself being visible in the value column to indicate that its an external ref

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled in d101f30

@amoghrajesh amoghrajesh requested a review from potiuk as a code owner May 27, 2026 06:57
@amoghrajesh amoghrajesh requested a review from kaxil May 27, 2026 06:58
@amoghrajesh amoghrajesh moved this from In progress to In review in AIP-103: Task State Management May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

2 participants