Decorate custom state refs with an envelope for UI clarity#67530
Decorate custom state refs with an envelope for UI clarity#67530amoghrajesh wants to merge 5 commits into
Conversation
| return backend.deserialize_task_state_from_ref(stored) | ||
| assert isinstance(ref, str) | ||
| return backend.deserialize_task_state_from_ref(ref) | ||
| return stored |
There was a problem hiding this comment.
When _get_worker_state_backend() returns a backend but stored is not an ExternalState envelope (not isinstance(stored, dict), missing __type, or __type != "ExternalState"), this falls through to return stored on line 522 -- the raw value is returned to the user as if no backend were configured. This matters during rolling upgrades / mixed-version workers: a row written by an older worker (pre-#67530) stored the raw ref string directly (no envelope). After upgrade, the new worker reads it back, fails the isinstance(stored, dict) check, and returns the ref string itself to user code rather than dereferencing it through the backend. The pre-#67530 code path explicitly called backend.deserialize_task_state_from_ref(stored) in that case.
Can you confirm the intended behavior here? If pre-envelope rows aren't expected to exist (fresh schema with no migration), worth a comment or a log warning so the silent fallthrough doesn't mask a future bug. If pre-envelope rows can exist (rolling deploy), this needs a fallback to the old deserialize_task_state_from_ref(stored) path for isinstance(stored, str).
Same issue at line 656 for asset state.
There was a problem hiding this comment.
Pre envelope rows cannot exist in practic cos this feature (custom state backends + serialize_task_state_to_ref) is brand new in 3.3 and has never shipped till date. There are no deployed clusters writing raw ref strings. The "older worker (pre-#67530)" scenario would only apply if someone was running build of 3.3 with an earlier iteration of this feature, which is not/should not be the case. So the rolling-upgrade / backwards-compat branch is not needed.
But the silent fallthrough concern is valid from a defensive coding standpoint, ie: if a backend is configured and the stored value is not an envelope, the caller gets back junk with no signal. Adding a warning log that covers that without adding dead code for a migration path that does not exist.
| backend = _get_worker_state_backend() | ||
| if backend is not None: | ||
| # serialize_task_state_to_ref always returns str by contract; stored contains the ref. | ||
| if backend is not None and isinstance(stored, dict) and stored.get("__type") == "ExternalState": |
There was a problem hiding this comment.
The envelope shape {"__type", "__var"} aliases the wire format that airflow-core/src/airflow/serialization/serialized_objects.py uses for serializing complex objects (OLD_TYPE = "__type", OLD_DATA = "__var"). Reusing those keys here creates two classes of collisions worth thinking through.
First, user-supplied dicts with the same shape: a user who does task_state.set("k", {"__type": "ExternalState", "__var": "some-string"}) will, on read, have their dict misinterpreted as a backend ref and passed to backend.deserialize_task_state_from_ref("some-string"). There's no FORBIDDEN_XCOM_KEYS-equivalent guard on task_state.set() to block this. Even without a backend the value still appears legitimate, so the trap is silent until the deployment turns a backend on.
Second, cross-mixing with BaseSerialization: if anything downstream (UI renderer, audit log, copying state to XCom) ever calls BaseSerialization.deserialize on this dict, it will try to resolve "ExternalState" as a registered class.
The PR title mentions "for UI clarity" -- and the UI is the consumer of this envelope. So the question is: what's the rationale for adopting __type/__var (Airflow's internal serialization envelope) here, vs. a distinct namespace like {"__airflow_state_ref__": "..."} that can't collide with either user values or BaseSerialization? If it's intentional alignment with the UI's existing renderer, worth a comment pointing to that. If it's incidental, the namespace collision is worth resolving now while there's no on-disk data.
There was a problem hiding this comment.
The __type, __var choice was not deliberate, there is no intentional alignment with BaseSerialization or the UI renderer. The UI simply renders raw values stored in the metadata DB and the envelope is just meant to signal "this value is a reference to external storage, not the actual state."
Given that, switching to a distinct single-key shape makes sense. Will change to {"__airflow_state_ref__": "<ref>"}.
| # serialize_task_state_to_ref always returns str by contract; stored contains the ref. | ||
| if backend is not None and isinstance(stored, dict) and stored.get("__type") == "ExternalState": | ||
| # unwrap the marker to get the ref, and retrieve the actual value from the backend using the ref | ||
| ref = stored["__var"] |
There was a problem hiding this comment.
stored["__var"] raises a bare KeyError if a malformed envelope (e.g. {"__type": "ExternalState"} with no __var) ever lands in the DB -- corrupt row, partial write, schema drift on a future backend. Bare KeyError propagating out of TaskStateAccessor.get() will be hard to diagnose from the user's traceback.
Minor: ref = stored.get("__var") plus an explicit if not isinstance(ref, str): raise AirflowRuntimeError(...) (or similar) gives a typed, identifiable error. Same on line 652 for the asset-state path.
There was a problem hiding this comment.
This is now resolved by the envelope change from previous comment. The read path checks _EXTERNAL_STATE_REF_KEY in stored before accessing the value, so a malformed envelope can't produce a bare KeyError. And since the new shape is a single key ({"__airflow_state_ref__": ref}), there is no separate field that could be missing.
| if backend is not None: | ||
| # decorate the value with a marker to indicate that it's stored externally, and include the ref to the external storage | ||
| ref = backend.serialize_task_state_to_ref(value=value, key=key, ti_id=str(self._ti_id)) | ||
| stored = {"__type": "ExternalState", "__var": ref} |
There was a problem hiding this comment.
The string literals "__type", "__var", and "ExternalState" appear four times across this file (516/518, 555, 650/652, 670) and form the wire contract between the worker SDK and any downstream consumer (UI, future server-side reader). Hoisting them to module-level constants -- e.g. _EXTERNAL_STATE_TYPE = "ExternalState", _EXTERNAL_STATE_TYPE_KEY = "__type", _EXTERNAL_STATE_REF_KEY = "__var" -- and writing a small helper like _wrap_external_ref(ref) -> dict / _unwrap_external_ref(stored) -> str | None will prevent a single typo ("__var" -> "_var", "ExternalState" -> "ExternalSate") from silently breaking the round-trip in one direction, make the contract visible to the UI / server-side renderer that consumes this format, and be the natural place to add the malformed-envelope guard from the comment on line 518.
There was a problem hiding this comment.
From previous comments, the key is already hoisted to _EXTERNAL_STATE_REF_KEY and the envelope reduced to a single key, so the multi-key typo risk is gone.
Adding _wrap_external_ref / _unwrap_external_ref helpers now.
| return backend.deserialize_asset_state_from_ref(stored) | ||
| assert isinstance(ref, str) | ||
| return backend.deserialize_asset_state_from_ref(ref) | ||
| return stored |
There was a problem hiding this comment.
Same silent fallthrough as the task-state read at line 522 -- if a backend is configured but stored isn't an ExternalState envelope, the raw stored value is returned to user code without going through backend.deserialize_asset_state_from_ref().
Same question applies: are pre-envelope rows expected to exist (rolling upgrade), or is this a fresh schema? Either a backwards-compat branch for isinstance(stored, str) or a log/warning that this code path was hit is worth adding.
There was a problem hiding this comment.
Already handled alongside the task-state fix above. The same warning log fallthrough was added to the asset state read path at the same time.
There was a problem hiding this comment.
On the rolling upgrade question: pre-envelope rows cannot exist. The custom state backend feature is brand new in 3.3 and has never shipped, so there are no deployed clusters with rows written in the old format. No backwards-compat branch needed.
| if backend is not None: | ||
| # decorate the value with a marker to indicate that it's stored externally, and include the ref to the external storage | ||
| ref = backend.serialize_asset_state_to_ref(value=value, key=key, asset_ref=asset_ref) | ||
| stored = {"__type": "ExternalState", "__var": ref} |
There was a problem hiding this comment.
The envelope is now a wire-protocol contract between the SDK and any consumer of task_state/asset_state values (the UI in this PR, but also any future server-side reader or third-party state backend implementer). Right now that contract is encoded only as four duplicated dict literals across this one file.
Worth documenting in BaseStateBackend (shared/state/src/airflow_shared/state/init.py) -- a brief note in the class docstring or on serialize_task_state_to_ref / serialize_asset_state_to_ref -- that the returned ref string is wrapped by the worker into {"__type": "ExternalState", "__var": <ref>} before persistence, so implementers don't try to wrap it themselves. And in the UI consumer / renderer: a comment pointing back to this envelope shape so renderer changes here don't break the UI silently.
(The PR description says "for UI clarity" but doesn't link the UI consumer. Could you add a link to the UI PR/issue in the description so the rollout sequence is reviewable? Otherwise it's hard to verify the envelope shape actually matches what the renderer expects.)
There was a problem hiding this comment.
Docstring added to both serialize_task_state_to_ref and serialize_asset_state_to_ref in BaseStateBackend explaining that the framework wraps/unwraps the envelope automatically and implementers should return only the raw ref string.
On the UI link — the UI reads values directly from the DB and renders them as-is. With this change, external refs will display as {"__airflow_state_ref__": "..."} rather than a bare path string, which is the intended visual signal. I linked the UI PR in the PR desc and the intended design is that envelope shape itself being visible in the value column to indicate that its an external ref
Was generative AI tooling used to co-author this PR?
What problem are we solving?
When a custom worker backend (e.g. S3, GCS) stores a state value externally and writes a reference string back to the DB, the UI has no way to tell whether a value like
s3://bucket/ti_123/job_idis:Without this distinction, the UI would show the raw path as if it were the value, which can be confusing and misleading.
Current behaviour
Custom backends return a reference string from
serialize_task_state_to_ref(), which is stored verbatim in the DB. The DB value column contains either a plain JSON value or a reference string with no structural difference between the two — the UI cannot differentiate them.Proposed change
When a custom worker backend is configured, the framework now automatically wraps the reference returned by
serialize_task_state_to_ref()in a typed envelope before storing:{"__airflow_state_ref__": "s3://bucket/ti_123/job_id"}On read, the framework detects the envelope, extracts the ref, and passes it to
deserialize_task_state_from_ref()and the backend never sees the envelope. If a stored value does not carry the marker (e.g. a corrupt row), the raw value is returned and a warning is logged.The default path (no custom backend) is unaffected, plain JSON values are stored and returned as before.
UI Impact
UI PR for reference #67292
The UI reads state values directly from the DB and displays them as-is. With this change, when a custom backend is in use, the UI will show
{"__airflow_state_ref__": "..."}instead of a raw reference string, making it visually clear that the value is a pointer to externally-stored data rather than the actual state value.Testing
Created a custom worker side backend based on file system:
Ran breeze with:
export AIRFLOW__WORKERS__STATE_BACKEND=file_state_backend.FileStateBackendDAG:
The task is agnostic to the envelope.
File system is updated with the custom backend generated files:
Core API will return the external reference for UI to build upon, the extra encoding like
{\"__airflow_state_ref__\": \"/tmp/airflow_state/ti_019e6821-0661-7bdc-ad37-8dad5ac1fa14/int_value.json\"}will be fixed by: #67547{pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.