Speed up Dag serialization by skipping redundant asset roundtrip#67702
Open
shahar1 wants to merge 1 commit into
Open
Speed up Dag serialization by skipping redundant asset roundtrip#67702shahar1 wants to merge 1 commit into
shahar1 wants to merge 1 commit into
Conversation
detect_task_dependencies built the asset dependency id via ensure_serialized_asset(), which runs a full encode→decode roundtrip (rebuilding group/extra/watchers/access_control) on every asset outlet of every task, only to read back name/uri for the unique key. Asset encode/decode copies name/uri verbatim, so the key can be built directly from the object. This removes the redundant roundtrip per outlet, cutting end-to-end serialization time by ~3-9% on asset-heavy Dags (larger gains the more outlets per task) with byte-identical output.
341bc5c to
3fd4dcb
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
_DependencyDetector.detect_task_dependenciesbuilds the assetdependency_idfor every asset outlet of every task via:
ensure_serialized_asset()runs a fulldecode_asset_like(encode_asset_like(obj))encode→decode roundtrip — rebuilding the entire
SerializedAsset(group, extra,watchers, access_control) — purely to read back
.name/.urifor the unique key.Fix
encode_asset_likecopiesname/uriverbatim and_decode_assetreconstructsthem verbatim, so the roundtrip cannot change either field. The key is built
directly from the object instead:
One full encode+decode per asset outlet is removed. The asset-alias branch is
left unchanged (rare path that genuinely needs the serialized object).
Correctness
Asset.__init__, never inside the roundtrip, soobj.name/obj.urialreadyhold the normalized values.
airflow-core/tests/unit/serialization/test_dag_serialization.pypass.[0]section asserts identical keys across asset shapes(name+uri, uri-only, name-only, special characters, with group/extra).
Benchmark
Comparing two separate process runs proved unreliable on a loaded machine
(absolute timings swung ~2× from background load alone). The credible
measurement is a single-process interleaved A/B that patches in both
implementations and times them round-by-round under identical instantaneous
load (
bench_asset_roundtrip_ab.py):Consistent ~3–9% end-to-end serialization speedup (largest on asset-heavy
Dags), and ~1.5–2× on the changed line in isolation.
Benchmark scripts: https://gist.github.com/shahar1/841592531adfd66def64fc67fcc3ea6c
Was generative AI tooling used to co-author this PR?
Generated-by: Claude Code (Opus 4.8) following the guidelines