Fix Callback.handle_event crash on OTel metrics with dict tag values#67527
Conversation
The triggerer crashes on the next deadline async callback when OpenTelemetry
metrics are enabled:
File ".../airflow/jobs/triggerer_job_runner.py", line 659, in handle_events
Trigger.submit_event(...)
File ".../airflow/models/callback.py", line 234, in handle_event
Stats.incr(**self.get_metric_info(status, self.output))
File ".../airflow/_shared/observability/metrics/otel_logger.py", line 211, in incr
counter.add(count, attributes=tags)
File ".../opentelemetry/sdk/metrics/.../view_instrument_match.py", line 105
aggr_key = frozenset(attributes.items())
TypeError: unhashable type: 'dict'
`Callback.get_metric_info` builds the metric tags dict directly from the
callback's `result` and `self.data` (which includes `kwargs`). Both are
frequently dicts — for deadline async callbacks the `result` is the user
callback's return value, and `kwargs` is the captured callback kwargs. When
the metrics backend is OTel, the SDK builds the aggregation key as
`frozenset(attributes.items())`, which raises if any value is unhashable
(dict, list, set). The result is a triggerer crash and stalled triggers.
The bug is metrics-backend-dependent: statsd accepts non-primitive tag values
without complaint, so OSS users running default statsd never see it. OTel
backends (used in production by Astronomer Astro Cloud and any OSS deployment
that enables `[metrics] otel_*`) hit it consistently.
Reproduces against 3.2.1 and main; not a 3.2.x regression.
Sanitize tag values to primitives before returning from `get_metric_info`:
keep `str | int | float | bool | None` as-is, JSON-stringify anything else.
Using `default=str` in `json.dumps` so values like `datetime` fall back
cleanly instead of raising.
Adds a regression test that asserts every tag value is hashable and that
`frozenset(tags.items())` does not raise.
Reported by Astronomer Runtime team while testing 3.2.2rc2-based images.
6097e9d to
9a34e73
Compare
|
Hi maintainer, this PR was merged without a milestone set.
|
Backport failed to create: v3-2-test. View the failure log Run detailsNote: As of Merging PRs targeted for Airflow 3.X In matter of doubt please ask in #release-management Slack channel.
You can attempt to backport this manually by running: cherry_picker 5978911 v3-2-testThis should apply the commit to the v3-2-test branch and leave the commit in conflict state marking After you have resolved the conflicts, you can continue the backport process by running: cherry_picker --continueIf you don't have cherry-picker installed, see the installation guide. |
…) (#67555) * Fix Callback.handle_event crash on OTel metrics with dict tag values (#67527) (#67529) Fix Callback.handle_event crash on OTel metrics with dict tag values (#67527) (cherry picked from commit 5978911) * [v3-2-test] Fix N+1 query in bulk task instance delete endpoint (#67304) * Fix N+1 query pattern in bulk task instance delete endpoint * Add regression test for bulk task instance delete N+1 * Refactor N+1 regression test to use parametrize pattern (cherry picked from commit e31cca1) Co-authored-by: Colten <jun930436@gmail.com> * Fix CI --------- Co-authored-by: Rahul Vats <43964496+vatsrahul1001@users.noreply.github.com> Co-authored-by: Colten <jun930436@gmail.com> Co-authored-by: pierrejeambrun <pierrejbrun@gmail.com>
Summary
Triggerer crashes on the next deadline async callback when OpenTelemetry metrics are enabled:
Root cause
Callback.get_metric_infobuilds the metric tags dict directly from the callback'sresultandself.data(which includeskwargs). Both are frequently dicts — for deadline async callbacks theresultis the user callback's return value, andkwargsis the captured callback kwargs.When the metrics backend is OpenTelemetry, the SDK builds the aggregation key as
frozenset(attributes.items()), which raisesTypeError: unhashable type: 'dict'if any value is unhashable (dict, list, set). Result: triggerer crash, stalled triggers.The bug is metrics-backend-dependent: statsd accepts non-primitive tag values without complaint, so OSS users running the default statsd never see it. OpenTelemetry backends hit it consistently — surfaced by Astronomer Astro Cloud, but affects any OSS deployment that enables
[metrics] otel_*.Reproduces against 3.2.1 and main — not a 3.2.x regression.
Fix
Sanitize tag values to primitives before returning from
get_metric_info. Keepstr | int | float | bool | Noneas-is; JSON-stringify anything else usingdefault=strso values likedatetimefall back cleanly instead of raising.Was generative AI tooling used to co-author this PR?
Generated-by: Claude Code (Opus 4.7) following the guidelines