Skip to content

Fix Callback.handle_event crash on OTel metrics with dict tag values#67527

Merged
vatsrahul1001 merged 1 commit into
mainfrom
fix-callback-otel-unhashable-dict
May 26, 2026
Merged

Fix Callback.handle_event crash on OTel metrics with dict tag values#67527
vatsrahul1001 merged 1 commit into
mainfrom
fix-callback-otel-unhashable-dict

Conversation

@vatsrahul1001
Copy link
Copy Markdown
Contributor

@vatsrahul1001 vatsrahul1001 commented May 26, 2026

Summary

Triggerer crashes on the next deadline async callback when OpenTelemetry metrics are enabled:

File ".../airflow/jobs/triggerer_job_runner.py", line 659, in handle_events
    Trigger.submit_event(...)
File ".../airflow/models/callback.py", line 234, in handle_event
    Stats.incr(**self.get_metric_info(status, self.output))
File ".../airflow/_shared/observability/metrics/otel_logger.py", line 211, in incr
    counter.add(count, attributes=tags)
File ".../opentelemetry/sdk/metrics/.../view_instrument_match.py", line 105
    aggr_key = frozenset(attributes.items())
TypeError: unhashable type: 'dict'

Root cause

Callback.get_metric_info builds the metric tags dict directly from the callback's result and self.data (which includes kwargs). Both are frequently dicts — for deadline async callbacks the result is the user callback's return value, and kwargs is the captured callback kwargs.

When the metrics backend is OpenTelemetry, the SDK builds the aggregation key as frozenset(attributes.items()), which raises TypeError: unhashable type: 'dict' if any value is unhashable (dict, list, set). Result: triggerer crash, stalled triggers.

The bug is metrics-backend-dependent: statsd accepts non-primitive tag values without complaint, so OSS users running the default statsd never see it. OpenTelemetry backends hit it consistently — surfaced by Astronomer Astro Cloud, but affects any OSS deployment that enables [metrics] otel_*.

Reproduces against 3.2.1 and main — not a 3.2.x regression.

Fix

Sanitize tag values to primitives before returning from get_metric_info. Keep str | int | float | bool | None as-is; JSON-stringify anything else using default=str so values like datetime fall back cleanly instead of raising.

Was generative AI tooling used to co-author this PR?
  • Yes — Claude Code (Opus 4.7)

Generated-by: Claude Code (Opus 4.7) following the guidelines

@vatsrahul1001 vatsrahul1001 added the backport-to-v3-2-test Mark PR with this label to backport to v3-2-test branch label May 26, 2026
Comment thread airflow-core/src/airflow/models/callback.py Outdated
Comment thread airflow-core/tests/unit/models/test_callback.py Outdated
The triggerer crashes on the next deadline async callback when OpenTelemetry
metrics are enabled:

    File ".../airflow/jobs/triggerer_job_runner.py", line 659, in handle_events
        Trigger.submit_event(...)
    File ".../airflow/models/callback.py", line 234, in handle_event
        Stats.incr(**self.get_metric_info(status, self.output))
    File ".../airflow/_shared/observability/metrics/otel_logger.py", line 211, in incr
        counter.add(count, attributes=tags)
    File ".../opentelemetry/sdk/metrics/.../view_instrument_match.py", line 105
        aggr_key = frozenset(attributes.items())
    TypeError: unhashable type: 'dict'

`Callback.get_metric_info` builds the metric tags dict directly from the
callback's `result` and `self.data` (which includes `kwargs`). Both are
frequently dicts — for deadline async callbacks the `result` is the user
callback's return value, and `kwargs` is the captured callback kwargs. When
the metrics backend is OTel, the SDK builds the aggregation key as
`frozenset(attributes.items())`, which raises if any value is unhashable
(dict, list, set). The result is a triggerer crash and stalled triggers.

The bug is metrics-backend-dependent: statsd accepts non-primitive tag values
without complaint, so OSS users running default statsd never see it. OTel
backends (used in production by Astronomer Astro Cloud and any OSS deployment
that enables `[metrics] otel_*`) hit it consistently.

Reproduces against 3.2.1 and main; not a 3.2.x regression.

Sanitize tag values to primitives before returning from `get_metric_info`:
keep `str | int | float | bool | None` as-is, JSON-stringify anything else.
Using `default=str` in `json.dumps` so values like `datetime` fall back
cleanly instead of raising.

Adds a regression test that asserts every tag value is hashable and that
`frozenset(tags.items())` does not raise.

Reported by Astronomer Runtime team while testing 3.2.2rc2-based images.
@vatsrahul1001 vatsrahul1001 force-pushed the fix-callback-otel-unhashable-dict branch from 6097e9d to 9a34e73 Compare May 26, 2026 06:08
@vatsrahul1001 vatsrahul1001 merged commit 5978911 into main May 26, 2026
74 checks passed
@vatsrahul1001 vatsrahul1001 deleted the fix-callback-otel-unhashable-dict branch May 26, 2026 06:39
@github-actions github-actions Bot added this to the Airflow 3.2.3 milestone May 26, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Hi maintainer, this PR was merged without a milestone set.
We've automatically set the milestone to Airflow 3.2.3 based on: backport label targeting v3-2-test
If this milestone is not correct, please update it to the appropriate milestone.

This comment was generated by Milestone Tag Assistant.

@github-actions
Copy link
Copy Markdown
Contributor

Backport failed to create: v3-2-test. View the failure log Run details

Note: As of Merging PRs targeted for Airflow 3.X
the committer who merges the PR is responsible for backporting the PRs that are bug fixes (generally speaking) to the maintenance branches.

In matter of doubt please ask in #release-management Slack channel.

Status Branch Result
v3-2-test Commit Link

You can attempt to backport this manually by running:

cherry_picker 5978911 v3-2-test

This should apply the commit to the v3-2-test branch and leave the commit in conflict state marking
the files that need manual conflict resolution.

After you have resolved the conflicts, you can continue the backport process by running:

cherry_picker --continue

If you don't have cherry-picker installed, see the installation guide.

vatsrahul1001 added a commit that referenced this pull request May 26, 2026
…67527)

Fix Callback.handle_event crash on OTel metrics with dict tag values (#67527)

(cherry picked from commit 5978911)
vatsrahul1001 added a commit that referenced this pull request May 26, 2026
…67527) (#67529)

Fix Callback.handle_event crash on OTel metrics with dict tag values (#67527)

(cherry picked from commit 5978911)
vatsrahul1001 added a commit that referenced this pull request May 26, 2026
…67527) (#67529)

Fix Callback.handle_event crash on OTel metrics with dict tag values (#67527)

(cherry picked from commit 5978911)
pierrejeambrun added a commit that referenced this pull request May 27, 2026
…) (#67555)

* Fix Callback.handle_event crash on OTel metrics with dict tag values (#67527) (#67529)

Fix Callback.handle_event crash on OTel metrics with dict tag values (#67527)

(cherry picked from commit 5978911)

* [v3-2-test] Fix N+1 query in bulk task instance delete endpoint (#67304)

* Fix N+1 query pattern in bulk task instance delete endpoint

* Add regression test for bulk task instance delete N+1

* Refactor N+1 regression test to use parametrize pattern
(cherry picked from commit e31cca1)

Co-authored-by: Colten <jun930436@gmail.com>

* Fix CI

---------

Co-authored-by: Rahul Vats <43964496+vatsrahul1001@users.noreply.github.com>
Co-authored-by: Colten <jun930436@gmail.com>
Co-authored-by: pierrejeambrun <pierrejbrun@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-to-v3-2-test Mark PR with this label to backport to v3-2-test branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants