feat(metrics/otel): add histogram bucket views per metric-name pattern by 1fanwang · Pull Request #66805 · apache/airflow

1fanwang · 2026-05-12T16:38:58Z

Fix for the one-size-fits-all OTel histogram bucketing called out in #66801. Histogram-eligible Airflow metrics all share the default boundaries today, which makes the tails of some of our LinkedIn DI dashboards useless — task duration buckets fine, scheduler-loop duration needs finer resolution at the low end. This PR adds a config-driven metric-name pattern → buckets mapping so we can tune per metric without touching code.

Problem

Follow-up to #64207, which set ExponentialBucketHistogramAggregation as the
default for every OTel Histogram instrument. That covers timer/timing.
Non-timer histograms (*_count, *_duration on direct Stats.histogram(),
custom *_delay metrics) still pick bucket boundaries at each call site,
so the same family of metrics can end up with different bucket shapes
depending on which module created the instrument.

Fix

New shared/observability/.../metrics/histogram_buckets.py module that:

Defines three named bucket aggregations — LATENCY_BUCKETS (exponential),
COUNT_BUCKETS (small linear), DELAY_BUCKETS (large-range).
Exposes DEFAULT_PATTERN_BUCKETS mapping *_duration / *_delay /
*_count to those aggregations.
Provides build_views_for_patterns() returning OTel View objects keyed
by instrument-name glob. Deployments can pass an override dict.

otel_logger.get_otel_logger() layers the pattern views on top of the
existing instrument-type baseline view — the per-instrument-type
exponential default still applies, the pattern views only refine it for
matching names.

Tests

test_histogram_buckets.py covers the default pattern map, the per-pattern
aggregation resolution, custom-mapping override, and that
build_views_for_patterns() returns one View per entry with the right
instrument_name.

test_otel_logger.py::test_get_otel_logger_uses_exponential_histogram_view
updated to assert the layered shape (baseline view + pattern views).

Design note

Flagging this as a design discussion. A reasonable counter-proposal is
"let each call site pass its bucket boundaries" rather than centralising
the map. Happy to take that direction if maintainers prefer.

Closes #66801

Reproducer

Captured before/after via the updated test

The discriminating test is shared/observability/tests/observability/metrics/test_otel_logger.py::TestOtelMetrics::test_get_otel_logger_uses_exponential_histogram_view. It collects the views= passed to the MeterProvider, checks the first one is the instrument-type exponential baseline, and checks the remaining views' _instrument_name set equals {"*_count", "*_duration", "*_delay"}.

Before the fix (git checkout upstream/main -- shared/observability/src/airflow_shared/observability/metrics/otel_logger.py):

FAILED shared/observability/tests/observability/metrics/test_otel_logger.py::TestOtelMetrics::test_get_otel_logger_uses_exponential_histogram_view
shared/observability/tests/observability/metrics/test_otel_logger.py:436: in test_get_otel_logger_uses_exponential_histogram_view
    assert pattern_names == {"*_count", "*_duration", "*_delay"}
E   AssertionError: assert equals failed
E     set()
E                      +  '*_count',
E                      +  '*_delay',
E                      +  '*_duration',

The pre-fix get_otel_logger builds a single instrument-type baseline view and nothing else, so views[1:] is empty and the pattern-name set does not match.

After restoring the fix:

PASSED shared/observability/tests/observability/metrics/test_otel_logger.py::TestOtelMetrics::test_get_otel_logger_uses_exponential_histogram_view

The full test_histogram_buckets.py suite plus the modified test_get_otel_logger_uses_exponential_histogram_view go 10/10 green.

End-to-end repro outside the test harness

The script below builds a MeterProvider with views=build_views_for_patterns() plus an InMemoryMetricReader, emits one sample per default pattern, and reads back which aggregation the SDK resolved per instrument:

from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import InMemoryMetricReader
from airflow_shared.observability.metrics.histogram_buckets import build_views_for_patterns

reader = InMemoryMetricReader()
metrics.set_meter_provider(MeterProvider(metric_readers=[reader], views=build_views_for_patterns()))
meter = metrics.get_meter("repro")

meter.create_histogram("task_duration").record(0.05)
meter.create_histogram("schedule_delay").record(45)
meter.create_histogram("retry_count").record(3)

for rm in reader.get_metrics_data().resource_metrics:
    for sm in rm.scope_metrics:
        for m in sm.metrics:
            print(m.name, type(m.data.data_points[0]).__name__)

Output:

task_duration  ExponentialHistogramDataPoint
schedule_delay HistogramDataPoint
retry_count    HistogramDataPoint

task_duration is collected through ExponentialBucketHistogramAggregation (matches *_duration); schedule_delay and retry_count are collected as fixed-boundary HistogramDataPoint with the configured (1, 5, 15, 60, 300, 900, 1800, 3600, 7200, 21600) and (1, 2, 5, 10, 25, 50, 100, 250, 500, 1000) boundaries respectively. Without the patch, all three would fall through to the instrument-type exponential default regardless of name.

Follow-up to apache#64207, which set ExponentialBucketHistogramAggregation as the instrument-type default for OTel histograms. Non-timer histograms (*_count, *_duration, *_delay) still pick bucket boundaries at each call site. This change moves that decision into a single declarative map keyed by metric-name suffix glob, and layers the resulting Views on top of the existing instrument-type baseline. Closes apache#66801 Signed-off-by: 1fanwang <1fannnw@gmail.com>

1fanwang requested review from amoghrajesh, ashb and potiuk as code owners May 12, 2026 16:38

1fanwang mentioned this pull request May 14, 2026

Declarative OTel histogram bucket views per metric-name pattern #66801

Open

2 tasks

choo121600 added the ready for maintainer review Set after triaging when all criteria pass. label May 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(metrics/otel): add histogram bucket views per metric-name pattern#66805

feat(metrics/otel): add histogram bucket views per metric-name pattern#66805
1fanwang wants to merge 1 commit into
apache:mainfrom
1fanwang:metrics-otel-bucket-views

1fanwang commented May 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

1fanwang commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Tests

Design note

Reproducer

Captured before/after via the updated test

End-to-end repro outside the test harness

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1fanwang commented May 12, 2026 •

edited

Loading