Skip to content

feat(metrics/otel): add histogram bucket views per metric-name pattern#66805

Open
1fanwang wants to merge 1 commit into
apache:mainfrom
1fanwang:metrics-otel-bucket-views
Open

feat(metrics/otel): add histogram bucket views per metric-name pattern#66805
1fanwang wants to merge 1 commit into
apache:mainfrom
1fanwang:metrics-otel-bucket-views

Conversation

@1fanwang
Copy link
Copy Markdown
Contributor

@1fanwang 1fanwang commented May 12, 2026

Fix for the one-size-fits-all OTel histogram bucketing called out in #66801. Histogram-eligible Airflow metrics all share the default boundaries today, which makes the tails of some of our LinkedIn DI dashboards useless — task duration buckets fine, scheduler-loop duration needs finer resolution at the low end. This PR adds a config-driven metric-name pattern → buckets mapping so we can tune per metric without touching code.

Problem

Follow-up to #64207, which set ExponentialBucketHistogramAggregation as the
default for every OTel Histogram instrument. That covers timer/timing.
Non-timer histograms (*_count, *_duration on direct Stats.histogram(),
custom *_delay metrics) still pick bucket boundaries at each call site,
so the same family of metrics can end up with different bucket shapes
depending on which module created the instrument.

Fix

New shared/observability/.../metrics/histogram_buckets.py module that:

  • Defines three named bucket aggregations — LATENCY_BUCKETS (exponential),
    COUNT_BUCKETS (small linear), DELAY_BUCKETS (large-range).
  • Exposes DEFAULT_PATTERN_BUCKETS mapping *_duration / *_delay /
    *_count to those aggregations.
  • Provides build_views_for_patterns() returning OTel View objects keyed
    by instrument-name glob. Deployments can pass an override dict.

otel_logger.get_otel_logger() layers the pattern views on top of the
existing instrument-type baseline view — the per-instrument-type
exponential default still applies, the pattern views only refine it for
matching names.

Tests

test_histogram_buckets.py covers the default pattern map, the per-pattern
aggregation resolution, custom-mapping override, and that
build_views_for_patterns() returns one View per entry with the right
instrument_name.

test_otel_logger.py::test_get_otel_logger_uses_exponential_histogram_view
updated to assert the layered shape (baseline view + pattern views).

Design note

Flagging this as a design discussion. A reasonable counter-proposal is
"let each call site pass its bucket boundaries" rather than centralising
the map. Happy to take that direction if maintainers prefer.

Closes #66801

Reproducer

Captured before/after via the updated test

The discriminating test is shared/observability/tests/observability/metrics/test_otel_logger.py::TestOtelMetrics::test_get_otel_logger_uses_exponential_histogram_view. It collects the views= passed to the MeterProvider, checks the first one is the instrument-type exponential baseline, and checks the remaining views' _instrument_name set equals {"*_count", "*_duration", "*_delay"}.

Before the fix (git checkout upstream/main -- shared/observability/src/airflow_shared/observability/metrics/otel_logger.py):

FAILED shared/observability/tests/observability/metrics/test_otel_logger.py::TestOtelMetrics::test_get_otel_logger_uses_exponential_histogram_view
shared/observability/tests/observability/metrics/test_otel_logger.py:436: in test_get_otel_logger_uses_exponential_histogram_view
    assert pattern_names == {"*_count", "*_duration", "*_delay"}
E   AssertionError: assert equals failed
E     set()
E                      +  '*_count',
E                      +  '*_delay',
E                      +  '*_duration',

The pre-fix get_otel_logger builds a single instrument-type baseline view and nothing else, so views[1:] is empty and the pattern-name set does not match.

After restoring the fix:

PASSED shared/observability/tests/observability/metrics/test_otel_logger.py::TestOtelMetrics::test_get_otel_logger_uses_exponential_histogram_view

The full test_histogram_buckets.py suite plus the modified test_get_otel_logger_uses_exponential_histogram_view go 10/10 green.

End-to-end repro outside the test harness

The script below builds a MeterProvider with views=build_views_for_patterns() plus an InMemoryMetricReader, emits one sample per default pattern, and reads back which aggregation the SDK resolved per instrument:

from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import InMemoryMetricReader
from airflow_shared.observability.metrics.histogram_buckets import build_views_for_patterns

reader = InMemoryMetricReader()
metrics.set_meter_provider(MeterProvider(metric_readers=[reader], views=build_views_for_patterns()))
meter = metrics.get_meter("repro")

meter.create_histogram("task_duration").record(0.05)
meter.create_histogram("schedule_delay").record(45)
meter.create_histogram("retry_count").record(3)

for rm in reader.get_metrics_data().resource_metrics:
    for sm in rm.scope_metrics:
        for m in sm.metrics:
            print(m.name, type(m.data.data_points[0]).__name__)

Output:

task_duration  ExponentialHistogramDataPoint
schedule_delay HistogramDataPoint
retry_count    HistogramDataPoint

task_duration is collected through ExponentialBucketHistogramAggregation (matches *_duration); schedule_delay and retry_count are collected as fixed-boundary HistogramDataPoint with the configured (1, 5, 15, 60, 300, 900, 1800, 3600, 7200, 21600) and (1, 2, 5, 10, 25, 50, 100, 250, 500, 1000) boundaries respectively. Without the patch, all three would fall through to the instrument-type exponential default regardless of name.

Follow-up to apache#64207, which set ExponentialBucketHistogramAggregation as
the instrument-type default for OTel histograms.  Non-timer histograms
(*_count, *_duration, *_delay) still pick bucket boundaries at each
call site.  This change moves that decision into a single declarative
map keyed by metric-name suffix glob, and layers the resulting Views
on top of the existing instrument-type baseline.

Closes apache#66801

Signed-off-by: 1fanwang <1fannnw@gmail.com>
@choo121600 choo121600 added the ready for maintainer review Set after triaging when all criteria pass. label May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready for maintainer review Set after triaging when all criteria pass.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Declarative OTel histogram bucket views per metric-name pattern

2 participants