Skip to content

Core: Add OpenTelemetry MetricsReporter #16169

@moomindani

Description

@moomindani

Motivation

Iceberg currently ships three built-in MetricsReporter implementations:

  • LoggingMetricsReporter — writes a verbose toString to a log file (api module)
  • InMemoryMetricsReporter — added in Display Spark read metrics on Spark SQL UI #7447 so that Spark can pick up scan metrics and surface them on the Spark SQL UI; it stores only the most recent report in memory and is wired into Spark internally rather than being a general-purpose external sink
  • RESTMetricsReporter — REST catalog only

This leaves users without an out-of-the-box way to ship ScanReport / CommitReport to an external observability platform (Prometheus, CloudWatch, Datadog, Grafana Cloud, Honeycomb, etc.). The gap applies even to Spark users:

  • The Spark SQL UI is production-ready and well-suited to its purpose: human-driven performance tuning and debugging through a Web UI. It is not designed to be an integrated monitoring platform, so it does not replace long-term retention, dashboards across many jobs, or automated alerts.
  • Streaming and 24/7 Spark workloads want metrics in the same observability stack the rest of their services use, rather than requiring an operator to open the UI per job.
  • Multi-engine users (Spark + Trino + Flink, etc.) want a single uniform metrics path rather than per-engine integrations.

Non-Spark engines (Trino, Flink, Dremio, Hive, etc.) and non-REST catalogs (Hive Metastore, Glue, JDBC, Hadoop) lack any built-in option at all.

Several existing issues and PRs touch this gap:

OpenTelemetry is the open, vendor-neutral standard for telemetry, hosted by the CNCF and supported by every major observability vendor and cloud platform (Prometheus, Grafana Cloud, Datadog, New Relic, Honeycomb, AWS CloudWatch, Google Cloud Monitoring, Azure Monitor, Databricks Zerobus, etc.). Because the OpenTelemetry Collector and downstream backends already handle fan-out, transformation, and routing, an OTLP-based MetricsReporter gives Iceberg users a single well-known integration point that reaches all of these backends without Iceberg having to maintain vendor-specific reporter code itself. This is a strong fit for a project, like Iceberg, that intentionally avoids tying its observability story to any single vendor.

Proposed design

Revision (2026-05-01): the original draft of this Issue defined a number of otel.* catalog properties (endpoint, protocol, headers, …) that the reporter would use to build its own OTLP exporter and SdkMeterProvider. After reviewing #14360, I revised the design to align with the same philosophy used there: the host application owns the OpenTelemetry SDK lifecycle, and Iceberg only consumes a Meter from it. The previous property-based design is preserved in the issue history.

A new org.apache.iceberg.metrics.OtelMetricsReporter implementing MetricsReporter with the standard no-arg constructor + initialize(Map<String,String>) pattern.

SDK ownership

The reporter does not create or own any OpenTelemetry SDK. In initialize, it obtains the OpenTelemetry instance from GlobalOpenTelemetry.get() and acquires a Meter named org.apache.iceberg. This matches the canonical OpenTelemetry usage pattern: the host application (Spark / Flink / Trino / a long-running JVM service) registers an SDK once via OpenTelemetrySdk.builder()...buildAndRegisterGlobal() (or via the OpenTelemetry Java agent), and any library — including Iceberg — picks it up automatically. If no SDK has been registered, OpenTelemetry returns a no-op Meter and metric calls are silently dropped, matching the standard OpenTelemetry API contract.

This means there is no Iceberg-specific configuration surface for endpoint, protocol, headers, exporter intervals, or resource attributes. All of these are owned by the host application or by the standard OpenTelemetry environment variables (e.g. OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_SERVICE_NAME, OTEL_EXPORTER_OTLP_HEADERS).

Catalog properties

Just one — registering the reporter:

metrics-reporter-impl=org.apache.iceberg.metrics.OtelMetricsReporter

Metric mapping

Each ScanReport and CommitReport field maps to a stable metric name with attributes that match existing Iceberg conventions:

  • iceberg.scan.planning.duration (histogram, ms)
  • iceberg.scan.result.{data_files,delete_files} (sum)
  • iceberg.scan.data_manifests.{scanned,skipped} (sum)
  • iceberg.scan.file_size.bytes (sum, By)
  • iceberg.commit.duration (histogram, ms)
  • iceberg.commit.{attempts,records.added} (sum)
  • iceberg.commit.data_files.{added,removed} (sum)
  • iceberg.commit.file_size.added_bytes (sum, By)

Attributes: iceberg.table.name, iceberg.snapshot.id, iceberg.schema.id, iceberg.operation.

Dependencies

Only io.opentelemetry:opentelemetry-api is added as compileOnly to iceberg-core. The OpenTelemetry SDK and OTLP exporters are not added to the runtime classpath — they come from the host application. Test scope adds opentelemetry-sdk + opentelemetry-sdk-testing for InMemoryMetricReader-based unit tests, and opentelemetry-exporter-otlp for the (gated) end-to-end smoke test.

JDK 11+, no breaking changes.

Validation

Validated end-to-end against two completely different OTLP backends, using the same reporter class without modification — to my mind the most important property of the design:

  1. Databricks Zerobus Ingest (OTLP/gRPC, Bearer auth) — metrics land directly in a Unity Catalog Delta table; verified with SQL aggregations matching injected values exactly.
  2. Amazon CloudWatch (OTLP/HTTP, SigV4 via OTel Collector) — same reporter, same metric names, same attributes; verified via PromQL sum by() and ratio queries.

In both cases the host process built and registered an OpenTelemetrySdk (with the appropriate exporter and headers) before initializing Iceberg's reporter. The reporter itself was unchanged across the two backends. Detailed validation reports (commands, queries, gotchas) can be shared on request or attached to the PR.

Disclosure

I used Claude Code to help draft and prototype this work. I reviewed every change by hand and ran the full test/lint loop locally before each iteration; the validation results above are from my own runs against real backends. Per the project's AI-assisted contribution guidelines, I am keeping this Issue open for design feedback before opening a PR.

Open questions

These are the calls I am currently planning to make. I am happy to adjust if folks see a better path.

  1. Module placement. Leaning toward iceberg-core with opentelemetry-api declared as compileOnly, so users who do not opt in pay no runtime classpath cost. If a separate iceberg-opentelemetry module would fit better with how Iceberg evolves observability features, please let me know.
  2. Configuration surface. The current design exposes zero Iceberg-specific properties; everything flows through GlobalOpenTelemetry. Is that the right call, or should the reporter also accept a small fallback set (e.g. otel.endpoint) so that catalogs configured purely through Spark --conf / Flink config can still drive the exporter without having to register a global SDK in the host JVM?
  3. Metric naming. No OpenTelemetry semantic conventions exist for table-format operations today, so I have used iceberg.scan.* / iceberg.commit.* to stay close to the names already used in MetricsContext. If someone is tracking work upstream in opentelemetry-specification, I would rather align with that than ship a name that has to change later.
  4. Coordination with Core: Add support for OpenTelemetry in HTTPClient #14360. This proposal is complementary to Core: Add support for OpenTelemetry in HTTPClient #14360 — that PR instruments REST-catalog HTTP calls, this one instruments Iceberg-level scan/commit reports. Both can use the same GlobalOpenTelemetry instance in the host process. cc @ebyhr — happy to coordinate on dependency / module / license decisions to keep them consistent.
  5. Anything else I should resolve before opening a PR?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions