You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Iceberg currently ships three built-in MetricsReporter implementations:
LoggingMetricsReporter — writes a verbose toString to a log file (api module)
InMemoryMetricsReporter — added in Display Spark read metrics on Spark SQL UI #7447 so that Spark can pick up scan metrics and surface them on the Spark SQL UI; it stores only the most recent report in memory and is wired into Spark internally rather than being a general-purpose external sink
RESTMetricsReporter — REST catalog only
This leaves users without an out-of-the-box way to ship ScanReport / CommitReport to an external observability platform (Prometheus, CloudWatch, Datadog, Grafana Cloud, Honeycomb, etc.). The gap applies even to Spark users:
The Spark SQL UI is production-ready and well-suited to its purpose: human-driven performance tuning and debugging through a Web UI. It is not designed to be an integrated monitoring platform, so it does not replace long-term retention, dashboards across many jobs, or automated alerts.
Streaming and 24/7 Spark workloads want metrics in the same observability stack the rest of their services use, rather than requiring an operator to open the UI per job.
Multi-engine users (Spark + Trino + Flink, etc.) want a single uniform metrics path rather than per-engine integrations.
Non-Spark engines (Trino, Flink, Dremio, Hive, etc.) and non-REST catalogs (Hive Metastore, Glue, JDBC, Hadoop) lack any built-in option at all.
Core: Add support for OpenTelemetry in HTTPClient #14360 — adds OpenTelemetry support to HTTPClient for REST-catalog HTTP traceability (complementary scope: instruments the network layer, while this proposal instruments Iceberg-level scan/commit metrics)
OpenTelemetry is the open, vendor-neutral standard for telemetry, hosted by the CNCF and supported by every major observability vendor and cloud platform (Prometheus, Grafana Cloud, Datadog, New Relic, Honeycomb, AWS CloudWatch, Google Cloud Monitoring, Azure Monitor, Databricks Zerobus, etc.). Because the OpenTelemetry Collector and downstream backends already handle fan-out, transformation, and routing, an OTLP-based MetricsReporter gives Iceberg users a single well-known integration point that reaches all of these backends without Iceberg having to maintain vendor-specific reporter code itself. This is a strong fit for a project, like Iceberg, that intentionally avoids tying its observability story to any single vendor.
Proposed design
Revision (2026-05-01): the original draft of this Issue defined a number of otel.* catalog properties (endpoint, protocol, headers, …) that the reporter would use to build its own OTLP exporter and SdkMeterProvider. After reviewing #14360, I revised the design to align with the same philosophy used there: the host application owns the OpenTelemetry SDK lifecycle, and Iceberg only consumes a Meter from it. The previous property-based design is preserved in the issue history.
A new org.apache.iceberg.metrics.OtelMetricsReporter implementing MetricsReporter with the standard no-arg constructor + initialize(Map<String,String>) pattern.
SDK ownership
The reporter does not create or own any OpenTelemetry SDK. In initialize, it obtains the OpenTelemetry instance from GlobalOpenTelemetry.get() and acquires a Meter named org.apache.iceberg. This matches the canonical OpenTelemetry usage pattern: the host application (Spark / Flink / Trino / a long-running JVM service) registers an SDK once via OpenTelemetrySdk.builder()...buildAndRegisterGlobal() (or via the OpenTelemetry Java agent), and any library — including Iceberg — picks it up automatically. If no SDK has been registered, OpenTelemetry returns a no-op Meter and metric calls are silently dropped, matching the standard OpenTelemetry API contract.
This means there is no Iceberg-specific configuration surface for endpoint, protocol, headers, exporter intervals, or resource attributes. All of these are owned by the host application or by the standard OpenTelemetry environment variables (e.g. OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_SERVICE_NAME, OTEL_EXPORTER_OTLP_HEADERS).
Only io.opentelemetry:opentelemetry-api is added as compileOnly to iceberg-core. The OpenTelemetry SDK and OTLP exporters are not added to the runtime classpath — they come from the host application. Test scope adds opentelemetry-sdk + opentelemetry-sdk-testing for InMemoryMetricReader-based unit tests, and opentelemetry-exporter-otlp for the (gated) end-to-end smoke test.
JDK 11+, no breaking changes.
Validation
Validated end-to-end against two completely different OTLP backends, using the same reporter class without modification — to my mind the most important property of the design:
Databricks Zerobus Ingest (OTLP/gRPC, Bearer auth) — metrics land directly in a Unity Catalog Delta table; verified with SQL aggregations matching injected values exactly.
Amazon CloudWatch (OTLP/HTTP, SigV4 via OTel Collector) — same reporter, same metric names, same attributes; verified via PromQL sum by() and ratio queries.
In both cases the host process built and registered an OpenTelemetrySdk (with the appropriate exporter and headers) before initializing Iceberg's reporter. The reporter itself was unchanged across the two backends. Detailed validation reports (commands, queries, gotchas) can be shared on request or attached to the PR.
Disclosure
I used Claude Code to help draft and prototype this work. I reviewed every change by hand and ran the full test/lint loop locally before each iteration; the validation results above are from my own runs against real backends. Per the project's AI-assisted contribution guidelines, I am keeping this Issue open for design feedback before opening a PR.
Open questions
These are the calls I am currently planning to make. I am happy to adjust if folks see a better path.
Module placement. Leaning toward iceberg-core with opentelemetry-api declared as compileOnly, so users who do not opt in pay no runtime classpath cost. If a separate iceberg-opentelemetry module would fit better with how Iceberg evolves observability features, please let me know.
Configuration surface. The current design exposes zero Iceberg-specific properties; everything flows through GlobalOpenTelemetry. Is that the right call, or should the reporter also accept a small fallback set (e.g. otel.endpoint) so that catalogs configured purely through Spark --conf / Flink config can still drive the exporter without having to register a global SDK in the host JVM?
Metric naming. No OpenTelemetry semantic conventions exist for table-format operations today, so I have used iceberg.scan.* / iceberg.commit.* to stay close to the names already used in MetricsContext. If someone is tracking work upstream in opentelemetry-specification, I would rather align with that than ship a name that has to change later.
Motivation
Iceberg currently ships three built-in
MetricsReporterimplementations:LoggingMetricsReporter— writes a verbosetoStringto a log file (apimodule)InMemoryMetricsReporter— added in Display Spark read metrics on Spark SQL UI #7447 so that Spark can pick up scan metrics and surface them on the Spark SQL UI; it stores only the most recent report in memory and is wired into Spark internally rather than being a general-purpose external sinkRESTMetricsReporter— REST catalog onlyThis leaves users without an out-of-the-box way to ship
ScanReport/CommitReportto an external observability platform (Prometheus, CloudWatch, Datadog, Grafana Cloud, Honeycomb, etc.). The gap applies even to Spark users:Non-Spark engines (Trino, Flink, Dremio, Hive, etc.) and non-REST catalogs (Hive Metastore, Glue, JDBC, Hadoop) lack any built-in option at all.
Several existing issues and PRs touch this gap:
org.apache.iceberg.SnapshotUpdate#metricsReporter(MetricsReporter reporter)#14490, Improvement: Allow callers to configure metrics reporter on ScanBuilder #14875, CreateSnapshotEvent / CommitReport are sent prematurely when using a Transaction #7278 — requests / discussions around metrics exportCloudMonitoringMetricsReporterthat uses reflection to access internal typesHTTPClientfor REST-catalog HTTP traceability (complementary scope: instruments the network layer, while this proposal instruments Iceberg-level scan/commit metrics)OpenTelemetry is the open, vendor-neutral standard for telemetry, hosted by the CNCF and supported by every major observability vendor and cloud platform (Prometheus, Grafana Cloud, Datadog, New Relic, Honeycomb, AWS CloudWatch, Google Cloud Monitoring, Azure Monitor, Databricks Zerobus, etc.). Because the OpenTelemetry Collector and downstream backends already handle fan-out, transformation, and routing, an OTLP-based
MetricsReportergives Iceberg users a single well-known integration point that reaches all of these backends without Iceberg having to maintain vendor-specific reporter code itself. This is a strong fit for a project, like Iceberg, that intentionally avoids tying its observability story to any single vendor.Proposed design
A new
org.apache.iceberg.metrics.OtelMetricsReporterimplementingMetricsReporterwith the standard no-arg constructor +initialize(Map<String,String>)pattern.SDK ownership
The reporter does not create or own any OpenTelemetry SDK. In
initialize, it obtains theOpenTelemetryinstance fromGlobalOpenTelemetry.get()and acquires aMeternamedorg.apache.iceberg. This matches the canonical OpenTelemetry usage pattern: the host application (Spark / Flink / Trino / a long-running JVM service) registers an SDK once viaOpenTelemetrySdk.builder()...buildAndRegisterGlobal()(or via the OpenTelemetry Java agent), and any library — including Iceberg — picks it up automatically. If no SDK has been registered, OpenTelemetry returns a no-opMeterand metric calls are silently dropped, matching the standard OpenTelemetry API contract.This means there is no Iceberg-specific configuration surface for endpoint, protocol, headers, exporter intervals, or resource attributes. All of these are owned by the host application or by the standard OpenTelemetry environment variables (e.g.
OTEL_EXPORTER_OTLP_ENDPOINT,OTEL_SERVICE_NAME,OTEL_EXPORTER_OTLP_HEADERS).Catalog properties
Just one — registering the reporter:
Metric mapping
Each
ScanReportandCommitReportfield maps to a stable metric name with attributes that match existing Iceberg conventions:iceberg.scan.planning.duration(histogram, ms)iceberg.scan.result.{data_files,delete_files}(sum)iceberg.scan.data_manifests.{scanned,skipped}(sum)iceberg.scan.file_size.bytes(sum, By)iceberg.commit.duration(histogram, ms)iceberg.commit.{attempts,records.added}(sum)iceberg.commit.data_files.{added,removed}(sum)iceberg.commit.file_size.added_bytes(sum, By)Attributes:
iceberg.table.name,iceberg.snapshot.id,iceberg.schema.id,iceberg.operation.Dependencies
Only
io.opentelemetry:opentelemetry-apiis added ascompileOnlytoiceberg-core. The OpenTelemetry SDK and OTLP exporters are not added to the runtime classpath — they come from the host application. Test scope addsopentelemetry-sdk+opentelemetry-sdk-testingforInMemoryMetricReader-based unit tests, andopentelemetry-exporter-otlpfor the (gated) end-to-end smoke test.JDK 11+, no breaking changes.
Validation
Validated end-to-end against two completely different OTLP backends, using the same reporter class without modification — to my mind the most important property of the design:
sum by()and ratio queries.In both cases the host process built and registered an
OpenTelemetrySdk(with the appropriate exporter and headers) before initializing Iceberg's reporter. The reporter itself was unchanged across the two backends. Detailed validation reports (commands, queries, gotchas) can be shared on request or attached to the PR.Disclosure
I used Claude Code to help draft and prototype this work. I reviewed every change by hand and ran the full test/lint loop locally before each iteration; the validation results above are from my own runs against real backends. Per the project's AI-assisted contribution guidelines, I am keeping this Issue open for design feedback before opening a PR.
Open questions
These are the calls I am currently planning to make. I am happy to adjust if folks see a better path.
iceberg-corewithopentelemetry-apideclared ascompileOnly, so users who do not opt in pay no runtime classpath cost. If a separateiceberg-opentelemetrymodule would fit better with how Iceberg evolves observability features, please let me know.GlobalOpenTelemetry. Is that the right call, or should the reporter also accept a small fallback set (e.g.otel.endpoint) so that catalogs configured purely through Spark--conf/ Flink config can still drive the exporter without having to register a global SDK in the host JVM?iceberg.scan.*/iceberg.commit.*to stay close to the names already used inMetricsContext. If someone is tracking work upstream in opentelemetry-specification, I would rather align with that than ship a name that has to change later.GlobalOpenTelemetryinstance in the host process. cc @ebyhr — happy to coordinate on dependency / module / license decisions to keep them consistent.