Core: Add OpenTelemetry MetricsReporter

### Motivation

Iceberg currently ships three built-in `MetricsReporter` implementations:

- `LoggingMetricsReporter` — writes a verbose `toString` to a log file (`api` module)
- `InMemoryMetricsReporter` — added in #7447 so that Spark can pick up scan metrics and surface them on the Spark SQL UI; it stores only the most recent report in memory and is wired into Spark internally rather than being a general-purpose external sink
- `RESTMetricsReporter` — REST catalog only

This leaves users without an out-of-the-box way to ship `ScanReport` / `CommitReport` to an **external observability platform** (Prometheus, CloudWatch, Datadog, Grafana Cloud, Honeycomb, etc.). The gap applies even to Spark users:

- The Spark SQL UI is production-ready and well-suited to its purpose: human-driven performance tuning and debugging through a Web UI. It is not designed to be an integrated monitoring platform, so it does not replace long-term retention, dashboards across many jobs, or automated alerts.
- Streaming and 24/7 Spark workloads want metrics in the same observability stack the rest of their services use, rather than requiring an operator to open the UI per job.
- Multi-engine users (Spark + Trino + Flink, etc.) want a single uniform metrics path rather than per-engine integrations.

Non-Spark engines (Trino, Flink, Dremio, Hive, etc.) and non-REST catalogs (Hive Metastore, Glue, JDBC, Hadoop) lack any built-in option at all.

Several existing issues and PRs touch this gap:

- #14490, #14875, #7278 — requests / discussions around metrics export
- #16107 — a GCP-specific `CloudMonitoringMetricsReporter` that uses reflection to access internal types
- #14360 — adds OpenTelemetry support to `HTTPClient` for REST-catalog HTTP traceability (complementary scope: instruments the network layer, while this proposal instruments Iceberg-level scan/commit metrics)

OpenTelemetry is the open, vendor-neutral standard for telemetry, hosted by the CNCF and supported by every major observability vendor and cloud platform (Prometheus, Grafana Cloud, Datadog, New Relic, Honeycomb, AWS CloudWatch, Google Cloud Monitoring, Azure Monitor, Databricks Zerobus, etc.). Because the OpenTelemetry Collector and downstream backends already handle fan-out, transformation, and routing, an OTLP-based `MetricsReporter` gives Iceberg users a single well-known integration point that reaches all of these backends without Iceberg having to maintain vendor-specific reporter code itself. This is a strong fit for a project, like Iceberg, that intentionally avoids tying its observability story to any single vendor.

### Proposed design

> **Revision (2026-05-01):** the original draft of this Issue defined a number of `otel.*` catalog properties (endpoint, protocol, headers, …) that the reporter would use to build its own OTLP exporter and `SdkMeterProvider`. After reviewing #14360, I revised the design to align with the same philosophy used there: **the host application owns the OpenTelemetry SDK lifecycle**, and Iceberg only consumes a `Meter` from it. The previous property-based design is preserved in the issue history.

A new `org.apache.iceberg.metrics.OtelMetricsReporter` implementing `MetricsReporter` with the standard no-arg constructor + `initialize(Map<String,String>)` pattern.

**SDK ownership**

The reporter does **not** create or own any OpenTelemetry SDK. In `initialize`, it obtains the `OpenTelemetry` instance from `GlobalOpenTelemetry.get()` and acquires a `Meter` named `org.apache.iceberg`. This matches the canonical OpenTelemetry usage pattern: the host application (Spark / Flink / Trino / a long-running JVM service) registers an SDK once via `OpenTelemetrySdk.builder()...buildAndRegisterGlobal()` (or via the OpenTelemetry Java agent), and any library — including Iceberg — picks it up automatically. If no SDK has been registered, OpenTelemetry returns a no-op `Meter` and metric calls are silently dropped, matching the standard OpenTelemetry API contract.

This means there is no Iceberg-specific configuration surface for endpoint, protocol, headers, exporter intervals, or resource attributes. All of these are owned by the host application or by the standard OpenTelemetry environment variables (e.g. `OTEL_EXPORTER_OTLP_ENDPOINT`, `OTEL_SERVICE_NAME`, `OTEL_EXPORTER_OTLP_HEADERS`).

**Catalog properties**

Just one — registering the reporter:

```
metrics-reporter-impl=org.apache.iceberg.metrics.OtelMetricsReporter
```

**Metric mapping**

Each `ScanReport` and `CommitReport` field maps to a stable metric name with attributes that match existing Iceberg conventions:

- `iceberg.scan.planning.duration` (histogram, ms)
- `iceberg.scan.result.{data_files,delete_files}` (sum)
- `iceberg.scan.data_manifests.{scanned,skipped}` (sum)
- `iceberg.scan.file_size.bytes` (sum, By)
- `iceberg.commit.duration` (histogram, ms)
- `iceberg.commit.{attempts,records.added}` (sum)
- `iceberg.commit.data_files.{added,removed}` (sum)
- `iceberg.commit.file_size.added_bytes` (sum, By)

Attributes: `iceberg.table.name`, `iceberg.snapshot.id`, `iceberg.schema.id`, `iceberg.operation`.

**Dependencies**

Only `io.opentelemetry:opentelemetry-api` is added as `compileOnly` to `iceberg-core`. The OpenTelemetry SDK and OTLP exporters are **not** added to the runtime classpath — they come from the host application. Test scope adds `opentelemetry-sdk` + `opentelemetry-sdk-testing` for `InMemoryMetricReader`-based unit tests, and `opentelemetry-exporter-otlp` for the (gated) end-to-end smoke test.

JDK 11+, no breaking changes.

### Validation

Validated end-to-end against two completely different OTLP backends, using the **same reporter class without modification** — to my mind the most important property of the design:

1. **Databricks Zerobus Ingest** (OTLP/gRPC, Bearer auth) — metrics land directly in a Unity Catalog Delta table; verified with SQL aggregations matching injected values exactly.
2. **Amazon CloudWatch** (OTLP/HTTP, SigV4 via OTel Collector) — same reporter, same metric names, same attributes; verified via PromQL `sum by()` and ratio queries.

In both cases the host process built and registered an `OpenTelemetrySdk` (with the appropriate exporter and headers) before initializing Iceberg's reporter. The reporter itself was unchanged across the two backends. Detailed validation reports (commands, queries, gotchas) can be shared on request or attached to the PR.

### Disclosure

I used Claude Code to help draft and prototype this work. I reviewed every change by hand and ran the full test/lint loop locally before each iteration; the validation results above are from my own runs against real backends. Per the project's [AI-assisted contribution guidelines](https://iceberg.apache.org/contribute/#guidelines-for-ai-assisted-contributions), I am keeping this Issue open for design feedback before opening a PR.

### Open questions

These are the calls I am currently planning to make. I am happy to adjust if folks see a better path.

1. **Module placement.** Leaning toward `iceberg-core` with `opentelemetry-api` declared as `compileOnly`, so users who do not opt in pay no runtime classpath cost. If a separate `iceberg-opentelemetry` module would fit better with how Iceberg evolves observability features, please let me know.
2. **Configuration surface.** The current design exposes zero Iceberg-specific properties; everything flows through `GlobalOpenTelemetry`. Is that the right call, or should the reporter also accept a small fallback set (e.g. `otel.endpoint`) so that catalogs configured purely through Spark `--conf` / Flink config can still drive the exporter without having to register a global SDK in the host JVM?
3. **Metric naming.** No OpenTelemetry semantic conventions exist for table-format operations today, so I have used `iceberg.scan.*` / `iceberg.commit.*` to stay close to the names already used in `MetricsContext`. If someone is tracking work upstream in [opentelemetry-specification](https://github.com/open-telemetry/semantic-conventions), I would rather align with that than ship a name that has to change later.
4. **Coordination with #14360.** This proposal is complementary to #14360 — that PR instruments REST-catalog HTTP calls, this one instruments Iceberg-level scan/commit reports. Both can use the same `GlobalOpenTelemetry` instance in the host process. cc @ebyhr — happy to coordinate on dependency / module / license decisions to keep them consistent.
5. Anything else I should resolve before opening a PR?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core: Add OpenTelemetry MetricsReporter #16169

Motivation

Proposed design

Validation

Disclosure

Open questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Core: Add OpenTelemetry MetricsReporter #16169

Description

Motivation

Proposed design

Validation

Disclosure

Open questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions