Skip to content

[Bug] OTLP metric exporter leaks memory until OOM when using BatchSplittingMetricExporter on OpenTelemetry Java 1.44.0 ~ 1.46.x #10266

@Houlong66

Description

@Houlong66

BUG REPORT

RocketMQ version: any build with #10240 BatchSplittingMetricExporter running against OpenTelemetry Java SDK 1.44.0 ~ 1.46.x.

JDK: 11 / 21 (both affected).

Describe the bug

After #10239 merged BatchSplittingMetricExporter into the broker, brokers running on OpenTelemetry Java 1.44.1 (the repo default) begin leaking heap linearly at roughly 3 ~ 10 MB per metric collection cycle. Old Gen climbs monotonically to OOM within hours on a 6 GB heap; Eden remains normal.

Root cause

OpenTelemetry Java 1.44.0 changed the default `OtlpGrpcMetricExporter` memory mode to `MemoryMode.REUSABLE_DATA`. The `MetricReusableDataMarshaler` implementation keeps a pool of marshalers in a non-thread-safe `ArrayDeque`:

  • `pool.poll()` is called on the `PeriodicMetricReader` thread during `export()`
  • `pool.add(marshaler)` is called on the OkHttp callback thread inside the `whenComplete` lambda

`BatchSplittingMetricExporter.export()` issues N concurrent sub-batch exports per cycle, so the two call sites race on the `ArrayDeque` and corrupt its internal `head`/`tail`/`elements` invariant. The pool grows unbounded, and each leaked marshaler retains ~132 KiB of `MarshalerContext` internal caches.

Evidence

MAT heap dump of an OOM'd broker:

  • 93% of heap retained by one `BatchSplittingMetricExporter → MetricReusableDataMarshaler.marshalerPool`
  • 17,531 leaked `LowAllocationMetricsRequestMarshaler` instances (should be <= 1 in steady state)
  • `ArrayDeque` invariant broken: logical `size()` = 32,233, actual non-null slots = 17,531 — direct evidence of cross-thread race corruption

Upstream issue and fix: `open-telemetry/opentelemetry-java#7019` → PR `open-telemetry/opentelemetry-java#7041` (replaces `ArrayDeque` with `ConcurrentLinkedDeque`). Released in OpenTelemetry Java v1.47.0 (2025-02-07).

Steps to reproduce

  1. Apply [ISSUE #10240] Add BatchSplittingMetricExporter to prevent OTLP gRPC export failures #10239 to a broker
  2. Enable `metricsExporterType=OTLP_GRPC` with a high-cardinality workload (many unique `consumer_group × topic` combinations)
  3. Run at steady traffic on OpenTelemetry Java 1.44.0 ~ 1.46.x
  4. Observe G1 Old Gen monotonically climb to heap limit → OOM

Expected behavior

Broker heap stays bounded; OTLP metrics export does not leak marshaler instances.

Proposed fix

See linked PR:

  1. Bump OpenTelemetry to 1.47.0 (upstream `ArrayDeque → ConcurrentLinkedDeque` fix)
  2. Default `OtlpGrpcMetricExporter.memoryMode` to `IMMUTABLE_DATA`, exposed via `brokerConfig.metricsExportOtelMemoryMode`
  3. Cap concurrent sub-batches with `brokerConfig.metricsExportBatchMaxConcurrent` (default 4)
  4. Escape hatch `brokerConfig.metricsExportBatchSplitEnabled` (default true) to bypass the splitter entirely

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions