You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
RocketMQ version: any build with #10240BatchSplittingMetricExporter running against OpenTelemetry Java SDK 1.44.0 ~ 1.46.x.
JDK: 11 / 21 (both affected).
Describe the bug
After #10239 merged BatchSplittingMetricExporter into the broker, brokers running on OpenTelemetry Java 1.44.1 (the repo default) begin leaking heap linearly at roughly 3 ~ 10 MB per metric collection cycle. Old Gen climbs monotonically to OOM within hours on a 6 GB heap; Eden remains normal.
Root cause
OpenTelemetry Java 1.44.0 changed the default `OtlpGrpcMetricExporter` memory mode to `MemoryMode.REUSABLE_DATA`. The `MetricReusableDataMarshaler` implementation keeps a pool of marshalers in a non-thread-safe `ArrayDeque`:
`pool.poll()` is called on the `PeriodicMetricReader` thread during `export()`
`pool.add(marshaler)` is called on the OkHttp callback thread inside the `whenComplete` lambda
`BatchSplittingMetricExporter.export()` issues N concurrent sub-batch exports per cycle, so the two call sites race on the `ArrayDeque` and corrupt its internal `head`/`tail`/`elements` invariant. The pool grows unbounded, and each leaked marshaler retains ~132 KiB of `MarshalerContext` internal caches.
Evidence
MAT heap dump of an OOM'd broker:
93% of heap retained by one `BatchSplittingMetricExporter → MetricReusableDataMarshaler.marshalerPool`
17,531 leaked `LowAllocationMetricsRequestMarshaler` instances (should be <= 1 in steady state)
`ArrayDeque` invariant broken: logical `size()` = 32,233, actual non-null slots = 17,531 — direct evidence of cross-thread race corruption
BUG REPORT
RocketMQ version: any build with #10240
BatchSplittingMetricExporterrunning against OpenTelemetry Java SDK 1.44.0 ~ 1.46.x.JDK: 11 / 21 (both affected).
Describe the bug
After #10239 merged
BatchSplittingMetricExporterinto the broker, brokers running on OpenTelemetry Java 1.44.1 (the repo default) begin leaking heap linearly at roughly 3 ~ 10 MB per metric collection cycle. Old Gen climbs monotonically to OOM within hours on a 6 GB heap; Eden remains normal.Root cause
OpenTelemetry Java 1.44.0 changed the default `OtlpGrpcMetricExporter` memory mode to `MemoryMode.REUSABLE_DATA`. The `MetricReusableDataMarshaler` implementation keeps a pool of marshalers in a non-thread-safe `ArrayDeque`:
`BatchSplittingMetricExporter.export()` issues N concurrent sub-batch exports per cycle, so the two call sites race on the `ArrayDeque` and corrupt its internal `head`/`tail`/`elements` invariant. The pool grows unbounded, and each leaked marshaler retains ~132 KiB of `MarshalerContext` internal caches.
Evidence
MAT heap dump of an OOM'd broker:
Upstream issue and fix: `open-telemetry/opentelemetry-java#7019` → PR `open-telemetry/opentelemetry-java#7041` (replaces `ArrayDeque` with `ConcurrentLinkedDeque`). Released in OpenTelemetry Java v1.47.0 (2025-02-07).
Steps to reproduce
Expected behavior
Broker heap stays bounded; OTLP metrics export does not leak marshaler instances.
Proposed fix
See linked PR: