Skip to content

[ISSUE #10266] Fix OOM caused by OpenTelemetry 1.44 OtlpGrpcMetricExporter pool race#10267

Open
Houlong66 wants to merge 1 commit intoapache:developfrom
Houlong66:fix/otel-1.44-reusable-pool-race
Open

[ISSUE #10266] Fix OOM caused by OpenTelemetry 1.44 OtlpGrpcMetricExporter pool race#10267
Houlong66 wants to merge 1 commit intoapache:developfrom
Houlong66:fix/otel-1.44-reusable-pool-race

Conversation

@Houlong66
Copy link
Copy Markdown
Contributor

@Houlong66 Houlong66 commented Apr 20, 2026

Summary

Closes #10266. Brokers running BatchSplittingMetricExporter (#10240) on OpenTelemetry Java SDK 1.44.0 ~ 1.46.x leak heap linearly until OOM due to a non-thread-safe ArrayDeque pool in MetricReusableDataMarshaler. Upstream fixed it in OT 1.47.0 via open-telemetry/opentelemetry-java#7041. This PR applies a three-layer defense on the broker side.

Changes

  • pom.xml: bump opentelemetry.version 1.44.1 → 1.47.0 and opentelemetry-exporter-prometheus.version → 1.47.0-alpha. OT 1.47.0 replaces the racy ArrayDeque with ConcurrentLinkedDeque.

  • BrokerConfig: three new fields, all hot-updatable via updateBrokerConfig:

    • metricsExportOtelMemoryMode (String, default "IMMUTABLE_DATA", case-insensitive; valid: IMMUTABLE_DATA / REUSABLE_DATA). Forces OtlpGrpcMetricExporter to IMMUTABLE_DATA, bypassing the pool path entirely.
    • metricsExportBatchMaxConcurrent (int, default 4). Bounds in-flight sub-batches in BatchSplittingMetricExporter via a Semaphore. Set to 1 to serialize (behaves like pre-batch); 0 or Integer.MAX_VALUE disables the limit.
    • metricsExportBatchSplitEnabled (boolean, default true). Escape hatch: set false to skip the splitter wrapper entirely and use the raw OtlpGrpcMetricExporter.
  • BrokerMetricsManager:

    • Reads metricsExportOtelMemoryMode via a new resolveMemoryMode(String) helper (invalid values → IMMUTABLE_DATA + WARN log).
    • Conditionally wraps with BatchSplittingMetricExporter per metricsExportBatchSplitEnabled.
  • BatchSplittingMetricExporter:

    • Constructor extended to (MetricExporter, IntSupplier batchSize, IntSupplier maxConcurrent).
    • export() builds a per-call Semaphore from the supplied concurrency limit; each sub-batch acquires a permit before delegate.export(batch) and releases it in whenComplete.
    • Adds defensive snapshotAllMetrics() on the collection before export to prevent ArrayIndexOutOfBoundsException in NumberDataPointMarshaler.createRepeated when async instrument callbacks mutate data point collections concurrently with export serialization.
  • BatchSplittingMetricExporterTest: adds concurrency-limiting cases (testConcurrencyLimitBoundsInFlightBatches, testConcurrencyLimitZeroMeansUnlimited), null-arg constructor rejection for the new parameter, and snapshot defensive-copy cases (testSnapshotCreatesNewMetricData, testSnapshotFallsBackToOriginal, testSnapshotPointsAreIndependentCopy).

Why default to IMMUTABLE_DATA

IMMUTABLE_DATA matches the de-facto behavior prior to OT 1.44.0 (no REUSABLE path shipped). Keeping it as default restores the pre-1.44 behavior users have been running safely on, while leaving operators a knob to opt into REUSABLE_DATA on OT >= 1.47 when allocation pressure matters more.

Motivation

Verified via MAT analysis of an OOM heap dump from production:

  • 93% of heap retained by one BatchSplittingMetricExporter → MetricReusableDataMarshaler.marshalerPool
  • 17,531 leaked LowAllocationMetricsRequestMarshaler instances in an unbounded ArrayDeque
  • ArrayDeque logical size() vs actual non-null slots mismatch (32,233 vs 17,531) — direct evidence of cross-thread race corruption
  • Linear ~3 ~ 10 MB / cycle old-gen growth (approx 2.25 GB / 600 cycles)

Test Plan

  • All 24 existing + new BatchSplittingMetricExporterTest cases pass on origin/develop + this commit
  • mvn compile full repository build passes
  • Integration: run broker against high-cardinality load on OT 1.47.0 for 24h, verify old gen does not climb linearly (recommended in staging prior to final merge)

Related

OpenTelemetry Java 1.44.0 ~ 1.46.x ships OtlpGrpcMetricExporter with
MemoryMode.REUSABLE_DATA by default. The underlying
MetricReusableDataMarshaler.marshalerPool is a non-thread-safe
ArrayDeque accessed concurrently by the reader thread (poll) and the
OkHttp callback thread (add, via whenComplete). With
BatchSplittingMetricExporter issuing N concurrent sub-batch exports
per cycle, the pool races and leaks marshalers (~132 KiB each) until
OOM. Fixed upstream in 1.47.0 via open-telemetry/opentelemetry-java#7041
(ArrayDeque -> ConcurrentLinkedDeque).

- Bump OpenTelemetry to 1.47.0 in pom.xml so the upstream race fix is
  in effect.
- Default OtlpGrpcMetricExporter to MemoryMode.IMMUTABLE_DATA to
  preserve the pre-1.44 default behavior; exposed via
  brokerConfig.metricsExportOtelMemoryMode ("IMMUTABLE_DATA" /
  "REUSABLE_DATA", case-insensitive). Operators may opt in to
  REUSABLE_DATA when running on OTel >= 1.47.
- Cap concurrent in-flight sub-batches in BatchSplittingMetricExporter
  with a Semaphore controlled by
  brokerConfig.metricsExportBatchMaxConcurrent (default 4; set to 1
  to serialize and match pre-batch behavior; 0 or Integer.MAX_VALUE
  means unlimited).
- Add brokerConfig.metricsExportBatchSplitEnabled (default true) as
  an escape hatch to bypass BatchSplittingMetricExporter entirely,
  restoring the raw OtlpGrpcMetricExporter wiring.
- Defensively snapshot MetricData points before export to avoid
  ArrayIndexOutOfBoundsException in NumberDataPointMarshaler when
  async instrument callbacks mutate point collections during export.
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 39.02439% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 48.89%. Comparing base (9879968) to head (cd98386).

Files with missing lines Patch % Lines
.../rocketmq/broker/metrics/BrokerMetricsManager.java 0.00% 11 Missing ⚠️
.../java/org/apache/rocketmq/common/BrokerConfig.java 25.00% 9 Missing ⚠️
...q/broker/metrics/BatchSplittingMetricExporter.java 72.22% 5 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##             develop   #10267      +/-   ##
=============================================
- Coverage      48.99%   48.89%   -0.10%     
+ Complexity     13459    13441      -18     
=============================================
  Files           1375     1375              
  Lines         100394   100432      +38     
  Branches       12964    12969       +5     
=============================================
- Hits           49188    49108      -80     
- Misses         45217    45315      +98     
- Partials        5989     6009      +20     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] OTLP metric exporter leaks memory until OOM when using BatchSplittingMetricExporter on OpenTelemetry Java 1.44.0 ~ 1.46.x

2 participants