Skip to content

[Enhancement] Synchronize metrics shutdown to prevent JVM crashes during broker shutdown #9701

@guyinyou

Description

@guyinyou

Before Creating the Enhancement Request

  • I have confirmed that this should be classified as an enhancement rather than a bug/feature.

Summary

Add synchronous blocking wait mechanism for metrics components shutdown to prevent JVM crashes caused by race conditions during broker shutdown process.

Motivation

Currently, the metrics shutdown process in BrokerMetricsManager uses asynchronous operations without proper synchronization. This creates race conditions where:

  1. Dependencies (like periodicMetricReader, metricExporter) may shutdown before the services that depend on them
  2. Services continue to access already-shutdown dependencies, causing JVM crashes
  3. Data loss may occur due to incomplete flush operations during shutdown

This enhancement is critical for production stability, as JVM crashes during broker shutdown can lead to:

  • Data corruption
  • Incomplete metrics export
  • Service unavailability
  • Difficult troubleshooting in production environments

The enhancement benefits the entire RocketMQ community by ensuring graceful and reliable broker shutdowns, especially in high-throughput production environments where metrics collection is heavily utilized.

Describe the Solution You'd Like

Implement synchronous blocking wait for all metrics-related shutdown operations in BrokerMetricsManager.shutdown():

  1. Replace async calls with sync blocking: Convert all shutdown operations to use CompletableFuture.join() with appropriate timeout
  2. Ensure proper shutdown order: Force each component to complete shutdown before proceeding to the next
  3. Add retry mechanism: Use while loops to retry failed operations until successful
  4. Apply to all exporter types: Implement the fix for OTLP_GRPC, PROM, and LOG metrics exporters

Implementation details:

  • Use join(Integer.MAX_VALUE, TimeUnit.DAYS) to ensure completion
  • Add isSuccess() checks to verify operation completion
  • Maintain the same shutdown sequence but with proper synchronization
  • Ensure forceFlush() completes before shutdown() for each component

Code changes:

// Before (async - causes race conditions)
periodicMetricReader.forceFlush();
periodicMetricReader.shutdown();

// After (sync - prevents race conditions)  
while (periodicMetricReader.forceFlush().join(Integer.MAX_VALUE, TimeUnit.DAYS).isSuccess());
while (periodicMetricReader.shutdown().join(Integer.MAX_VALUE, TimeUnit.DAYS).isSuccess());

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions