-
Notifications
You must be signed in to change notification settings - Fork 12k
Description
Before Creating the Enhancement Request
- I have confirmed that this should be classified as an enhancement rather than a bug/feature.
Summary
Add synchronous blocking wait mechanism for metrics components shutdown to prevent JVM crashes caused by race conditions during broker shutdown process.
Motivation
Currently, the metrics shutdown process in BrokerMetricsManager uses asynchronous operations without proper synchronization. This creates race conditions where:
- Dependencies (like periodicMetricReader, metricExporter) may shutdown before the services that depend on them
- Services continue to access already-shutdown dependencies, causing JVM crashes
- Data loss may occur due to incomplete flush operations during shutdown
This enhancement is critical for production stability, as JVM crashes during broker shutdown can lead to:
- Data corruption
- Incomplete metrics export
- Service unavailability
- Difficult troubleshooting in production environments
The enhancement benefits the entire RocketMQ community by ensuring graceful and reliable broker shutdowns, especially in high-throughput production environments where metrics collection is heavily utilized.
Describe the Solution You'd Like
Implement synchronous blocking wait for all metrics-related shutdown operations in BrokerMetricsManager.shutdown():
- Replace async calls with sync blocking: Convert all shutdown operations to use CompletableFuture.join() with appropriate timeout
- Ensure proper shutdown order: Force each component to complete shutdown before proceeding to the next
- Add retry mechanism: Use while loops to retry failed operations until successful
- Apply to all exporter types: Implement the fix for OTLP_GRPC, PROM, and LOG metrics exporters
Implementation details:
- Use
join(Integer.MAX_VALUE, TimeUnit.DAYS)to ensure completion - Add
isSuccess()checks to verify operation completion - Maintain the same shutdown sequence but with proper synchronization
- Ensure forceFlush() completes before shutdown() for each component
Code changes:
// Before (async - causes race conditions)
periodicMetricReader.forceFlush();
periodicMetricReader.shutdown();
// After (sync - prevents race conditions)
while (periodicMetricReader.forceFlush().join(Integer.MAX_VALUE, TimeUnit.DAYS).isSuccess());
while (periodicMetricReader.shutdown().join(Integer.MAX_VALUE, TimeUnit.DAYS).isSuccess());