HDDS-15269. Avoid 30s shutdown wait in ReconTaskController#10402
Merged
Conversation
ReconTaskControllerImpl.stop() switched to a graceful shutdown() + awaitTermination(30s) in HDDS-13956, but the event processing loop only exits on interrupt, so shutdown() can never drain it and stop() always burned the full 30s timeout before shutdownNow() took effect. Add a volatile running flag that the loop checks, cleared in stop(), so the loop exits on its next poll cycle and graceful shutdown completes promptly. Add testStopCompletesPromptly as a guard.
testProcessReInitializationEventWith* called stop() on a Mockito spy to quiesce the background event loop. The spy is a shallow copy, so this only flipped the running flag on the copy while the live event-processing thread (started on the original controller) kept running, forcing stop() to wait out the full 30s shutdown timeout. Stop the original controller instead.
adoroszlai
approved these changes
Jun 1, 2026
Contributor
adoroszlai
left a comment
There was a problem hiding this comment.
Thanks @chihsuan for the patch. Not only does this speed up TestReconTaskControllerImpl, but all Recon integration tests as well (saving 30s for each cluster stop):
-Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 56.83 s -- in org.apache.hadoop.ozone.recon.TestNSSummaryAdmin
+Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 27.02 s -- in org.apache.hadoop.ozone.recon.TestNSSummaryAdmin
-Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 82.93 s -- in org.apache.hadoop.ozone.recon.TestNSSummaryMemoryLeak
+Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 54.92 s -- in org.apache.hadoop.ozone.recon.TestNSSummaryMemoryLeak
-Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 149.4 s -- in org.apache.hadoop.ozone.recon.TestReconAsPassiveScm
+Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 61.61 s -- in org.apache.hadoop.ozone.recon.TestReconAsPassiveScm
-Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 120.9 s -- in org.apache.hadoop.ozone.recon.TestReconContainerEndpoint
+Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 61.89 s -- in org.apache.hadoop.ozone.recon.TestReconContainerEndpoint
-Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 174.1 s -- in org.apache.hadoop.ozone.recon.TestReconContainerHealthSummaryEndToEnd
+Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 86.49 s -- in org.apache.hadoop.ozone.recon.TestReconContainerHealthSummaryEndToEnd
-Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 108.3 s -- in org.apache.hadoop.ozone.recon.TestReconInsightsForDeletedDirectories
+Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 81.83 s -- in org.apache.hadoop.ozone.recon.TestReconInsightsForDeletedDirectories
-Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 62.88 s -- in org.apache.hadoop.ozone.recon.TestReconQuasiClosedContainerEndpoint
+Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 32.82 s -- in org.apache.hadoop.ozone.recon.TestReconQuasiClosedContainerEndpoint
-Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 172.0 s -- in org.apache.hadoop.ozone.recon.TestReconScmSnapshot
+Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 83.28 s -- in org.apache.hadoop.ozone.recon.TestReconScmSnapshot
-Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 354.9 s -- in org.apache.hadoop.ozone.recon.TestReconTasks
+Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 176.5 s -- in org.apache.hadoop.ozone.recon.TestReconTasks
-Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 62.43 s -- in org.apache.hadoop.ozone.recon.TestReconTasksMultiNode
+Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 34.28 s -- in org.apache.hadoop.ozone.recon.TestReconTasksMultiNode
-Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 103.2 s -- in org.apache.hadoop.ozone.recon.TestReconWithOzoneManager
+Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 45.61 s -- in org.apache.hadoop.ozone.recon.TestReconWithOzoneManager
-Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 68.88 s -- in org.apache.hadoop.ozone.recon.TestReconWithOzoneManagerFSO
+Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 38.21 s -- in org.apache.hadoop.ozone.recon.TestReconWithOzoneManagerFSO
-Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 164.3 s -- in org.apache.hadoop.ozone.recon.TestStorageDistributionEndpointEC
+Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 135.2 s -- in org.apache.hadoop.ozone.recon.TestStorageDistributionEndpointEC
-Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 164.0 s -- in org.apache.hadoop.ozone.recon.TestStorageDistributionEndpointRatis
+Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 134.2 s -- in org.apache.hadoop.ozone.recon.TestStorageDistributionEndpointRatis
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
TestReconTaskControllerImpltook ~145s to run 17 tests. Profiling each methodshowed the time was not spread out: four tests each blocked for almost exactly
30 seconds inside
ReconTaskController.stop().Root cause
ReconTaskControllerImplruns a single background thread(
processBufferedEventsAsync) that loops on an interruptibleeventBuffer.poll(...)and only exits when the thread is interrupted:When this loop was introduced (HDDS-13576),
stop()usedshutdownNow(), whichinterrupts the thread, so it exited immediately. A later reliability change
(HDDS-13956) replaced that with a graceful shutdown:
shutdown()stops accepting new work but never interrupts the running thread,and the loop has no non-interrupt exit path. So the graceful phase can never
drain the loop:
awaitTerminationalways burns its full 30s timeout before thefallback
shutdownNow()finally stops it. Everystop()therefore costs ~30s.This is not test-only: in production it delays Recon shutdown by ~30s and logs a
misleading "did not terminate within 30 seconds" warning on every shutdown.
Fix (production)
Give the loop a cooperative exit path: a
volatile boolean runningflag set instart()and cleared instop(). The loop checks it each iteration, so afterstop()it exits on the next poll cycle and the existing graceful shutdowncompletes promptly.
I chose the cooperative flag over simply reverting to
shutdownNow()because itpreserves the intent of HDDS-13956: an in-flight event is still allowed to
finish instead of being interrupted mid-processing. The 30s
awaitTerminationstays as a genuine safety net rather than the normal path.
Fix (test)
Two tests (
testProcessReInitializationEventWith*) callstop()on a Mockitospy(controller)to quiesce the background loop. A spy is a shallow copy, so itflips the
runningflag on the copy while the live thread (started on theoriginal controller in
setUp) keeps running, still hitting the 30s timeout.They now stop the original controller. This does not change what the tests
assert;
stop()there is only setup to avoid a race with the background loop.Scope
Kept intentionally small to fix only the critical issue (the 30s production
shutdown hang). Two unrelated, lower-impact test-only slowdowns remain and are
left for a follow-up to avoid scope creep:
testNewRetryLogicWithMaxRetriesExceededandtestFailedTaskRetryLogicusereal-clock
Thread.sleepto wait outRETRY_DELAY_MS; speeding these up needsthe retry delay to be injectable.
Result
TestReconTaskControllerImpl: ~145s → ~26s, all tests pass, checkstyle clean.What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-15269
How was this patch tested?
testStopCompletesPromptly, which fails (~31s) before the fix and passes(~1s) after, guarding against regression.
TestReconTaskControllerImplclass: 18 tests, 0 failures,~26s (down from ~145s).
mvn -pl hadoop-ozone/recon checkstyle:check: 0 violations.