GEODE-9887: Fix for deadlock during server shutdown #7194
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This deadlock happens when server is shutdown gracefully.
"Distributed system shutdown hook" thread initiates shutdown threads
"ConcurrentParallelGatewaySenderEventProcessor Stopper Thread"
and waits for them to finish.
Every "ConcurrentParallelGatewaySenderEventProcessor Stopper Thread"
then set flag AckReaderThread.shutdown indicating that AckReaderThread
should be also shutdown and waits for it by joining the threads for max 15 seconds.
The "AckReaderThread for : Event Processor for GatewaySender_sender"
thread blocks during shutdown because it tries to acquire the lock that was
already acquired by "Distributed system shutdown hook" thread.
This deadlock only last for 15 seconds, because thread join
will expire for all "ConcurrentParallelGatewaySenderEventProcessor
Stopper Thread" threads forcing them to finish. Because these threads finished,
the "Distributed system shutdown hook" can continue the execution,
release the lock and conclude the shutdown. The problem here is that
delay of 15 seconds can cause traffic loss if read-timeout on clients
is configured to lower value.
The fix:
When exception happen in AckReaderThread due to shutdown of the server
(e.g. CacheClosedException) then check if shutdown is already ongoing before
trying to acquire the lock and initiate shutdown of the dispatcher threads.
When server is shutting down that means that dispatcher threads are already
in the process of shutting down and there is no need for AckReaderThread
to initiate the same thing again.
For all changes:
Is there a JIRA ticket associated with this PR? Is it referenced in the commit message?
Has your PR been rebased against the latest commit within the target branch (typically
develop)?Is your initial contribution a single, squashed commit?
Does
gradlew buildrun cleanly?Have you written or updated unit tests to verify your changes?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?