Skip to content

Conversation

@jvarenina
Copy link
Contributor

@jvarenina jvarenina commented Dec 13, 2021

This deadlock happens when server is shutdown gracefully.
"Distributed system shutdown hook" thread initiates shutdown threads
"ConcurrentParallelGatewaySenderEventProcessor Stopper Thread"
and waits for them to finish.
Every "ConcurrentParallelGatewaySenderEventProcessor Stopper Thread"
then set flag AckReaderThread.shutdown indicating that AckReaderThread
should be also shutdown and waits for it by joining the threads for max 15 seconds.
The "AckReaderThread for : Event Processor for GatewaySender_sender"
thread blocks during shutdown because it tries to acquire the lock that was
already acquired by "Distributed system shutdown hook" thread.
This deadlock only last for 15 seconds, because thread join
will expire for all "ConcurrentParallelGatewaySenderEventProcessor
Stopper Thread" threads forcing them to finish. Because these threads finished,
the "Distributed system shutdown hook" can continue the execution,
release the lock and conclude the shutdown. The problem here is that
delay of 15 seconds can cause traffic loss if read-timeout on clients
is configured to lower value.

The fix:
When exception happen in AckReaderThread due to shutdown of the server
(e.g. CacheClosedException) then check if shutdown is already ongoing before
trying to acquire the lock and initiate shutdown of the dispatcher threads.
When server is shutting down that means that dispatcher threads are already
in the process of shutting down and there is no need for AckReaderThread
to initiate the same thing again.

For all changes:

  • Is there a JIRA ticket associated with this PR? Is it referenced in the commit message?

  • Has your PR been rebased against the latest commit within the target branch (typically develop)?

  • Is your initial contribution a single, squashed commit?

  • Does gradlew build run cleanly?

  • Have you written or updated unit tests to verify your changes?

  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?

@jvarenina jvarenina changed the title GEODE-9881: Fix for deadlock during server shutdown GEODE-9887: Fix for deadlock during server shutdown Dec 14, 2021
@jvarenina jvarenina marked this pull request as ready for review December 14, 2021 17:42
Copy link
Contributor

@nabarunnag nabarunnag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@mkevo mkevo merged commit 0e787a6 into apache:develop Jan 4, 2022
@jvarenina jvarenina deleted the feature/GEODE-9887 branch January 4, 2022 09:30
jvarenina added a commit to Nordix/geode that referenced this pull request Jan 21, 2022
jvarenina added a commit to Nordix/geode that referenced this pull request Apr 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants