Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[STORM-3658] Avoid deadlock and race condition caused by shutdown hooks #3299

Merged
merged 3 commits into from
Jul 2, 2020

Conversation

Ethanlm
Copy link
Contributor

@Ethanlm Ethanlm commented Jun 29, 2020

What is the purpose of the change

Details explained in https://issues.apache.org/jira/browse/STORM-3658.

In short, the problem here is

  1. there could be a race condition between registering the shutdown hooks and invoking shutdown hooks
  2. There is a possible deadlock between a thread like executor thread that invokes shutdown hook (when it invokes Runtime.getRuntime().exit) and the shutdown hook/thread itself that tries to terminate threads like the executor thread.

How was the change tested

  1. The deadlock and race condition can be reproduced with a WordCountTopology whose WordCountBolt simply throws a RuntimeException in prepare method. Use it to validate the change. The race condition and deadlock is not happening.

The log below shows shutdown hook waited for 100ms and continue. This avoids the deadlock.

2020-06-29 20:33:44.237 o.a.s.e.ExecutorShutdown ShutdownHook-shutdownFunc [WARN] Thread Thread-15-count-executor[1, 1] is still alive (100 ms after interruption). Stop wait
ing for it.
  1. The shutdown hooks are registered at very early stage (before any other threads are spawned). This is validated by looking at the logs and see Adding shutdown hook with kill in 3 secs message appears very early. And I added a bug in my test to print out thread dump right after registering the shutdown hook. The only alive threads at that point are
Log4j2-TF-6-Scheduled-1
Reference Handler
Signal Dispatcher
Finalizer
main

And also tested with the above topology and don't see race condition happen again (while it was very easy to happen before the code change)

  1. thread name change can be validated by the log
2020-06-29 20:33:44.023 o.a.s.u.Utils ShutdownHook-sleepKill-3s [INFO] Halting after 3 seconds
....
2020-06-29 20:33:44.040 o.a.s.d.w.Worker ShutdownHook-shutdownFunc [INFO] Shutting down worker wc4-15-1593462538 513216e5-17e1-421f-809e-47cefd6eb136-10.215.73.209 6702
...
  1. workerstate could be null if exception happens before workerstate object is initialized properly. This was tested by throwing an exception before workerstate construction is all done. And the modified shutdown method doesn't throw NPE.

@Ethanlm
Copy link
Contributor Author

Ethanlm commented Jul 1, 2020

Last commit re-organized shutdown method to avoid checking workerState!=null in many different places. Tested again.

Copy link
Contributor

@kishorvpatil kishorvpatil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@Ethanlm Ethanlm merged commit 2595675 into apache:master Jul 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants