-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
STORM-3355: Use supervisor.worker.shutdown.sleep.secs to set worker s… #2974
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If my memory is correct, it was to ensure worker to try to shutdown earlier, and let supervisor kills worker when worker fails to shutdown. If we wait for same secs in both worker and supervisor, it would be non-deterministic.
I'm not sure why we don't rely on supervisor's kill and try to kill in both side. I might see the reason but unfortunately I can't remember. Do you know why?
@@ -127,7 +127,9 @@ public static void main(String[] args) throws Exception { | |||
Worker worker = new Worker(conf, null, stormId, assignmentId, Integer.parseInt(supervisorPort), | |||
Integer.parseInt(portStr), workerId); | |||
worker.start(); | |||
Utils.addShutdownHookWithForceKillIn1Sec(worker::shutdown); | |||
LOG.info("Adding shutdown hook with kill in {} secs", ObjectReader.getInt(conf.get(Config.SUPERVISOR_WORKER_SHUTDOWN_SLEEP_SECS))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ObjectReader.getInt(conf.get(Config.SUPERVISOR_WORKER_SHUTDOWN_SLEEP_SECS))
will be executed twice once end users configure to print INFO or lower. Would be better to have separate variable here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, will fix
I'm not sure it's an issue that it's non-deterministic. The force kill only happens if the worker fails to shut down within the time limit, which should hopefully be rare. If the time limit is hit, does it matter whether the supervisor I assumed the reason we want the worker to suicide is in case the supervisor issues a soft kill, and then crashes. I'm not sure how the supervisor handles it if it is restarted and the worker is a "zombie" hanging in the middle of shutdown? If that's not the reason, I'm not sure. Relying on the supervisor's kill might be fine? |
…uicide delay to allow users to configure how long they're willing to wait for orderly shutdown
I found origin commit to introduce 1 sec suicide but Github fails to link the PR and I can't get the reason I wasn't sure about worker, but if we concern about supervisor being crashed just after issuing soft kill, that would make sense. Btw, I think you're right. Once it reaches force kill it wouldn't matter who triggers it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 The change is OK to me, and doesn't seem to bring any regression here.
The reason for replacing Since the behavior before that commit was for the worker to just die immediately on shutdown, I think the shutdown hook was probably added in the worker because that was the same way it was done for the daemons. I'm still not sure the shutdown hook is necessary for the worker, but I don't think it hurts either. Edit: Thanks for looking into why the hook is there btw. |
…uicide delay to allow users to configure how long they're willing to wait for orderly shutdown
https://issues.apache.org/jira/browse/STORM-3355
This will not be useful on Windows, as there is no easy way that I can see for shutting down JVMs cleanly. I dug around for a while and it looks like the JVM devs also haven't found a reasonable way to do orderly shutdowns on Windows https://bugs.openjdk.java.net/browse/JDK-8056139. We can probably be excused for not doing better.
On Linux it works as expected. Tested by running ExclamationTopology with a
cleanup
that sleeps for 15 seconds, and setting the timeout parameter to 30 seconds. The worker was allowed to shut down cleanly.