New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Forced termination of workers by nannies not working properly #1303
Comments
Reproduced. Taking a look. Thank you for the simple reproducible example.
…On Tue, Aug 1, 2017 at 9:57 PM, eferreira ***@***.***> wrote:
It looks like this is broken since #1148
<#1148> . This is all on Linux.
Let me know if I'm the one doing something wrong. Steps to reproduce:
1. Run dask-scheduler on one terminal and dask-worker on another
terminal
2. Connect a client from python and call client.submit(time.sleep,
3600)
3. Call client.restart(). This outputs a stack trace in the scheduler
but returns after a while. Meanwhile the nanny isn't able to kill the
worker, yet it still spawns another one anyway, so now there are two worker
processes and the first one is still stuck doing the sleep.
4. Try hitting Ctrl+C in the terminal running dask-worker. Again it
tries to kill its children but it fails to kill the stuck worker and it
hangs waiting on it to exit.
5. Hit Ctrl+C again, now dask-worker finally exits. But the stuck
worker process is left behind as an orphaned process on the machine, still
doing the sleep!
It seems like the only way that the nanny tries to force-kill the worker
is by sending SIGTERM. But the worker explicitly catches SIGTERM and tries
to wait for everything to finish cleanly instead of letting itself be
killed.
I guess the worker should not try to catch SIGTERM if that's what the
nanny is going to use to do a force-kill, right? Also, to be more thorough,
shouldn't the nanny be more aggressive and send a SIGKILL if SIGTERM
doesn't work after a while? This would cover cases where a task does
something weird like changing the signal handlers on the worker process or
messing up the state of the worker process in some other way.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1303>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszAVKqBF5_23Bu7i3tTC3crapZEWpks5sT9d-gaJpZM4OqgUA>
.
|
mrocklin
added a commit
to mrocklin/distributed
that referenced
this issue
Aug 2, 2017
Previously restarting a cluster that had long-running tasks would sometimes hang. This was because of two reasons: 1. The nanny's restart timeout was longer than the scheduler's restart timeout Now we pass a fraction of the scheduler's timeout down to the nanny 2. The workers used to wait for the executor to finish all currently running tasks. Now we don't Fixes dask#1303
mrocklin
added a commit
that referenced
this issue
Aug 2, 2017
* Respect timeouts when restarting Previously restarting a cluster that had long-running tasks would sometimes hang. This was because of two reasons: 1. The nanny's restart timeout was longer than the scheduler's restart timeout Now we pass a fraction of the scheduler's timeout down to the nanny 2. The workers used to wait for the executor to finish all currently running tasks. Now we don't Fixes #1303
This has been resolved in #1304 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
It looks like this is broken since #1148 . This is all on Linux. Let me know if I'm the one doing something wrong. Steps to reproduce:
It seems like the only way that the nanny tries to force-kill the worker is by sending SIGTERM. But the worker explicitly catches SIGTERM and tries to wait for everything to finish cleanly instead of letting itself be killed.
I guess the worker should not try to catch SIGTERM if that's what the nanny is going to use to do a force-kill, right? Also, to be more thorough, shouldn't the nanny be more aggressive and send a SIGKILL if SIGTERM doesn't work after a while? This would cover cases where a task does something weird like changing the signal handlers on the worker process or messing up the state of the worker process in some other way.
The text was updated successfully, but these errors were encountered: