Forced termination of workers by nannies not working properly #1303

eferreira · 2017-08-02T01:57:17Z

It looks like this is broken since #1148 . This is all on Linux. Let me know if I'm the one doing something wrong. Steps to reproduce:

Run dask-scheduler on one terminal and dask-worker on another terminal
Connect a client from python and call client.submit(time.sleep, 3600)
Call client.restart(). This outputs a stack trace in the scheduler but returns after a while. Meanwhile the nanny isn't able to kill the worker, yet it still spawns another one anyway, so now there are two worker processes and the first one is still stuck doing the sleep.
Try hitting Ctrl+C in the terminal running dask-worker. Again it tries to kill its children but it fails to kill the stuck worker and it hangs waiting on it to exit.
Hit Ctrl+C again, now dask-worker finally exits. But the stuck worker process is left behind as an orphaned process on the machine, still doing the sleep!

It seems like the only way that the nanny tries to force-kill the worker is by sending SIGTERM. But the worker explicitly catches SIGTERM and tries to wait for everything to finish cleanly instead of letting itself be killed.

I guess the worker should not try to catch SIGTERM if that's what the nanny is going to use to do a force-kill, right? Also, to be more thorough, shouldn't the nanny be more aggressive and send a SIGKILL if SIGTERM doesn't work after a while? This would cover cases where a task does something weird like changing the signal handlers on the worker process or messing up the state of the worker process in some other way.

mrocklin · 2017-08-02T14:40:12Z

Reproduced. Taking a look. Thank you for the simple reproducible example.

…

On Tue, Aug 1, 2017 at 9:57 PM, eferreira ***@***.***> wrote: It looks like this is broken since #1148 <#1148> . This is all on Linux. Let me know if I'm the one doing something wrong. Steps to reproduce: 1. Run dask-scheduler on one terminal and dask-worker on another terminal 2. Connect a client from python and call client.submit(time.sleep, 3600) 3. Call client.restart(). This outputs a stack trace in the scheduler but returns after a while. Meanwhile the nanny isn't able to kill the worker, yet it still spawns another one anyway, so now there are two worker processes and the first one is still stuck doing the sleep. 4. Try hitting Ctrl+C in the terminal running dask-worker. Again it tries to kill its children but it fails to kill the stuck worker and it hangs waiting on it to exit. 5. Hit Ctrl+C again, now dask-worker finally exits. But the stuck worker process is left behind as an orphaned process on the machine, still doing the sleep! It seems like the only way that the nanny tries to force-kill the worker is by sending SIGTERM. But the worker explicitly catches SIGTERM and tries to wait for everything to finish cleanly instead of letting itself be killed. I guess the worker should not try to catch SIGTERM if that's what the nanny is going to use to do a force-kill, right? Also, to be more thorough, shouldn't the nanny be more aggressive and send a SIGKILL if SIGTERM doesn't work after a while? This would cover cases where a task does something weird like changing the signal handlers on the worker process or messing up the state of the worker process in some other way. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1303>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszAVKqBF5_23Bu7i3tTC3crapZEWpks5sT9d-gaJpZM4OqgUA> .

Previously restarting a cluster that had long-running tasks would sometimes hang. This was because of two reasons: 1. The nanny's restart timeout was longer than the scheduler's restart timeout Now we pass a fraction of the scheduler's timeout down to the nanny 2. The workers used to wait for the executor to finish all currently running tasks. Now we don't Fixes dask#1303

* Respect timeouts when restarting Previously restarting a cluster that had long-running tasks would sometimes hang. This was because of two reasons: 1. The nanny's restart timeout was longer than the scheduler's restart timeout Now we pass a fraction of the scheduler's timeout down to the nanny 2. The workers used to wait for the executor to finish all currently running tasks. Now we don't Fixes #1303

mrocklin · 2017-08-02T20:10:04Z

This has been resolved in #1304

mrocklin mentioned this issue Aug 2, 2017

Respect timeouts when restarting #1304

Merged

mrocklin closed this as completed in #1304 Aug 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Forced termination of workers by nannies not working properly #1303

Forced termination of workers by nannies not working properly #1303

eferreira commented Aug 2, 2017

mrocklin commented Aug 2, 2017 via email

mrocklin commented Aug 2, 2017

Forced termination of workers by nannies not working properly #1303

Forced termination of workers by nannies not working properly #1303

Comments

eferreira commented Aug 2, 2017

mrocklin commented Aug 2, 2017 via email

mrocklin commented Aug 2, 2017