Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forced termination of workers by nannies not working properly #1303

Closed
eferreira opened this issue Aug 2, 2017 · 2 comments · Fixed by #1304
Closed

Forced termination of workers by nannies not working properly #1303

eferreira opened this issue Aug 2, 2017 · 2 comments · Fixed by #1304

Comments

@eferreira
Copy link

It looks like this is broken since #1148 . This is all on Linux. Let me know if I'm the one doing something wrong. Steps to reproduce:

  1. Run dask-scheduler on one terminal and dask-worker on another terminal
  2. Connect a client from python and call client.submit(time.sleep, 3600)
  3. Call client.restart(). This outputs a stack trace in the scheduler but returns after a while. Meanwhile the nanny isn't able to kill the worker, yet it still spawns another one anyway, so now there are two worker processes and the first one is still stuck doing the sleep.
  4. Try hitting Ctrl+C in the terminal running dask-worker. Again it tries to kill its children but it fails to kill the stuck worker and it hangs waiting on it to exit.
  5. Hit Ctrl+C again, now dask-worker finally exits. But the stuck worker process is left behind as an orphaned process on the machine, still doing the sleep!

It seems like the only way that the nanny tries to force-kill the worker is by sending SIGTERM. But the worker explicitly catches SIGTERM and tries to wait for everything to finish cleanly instead of letting itself be killed.

I guess the worker should not try to catch SIGTERM if that's what the nanny is going to use to do a force-kill, right? Also, to be more thorough, shouldn't the nanny be more aggressive and send a SIGKILL if SIGTERM doesn't work after a while? This would cover cases where a task does something weird like changing the signal handlers on the worker process or messing up the state of the worker process in some other way.

@mrocklin
Copy link
Member

mrocklin commented Aug 2, 2017 via email

mrocklin added a commit to mrocklin/distributed that referenced this issue Aug 2, 2017
Previously restarting a cluster that had long-running tasks would sometimes
hang.  This was because of two reasons:

1.  The nanny's restart timeout was longer than the scheduler's restart
    timeout
    Now we pass a fraction of the scheduler's timeout down to the nanny

2.  The workers used to wait for the executor to finish all currently
    running tasks.
    Now we don't

Fixes dask#1303
mrocklin added a commit that referenced this issue Aug 2, 2017
* Respect timeouts when restarting

Previously restarting a cluster that had long-running tasks would sometimes
hang.  This was because of two reasons:

1.  The nanny's restart timeout was longer than the scheduler's restart
    timeout
    Now we pass a fraction of the scheduler's timeout down to the nanny

2.  The workers used to wait for the executor to finish all currently
    running tasks.
    Now we don't

Fixes #1303
@mrocklin
Copy link
Member

mrocklin commented Aug 2, 2017

This has been resolved in #1304

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants