Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Respect timeouts when restarting #1304

Merged
merged 2 commits into from Aug 2, 2017
Merged

Conversation

mrocklin
Copy link
Member

@mrocklin mrocklin commented Aug 2, 2017

Previously restarting a cluster that had long-running tasks would sometimes
hang. This was because of two reasons:

  1. The nanny's restart timeout was longer than the scheduler's restart
    timeout
    Now we pass a fraction of the scheduler's timeout down to the nanny

  2. The workers used to wait for the executor to finish all currently
    running tasks.
    Now we don't

Fixes #1303 . Either of these changes are enough to fix the issue independently.

@eferreira

Previously restarting a cluster that had long-running tasks would sometimes
hang.  This was because of two reasons:

1.  The nanny's restart timeout was longer than the scheduler's restart
    timeout
    Now we pass a fraction of the scheduler's timeout down to the nanny

2.  The workers used to wait for the executor to finish all currently
    running tasks.
    Now we don't

Fixes dask#1303
3.2 is raising SkipErrors
@mrocklin mrocklin merged commit 7985689 into dask:master Aug 2, 2017
@mrocklin mrocklin deleted the restart-timeout branch August 2, 2017 20:09
@mrocklin mrocklin restored the restart-timeout branch October 6, 2017 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Forced termination of workers by nannies not working properly
1 participant