Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support slowly arriving workers in SpecCluster #2904

Merged
merged 17 commits into from Aug 1, 2019

Conversation

@mrocklin
Copy link
Member

commented Jul 29, 2019

Previously SpecCluster waited until all workers had checked in with the
scheduler. This made sense for LocalCluster or SSHCluster because
there isn't really a significant delay in starting things that we can't
control. However, for other systems like dask-jobqueue or
dask-kubernetes workers might not ever start, so we need a different
system.

Now, SpecCluster still awaits the Worker object that it is passed, but
doesn't require that the worker has started in the scheduler. We now
expect awaiting to mean

"We have successfully handed control of starting the worker to some other robust system"

Our job at this point is done. We hope that the worker arrives, but
from our perspective this local Worker object is awaited and "running".

This commit also includes a minimal example of a Worker class,
SlowWorker, that serves as a nice minimal example for what SpecCluster
expects.

I plan to do a bit more work here and review our Adaptive policies when workers
may take a while. I hope that this allows us to replicate/replace a lot of the
fine work that @jhamman did over at dask-jobqueue.

cc @jcrist @jacobtomlinson

mrocklin added 12 commits Jul 29, 2019
Support slowly arriving workers in SpecCluster
Previously SpecCluster waited until all workers had checked in with the
scheduler.  This made sense for LocalCluster or SSHCluster because
there isn't really a significant delay in starting things that we can't
control.  However, for other systems like dask-jobqueue or
dask-kubernetes workers might not ever start, so we need a different
system.

Now, SpecCluster still awaits the Worker object that it is passed, but
doesn't require that the worker has started in the scheduler.  We now
expect awaiting to mean

*"We have successfully handed control of starting the worker to some other robust system"*

Our job at this point is done.  We hope that the worker arrives, but
from our perspective this local Worker object is awaited and "running".

This commit also includes a minimal example of a Worker class,
`SlowWorker`, that serves as a nice minimal example for what SpecCluster
expects.
Close workers more gracefully
This commit does two things:

1.  We wait to shutdown the executor a little longer in case it is still
    in use
2.  The worker no longer asks the Nanny to terminate it.  Instead it
    asks the nanny to shutdown gracefully after it is gone, and then
    continues closing itself as normal.
@mrocklin

This comment has been minimized.

Copy link
Member Author

commented Jul 30, 2019

@jcrist would you mind reviewing the adaptive_core.py file? I think that it abstracts away all of the adaptive logic from assumptions about the cluster class. It would be both to get your perspective as a controls person, and also as someone who might want to connect this logic to another cluster class type (or two) in the future.

mrocklin added 5 commits Jul 30, 2019
@mrocklin

This comment has been minimized.

Copy link
Member Author

commented Aug 1, 2019

Moving ahead with this. I'm more than happy to change things after merging.

@mrocklin mrocklin merged commit ff3437c into dask:master Aug 1, 2019

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@mrocklin mrocklin deleted the mrocklin:deploy-slow branch Aug 1, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant
You can’t perform that action at this time.