Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quickly return WorkerPool, add workers later #83

Open
kolia opened this issue Aug 6, 2021 · 4 comments
Open

quickly return WorkerPool, add workers later #83

kolia opened this issue Aug 6, 2021 · 4 comments

Comments

@kolia
Copy link
Contributor

kolia commented Aug 6, 2021

Getting many requested pods can trigger scale-up which takes time.

Currently this is dealt with with a timeout; any requested pods that do not stand up and connect by the timeout are dropped, and launch returns with however many pods have come up by the timeout. This can be awkward.

An alternative way to deal with spin-up slowness is to return a WorkerPool quickly, maybe as soon as there is one worker connected, and continue adding workers to that pool after returning.

To be practical, this method should have a worker initialization hook, so that workers only join the workerpool after evaling some quoted code in Main, typically using Packages commands and other definitions.

@omus
Copy link
Member

omus commented Aug 6, 2021

I like the concept of dynamically adding workers into a pool but unfortunately we're restricted by how Distributed.jl works at the moment. Attempting to do this with the existing Distributed.jl would definitely be painful and probably end up being very fragile.

The concept however could warrant an iteration to Distributed.jl which could start out as an external package. Some basic thoughts on what changes would be made to the existing Distributed.jl stdlib:

  • Update the cluster manager interface such that workers are added via a Channel instead of a Vector
  • addprocs returns a WorkerPool or something similar where workers can be added dynamically as workers report in
  • @everywhere calls are applied to workers in the WorkerPool as they report in. This means that workers can execute this logic at very different times. There my be implications of this I'm not considering.
  • Using pmap or @distributed can partition work based upon the expected size of the WorkerPool and start running immediately with the current set of workers available

@ericphanson
Copy link
Member

Are you sure we can't dynamically add workers to a pool? It seems like pmap just take!s workers from the pool instead of predistributing the work upfront.

(Though a redesign does sound good too!)

@omus
Copy link
Member

omus commented Aug 10, 2021

Are you sure we can't dynamically add workers to a pool?

Here's the code that calls the launch method defined by the Distributed interface:

https://github.com/JuliaLang/julia/blob/c2b4b382c11b5668cb9091138b1fa9178c47bff5/stdlib/Distributed/src/cluster.jl#L480-L499

You're expected to add workers to the launched vector and there's no way to pass back a WorkerPool. There may be a way of doing an unblocked launch and call setup_launched_worker within it but I'd expect you to run into strange corner cases.

If someone wants to look into this further that would be great. They may find something I've missed or at worst validate my assessment.

@omus
Copy link
Member

omus commented Aug 18, 2021

I may have thought of a workaround to this problem. If we define an alternative addprocs function, maybe spawn, what we could do is internally is call addprocs asynchronously adding single worker at a time. This should allow us to immediately return a mutable Vector of worker IDs or possibly even a WorkerPool. Depending how the internals of WorkerPool and functions that use it we may be able to allocate work to workers as they come in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants