-
-
Notifications
You must be signed in to change notification settings - Fork 715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preloading a module in every worker processes #1013
Comments
In general I'm in favor of the idea of preloading state. Because many people start dask in different way I think that this should be deeper than the command line interface to start. This is similar to #495 and PR #505 (never merged). I suspect that the scheduling bits of this make them somewhat more challenging. I can see two (of many) paths forward
Thoughts @bm371613 ? |
These
As environments are objects, they cannot be passed as CLI options to scheduler/worker easily. They could be initialized in a module which would be preloaded and in that sense, what I propose could work together with #495. |
Environments as defined in that issue are stored on the scheduler and are
tried against workers as they arrive. How the environments arrive on the
scheduler is an interesting question but generally not the hard part of the
problem. Presumably in your case you would set up the scheduler, register
an environment with it somehow, either with custom code or by spinning up a
short lived client. Does starting a short-lived client pose a problem?
Running some arbitrary python script at startup as you suggest is also a
solution. I'm slightly less excited by it just because it feels
tacked-on. This sort of thing does come up decently often though where
people want to slightly modify the setup process. It would be interesting
to think about how we would want to arrange this. If you want to put
together a proposal I'd be happy to look at it.
What would your script do precisely? Does it contain a particular function
that we call? How does it learn about the scheduler or worker that it is
supposed to modify? Does it return something? Perhaps the worker again?
Should it expect the event loop to be running or not running at the time
when it is called?
…On Wed, Apr 12, 2017 at 1:22 PM, Bartosz Marcinkowski < ***@***.***> wrote:
These Environments would not solve my problem, as they are registered by
a client. I would require an environment to be there before any client
connects. One of the two would have two be possible:
- the scheduler can be started with a required_environments option
that are later set up for every worker that connects
- the worker can be started with a initial_environments option, and
then it tells the scheduler what is available already
As environments are objects, they cannot be passed as CLI options to
scheduler/worker easily. They could be initialized in a module which would
be preloaded and in that sense, what I propose could work together with
#495 <#495>. Other than that, I
would see these problems separately.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1013 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszLEdy3pfxIiF3zFCv1DAVUo-hbeyks5rvQhkgaJpZM4M7paY>
.
|
And are there other problems that this would resolve other than your
immediate problem.
On Wed, Apr 12, 2017 at 1:33 PM, Matthew Rocklin <mrocklin@continuum.io>
wrote:
… Environments as defined in that issue are stored on the scheduler and are
tried against workers as they arrive. How the environments arrive on the
scheduler is an interesting question but generally not the hard part of the
problem. Presumably in your case you would set up the scheduler, register
an environment with it somehow, either with custom code or by spinning up a
short lived client. Does starting a short-lived client pose a problem?
Running some arbitrary python script at startup as you suggest is also a
solution. I'm slightly less excited by it just because it feels
tacked-on. This sort of thing does come up decently often though where
people want to slightly modify the setup process. It would be interesting
to think about how we would want to arrange this. If you want to put
together a proposal I'd be happy to look at it.
What would your script do precisely? Does it contain a particular
function that we call? How does it learn about the scheduler or worker
that it is supposed to modify? Does it return something? Perhaps the
worker again? Should it expect the event loop to be running or not running
at the time when it is called?
On Wed, Apr 12, 2017 at 1:22 PM, Bartosz Marcinkowski <
***@***.***> wrote:
> These Environments would not solve my problem, as they are registered by
> a client. I would require an environment to be there before any client
> connects. One of the two would have two be possible:
>
> - the scheduler can be started with a required_environments option
> that are later set up for every worker that connects
> - the worker can be started with a initial_environments option, and
> then it tells the scheduler what is available already
>
> As environments are objects, they cannot be passed as CLI options to
> scheduler/worker easily. They could be initialized in a module which would
> be preloaded and in that sense, what I propose could work together with
> #495 <#495>. Other than that,
> I would see these problems separately.
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#1013 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AASszLEdy3pfxIiF3zFCv1DAVUo-hbeyks5rvQhkgaJpZM4M7paY>
> .
>
|
Running a short lived client would be inconvenient. It would do for experimenting with a notebook, but it would not play nicely with production deployment tools. Could be worked around, but not in an elegant way. My script would not interact with |
I guess I'm looking around for other applications that this would
help/enable that would help us to justify maintaining this feature long
term.
…On Wed, Apr 12, 2017 at 1:56 PM, Bartosz Marcinkowski < ***@***.***> wrote:
Running a short lived client would be inconvenient. It would do for
experimenting with a notebook, but it would not play nicely with production
deployment tools. Could be worked around, but not in an elegant way.
My script would not interact with distributed at all. No special
interface, just "execute me". It would call methods on global objects
defined in other modules (initialize resources). These objects have
__getstate__ and __setstate__ overridden to translate client's
uninitialized resources to worker's already initialized workers. This may
sound hacky, but the point is to make it transparent so that is can work
without distributed too.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1013 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszGI3YHoPyGjgJdDf_BGVHtozI_jnks5rvRBDgaJpZM4M7paY>
.
|
All global setup will benefit:
|
How do you plan to run your script internally? |
I guess currently the best alternative would be to make your own from distributed.cli.dask_worker import go
# your own code here
go() $ python dask-worker-2.py ... |
That is exactly what I did, but it is not that easy. |
I am developing some generic code that will later be used by people not willing to understand why they have to write their scripts according to some weird instructions, with all this unusual On the other hand, looking into dask-worker code, preloading would not be very intrusive, it is not specific to my problem (now that I thought about it, I would really use it for configuring logging). |
I'm generally not opposed to providing some mechanism for user-provided code to run in both of the command line executables (
You might be surprised. People generally want to add lots of little things into Dask to solve their problems. Other people come by, misuse these features, and then blame the project for being buggy. This has caused me to be fairly conservative with adding new features. I'm not saying "no". I'm saying "yes, but lets be thoughtful". If you want to push a PR with your intended solution I'd be happy to take a look at it. You should expect some back and forth though. |
Fair enough, I will push a PR. |
Just to chime in from my timezone: I'd appreciate something like this too. Mainly to initialize logging and some global configs that depend on the environment it is running on (ex. an abstraction to an object store, which internally uses S3 when on AWS, and GCS when on Google Cloud). A mechanism to make sure some code is executed everytime a worker is started would especially help in situations where you want to scale up after starting by adding extra workers. |
@vincentschut can you say more about your bonus request? |
Thanks for your help! |
I would like to propose a new
--preload
option fordask-worker
. It would let the user pass a module string (foo.bar
) and then the module would be loaded in every worker process.Rationale
The preloaded module could initialize some resources, later used by tasks. Lazy initialization is not an option if workers might often be stopped or started and the initialization is too slow for a task to risk waiting.
Would you accept a PR with that feature?
The text was updated successfully, but these errors were encountered: