Preloading a module in every worker processes #1013

bm371613 · 2017-04-12T16:14:19Z

I would like to propose a new --preload option for dask-worker. It would let the user pass a module string (foo.bar) and then the module would be loaded in every worker process.

Rationale

The preloaded module could initialize some resources, later used by tasks. Lazy initialization is not an option if workers might often be stopped or started and the initialization is too slow for a task to risk waiting.

Would you accept a PR with that feature?

The text was updated successfully, but these errors were encountered:

mrocklin · 2017-04-12T16:22:28Z

In general I'm in favor of the idea of preloading state. Because many people start dask in different way I think that this should be deeper than the command line interface to start. This is similar to #495 and PR #505 (never merged).

I suspect that the scheduling bits of this make them somewhat more challenging. I can see two (of many) paths forward

Try to push the current PR forward
Implement something simpler in such a way that it could be easily extended to the full scope of environments in the future

Thoughts @bm371613 ?

bm371613 · 2017-04-12T17:22:44Z

These Environments would not solve my problem, as they are registered by a client. I would require an environment to be there before any client connects. One of the two would have two be possible:

the scheduler can be started with a required_environments option that are later set up for every worker that connects
the worker can be started with a initial_environments option, and then it tells the scheduler what is available already

As environments are objects, they cannot be passed as CLI options to scheduler/worker easily. They could be initialized in a module which would be preloaded and in that sense, what I propose could work together with #495.

mrocklin · 2017-04-12T17:33:56Z

Environments as defined in that issue are stored on the scheduler and are tried against workers as they arrive. How the environments arrive on the scheduler is an interesting question but generally not the hard part of the problem. Presumably in your case you would set up the scheduler, register an environment with it somehow, either with custom code or by spinning up a short lived client. Does starting a short-lived client pose a problem? Running some arbitrary python script at startup as you suggest is also a solution. I'm slightly less excited by it just because it feels tacked-on. This sort of thing does come up decently often though where people want to slightly modify the setup process. It would be interesting to think about how we would want to arrange this. If you want to put together a proposal I'd be happy to look at it. What would your script do precisely? Does it contain a particular function that we call? How does it learn about the scheduler or worker that it is supposed to modify? Does it return something? Perhaps the worker again? Should it expect the event loop to be running or not running at the time when it is called?

…

On Wed, Apr 12, 2017 at 1:22 PM, Bartosz Marcinkowski < ***@***.***> wrote: These Environments would not solve my problem, as they are registered by a client. I would require an environment to be there before any client connects. One of the two would have two be possible: - the scheduler can be started with a required_environments option that are later set up for every worker that connects - the worker can be started with a initial_environments option, and then it tells the scheduler what is available already As environments are objects, they cannot be passed as CLI options to scheduler/worker easily. They could be initialized in a module which would be preloaded and in that sense, what I propose could work together with #495 <#495>. Other than that, I would see these problems separately. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1013 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszLEdy3pfxIiF3zFCv1DAVUo-hbeyks5rvQhkgaJpZM4M7paY> .

mrocklin · 2017-04-12T17:34:15Z

And are there other problems that this would resolve other than your immediate problem. On Wed, Apr 12, 2017 at 1:33 PM, Matthew Rocklin <mrocklin@continuum.io> wrote:

…

Environments as defined in that issue are stored on the scheduler and are tried against workers as they arrive. How the environments arrive on the scheduler is an interesting question but generally not the hard part of the problem. Presumably in your case you would set up the scheduler, register an environment with it somehow, either with custom code or by spinning up a short lived client. Does starting a short-lived client pose a problem? Running some arbitrary python script at startup as you suggest is also a solution. I'm slightly less excited by it just because it feels tacked-on. This sort of thing does come up decently often though where people want to slightly modify the setup process. It would be interesting to think about how we would want to arrange this. If you want to put together a proposal I'd be happy to look at it. What would your script do precisely? Does it contain a particular function that we call? How does it learn about the scheduler or worker that it is supposed to modify? Does it return something? Perhaps the worker again? Should it expect the event loop to be running or not running at the time when it is called? On Wed, Apr 12, 2017 at 1:22 PM, Bartosz Marcinkowski < ***@***.***> wrote: > These Environments would not solve my problem, as they are registered by > a client. I would require an environment to be there before any client > connects. One of the two would have two be possible: > > - the scheduler can be started with a required_environments option > that are later set up for every worker that connects > - the worker can be started with a initial_environments option, and > then it tells the scheduler what is available already > > As environments are objects, they cannot be passed as CLI options to > scheduler/worker easily. They could be initialized in a module which would > be preloaded and in that sense, what I propose could work together with > #495 <#495>. Other than that, > I would see these problems separately. > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#1013 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AASszLEdy3pfxIiF3zFCv1DAVUo-hbeyks5rvQhkgaJpZM4M7paY> > . >

bm371613 · 2017-04-12T17:56:19Z

Running a short lived client would be inconvenient. It would do for experimenting with a notebook, but it would not play nicely with production deployment tools. Could be worked around, but not in an elegant way.

My script would not interact with distributed at all. No special interface, just "execute me". It would call methods on global objects defined in other modules (initialize resources). These objects have __getstate__ and __setstate__ overridden to translate client's uninitialized resources to worker's already initialized resources. This may sound hacky, but the point is to make it transparent so that is can work without distributed too.

mrocklin · 2017-04-12T18:06:50Z

I guess I'm looking around for other applications that this would help/enable that would help us to justify maintaining this feature long term.

…

On Wed, Apr 12, 2017 at 1:56 PM, Bartosz Marcinkowski < ***@***.***> wrote: Running a short lived client would be inconvenient. It would do for experimenting with a notebook, but it would not play nicely with production deployment tools. Could be worked around, but not in an elegant way. My script would not interact with distributed at all. No special interface, just "execute me". It would call methods on global objects defined in other modules (initialize resources). These objects have __getstate__ and __setstate__ overridden to translate client's uninitialized resources to worker's already initialized workers. This may sound hacky, but the point is to make it transparent so that is can work without distributed too. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1013 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszGI3YHoPyGjgJdDf_BGVHtozI_jnks5rvRBDgaJpZM4M7paY> .

bm371613 · 2017-04-12T18:52:22Z

All global setup will benefit:

sys.set...
configuring logging (especially if a project uses non-standard configuration reading method)
monkeypatching

mrocklin · 2017-04-12T18:57:38Z

How do you plan to run your script internally?

mrocklin · 2017-04-12T18:58:07Z

I guess currently the best alternative would be to make your own dask-worker-2.py script:

from distributed.cli.dask_worker import go

# your own code here

go()

$ python dask-worker-2.py ...

bm371613 · 2017-04-12T19:14:37Z

That is exactly what I did, but it is not that easy. --nanny/no-nanny options with the forkserver especially complicate this, as this module could be imported as __main__, __mp_main__ or even not imported by a worker at all, if it happens to be foo/__main__.py. You end up guessing based on __name__ and sys.argv, and still it is not fully functional (cannot be used in entry_points in setup.py).

bm371613 · 2017-04-12T19:22:21Z

I am developing some generic code that will later be used by people not willing to understand why they have to write their scripts according to some weird instructions, with all this unusual __name__ handling, and it still will not work invoked one way, not the other. And all this guessing would be broken easily by a change in dask-worker.

On the other hand, looking into dask-worker code, preloading would not be very intrusive, it is not specific to my problem (now that I thought about it, I would really use it for configuring logging).

mrocklin · 2017-04-12T22:08:37Z

I'm generally not opposed to providing some mechanism for user-provided code to run in both of the command line executables (dask-scheduler and dask-worker) however I do think we need to think about this for a bit to make sure we're covering things well.

On the other hand, looking into dask-worker code, preloading would not be very intrusive

You might be surprised. People generally want to add lots of little things into Dask to solve their problems. Other people come by, misuse these features, and then blame the project for being buggy. This has caused me to be fairly conservative with adding new features. I'm not saying "no". I'm saying "yes, but lets be thoughtful".

If you want to push a PR with your intended solution I'd be happy to take a look at it. You should expect some back and forth though.

bm371613 · 2017-04-13T07:16:35Z

Fair enough, I will push a PR.

vincentschut · 2017-04-13T08:44:17Z

Just to chime in from my timezone: I'd appreciate something like this too. Mainly to initialize logging and some global configs that depend on the environment it is running on (ex. an abstraction to an object store, which internally uses S3 when on AWS, and GCS when on Google Cloud). A mechanism to make sure some code is executed everytime a worker is started would especially help in situations where you want to scale up after starting by adding extra workers.
Bonus: run different initialization based on worker resources :-)

mrocklin · 2017-04-13T11:10:41Z

@vincentschut can you say more about your bonus request?

bm371613 · 2017-04-18T15:27:27Z

Thanks for your help!

bm371613 mentioned this issue Apr 13, 2017

Preloading a module in every worker processes #1016

Merged

bm371613 closed this as completed Apr 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preloading a module in every worker processes #1013

Preloading a module in every worker processes #1013

bm371613 commented Apr 12, 2017

mrocklin commented Apr 12, 2017

bm371613 commented Apr 12, 2017 •

edited

Loading

mrocklin commented Apr 12, 2017 via email

mrocklin commented Apr 12, 2017 via email

bm371613 commented Apr 12, 2017 •

edited

Loading

mrocklin commented Apr 12, 2017 via email

bm371613 commented Apr 12, 2017

mrocklin commented Apr 12, 2017

mrocklin commented Apr 12, 2017

bm371613 commented Apr 12, 2017

bm371613 commented Apr 12, 2017

mrocklin commented Apr 12, 2017

bm371613 commented Apr 13, 2017

vincentschut commented Apr 13, 2017

mrocklin commented Apr 13, 2017

bm371613 commented Apr 18, 2017

Preloading a module in every worker processes #1013

Preloading a module in every worker processes #1013

Comments

bm371613 commented Apr 12, 2017

Rationale

mrocklin commented Apr 12, 2017

bm371613 commented Apr 12, 2017 • edited Loading

mrocklin commented Apr 12, 2017 via email

mrocklin commented Apr 12, 2017 via email

bm371613 commented Apr 12, 2017 • edited Loading

mrocklin commented Apr 12, 2017 via email

bm371613 commented Apr 12, 2017

mrocklin commented Apr 12, 2017

mrocklin commented Apr 12, 2017

bm371613 commented Apr 12, 2017

bm371613 commented Apr 12, 2017

mrocklin commented Apr 12, 2017

bm371613 commented Apr 13, 2017

vincentschut commented Apr 13, 2017

mrocklin commented Apr 13, 2017

bm371613 commented Apr 18, 2017

bm371613 commented Apr 12, 2017 •

edited

Loading

bm371613 commented Apr 12, 2017 •

edited

Loading