Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preloading a module in every worker processes #1013

Closed
bm371613 opened this issue Apr 12, 2017 · 16 comments
Closed

Preloading a module in every worker processes #1013

bm371613 opened this issue Apr 12, 2017 · 16 comments

Comments

@bm371613
Copy link
Contributor

I would like to propose a new --preload option for dask-worker. It would let the user pass a module string (foo.bar) and then the module would be loaded in every worker process.

Rationale

The preloaded module could initialize some resources, later used by tasks. Lazy initialization is not an option if workers might often be stopped or started and the initialization is too slow for a task to risk waiting.

Would you accept a PR with that feature?

@mrocklin
Copy link
Member

In general I'm in favor of the idea of preloading state. Because many people start dask in different way I think that this should be deeper than the command line interface to start. This is similar to #495 and PR #505 (never merged).

I suspect that the scheduling bits of this make them somewhat more challenging. I can see two (of many) paths forward

  1. Try to push the current PR forward
  2. Implement something simpler in such a way that it could be easily extended to the full scope of environments in the future

Thoughts @bm371613 ?

@bm371613
Copy link
Contributor Author

bm371613 commented Apr 12, 2017

These Environments would not solve my problem, as they are registered by a client. I would require an environment to be there before any client connects. One of the two would have two be possible:

  • the scheduler can be started with a required_environments option that are later set up for every worker that connects
  • the worker can be started with a initial_environments option, and then it tells the scheduler what is available already

As environments are objects, they cannot be passed as CLI options to scheduler/worker easily. They could be initialized in a module which would be preloaded and in that sense, what I propose could work together with #495.

@mrocklin
Copy link
Member

mrocklin commented Apr 12, 2017 via email

@mrocklin
Copy link
Member

mrocklin commented Apr 12, 2017 via email

@bm371613
Copy link
Contributor Author

bm371613 commented Apr 12, 2017

Running a short lived client would be inconvenient. It would do for experimenting with a notebook, but it would not play nicely with production deployment tools. Could be worked around, but not in an elegant way.

My script would not interact with distributed at all. No special interface, just "execute me". It would call methods on global objects defined in other modules (initialize resources). These objects have __getstate__ and __setstate__ overridden to translate client's uninitialized resources to worker's already initialized resources. This may sound hacky, but the point is to make it transparent so that is can work without distributed too.

@mrocklin
Copy link
Member

mrocklin commented Apr 12, 2017 via email

@bm371613
Copy link
Contributor Author

All global setup will benefit:

  • sys.set...
  • configuring logging (especially if a project uses non-standard configuration reading method)
  • monkeypatching

@mrocklin
Copy link
Member

How do you plan to run your script internally?

@mrocklin
Copy link
Member

I guess currently the best alternative would be to make your own dask-worker-2.py script:

from distributed.cli.dask_worker import go

# your own code here

go()

$ python dask-worker-2.py ...

@bm371613
Copy link
Contributor Author

That is exactly what I did, but it is not that easy. --nanny/no-nanny options with the forkserver especially complicate this, as this module could be imported as __main__, __mp_main__ or even not imported by a worker at all, if it happens to be foo/__main__.py. You end up guessing based on __name__ and sys.argv, and still it is not fully functional (cannot be used in entry_points in setup.py).

@bm371613
Copy link
Contributor Author

I am developing some generic code that will later be used by people not willing to understand why they have to write their scripts according to some weird instructions, with all this unusual __name__ handling, and it still will not work invoked one way, not the other. And all this guessing would be broken easily by a change in dask-worker.

On the other hand, looking into dask-worker code, preloading would not be very intrusive, it is not specific to my problem (now that I thought about it, I would really use it for configuring logging).

@mrocklin
Copy link
Member

I'm generally not opposed to providing some mechanism for user-provided code to run in both of the command line executables (dask-scheduler and dask-worker) however I do think we need to think about this for a bit to make sure we're covering things well.

On the other hand, looking into dask-worker code, preloading would not be very intrusive

You might be surprised. People generally want to add lots of little things into Dask to solve their problems. Other people come by, misuse these features, and then blame the project for being buggy. This has caused me to be fairly conservative with adding new features. I'm not saying "no". I'm saying "yes, but lets be thoughtful".

If you want to push a PR with your intended solution I'd be happy to take a look at it. You should expect some back and forth though.

@bm371613
Copy link
Contributor Author

Fair enough, I will push a PR.

@vincentschut
Copy link

Just to chime in from my timezone: I'd appreciate something like this too. Mainly to initialize logging and some global configs that depend on the environment it is running on (ex. an abstraction to an object store, which internally uses S3 when on AWS, and GCS when on Google Cloud). A mechanism to make sure some code is executed everytime a worker is started would especially help in situations where you want to scale up after starting by adding extra workers.
Bonus: run different initialization based on worker resources :-)

@mrocklin
Copy link
Member

@vincentschut can you say more about your bonus request?

@bm371613
Copy link
Contributor Author

Thanks for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants