PipInstall: support local packages and add client method#4628
Conversation
When developing a module locally, it's often nicer to have the package fully pip-installed, rather than the files just added to your PYTHONPATH like `client.upload_file` does—particularly when your package has other dependencies. If you're changing dependencies frequently, in some cases just pip-installing the right dependencies can be faster than rebuilding the environment over and over where you're running the cluster, plus it ensures consistency.
jrbourbeau
left a comment
There was a problem hiding this comment.
That for the PR @gjoseph92! Adding support for local packages and logging seems like a nice addition.
There are tests for PipInstall today, however you're correct that we don't actually call pip install in those tests. Instead I think we end up taking a mocking approach.
Regarding adding a client.pip_install method, I think it's worth getting input from other folks in the community. I recall there was some contention around adding the PipInstall plugin in the first place as historically we tended to minimize the amount of user software environment management we took on (there are lots of rough edge cases that come along with this). That said, people may have different thoughts today. We'll just need to decide how much of a first-class citizen we want the PipInstall plugin to be. cc @jcrist @TomAugspurger @jacobtomlinson who may have thoughts on this topic
|
The main motivation for the client method was that the Maybe we could have a I guess overall, whether or not it's not in the client, people are going to keep doing things like this as they've always done: https://coiled-users.slack.com/archives/C0195GJKQ1G/p1616516879026900?thread_ts=1616488852.025600&cid=C0195GJKQ1G. Personally, I don't think making it more convenient is so bad, so long as it's clearly documented to only use for experimentation. EDIT: here's the linked thread OP:
An answer:
Point being, I imagine it's pretty common for users to do this, hopefully with a similar awareness of the limitations. |
|
Would you mind copying over the linked content so those not in the Coiled slack can view it? |
|
Haha, I've got to admit the |
jrbourbeau
left a comment
There was a problem hiding this comment.
I think the updates to the PipInstall plugin are great. Thinking about it more, I'd prefer to hold off on adding a Client.pip_install method and instead improve/increase the visibility of our documentation around using the PipInstall plugin (xref dask/dask#7459).
Looking ahead I'd much rather distributed, or better yet some new dask-contrib package, grow CondaInstall, AptInstall, ... plugins for patching worker environments instead of this functionality living directly on the Client
Having the client method was pretty critical when I opened this—it was way too easy to not change the name of the plugin between repeated executions of a script, which would lead to the scheduler silently ignoring the new plugin (because the name was already registered) and therefore not installing the newest copy of your local package on the cluster. This would be very confusing for a user. But now that #4748 added That said, my preference would be that if we actually think something is a bad idea, we shouldn't offer it at all. And otherwise, things we offer should be easy to use. Having a separate package for environment management plugins would definitely ideal. But if we don't think that'll happen anytime soon, I don't like the idea of making users write boilerplate to dissuade them from using an API. |
Agreed setting a
Historically Dask has not taken on software environment management. However, there were enough users who wanted to patch their worker environment that someone built the
In general I agree with this sentiment, but in this case I wouldn't categorize client.register_worker_plugin(PipInstall(packages=...))as being overly cumbersome boilerplate code. |
Great point. I was still thinking of needing to hash the file contents. Regardless, this still needs tests. Based on scattered feedback I've heard from a few folks though, I think it would be useful, so hopefully I can get to that at some point. |
|
Totally agree the updates you've made here to |
When developing a module locally, it's often nicer to have the package fully pip-installed, rather than the files just added to your PYTHONPATH like
client.upload_filedoes—particularly when your package has other dependencies. If you're changing dependencies frequently, in some cases just pip-installing the right dependencies can be faster than rebuilding the environment over and over where you're running the cluster, plus it ensures consistency.Personally, I've found this to be a nicer and more reliable workflow than
client.upload_filewhen developing full packages locally.I did not add tests, because honestly I couldn't figure out how to.NVM there are already tests, just didn't see them! I'll add some more following this pattern.PipInstalldoesn't have tests, and the idea of callingpip installwithin a test and modifying the current environment feels pretty nasty to me. And as far as I could find, there's no easy way to activate a virtual environment midway through a script (and then deactivate it later). But if others think callingpip installwithin a test would be okay/worthwhile, then I can install (and then uninstall) something like https://pypi.org/project/test-pip-install.I'll also add a note to https://docs.dask.org/en/latest/setup/environment.html#send-source in a separate PR.
black distributed/flake8 distributed