Skip to content

PipInstall: support local packages and add client method#4628

Open
gjoseph92 wants to merge 1 commit into
dask:mainfrom
gjoseph92:pip-install-local-modules
Open

PipInstall: support local packages and add client method#4628
gjoseph92 wants to merge 1 commit into
dask:mainfrom
gjoseph92:pip-install-local-modules

Conversation

@gjoseph92

@gjoseph92 gjoseph92 commented Mar 23, 2021

Copy link
Copy Markdown
Collaborator

When developing a module locally, it's often nicer to have the package fully pip-installed, rather than the files just added to your PYTHONPATH like client.upload_file does—particularly when your package has other dependencies. If you're changing dependencies frequently, in some cases just pip-installing the right dependencies can be faster than rebuilding the environment over and over where you're running the cluster, plus it ensures consistency.

Personally, I've found this to be a nicer and more reliable workflow than client.upload_file when developing full packages locally.

I did not add tests, because honestly I couldn't figure out how to. PipInstall doesn't have tests, and the idea of calling pip install within a test and modifying the current environment feels pretty nasty to me. And as far as I could find, there's no easy way to activate a virtual environment midway through a script (and then deactivate it later). But if others think calling pip install within a test would be okay/worthwhile, then I can install (and then uninstall) something like https://pypi.org/project/test-pip-install. NVM there are already tests, just didn't see them! I'll add some more following this pattern.

I'll also add a note to https://docs.dask.org/en/latest/setup/environment.html#send-source in a separate PR.

  • Tests added / passed
  • Passes black distributed / flake8 distributed

When developing a module locally, it's often nicer to have the package fully pip-installed, rather than the files just added to your PYTHONPATH like `client.upload_file` does—particularly when your package has other dependencies. If you're changing dependencies frequently, in some cases just pip-installing the right dependencies can be faster than rebuilding the environment over and over where you're running the cluster, plus it ensures consistency.
@gjoseph92 gjoseph92 changed the title PipInstall: support local packages PipInstall: support local packages and add client method Mar 23, 2021

@jrbourbeau jrbourbeau left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That for the PR @gjoseph92! Adding support for local packages and logging seems like a nice addition.

There are tests for PipInstall today, however you're correct that we don't actually call pip install in those tests. Instead I think we end up taking a mocking approach.

Regarding adding a client.pip_install method, I think it's worth getting input from other folks in the community. I recall there was some contention around adding the PipInstall plugin in the first place as historically we tended to minimize the amount of user software environment management we took on (there are lots of rough edge cases that come along with this). That said, people may have different thoughts today. We'll just need to decide how much of a first-class citizen we want the PipInstall plugin to be. cc @jcrist @TomAugspurger @jacobtomlinson who may have thoughts on this topic

@gjoseph92

gjoseph92 commented Mar 24, 2021

Copy link
Copy Markdown
Collaborator Author

The main motivation for the client method was that the name needs to change every time the files change in order for the new version of the plugin to get registered. This seemed easy to miss, and hard to recognize when you've missed it (the module is there, but my new changes aren't?), so I thought I'd wrap up that logic in something more convenient.

Maybe we could have a .name property on PipInstall instances that give you a "recommended" name, and suggest using client.register_worker_plugin(plugin, name=plugin.name) in the docs instead?

I guess overall, whether or not it's not in the client, people are going to keep doing things like this as they've always done: https://coiled-users.slack.com/archives/C0195GJKQ1G/p1616516879026900?thread_ts=1616488852.025600&cid=C0195GJKQ1G. Personally, I don't think making it more convenient is so bad, so long as it's clearly documented to only use for experimentation.

EDIT: here's the linked thread

OP:

Is it possible to update the software environment of a cluster while it’s running?

An answer:

My two cents:
If you're only missing some dependencies you can also use something like:

from subprocess import run
def install_missing_deps():
    run('pip install pyarrow'.split())  # missing pyarrow, for example
client.run(install_missing_deps)

This only lasts for as long as the cluster lives but you can rebuild your software environment in another session and keep working on the one you already have up

Point being, I imagine it's pretty common for users to do this, hopefully with a similar awareness of the limitations.

@jrbourbeau

Copy link
Copy Markdown
Member

Would you mind copying over the linked content so those not in the Coiled slack can view it?

@jose-moralez

Copy link
Copy Markdown

Haha, I've got to admit the client.run approach is very quick and dirty. I really like the local_packages addition to the PipInstall plugin, I just recently started developing a package and was using client.upload_file to test it out in the cluster but this definitely seems cleaner and more reliable.

@jrbourbeau jrbourbeau left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the updates to the PipInstall plugin are great. Thinking about it more, I'd prefer to hold off on adding a Client.pip_install method and instead improve/increase the visibility of our documentation around using the PipInstall plugin (xref dask/dask#7459).

Looking ahead I'd much rather distributed, or better yet some new dask-contrib package, grow CondaInstall, AptInstall, ... plugins for patching worker environments instead of this functionality living directly on the Client

@gjoseph92

Copy link
Copy Markdown
Collaborator Author

I'd prefer to hold off on adding a Client.pip_install method

Having the client method was pretty critical when I opened this—it was way too easy to not change the name of the plugin between repeated executions of a script, which would lead to the scheduler silently ignoring the new plugin (because the name was already registered) and therefore not installing the newest copy of your local package on the cluster. This would be very confusing for a user.

But now that #4748 added _get_worker_plugin_name, I think we could just add a name @property to PipInstall and it would work okay without the client method.

That said, my preference would be that if we actually think something is a bad idea, we shouldn't offer it at all. And otherwise, things we offer should be easy to use. Having a separate package for environment management plugins would definitely ideal. But if we don't think that'll happen anytime soon, I don't like the idea of making users write boilerplate to dissuade them from using an API.

@jrbourbeau

Copy link
Copy Markdown
Member

But now that #4748 added _get_worker_plugin_name, I think we could just add a name @Property to PipInstall and it would work okay without the client method

Agreed setting a name property would be a nice improvement (FWIW the _get_worker_plugin_name logic predates #4748, that PR just moved it into a separate method)

That said, my preference would be that if we actually think something is a bad idea, we shouldn't offer it at all

Historically Dask has not taken on software environment management. However, there were enough users who wanted to patch their worker environment that someone built the PipInstall plugin. We don't recommend this as the primary way to management software environments, but from a practical perspective it's a useful tool to have in your toolbox for certain cases.

And otherwise, things we offer should be easy to use...I don't like the idea of making users write boilerplate to dissuade them from using an API

In general I agree with this sentiment, but in this case I wouldn't categorize

client.register_worker_plugin(PipInstall(packages=...))

as being overly cumbersome boilerplate code.

@gjoseph92

Copy link
Copy Markdown
Collaborator Author

I wouldn't categorize client.register_worker_plugin(PipInstall(packages=...)) as being overly cumbersome boilerplate code.

Great point. I was still thinking of needing to hash the file contents.

Regardless, this still needs tests. Based on scattered feedback I've heard from a few folks though, I think it would be useful, so hopefully I can get to that at some point.

@jrbourbeau

Copy link
Copy Markdown
Member

Totally agree the updates you've made here to PipInstall are nice additions / improvements 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants