Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we submit functions without requiring containers to have exact same environment? #159

Closed
ericmjl opened this issue Jun 28, 2019 · 2 comments

Comments

@ericmjl
Copy link

ericmjl commented Jun 28, 2019

@mrocklin, I just came across #84, having tried dask on Kubernetes this Wednesday with a colleague, and found one thing that I'm not sure whether to consider a limitation or a feature (in general).

I noticed that the containers that Kubernetes is managing has to match the environment of the Jupyter notebook, otherwise, code from the notebook will not execute on the remote workers. This poses a problem for the use case of "JLab on laptop, execute on remote cluster", I think.

I see the problem showing up in two places.

In the first case, problems arise because I actively develop custom source code in a custom Python package per project. This I consider to be good practice, because it means I can reuse my code across notebooks and set up proper unit tests for things that need to be unit tested. However, if I develop on my laptop and don't work in a container, then changes that are local are not reflected in the remote workers. The alternative, then, is to work inside a container, but that means rebuilding the container and shipping it to the remote workers each time I make a change to the code base and want to test-drive it on remote workers.

In the second case, if I on-the-fly decide that I need a new package and install it in my conda environment, code using the new package will not execute on the remote workers without me re-shipping the containers to the workers.

I'm wondering if it might be possible to just package up every function that is used (and their dependencies) and submit them to the dask cluster, rather than requiring the worker containers to be identical to the jupyter server's compute environment? Or am I missing something from my mental model in thinking this?

@mrocklin
Copy link
Member

The environments don't need to be identical, but any function you serialize on your client's side will need to deserialize on the worker. So if you've installed scikit-learn version X on your client, and call a function defined in that library then Scikit-Learn version X will also have to be on the worker when we deserialize that function. Moving around libraries like that is out of scope for Dask.

But, for example, Jupyter itself doesn't need to be on the workers, unless you plan to send along a Jupyter function.

@ericmjl
Copy link
Author

ericmjl commented Jun 28, 2019

Ok, thanks for clarifying, @mrocklin! Going to close this issue.

@ericmjl ericmjl closed this as completed Jun 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants