-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[New feature] full code pickle for arbitrary modules #206
Comments
As far as I understand, In your case, I think you want to proactively pickle and import the full module at once instead of incrementally pickling new class and function definitions as needed on the fly. Although I am not sure. |
I am wondering if your use case would not be better handled via shared filesystem to share the content of a development folder both on the client and worker containers. Reloading modules when the shared code is edited would still be tricky: one would have to restart all the workers to be safe but this is already tricky in jupyter on a single host (e.g. via the hackish %%autoreload magic). |
@ogrisel I don't see any benefit in aggressively pickling my whole package.The current policy of pickling just what's referenced looks fine to me. |
I suggest you implement the suggested feature and try to use it in a dask/distributed context and then open a PR to report back if it works for your use case. In particular, I am curious about the interactive development setting when you edit the code in the registered packages while instances of classes of that package are still alive in remote workers on the cluster. We also recently "fixed" the behavior of functions that access shared global variables in dynamic modules: https://github.com/cloudpipe/cloudpickle/pull/205/files . You might want to have a look at the newly added test to ensure that semantics are in line with user expectations. It would also be interesting to see if it's possible to add support for compiled extensions (e.g. Cython modules) for the same scenario, but maybe this can be handled as a separate PR. |
@crusaderky I think your description fits a very much required scenario. Interested to hear what methodology you eventually chose to address the challenge? Do you simply make code available to the workers via copy/shared file system? |
@syagev I hacked my way through class C:
pass
C.__module__ = "__main__" # fool cloudpickle into fully serializing the code |
I'm also running into this issue with some custom code in a module and Dask. Perhaps cloudpickle could add an option where a specific list of extra modules can be included for pickling beyond just |
I would also greatly appreciate such a feature for deploying self-contained models |
I've opened PR #391 to add a dynamic modules option. It's based on a runtime patch on cloudpickle that I've been using for a couple months. |
I've tried PR #391 for the very same usecase and at a first test it works like a charm |
Adding tested deep serialisation option. Based on PR391 by kinghuang, but taking on feedback from maintainers and adding tests.
Adding tested deep serialisation option. Based on PR391 by kinghuang, but taking on feedback from maintainers and adding tests.
@bonelli @kinghuang @cyruoff @crusaderky @Samreay it would be great if you could confirm if the work done in #417 can solve this issue in your own usual workflow. |
ping also @jrbourbeau. |
From the comments to the PR #417 it looks like this should fit my use-case, in order to be sure I would have to update our code-base, at the moment it uses the last commit on PR #391 Anyhow, my use-case is exactly a case in which multiple different clients connect to a single k8s/docker dask cluster (both uncustomized scheduler and uncustomized workers) and all clients bring their own modules, and sometimes different versions of the same modules, and they should run correctly at the same time on the same dask cluster. |
@ogrisel I just tried out #417 and it indeed enables Dask users to run tasks which use Python modules which are present on the EDIT: I should note that I was using a modified version of |
This might be an obvious contribution, but I can confirm that this helps with saving custom mlflow models that have functional dependencies in external modules. For anyone looking to test it, the pip command to install cloudpickle as if #417 was merged is
|
Interesting. I believe #417 should work because it does not mutate |
I've simulated a difference in environments in a new unit test and added that to the PR :) |
I have two problems with cloudpickle today:
Python modules outside of PYTHONPATH
I develop a high-level framework and allow the end user to inject plugins (arbitrary .py files containing subclasses of classes defined by my framework). These python files are shipped by the user along with the data, meaning outside of the PYTHONPATH, and I load them with importlib. After all objects are loaded, I pickle the state of the whole framework for quick recovery and forensics. The problem is that, when loading such pickles, I have no idea where to find the user-provided python files - hence the unpickling fails. The workaround, to implement a sidecar config file that lists all paths that need to be added to the PYTHONPATH before unpickling, is ugly at best.
dask distributed on docker
A typical devops design for dask distributed is to have:
conda install distributed nomkl bokeh
)The last point is a pain, particularly in the development phase, as it needs to contain all user-specific modules.
I can see how the distributed developers already noticed this problem and this is why (I guess) cloudpickle is rigged to fully serialise the code of any function or class that has been defined on the fly in a Jupyter Notebook. This is great for the very first stage of prototyping, but doesn't help when the user has part of his development-phase code written down in custom python files - which is the norm on any project that is not completely brand new.
It would be awesome if one could develop his whole application with a barebones, general purpose docker worker (e.g.
conda install dask distributed numpy scipy
), which will just work(tm) as long as you don't introduce custom C extensions.Proposed design
I'm proposing two changes to address these problems:
__main__
references. This set would default to__main__
only and the user would be able to add his own packages.It should not be needed to specify individual subpackages - e.g. to enable full serialisation for
foo.bar
, I just need to addfoo
to the set.So for example, in the first lines of my own code, I would have:
And in
.dask/config.yaml
:With these changes,
mypackage
would need to exist exclusively on the client side and not on the worker.Opinions / suggestions?
The text was updated successfully, but these errors were encountered: