Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New feature] full code pickle for arbitrary modules #206

Closed
crusaderky opened this issue Sep 23, 2018 · 17 comments · Fixed by #417
Closed

[New feature] full code pickle for arbitrary modules #206

crusaderky opened this issue Sep 23, 2018 · 17 comments · Fixed by #417

Comments

@crusaderky
Copy link

crusaderky commented Sep 23, 2018

I have two problems with cloudpickle today:

Python modules outside of PYTHONPATH

I develop a high-level framework and allow the end user to inject plugins (arbitrary .py files containing subclasses of classes defined by my framework). These python files are shipped by the user along with the data, meaning outside of the PYTHONPATH, and I load them with importlib. After all objects are loaded, I pickle the state of the whole framework for quick recovery and forensics. The problem is that, when loading such pickles, I have no idea where to find the user-provided python files - hence the unpickling fails. The workaround, to implement a sidecar config file that lists all paths that need to be added to the PYTHONPATH before unpickling, is ugly at best.

dask distributed on docker

A typical devops design for dask distributed is to have:

  • A generic docker image for the distributed scheduler (conda install distributed nomkl bokeh)
  • An application-specific docker image for the client
  • An application-specific docker image for the workers

The last point is a pain, particularly in the development phase, as it needs to contain all user-specific modules.
I can see how the distributed developers already noticed this problem and this is why (I guess) cloudpickle is rigged to fully serialise the code of any function or class that has been defined on the fly in a Jupyter Notebook. This is great for the very first stage of prototyping, but doesn't help when the user has part of his development-phase code written down in custom python files - which is the norm on any project that is not completely brand new.

It would be awesome if one could develop his whole application with a barebones, general purpose docker worker (e.g. conda install dask distributed numpy scipy), which will just work(tm) as long as you don't introduce custom C extensions.

Proposed design

I'm proposing two changes to address these problems:

  1. in the cloudpickle package, expose a set of packages whose contents must be pickled completely. This goes to replace the hardcoded __main__ references. This set would default to __main__ only and the user would be able to add his own packages.
    It should not be needed to specify individual subpackages - e.g. to enable full serialisation for foo.bar, I just need to add foo to the set.
  2. in the distributed package, add a config switch that specifies the order of pickle-like modules. This solves performance issues too (see Everything is pickled twice dask/distributed#1371). This would default to the current pickle -> cloudpickle, but a user could want to disable pickle and go for cloudpickle directly. This would also allow using custom modules.

So for example, in the first lines of my own code, I would have:

import cloudpickle
cloudpickle.full_pickle_packages.add('mypackage')

And in .dask/config.yaml:

# Order of pickle-like modules to be tried in sequence to serialize objects
pickle:
    # - pickle  # need to disable this to allow for full serialisation of mypackage
    - cloudpickle

With these changes, mypackage would need to exist exclusively on the client side and not on the worker.

Opinions / suggestions?

@crusaderky crusaderky changed the title [New feature] pickle code for arbitrary modules [New feature] full code pickle for arbitrary modules Sep 23, 2018
@ogrisel
Copy link
Contributor

ogrisel commented Sep 24, 2018

This goes to replace the hardcoded main references. This set would default to main only and the user would be able to add his own packages.

As far as I understand, __main__ is currently not pickled entirely ahead of time: only the dependencies of a given object whose class / function is defined in __main__ are pickled on the fly. The goal is to keep as lightweight as possible, see the following non-regression test recently merged in master: https://github.com/cloudpipe/cloudpickle/pull/204/files.

In your case, I think you want to proactively pickle and import the full module at once instead of incrementally pickling new class and function definitions as needed on the fly. Although I am not sure.

@ogrisel
Copy link
Contributor

ogrisel commented Sep 24, 2018

I am wondering if your use case would not be better handled via shared filesystem to share the content of a development folder both on the client and worker containers.

Reloading modules when the shared code is edited would still be tricky: one would have to restart all the workers to be safe but this is already tricky in jupyter on a single host (e.g. via the hackish %%autoreload magic).

@crusaderky
Copy link
Author

crusaderky commented Sep 24, 2018

@ogrisel I don't see any benefit in aggressively pickling my whole package.The current policy of pickling just what's referenced looks fine to me.

@ogrisel
Copy link
Contributor

ogrisel commented Sep 27, 2018

I suggest you implement the suggested feature and try to use it in a dask/distributed context and then open a PR to report back if it works for your use case.

In particular, I am curious about the interactive development setting when you edit the code in the registered packages while instances of classes of that package are still alive in remote workers on the cluster. We also recently "fixed" the behavior of functions that access shared global variables in dynamic modules: https://github.com/cloudpipe/cloudpickle/pull/205/files . You might want to have a look at the newly added test to ensure that semantics are in line with user expectations.

It would also be interesting to see if it's possible to add support for compiled extensions (e.g. Cython modules) for the same scenario, but maybe this can be handled as a separate PR.

@syagev
Copy link

syagev commented Nov 20, 2019

@crusaderky I think your description fits a very much required scenario. Interested to hear what methodology you eventually chose to address the challenge? Do you simply make code available to the workers via copy/shared file system?

@crusaderky
Copy link
Author

@syagev I hacked my way through

class C:
    pass

C.__module__ = "__main__" # fool cloudpickle into fully serializing the code

@kinghuang
Copy link

I'm also running into this issue with some custom code in a module and Dask. Perhaps cloudpickle could add an option where a specific list of extra modules can be included for pickling beyond just __main__?

@cyruoff
Copy link

cyruoff commented Apr 30, 2020

I would also greatly appreciate such a feature for deploying self-contained models

@kinghuang
Copy link

kinghuang commented Jul 8, 2020

I've opened PR #391 to add a dynamic modules option. It's based on a runtime patch on cloudpickle that I've been using for a couple months.

@bonelli
Copy link

bonelli commented Nov 15, 2020

I've tried PR #391 for the very same usecase and at a first test it works like a charm

Samreay pushed a commit to Samreay/cloudpickle that referenced this issue Apr 20, 2021
Adding tested deep serialisation option.

Based on PR391 by kinghuang, but taking on
feedback from maintainers and adding tests.
Samreay pushed a commit to Samreay/cloudpickle that referenced this issue Jun 15, 2021
Adding tested deep serialisation option.

Based on PR391 by kinghuang, but taking on
feedback from maintainers and adding tests.
@ogrisel
Copy link
Contributor

ogrisel commented Jun 22, 2021

@bonelli @kinghuang @cyruoff @crusaderky @Samreay it would be great if you could confirm if the work done in #417 can solve this issue in your own usual workflow.

@ogrisel
Copy link
Contributor

ogrisel commented Jun 22, 2021

ping also @jrbourbeau.

@bonelli
Copy link

bonelli commented Jun 22, 2021

@bonelli @kinghuang @cyruoff @crusaderky @Samreay it would be great if you could confirm if the work done in #417 can solve this issue in your own usual workflow.

From the comments to the PR #417 it looks like this should fit my use-case, in order to be sure I would have to update our code-base, at the moment it uses the last commit on PR #391

Anyhow, my use-case is exactly a case in which multiple different clients connect to a single k8s/docker dask cluster (both uncustomized scheduler and uncustomized workers) and all clients bring their own modules, and sometimes different versions of the same modules, and they should run correctly at the same time on the same dask cluster.

@jrbourbeau
Copy link
Member

jrbourbeau commented Jun 22, 2021

@ogrisel I just tried out #417 and it indeed enables Dask users to run tasks which use Python modules which are present on the Client machine, but not on the worker machines.

EDIT: I should note that I was using a modified version of distributed when experimenting with #417. Some changes to distributed would be needed to fully enable this

@Samreay
Copy link
Contributor

Samreay commented Jun 29, 2021

This might be an obvious contribution, but I can confirm that this helps with saving custom mlflow models that have functional dependencies in external modules.

For anyone looking to test it, the pip command to install cloudpickle as if #417 was merged is

pip install git+https://github.com/cloudpipe/cloudpickle.git@refs/pull/417/merge

@ogrisel
Copy link
Contributor

ogrisel commented Jun 30, 2021

all clients bring their own modules, and sometimes different versions of the same modules, and they should run correctly at the same time on the same dask cluster.

Interesting. I believe #417 should work because it does not mutate sys.modules if you pickle functions/objects from modules registered to be pickled by values. However we do not have explicit tests for this kind of isolation. Right now it's kind of implicit and we not be able to detect a regression with the current test suite...

@Samreay
Copy link
Contributor

Samreay commented Jun 30, 2021

I've simulated a difference in environments in a new unit test and added that to the PR :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants