[New feature] full code pickle for arbitrary modules #206

crusaderky · 2018-09-23T14:37:55Z

I have two problems with cloudpickle today:

Python modules outside of PYTHONPATH

I develop a high-level framework and allow the end user to inject plugins (arbitrary .py files containing subclasses of classes defined by my framework). These python files are shipped by the user along with the data, meaning outside of the PYTHONPATH, and I load them with importlib. After all objects are loaded, I pickle the state of the whole framework for quick recovery and forensics. The problem is that, when loading such pickles, I have no idea where to find the user-provided python files - hence the unpickling fails. The workaround, to implement a sidecar config file that lists all paths that need to be added to the PYTHONPATH before unpickling, is ugly at best.

dask distributed on docker

A typical devops design for dask distributed is to have:

A generic docker image for the distributed scheduler (conda install distributed nomkl bokeh)
An application-specific docker image for the client
An application-specific docker image for the workers

The last point is a pain, particularly in the development phase, as it needs to contain all user-specific modules.
I can see how the distributed developers already noticed this problem and this is why (I guess) cloudpickle is rigged to fully serialise the code of any function or class that has been defined on the fly in a Jupyter Notebook. This is great for the very first stage of prototyping, but doesn't help when the user has part of his development-phase code written down in custom python files - which is the norm on any project that is not completely brand new.

It would be awesome if one could develop his whole application with a barebones, general purpose docker worker (e.g. conda install dask distributed numpy scipy), which will just work(tm) as long as you don't introduce custom C extensions.

Proposed design

I'm proposing two changes to address these problems:

in the cloudpickle package, expose a set of packages whose contents must be pickled completely. This goes to replace the hardcoded __main__ references. This set would default to __main__ only and the user would be able to add his own packages.
It should not be needed to specify individual subpackages - e.g. to enable full serialisation for foo.bar, I just need to add foo to the set.
in the distributed package, add a config switch that specifies the order of pickle-like modules. This solves performance issues too (see Everything is pickled twice dask/distributed#1371). This would default to the current pickle -> cloudpickle, but a user could want to disable pickle and go for cloudpickle directly. This would also allow using custom modules.

So for example, in the first lines of my own code, I would have:

import cloudpickle
cloudpickle.full_pickle_packages.add('mypackage')

And in .dask/config.yaml:

# Order of pickle-like modules to be tried in sequence to serialize objects
pickle:
    # - pickle  # need to disable this to allow for full serialisation of mypackage
    - cloudpickle

With these changes, mypackage would need to exist exclusively on the client side and not on the worker.

Opinions / suggestions?

The text was updated successfully, but these errors were encountered:

ogrisel · 2018-09-24T09:42:46Z

This goes to replace the hardcoded main references. This set would default to main only and the user would be able to add his own packages.

As far as I understand, __main__ is currently not pickled entirely ahead of time: only the dependencies of a given object whose class / function is defined in __main__ are pickled on the fly. The goal is to keep as lightweight as possible, see the following non-regression test recently merged in master: https://github.com/cloudpipe/cloudpickle/pull/204/files.

In your case, I think you want to proactively pickle and import the full module at once instead of incrementally pickling new class and function definitions as needed on the fly. Although I am not sure.

ogrisel · 2018-09-24T09:46:39Z

I am wondering if your use case would not be better handled via shared filesystem to share the content of a development folder both on the client and worker containers.

Reloading modules when the shared code is edited would still be tricky: one would have to restart all the workers to be safe but this is already tricky in jupyter on a single host (e.g. via the hackish %%autoreload magic).

crusaderky · 2018-09-24T22:03:04Z

@ogrisel I don't see any benefit in aggressively pickling my whole package.The current policy of pickling just what's referenced looks fine to me.

ogrisel · 2018-09-27T07:43:39Z

I suggest you implement the suggested feature and try to use it in a dask/distributed context and then open a PR to report back if it works for your use case.

In particular, I am curious about the interactive development setting when you edit the code in the registered packages while instances of classes of that package are still alive in remote workers on the cluster. We also recently "fixed" the behavior of functions that access shared global variables in dynamic modules: https://github.com/cloudpipe/cloudpickle/pull/205/files . You might want to have a look at the newly added test to ensure that semantics are in line with user expectations.

It would also be interesting to see if it's possible to add support for compiled extensions (e.g. Cython modules) for the same scenario, but maybe this can be handled as a separate PR.

syagev · 2019-11-20T10:18:59Z

@crusaderky I think your description fits a very much required scenario. Interested to hear what methodology you eventually chose to address the challenge? Do you simply make code available to the workers via copy/shared file system?

crusaderky · 2019-11-20T10:22:32Z

@syagev I hacked my way through

class C:
    pass

C.__module__ = "__main__" # fool cloudpickle into fully serializing the code

kinghuang · 2020-03-31T20:45:33Z

I'm also running into this issue with some custom code in a module and Dask. Perhaps cloudpickle could add an option where a specific list of extra modules can be included for pickling beyond just __main__?

cyruoff · 2020-04-30T09:13:57Z

I would also greatly appreciate such a feature for deploying self-contained models

kinghuang · 2020-07-08T21:22:54Z

I've opened PR #391 to add a dynamic modules option. It's based on a runtime patch on cloudpickle that I've been using for a couple months.

bonelli · 2020-11-15T20:59:08Z

I've tried PR #391 for the very same usecase and at a first test it works like a charm

Adding tested deep serialisation option. Based on PR391 by kinghuang, but taking on feedback from maintainers and adding tests.

ogrisel · 2021-06-22T16:26:04Z

@bonelli @kinghuang @cyruoff @crusaderky @Samreay it would be great if you could confirm if the work done in #417 can solve this issue in your own usual workflow.

ogrisel · 2021-06-22T16:26:44Z

ping also @jrbourbeau.

bonelli · 2021-06-22T20:54:44Z

@bonelli @kinghuang @cyruoff @crusaderky @Samreay it would be great if you could confirm if the work done in #417 can solve this issue in your own usual workflow.

From the comments to the PR #417 it looks like this should fit my use-case, in order to be sure I would have to update our code-base, at the moment it uses the last commit on PR #391

Anyhow, my use-case is exactly a case in which multiple different clients connect to a single k8s/docker dask cluster (both uncustomized scheduler and uncustomized workers) and all clients bring their own modules, and sometimes different versions of the same modules, and they should run correctly at the same time on the same dask cluster.

jrbourbeau · 2021-06-22T22:01:15Z

@ogrisel I just tried out #417 and it indeed enables Dask users to run tasks which use Python modules which are present on the Client machine, but not on the worker machines.

EDIT: I should note that I was using a modified version of distributed when experimenting with #417. Some changes to distributed would be needed to fully enable this

Samreay · 2021-06-29T14:22:18Z

This might be an obvious contribution, but I can confirm that this helps with saving custom mlflow models that have functional dependencies in external modules.

For anyone looking to test it, the pip command to install cloudpickle as if #417 was merged is

pip install git+https://github.com/cloudpipe/cloudpickle.git@refs/pull/417/merge

ogrisel · 2021-06-30T10:02:44Z

all clients bring their own modules, and sometimes different versions of the same modules, and they should run correctly at the same time on the same dask cluster.

Interesting. I believe #417 should work because it does not mutate sys.modules if you pickle functions/objects from modules registered to be pickled by values. However we do not have explicit tests for this kind of isolation. Right now it's kind of implicit and we not be able to detect a regression with the current test suite...

Samreay · 2021-06-30T11:38:32Z

I've simulated a difference in environments in a new unit test and added that to the PR :)

crusaderky changed the title ~~[New feature] pickle code for arbitrary modules~~ [New feature] full code pickle for arbitrary modules Sep 23, 2018

crusaderky mentioned this issue Sep 23, 2018

[New feature] full code pickle for arbitrary modules dask/distributed#2268

Closed

kinghuang mentioned this issue Jul 8, 2020

Add option to treat resources from specific modules as dynamic #391

Closed

jrbourbeau mentioned this issue Nov 17, 2020

Class-level attributes dropped during serialization across processes #398

Closed

Samreay pushed a commit to Samreay/cloudpickle that referenced this issue Apr 20, 2021

Fixes cloudpipe#206.

23ffd4a

Adding tested deep serialisation option. Based on PR391 by kinghuang, but taking on feedback from maintainers and adding tests.

Samreay mentioned this issue Apr 20, 2021

Add ability to register modules to be deeply serialized #417

Merged

Samreay pushed a commit to Samreay/cloudpickle that referenced this issue Jun 15, 2021

Fixes cloudpipe#206.

38364d6

Adding tested deep serialisation option. Based on PR391 by kinghuang, but taking on feedback from maintainers and adding tests.

crusaderky mentioned this issue Jun 23, 2021

[DO NOT MERGE] Add option to always use cloudpickle dask/distributed#4954

Open

ogrisel closed this as completed in #417 Aug 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[New feature] full code pickle for arbitrary modules #206

[New feature] full code pickle for arbitrary modules #206

crusaderky commented Sep 23, 2018 •

edited

Loading

ogrisel commented Sep 24, 2018

ogrisel commented Sep 24, 2018

crusaderky commented Sep 24, 2018 •

edited

Loading

ogrisel commented Sep 27, 2018 •

edited

Loading

syagev commented Nov 20, 2019

crusaderky commented Nov 20, 2019

kinghuang commented Mar 31, 2020

cyruoff commented Apr 30, 2020

kinghuang commented Jul 8, 2020 •

edited

Loading

bonelli commented Nov 15, 2020

ogrisel commented Jun 22, 2021

ogrisel commented Jun 22, 2021 •

edited

Loading

bonelli commented Jun 22, 2021 •

edited

Loading

jrbourbeau commented Jun 22, 2021 •

edited

Loading

Samreay commented Jun 29, 2021

ogrisel commented Jun 30, 2021

Samreay commented Jun 30, 2021

[New feature] full code pickle for arbitrary modules #206

[New feature] full code pickle for arbitrary modules #206

Comments

crusaderky commented Sep 23, 2018 • edited Loading

Python modules outside of PYTHONPATH

dask distributed on docker

Proposed design

ogrisel commented Sep 24, 2018

ogrisel commented Sep 24, 2018

crusaderky commented Sep 24, 2018 • edited Loading

ogrisel commented Sep 27, 2018 • edited Loading

syagev commented Nov 20, 2019

crusaderky commented Nov 20, 2019

kinghuang commented Mar 31, 2020

cyruoff commented Apr 30, 2020

kinghuang commented Jul 8, 2020 • edited Loading

bonelli commented Nov 15, 2020

ogrisel commented Jun 22, 2021

ogrisel commented Jun 22, 2021 • edited Loading

bonelli commented Jun 22, 2021 • edited Loading

jrbourbeau commented Jun 22, 2021 • edited Loading

Samreay commented Jun 29, 2021

ogrisel commented Jun 30, 2021

Samreay commented Jun 30, 2021

crusaderky commented Sep 23, 2018 •

edited

Loading

crusaderky commented Sep 24, 2018 •

edited

Loading

ogrisel commented Sep 27, 2018 •

edited

Loading

kinghuang commented Jul 8, 2020 •

edited

Loading

ogrisel commented Jun 22, 2021 •

edited

Loading

bonelli commented Jun 22, 2021 •

edited

Loading

jrbourbeau commented Jun 22, 2021 •

edited

Loading