Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting out-of-band buffers with pickle protocol 5 #5472

Open
jakirkham opened this issue May 21, 2020 · 3 comments
Open

Supporting out-of-band buffers with pickle protocol 5 #5472

jakirkham opened this issue May 21, 2020 · 3 comments
Labels
enhancement Feature requests and improvements feat / serialize Feature: Serialization, saving and loading help wanted Contributions welcome!

Comments

@jakirkham
Copy link

jakirkham commented May 21, 2020

Feature description

Typically pickling in Python creates a large bytes object with types, functions, and data all packed in to allow easy reconstruction later. Originally pickling was focused on reading/writing to disk. However these days it is increasingly using as a serialization protocol for objects on the wire. In this case the copies of data required to put everything in a single bytes object hurts performance and doesn't offer much (as the data could be shipped along in separate buffers without copying).

For these reasons, Python added support for out-of-band buffers in pickle, which allows the user to flag buffers of data for pickle to extract and send alongside the typical bytes object (thus avoiding unneeded copying of data). This was submitted and accepted as PEP 574 and is part of Python 3.8 (along with a backport package for Python 3.5, 3.6, and 3.7). On the implementation side this just comes down to implementing __reduce_ex__ instead of __reduce__ (basically the same with a protocol version argument) and placing any bytes-like data (like NumPy arrays and memoryviews) into PickleBuffer objects. For older pickle protocols this step can simply be skipped. Here's an example. The rest is on libraries using protocol 5 (like Dask) to implement and use.

Could the feature be a custom component or spaCy plugin?

If so, we will tag it as project idea so other users can take it on.


I don't think so as this relies on changing the pickle implementations of spaCy objects. Though I could be wrong :)

@jakirkham
Copy link
Author

Should add this would only be needed on objects that have data that could be better handled out-of-band. Objects that don't own data directly themselves wouldn't need this. Also NumPy arrays already support this behavior.

@adrianeboyd adrianeboyd added enhancement Feature requests and improvements feat / serialize Feature: Serialization, saving and loading labels May 21, 2020
@honnibal
Copy link
Member

Oh thanks for explaining this! I didn't know about it. I've definitely been frustrated by Pickle before.

I think there should be a way to do this cleverly if we add support in preshed as well. I'm very keen to have this project move forward but I don't have bandwidth for it myself. I'd love for someone to take this on.

@jakirkham
Copy link
Author

Of course! It's a pretty new feature and maybe not as widely known. Know the feeling. Out-of-band pickling should help.

There are some clever ways to make the change simpler still. For example since NumPy arrays already support out-of-band pickling, if __reduce__ or __getstate__ methods return NumPy arrays (or can be tweaked to do so), things mostly just work. We had this observation with Pandas recently ( pandas-dev/pandas#34244 ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / serialize Feature: Serialization, saving and loading help wanted Contributions welcome!
Projects
None yet
Development

No branches or pull requests

4 participants