Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-2660: [Python] Experimental zero-copy pickling #2161

Closed
wants to merge 3 commits into from

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Jun 25, 2018

Zero-copy pickling of buffers and buffer-based objects will be possible using PEP 574 (if/when accepted). The PyPI backport "pickle5" helps us test that possibility.

@codecov-io
Copy link

codecov-io commented Jun 25, 2018

Codecov Report

Merging #2161 into master will increase coverage by 0.01%.
The diff coverage is 63.63%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2161      +/-   ##
==========================================
+ Coverage   84.39%   84.41%   +0.01%     
==========================================
  Files         293      293              
  Lines       44820    44841      +21     
==========================================
+ Hits        37826    37851      +25     
  Misses       6963     6963              
+ Partials       31       27       -4
Impacted Files Coverage Δ
python/pyarrow/io.pxi 60.52% <40%> (-0.04%) ⬇️
python/pyarrow/compat.py 79.25% <40%> (+0.74%) ⬆️
python/pyarrow/tests/test_array.py 98.89% <73.91%> (-1.11%) ⬇️
go/arrow/math/int64_sse4_amd64.go 0% <0%> (-100%) ⬇️
go/arrow/math/uint64_sse4_amd64.go 0% <0%> (-100%) ⬇️
go/arrow/memory/memory_sse4_amd64.go 0% <0%> (-100%) ⬇️
go/arrow/math/float64_sse4_amd64.go 0% <0%> (-100%) ⬇️
go/arrow/math/uint64_amd64.go 33.33% <0%> (ø) ⬆️
go/arrow/math/int64_amd64.go 33.33% <0%> (ø) ⬆️
go/arrow/math/float64_amd64.go 33.33% <0%> (ø) ⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 078b806...50f0491. Read the comment docs.

@wesm
Copy link
Member

wesm commented Jul 23, 2018

@pitrou this is cool. Do you see any reason not to rebase and merge this for 0.10.0?

@pitrou
Copy link
Member Author

pitrou commented Jul 24, 2018

There were changes to the PEP lately, I must adapt the code before :-)

Zero-copy pickling of buffers and buffer-based objects will be possible
using PEP 574 (if/when accepted).  The PyPI backport "pickle5" helps us
test that possibility.
@pitrou pitrou force-pushed the ARROW-2660-zero-copy-pickling branch from d444d1e to 50f0491 Compare July 24, 2018 14:49
@pitrou
Copy link
Member Author

pitrou commented Jul 24, 2018

I've fixed for the latest PEP updates and rebased.

@@ -133,6 +133,8 @@ popd

pushd python

pip install pickle5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This install call is redundant with the once below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, no, the second one is in a distinct virtualenv where we install the wheel we just built.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I got confused by the free-standing pip install here with no other packages. This is then just because we have no conda package for it yet?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, but it's very quick to compile anyway and there are no non-Python dependencies.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW we did add a package to conda-forge. Though it's true this is quite fast to build.

ref: https://github.com/conda-forge/pickle5-feedstock

@pitrou
Copy link
Member Author

pitrou commented Jul 25, 2018

Quick benchmark:

>>> import pickle5 as pickle
>>> import pyarrow as pa
>>> import pandas as pd
>>> df = pd.DataFrame({'ints': range(100000), 'strs': [str(i) for i in range(100000)]})
>>> table = pa.Table.from_pandas(df)

# Pickling a Pandas dataframe is slow, no difference with protocol 5
>>> %timeit pickle.loads(pickle.dumps(df))
29.1 ms ± 33.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit pickle.loads(pickle.dumps(df, protocol=5))
29.1 ms ± 43.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Pickling a Arrow table is faster
>>> %timeit pickle.loads(pickle.dumps(table))
3.33 ms ± 2.26 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# ... even faster with the new protocol 5
>>> %timeit pickle.loads(pickle.dumps(table, protocol=5))
526 µs ± 2.13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# ... and even faster with zero-copy buffers
>>> %timeit buffers = []; serd = pickle.dumps(table, protocol=5, buffer_callback=buffers.append); pickle.loads(serd, buffers=buffers)
154 µs ± 152 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

# There are 5 exported buffers for this table, and the serialized pickle stream is around 1kB
>>> buffers = []; serd = pickle.dumps(table, protocol=5, buffer_callback=buffers.append)
>>> buffers
[<pickle.PickleBuffer at 0x7f5a3096fac8>,
 <pickle.PickleBuffer at 0x7f5a3096f348>,
 <pickle.PickleBuffer at 0x7f5a27630948>,
 <pickle.PickleBuffer at 0x7f5a27630c48>,
 <pickle.PickleBuffer at 0x7f5a27630dc8>]
>>> len(serd)
1053

# Note that currently our exported buffers don't expose type information
>>> [(memoryview(buf).format, memoryview(buf).shape) for buf in buffers]
[('b', (800000,)),
 ('b', (12500,)),
 ('b', (400004,)),
 ('b', (488890,)),
 ('b', (800000,))]

@mrocklin

@pitrou
Copy link
Member Author

pitrou commented Jul 25, 2018

I created ARROW-2913 for the issue that exported buffers lose the data type. @mrocklin your informed opinion on that one could be useful.

(note it doesn't block this PR)

@wesm
Copy link
Member

wesm commented Jul 25, 2018

Nice and informative benchmarks. The copying of memory in unpickling with NumPy arrays etc. has been a long-standing gripe of mine

@xhochy
Copy link
Member

xhochy commented Jul 26, 2018

This looks really nice. @pitrou What is the best way to keep updated of the status of a PEP? Poll the website?

I guess, we wait with merging until the PEP is accepted?

@pitrou
Copy link
Member Author

pitrou commented Jul 26, 2018

What is the best way to keep updated of the status of a PEP? Poll the website?

If you don't want to read python-dev, then you can just do that indeed.

I guess, we wait with merging until the PEP is accepted?

Or we could merge already as @wesm proposes. The changes should be transparent if you don't use pickle5.

@pitrou pitrou changed the title [EXP] ARROW-2660: [Python] Zero-copy pickling ARROW-2660: [Python] Zero-copy pickling Jul 30, 2018
@pitrou pitrou changed the title ARROW-2660: [Python] Zero-copy pickling ARROW-2660: [Python] Experimental zero-copy pickling Jul 30, 2018
@pitrou pitrou closed this in 2422d9c Jul 30, 2018
@pitrou pitrou deleted the ARROW-2660-zero-copy-pickling branch July 30, 2018 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants