Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Handle buffer protocol objects in ensure_bytes #2969

Merged
merged 2 commits into from Aug 21, 2019

Conversation

@TomAugspurger
Copy link
Member

commented Aug 20, 2019

I'm a bit concerned that I'm just working around a problem with calling maybe_compress / ensure_bytes inside dumps. Would it be better to update ensure_bytes to handle pyarrow Buffer objects? For pyarrow, this would be something like

if hasattr(s, 'to_pybytes'):
    return s.to_pybytes()

or something like bytes(s) as the final condition?

Closes #2966

@dhirschfeld

This comment has been minimized.

Copy link
Contributor

commented Aug 20, 2019

The returned header = [buf] is actually picklable so it's somewhat icky that it would blow up elsewhere:

In [21]: sink
Out[21]: <pyarrow.lib.BufferOutputStream at 0x2c85aee33b0>

In [22]: len(pickle.dumps([sink.getvalue()]))
Out[22]: 1961

In [24]: len(distributed.protocol.dumps([sink.getvalue()]))
distributed.protocol.core - CRITICAL - Failed to Serialize
<snip>
TypeError: can not serialize 'pyarrow.lib.Buffer' object

That said, the workaround is straightforward (don't return complex objects in the header) so I I'm 👍 for this fix.

@dhirschfeld

This comment has been minimized.

Copy link
Contributor

commented Aug 20, 2019

For pyarrow, this would be something like:

if hasattr(s, 'to_pybytes'):
    return s.to_pybytes()

Adding an arrow-specific workaround is a slippery slope so I think fixing it in the serialize_table
function as is done here is better.

@TomAugspurger

This comment has been minimized.

Copy link
Member Author

commented Aug 21, 2019

Adding an arrow-specific workaround is a slippery slope

Yeah, we certainly don't want that. But my hope is that bytes(thing) in ensure_bytes is generic enough. It should work for anything implementing the buffer protocol.

@TomAugspurger TomAugspurger changed the title BUG: Handle pyarrow Buffer when serializing Table BUG: Handle buffer protocol objects in ensure_bytes Aug 21, 2019

@TomAugspurger

This comment has been minimized.

Copy link
Member Author

commented Aug 21, 2019

Updated to fix the actual problem in ensure_bytes I think. I'll followup with a PR to Dask's version if this seems reasonable.

@dhirschfeld dhirschfeld referenced this pull request Aug 21, 2019

@mrocklin mrocklin merged commit 4ae0271 into dask:master Aug 21, 2019

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.