Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Deduplicate non-scalar Python objects when using pyarrow.serialize #17411

Closed
asfimport opened this issue Aug 20, 2017 · 6 comments
Closed

Comments

@asfimport
Copy link

If a Python object appears multiple times within a list/tuple/dictionary, then when pyarrow serializes the object, it will duplicate the object many times. This leads to a potentially huge expansion in the size of the object (e.g., the serialized version of 100 * [np.zeros(10 ** 6)] will be 100 times bigger than it needs to be).

import pyarrow as pa

l = [0]
original_object = [l, l]
# Serialize and deserialize the object.
buf = pa.serialize(original_object).to_buffer()
new_object = pa.deserialize(buf)
# This works.
assert original_object[0] is original_object[1]
# This fails.
assert new_object[0] is new_object[1]

One potential way to address this is to use the Arrow dictionary encoding.

Reporter: Robert Nishihara / @robertnishihara

PRs and other links:

Note: This issue was originally created as ARROW-1382. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Robert Nishihara / @robertnishihara:
A related example is an object that recursively contains itself (this example is a bit contrived, but you could imagine a graph data structure with cyclic references).

import pyarrow as pa

l = []
original_object = l.append(l)
# Serialize the object. This fails.
pa.serialize(original_object)

The pa.serialize call fails with

ArrowException: Unknown error: 'NoneType' object is not iterable

The error really should be

ArrowNotImplementedError: This object exceeds the maximum recursion depth. It may contain itself recursively.

That's the error you run the following

import pyarrow as pa

l1 = []
l2 = []
l1.append(l2)
l2.append(l1)
# This fails.
pa.serialize(l1)

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Deduplication would be nice. I changed the issue title to reflect

@asfimport
Copy link
Author

Robert Nishihara / @robertnishihara:
Sounds good. This can be addressed at different levels of generality. The problem applies even beyond ndarrays, e.g., if you serialize 100 * [[]].

@asfimport
Copy link
Author

Wes McKinney / @wesm:
I changed the title to be "non-scalar Python objects". I think we can deduplicate by object id on anything that isn't a scalar value

@asfimport
Copy link
Author

Robert Nishihara / @robertnishihara:
That sounds right :)

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
PyArrow serialization is deprecated, closing as won't fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant