-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Deduplicate non-scalar Python objects when using pyarrow.serialize #17411
Comments
Robert Nishihara / @robertnishihara: import pyarrow as pa
l = []
original_object = l.append(l)
# Serialize the object. This fails.
pa.serialize(original_object) The ArrowException: Unknown error: 'NoneType' object is not iterable The error really should be ArrowNotImplementedError: This object exceeds the maximum recursion depth. It may contain itself recursively. That's the error you run the following import pyarrow as pa
l1 = []
l2 = []
l1.append(l2)
l2.append(l1)
# This fails.
pa.serialize(l1) |
Wes McKinney / @wesm: |
Robert Nishihara / @robertnishihara: |
Wes McKinney / @wesm: |
Robert Nishihara / @robertnishihara: |
Antoine Pitrou / @pitrou: |
If a Python object appears multiple times within a list/tuple/dictionary, then when pyarrow serializes the object, it will duplicate the object many times. This leads to a potentially huge expansion in the size of the object (e.g., the serialized version of
100 * [np.zeros(10 ** 6)]
will be 100 times bigger than it needs to be).One potential way to address this is to use the Arrow dictionary encoding.
Reporter: Robert Nishihara / @robertnishihara
PRs and other links:
Note: This issue was originally created as ARROW-1382. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: