-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-1382: [Python] Raise an exception when serializing objects containing multiple copie… #2859
Conversation
…s of the same object.
cc @pcmoritz |
This needs to be tried out and tested in more scenarios before merging. |
If we want to merge something like this, I'd say it should be behind a flag which is off by default. An alternative approach that could be on by default (which I can have a look into) is as mentioned using dictionary encoding as mentioned. We would use something similar to this to detect duplicate objects and in a first pass collect them. In a second pass we would store them in an array at the end of the RecordBatch and then replace the occurences with indices into that array. |
So in some way it is like Python's deepcopy with |
I have not commented on this being off. I would even say that it should be on as you said. I am mostly waiting for proper solution here. This can help find bugs, though. So I am not sure why to have it if it is not on by default. It is better to throw an exception and then people can disable it after they determine that their use case warrants that. |
This really looks like a hack to me. I think the middle-term plan should be instead to remove most of the custom serialization layer and exploit PEP 574 instead. |
I am wondering why we would not try to implement the dictionary encoding rather than hacking around the issue like this. I agree with @pitrou that using pickle5 eventually would be a good idea (though there are some technical questions, e.g. output sizing / allocation) |
@pitrou @wesm I don't view it as a hack because it is a subset of what needs to be done to implement the dictionary encoding approach (that is, you need a way to keep track of which objects you've seen before). As for the dictionary approach, I think it's probably a good idea, but it's still not clear to me how to actually do it efficiently (maybe @pcmoritz has more thoughts about this). I submitted this PR because the current behavior is very problematic for some Python objects. @pitrou, PEP 574 looks like a really good idea. However, it wouldn't use Arrow at all, right? So we would lose certain things like sharing serialized objects between languages. Is that right? |
Can you give an example of the kind of sharing you are thinking about? |
@pitrou, so we haven't implemented this yet, but it would be natural to use |
Well... IMHO, Arrow isn't in the business of making arbitrary objects interoperable between languages. |
Ok, not arbitrary objects, but isn't the intention for the data layout to be language agnostic so as to allow some degree of interoperability? |
Right, but that's when sharing Arrow arrays, not native Python instances. Am I missing something? |
Sure, but the motivating example is |
I don't understand why it should work at all. These are not Arrow arrays, but a list of Numpy arrays. It's definitely Python specific. |
What you might want to do is to call |
Ok, I'm not 100% sure what the ideal behavior is here. Will think it over a little. |
I forgot about the plan to be able to read these objects in Java. If we limit to tensors and other objects recognized by Arrow then that's valuable to support. |
I'd like to be able to use I wouldn't need/want arbitrary objects to be serialised but I'd very much like to be able to serialise objects representing mappings (dicts) of strings (names) to arrays, tables, datetimes and primitive types IIUC this can already be done - https://arrow.apache.org/docs/python/ipc.html#arbitrary-object-serialization |
Closing for now. Still need to address this issue at some point. |
Is there a related issue for this? |
…s of the same object.
This is one approach to addressing a long-standing issue in which Python objects that contain multiple copies of the same object are serialized (by
pyarrow.serialize
) as if they contain distinct copies of that object, e.g., the object1000 * [np.zeros(10**8)]
will be serialized to contain 1000 distinct arrays. For graph data structures, this can lead to exponential size blowups and incorrect behavior (in the sense of not preserving the structure of the Python object).This PR raises an exception when we detect that we are in this scenario.
A complementary (and desirable) approach would be to use dictionary encodings to actually serialize more objects correctly.