Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-1382: [Python] Raise an exception when serializing objects containing multiple copie… #2859

Closed
wants to merge 3 commits into from

Conversation

robertnishihara
Copy link
Contributor

…s of the same object.

This is one approach to addressing a long-standing issue in which Python objects that contain multiple copies of the same object are serialized (by pyarrow.serialize) as if they contain distinct copies of that object, e.g., the object 1000 * [np.zeros(10**8)] will be serialized to contain 1000 distinct arrays. For graph data structures, this can lead to exponential size blowups and incorrect behavior (in the sense of not preserving the structure of the Python object).

This PR raises an exception when we detect that we are in this scenario.

A complementary (and desirable) approach would be to use dictionary encodings to actually serialize more objects correctly.

@robertnishihara
Copy link
Contributor Author

cc @pcmoritz

@robertnishihara
Copy link
Contributor Author

This needs to be tried out and tested in more scenarios before merging.

@pcmoritz
Copy link
Contributor

If we want to merge something like this, I'd say it should be behind a flag which is off by default.

An alternative approach that could be on by default (which I can have a look into) is as mentioned using dictionary encoding as mentioned. We would use something similar to this to detect duplicate objects and in a first pass collect them. In a second pass we would store them in an array at the end of the RecordBatch and then replace the occurences with indices into that array.

@mitar
Copy link
Contributor

mitar commented Oct 29, 2018

So in some way it is like Python's deepcopy with memo argument. :-)

@robertnishihara
Copy link
Contributor Author

@pcmoritz @mitar I'm ok with making this off by default, but in the cases where this PR would raise an exception, the current serialization doesn't accurately preserve the underlying object, which could be an issue for some applications.

@mitar
Copy link
Contributor

mitar commented Oct 29, 2018

I have not commented on this being off. I would even say that it should be on as you said.

I am mostly waiting for proper solution here. This can help find bugs, though. So I am not sure why to have it if it is not on by default. It is better to throw an exception and then people can disable it after they determine that their use case warrants that.

@pitrou
Copy link
Member

pitrou commented Oct 29, 2018

This really looks like a hack to me. I think the middle-term plan should be instead to remove most of the custom serialization layer and exploit PEP 574 instead.

@wesm
Copy link
Member

wesm commented Oct 29, 2018

I am wondering why we would not try to implement the dictionary encoding rather than hacking around the issue like this. I agree with @pitrou that using pickle5 eventually would be a good idea (though there are some technical questions, e.g. output sizing / allocation)

@robertnishihara
Copy link
Contributor Author

@pitrou @wesm I don't view it as a hack because it is a subset of what needs to be done to implement the dictionary encoding approach (that is, you need a way to keep track of which objects you've seen before).

As for the dictionary approach, I think it's probably a good idea, but it's still not clear to me how to actually do it efficiently (maybe @pcmoritz has more thoughts about this). I submitted this PR because the current behavior is very problematic for some Python objects.

@pitrou, PEP 574 looks like a really good idea. However, it wouldn't use Arrow at all, right? So we would lose certain things like sharing serialized objects between languages. Is that right?

@pitrou
Copy link
Member

pitrou commented Oct 29, 2018

So we would lose certain things like sharing serialized objects between languages.

Can you give an example of the kind of sharing you are thinking about?

@robertnishihara
Copy link
Contributor Author

@pitrou, so we haven't implemented this yet, but it would be natural to use pyarrow.serialize to serialize certain primitive types (scalars, arrays, strings, tuples, potentially even lists/dictionaries) and then deserialize the resulting serialized object in a Java program (as the equivalent Java type) and vice versa.

@pitrou
Copy link
Member

pitrou commented Oct 29, 2018

Well... IMHO, Arrow isn't in the business of making arbitrary objects interoperable between languages.

@robertnishihara
Copy link
Contributor Author

Ok, not arbitrary objects, but isn't the intention for the data layout to be language agnostic so as to allow some degree of interoperability?

@pitrou
Copy link
Member

pitrou commented Oct 29, 2018

Right, but that's when sharing Arrow arrays, not native Python instances. Am I missing something?

@mitar
Copy link
Contributor

mitar commented Oct 29, 2018

Sure, but the motivating example is 1000 * [np.zeros(10**8)], which should work both in Python and Java efficiently, no?

@pitrou
Copy link
Member

pitrou commented Oct 29, 2018

I don't understand why it should work at all. These are not Arrow arrays, but a list of Numpy arrays. It's definitely Python specific.

@pitrou
Copy link
Member

pitrou commented Oct 29, 2018

What you might want to do is to call pa.array on your Python object so as to have a Arrow "equivalent" of the Python object (if at all possible), but of course that won't deduplicate objects either ;-)

@robertnishihara
Copy link
Contributor Author

Ok, I'm not 100% sure what the ideal behavior is here. Will think it over a little.

@wesm
Copy link
Member

wesm commented Oct 29, 2018

I forgot about the plan to be able to read these objects in Java. If we limit to tensors and other objects recognized by Arrow then that's valuable to support.

@dhirschfeld
Copy link

dhirschfeld commented Oct 30, 2018

I'd like to be able to use arrow to share objects between different languages (python/R/typescript).

I wouldn't need/want arbitrary objects to be serialised but I'd very much like to be able to serialise objects representing mappings (dicts) of strings (names) to arrays, tables, datetimes and primitive types

IIUC this can already be done - https://arrow.apache.org/docs/python/ipc.html#arbitrary-object-serialization
I'm just piping up because I wouldn't want to lose this functionality and AFAICS pickle (PEP-574) wouldn't support this (language-interoperability) usecase.

@robertnishihara
Copy link
Contributor Author

Closing for now. Still need to address this issue at some point.

@mitar
Copy link
Contributor

mitar commented May 22, 2019

Is there a related issue for this?

@robertnishihara
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants