ARROW-1382: [Python] Raise an exception when serializing objects containing multiple copie… #2859

robertnishihara · 2018-10-29T04:28:31Z

…s of the same object.

This is one approach to addressing a long-standing issue in which Python objects that contain multiple copies of the same object are serialized (by pyarrow.serialize) as if they contain distinct copies of that object, e.g., the object 1000 * [np.zeros(10**8)] will be serialized to contain 1000 distinct arrays. For graph data structures, this can lead to exponential size blowups and incorrect behavior (in the sense of not preserving the structure of the Python object).

This PR raises an exception when we detect that we are in this scenario.

A complementary (and desirable) approach would be to use dictionary encodings to actually serialize more objects correctly.

…s of the same object.

robertnishihara · 2018-10-29T04:28:57Z

cc @pcmoritz

robertnishihara · 2018-10-29T04:31:01Z

This needs to be tried out and tested in more scenarios before merging.

pcmoritz · 2018-10-29T04:47:57Z

If we want to merge something like this, I'd say it should be behind a flag which is off by default.

An alternative approach that could be on by default (which I can have a look into) is as mentioned using dictionary encoding as mentioned. We would use something similar to this to detect duplicate objects and in a first pass collect them. In a second pass we would store them in an array at the end of the RecordBatch and then replace the occurences with indices into that array.

mitar · 2018-10-29T04:51:38Z

So in some way it is like Python's deepcopy with memo argument. :-)

robertnishihara · 2018-10-29T04:52:56Z

@pcmoritz @mitar I'm ok with making this off by default, but in the cases where this PR would raise an exception, the current serialization doesn't accurately preserve the underlying object, which could be an issue for some applications.

mitar · 2018-10-29T05:57:09Z

I have not commented on this being off. I would even say that it should be on as you said.

I am mostly waiting for proper solution here. This can help find bugs, though. So I am not sure why to have it if it is not on by default. It is better to throw an exception and then people can disable it after they determine that their use case warrants that.

pitrou · 2018-10-29T10:23:23Z

This really looks like a hack to me. I think the middle-term plan should be instead to remove most of the custom serialization layer and exploit PEP 574 instead.

wesm · 2018-10-29T12:49:02Z

I am wondering why we would not try to implement the dictionary encoding rather than hacking around the issue like this. I agree with @pitrou that using pickle5 eventually would be a good idea (though there are some technical questions, e.g. output sizing / allocation)

robertnishihara · 2018-10-29T17:15:10Z

@pitrou @wesm I don't view it as a hack because it is a subset of what needs to be done to implement the dictionary encoding approach (that is, you need a way to keep track of which objects you've seen before).

As for the dictionary approach, I think it's probably a good idea, but it's still not clear to me how to actually do it efficiently (maybe @pcmoritz has more thoughts about this). I submitted this PR because the current behavior is very problematic for some Python objects.

@pitrou, PEP 574 looks like a really good idea. However, it wouldn't use Arrow at all, right? So we would lose certain things like sharing serialized objects between languages. Is that right?

pitrou · 2018-10-29T17:21:14Z

So we would lose certain things like sharing serialized objects between languages.

Can you give an example of the kind of sharing you are thinking about?

robertnishihara · 2018-10-29T17:27:52Z

@pitrou, so we haven't implemented this yet, but it would be natural to use pyarrow.serialize to serialize certain primitive types (scalars, arrays, strings, tuples, potentially even lists/dictionaries) and then deserialize the resulting serialized object in a Java program (as the equivalent Java type) and vice versa.

pitrou · 2018-10-29T17:48:16Z

Well... IMHO, Arrow isn't in the business of making arbitrary objects interoperable between languages.

robertnishihara · 2018-10-29T18:01:42Z

Ok, not arbitrary objects, but isn't the intention for the data layout to be language agnostic so as to allow some degree of interoperability?

pitrou · 2018-10-29T18:08:26Z

Right, but that's when sharing Arrow arrays, not native Python instances. Am I missing something?

mitar · 2018-10-29T18:20:58Z

Sure, but the motivating example is 1000 * [np.zeros(10**8)], which should work both in Python and Java efficiently, no?

pitrou · 2018-10-29T18:36:14Z

I don't understand why it should work at all. These are not Arrow arrays, but a list of Numpy arrays. It's definitely Python specific.

pitrou · 2018-10-29T18:43:37Z

What you might want to do is to call pa.array on your Python object so as to have a Arrow "equivalent" of the Python object (if at all possible), but of course that won't deduplicate objects either ;-)

robertnishihara · 2018-10-29T18:51:27Z

Ok, I'm not 100% sure what the ideal behavior is here. Will think it over a little.

wesm · 2018-10-29T19:23:07Z

I forgot about the plan to be able to read these objects in Java. If we limit to tensors and other objects recognized by Arrow then that's valuable to support.

dhirschfeld · 2018-10-30T11:50:08Z

I'd like to be able to use arrow to share objects between different languages (python/R/typescript).

I wouldn't need/want arbitrary objects to be serialised but I'd very much like to be able to serialise objects representing mappings (dicts) of strings (names) to arrays, tables, datetimes and primitive types

IIUC this can already be done - https://arrow.apache.org/docs/python/ipc.html#arbitrary-object-serialization
I'm just piping up because I wouldn't want to lose this functionality and AFAICS pickle (PEP-574) wouldn't support this (language-interoperability) usecase.

robertnishihara · 2019-05-21T20:12:24Z

Closing for now. Still need to address this issue at some point.

mitar · 2019-05-22T02:31:19Z

Is there a related issue for this?

robertnishihara · 2019-05-22T02:33:07Z

@mitar it is https://issues.apache.org/jira/browse/ARROW-1382

Raise an exception when serializing objects containing multiple copie…

cbc3a7b

…s of the same object.

Augment test

51f832c

Linting

68d1aeb

wesm force-pushed the master branch from 3088183 to 0c6b2d2 Compare February 18, 2019 19:34

robertnishihara closed this May 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-1382: [Python] Raise an exception when serializing objects containing multiple copie… #2859

ARROW-1382: [Python] Raise an exception when serializing objects containing multiple copie… #2859

robertnishihara commented Oct 29, 2018

robertnishihara commented Oct 29, 2018

robertnishihara commented Oct 29, 2018

pcmoritz commented Oct 29, 2018

mitar commented Oct 29, 2018

robertnishihara commented Oct 29, 2018

mitar commented Oct 29, 2018

pitrou commented Oct 29, 2018

wesm commented Oct 29, 2018

robertnishihara commented Oct 29, 2018

pitrou commented Oct 29, 2018

robertnishihara commented Oct 29, 2018

pitrou commented Oct 29, 2018

robertnishihara commented Oct 29, 2018

pitrou commented Oct 29, 2018

mitar commented Oct 29, 2018

pitrou commented Oct 29, 2018

pitrou commented Oct 29, 2018

robertnishihara commented Oct 29, 2018

wesm commented Oct 29, 2018

dhirschfeld commented Oct 30, 2018 •

edited

Loading

robertnishihara commented May 21, 2019

mitar commented May 22, 2019

robertnishihara commented May 22, 2019

ARROW-1382: [Python] Raise an exception when serializing objects containing multiple copie… #2859

ARROW-1382: [Python] Raise an exception when serializing objects containing multiple copie… #2859

Conversation

robertnishihara commented Oct 29, 2018

robertnishihara commented Oct 29, 2018

robertnishihara commented Oct 29, 2018

pcmoritz commented Oct 29, 2018

mitar commented Oct 29, 2018

robertnishihara commented Oct 29, 2018

mitar commented Oct 29, 2018

pitrou commented Oct 29, 2018

wesm commented Oct 29, 2018

robertnishihara commented Oct 29, 2018

pitrou commented Oct 29, 2018

robertnishihara commented Oct 29, 2018

pitrou commented Oct 29, 2018

robertnishihara commented Oct 29, 2018

pitrou commented Oct 29, 2018

mitar commented Oct 29, 2018

pitrou commented Oct 29, 2018

pitrou commented Oct 29, 2018

robertnishihara commented Oct 29, 2018

wesm commented Oct 29, 2018

dhirschfeld commented Oct 30, 2018 • edited Loading

robertnishihara commented May 21, 2019

mitar commented May 22, 2019

robertnishihara commented May 22, 2019

dhirschfeld commented Oct 30, 2018 •

edited

Loading