[Python] Improve performance of serializing object dtype ndarrays #17848

asfimport · 2017-11-24T20:15:12Z

I haven't looked carefully at the hot path for this, but I would expect these statements to have roughly the same performance (offloading the ndarray serialization to pickle)

In [1]: import pickle

In [2]: import numpy as np

In [3]: import pyarrow as pa
a
In [4]: arr = np.array(['foo', 'bar', None] * 100000, dtype=object)

In [5]: timeit serialized = pa.serialize(arr).to_buffer()
10 loops, best of 3: 27.1 ms per loop

In [6]: timeit pickled = pickle.dumps(arr)
100 loops, best of 3: 6.03 ms per loop

@robertnishihara @pcmoritz I encountered this while working on ARROW-1783, but it can likely be resolved independently

Reporter: Wes McKinney / @wesm
Assignee: Wes McKinney / @wesm

_{Note: This issue was originally created as ARROW-1854. Please see the migration documentation for further details.}

asfimport · 2017-11-24T20:40:09Z

Robert Nishihara / @robertnishihara:
Your numbers are much better than what I'm seeing. It looks like the poor performance comes from our handling of lists. Since pyarrow handles the numpy array or objects by first converting it to a list and then serializing it, we can't do better than the list case.

EDIT: Actually I'm seeing similar numbers (updated below). I think I had compiled without optimizations.

import pickle
import pyarrow as pa
import numpy as np

print(pa.__version__)  # '0.7.2.dev165+ga446fbd.d20171116'

arr = np.array(['foo', 'bar', None] * 100000, dtype=object)
arr_list = arr.tolist()

# Serializing the array.
%timeit pa.serialize(arr).to_buffer()
29.1 ms ± 535 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pickle.dumps(arr)
7.43 ms ± 196 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# Serializing the list.
%timeit pa.serialize(arr_list).to_buffer()
27.5 ms ± 669 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pickle.dumps(arr_list)
5.87 ms ± 160 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

asfimport · 2017-11-25T22:50:34Z

Wes McKinney / @wesm:
I'm quite confident we can do better. I think instead of converting an ndarray to a list, we should pickle it and send the pickle as a buffer along with any other buffers that are produced during the serialization pass

asfimport · 2017-11-26T05:36:03Z

Robert Nishihara / @robertnishihara:
That would certainly work. It wouldn't give us any of the benefits of using Arrow, but for numpy arrays of general Python objects, we probably shouldn't expect that anyway.

It may be as simple as changing the custom serializer/deserializer. I'll take a quick look at that.

asfimport · 2017-11-26T05:51:22Z

Robert Nishihara / @robertnishihara:
We may run into problems when the numpy array can't be pickled/unpickled but it can be cloudpickled/cloudunpickled. E.g.,

import numpy as np
import pickle
import cloudpickle

class Foo(object):
    pass

a = np.array([Foo()])

Pickle will succeed at pickling a, but it won't be able to unpickle it (in a different process). Cloudpickle will succeed but will be much slower. Our current approach will succeed and will be faster than cloudpickle.

asfimport · 2017-11-29T01:06:25Z

Wes McKinney / @wesm:
Issue resolved by pull request 1360
#1360

asfimport · 2017-12-15T22:15:02Z

Brian Bowman:
I’m out of the office for vacation, followed by the SAS Winter Holiday until Tuesay January 2nd 2018.

-Brian

On Nov 24, 2017, at 3:16 PM, Wes McKinney (JIRA) jira@apache.org wrote:

EXTERNAL

Wes McKinney created ARROW-1854:

        Summary: [Python] Improve performance of serializing object dtype ndarrays
            Key: ARROW-1854
            URL: https://issues.apache.org/jira/browse/ARROW-1854
        Project: Apache Arrow
     Issue Type: Improvement
     Components: Python
       Reporter: Wes McKinney
        Fix For: 0.8.0

I haven't looked carefully at the hot path for this, but I would expect these statements to have roughly the same performance (offloading the ndarray serialization to pickle)

In [1]: import pickle

In [2]: import numpy as np

In [3]: import pyarrow as pa
a
In [4]: arr = np.array(['foo', 'bar', None] * 100000, dtype=object)

In [5]: timeit serialized = pa.serialize(arr).to_buffer()
10 loops, best of 3: 27.1 ms per loop

In [6]: timeit pickled = pickle.dumps(arr)
100 loops, best of 3: 6.03 ms per loop

@robertnishihara @pcmoritz I encountered this while working on ARROW-1783, but it can likely be resolved independently

–
This message was sent by Atlassian JIRA
(v6.4.14#64029)

asfimport closed this as completed Nov 29, 2017

asfimport assigned wesm Jan 10, 2023

asfimport mentioned this issue Jan 11, 2023

[Python] Read and write pandas.DataFrame in pyarrow.serialize by decomposing the BlockManager rather than coercing to Arrow format #17782

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Improve performance of serializing object dtype ndarrays #17848

[Python] Improve performance of serializing object dtype ndarrays #17848

asfimport commented Nov 24, 2017 •

edited

asfimport commented Nov 24, 2017

asfimport commented Nov 25, 2017

asfimport commented Nov 26, 2017

asfimport commented Nov 26, 2017

asfimport commented Nov 29, 2017

asfimport commented Dec 15, 2017

[Python] Improve performance of serializing object dtype ndarrays #17848

[Python] Improve performance of serializing object dtype ndarrays #17848

Comments

asfimport commented Nov 24, 2017 • edited

Related issues:

Original Issue Attachments:

PRs and other links:

asfimport commented Nov 24, 2017

asfimport commented Nov 25, 2017

asfimport commented Nov 26, 2017

asfimport commented Nov 26, 2017

asfimport commented Nov 29, 2017

asfimport commented Dec 15, 2017

Wes McKinney created ARROW-1854:

asfimport commented Nov 24, 2017 •

edited