Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Improve performance of serializing object dtype ndarrays #17848

Closed
asfimport opened this issue Nov 24, 2017 · 6 comments
Closed

[Python] Improve performance of serializing object dtype ndarrays #17848

asfimport opened this issue Nov 24, 2017 · 6 comments

Comments

@asfimport
Copy link

asfimport commented Nov 24, 2017

I haven't looked carefully at the hot path for this, but I would expect these statements to have roughly the same performance (offloading the ndarray serialization to pickle)

In [1]: import pickle

In [2]: import numpy as np

In [3]: import pyarrow as pa
a
In [4]: arr = np.array(['foo', 'bar', None] * 100000, dtype=object)

In [5]: timeit serialized = pa.serialize(arr).to_buffer()
10 loops, best of 3: 27.1 ms per loop

In [6]: timeit pickled = pickle.dumps(arr)
100 loops, best of 3: 6.03 ms per loop

@robertnishihara @pcmoritz I encountered this while working on ARROW-1783, but it can likely be resolved independently

Reporter: Wes McKinney / @wesm
Assignee: Wes McKinney / @wesm

Related issues:

Original Issue Attachments:

PRs and other links:

Note: This issue was originally created as ARROW-1854. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Robert Nishihara / @robertnishihara:
Your numbers are much better than what I'm seeing. It looks like the poor performance comes from our handling of lists. Since pyarrow handles the numpy array or objects by first converting it to a list and then serializing it, we can't do better than the list case.

EDIT: Actually I'm seeing similar numbers (updated below). I think I had compiled without optimizations.

import pickle
import pyarrow as pa
import numpy as np

print(pa.__version__)  # '0.7.2.dev165+ga446fbd.d20171116'

arr = np.array(['foo', 'bar', None] * 100000, dtype=object)
arr_list = arr.tolist()

# Serializing the array.
%timeit pa.serialize(arr).to_buffer()
29.1 ms ± 535 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pickle.dumps(arr)
7.43 ms ± 196 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# Serializing the list.
%timeit pa.serialize(arr_list).to_buffer()
27.5 ms ± 669 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pickle.dumps(arr_list)
5.87 ms ± 160 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@asfimport
Copy link
Author

Wes McKinney / @wesm:
I'm quite confident we can do better. I think instead of converting an ndarray to a list, we should pickle it and send the pickle as a buffer along with any other buffers that are produced during the serialization pass

@asfimport
Copy link
Author

Robert Nishihara / @robertnishihara:
That would certainly work. It wouldn't give us any of the benefits of using Arrow, but for numpy arrays of general Python objects, we probably shouldn't expect that anyway.

It may be as simple as changing the custom serializer/deserializer. I'll take a quick look at that.

@asfimport
Copy link
Author

Robert Nishihara / @robertnishihara:
We may run into problems when the numpy array can't be pickled/unpickled but it can be cloudpickled/cloudunpickled. E.g.,

import numpy as np
import pickle
import cloudpickle

class Foo(object):
    pass

a = np.array([Foo()])

Pickle will succeed at pickling a, but it won't be able to unpickle it (in a different process). Cloudpickle will succeed but will be much slower. Our current approach will succeed and will be faster than cloudpickle.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Issue resolved by pull request 1360
#1360

@asfimport
Copy link
Author

Brian Bowman:
I’m out of the office for vacation, followed by the SAS Winter Holiday until Tuesay January 2nd 2018.

-Brian

On Nov 24, 2017, at 3:16 PM, Wes McKinney (JIRA) jira@apache.org wrote:

EXTERNAL

Wes McKinney created ARROW-1854:

        Summary: [Python] Improve performance of serializing object dtype ndarrays
            Key: ARROW-1854
            URL: https://issues.apache.org/jira/browse/ARROW-1854
        Project: Apache Arrow
     Issue Type: Improvement
     Components: Python
       Reporter: Wes McKinney
        Fix For: 0.8.0

I haven't looked carefully at the hot path for this, but I would expect these statements to have roughly the same performance (offloading the ndarray serialization to pickle)

In [1]: import pickle

In [2]: import numpy as np

In [3]: import pyarrow as pa
a
In [4]: arr = np.array(['foo', 'bar', None] * 100000, dtype=object)

In [5]: timeit serialized = pa.serialize(arr).to_buffer()
10 loops, best of 3: 27.1 ms per loop

In [6]: timeit pickled = pickle.dumps(arr)
100 loops, best of 3: 6.03 ms per loop

@robertnishihara @pcmoritz I encountered this while working on ARROW-1783, but it can likely be resolved independently


This message was sent by Atlassian JIRA
(v6.4.14#64029)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants