New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Improve performance of serializing object dtype ndarrays #17848
Comments
Robert Nishihara / @robertnishihara: EDIT: Actually I'm seeing similar numbers (updated below). I think I had compiled without optimizations. import pickle
import pyarrow as pa
import numpy as np
print(pa.__version__) # '0.7.2.dev165+ga446fbd.d20171116'
arr = np.array(['foo', 'bar', None] * 100000, dtype=object)
arr_list = arr.tolist()
# Serializing the array.
%timeit pa.serialize(arr).to_buffer()
29.1 ms ± 535 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pickle.dumps(arr)
7.43 ms ± 196 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Serializing the list.
%timeit pa.serialize(arr_list).to_buffer()
27.5 ms ± 669 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pickle.dumps(arr_list)
5.87 ms ± 160 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) |
Wes McKinney / @wesm: |
Robert Nishihara / @robertnishihara: It may be as simple as changing the custom serializer/deserializer. I'll take a quick look at that. |
Robert Nishihara / @robertnishihara: import numpy as np
import pickle
import cloudpickle
class Foo(object):
pass
a = np.array([Foo()]) Pickle will succeed at pickling |
Wes McKinney / @wesm: |
Brian Bowman: -Brian On Nov 24, 2017, at 3:16 PM, Wes McKinney (JIRA) jira@apache.org wrote: EXTERNAL Wes McKinney created ARROW-1854:
I haven't looked carefully at the hot path for this, but I would expect these statements to have roughly the same performance (offloading the ndarray serialization to pickle) In [1]: import pickle
In [2]: import numpy as np
In [3]: import pyarrow as pa
a
In [4]: arr = np.array(['foo', 'bar', None] * 100000, dtype=object)
In [5]: timeit serialized = pa.serialize(arr).to_buffer()
10 loops, best of 3: 27.1 ms per loop
In [6]: timeit pickled = pickle.dumps(arr)
100 loops, best of 3: 6.03 ms per loop @robertnishihara @pcmoritz I encountered this while working on ARROW-1783, but it can likely be resolved independently – |
I haven't looked carefully at the hot path for this, but I would expect these statements to have roughly the same performance (offloading the ndarray serialization to pickle)
@robertnishihara @pcmoritz I encountered this while working on ARROW-1783, but it can likely be resolved independently
Reporter: Wes McKinney / @wesm
Assignee: Wes McKinney / @wesm
Related issues:
Original Issue Attachments:
PRs and other links:
Note: This issue was originally created as ARROW-1854. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: