`to_numpy().tolist()` is significantlly faster than `.tolist()` #34354

Ben-Epstein · 2023-02-26T02:22:31Z

Describe the bug, including details regarding any error messages, version, and platform.

pyarrow version '9.0.0'
numpy version 1.23.5

Code to reproduce

import pyarrow as pa
import random

x = pa.array(random.randint(0, 10000) for _ in range(500_000))
%time y = x.to_pylist()
%time z = x.to_numpy().tolist()

It would seem to me that these should have nearly the same performance

Component(s)

Python

The text was updated successfully, but these errors were encountered:

westonpace · 2023-02-27T22:37:23Z

I'm not aware that anyone has tried particularly hard to optimize to_pylist. I think the expectation at the moment is that it won't be used all that often on large lists since a python list is a very inefficient way to represent the data.

However, from a glance, my guess would be that the difference is that Arrow implements to_pylist mostly in python:

    def to_pylist(self):
        """
        Convert to a list of native Python objects.

        Returns
        -------
        lst : list
        """
        return [x.as_py() for x in self]

However, in numpy the entire tolist function is in C. So in Arrow you get 500k python calls and in numpy you get one. It should be fairly straightforward to implement the more efficient version in Arrow. I would hope it could mostly be done in cython. If someone is interested in taking this on I can try giving a few pointers / suggestions.

jorisvandenbossche · 2023-03-03T16:48:23Z

There is quite some discussion on this topic in #28694 (it's indeed slower because we currently do this in python space and especially wrapping each element in a Scalar object before converting to python object). The issue has some discussion on moving this into C++.

#11302 also implemented to_pylist in C++ specifically for the new MonthDayNano type.

I think we can close this issue in favor of #28694

jorisvandenbossche · 2023-03-23T10:00:53Z

Duplicate of #28694

Ben-Epstein added the Type: bug label Feb 26, 2023

github-actions bot added the Component: Python label Feb 26, 2023

Ben-Epstein mentioned this issue Feb 26, 2023

[BUG-REPORT] tolist is much slower than to_numpy().tolist() vaexio/vaex#2325

Open

jorisvandenbossche marked this as a duplicate of #28694 Mar 23, 2023

jorisvandenbossche closed this as not planned Won't fix, can't repro, duplicate, stale Mar 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`to_numpy().tolist()` is significantlly faster than `.tolist()` #34354

`to_numpy().tolist()` is significantlly faster than `.tolist()` #34354

Ben-Epstein commented Feb 26, 2023

westonpace commented Feb 27, 2023

jorisvandenbossche commented Mar 3, 2023

jorisvandenbossche commented Mar 23, 2023

to_numpy().tolist() is significantlly faster than .tolist() #34354

to_numpy().tolist() is significantlly faster than .tolist() #34354

Comments

Ben-Epstein commented Feb 26, 2023

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

westonpace commented Feb 27, 2023

jorisvandenbossche commented Mar 3, 2023

jorisvandenbossche commented Mar 23, 2023

`to_numpy().tolist()` is significantlly faster than `.tolist()` #34354

`to_numpy().tolist()` is significantlly faster than `.tolist()` #34354