Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_numpy().tolist() is significantlly faster than .tolist() #34354

Closed
Ben-Epstein opened this issue Feb 26, 2023 · 3 comments
Closed

to_numpy().tolist() is significantlly faster than .tolist() #34354

Ben-Epstein opened this issue Feb 26, 2023 · 3 comments

Comments

@Ben-Epstein
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

pyarrow version '9.0.0'
numpy version 1.23.5

Code to reproduce

import pyarrow as pa
import random

x = pa.array(random.randint(0, 10000) for _ in range(500_000))
%time y = x.to_pylist()
%time z = x.to_numpy().tolist()

It would seem to me that these should have nearly the same performance

image

Component(s)

Python

@westonpace
Copy link
Member

I'm not aware that anyone has tried particularly hard to optimize to_pylist. I think the expectation at the moment is that it won't be used all that often on large lists since a python list is a very inefficient way to represent the data.

However, from a glance, my guess would be that the difference is that Arrow implements to_pylist mostly in python:

    def to_pylist(self):
        """
        Convert to a list of native Python objects.

        Returns
        -------
        lst : list
        """
        return [x.as_py() for x in self]

However, in numpy the entire tolist function is in C. So in Arrow you get 500k python calls and in numpy you get one. It should be fairly straightforward to implement the more efficient version in Arrow. I would hope it could mostly be done in cython. If someone is interested in taking this on I can try giving a few pointers / suggestions.

@jorisvandenbossche
Copy link
Member

There is quite some discussion on this topic in #28694 (it's indeed slower because we currently do this in python space and especially wrapping each element in a Scalar object before converting to python object). The issue has some discussion on moving this into C++.

#11302 also implemented to_pylist in C++ specifically for the new MonthDayNano type.

I think we can close this issue in favor of #28694

@jorisvandenbossche
Copy link
Member

Duplicate of #28694

@jorisvandenbossche jorisvandenbossche marked this as a duplicate of #28694 Mar 23, 2023
@jorisvandenbossche jorisvandenbossche closed this as not planned Won't fix, can't repro, duplicate, stale Mar 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants