Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Large performance difference in conversion of binary array to object dtype array in to_pandas vs to_numpy #42026

Open
jorisvandenbossche opened this issue Jun 7, 2024 · 1 comment

Comments

@jorisvandenbossche
Copy link
Member

When having a binary array, converting that object dtype with to_pandas() (eg from converting a table to pandas) vs to_numpy() (or from calling np.asarray(..) on a pyarrow array) gives a considerable performance difference, although both are resulting in exactly the same numpy object dtype array (for to_pandas just wrapped in a pandas Series, but that should not give much overhead).

Example:

import numpy as np
import pyarrow as pa

def random_ascii(length):
    return bytes(np.random.randint(65, 123, size=length, dtype='i1'))

arr = pa.chunked_array([pa.array(random_ascii(i) for i in np.random.randint(20, 100, 1_000_000)) for _ in range(10)])
In [60]: %timeit _ = arr.to_pandas()
1.98 s ± 41.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [61]: %timeit _ = arr.to_numpy(zero_copy_only=False)
382 ms ± 775 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

(noticed this in geopandas/geopandas#3322)

@jorisvandenbossche
Copy link
Member Author

Profiling both showed a clear difference where there is a lot of usage of a hash table in the to_pandas() case, which reminded me that we have a deduplicate_objects option, which is False by default (which to_numpy uses) but set to True by default in to_pandas().

That explains the difference, and in a cases like this of all unique binary values, that just gives unnecessary overhead. Disabling it gives the expected similar performance for both to_numpu and to_pandas

In [4]: %timeit _ = arr.to_numpy(zero_copy_only=False)
375 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: %timeit _ = arr.to_pandas(deduplicate_objects=False)
380 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Now I do wonder if we should consider turning it off by default in case of binary data .. (for strings, starting with pandas 3.0 which keeps arrow memory, the option will also not be relevant anymore)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant