[Python] Large performance difference in conversion of binary array to object dtype array in to_pandas vs to_numpy #42026

jorisvandenbossche · 2024-06-07T09:09:41Z

When having a binary array, converting that object dtype with to_pandas() (eg from converting a table to pandas) vs to_numpy() (or from calling np.asarray(..) on a pyarrow array) gives a considerable performance difference, although both are resulting in exactly the same numpy object dtype array (for to_pandas just wrapped in a pandas Series, but that should not give much overhead).

Example:

import numpy as np
import pyarrow as pa

def random_ascii(length):
    return bytes(np.random.randint(65, 123, size=length, dtype='i1'))

arr = pa.chunked_array([pa.array(random_ascii(i) for i in np.random.randint(20, 100, 1_000_000)) for _ in range(10)])

In [60]: %timeit _ = arr.to_pandas()
1.98 s ± 41.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [61]: %timeit _ = arr.to_numpy(zero_copy_only=False)
382 ms ± 775 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

(noticed this in geopandas/geopandas#3322)

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2024-06-07T09:21:26Z

Profiling both showed a clear difference where there is a lot of usage of a hash table in the to_pandas() case, which reminded me that we have a deduplicate_objects option, which is False by default (which to_numpy uses) but set to True by default in to_pandas().

That explains the difference, and in a cases like this of all unique binary values, that just gives unnecessary overhead. Disabling it gives the expected similar performance for both to_numpu and to_pandas

In [4]: %timeit _ = arr.to_numpy(zero_copy_only=False)
375 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: %timeit _ = arr.to_pandas(deduplicate_objects=False)
380 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Now I do wonder if we should consider turning it off by default in case of binary data .. (for strings, starting with pandas 3.0 which keeps arrow memory, the option will also not be relevant anymore)

jorisvandenbossche added the Component: Python label Jun 7, 2024

jorisvandenbossche mentioned this issue Jun 7, 2024

PERF: improve reading of geoarrow encoded Parquet (avoid converting coords to geopandas object dtype) geopandas/geopandas#3322

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Large performance difference in conversion of binary array to object dtype array in to_pandas vs to_numpy #42026

[Python] Large performance difference in conversion of binary array to object dtype array in to_pandas vs to_numpy #42026

jorisvandenbossche commented Jun 7, 2024

jorisvandenbossche commented Jun 7, 2024

[Python] Large performance difference in conversion of binary array to object dtype array in to_pandas vs to_numpy #42026

[Python] Large performance difference in conversion of binary array to object dtype array in to_pandas vs to_numpy #42026

Comments

jorisvandenbossche commented Jun 7, 2024

jorisvandenbossche commented Jun 7, 2024