[Python] Heuristic in dataframe_to_arrays that decides to multithread convert cause slow conversions #17194

asfimport · 2020-05-22T08:19:07Z

When calling pa.Table.from_pandas() the code path that uses the ThreadPoolExecutor in dataframe_to_arrays (called by Table.from_pandas) the conversion is much much slower.

I have a simple example - but the time difference is much worse with a real table.

Python 3.7.3 | packaged by conda-forge | (default, Dec 6 2019, 08:54:18)
 Type 'copyright', 'credits' or 'license' for more information
 IPython 7.13.0 – An enhanced Interactive Python. Type '?' for help.
In [1]: import pyarrow as pa
In [2]: import pandas as pd
In [3]: df = pd.DataFrame({"A": [0] * 10000000})
In [4]: %timeit table = pa.Table.from_pandas(df)
 577 µs ± 15.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [5]: %timeit table = pa.Table.from_pandas(df, nthreads=1)
 106 µs ± 1.65 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Environment: MacOS: 10.15.4 (Also happening on windows 10)
Python: 3.7.3
Pyarrow: 0.16.0
Pandas: 0.25.3
Reporter: Kevin Glasson
Assignee: Wes McKinney / @wesm

PRs and other links:

GitHub Pull Request #7563

_{Note: This issue was originally created as ARROW-8888. Please see the migration documentation for further details.}

asfimport · 2020-06-09T09:11:22Z

Joris Van den Bossche / @jorisvandenbossche:
There is currently a heuristic based on the number of rows vs number of columns whether to use multithreading or not when it is not specified by the user (https://github.com/apache/arrow/blob/d00c50a6ca0d88e3458742091c59f0fc5c2fc7de/python/pyarrow/pandas_compat.py#L541-L549).

And this will probably not be the best decision for all cases. For example, you have only a single column, and the parallelization is done by processing each column in a thread, so clearly in the case of a single column, doing it in a threadpool will only give unnecessary overhead. Also, ints are very cheap (zero-copy) to convert.

So I suspect that with more columns and with a more expensive conversion, you will see the benefit of the default multithreading. For example using floats instead of ints and more columns:

In [1]: df = pd.DataFrame({key: [0.0] * 1_000_000 for key in range(100)})                                                                                                                                          

In [2]: %timeit table = pa.Table.from_pandas(df, nthreads=1)                                                                                                                                                       
1.3 s ± 7.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit table = pa.Table.from_pandas(df, nthreads=None)                                                                                                                                                    
327 ms ± 5.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

(with this amount of columns but with only ints, using no threads is still faster).

So you are certainly welcome to look into the default heuristic to decide how many threads are used, and whether this can be improved (eg ensure that we are not using more threads than there are columns), but it will never be ideal for all possible use cases I think.

asfimport · 2020-06-09T09:12:31Z

Joris Van den Bossche / @jorisvandenbossche:
BTW, you mention that for your real use case, the difference is worse as for the simple example. What are the characteristics of your actual use case? (in terms of number of columns, data types, etc)

asfimport · 2020-06-09T09:25:08Z

Kevin Glasson:
Yeah - so off the top of my head it was about 6 million rows and 40 columns, mostly string objects, mixed in with some timestamps.

When I was profiling it I could see it was spending nearly all of it's time in 'threading'.

The reduction was 10x, instead of a write taking 50 minutes it took 4. There could be some other inefficiencies in my code of course but just changing that one flag gave me that massive reduction.

I can try and share the profiling if I get around to running it again.

asfimport · 2020-06-09T09:39:17Z

Joris Van den Bossche / @jorisvandenbossche:
Trying with strings, I don't see much speedup in that case by using multithreading, but also not a significant slowdown:

In [10]: df = pd.DataFrame({key: ['a'] * 1_000_000 for key in range(10)})                                                                                                                                          

In [11]: %timeit table = pa.Table.from_pandas(df)                                                                                                                                                                  
3.43 s ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [12]: %timeit table = pa.Table.from_pandas(df, nthreads=1)                                                                                                                                                      
3.79 s ± 162 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Strings are stored in pandas as Python objects. I am not fully sure how the conversion in arrow is implemented, but so it might be this requires the GIL, and then multithreading won't help.

Such a big slowdown as you mention is still strange though, so if you can try to look further into it, that's certainly welcome.

asfimport · 2020-06-28T14:39:48Z

Wes McKinney / @wesm:
Issue resolved by pull request 7563
#7563

asfimport closed this as completed Jun 28, 2020

asfimport assigned wesm Jan 10, 2023

asfimport added this to the 1.0.0 milestone Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Heuristic in dataframe_to_arrays that decides to multithread convert cause slow conversions #17194

[Python] Heuristic in dataframe_to_arrays that decides to multithread convert cause slow conversions #17194

asfimport commented May 22, 2020

asfimport commented Jun 9, 2020

asfimport commented Jun 9, 2020

asfimport commented Jun 9, 2020

asfimport commented Jun 9, 2020

asfimport commented Jun 28, 2020

[Python] Heuristic in dataframe_to_arrays that decides to multithread convert cause slow conversions #17194

[Python] Heuristic in dataframe_to_arrays that decides to multithread convert cause slow conversions #17194

Comments

asfimport commented May 22, 2020

PRs and other links:

asfimport commented Jun 9, 2020

asfimport commented Jun 9, 2020

asfimport commented Jun 9, 2020

asfimport commented Jun 9, 2020

asfimport commented Jun 28, 2020