Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Heuristic in dataframe_to_arrays that decides to multithread convert cause slow conversions #17194

Closed
asfimport opened this issue May 22, 2020 · 5 comments

Comments

@asfimport
Copy link

When calling pa.Table.from_pandas() the code path that uses the ThreadPoolExecutor in dataframe_to_arrays (called by Table.from_pandas) the conversion is much much slower.

 
I have a simple example - but the time difference is much worse with a real table.

 

Python 3.7.3 | packaged by conda-forge | (default, Dec 6 2019, 08:54:18)
 Type 'copyright', 'credits' or 'license' for more information
 IPython 7.13.0An enhanced Interactive Python. Type '?' for help.
In [1]: import pyarrow as pa
In [2]: import pandas as pd
In [3]: df = pd.DataFrame({"A": [0] * 10000000})
In [4]: %timeit table = pa.Table.from_pandas(df)
 577 µs ± 15.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [5]: %timeit table = pa.Table.from_pandas(df, nthreads=1)
 106 µs ± 1.65 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

 

Environment: MacOS: 10.15.4 (Also happening on windows 10)
Python: 3.7.3
Pyarrow: 0.16.0
Pandas: 0.25.3
Reporter: Kevin Glasson
Assignee: Wes McKinney / @wesm

PRs and other links:

Note: This issue was originally created as ARROW-8888. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
There is currently a heuristic based on the number of rows vs number of columns whether to use multithreading or not when it is not specified by the user (https://github.com/apache/arrow/blob/d00c50a6ca0d88e3458742091c59f0fc5c2fc7de/python/pyarrow/pandas_compat.py#L541-L549).

And this will probably not be the best decision for all cases. For example, you have only a single column, and the parallelization is done by processing each column in a thread, so clearly in the case of a single column, doing it in a threadpool will only give unnecessary overhead. Also, ints are very cheap (zero-copy) to convert.

So I suspect that with more columns and with a more expensive conversion, you will see the benefit of the default multithreading. For example using floats instead of ints and more columns:

In [1]: df = pd.DataFrame({key: [0.0] * 1_000_000 for key in range(100)})                                                                                                                                          

In [2]: %timeit table = pa.Table.from_pandas(df, nthreads=1)                                                                                                                                                       
1.3 s ± 7.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit table = pa.Table.from_pandas(df, nthreads=None)                                                                                                                                                    
327 ms ± 5.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

(with this amount of columns but with only ints, using no threads is still faster).

So you are certainly welcome to look into the default heuristic to decide how many threads are used, and whether this can be improved (eg ensure that we are not using more threads than there are columns), but it will never be ideal for all possible use cases I think.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
BTW, you mention that for your real use case, the difference is worse as for the simple example. What are the characteristics of your actual use case? (in terms of number of columns, data types, etc)

@asfimport
Copy link
Author

Kevin Glasson:
Yeah - so off the top of my head it was about 6 million rows and 40 columns, mostly string objects, mixed in with some timestamps.

When I was profiling it I could see it was spending nearly all of it's time in 'threading'.

The reduction was 10x, instead of a write taking 50 minutes it took 4. There could be some other inefficiencies in my code of course but just changing that one flag gave me that massive reduction.

I can try and share the profiling if I get around to running it again.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
Trying with strings, I don't see much speedup in that case by using multithreading, but also not a significant slowdown:

In [10]: df = pd.DataFrame({key: ['a'] * 1_000_000 for key in range(10)})                                                                                                                                          

In [11]: %timeit table = pa.Table.from_pandas(df)                                                                                                                                                                  
3.43 s ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [12]: %timeit table = pa.Table.from_pandas(df, nthreads=1)                                                                                                                                                      
3.79 s ± 162 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Strings are stored in pandas as Python objects. I am not fully sure how the conversion in arrow is implemented, but so it might be this requires the GIL, and then multithreading won't help.

Such a big slowdown as you mention is still strange though, so if you can try to look further into it, that's certainly welcome.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Issue resolved by pull request 7563
#7563

@asfimport asfimport added this to the 1.0.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants