New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Heuristic in dataframe_to_arrays that decides to multithread convert cause slow conversions #17194
Comments
Joris Van den Bossche / @jorisvandenbossche: And this will probably not be the best decision for all cases. For example, you have only a single column, and the parallelization is done by processing each column in a thread, so clearly in the case of a single column, doing it in a threadpool will only give unnecessary overhead. Also, ints are very cheap (zero-copy) to convert. So I suspect that with more columns and with a more expensive conversion, you will see the benefit of the default multithreading. For example using floats instead of ints and more columns: In [1]: df = pd.DataFrame({key: [0.0] * 1_000_000 for key in range(100)})
In [2]: %timeit table = pa.Table.from_pandas(df, nthreads=1)
1.3 s ± 7.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [3]: %timeit table = pa.Table.from_pandas(df, nthreads=None)
327 ms ± 5.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) (with this amount of columns but with only ints, using no threads is still faster). So you are certainly welcome to look into the default heuristic to decide how many threads are used, and whether this can be improved (eg ensure that we are not using more threads than there are columns), but it will never be ideal for all possible use cases I think. |
Joris Van den Bossche / @jorisvandenbossche: |
Kevin Glasson: When I was profiling it I could see it was spending nearly all of it's time in 'threading'. The reduction was 10x, instead of a write taking 50 minutes it took 4. There could be some other inefficiencies in my code of course but just changing that one flag gave me that massive reduction. I can try and share the profiling if I get around to running it again. |
Joris Van den Bossche / @jorisvandenbossche: In [10]: df = pd.DataFrame({key: ['a'] * 1_000_000 for key in range(10)})
In [11]: %timeit table = pa.Table.from_pandas(df)
3.43 s ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [12]: %timeit table = pa.Table.from_pandas(df, nthreads=1)
3.79 s ± 162 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Strings are stored in pandas as Python objects. I am not fully sure how the conversion in arrow is implemented, but so it might be this requires the GIL, and then multithreading won't help. Such a big slowdown as you mention is still strange though, so if you can try to look further into it, that's certainly welcome. |
Wes McKinney / @wesm: |
When calling pa.Table.from_pandas() the code path that uses the ThreadPoolExecutor in dataframe_to_arrays (called by Table.from_pandas) the conversion is much much slower.
I have a simple example - but the time difference is much worse with a real table.
Environment: MacOS: 10.15.4 (Also happening on windows 10)
Python: 3.7.3
Pyarrow: 0.16.0
Pandas: 0.25.3
Reporter: Kevin Glasson
Assignee: Wes McKinney / @wesm
PRs and other links:
Note: This issue was originally created as ARROW-8888. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: