[Python] pandas compat: clean-up code in columns Index reconstruction in Table.to_pandas #40720

jorisvandenbossche · 2024-03-21T17:31:29Z

The pandas_compat.py has over the years grown quite complex and a lot of pandas compatibility code, which probably can be simplified nowadays because of not supporting old pandas and Python versions anymore.

One aspect that I noticed while working on to_pandas was the complexity in the reconstruction of the .columns Index object of the resulting DataFrame. Right now that always goes through a MultiIndex (even for simple column names), which has quite some overhead of the simple case.

The text was updated successfully, but these errors were encountered:

… column names in Table.to_pandas

…n names in Table.to_pandas (#40721) ### Rationale for this change The `pandas_compat.py` has over the years grown quite complex and a lot of pandas compatibility code, which probably can be simplified nowadays because of not supporting old pandas and Python versions anymore. One part of the code where this is the case is in the reconstruction of the `.columns` Index object of the resulting DataFrame. Right now that always goes through a MultiIndex (even for simple column names), which has quite some overhead of the simple case. And it also has some old Python/pandas compat code that could be removed. ### What changes are included in this PR? The simplification to not go through a MultiIndex for the simple cases gives a nice speed-up as well: ```python In [1]: table = pa.table({'a': [1, 2, 3], 'b': [0.1, 0.2, 0.3], 'c': [3, 4, 5]}) In [2]: %timeit table.to_pandas() 251 µs ± 1.26 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) # <-- main 68.1 µs ± 894 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) # <-- PR ``` ### Are these changes tested? We should have extensive existing tests for this ### Are there any user-facing changes? That should not be the case * GitHub Issue: #40720 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

jorisvandenbossche · 2024-03-26T14:37:47Z

Issue resolved by pull request 40721
#40721

corneliusroemer · 2024-06-29T12:06:37Z

I tried to find out what the supported versions of pandas/python are, but couldn't find anything. It would be great if you could point me somewhere @jorisvandenbossche.

I noticed that Pyarrow 16.1.0 seems not to be usable with pandas 1.5.3 anymore, for example (it was I think until 16 dropped). I'm getting AttributeError: module 'pyarrow' has no attribute '__version__' from pandas.

jorisvandenbossche added the Component: Python label Mar 21, 2024

jorisvandenbossche self-assigned this Mar 21, 2024

jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Mar 21, 2024

apacheGH-40720: [Python] Simplify and improve perf of creation of the…

7958396

… column names in Table.to_pandas

github-actions bot mentioned this issue Mar 21, 2024

GH-40720: [Python] Simplify and improve perf of creation of the column names in Table.to_pandas #40721

Merged

jorisvandenbossche added this to the 16.0.0 milestone Mar 26, 2024

jorisvandenbossche closed this as completed Mar 26, 2024

mzappitello mentioned this issue May 13, 2024

Chore(deps): Bump pyarrow from 15.0.0 to 16.0.0 mbta/lamp#320

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] pandas compat: clean-up code in columns Index reconstruction in Table.to_pandas #40720

[Python] pandas compat: clean-up code in columns Index reconstruction in Table.to_pandas #40720

jorisvandenbossche commented Mar 21, 2024

jorisvandenbossche commented Mar 26, 2024

corneliusroemer commented Jun 29, 2024

[Python] pandas compat: clean-up code in columns Index reconstruction in Table.to_pandas #40720

[Python] pandas compat: clean-up code in columns Index reconstruction in Table.to_pandas #40720

Comments

jorisvandenbossche commented Mar 21, 2024

jorisvandenbossche commented Mar 26, 2024

corneliusroemer commented Jun 29, 2024