Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] pandas compat: clean-up code in columns Index reconstruction in Table.to_pandas #40720

Closed
jorisvandenbossche opened this issue Mar 21, 2024 · 2 comments
Assignees
Milestone

Comments

@jorisvandenbossche
Copy link
Member

The pandas_compat.py has over the years grown quite complex and a lot of pandas compatibility code, which probably can be simplified nowadays because of not supporting old pandas and Python versions anymore.

One aspect that I noticed while working on to_pandas was the complexity in the reconstruction of the .columns Index object of the resulting DataFrame. Right now that always goes through a MultiIndex (even for simple column names), which has quite some overhead of the simple case.

@jorisvandenbossche jorisvandenbossche self-assigned this Mar 21, 2024
jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Mar 21, 2024
jorisvandenbossche added a commit that referenced this issue Mar 26, 2024
…n names in Table.to_pandas (#40721)

### Rationale for this change

The `pandas_compat.py` has over the years grown quite complex and a lot of pandas compatibility code, which probably can be simplified nowadays because of not supporting old pandas and Python versions anymore.

One part of the code where this is the case is in the reconstruction of the `.columns` Index object of the resulting DataFrame. Right now that always goes through a MultiIndex (even for simple column names), which has quite some overhead of the simple case. And it also has some old Python/pandas compat code that could be removed.

### What changes are included in this PR?

The simplification to not go through a MultiIndex for the simple cases gives a nice speed-up as well:

```python
In [1]: table = pa.table({'a': [1, 2, 3], 'b': [0.1, 0.2, 0.3], 'c': [3, 4, 5]})

In [2]: %timeit table.to_pandas()
251 µs ± 1.26 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)   # <-- main
68.1 µs ± 894 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)  # <-- PR
```

### Are these changes tested?

We should have extensive existing tests for this

### Are there any user-facing changes?

That should not be the case
* GitHub Issue: #40720

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@jorisvandenbossche jorisvandenbossche added this to the 16.0.0 milestone Mar 26, 2024
@jorisvandenbossche
Copy link
Member Author

Issue resolved by pull request 40721
#40721

@corneliusroemer
Copy link

I tried to find out what the supported versions of pandas/python are, but couldn't find anything. It would be great if you could point me somewhere @jorisvandenbossche.

I noticed that Pyarrow 16.1.0 seems not to be usable with pandas 1.5.3 anymore, for example (it was I think until 16 dropped). I'm getting AttributeError: module 'pyarrow' has no attribute '__version__' from pandas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants