Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] avoid using pandas internals #35081

Closed
2 tasks done
jbrockmendel opened this issue Apr 12, 2023 · 3 comments
Closed
2 tasks done

[Python] avoid using pandas internals #35081

jbrockmendel opened this issue Apr 12, 2023 · 3 comments
Assignees
Labels
Milestone

Comments

@jbrockmendel
Copy link

jbrockmendel commented Apr 12, 2023


Describe the bug, including details regarding any error messages, version, and platform.

ATM pyarrow passes a BlockManager to the pd.DataFrame constructor, but in doing so accesses pandas functions/classes that are not public and that some pandas maintainers (me) would like to ween arrow off of (xref pandas-dev/pandas#52419).

@jorisvandenbossche tells me the current usage is performance motivated, particularly (but not exclusively?) the performance hit associated with pandas silently consolidating, which it no longer does in 2.0.

Let's see if we can find an alternative using pandas' public API. Starting bid: is pd.DataFrame.from_arrays a viable alternative?

Component(s)

Python

@jorisvandenbossche jorisvandenbossche changed the title REF: avoid using pandas internals [Python] avoid using pandas internals Apr 12, 2023
jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Oct 18, 2023
@graingert
Copy link
Contributor

graingert commented Oct 18, 2023

pandas 2.2.0.dev0+390.g59d4e84128 raises a DeprecationWarning: Passing a BlockManager to DataFrame is deprecated and will raise in a future version. Use public APIs instead for this use now:

import pyarrow.dataset
import fsspec

paths = [
    "https://github.com/Parquet/parquet-compatibility/raw/master/parquet-testdata/impala/1.1.1-NONE/nation.impala.parquet"
]
(
    pyarrow.dataset.dataset(paths, filesystem=fsspec.filesystem("http"))
    .schema.empty_table()
    .to_pandas()
)

this landed in pandas-dev/pandas#52419

@jorisvandenbossche
Copy link
Member

@graingert yes, although you shouldn't see that as an end user, fixing that on the pandas side (-> pandas-dev/pandas#52419 (comment))

You will see the warning when enabling all warnings to show or error (like when running tests), so we have to switch to a different API in pyarrow.

jorisvandenbossche added a commit that referenced this issue Jan 8, 2024
…38321)

### Rationale for this change

This usage probably stems from a long time ago that it was required to specify the Block type, but nowadays it's good enough to just specify the dtype, and thus cutting down on our usage of internal pandas objects.

Part of #35081

* Closes: #38341

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
clayburn pushed a commit to clayburn/arrow that referenced this issue Jan 23, 2024
…lock (apache#38321)

### Rationale for this change

This usage probably stems from a long time ago that it was required to specify the Block type, but nowadays it's good enough to just specify the dtype, and thus cutting down on our usage of internal pandas objects.

Part of apache#35081

* Closes: apache#38341

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…lock (apache#38321)

### Rationale for this change

This usage probably stems from a long time ago that it was required to specify the Block type, but nowadays it's good enough to just specify the dtype, and thus cutting down on our usage of internal pandas objects.

Part of apache#35081

* Closes: apache#38341

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this issue Feb 28, 2024
…lock (apache#38321)

### Rationale for this change

This usage probably stems from a long time ago that it was required to specify the Block type, but nowadays it's good enough to just specify the dtype, and thus cutting down on our usage of internal pandas objects.

Part of apache#35081

* Closes: apache#38341

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@jorisvandenbossche jorisvandenbossche added this to the 16.0.0 milestone Mar 21, 2024
@raulcd raulcd added the Priority: Blocker Marks a blocker for the release label Apr 11, 2024
jorisvandenbossche added a commit that referenced this issue Apr 16, 2024
…pandas` (#40897)

### Rationale for this change

Avoiding using pandas internals to create Block objects ourselves, using a new API for pandas>=3

* GitHub Issue: #35081

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@jorisvandenbossche
Copy link
Member

Issue resolved by pull request 40897
#40897

raulcd pushed a commit that referenced this issue Apr 16, 2024
…pandas` (#40897)

### Rationale for this change

Avoiding using pandas internals to create Block objects ourselves, using a new API for pandas>=3

* GitHub Issue: #35081

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
tolleybot pushed a commit to tmct/arrow that referenced this issue May 2, 2024
…n `to_pandas` (apache#40897)

### Rationale for this change

Avoiding using pandas internals to create Block objects ourselves, using a new API for pandas>=3

* GitHub Issue: apache#35081

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
vibhatha pushed a commit to vibhatha/arrow that referenced this issue May 25, 2024
…n `to_pandas` (apache#40897)

### Rationale for this change

Avoiding using pandas internals to create Block objects ourselves, using a new API for pandas>=3

* GitHub Issue: apache#35081

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants