-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] avoid using pandas internals #35081
Comments
pandas 2.2.0.dev0+390.g59d4e84128 raises a import pyarrow.dataset
import fsspec
paths = [
"https://github.com/Parquet/parquet-compatibility/raw/master/parquet-testdata/impala/1.1.1-NONE/nation.impala.parquet"
]
(
pyarrow.dataset.dataset(paths, filesystem=fsspec.filesystem("http"))
.schema.empty_table()
.to_pandas()
) this landed in pandas-dev/pandas#52419 |
@graingert yes, although you shouldn't see that as an end user, fixing that on the pandas side (-> pandas-dev/pandas#52419 (comment)) You will see the warning when enabling all warnings to show or error (like when running tests), so we have to switch to a different API in pyarrow. |
…38321) ### Rationale for this change This usage probably stems from a long time ago that it was required to specify the Block type, but nowadays it's good enough to just specify the dtype, and thus cutting down on our usage of internal pandas objects. Part of #35081 * Closes: #38341 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…lock (apache#38321) ### Rationale for this change This usage probably stems from a long time ago that it was required to specify the Block type, but nowadays it's good enough to just specify the dtype, and thus cutting down on our usage of internal pandas objects. Part of apache#35081 * Closes: apache#38341 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…lock (apache#38321) ### Rationale for this change This usage probably stems from a long time ago that it was required to specify the Block type, but nowadays it's good enough to just specify the dtype, and thus cutting down on our usage of internal pandas objects. Part of apache#35081 * Closes: apache#38341 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…lock (apache#38321) ### Rationale for this change This usage probably stems from a long time ago that it was required to specify the Block type, but nowadays it's good enough to just specify the dtype, and thus cutting down on our usage of internal pandas objects. Part of apache#35081 * Closes: apache#38341 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…pandas` (#40897) ### Rationale for this change Avoiding using pandas internals to create Block objects ourselves, using a new API for pandas>=3 * GitHub Issue: #35081 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Issue resolved by pull request 40897 |
…pandas` (#40897) ### Rationale for this change Avoiding using pandas internals to create Block objects ourselves, using a new API for pandas>=3 * GitHub Issue: #35081 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…n `to_pandas` (apache#40897) ### Rationale for this change Avoiding using pandas internals to create Block objects ourselves, using a new API for pandas>=3 * GitHub Issue: apache#35081 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…n `to_pandas` (apache#40897) ### Rationale for this change Avoiding using pandas internals to create Block objects ourselves, using a new API for pandas>=3 * GitHub Issue: apache#35081 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Describe the bug, including details regarding any error messages, version, and platform.
ATM pyarrow passes a BlockManager to the pd.DataFrame constructor, but in doing so accesses pandas functions/classes that are not public and that some pandas maintainers (me) would like to ween arrow off of (xref pandas-dev/pandas#52419).
@jorisvandenbossche tells me the current usage is performance motivated, particularly (but not exclusively?) the performance hit associated with pandas silently consolidating, which it no longer does in 2.0.
Let's see if we can find an alternative using pandas' public API. Starting bid: is pd.DataFrame.from_arrays a viable alternative?
Component(s)
Python
The text was updated successfully, but these errors were encountered: