Avoid `pd.core` import in dask utils by mroeschke · Pull Request #9907 · dask/dask

mroeschke · 2023-02-01T03:22:41Z

Closes #xxxx
Tests added / passed
Passes pre-commit run --all-files

pandas.core is documented as private https://pandas.pydata.org/pandas-docs/stable/reference/index.html, and additionally I plan to break pd.core.strings.StringMethods as it's causing a circular import for work on integrating string method support in pd.arrays.ArrowExtensionArray

GPUtester · 2023-02-01T03:22:43Z

Can one of the admins verify this patch?

Admins can comment ok to test to allow this one PR to run or add to allowlist to allow all future PRs from the same author to run.

quasiben · 2023-02-01T14:12:22Z

add to allowlist

quasiben · 2023-02-01T14:15:01Z

pd.core is also used for pulling docstrings like:

dask/dask/dataframe/groupby.py

Line 1672 in 834a19e

@derived_from(pd.core.groupby.GroupBy)

Is there a better way to do this ?

jrbourbeau

Thanks @mroeschke! Happy to move away from pandas.core

pd.core is also used for pulling docstrings like

Yeah, that's a good point. Grepping through the codebase it looks like we use

pd.core.groupby.GroupBy
pd.core.groupby.DataFrameGroupBy.aggregate
pd.core.groupby.DataFrameGroupBy.agg
pd.core.groupby.DataFrameGroupBy
pd.core.groupby.SeriesGroupBy
pd.core.groupby.SeriesGroupBy.aggregate
pd.core.groupby.SeriesGroupBy.agg
pd.core.indexing.IndexingError

So mostly groupby-related things and IndexingError. I'm not sure about the groupby objects though

For IndexingError it looks like we can use pd.errors.IndexingError instead

In [1]: import pandas as pd

In [2]: pd.errors.IndexingError
Out[2]: pandas.errors.IndexingError

In [3]: pd.core.indexing.IndexingError
Out[3]: pandas.errors.IndexingError

mroeschke · 2023-02-01T18:10:55Z

Unfortunately the groupby objects are not exposed publicly yet. There is an issue, pandas-dev/pandas#48577, discussing how to expose these.

jrbourbeau

Sorry I didn't check beforehand, it looks like pd.errors.IndexingError was only added in pandas=1.5 (our current minimum supported version is 1.0). I'd recommend something like the following

diff --git a/dask/dataframe/_compat.py b/dask/dataframe/_compat.py
index 9791345ad..a0434065e 100644
--- a/dask/dataframe/_compat.py
+++ b/dask/dataframe/_compat.py
@@ -106,3 +106,9 @@ def dtype_eq(a: type, b: type) -> bool:
     ):
         return False
     return a == b
+
+
+if PANDAS_GT_150:
+    IndexingError = pd.errors.IndexingError
+else:
+    IndexingError = pd.core.indexing.IndexingError
diff --git a/dask/dataframe/indexing.py b/dask/dataframe/indexing.py
index 9a88c7b36..656a9cf8d 100644
--- a/dask/dataframe/indexing.py
+++ b/dask/dataframe/indexing.py
@@ -9,7 +9,7 @@ from pandas.api.types import is_bool_dtype
 from dask.array.core import Array
 from dask.base import tokenize
 from dask.dataframe import methods
-from dask.dataframe._compat import PANDAS_GT_130
+from dask.dataframe._compat import PANDAS_GT_130, IndexingError
 from dask.dataframe.core import Series, new_dd_object
 from dask.dataframe.utils import is_index_like, is_series_like, meta_nonempty
 from dask.highlevelgraph import HighLevelGraph
@@ -93,7 +93,7 @@ class _LocIndexer(_IndexerBase):
             if len(key) > self.obj.ndim:
                 # raise from pandas
                 msg = "Too many indexers"
-                raise pd.errors.IndexingError(msg)
+                raise IndexingError(msg)

             iindexer = key[0]
             cindexer = key[1]
diff --git a/dask/dataframe/tests/test_indexing.py b/dask/dataframe/tests/test_indexing.py
index 514b94041..bef985daf 100644
--- a/dask/dataframe/tests/test_indexing.py
+++ b/dask/dataframe/tests/test_indexing.py
@@ -5,7 +5,7 @@ import pytest
 import dask
 import dask.dataframe as dd
 from dask.base import tokenize
-from dask.dataframe._compat import PANDAS_GT_110, PANDAS_GT_120, tm
+from dask.dataframe._compat import PANDAS_GT_110, PANDAS_GT_120, IndexingError, tm
 from dask.dataframe.indexing import _coerce_loc_index
 from dask.dataframe.utils import assert_eq, make_meta

@@ -188,17 +188,17 @@ def test_loc2d():
     assert_eq(d.loc[3:, ["a"]], full.loc[3:, ["a"]])

     # 3d
-    with pytest.raises(pd.errors.indexing.IndexingError):
+    with pytest.raises(IndexingError):
         d.loc[3, 3, 3]

     # Series should raise
-    with pytest.raises(pd.errors.indexing.IndexingError):
+    with pytest.raises(IndexingError):
         d.a.loc[3, 3]

-    with pytest.raises(pd.errors.indexing.IndexingError):
+    with pytest.raises(IndexingError):
         d.a.loc[3:, 3]

-    with pytest.raises(pd.errors.indexing.IndexingError):
+    with pytest.raises(IndexingError):
         d.a.loc[d.a % 2 == 0, 3]

jrbourbeau

Thanks @mroeschke! This is in

Also, I noticed this is your first code contribution to this repository. Welcome!

mroeschke · 2023-02-02T16:08:16Z

Thanks for the review!

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* [WIP] feat(datasets): add the option to use pandas 2.0.0 * chore(datasets): add pandas<2.0.0 back to versions * Update test_requirements.txt * Unbound `pyarrow` in datasets testing requirements * Change import to 2.0.0-compatible `pandas.testing` Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Unpin Dask to get dask/dask#9907 Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Fix requirements.txt using pre-commit hook Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Use one partition for Dask so indexes are the same Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Don't require unavailable pandas 2.0.0 for 3.7 now * Unleash the pandas :panda: * Check whether SQLAlchemy 2.0.x will break stuff TM * chore: "sort" requirements? Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Update RELEASE.md Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Use `Inspector.has_table` instead of deprecated `Engine.table_name`` Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * chore: clean up relaxation of SQLAlchemy to 2.0.x * docs(datasets): update docstrings to reference latest SQLAlchemy * docs(datasets): always point to latet SQLAlchemy docs --------- Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

Avoid pd.core import in dask utils

cba26d4

github-actions bot added the dataframe label Feb 1, 2023

jrbourbeau reviewed Feb 1, 2023

View reviewed changes

Import IndexError from pd.errors

234f36a

jrbourbeau reviewed Feb 1, 2023

View reviewed changes

jrbourbeau changed the title ~~Avoid pd.core import in dask utils~~ Avoid pd.core import in dask utils Feb 1, 2023

mroeschke added 2 commits February 1, 2023 16:02

Version control IndexingError

225b9c5

Linting

54da45a

jrbourbeau approved these changes Feb 2, 2023

View reviewed changes

jrbourbeau merged commit e2dfb3c into dask:main Feb 2, 2023

mroeschke deleted the doc/avoid_pd_core branch February 2, 2023 16:08

mroeschke mentioned this pull request Feb 7, 2023

ENH: Allow ArrowDtype(pa.string()) to be compatable with str accessor pandas-dev/pandas#51207

Merged

deepyaman added a commit to kedro-org/kedro-plugins that referenced this pull request Mar 16, 2023

Unpin Dask to get dask/dask#9907

fbb2993

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

This was referenced Apr 8, 2023

[python-package] support pandas 2.0 lightgbm-org/LightGBM#5739

Open

Using pandas >= 2.0 with dask.dataframe requires dask >= 2023.2.0 #10164

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid `pd.core` import in dask utils#9907

Avoid `pd.core` import in dask utils#9907
jrbourbeau merged 4 commits intodask:mainfrom
mroeschke:doc/avoid_pd_core

mroeschke commented Feb 1, 2023

Uh oh!

GPUtester commented Feb 1, 2023

Uh oh!

quasiben commented Feb 1, 2023

Uh oh!

quasiben commented Feb 1, 2023

Uh oh!

jrbourbeau left a comment

Uh oh!

mroeschke commented Feb 1, 2023

Uh oh!

jrbourbeau left a comment

Uh oh!

jrbourbeau left a comment

Uh oh!

mroeschke commented Feb 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

mroeschke commented Feb 1, 2023

Uh oh!

GPUtester commented Feb 1, 2023

Uh oh!

quasiben commented Feb 1, 2023

Uh oh!

quasiben commented Feb 1, 2023

Uh oh!

jrbourbeau left a comment

Choose a reason for hiding this comment

Uh oh!

mroeschke commented Feb 1, 2023

Uh oh!

jrbourbeau left a comment

Choose a reason for hiding this comment

Uh oh!

jrbourbeau left a comment

Choose a reason for hiding this comment

Uh oh!

mroeschke commented Feb 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants