Skip to content

Avoid pd.core import in dask utils#9907

Merged
jrbourbeau merged 4 commits intodask:mainfrom
mroeschke:doc/avoid_pd_core
Feb 2, 2023
Merged

Avoid pd.core import in dask utils#9907
jrbourbeau merged 4 commits intodask:mainfrom
mroeschke:doc/avoid_pd_core

Conversation

@mroeschke
Copy link
Copy Markdown
Contributor

  • Closes #xxxx
  • Tests added / passed
  • Passes pre-commit run --all-files

cc @jrbourbeau

pandas.core is documented as private https://pandas.pydata.org/pandas-docs/stable/reference/index.html, and additionally I plan to break pd.core.strings.StringMethods as it's causing a circular import for work on integrating string method support in pd.arrays.ArrowExtensionArray

@GPUtester
Copy link
Copy Markdown
Collaborator

Can one of the admins verify this patch?

Admins can comment ok to test to allow this one PR to run or add to allowlist to allow all future PRs from the same author to run.

@quasiben
Copy link
Copy Markdown
Member

quasiben commented Feb 1, 2023

add to allowlist

@quasiben
Copy link
Copy Markdown
Member

quasiben commented Feb 1, 2023

pd.core is also used for pulling docstrings like:

@derived_from(pd.core.groupby.GroupBy)

Is there a better way to do this ?

Copy link
Copy Markdown
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mroeschke! Happy to move away from pandas.core

pd.core is also used for pulling docstrings like

Yeah, that's a good point. Grepping through the codebase it looks like we use

  • pd.core.groupby.GroupBy
  • pd.core.groupby.DataFrameGroupBy.aggregate
  • pd.core.groupby.DataFrameGroupBy.agg
  • pd.core.groupby.DataFrameGroupBy
  • pd.core.groupby.SeriesGroupBy
  • pd.core.groupby.SeriesGroupBy.aggregate
  • pd.core.groupby.SeriesGroupBy.agg
  • pd.core.indexing.IndexingError

So mostly groupby-related things and IndexingError. I'm not sure about the groupby objects though

For IndexingError it looks like we can use pd.errors.IndexingError instead

In [1]: import pandas as pd

In [2]: pd.errors.IndexingError
Out[2]: pandas.errors.IndexingError

In [3]: pd.core.indexing.IndexingError
Out[3]: pandas.errors.IndexingError

@mroeschke
Copy link
Copy Markdown
Contributor Author

Unfortunately the groupby objects are not exposed publicly yet. There is an issue, pandas-dev/pandas#48577, discussing how to expose these.

Copy link
Copy Markdown
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't check beforehand, it looks like pd.errors.IndexingError was only added in pandas=1.5 (our current minimum supported version is 1.0). I'd recommend something like the following

diff --git a/dask/dataframe/_compat.py b/dask/dataframe/_compat.py
index 9791345ad..a0434065e 100644
--- a/dask/dataframe/_compat.py
+++ b/dask/dataframe/_compat.py
@@ -106,3 +106,9 @@ def dtype_eq(a: type, b: type) -> bool:
     ):
         return False
     return a == b
+
+
+if PANDAS_GT_150:
+    IndexingError = pd.errors.IndexingError
+else:
+    IndexingError = pd.core.indexing.IndexingError
diff --git a/dask/dataframe/indexing.py b/dask/dataframe/indexing.py
index 9a88c7b36..656a9cf8d 100644
--- a/dask/dataframe/indexing.py
+++ b/dask/dataframe/indexing.py
@@ -9,7 +9,7 @@ from pandas.api.types import is_bool_dtype
 from dask.array.core import Array
 from dask.base import tokenize
 from dask.dataframe import methods
-from dask.dataframe._compat import PANDAS_GT_130
+from dask.dataframe._compat import PANDAS_GT_130, IndexingError
 from dask.dataframe.core import Series, new_dd_object
 from dask.dataframe.utils import is_index_like, is_series_like, meta_nonempty
 from dask.highlevelgraph import HighLevelGraph
@@ -93,7 +93,7 @@ class _LocIndexer(_IndexerBase):
             if len(key) > self.obj.ndim:
                 # raise from pandas
                 msg = "Too many indexers"
-                raise pd.errors.IndexingError(msg)
+                raise IndexingError(msg)

             iindexer = key[0]
             cindexer = key[1]
diff --git a/dask/dataframe/tests/test_indexing.py b/dask/dataframe/tests/test_indexing.py
index 514b94041..bef985daf 100644
--- a/dask/dataframe/tests/test_indexing.py
+++ b/dask/dataframe/tests/test_indexing.py
@@ -5,7 +5,7 @@ import pytest
 import dask
 import dask.dataframe as dd
 from dask.base import tokenize
-from dask.dataframe._compat import PANDAS_GT_110, PANDAS_GT_120, tm
+from dask.dataframe._compat import PANDAS_GT_110, PANDAS_GT_120, IndexingError, tm
 from dask.dataframe.indexing import _coerce_loc_index
 from dask.dataframe.utils import assert_eq, make_meta

@@ -188,17 +188,17 @@ def test_loc2d():
     assert_eq(d.loc[3:, ["a"]], full.loc[3:, ["a"]])

     # 3d
-    with pytest.raises(pd.errors.indexing.IndexingError):
+    with pytest.raises(IndexingError):
         d.loc[3, 3, 3]

     # Series should raise
-    with pytest.raises(pd.errors.indexing.IndexingError):
+    with pytest.raises(IndexingError):
         d.a.loc[3, 3]

-    with pytest.raises(pd.errors.indexing.IndexingError):
+    with pytest.raises(IndexingError):
         d.a.loc[3:, 3]

-    with pytest.raises(pd.errors.indexing.IndexingError):
+    with pytest.raises(IndexingError):
         d.a.loc[d.a % 2 == 0, 3]

@jrbourbeau jrbourbeau changed the title Avoid pd.core import in dask utils Avoid pd.core import in dask utils Feb 1, 2023
Copy link
Copy Markdown
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mroeschke! This is in

Also, I noticed this is your first code contribution to this repository. Welcome!

@jrbourbeau jrbourbeau merged commit e2dfb3c into dask:main Feb 2, 2023
@mroeschke mroeschke deleted the doc/avoid_pd_core branch February 2, 2023 16:08
@mroeschke
Copy link
Copy Markdown
Contributor Author

Thanks for the review!

deepyaman added a commit to kedro-org/kedro-plugins that referenced this pull request Mar 16, 2023
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
deepyaman added a commit to kedro-org/kedro-plugins that referenced this pull request Apr 4, 2023
* [WIP] feat(datasets): add the option to use pandas 2.0.0

* chore(datasets): add pandas<2.0.0 back to versions

* Update test_requirements.txt

* Unbound `pyarrow` in datasets testing requirements

* Change import to 2.0.0-compatible `pandas.testing`

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Unpin Dask to get dask/dask#9907

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Fix requirements.txt using pre-commit hook

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Use one partition for Dask so indexes are the same

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Don't require unavailable pandas 2.0.0 for 3.7 now

* Unleash the pandas :panda:

* Check whether SQLAlchemy 2.0.x will break stuff TM

* chore: "sort" requirements?

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Update RELEASE.md

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* Use `Inspector.has_table` instead of deprecated `Engine.table_name``

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>

* chore: clean up relaxation of SQLAlchemy to 2.0.x

* docs(datasets): update docstrings to reference latest SQLAlchemy

* docs(datasets): always point to latet SQLAlchemy docs

---------

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants