Avoid pd.core import in dask utils#9907
Conversation
|
Can one of the admins verify this patch? Admins can comment |
|
add to allowlist |
|
dask/dask/dataframe/groupby.py Line 1672 in 834a19e Is there a better way to do this ? |
jrbourbeau
left a comment
There was a problem hiding this comment.
Thanks @mroeschke! Happy to move away from pandas.core
pd.core is also used for pulling docstrings like
Yeah, that's a good point. Grepping through the codebase it looks like we use
pd.core.groupby.GroupBypd.core.groupby.DataFrameGroupBy.aggregatepd.core.groupby.DataFrameGroupBy.aggpd.core.groupby.DataFrameGroupBypd.core.groupby.SeriesGroupBypd.core.groupby.SeriesGroupBy.aggregatepd.core.groupby.SeriesGroupBy.aggpd.core.indexing.IndexingError
So mostly groupby-related things and IndexingError. I'm not sure about the groupby objects though
For IndexingError it looks like we can use pd.errors.IndexingError instead
In [1]: import pandas as pd
In [2]: pd.errors.IndexingError
Out[2]: pandas.errors.IndexingError
In [3]: pd.core.indexing.IndexingError
Out[3]: pandas.errors.IndexingError|
Unfortunately the groupby objects are not exposed publicly yet. There is an issue, pandas-dev/pandas#48577, discussing how to expose these. |
jrbourbeau
left a comment
There was a problem hiding this comment.
Sorry I didn't check beforehand, it looks like pd.errors.IndexingError was only added in pandas=1.5 (our current minimum supported version is 1.0). I'd recommend something like the following
diff --git a/dask/dataframe/_compat.py b/dask/dataframe/_compat.py
index 9791345ad..a0434065e 100644
--- a/dask/dataframe/_compat.py
+++ b/dask/dataframe/_compat.py
@@ -106,3 +106,9 @@ def dtype_eq(a: type, b: type) -> bool:
):
return False
return a == b
+
+
+if PANDAS_GT_150:
+ IndexingError = pd.errors.IndexingError
+else:
+ IndexingError = pd.core.indexing.IndexingError
diff --git a/dask/dataframe/indexing.py b/dask/dataframe/indexing.py
index 9a88c7b36..656a9cf8d 100644
--- a/dask/dataframe/indexing.py
+++ b/dask/dataframe/indexing.py
@@ -9,7 +9,7 @@ from pandas.api.types import is_bool_dtype
from dask.array.core import Array
from dask.base import tokenize
from dask.dataframe import methods
-from dask.dataframe._compat import PANDAS_GT_130
+from dask.dataframe._compat import PANDAS_GT_130, IndexingError
from dask.dataframe.core import Series, new_dd_object
from dask.dataframe.utils import is_index_like, is_series_like, meta_nonempty
from dask.highlevelgraph import HighLevelGraph
@@ -93,7 +93,7 @@ class _LocIndexer(_IndexerBase):
if len(key) > self.obj.ndim:
# raise from pandas
msg = "Too many indexers"
- raise pd.errors.IndexingError(msg)
+ raise IndexingError(msg)
iindexer = key[0]
cindexer = key[1]
diff --git a/dask/dataframe/tests/test_indexing.py b/dask/dataframe/tests/test_indexing.py
index 514b94041..bef985daf 100644
--- a/dask/dataframe/tests/test_indexing.py
+++ b/dask/dataframe/tests/test_indexing.py
@@ -5,7 +5,7 @@ import pytest
import dask
import dask.dataframe as dd
from dask.base import tokenize
-from dask.dataframe._compat import PANDAS_GT_110, PANDAS_GT_120, tm
+from dask.dataframe._compat import PANDAS_GT_110, PANDAS_GT_120, IndexingError, tm
from dask.dataframe.indexing import _coerce_loc_index
from dask.dataframe.utils import assert_eq, make_meta
@@ -188,17 +188,17 @@ def test_loc2d():
assert_eq(d.loc[3:, ["a"]], full.loc[3:, ["a"]])
# 3d
- with pytest.raises(pd.errors.indexing.IndexingError):
+ with pytest.raises(IndexingError):
d.loc[3, 3, 3]
# Series should raise
- with pytest.raises(pd.errors.indexing.IndexingError):
+ with pytest.raises(IndexingError):
d.a.loc[3, 3]
- with pytest.raises(pd.errors.indexing.IndexingError):
+ with pytest.raises(IndexingError):
d.a.loc[3:, 3]
- with pytest.raises(pd.errors.indexing.IndexingError):
+ with pytest.raises(IndexingError):
d.a.loc[d.a % 2 == 0, 3]pd.core import in dask utils
jrbourbeau
left a comment
There was a problem hiding this comment.
Thanks @mroeschke! This is in
Also, I noticed this is your first code contribution to this repository. Welcome!
|
Thanks for the review! |
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
* [WIP] feat(datasets): add the option to use pandas 2.0.0 * chore(datasets): add pandas<2.0.0 back to versions * Update test_requirements.txt * Unbound `pyarrow` in datasets testing requirements * Change import to 2.0.0-compatible `pandas.testing` Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Unpin Dask to get dask/dask#9907 Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Fix requirements.txt using pre-commit hook Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Use one partition for Dask so indexes are the same Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Don't require unavailable pandas 2.0.0 for 3.7 now * Unleash the pandas :panda: * Check whether SQLAlchemy 2.0.x will break stuff TM * chore: "sort" requirements? Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Update RELEASE.md Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * Use `Inspector.has_table` instead of deprecated `Engine.table_name`` Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu> * chore: clean up relaxation of SQLAlchemy to 2.0.x * docs(datasets): update docstrings to reference latest SQLAlchemy * docs(datasets): always point to latet SQLAlchemy docs --------- Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
pre-commit run --all-filescc @jrbourbeau
pandas.coreis documented as private https://pandas.pydata.org/pandas-docs/stable/reference/index.html, and additionally I plan to breakpd.core.strings.StringMethodsas it's causing a circular import for work on integrating string method support inpd.arrays.ArrowExtensionArray