Skip to content

Deprecate axis= for some groupby operations#10094

Merged
jrbourbeau merged 5 commits intodask:mainfrom
jrbourbeau:groupby-axis-deprecated
Mar 23, 2023
Merged

Deprecate axis= for some groupby operations#10094
jrbourbeau merged 5 commits intodask:mainfrom
jrbourbeau:groupby-axis-deprecated

Conversation

@jrbourbeau
Copy link
Copy Markdown
Member

Corresponding PR to pandas-dev/pandas#52018

Note that shift-related tests, e.g. test_groupby_shift_basic_input, should be raising deprecation warnings, but aren't. When I run the same code outside the test suite, I see the deprecation warnings. Will need to look into this further.

PANDAS_GT_150 = PANDAS_VERSION >= Version("1.5.0")
PANDAS_GT_200 = PANDAS_VERSION.major >= 2 # Also true for nightly builds
PANDAS_GT_200 = PANDAS_VERSION.major >= 2
PANDAS_GT_210 = PANDAS_VERSION.release >= (2, 1, 0)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phofl I saw pandas-dev/pandas#52018 is in the 2.1 milestone, so I think the changes in this PR are correct. Just wanted to double check that this deprecation won't show up in a future pandas=2.0.x release

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this will come in 2.1. generally we don’t backport deprecations if we can avoid it.

there will be a couple more axis=1 deprecations soonish (just as a heads up)

Comment on lines +1928 to +1931
if axis in (1, "columns"):
raise NotImplementedError(
f"The axis={axis} keyword is not implemented for groupby.idxmax"
)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting for reviewers that axis=1 isn't currently supported and just gets silently ignored. I've opted here to raise an informative error in this case (test coverage is included in test_df_groupby_idx_axis)

Comment on lines +1896 to +1900
if axis in (1, "columns"):
raise NotImplementedError(
f"The axis={axis} keyword is not implemented for groupby.idxmin"
)
self._normalize_axis(axis, "idxmin")
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment here about axis=1 being silently ignored now

@jrbourbeau
Copy link
Copy Markdown
Member Author

rerun tests

"be removed in a future version. Call without passing 'axis' instead.",
FutureWarning,
)
else:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that with this condition, we emit a warning for any arbitrary axis="foo", even when this would subsequently raise an error. Would it make sense to change this to

Suggested change
else:
elif axis in (1, "columns", None):

To capture the remaining supported axis values instead?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To capture the remaining supported axis values instead?

What other axis values are supported?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had assumed (1, "columns", None) based on DataFrame._validate_axis, but I recognize that this list varies; I guess my underlying question is why we opt to flip pandas' approach here of specifying the deprecated axis argument in a finite set of cases, and the general deprecation warning otherwise:

https://github.com/pandas-dev/pandas/blob/3083ae932e0b122e963608fb25cf3df06d588b03/pandas/core/groupby/groupby.py#L980-L995

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas has some internal methods for normalizing to 0 or 1 for input axis values (i.e. "columns" --> 1). For example, not even when one specifies axis="columns" in pandas, the deprecation warnings that's emitted says axis=1

In [6]: type(df)
Out[6]: pandas.core.frame.DataFrame

In [7]: df
Out[7]:
   A  B    C    D    E
0  1  3  NaN  4.0  6.0
1  1  4  3.0  NaN  NaN
2  2  3  NaN  5.0  7.0
3  2  4  NaN  NaN  NaN

In [8]: df.groupby("A").fillna(0, axis="columns")
<ipython-input-8-b27392e2513d>:1: FutureWarning: DataFrameGroupBy.fillna with axis=1 is deprecated and will be removed in a future version. Operate on the un-grouped DataFrame instead
  df.groupby("A").fillna(0, axis="columns")
Out[8]:
     B    C    D    E
0  3.0  0.0  4.0  6.0
1  4.0  3.0  0.0  0.0
2  3.0  0.0  5.0  7.0
3  4.0  0.0  0.0  0.0

In this PR we're not normalizing to integers and are instead using the user-provided axis value in the deprecation warning

In [12]: ddf.groupby("A").fillna(0, axis="columns")
/Users/james/projects/dask/dask/dask/dataframe/groupby.py:2741: FutureWarning: DataFrameGroupBy.fillna with axis=columns is deprecated and will be removed in a future version. Operate on the un-grouped DataFrame instead
  warnings.warn(
Out[12]:
Dask DataFrame Structure:
                     B        C        D        E
npartitions=2
               float64  float64  float64  float64
                   ...      ...      ...      ...
                   ...      ...      ...      ...
Dask Name: droplevel, 8 graph layers

I think this is a bit more clear, but probably not a huge deal either way. Does that help clarify things? Or am I missing something?

extract_kwargs,
nonempty=True,
)
meta = self._meta_nonempty.shift(**meta_kwargs)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intent here to emit the deprecation warning only through Dask, and suppress any cases where it comes up in pandas? If so, we might want to conditionally suppress warnings here depending on the value of include_axis to avoid getting two identical warnings.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's the intent. While digging into #10094 (comment) I realized we weren't properly handling the case when we're checking for multiple warnings being emitted. As part of fixing, I decided to always forward axis to pandas, but also catch and ignore these deprecation warnings (coming from pandas, we emit these warnings ourself in `dask)

303f586

token="groupby-shift",
group_keys=self.group_keys,
meta=meta,
**axis_kwarg,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Believe the same as above should apply to the g.shift() call in _groupby_slice_shift

Comment on lines +1309 to +1312
with groupby_axis_deprecated():
expected = df.groupby("A", group_keys=group_keys).fillna(0, axis=axis)
with groupby_axis_deprecated():
result = ddf.groupby("A", group_keys=group_keys).fillna(0, axis=axis)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my understanding, is there a reason we're opting to do the fillna calls in separate contexts throughout instead of

Suggested change
with groupby_axis_deprecated():
expected = df.groupby("A", group_keys=group_keys).fillna(0, axis=axis)
with groupby_axis_deprecated():
result = ddf.groupby("A", group_keys=group_keys).fillna(0, axis=axis)
with groupby_axis_deprecated():
expected = df.groupby("A", group_keys=group_keys).fillna(0, axis=axis)
result = ddf.groupby("A", group_keys=group_keys).fillna(0, axis=axis)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing them separately ensure that both pandas and dask emit FutureWarnings. With a single context manager, things would still pass if only one emitted a FutureWarning

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification!

Copy link
Copy Markdown
Contributor

@j-bennet j-bennet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jrbourbeau
Copy link
Copy Markdown
Member Author

Thanks @j-bennet @charlesbluca for reviewing!

@jrbourbeau jrbourbeau merged commit 8ba5ad5 into dask:main Mar 23, 2023
@jrbourbeau jrbourbeau deleted the groupby-axis-deprecated branch March 23, 2023 22:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants