-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-12565] Dataframe compare implementation #16027
Conversation
@frame_base.with_docs_from(pd.DataFrame) | ||
@frame_base.args_to_kwargs(pd.DataFrame) | ||
@frame_base.populate_defaults(pd.DataFrame) | ||
def compare(self, other, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @TheNeuralBit, I have a similar issue here as in idxmin
and idxmax
proxy. I wasn't able to create a valid proxy since the expected proxy is a DataFrame with a MultiIndex and its structure depends on the differences between the inputs. So, I'm not sure how this proxy can be created beforehand. Any suggestion on how this can be solved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @roger-mike, looking through the docs for this operation here, I think we will need to be restrictive about the arguments we support. As you point out, the default (align_axis=1
, keep_shape=False
) will drop columns that are equivalent. Since that makes the shape depend on the input data, we should raise WontImplementError(reason="non-deferred-columns")
in that case.
We should still be able to support align_axis=1
, keep_shape=True
though. And I think we should be able to support align_axis=0
no matter what the other args are.
Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comments @TheNeuralBit. I handled the (align_axis=1
, keep_shape=False
) case as you suggest. The other cases work just fine. I also disabled proxy check for the (align_axis=0
, keep_shape=False
) test case since its result also depends on the input data. Let me know what you think 👍 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks!
One more thing: any time there's a caveat like the wontimplementerror here, we should add a docstring for it, we have infrastructure that puts this in a "Differences from pandaas" section. See here.
Something like:
def compare(self, other, **kwargs): | |
def compare(self, other, **kwargs): | |
"""The default values ``align_axis=1 and ``keep_shape=False`` are not supported, because the output columns depend on the data. To use ``align_axis=1``, please specify ``keep_shape=True``.""" |
R: @TheNeuralBit could you take a look? Thanks 👍 . |
self._run_test( | ||
lambda s1, s2: s1.compare(s2, keep_shape=True, keep_equal=True), s1, s2) | ||
|
||
def test_compare_dataframe(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test for Dataframe.compare
in pandas_doctests.test.py
was skipped because it tries to apply a loc
to df2
and it's not implemented, making the tests fail. @TheNeuralBit Do you think this should remain skipped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think it's approptiate to keep that test skipped. The problem is that it creates test data by modifying the DataFrame in-place with loc, which we don't support:
>>> df2 = df.copy()
>>> df2.loc[0, 'col1'] = 'c'
>>> df2.loc[2, 'col3'] = 4.0
>>> df2
col1 col2 col3
0 c 1.0 1.0
1 a 2.0 2.0
2 b 3.0 4.0
3 b NaN 4.0
4 a 5.0 5.0
It's fine to just test compare
here in frames_test.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I'll keep them skipped then.
Codecov Report
@@ Coverage Diff @@
## master #16027 +/- ##
==========================================
+ Coverage 74.62% 83.64% +9.01%
==========================================
Files 643 447 -196
Lines 81120 61648 -19472
==========================================
- Hits 60533 51563 -8970
+ Misses 19617 10085 -9532
+ Partials 970 0 -970 Continue to review full report at Codecov.
|
da75a84
to
4ae8f26
Compare
4ae8f26
to
bb3dea2
Compare
R: @TheNeuralBit Could you take a look? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, just a few more minor suggestions.
We should also wrap this with a hasattr check like we do for value_counts
:
beam/sdks/python/apache_beam/dataframe/frames.py
Lines 3613 to 3617 in 3d75542
if hasattr(pd.DataFrame, 'value_counts'): | |
@frame_base.with_docs_from(pd.DataFrame) | |
def value_counts(self, subset=None, sort=False, normalize=False, | |
ascending=False, dropna=True): | |
"""``sort`` is ``False`` by default, and ``sort=True`` is not supported |
Otherwise we will break pandas 1.0 support.
|
||
preserve_partition = None | ||
|
||
if align_axis and not keep_shape: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
align_axis
is allowed to be 'columns' or 'index'
if align_axis and not keep_shape: | |
if align_axis in (1, 'columns') and not keep_shape: |
"compare(align_axis=1, keep_shape=False) is not allowed", | ||
reason='non-deferred-columns' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple nits on the exception:
"compare(align_axis=1, keep_shape=False) is not allowed", | |
reason='non-deferred-columns' | |
f"compare(align_axis={align_axis!r}, keep_shape={keep_shape!r}) is not allowed because the output columns depend on the data, please specify keep_shape=True.", | |
reason='non-deferred-columns' |
- We can use a format string to display the actual user-specified args
- Added a little detail about why this happened, and how to address it
(this may not work as written, it probably needs some linting)
@frame_base.with_docs_from(pd.DataFrame) | ||
@frame_base.args_to_kwargs(pd.DataFrame) | ||
@frame_base.populate_defaults(pd.DataFrame) | ||
def compare(self, other, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks!
One more thing: any time there's a caveat like the wontimplementerror here, we should add a docstring for it, we have infrastructure that puts this in a "Differences from pandaas" section. See here.
Something like:
def compare(self, other, **kwargs): | |
def compare(self, other, **kwargs): | |
"""The default values ``align_axis=1 and ``keep_shape=False`` are not supported, because the output columns depend on the data. To use ``align_axis=1``, please specify ``keep_shape=True``.""" |
@frame_base.populate_defaults(pd.DataFrame) | ||
def compare(self, other, **kwargs): | ||
align_axis = kwargs.get('align_axis', 1) | ||
keep_shape = kwargs.get('keep_shape', False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you put align_axis
and keep_shape
in the argument list, then @frame_base.populate_defaults
should pull the default values from the pandas code, and you won't need to specify them here. Did you try that?
reason='non-deferred-columns' | ||
) | ||
|
||
if align_axis: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if align_axis: | |
if align_axis in (1, 'columns'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, just a few more minor suggestions.
We should also wrap this with a hasattr check like we do for value_counts
:
beam/sdks/python/apache_beam/dataframe/frames.py
Lines 3613 to 3617 in 3d75542
if hasattr(pd.DataFrame, 'value_counts'): | |
@frame_base.with_docs_from(pd.DataFrame) | |
def value_counts(self, subset=None, sort=False, normalize=False, | |
ascending=False, dropna=True): | |
"""``sort`` is ``False`` by default, and ``sort=True`` is not supported |
Otherwise we will break pandas 1.0 support.
@@ -2049,6 +2049,31 @@ def repeat(self, repeats, axis): | |||
"repeat(repeats=) value must be an int or a " | |||
f"DeferredSeries (encountered {type(repeats)}).") | |||
|
|||
if hasattr(pd.Series, 'compare'): | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why yapf spaces this check and not the one in DataFrame
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
I tried to address the merge conflict myself in the GitHub UI, but it looks like I broke the formatter and linter. Could you look at those errors @roger-mike? |
47fd628
to
1a31d1d
Compare
1a31d1d
to
08becba
Compare
Done 👍 |
Implementation of
compare
for DeferredDataFrame and DeferredSeriesExamples testing status on various runners
Post-Commit SDK/Transform Integration Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.