Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-12565] Dataframe compare implementation #16027

Merged
merged 4 commits into from Dec 10, 2021

Conversation

roger-mike
Copy link
Contributor

Implementation of compare for DeferredDataFrame and DeferredSeries


Lang ULR Dataflow Flink Samza Spark Twister2
Go --- Build Status Build Status Build Status Build Status ---
Java Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Python --- Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status ---
XLang Build Status Build Status Build Status Build Status Build Status ---

Examples testing status on various runners

Lang ULR Dataflow Flink Samza Spark Twister2
Go --- --- --- --- --- --- ---
Java --- Build Status
Build Status
Build Status
--- --- --- --- ---
Python --- --- --- --- --- --- ---
XLang --- --- --- --- --- --- ---

Post-Commit SDK/Transform Integration Tests Status (on master branch)

Go Java Python
Build Status Build Status Build Status
Build Status
Build Status

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website Whitespace Typescript
Non-portable Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status Build Status
Portable --- Build Status Build Status --- --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

@frame_base.with_docs_from(pd.DataFrame)
@frame_base.args_to_kwargs(pd.DataFrame)
@frame_base.populate_defaults(pd.DataFrame)
def compare(self, other, **kwargs):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @TheNeuralBit, I have a similar issue here as in idxmin and idxmax proxy. I wasn't able to create a valid proxy since the expected proxy is a DataFrame with a MultiIndex and its structure depends on the differences between the inputs. So, I'm not sure how this proxy can be created beforehand. Any suggestion on how this can be solved?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @roger-mike, looking through the docs for this operation here, I think we will need to be restrictive about the arguments we support. As you point out, the default (align_axis=1, keep_shape=False) will drop columns that are equivalent. Since that makes the shape depend on the input data, we should raise WontImplementError(reason="non-deferred-columns") in that case.

We should still be able to support align_axis=1, keep_shape=True though. And I think we should be able to support align_axis=0 no matter what the other args are.

Does that make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comments @TheNeuralBit. I handled the (align_axis=1, keep_shape=False) case as you suggest. The other cases work just fine. I also disabled proxy check for the (align_axis=0, keep_shape=False) test case since its result also depends on the input data. Let me know what you think 👍 .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks!

One more thing: any time there's a caveat like the wontimplementerror here, we should add a docstring for it, we have infrastructure that puts this in a "Differences from pandaas" section. See here.

Something like:

Suggested change
def compare(self, other, **kwargs):
def compare(self, other, **kwargs):
"""The default values ``align_axis=1 and ``keep_shape=False`` are not supported, because the output columns depend on the data. To use ``align_axis=1``, please specify ``keep_shape=True``."""

@roger-mike
Copy link
Contributor Author

R: @TheNeuralBit could you take a look? Thanks 👍 .

self._run_test(
lambda s1, s2: s1.compare(s2, keep_shape=True, keep_equal=True), s1, s2)

def test_compare_dataframe(self):
Copy link
Contributor Author

@roger-mike roger-mike Dec 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test for Dataframe.compare in pandas_doctests.test.py was skipped because it tries to apply a loc to df2 and it's not implemented, making the tests fail. @TheNeuralBit Do you think this should remain skipped?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think it's approptiate to keep that test skipped. The problem is that it creates test data by modifying the DataFrame in-place with loc, which we don't support:

>>> df2 = df.copy()
>>> df2.loc[0, 'col1'] = 'c'
>>> df2.loc[2, 'col3'] = 4.0
>>> df2
  col1  col2  col3
0    c   1.0   1.0
1    a   2.0   2.0
2    b   3.0   4.0
3    b   NaN   4.0
4    a   5.0   5.0

It's fine to just test compare here in frames_test.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I'll keep them skipped then.

@codecov
Copy link

codecov bot commented Dec 1, 2021

Codecov Report

Merging #16027 (08becba) into master (07b956a) will increase coverage by 9.01%.
The diff coverage is 92.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16027      +/-   ##
==========================================
+ Coverage   74.62%   83.64%   +9.01%     
==========================================
  Files         643      447     -196     
  Lines       81120    61648   -19472     
==========================================
- Hits        60533    51563    -8970     
+ Misses      19617    10085    -9532     
+ Partials      970        0     -970     
Impacted Files Coverage Δ
sdks/python/apache_beam/dataframe/frames.py 94.89% <92.00%> (-0.04%) ⬇️
...hon/apache_beam/runners/direct/test_stream_impl.py 94.02% <0.00%> (-2.24%) ⬇️
...ks/python/apache_beam/runners/worker/sdk_worker.py 88.90% <0.00%> (-0.16%) ⬇️
...o/pkg/beam/io/rtrackers/offsetrange/offsetrange.go
sdks/go/pkg/beam/util/harnessopts/sampler.go
sdks/go/pkg/beam/transforms/stats/min.go
sdks/go/pkg/beam/core/graph/coder/iterable.go
sdks/go/pkg/beam/core/runtime/xlangx/resolve.go
sdks/go/pkg/beam/core/runtime/exec/fn_arity.go
...o/pkg/beam/runners/dataflow/dataflowlib/metrics.go
... and 190 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 07b956a...08becba. Read the comment docs.

@roger-mike roger-mike changed the title [WIP][BEAM-12565] Dataframe compare implementation [BEAM-12565] Dataframe compare implementation Dec 1, 2021
@roger-mike
Copy link
Contributor Author

R: @TheNeuralBit Could you take a look? Thanks.

Copy link
Member

@TheNeuralBit TheNeuralBit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, just a few more minor suggestions.

We should also wrap this with a hasattr check like we do for value_counts:

if hasattr(pd.DataFrame, 'value_counts'):
@frame_base.with_docs_from(pd.DataFrame)
def value_counts(self, subset=None, sort=False, normalize=False,
ascending=False, dropna=True):
"""``sort`` is ``False`` by default, and ``sort=True`` is not supported

Otherwise we will break pandas 1.0 support.

sdks/python/apache_beam/dataframe/pandas_doctests_test.py Outdated Show resolved Hide resolved

preserve_partition = None

if align_axis and not keep_shape:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

align_axis is allowed to be 'columns' or 'index'

Suggested change
if align_axis and not keep_shape:
if align_axis in (1, 'columns') and not keep_shape:

Comment on lines 3474 to 3475
"compare(align_axis=1, keep_shape=False) is not allowed",
reason='non-deferred-columns'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple nits on the exception:

Suggested change
"compare(align_axis=1, keep_shape=False) is not allowed",
reason='non-deferred-columns'
f"compare(align_axis={align_axis!r}, keep_shape={keep_shape!r}) is not allowed because the output columns depend on the data, please specify keep_shape=True.",
reason='non-deferred-columns'
  • We can use a format string to display the actual user-specified args
  • Added a little detail about why this happened, and how to address it

(this may not work as written, it probably needs some linting)

@frame_base.with_docs_from(pd.DataFrame)
@frame_base.args_to_kwargs(pd.DataFrame)
@frame_base.populate_defaults(pd.DataFrame)
def compare(self, other, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks!

One more thing: any time there's a caveat like the wontimplementerror here, we should add a docstring for it, we have infrastructure that puts this in a "Differences from pandaas" section. See here.

Something like:

Suggested change
def compare(self, other, **kwargs):
def compare(self, other, **kwargs):
"""The default values ``align_axis=1 and ``keep_shape=False`` are not supported, because the output columns depend on the data. To use ``align_axis=1``, please specify ``keep_shape=True``."""

@frame_base.populate_defaults(pd.DataFrame)
def compare(self, other, **kwargs):
align_axis = kwargs.get('align_axis', 1)
keep_shape = kwargs.get('keep_shape', False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you put align_axis and keep_shape in the argument list, then @frame_base.populate_defaults should pull the default values from the pandas code, and you won't need to specify them here. Did you try that?

reason='non-deferred-columns'
)

if align_axis:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if align_axis:
if align_axis in (1, 'columns'):

Copy link
Member

@TheNeuralBit TheNeuralBit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, just a few more minor suggestions.

We should also wrap this with a hasattr check like we do for value_counts:

if hasattr(pd.DataFrame, 'value_counts'):
@frame_base.with_docs_from(pd.DataFrame)
def value_counts(self, subset=None, sort=False, normalize=False,
ascending=False, dropna=True):
"""``sort`` is ``False`` by default, and ``sort=True`` is not supported

Otherwise we will break pandas 1.0 support.

@@ -2049,6 +2049,31 @@ def repeat(self, repeats, axis):
"repeat(repeats=) value must be an int or a "
f"DeferredSeries (encountered {type(repeats)}).")

if hasattr(pd.Series, 'compare'):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why yapf spaces this check and not the one in DataFrame

Copy link
Member

@TheNeuralBit TheNeuralBit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@TheNeuralBit
Copy link
Member

I tried to address the merge conflict myself in the GitHub UI, but it looks like I broke the formatter and linter. Could you look at those errors @roger-mike?

@roger-mike roger-mike force-pushed the feat/dataframe-compare branch 2 times, most recently from 47fd628 to 1a31d1d Compare December 10, 2021 18:24
@roger-mike
Copy link
Contributor Author

I tried to address the merge conflict myself in the GitHub UI, but it looks like I broke the formatter and linter. Could you look at those errors @roger-mike?

Done 👍

@TheNeuralBit TheNeuralBit merged commit f98a3b0 into apache:master Dec 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants