[BEAM-12565] Dataframe compare implementation #16027

roger-mike · 2021-11-19T18:03:03Z

Implementation of compare for DeferredDataFrame and DeferredSeries

Lang	ULR	Twister2
Go	---	---
Java
Python	---	---
XLang		---

Examples testing status on various runners

Lang	ULR	Dataflow	Flink	Samza	Spark	Twister2
Go	---	---	---	---	---	---	---
Java	---		---	---	---	---	---
Python	---	---	---	---	---	---	---
XLang	---	---	---	---	---	---	---

Post-Commit SDK/Transform Integration Tests Status (on master branch)

Go	Java	Python

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website	Whitespace	Typescript
Non-portable
Portable	---			---	---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI.

roger-mike · 2021-11-25T17:26:14Z

sdks/python/apache_beam/dataframe/frames.py

+  @frame_base.with_docs_from(pd.DataFrame)
+  @frame_base.args_to_kwargs(pd.DataFrame)
+  @frame_base.populate_defaults(pd.DataFrame)
+  def compare(self, other, **kwargs):


Hi @TheNeuralBit, I have a similar issue here as in idxmin and idxmax proxy. I wasn't able to create a valid proxy since the expected proxy is a DataFrame with a MultiIndex and its structure depends on the differences between the inputs. So, I'm not sure how this proxy can be created beforehand. Any suggestion on how this can be solved?

Hi @roger-mike, looking through the docs for this operation here, I think we will need to be restrictive about the arguments we support. As you point out, the default (align_axis=1, keep_shape=False) will drop columns that are equivalent. Since that makes the shape depend on the input data, we should raise WontImplementError(reason="non-deferred-columns") in that case.

We should still be able to support align_axis=1, keep_shape=True though. And I think we should be able to support align_axis=0 no matter what the other args are.

Does that make sense?

Thanks for your comments @TheNeuralBit. I handled the (align_axis=1, keep_shape=False) case as you suggest. The other cases work just fine. I also disabled proxy check for the (align_axis=0, keep_shape=False) test case since its result also depends on the input data. Let me know what you think 👍 .

Great, thanks!

One more thing: any time there's a caveat like the wontimplementerror here, we should add a docstring for it, we have infrastructure that puts this in a "Differences from pandaas" section. See here.

Something like:

Suggested change

def compare(self, other, **kwargs):

def compare(self, other, **kwargs):

"""The default values ``align_axis=1 and ``keep_shape=False`` are not supported, because the output columns depend on the data. To use ``align_axis=1``, please specify ``keep_shape=True``."""

roger-mike · 2021-11-26T18:45:10Z

R: @TheNeuralBit could you take a look? Thanks 👍 .

roger-mike · 2021-12-01T15:47:00Z

sdks/python/apache_beam/dataframe/frames_test.py

+    self._run_test(
+        lambda s1, s2: s1.compare(s2, keep_shape=True, keep_equal=True), s1, s2)
+
+  def test_compare_dataframe(self):


The test for Dataframe.compare in pandas_doctests.test.py was skipped because it tries to apply a loc to df2 and it's not implemented, making the tests fail. @TheNeuralBit Do you think this should remain skipped?

Yeah I think it's approptiate to keep that test skipped. The problem is that it creates test data by modifying the DataFrame in-place with loc, which we don't support:

>>> df2 = df.copy() >>> df2.loc[0, 'col1'] = 'c' >>> df2.loc[2, 'col3'] = 4.0 >>> df2 col1 col2 col3 0 c 1.0 1.0 1 a 2.0 2.0 2 b 3.0 4.0 3 b NaN 4.0 4 a 5.0 5.0

It's fine to just test compare here in frames_test.py.

Got it. I'll keep them skipped then.

codecov · 2021-12-01T16:00:21Z

Codecov Report

Merging #16027 (08becba) into master (07b956a) will increase coverage by 9.01%.
The diff coverage is 92.00%.

@@            Coverage Diff             @@
##           master   #16027      +/-   ##
==========================================
+ Coverage   74.62%   83.64%   +9.01%     
==========================================
  Files         643      447     -196     
  Lines       81120    61648   -19472     
==========================================
- Hits        60533    51563    -8970     
+ Misses      19617    10085    -9532     
+ Partials      970        0     -970

Impacted Files	Coverage Δ
sdks/python/apache_beam/dataframe/frames.py	`94.89% <92.00%> (-0.04%)`	⬇️
...hon/apache_beam/runners/direct/test_stream_impl.py	`94.02% <0.00%> (-2.24%)`	⬇️
...ks/python/apache_beam/runners/worker/sdk_worker.py	`88.90% <0.00%> (-0.16%)`	⬇️
...o/pkg/beam/io/rtrackers/offsetrange/offsetrange.go
sdks/go/pkg/beam/util/harnessopts/sampler.go
sdks/go/pkg/beam/transforms/stats/min.go
sdks/go/pkg/beam/core/graph/coder/iterable.go
sdks/go/pkg/beam/core/runtime/xlangx/resolve.go
sdks/go/pkg/beam/core/runtime/exec/fn_arity.go
...o/pkg/beam/runners/dataflow/dataflowlib/metrics.go
... and 190 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 07b956a...08becba. Read the comment docs.

roger-mike · 2021-12-06T22:55:04Z

R: @TheNeuralBit Could you take a look? Thanks.

TheNeuralBit

Thanks, just a few more minor suggestions.

We should also wrap this with a hasattr check like we do for value_counts:

beam/sdks/python/apache_beam/dataframe/frames.py

Lines 3613 to 3617 in 3d75542

    
           if hasattr(pd.DataFrame, 'value_counts'): 
        
             @frame_base.with_docs_from(pd.DataFrame) 
        
             def value_counts(self, subset=None, sort=False, normalize=False, 
        
                              ascending=False, dropna=True): 
        
               """``sort`` is ``False`` by default, and ``sort=True`` is not supported

Otherwise we will break pandas 1.0 support.

sdks/python/apache_beam/dataframe/pandas_doctests_test.py

TheNeuralBit · 2021-12-09T15:48:35Z

sdks/python/apache_beam/dataframe/frames.py

+
+    preserve_partition = None
+
+    if align_axis and not keep_shape:


align_axis is allowed to be 'columns' or 'index'

Suggested change

if align_axis and not keep_shape:

if align_axis in (1, 'columns') and not keep_shape:

TheNeuralBit · 2021-12-09T15:51:42Z

sdks/python/apache_beam/dataframe/frames.py

+        "compare(align_axis=1, keep_shape=False) is not allowed",
+        reason='non-deferred-columns'


A couple nits on the exception:

Suggested change

"compare(align_axis=1, keep_shape=False) is not allowed",

reason='non-deferred-columns'

f"compare(align_axis={align_axis!r}, keep_shape={keep_shape!r}) is not allowed because the output columns depend on the data, please specify keep_shape=True.",

reason='non-deferred-columns'

We can use a format string to display the actual user-specified args

Added a little detail about why this happened, and how to address it

(this may not work as written, it probably needs some linting)

TheNeuralBit · 2021-12-09T16:00:37Z

sdks/python/apache_beam/dataframe/frames.py

+  @frame_base.with_docs_from(pd.DataFrame)
+  @frame_base.args_to_kwargs(pd.DataFrame)
+  @frame_base.populate_defaults(pd.DataFrame)
+  def compare(self, other, **kwargs):


Great, thanks!

One more thing: any time there's a caveat like the wontimplementerror here, we should add a docstring for it, we have infrastructure that puts this in a "Differences from pandaas" section. See here.

Something like:

Suggested change

def compare(self, other, **kwargs):

def compare(self, other, **kwargs):

"""The default values ``align_axis=1 and ``keep_shape=False`` are not supported, because the output columns depend on the data. To use ``align_axis=1``, please specify ``keep_shape=True``."""

TheNeuralBit · 2021-12-09T16:02:44Z

sdks/python/apache_beam/dataframe/frames.py

+  @frame_base.populate_defaults(pd.DataFrame)
+  def compare(self, other, **kwargs):
+    align_axis = kwargs.get('align_axis', 1)
+    keep_shape = kwargs.get('keep_shape', False)


if you put align_axis and keep_shape in the argument list, then @frame_base.populate_defaults should pull the default values from the pandas code, and you won't need to specify them here. Did you try that?

TheNeuralBit · 2021-12-09T16:03:18Z

sdks/python/apache_beam/dataframe/frames.py

+        reason='non-deferred-columns'
+      )
+
+    if align_axis:


Suggested change

if align_axis:

if align_axis in (1, 'columns'):

TheNeuralBit

Thanks, just a few more minor suggestions.

We should also wrap this with a hasattr check like we do for value_counts:

beam/sdks/python/apache_beam/dataframe/frames.py

Lines 3613 to 3617 in 3d75542

    
           if hasattr(pd.DataFrame, 'value_counts'): 
        
             @frame_base.with_docs_from(pd.DataFrame) 
        
             def value_counts(self, subset=None, sort=False, normalize=False, 
        
                              ascending=False, dropna=True): 
        
               """``sort`` is ``False`` by default, and ``sort=True`` is not supported

Otherwise we will break pandas 1.0 support.

roger-mike · 2021-12-09T23:25:26Z

sdks/python/apache_beam/dataframe/frames.py

@@ -2049,6 +2049,31 @@ def repeat(self, repeats, axis):
          "repeat(repeats=) value must be an int or a "
          f"DeferredSeries (encountered {type(repeats)}).")

+  if hasattr(pd.Series, 'compare'):
+


Not sure why yapf spaces this check and not the one in DataFrame

TheNeuralBit

Thanks!

TheNeuralBit · 2021-12-10T17:55:27Z

I tried to address the merge conflict myself in the GitHub UI, but it looks like I broke the formatter and linter. Could you look at those errors @roger-mike?

roger-mike · 2021-12-10T20:59:22Z

I tried to address the merge conflict myself in the GitHub UI, but it looks like I broke the formatter and linter. Could you look at those errors @roger-mike?

Done 👍

[BEAM-12565] Series implementation of compare

45c9f6f

roger-mike commented Nov 25, 2021

View reviewed changes

roger-mike commented Dec 1, 2021

View reviewed changes

roger-mike changed the title ~~[WIP][BEAM-12565] Dataframe compare implementation~~ [BEAM-12565] Dataframe compare implementation Dec 1, 2021

roger-mike force-pushed the feat/dataframe-compare branch from da75a84 to 4ae8f26 Compare December 6, 2021 22:31

[BEAM-12565] DataFrame implementation of compare

bb3dea2

roger-mike force-pushed the feat/dataframe-compare branch from 4ae8f26 to bb3dea2 Compare December 6, 2021 22:49

TheNeuralBit reviewed Dec 9, 2021

View reviewed changes

[BEAM-12565] Fixed minor issues and error checks

287ad69

roger-mike commented Dec 9, 2021

View reviewed changes

roger-mike requested a review from TheNeuralBit December 9, 2021 23:35

TheNeuralBit approved these changes Dec 10, 2021

View reviewed changes

roger-mike force-pushed the feat/dataframe-compare branch 2 times, most recently from 47fd628 to 1a31d1d Compare December 10, 2021 18:24

Merge branch 'master' into feat/dataframe-compare

08becba

roger-mike force-pushed the feat/dataframe-compare branch from 1a31d1d to 08becba Compare December 10, 2021 18:35

roger-mike requested a review from TheNeuralBit December 10, 2021 20:59

TheNeuralBit merged commit f98a3b0 into apache:master Dec 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-12565] Dataframe compare implementation #16027

[BEAM-12565] Dataframe compare implementation #16027

roger-mike commented Nov 19, 2021

roger-mike Nov 25, 2021

TheNeuralBit Dec 3, 2021

roger-mike Dec 6, 2021

TheNeuralBit Dec 9, 2021

roger-mike commented Nov 26, 2021

roger-mike Dec 1, 2021 •

edited

TheNeuralBit Dec 3, 2021

roger-mike Dec 6, 2021

codecov bot commented Dec 1, 2021 •

edited

roger-mike commented Dec 6, 2021

TheNeuralBit left a comment

TheNeuralBit Dec 9, 2021

TheNeuralBit Dec 9, 2021

TheNeuralBit Dec 9, 2021

TheNeuralBit Dec 9, 2021

TheNeuralBit Dec 9, 2021

TheNeuralBit left a comment

roger-mike Dec 9, 2021

TheNeuralBit left a comment

TheNeuralBit commented Dec 10, 2021

roger-mike commented Dec 10, 2021

	def compare(self, other, **kwargs):
	def compare(self, other, **kwargs):
	"""The default values ``align_axis=1 and ``keep_shape=False`` are not supported, because the output columns depend on the data. To use ``align_axis=1``, please specify ``keep_shape=True``."""

	if hasattr(pd.DataFrame, 'value_counts'):
	@frame_base.with_docs_from(pd.DataFrame)
	def value_counts(self, subset=None, sort=False, normalize=False,
	ascending=False, dropna=True):
	"""``sort`` is ``False`` by default, and ``sort=True`` is not supported

	if align_axis and not keep_shape:
	if align_axis in (1, 'columns') and not keep_shape:

		"compare(align_axis=1, keep_shape=False) is not allowed",
		reason='non-deferred-columns'

[BEAM-12565] Dataframe compare implementation #16027

[BEAM-12565] Dataframe compare implementation #16027

Conversation

roger-mike commented Nov 19, 2021

Examples testing status on various runners

Post-Commit SDK/Transform Integration Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

GitHub Actions Tests Status (on master branch)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roger-mike commented Nov 26, 2021

roger-mike Dec 1, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Dec 1, 2021 • edited

Codecov Report

roger-mike commented Dec 6, 2021

TheNeuralBit left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TheNeuralBit left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TheNeuralBit left a comment

Choose a reason for hiding this comment

TheNeuralBit commented Dec 10, 2021

roger-mike commented Dec 10, 2021

roger-mike Dec 1, 2021 •

edited

codecov bot commented Dec 1, 2021 •

edited