[SPARK-40328][PS] Implement DataFrame.compare by devin-petersohn · Pull Request #55143 · apache/spark

devin-petersohn · 2026-04-01T18:20:29Z

What changes were proposed in this pull request?

Implement DataFrame.compare for the pandas API on Spark. This was already implemented for Series but was unimplemented for DataFrame.

Why are the changes needed?

Implements missing API to improve pandas compatibility.

Does this PR introduce any user-facing change?

Yes, new DataFrame.compare method.

How was this patch tested?

CI

Was this patch authored or co-authored using generative AI tooling?

Co-authored-by: Claude Opus 4

Co-authored-by: Devin Petersohn <devin.petersohn@snowflake.com> Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

HyukjinKwon · 2026-04-05T22:47:42Z

Seems fine. cc @zhengruifeng @gaogaotiantian @Yicong-Huang

devin-petersohn · 2026-04-07T19:59:38Z

@zhengruifeng, @gaogaotiantian, or @Yicong-Huang - any thoughts?

gaogaotiantian · 2026-04-07T21:29:39Z

+                {"a": [2, 2, 3, 4, 1], "b": [2, 2, 3, 4, 1]},
+                index=pd.Index([5, 4, 3, 2, 1]),
+            )
+            psdf1.compare(psdf2)


nit: Can we move the data initialization out of the assertRaisesRegex block? I think having the line that would actually raise in the block only will make it easier to understand.

gaogaotiantian · 2026-04-07T21:31:26Z

        return DataFrame(internal)

+    def compare(
+        self, other: "DataFrame", keep_shape: bool = False, keep_equal: bool = False


We have some historical burdens about having different parameters for our pandas-like API, which could be confusing when they are used with positional arguments. However, for new APIs, how about we make keep_shape and keep_equal keyword only? If we want to add align_axis in the future, that won't be a problem.

Yicong-Huang · 2026-04-07T22:23:52Z

+            # Determine which columns have any difference and prune the rest.
+            has_diff = sdf.select(
+                [
+                    F.max(cond.cast("int")).alias(name_like_string(label))
+                    for label, cond in diff_conditions
+                ]
+            ).head()


There seems to be two data passes/data scans: the first one to determine which columns have any differences, and the second pass to filter the data. is it possible to reduce the passes?

Co-Authored-By: Claude <noreply@anthropic.com>

devin-petersohn and others added 5 commits April 1, 2026 13:18

[SPARK-40328][PS] Implement DataFrame.compare

6398f02

Co-authored-by: Devin Petersohn <devin.petersohn@snowflake.com> Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

fix version tag and add empty test

b7b5192

Co-authored-by: Devin Petersohn <devin.petersohn@snowflake.com> Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

register test in CI modules and add connect parity test

eb1d828

Co-authored-by: Devin Petersohn <devin.petersohn@snowflake.com> Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

fix doctest output

0556dfa

Co-authored-by: Devin Petersohn <devin.petersohn@snowflake.com> Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

use numeric-only doctest examples

fe907cd

Co-authored-by: Devin Petersohn <devin.petersohn@snowflake.com> Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>

gaogaotiantian reviewed Apr 7, 2026

View reviewed changes

Yicong-Huang reviewed Apr 7, 2026

View reviewed changes

devin-petersohn and others added 3 commits April 8, 2026 08:57

retrigger CI

52767e0

optimize compare query plan and fix tests

710de1c

Co-Authored-By: Claude <noreply@anthropic.com>

Merge upstream/master into devin/dataframe-compare

6feacca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-40328][PS] Implement DataFrame.compare#55143

[SPARK-40328][PS] Implement DataFrame.compare#55143
devin-petersohn wants to merge 8 commits intoapache:masterfrom
devin-petersohn:devin/dataframe-compare

devin-petersohn commented Apr 1, 2026

Uh oh!

HyukjinKwon commented Apr 5, 2026

Uh oh!

devin-petersohn commented Apr 7, 2026

Uh oh!

gaogaotiantian Apr 7, 2026

Uh oh!

gaogaotiantian Apr 7, 2026

Uh oh!

Yicong-Huang Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

devin-petersohn commented Apr 1, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

HyukjinKwon commented Apr 5, 2026

Uh oh!

devin-petersohn commented Apr 7, 2026

Uh oh!

gaogaotiantian Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

gaogaotiantian Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants