Skip to content

[SPARK-40328][PS] Implement DataFrame.compare#55143

Open
devin-petersohn wants to merge 8 commits intoapache:masterfrom
devin-petersohn:devin/dataframe-compare
Open

[SPARK-40328][PS] Implement DataFrame.compare#55143
devin-petersohn wants to merge 8 commits intoapache:masterfrom
devin-petersohn:devin/dataframe-compare

Conversation

@devin-petersohn
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Implement DataFrame.compare for the pandas API on Spark. This was already implemented for Series but was unimplemented for DataFrame.

Why are the changes needed?

Implements missing API to improve pandas compatibility.

Does this PR introduce any user-facing change?

Yes, new DataFrame.compare method.

How was this patch tested?

CI

Was this patch authored or co-authored using generative AI tooling?

Co-authored-by: Claude Opus 4

devin-petersohn and others added 5 commits April 1, 2026 13:18
Co-authored-by: Devin Petersohn <devin.petersohn@snowflake.com>
Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>
Co-authored-by: Devin Petersohn <devin.petersohn@snowflake.com>
Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>
Co-authored-by: Devin Petersohn <devin.petersohn@snowflake.com>
Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>
Co-authored-by: Devin Petersohn <devin.petersohn@snowflake.com>
Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>
Co-authored-by: Devin Petersohn <devin.petersohn@snowflake.com>
Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>
@HyukjinKwon
Copy link
Copy Markdown
Member

Seems fine. cc @zhengruifeng @gaogaotiantian @Yicong-Huang

@devin-petersohn
Copy link
Copy Markdown
Contributor Author

@zhengruifeng, @gaogaotiantian, or @Yicong-Huang - any thoughts?

{"a": [2, 2, 3, 4, 1], "b": [2, 2, 3, 4, 1]},
index=pd.Index([5, 4, 3, 2, 1]),
)
psdf1.compare(psdf2)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we move the data initialization out of the assertRaisesRegex block? I think having the line that would actually raise in the block only will make it easier to understand.

Comment thread python/pyspark/pandas/frame.py Outdated
return DataFrame(internal)

def compare(
self, other: "DataFrame", keep_shape: bool = False, keep_equal: bool = False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have some historical burdens about having different parameters for our pandas-like API, which could be confusing when they are used with positional arguments. However, for new APIs, how about we make keep_shape and keep_equal keyword only? If we want to add align_axis in the future, that won't be a problem.

Comment thread python/pyspark/pandas/frame.py Outdated
Comment on lines +9373 to +9379
# Determine which columns have any difference and prune the rest.
has_diff = sdf.select(
[
F.max(cond.cast("int")).alias(name_like_string(label))
for label, cond in diff_conditions
]
).head()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be two data passes/data scans: the first one to determine which columns have any differences, and the second pass to filter the data. is it possible to reduce the passes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants