[SPARK-40328][PS] Implement DataFrame.compare#55143
[SPARK-40328][PS] Implement DataFrame.compare#55143devin-petersohn wants to merge 8 commits intoapache:masterfrom
Conversation
Co-authored-by: Devin Petersohn <devin.petersohn@snowflake.com> Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>
Co-authored-by: Devin Petersohn <devin.petersohn@snowflake.com> Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>
Co-authored-by: Devin Petersohn <devin.petersohn@snowflake.com> Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>
Co-authored-by: Devin Petersohn <devin.petersohn@snowflake.com> Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>
Co-authored-by: Devin Petersohn <devin.petersohn@snowflake.com> Signed-off-by: Devin Petersohn <devin.petersohn@gmail.com>
|
Seems fine. cc @zhengruifeng @gaogaotiantian @Yicong-Huang |
|
@zhengruifeng, @gaogaotiantian, or @Yicong-Huang - any thoughts? |
| {"a": [2, 2, 3, 4, 1], "b": [2, 2, 3, 4, 1]}, | ||
| index=pd.Index([5, 4, 3, 2, 1]), | ||
| ) | ||
| psdf1.compare(psdf2) |
There was a problem hiding this comment.
nit: Can we move the data initialization out of the assertRaisesRegex block? I think having the line that would actually raise in the block only will make it easier to understand.
| return DataFrame(internal) | ||
|
|
||
| def compare( | ||
| self, other: "DataFrame", keep_shape: bool = False, keep_equal: bool = False |
There was a problem hiding this comment.
We have some historical burdens about having different parameters for our pandas-like API, which could be confusing when they are used with positional arguments. However, for new APIs, how about we make keep_shape and keep_equal keyword only? If we want to add align_axis in the future, that won't be a problem.
| # Determine which columns have any difference and prune the rest. | ||
| has_diff = sdf.select( | ||
| [ | ||
| F.max(cond.cast("int")).alias(name_like_string(label)) | ||
| for label, cond in diff_conditions | ||
| ] | ||
| ).head() |
There was a problem hiding this comment.
There seems to be two data passes/data scans: the first one to determine which columns have any differences, and the second pass to filter the data. is it possible to reduce the passes?
Co-Authored-By: Claude <noreply@anthropic.com>
What changes were proposed in this pull request?
Implement
DataFrame.comparefor the pandas API on Spark. This was already implemented forSeriesbut was unimplemented forDataFrame.Why are the changes needed?
Implements missing API to improve pandas compatibility.
Does this PR introduce any user-facing change?
Yes, new
DataFrame.comparemethod.How was this patch tested?
CI
Was this patch authored or co-authored using generative AI tooling?
Co-authored-by: Claude Opus 4