[SPARK-54090][PYTHON] Fix cascading diff in assertDataFrameEqual when row counts differ by Yicong-Huang · Pull Request #55522 · apache/spark

Yicong-Huang · 2026-04-23T20:33:54Z

What changes were proposed in this pull request?

When assertDataFrameEqual is called with checkRowOrder=False (the default) and the two inputs have different row counts, a single missing/extra row cascades into a mismatch on every subsequent row. This inflates both the diff count and the reported mismatch percentage.

Root cause: after sorting both lists by str(row), assert_rows_equal pairs rows with zip_longest. When one side is shorter, the pairing shifts past every row following the hole.

Fix: switch to a merge-walk over the sorted lists only when their lengths differ. Equal lengths keep zip_longest so that field-level diffs continue to be reported as paired rows (preserving existing docstrings and tests that rely on "B vs X" style pairing).

The merge-walk uses compare_rows for equality (honoring rtol/atol) and str(r) for ordering decisions (consistent with how the lists were sorted).

Why are the changes needed?

Reproducer:

from pyspark.testing.utils import assertDataFrameEqual
from pyspark.sql import Row

actual   = [Row(id='1'), Row(id='2'), Row(id='3'), Row(id='4'), Row(id='5')]
expected = [Row(id='1'), Row(id='2'),              Row(id='4'), Row(id='5')]
assertDataFrameEqual(actual, expected)

Before this fix: Results do not match: ( 60.00000 % ) (3 of 5 rows reported as different).
After this fix: Results do not match: ( 20.00000 % ) (only Row(id='3') is reported).

A larger example from the JIRA: rows1 has 5 rows, rows2 has 3 of them missing in the middle -- the old code reports 80% mismatch; the new code reports 40%, matching what a user would expect from a sorted-set comparison.

Does this PR introduce any user-facing change?

Yes, but only in the error message / reported data when assertDataFrameEqual(checkRowOrder=False) is given inputs whose row counts differ:

The reported mismatch percentage is no longer inflated by positional shifting.
includeDiffRows=True now returns (row, None) tuples for extras and (None, row) tuples for missing rows, rather than shifted (row, row) pairs.

Behavior when row counts are equal (including all existing docstring examples) is unchanged.

How was this patch tested?

Added five targeted tests in python/pyspark/sql/tests/test_utils.py:
- test_different_row_count_middle_missing_no_cascading_diff
- test_different_row_count_multiple_missing
- test_different_row_count_includeDiffRows
- test_different_row_count_mixed_extra_and_missing
- test_different_row_count_extras_at_end

Was this patch authored or co-authored using generative AI tooling?

No.

Yicong-Huang · 2026-04-24T04:29:57Z

cc @HyukjinKwon @zhengruifeng

HyukjinKwon · 2026-04-24T04:33:23Z

Merged to master.

fix: use merge-walk diff for sorted rows in assertDataFrameEqual

5c5acdd

HyukjinKwon approved these changes Apr 24, 2026

View reviewed changes

HyukjinKwon closed this in 845a1b5 Apr 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-54090][PYTHON] Fix cascading diff in assertDataFrameEqual when row counts differ#55522

[SPARK-54090][PYTHON] Fix cascading diff in assertDataFrameEqual when row counts differ#55522
Yicong-Huang wants to merge 1 commit into
apache:masterfrom
Yicong-Huang:SPARK-54090

Yicong-Huang commented Apr 23, 2026 •

edited

Loading

Uh oh!

Yicong-Huang commented Apr 24, 2026

Uh oh!

HyukjinKwon commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Yicong-Huang commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Yicong-Huang commented Apr 24, 2026

Uh oh!

HyukjinKwon commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Yicong-Huang commented Apr 23, 2026 •

edited

Loading