Skip to content

[SPARK-54090][PYTHON] Fix cascading diff in assertDataFrameEqual when row counts differ#55522

Closed
Yicong-Huang wants to merge 1 commit into
apache:masterfrom
Yicong-Huang:SPARK-54090
Closed

[SPARK-54090][PYTHON] Fix cascading diff in assertDataFrameEqual when row counts differ#55522
Yicong-Huang wants to merge 1 commit into
apache:masterfrom
Yicong-Huang:SPARK-54090

Conversation

@Yicong-Huang
Copy link
Copy Markdown
Contributor

@Yicong-Huang Yicong-Huang commented Apr 23, 2026

What changes were proposed in this pull request?

When assertDataFrameEqual is called with checkRowOrder=False (the default) and the two inputs have different row counts, a single missing/extra row cascades into a mismatch on every subsequent row. This inflates both the diff count and the reported mismatch percentage.

Root cause: after sorting both lists by str(row), assert_rows_equal pairs rows with zip_longest. When one side is shorter, the pairing shifts past every row following the hole.

Fix: switch to a merge-walk over the sorted lists only when their lengths differ. Equal lengths keep zip_longest so that field-level diffs continue to be reported as paired rows (preserving existing docstrings and tests that rely on "B vs X" style pairing).

The merge-walk uses compare_rows for equality (honoring rtol/atol) and str(r) for ordering decisions (consistent with how the lists were sorted).

Why are the changes needed?

Reproducer:

from pyspark.testing.utils import assertDataFrameEqual
from pyspark.sql import Row

actual   = [Row(id='1'), Row(id='2'), Row(id='3'), Row(id='4'), Row(id='5')]
expected = [Row(id='1'), Row(id='2'),              Row(id='4'), Row(id='5')]
assertDataFrameEqual(actual, expected)

Before this fix: Results do not match: ( 60.00000 % ) (3 of 5 rows reported as different).
After this fix: Results do not match: ( 20.00000 % ) (only Row(id='3') is reported).

A larger example from the JIRA: rows1 has 5 rows, rows2 has 3 of them missing in the middle -- the old code reports 80% mismatch; the new code reports 40%, matching what a user would expect from a sorted-set comparison.

Does this PR introduce any user-facing change?

Yes, but only in the error message / reported data when assertDataFrameEqual(checkRowOrder=False) is given inputs whose row counts differ:

  • The reported mismatch percentage is no longer inflated by positional shifting.
  • includeDiffRows=True now returns (row, None) tuples for extras and (None, row) tuples for missing rows, rather than shifted (row, row) pairs.

Behavior when row counts are equal (including all existing docstring examples) is unchanged.

How was this patch tested?

  • Added five targeted tests in python/pyspark/sql/tests/test_utils.py:
    • test_different_row_count_middle_missing_no_cascading_diff
    • test_different_row_count_multiple_missing
    • test_different_row_count_includeDiffRows
    • test_different_row_count_mixed_extra_and_missing
    • test_different_row_count_extras_at_end

Was this patch authored or co-authored using generative AI tooling?

No.

@Yicong-Huang
Copy link
Copy Markdown
Contributor Author

cc @HyukjinKwon @zhengruifeng

@HyukjinKwon
Copy link
Copy Markdown
Member

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants