-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-45552][PS] Introduce flexible parameters to assertDataFrameEqual
#43433
Conversation
cc @HyukjinKwon @allanf-db FYI |
assertDataFrameEqual
assertDataFrameEqual
assertDataFrameEqual
assertDataFrameEqual
Gentle reminder for @HyukjinKwon as CI passed. Also cc @ueshin @zhengruifeng |
ignoreColumnOrder: bool = False, | ||
ignoreColumnName: bool = False, | ||
ignoreColumnType: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have we considered ignoreSchema
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds worth to have it for some scenarios. Let's discuss in the separate thread!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also add some tests in test_utils
?
Merged to master. |
What changes were proposed in this pull request?
This PR proposes to add six new parameters to the
assertDataFrameEqual
:ignoreNullable
,ignoreColumnOrder
,ignoreColumnName
,ignoreColumnType
,maxErrors
, andshowOnlyDiff
to provide users with more flexibility in DataFrame testing.Why are the changes needed?
To enhance the utility of
assertDataFrameEqual
by accommodating various common DataFrame comparison scenarios that users might encounter, without necessitating manual adjustments or workarounds.Does this PR introduce any user-facing change?
Yes.
assertDataFrameEqual
now have the option to use the six new parameters:When set to True (default), the nullable property of the columns being compared is not taken into account and the columns will be considered equal even if they have different nullable settings.
When set to False, columns are considered equal only if they have the same nullable setting.
When set to False (default), columns are compared in the order they appear in the DataFrames.
When set to True, a column in the expected DataFrame is compared to the column with the same name in the actual DataFrame.
ignoreColumnOrder cannot be set to True if ignoreColumnNames is also set to True.
When set to False (default), column names are checked and the function fails if they are different.
When set to True, the function will succeed even if column names are different. Column data types are compared for columns in the order they appear in the DataFrames.
ignoreColumnNames cannot be set to True if ignoreColumnOrder is also set to True.
When set to False (default), column data types are checked and the function fails if they are different.
When set to True, the schema equality check will succeed even if column data types are different and the function will attempt to compare rows.
When this number of row comparisons have failed, the function returns independent of how many rows have been compared.
Set to None by default which means compare all rows independent of number of failures.
If set to False (default), the error message will include all rows (when there is at least one row that is different).
How was this patch tested?
Added usage examples into doctest for each parameter.
Was this patch authored or co-authored using generative AI tooling?
No.