[SPARK-45552][PS] Introduce flexible parameters to `assertDataFrameEqual` #43433

itholic · 2023-10-18T11:29:20Z

What changes were proposed in this pull request?

This PR proposes to add six new parameters to the assertDataFrameEqual: ignoreNullable, ignoreColumnOrder, ignoreColumnName, ignoreColumnType, maxErrors, and showOnlyDiff to provide users with more flexibility in DataFrame testing.

Why are the changes needed?

To enhance the utility of assertDataFrameEqual by accommodating various common DataFrame comparison scenarios that users might encounter, without necessitating manual adjustments or workarounds.

Does this PR introduce any user-facing change?

Yes. assertDataFrameEqual now have the option to use the six new parameters:

Parameter	Type	Comment
ignoreNullable	Boolean [optional]	Specifies whether a column’s nullable property is included when checking for schema equality. When set to True (default), the nullable property of the columns being compared is not taken into account and the columns will be considered equal even if they have different nullable settings. When set to False, columns are considered equal only if they have the same nullable setting.
ignoreColumnOrder	Boolean [optional]	Specifies whether to compare columns in the order they appear in the DataFrames or by column name. When set to False (default), columns are compared in the order they appear in the DataFrames. When set to True, a column in the expected DataFrame is compared to the column with the same name in the actual DataFrame. ignoreColumnOrder cannot be set to True if ignoreColumnNames is also set to True.
ignoreColumnName	Boolean [optional]	Specifies whether to fail the initial schema equality check if the column names in the two DataFrames are different. When set to False (default), column names are checked and the function fails if they are different. When set to True, the function will succeed even if column names are different. Column data types are compared for columns in the order they appear in the DataFrames. ignoreColumnNames cannot be set to True if ignoreColumnOrder is also set to True.
ignoreColumnType	Boolean [optional]	Specifies whether to ignore the data type of the columns when comparing. When set to False (default), column data types are checked and the function fails if they are different. When set to True, the schema equality check will succeed even if column data types are different and the function will attempt to compare rows.
maxErrors	Integer [optional]	The maximum number of row comparison failures to encounter before returning. When this number of row comparisons have failed, the function returns independent of how many rows have been compared. Set to None by default which means compare all rows independent of number of failures.
showOnlyDiff	Boolean [optional]	If set to True, the error message will only include rows that are different. If set to False (default), the error message will include all rows (when there is at least one row that is different).

How was this patch tested?

Added usage examples into doctest for each parameter.

Was this patch authored or co-authored using generative AI tooling?

No.

itholic · 2023-10-18T11:29:57Z

cc @HyukjinKwon @allanf-db FYI

…5552

itholic · 2023-10-24T23:46:43Z

Gentle reminder for @HyukjinKwon as CI passed. Also cc @ueshin @zhengruifeng

allisonwang-db · 2023-10-27T00:44:38Z

python/pyspark/testing/utils.py

+    ignoreColumnOrder: bool = False,
+    ignoreColumnName: bool = False,
+    ignoreColumnType: bool = False,


have we considered ignoreSchema?

It sounds worth to have it for some scenarios. Let's discuss in the separate thread!

allisonwang-db

Can we also add some tests in test_utils?

…5552

HyukjinKwon · 2023-10-30T02:06:40Z

Merged to master.

[SPARK-45552][PS] Introduce flexible parameters to assertDataFrameEqual

95e236f

github-actions bot added the PYTHON label Oct 18, 2023

itholic added 3 commits October 18, 2023 20:36

fix docstring

e2241be

remove unnecessary import for doctest

a6321d0

Merge branch 'master' of https://github.com/apache/spark into SPARK-4…

32cc9a9

…5552

itholic changed the title ~~[SPARK-45552][PS] Introduce flexible parameters to assertDataFrameEqual~~ [SPARK-45554][PS] Introduce flexible parameters to assertDataFrameEqual Oct 19, 2023

itholic changed the title ~~[SPARK-45554][PS] Introduce flexible parameters to assertDataFrameEqual~~ [SPARK-45552][PS] Introduce flexible parameters to assertDataFrameEqual Oct 19, 2023

itholic added 3 commits October 19, 2023 10:01

fix docs

72143cc

Add ignoreNullable

9c9d0bd

Merge branch 'master' of https://github.com/apache/spark into SPARK-4…

4dd0e41

…5552

allisonwang-db reviewed Oct 27, 2023

View reviewed changes

Added unittests

15761f5

github-actions bot added the SQL label Oct 27, 2023

itholic added 3 commits October 27, 2023 11:41

fix linter

a847b30

remove dummy

2905e58

Merge branch 'master' of https://github.com/apache/spark into SPARK-4…

b9e848e

…5552

HyukjinKwon approved these changes Oct 30, 2023

View reviewed changes

HyukjinKwon closed this in 4af4dde Oct 30, 2023

itholic deleted the SPARK-45552 branch November 20, 2023 01:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-45552][PS] Introduce flexible parameters to `assertDataFrameEqual` #43433

[SPARK-45552][PS] Introduce flexible parameters to `assertDataFrameEqual` #43433

itholic commented Oct 18, 2023 •

edited

Loading

itholic commented Oct 18, 2023

itholic commented Oct 24, 2023

allisonwang-db Oct 27, 2023

itholic Oct 27, 2023

allisonwang-db left a comment

HyukjinKwon commented Oct 30, 2023

[SPARK-45552][PS] Introduce flexible parameters to assertDataFrameEqual #43433

[SPARK-45552][PS] Introduce flexible parameters to assertDataFrameEqual #43433

Conversation

itholic commented Oct 18, 2023 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

itholic commented Oct 18, 2023

itholic commented Oct 24, 2023

allisonwang-db Oct 27, 2023

Choose a reason for hiding this comment

itholic Oct 27, 2023

Choose a reason for hiding this comment

allisonwang-db left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Oct 30, 2023

[SPARK-45552][PS] Introduce flexible parameters to `assertDataFrameEqual` #43433

[SPARK-45552][PS] Introduce flexible parameters to `assertDataFrameEqual` #43433

itholic commented Oct 18, 2023 •

edited

Loading