Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-45554][PYTHON] Introduce flexible parameter to assertSchemaEqual #43450

Closed
wants to merge 3 commits into from

Conversation

itholic
Copy link
Contributor

@itholic itholic commented Oct 19, 2023

What changes were proposed in this pull request?

This PR proposes to add three new parameters to the assertSchemaEqual: ignoreNullable, ignoreColumnOrder and ignoreColumnName to provide users with more flexibility in schema testing.

Why are the changes needed?

To enhance the utility of assertSchemaEqual by accommodating various common schema comparison scenarios that users might encounter, without necessitating manual adjustments or workarounds.

Does this PR introduce any user-facing change?

Yes. assertDataFrameEqual now have the option to use the five new parameters:

Parameter Type Comment
ignoreNullable Boolean [optional] Specifies whether a column’s nullable property is included when checking for schema equality.

When set to True (default), the nullable property of the columns being compared is not taken into account and the columns will be considered equal even if they have different nullable settings.

When set to False, columns are considered equal only if they have the same nullable setting.
ignoreColumnOrder Boolean [optional] Specifies whether to compare columns in the order they appear in the DataFrames or by column name.

When set to False (default), columns are compared in the order they appear in the DataFrames.

When set to True, a column in the expected DataFrame is compared to the column with the same name in the actual DataFrame.

ignoreColumnOrder cannot be set to True if ignoreColumnNames is also set to True.
ignoreColumnName Boolean [optional] Specifies whether to fail the initial schema equality check if the column names in the two DataFrames are different.

When set to False (default), column names are checked and the function fails if they are different.

When set to True, the function will succeed even if column names are different. Column data types are compared for columns in the order they appear in the DataFrames.

ignoreColumnNames cannot be set to True if ignoreColumnOrder is also set to True.

How was this patch tested?

Added usage examples into doctest for each parameter.

Was this patch authored or co-authored using generative AI tooling?

No.

@itholic itholic changed the title [SPARK-45554][PYTHON] Introduce flexible parameter to assertSchemaEqual [SPARK-45554][PYTHON] Introduce flexible parameter to assertSchemaEqual Oct 19, 2023
@itholic
Copy link
Contributor Author

itholic commented Oct 19, 2023

cc @HyukjinKwon @allanf-db

@itholic
Copy link
Contributor Author

itholic commented Oct 24, 2023

This also CI passed. Gentle reminder for @HyukjinKwon, also cc @ueshin @zhengruifeng .

Copy link
Contributor

@allisonwang-db allisonwang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These parameters will be super helpful!

python/pyspark/testing/utils.py Outdated Show resolved Hide resolved
python/pyspark/testing/utils.py Outdated Show resolved Hide resolved
python/pyspark/testing/utils.py Outdated Show resolved Hide resolved
@HyukjinKwon
Copy link
Member

Merged to master.

@itholic itholic deleted the SPARK-45554 branch November 20, 2023 01:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants