Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32693][SQL][2.4] Compare two dataframes with same schema except nullable property #29576

Closed
wants to merge 1 commit into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Aug 29, 2020

What changes were proposed in this pull request?

This PR changes key data types check in HashJoin to use sameType. This backports #29555 to branch-2.4.

Why are the changes needed?

Looks at the resolving condition of SetOperation, it requires only each left data types should be sameType as the right ones. Logically the EqualTo expression in equi-join, also requires only left data type sameType as right data type. Then HashJoin requires left keys data type exactly the same as right keys data type, looks not reasonable.

It makes inconsistent results when doing except between two dataframes.

If two dataframes don't have nested fields, even their field nullable property different, HashJoin passes the key type check because it checks field individually so field nullable property is ignored.

If two dataframes have nested fields like struct, HashJoin fails the key type check because now it compare two struct types and nullable property now affects.

Does this PR introduce any user-facing change?

Yes. Making consistent except operation between dataframes.

How was this patch tested?

Unit test.

@viirya
Copy link
Member Author

viirya commented Aug 29, 2020

cc @maropu

@maropu
Copy link
Member

maropu commented Aug 29, 2020

Thanks! LGTM

@SparkQA
Copy link

SparkQA commented Aug 29, 2020

Test build #128009 has finished for PR 29576 at commit d015266.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

maropu pushed a commit that referenced this pull request Aug 29, 2020
…t nullable property

### What changes were proposed in this pull request?

This PR changes key data types check in `HashJoin` to use `sameType`. This backports #29555 to branch-2.4.

### Why are the changes needed?

Looks at the resolving condition of `SetOperation`, it requires only each left data types should be `sameType` as the right ones. Logically the `EqualTo` expression in equi-join, also requires only left data type `sameType` as right data type. Then `HashJoin` requires left keys data type exactly the same as right keys data type, looks not reasonable.

It makes inconsistent results when doing `except` between two dataframes.

If two dataframes don't have nested fields, even their field nullable property different, `HashJoin` passes the key type check because it checks field individually so field nullable property is ignored.

If two dataframes have nested fields like struct, `HashJoin` fails the key type check because now it compare two struct types and nullable property now affects.

### Does this PR introduce _any_ user-facing change?

Yes. Making consistent `except` operation between dataframes.

### How was this patch tested?

Unit test.

Closes #29576 from viirya/SPARK-32693-2.4.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
@maropu
Copy link
Member

maropu commented Aug 29, 2020

Thanks! Merged to branch-2.4.

@maropu maropu closed this Aug 29, 2020
@viirya
Copy link
Member Author

viirya commented Aug 29, 2020

Thanks @maropu

@viirya viirya deleted the SPARK-32693-2.4 branch December 27, 2023 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants