[SPARK-37829][SQL] Dataframe.joinWith outer-join should return a null value for unmatched row #40755

kings129 · 2023-04-12T12:05:50Z

What changes were proposed in this pull request?

When doing an outer join with joinWith on DataFrames, unmatched rows return Row objects with null fields instead of a single null value. This is not a expected behavior, and it's a regression introduced in this commit.
This pull request aims to fix the regression, note this is not a full rollback of the commit, do not add back "schema" variable.

case class ClassData(a: String, b: Int)
val left = Seq(ClassData("a", 1), ClassData("b", 2)).toDF
val right = Seq(ClassData("x", 2), ClassData("y", 3)).toDF

left.joinWith(right, left("b") === right("b"), "left_outer").collect

Wrong results (current behavior):    Array(([a,1],[null,null]), ([b,2],[x,2]))
Correct results:                     Array(([a,1],null), ([b,2],[x,2]))

Why are the changes needed?

We need to address the regression mentioned above. It results in unexpected behavior changes in the Dataframe joinWith API between versions 2.4.8 and 3.0.0+. This could potentially cause data correctness issues for users who expect the old behavior when using Spark 3.0.0+.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added unit test (use the same test in previous closed pull request, credit to Clément de Groc)
Run sql-core and sql-catalyst submodules locally with ./build/mvn clean package -pl sql/core,sql/catalyst

hvanhovell · 2023-04-13T15:50:15Z

cc @zhenlineo since you are working on this on the connect side.

kings129 · 2023-04-13T17:52:21Z

cc @cloud-fan @viirya for review, thanks!

cloud-fan · 2023-04-14T13:07:17Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala


    new ExpressionEncoder[Any](
      nullSafe(newSerializerInput, newSerializer),
-      nullSafe(newDeserializerInput, newDeserializer),


it's kind of we push down the null check to the children deserializers. Why is the serializer fine?

This change is intended to create a deserializer type newinstance(class scala.Tuple*) that can convert to a single null value. This behavior is the same as before the commit introduced the regression.
Regarding the serializer, in the new unit test added in this pull request, when the tuple is not null, named_struct is created for each element, and null is handled there.

if (isnull(input[0, scala.Tuple2, true])) null else named_struct(_1, if (isnull(input[0, scala.Tuple2, true]._1)) null else named_struct(a, if (input[0, scala.Tuple2, true]._1.isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(input[0, scala.Tuple2, true]._1, 0, a), StringType, ObjectType(class java.lang.String)), true, false, true), b, assertnotnull(validateexternaltype(getexternalrowfield(input[0, scala.Tuple2, true]._1, 1, b), IntegerType, ObjectType(class java.lang.Integer)).intValue)) AS _1#18, _2, if (isnull(input[0, scala.Tuple2, true]._2)) null else named_struct(a, if (input[0, scala.Tuple2, true]._2.isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(input[0, scala.Tuple2, true]._2, 0, a), StringType, ObjectType(class java.lang.String)), true, false, true), b, assertnotnull(validateexternaltype(getexternalrowfield(input[0, scala.Tuple2, true]._2, 1, b), IntegerType, ObjectType(class java.lang.Integer)).intValue)) AS _2#19)

@cloud-fan does my comment answer your question? PTAL, thanks!

It looks correct to me to add the null check for the children deserializers. But I don't quite understand why this PR removes the outermost null check. After looking at the code, I think it doesn't matter, as the outermost null check will be removed anyway: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala#L274

Since this is unrelated to this PR, let's not touch it. If you do want to fix it (adding null check and removing it later is useless), let's fix the serializer as well.

@cloud-fan, thanks for the explanation! You're right; it doesn't matter whether to keep the outermost null check. (null check for deserializer was also added in refactor commit)

I also prefer making minimal changes to fix the target issue. I added back the outermost null check for the deserializer.

… value for unmatched row ### What changes were proposed in this pull request? When doing an outer join with joinWith on DataFrames, unmatched rows return Row objects with null fields instead of a single null value. This is not a expected behavior, and it's a regression introduced in [this commit](cd92f25). This pull request aims to fix the regression, note this is not a full rollback of the commit, do not add back "schema" variable. ``` case class ClassData(a: String, b: Int) val left = Seq(ClassData("a", 1), ClassData("b", 2)).toDF val right = Seq(ClassData("x", 2), ClassData("y", 3)).toDF left.joinWith(right, left("b") === right("b"), "left_outer").collect ``` ``` Wrong results (current behavior): Array(([a,1],[null,null]), ([b,2],[x,2])) Correct results: Array(([a,1],null), ([b,2],[x,2])) ``` ### Why are the changes needed? We need to address the regression mentioned above. It results in unexpected behavior changes in the Dataframe joinWith API between versions 2.4.8 and 3.0.0+. This could potentially cause data correctness issues for users who expect the old behavior when using Spark 3.0.0+. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test (use the same test in previous [closed pull request](#35140), credit to Clément de Groc) Run sql-core and sql-catalyst submodules locally with ./build/mvn clean package -pl sql/core,sql/catalyst Closes #40755 from kings129/encoder_bug_fix. Authored-by: --global <xuqiang129@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 74ce620) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan · 2023-04-19T14:06:01Z

thanks, merging to master/3.4!

cloud-fan · 2023-04-19T14:06:20Z

@kings129 can you open a new PR for branch 3.3? Thanks!

… value for unmatched row ### What changes were proposed in this pull request? When doing an outer join with joinWith on DataFrames, unmatched rows return Row objects with null fields instead of a single null value. This is not a expected behavior, and it's a regression introduced in [this commit](apache@cd92f25). This pull request aims to fix the regression, note this is not a full rollback of the commit, do not add back "schema" variable. ``` case class ClassData(a: String, b: Int) val left = Seq(ClassData("a", 1), ClassData("b", 2)).toDF val right = Seq(ClassData("x", 2), ClassData("y", 3)).toDF left.joinWith(right, left("b") === right("b"), "left_outer").collect ``` ``` Wrong results (current behavior): Array(([a,1],[null,null]), ([b,2],[x,2])) Correct results: Array(([a,1],null), ([b,2],[x,2])) ``` ### Why are the changes needed? We need to address the regression mentioned above. It results in unexpected behavior changes in the Dataframe joinWith API between versions 2.4.8 and 3.0.0+. This could potentially cause data correctness issues for users who expect the old behavior when using Spark 3.0.0+. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test (use the same test in previous [closed pull request](apache#35140), credit to Clément de Groc) Run sql-core and sql-catalyst submodules locally with ./build/mvn clean package -pl sql/core,sql/catalyst Closes apache#40755 from kings129/encoder_bug_fix. Authored-by: --global <xuqiang129@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

kings129 · 2023-04-19T19:28:15Z

@kings129 can you open a new PR for branch 3.3? Thanks!

Thanks for the quick review, @cloud-fan!
Yes, here is the pull request for branch 3.3: #40858

… null value for unmatched row ### What changes were proposed in this pull request? This is a pull request to port the fix from the master branch to version 3.3. [PR](#40755) When doing an outer join with joinWith on DataFrames, unmatched rows return Row objects with null fields instead of a single null value. This is not a expected behavior, and it's a regression introduced in [this commit](cd92f25). This pull request aims to fix the regression, note this is not a full rollback of the commit, do not add back "schema" variable. ``` case class ClassData(a: String, b: Int) val left = Seq(ClassData("a", 1), ClassData("b", 2)).toDF val right = Seq(ClassData("x", 2), ClassData("y", 3)).toDF left.joinWith(right, left("b") === right("b"), "left_outer").collect ``` ``` Wrong results (current behavior): Array(([a,1],[null,null]), ([b,2],[x,2])) Correct results: Array(([a,1],null), ([b,2],[x,2])) ``` ### Why are the changes needed? We need to address the regression mentioned above. It results in unexpected behavior changes in the Dataframe joinWith API between versions 2.4.8 and 3.0.0+. This could potentially cause data correctness issues for users who expect the old behavior when using Spark 3.0.0+. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test (use the same test in previous [closed pull request](#35140), credit to Clément de Groc) Run sql-core and sql-catalyst submodules locally with ./build/mvn clean package -pl sql/core,sql/catalyst Closes #40755 from kings129/encoder_bug_fix. Authored-by: --global <xuqiang129gmail.com> Closes #40858 from kings129/fix_encoder_branch_33. Authored-by: --global <xuqiang129@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

… value for unmatched row ### What changes were proposed in this pull request? When doing an outer join with joinWith on DataFrames, unmatched rows return Row objects with null fields instead of a single null value. This is not a expected behavior, and it's a regression introduced in [this commit](apache@cd92f25). This pull request aims to fix the regression, note this is not a full rollback of the commit, do not add back "schema" variable. ``` case class ClassData(a: String, b: Int) val left = Seq(ClassData("a", 1), ClassData("b", 2)).toDF val right = Seq(ClassData("x", 2), ClassData("y", 3)).toDF left.joinWith(right, left("b") === right("b"), "left_outer").collect ``` ``` Wrong results (current behavior): Array(([a,1],[null,null]), ([b,2],[x,2])) Correct results: Array(([a,1],null), ([b,2],[x,2])) ``` ### Why are the changes needed? We need to address the regression mentioned above. It results in unexpected behavior changes in the Dataframe joinWith API between versions 2.4.8 and 3.0.0+. This could potentially cause data correctness issues for users who expect the old behavior when using Spark 3.0.0+. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test (use the same test in previous [closed pull request](apache#35140), credit to Clément de Groc) Run sql-core and sql-catalyst submodules locally with ./build/mvn clean package -pl sql/core,sql/catalyst Closes apache#40755 from kings129/encoder_bug_fix. Authored-by: --global <xuqiang129@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 74ce620) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

## What changes were proposed in this pull request? #40755 adds a null check on the input of the child deserializer in the tuple encoder. It breaks the deserializer for the `Option` type, because null should be deserialized into `None` rather than null. This PR adds a boolean parameter to `ExpressionEncoder.tuple` so that only the user that #40755 intended to fix has this null check. ## How was this patch tested? Unit test. Closes #45508 from chenhao-db/SPARK-47385. Authored-by: Chenhao Li <chenhao.li@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

## What changes were proposed in this pull request? #40755 adds a null check on the input of the child deserializer in the tuple encoder. It breaks the deserializer for the `Option` type, because null should be deserialized into `None` rather than null. This PR adds a boolean parameter to `ExpressionEncoder.tuple` so that only the user that #40755 intended to fix has this null check. ## How was this patch tested? Unit test. Closes #45508 from chenhao-db/SPARK-47385. Authored-by: Chenhao Li <chenhao.li@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 9986462) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

#40755 adds a null check on the input of the child deserializer in the tuple encoder. It breaks the deserializer for the `Option` type, because null should be deserialized into `None` rather than null. This PR adds a boolean parameter to `ExpressionEncoder.tuple` so that only the user that #40755 intended to fix has this null check. Unit test. Closes #45508 from chenhao-db/SPARK-47385. Authored-by: Chenhao Li <chenhao.li@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 9986462) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

## What changes were proposed in this pull request? apache#40755 adds a null check on the input of the child deserializer in the tuple encoder. It breaks the deserializer for the `Option` type, because null should be deserialized into `None` rather than null. This PR adds a boolean parameter to `ExpressionEncoder.tuple` so that only the user that apache#40755 intended to fix has this null check. ## How was this patch tested? Unit test. Closes apache#45508 from chenhao-db/SPARK-47385. Authored-by: Chenhao Li <chenhao.li@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

github-actions bot added the SQL label Apr 12, 2023

cloud-fan reviewed Apr 14, 2023

View reviewed changes

cloud-fan approved these changes Apr 19, 2023

View reviewed changes

viirya approved these changes Apr 19, 2023

View reviewed changes

kings129 added 8 commits April 19, 2023 03:03

add test

58e779d

local var

2f43ef4

add back null check in children deserializer

1809a4d

fix scala style check

289a546

remove extra space

413b632

use exiting nullSafe

6912c3b

better naming

2e59114

keep deserializer outermost null check

2372d49

kings129 force-pushed the encoder_bug_fix branch from 86e7c66 to 2372d49 Compare April 19, 2023 10:03

cloud-fan closed this in 74ce620 Apr 19, 2023

kings129 mentioned this pull request Apr 19, 2023

[SPARK-37829][SQL][3.3] Dataframe.joinWith outer-join should return a null value for unmatched row #40858

Closed

chenhao-db mentioned this pull request Mar 14, 2024

[SPARK-47385] Fix tuple encoders with Option inputs. #45508

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-37829][SQL] Dataframe.joinWith outer-join should return a null value for unmatched row #40755

[SPARK-37829][SQL] Dataframe.joinWith outer-join should return a null value for unmatched row #40755

kings129 commented Apr 12, 2023 •

edited

hvanhovell commented Apr 13, 2023

kings129 commented Apr 13, 2023

cloud-fan Apr 14, 2023

kings129 Apr 14, 2023

kings129 Apr 18, 2023

cloud-fan Apr 18, 2023

kings129 Apr 18, 2023

cloud-fan commented Apr 19, 2023

cloud-fan commented Apr 19, 2023

kings129 commented Apr 19, 2023

[SPARK-37829][SQL] Dataframe.joinWith outer-join should return a null value for unmatched row #40755

[SPARK-37829][SQL] Dataframe.joinWith outer-join should return a null value for unmatched row #40755

Conversation

kings129 commented Apr 12, 2023 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

hvanhovell commented Apr 13, 2023

kings129 commented Apr 13, 2023

cloud-fan Apr 14, 2023

Choose a reason for hiding this comment

kings129 Apr 14, 2023

Choose a reason for hiding this comment

kings129 Apr 18, 2023

Choose a reason for hiding this comment

cloud-fan Apr 18, 2023

Choose a reason for hiding this comment

kings129 Apr 18, 2023

Choose a reason for hiding this comment

cloud-fan commented Apr 19, 2023

cloud-fan commented Apr 19, 2023

kings129 commented Apr 19, 2023

kings129 commented Apr 12, 2023 •

edited