[SPARK-31227][SQL] Non-nullable null type in complex types should not coerce to nullable type #27991

HyukjinKwon · 2020-03-23T13:03:52Z

What changes were proposed in this pull request?

This PR targets for non-nullable null type not to coerce to nullable type in complex types.

Non-nullable fields in struct, elements in an array and entries in map can mean empty array, struct and map. They are empty so it does not need to force the nullability when we find common types.

This PR also reverts and supersedes d7b97a1

Why are the changes needed?

To make type coercion coherent and consistent. Currently, we correctly keep the nullability even between non-nullable fields:

import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
spark.range(1).select(array(lit(1)).cast(ArrayType(IntegerType, false))).printSchema()
spark.range(1).select(array(lit(1)).cast(ArrayType(DoubleType, false))).printSchema()

spark.range(1).selectExpr("concat(array(1), array(1)) as arr").printSchema()

Does this PR introduce any user-facing change?

Yes.

import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
spark.range(1).select(array().cast(ArrayType(IntegerType, false))).printSchema()

spark.range(1).selectExpr("concat(array(), array(1)) as arr").printSchema()

Before:

org.apache.spark.sql.AnalysisException: cannot resolve 'array()' due to data type mismatch: cannot cast array<null> to array<int>;;
'Project [cast(array() as array<int>) AS array()#68]
+- Range (0, 1, step=1, splits=Some(12))

  at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:149)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:140)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:333)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:333)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:330)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:399)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237)

root
 |-- arr: array (nullable = false)
 |    |-- element: integer (containsNull = true)

After:

root
 |-- array(): array (nullable = false)
 |    |-- element: integer (containsNull = false)

root
 |-- arr: array (nullable = false)
 |    |-- element: integer (containsNull = false)

How was this patch tested?

Unittests were added and manually tested.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

SparkQA · 2020-03-23T18:05:30Z

Test build #120206 has finished for PR 27991 at commit 7c54a5f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-03-23T18:33:21Z

Test build #120209 has finished for PR 27991 at commit 51acfeb.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

SparkQA · 2020-03-24T05:28:30Z

Test build #120226 has finished for PR 27991 at commit dfd9343.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

SparkQA · 2020-03-24T07:05:01Z

Test build #120246 has finished for PR 27991 at commit 7bae2c1.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-03-24T07:05:02Z

Test build #120235 has finished for PR 27991 at commit e4557ae.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-03-24T09:01:10Z

retest this please

SparkQA · 2020-03-24T14:41:06Z

Test build #120261 has finished for PR 27991 at commit 7bae2c1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

SparkQA · 2020-03-25T05:16:56Z

Test build #120292 has finished for PR 27991 at commit 05dc916.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

SparkQA · 2020-03-25T18:07:38Z

Test build #120356 has finished for PR 27991 at commit d57d4a6.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-03-25T19:35:32Z

Test build #120360 has finished for PR 27991 at commit 1bf68f0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-03-26T07:43:07Z

thanks, merging to master/3.0!

… coerce to nullable type ### What changes were proposed in this pull request? This PR targets for non-nullable null type not to coerce to nullable type in complex types. Non-nullable fields in struct, elements in an array and entries in map can mean empty array, struct and map. They are empty so it does not need to force the nullability when we find common types. This PR also reverts and supersedes d7b97a1 ### Why are the changes needed? To make type coercion coherent and consistent. Currently, we correctly keep the nullability even between non-nullable fields: ```scala import org.apache.spark.sql.types._ import org.apache.spark.sql.functions._ spark.range(1).select(array(lit(1)).cast(ArrayType(IntegerType, false))).printSchema() spark.range(1).select(array(lit(1)).cast(ArrayType(DoubleType, false))).printSchema() ``` ```scala spark.range(1).selectExpr("concat(array(1), array(1)) as arr").printSchema() ``` ### Does this PR introduce any user-facing change? Yes. ```scala import org.apache.spark.sql.types._ import org.apache.spark.sql.functions._ spark.range(1).select(array().cast(ArrayType(IntegerType, false))).printSchema() ``` ```scala spark.range(1).selectExpr("concat(array(), array(1)) as arr").printSchema() ``` **Before:** ``` org.apache.spark.sql.AnalysisException: cannot resolve 'array()' due to data type mismatch: cannot cast array<null> to array<int>;; 'Project [cast(array() as array<int>) AS array()#68] +- Range (0, 1, step=1, splits=Some(12)) at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:149) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:140) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:333) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:333) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:330) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:399) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237) ``` ``` root |-- arr: array (nullable = false) | |-- element: integer (containsNull = true) ``` **After:** ``` root |-- array(): array (nullable = false) | |-- element: integer (containsNull = false) ``` ``` root |-- arr: array (nullable = false) | |-- element: integer (containsNull = false) ``` ### How was this patch tested? Unittests were added and manually tested. Closes #27991 from HyukjinKwon/SPARK-31227. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 3bd10ce) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

… coerce to nullable type ### What changes were proposed in this pull request? This PR targets for non-nullable null type not to coerce to nullable type in complex types. Non-nullable fields in struct, elements in an array and entries in map can mean empty array, struct and map. They are empty so it does not need to force the nullability when we find common types. This PR also reverts and supersedes apache@d7b97a1 ### Why are the changes needed? To make type coercion coherent and consistent. Currently, we correctly keep the nullability even between non-nullable fields: ```scala import org.apache.spark.sql.types._ import org.apache.spark.sql.functions._ spark.range(1).select(array(lit(1)).cast(ArrayType(IntegerType, false))).printSchema() spark.range(1).select(array(lit(1)).cast(ArrayType(DoubleType, false))).printSchema() ``` ```scala spark.range(1).selectExpr("concat(array(1), array(1)) as arr").printSchema() ``` ### Does this PR introduce any user-facing change? Yes. ```scala import org.apache.spark.sql.types._ import org.apache.spark.sql.functions._ spark.range(1).select(array().cast(ArrayType(IntegerType, false))).printSchema() ``` ```scala spark.range(1).selectExpr("concat(array(), array(1)) as arr").printSchema() ``` **Before:** ``` org.apache.spark.sql.AnalysisException: cannot resolve 'array()' due to data type mismatch: cannot cast array<null> to array<int>;; 'Project [cast(array() as array<int>) AS array()apache#68] +- Range (0, 1, step=1, splits=Some(12)) at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:149) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:140) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:333) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:333) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:330) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:399) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237) ``` ``` root |-- arr: array (nullable = false) | |-- element: integer (containsNull = true) ``` **After:** ``` root |-- array(): array (nullable = false) | |-- element: integer (containsNull = false) ``` ``` root |-- arr: array (nullable = false) | |-- element: integer (containsNull = false) ``` ### How was this patch tested? Unittests were added and manually tested. Closes apache#27991 from HyukjinKwon/SPARK-31227. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

bart-samwel · 2020-04-23T12:01:21Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala

@@ -1532,6 +1532,13 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSparkSession {
    assert(e.getMessage.contains("string, binary or array"))
  }

+  test("SPARK-31227: Non-nullable null type should not coerce to nullable type in concat") {


How about concat(array(), array(NULL))? That should have the same type as array(NULL).

Yea, the output has the same type with array(null);

scala> sql("select concat(array(), array(NULL))").printSchema root |-- concat(array(), array(NULL)): array (nullable = false) | |-- element: null (containsNull = true) scala> sql("select array()").printSchema root |-- array(): array (nullable = false) | |-- element: null (containsNull = false) scala> sql("select array(null)").printSchema root |-- array(NULL): array (nullable = false) | |-- element: null (containsNull = true)

Any concern?

Only that it should be tested, since it's an interesting corner case!

I believe those cases are tested in CastSuite.scala and TypeCoercionSuite.scala including all types if I didn't miss anything. I just kept one e2e test here since it was the reported case in the JIRA SPARK-31227.

HyukjinKwon requested a review from cloud-fan March 23, 2020 13:03

cloud-fan reviewed Mar 23, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala Show resolved Hide resolved

maropu reviewed Mar 23, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala Show resolved Hide resolved

maropu reviewed Mar 23, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Mar 24, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Mar 24, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala Outdated Show resolved Hide resolved

viirya reviewed Mar 24, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala Outdated Show resolved Hide resolved

Non-nullable null type should not coerce to nullable type

7bae2c1

HyukjinKwon force-pushed the SPARK-31227 branch from e4557ae to 7bae2c1 Compare March 24, 2020 06:28

Addres comments

05dc916

maropu reviewed Mar 25, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala Show resolved Hide resolved

maropu approved these changes Mar 25, 2020

View reviewed changes

Address comments

d57d4a6

cloud-fan reviewed Mar 25, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala Outdated Show resolved Hide resolved

Address comments

1bf68f0

maropu approved these changes Mar 25, 2020

View reviewed changes

cloud-fan closed this in 3bd10ce Mar 26, 2020

bart-samwel reviewed Apr 23, 2020

View reviewed changes

HyukjinKwon deleted the SPARK-31227 branch July 27, 2020 07:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-31227][SQL] Non-nullable null type in complex types should not coerce to nullable type #27991

[SPARK-31227][SQL] Non-nullable null type in complex types should not coerce to nullable type #27991

HyukjinKwon commented Mar 23, 2020 •

edited

Loading

SparkQA commented Mar 23, 2020

SparkQA commented Mar 23, 2020

SparkQA commented Mar 24, 2020

SparkQA commented Mar 24, 2020

SparkQA commented Mar 24, 2020

HyukjinKwon commented Mar 24, 2020

SparkQA commented Mar 24, 2020

SparkQA commented Mar 25, 2020

SparkQA commented Mar 25, 2020

SparkQA commented Mar 25, 2020

cloud-fan commented Mar 26, 2020

bart-samwel Apr 23, 2020

maropu Apr 23, 2020

bart-samwel Apr 23, 2020

HyukjinKwon Apr 24, 2020

[SPARK-31227][SQL] Non-nullable null type in complex types should not coerce to nullable type #27991

[SPARK-31227][SQL] Non-nullable null type in complex types should not coerce to nullable type #27991

Conversation

HyukjinKwon commented Mar 23, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Mar 23, 2020

SparkQA commented Mar 23, 2020

SparkQA commented Mar 24, 2020

SparkQA commented Mar 24, 2020

SparkQA commented Mar 24, 2020

HyukjinKwon commented Mar 24, 2020

SparkQA commented Mar 24, 2020

SparkQA commented Mar 25, 2020

SparkQA commented Mar 25, 2020

SparkQA commented Mar 25, 2020

cloud-fan commented Mar 26, 2020

bart-samwel Apr 23, 2020

Choose a reason for hiding this comment

maropu Apr 23, 2020

Choose a reason for hiding this comment

bart-samwel Apr 23, 2020

Choose a reason for hiding this comment

HyukjinKwon Apr 24, 2020

Choose a reason for hiding this comment

HyukjinKwon commented Mar 23, 2020 •

edited

Loading