[SPARK-22472][SQL] add null check for top-level primitive values #19707

cloud-fan · 2017-11-09T13:51:36Z

What changes were proposed in this pull request?

One powerful feature of Dataset is, we can easily map SQL rows to Scala/Java objects and do runtime null check automatically.

For example, let's say we have a parquet file with schema <a: int, b: string>, and we have a case class Data(a: Int, b: String). Users can easily read this parquet file into Data objects, and Spark will throw NPE if column a has null values.

However the null checking is left behind for top-level primitive values. For example, let's say we have a parquet file with schema <a: Int>, and we read it into Scala Int. If column a has null values, we will get some weird results.

scala> val ds = spark.read.parquet(...).as[Int]

scala> ds.show()
+----+
|v   |
+----+
|null|
|1   |
+----+

scala> ds.collect
res0: Array[Long] = Array(0, 1)

scala> ds.map(_ * 2).show
+-----+
|value|
+-----+
|-2   |
|2    |
+-----+

This is because internally Spark use some special default values for primitive types, but never expect users to see/operate these default value directly.

This PR adds null check for top-level primitive values

How was this patch tested?

new test

cloud-fan · 2017-11-09T13:52:05Z

cc @gatorsmile @kiszk @srowen

kiszk · 2017-11-09T16:34:07Z

sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala

+    assert(e.getCause.isInstanceOf[NullPointerException])
+
+    withTempPath { path =>
+      Seq(new Integer(1), null).toDF("i").write.parquet(path.getCanonicalPath)


nit: toDF() also works.

not a big deal, but toDF("i") is more explicit about column name.

kiszk · 2017-11-09T16:34:28Z

LGTM except one minor comment

SparkQA · 2017-11-09T16:45:00Z

Test build #83644 has finished for PR 19707 at commit dad5080.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-11-09T17:11:36Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala

+      expr
+    } else {
+      AssertNotNull(expr, walkedTypePath)
+    }


Hi, @cloud-fan .
It looks great. Can we add a test case in ScalaReflectionSuite, too?

dongjoon-hyun · 2017-11-09T17:15:17Z

sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala

+    assert(e.getCause.isInstanceOf[NullPointerException])
+
+    withTempPath { path =>
+      Seq(new Integer(1), null).toDF("i").write.parquet(path.getCanonicalPath)


Is this PR orthogonal to data source formats?
Could you test more data source like JSON, here?

It doesn't matter, I just need a dataframe, I can even use Seq(Some(1), None).toDF, but using parquet is more convincing.

dongjoon-hyun

+1, LGTM.

SparkQA · 2017-11-09T22:30:36Z

Test build #83654 has finished for PR 19707 at commit a59c399.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

LGTM too

gatorsmile

This needs a release note. The exceptions will be thrown when their data have null and their codes call deserializer. For example, users can see AssertNotNull in the operator DeserializeToObject in the plan. contain

gatorsmile · 2017-11-10T05:55:30Z

LGTM

## What changes were proposed in this pull request? One powerful feature of `Dataset` is, we can easily map SQL rows to Scala/Java objects and do runtime null check automatically. For example, let's say we have a parquet file with schema `<a: int, b: string>`, and we have a `case class Data(a: Int, b: String)`. Users can easily read this parquet file into `Data` objects, and Spark will throw NPE if column `a` has null values. However the null checking is left behind for top-level primitive values. For example, let's say we have a parquet file with schema `<a: Int>`, and we read it into Scala `Int`. If column `a` has null values, we will get some weird results. ``` scala> val ds = spark.read.parquet(...).as[Int] scala> ds.show() +----+ |v | +----+ |null| |1 | +----+ scala> ds.collect res0: Array[Long] = Array(0, 1) scala> ds.map(_ * 2).show +-----+ |value| +-----+ |-2 | |2 | +-----+ ``` This is because internally Spark use some special default values for primitive types, but never expect users to see/operate these default value directly. This PR adds null check for top-level primitive values ## How was this patch tested? new test Author: Wenchen Fan <wenchen@databricks.com> Closes #19707 from cloud-fan/bug. (cherry picked from commit 0025dde) Signed-off-by: gatorsmile <gatorsmile@gmail.com> # Conflicts: # sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala # sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala

gatorsmile · 2017-11-10T06:01:51Z

Thanks! Merged to master/2.2

## What changes were proposed in this pull request? One powerful feature of `Dataset` is, we can easily map SQL rows to Scala/Java objects and do runtime null check automatically. For example, let's say we have a parquet file with schema `<a: int, b: string>`, and we have a `case class Data(a: Int, b: String)`. Users can easily read this parquet file into `Data` objects, and Spark will throw NPE if column `a` has null values. However the null checking is left behind for top-level primitive values. For example, let's say we have a parquet file with schema `<a: Int>`, and we read it into Scala `Int`. If column `a` has null values, we will get some weird results. ``` scala> val ds = spark.read.parquet(...).as[Int] scala> ds.show() +----+ |v | +----+ |null| |1 | +----+ scala> ds.collect res0: Array[Long] = Array(0, 1) scala> ds.map(_ * 2).show +-----+ |value| +-----+ |-2 | |2 | +-----+ ``` This is because internally Spark use some special default values for primitive types, but never expect users to see/operate these default value directly. This PR adds null check for top-level primitive values ## How was this patch tested? new test Author: Wenchen Fan <wenchen@databricks.com> Closes apache#19707 from cloud-fan/bug. (cherry picked from commit 0025dde) Signed-off-by: gatorsmile <gatorsmile@gmail.com> # Conflicts: # sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala # sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala

add null check for top-level primitive values

dad5080

kiszk reviewed Nov 9, 2017

View reviewed changes

dongjoon-hyun reviewed Nov 9, 2017

View reviewed changes

add unit test

a59c399

dongjoon-hyun reviewed Nov 9, 2017

View reviewed changes

HyukjinKwon approved these changes Nov 10, 2017

View reviewed changes

gatorsmile approved these changes Nov 10, 2017

View reviewed changes

asfgit closed this in 0025dde Nov 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-22472][SQL] add null check for top-level primitive values #19707

[SPARK-22472][SQL] add null check for top-level primitive values #19707

cloud-fan commented Nov 9, 2017

cloud-fan commented Nov 9, 2017

kiszk Nov 9, 2017

cloud-fan Nov 9, 2017

kiszk commented Nov 9, 2017

SparkQA commented Nov 9, 2017

dongjoon-hyun Nov 9, 2017

dongjoon-hyun Nov 9, 2017 •

edited

Loading

cloud-fan Nov 9, 2017

dongjoon-hyun Nov 9, 2017

dongjoon-hyun left a comment

SparkQA commented Nov 9, 2017

HyukjinKwon left a comment

gatorsmile left a comment

gatorsmile commented Nov 10, 2017

gatorsmile commented Nov 10, 2017

[SPARK-22472][SQL] add null check for top-level primitive values #19707

[SPARK-22472][SQL] add null check for top-level primitive values #19707

Conversation

cloud-fan commented Nov 9, 2017

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Nov 9, 2017

kiszk Nov 9, 2017

Choose a reason for hiding this comment

cloud-fan Nov 9, 2017

Choose a reason for hiding this comment

kiszk commented Nov 9, 2017

SparkQA commented Nov 9, 2017

dongjoon-hyun Nov 9, 2017

Choose a reason for hiding this comment

dongjoon-hyun Nov 9, 2017 • edited Loading

Choose a reason for hiding this comment

cloud-fan Nov 9, 2017

Choose a reason for hiding this comment

dongjoon-hyun Nov 9, 2017

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

SparkQA commented Nov 9, 2017

HyukjinKwon left a comment

Choose a reason for hiding this comment

gatorsmile left a comment

Choose a reason for hiding this comment

gatorsmile commented Nov 10, 2017

gatorsmile commented Nov 10, 2017

dongjoon-hyun Nov 9, 2017 •

edited

Loading