Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22472][SQL] add null check for top-level primitive values #19707

Closed
wants to merge 2 commits into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

One powerful feature of Dataset is, we can easily map SQL rows to Scala/Java objects and do runtime null check automatically.

For example, let's say we have a parquet file with schema <a: int, b: string>, and we have a case class Data(a: Int, b: String). Users can easily read this parquet file into Data objects, and Spark will throw NPE if column a has null values.

However the null checking is left behind for top-level primitive values. For example, let's say we have a parquet file with schema <a: Int>, and we read it into Scala Int. If column a has null values, we will get some weird results.

scala> val ds = spark.read.parquet(...).as[Int]

scala> ds.show()
+----+
|v   |
+----+
|null|
|1   |
+----+

scala> ds.collect
res0: Array[Long] = Array(0, 1)

scala> ds.map(_ * 2).show
+-----+
|value|
+-----+
|-2   |
|2    |
+-----+

This is because internally Spark use some special default values for primitive types, but never expect users to see/operate these default value directly.

This PR adds null check for top-level primitive values

How was this patch tested?

new test

@cloud-fan
Copy link
Contributor Author

cc @gatorsmile @kiszk @srowen

assert(e.getCause.isInstanceOf[NullPointerException])

withTempPath { path =>
Seq(new Integer(1), null).toDF("i").write.parquet(path.getCanonicalPath)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: toDF() also works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a big deal, but toDF("i") is more explicit about column name.

@kiszk
Copy link
Member

kiszk commented Nov 9, 2017

LGTM except one minor comment

@SparkQA
Copy link

SparkQA commented Nov 9, 2017

Test build #83644 has finished for PR 19707 at commit dad5080.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

expr
} else {
AssertNotNull(expr, walkedTypePath)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @cloud-fan .
It looks great. Can we add a test case in ScalaReflectionSuite, too?

assert(e.getCause.isInstanceOf[NullPointerException])

withTempPath { path =>
Seq(new Integer(1), null).toDF("i").write.parquet(path.getCanonicalPath)
Copy link
Member

@dongjoon-hyun dongjoon-hyun Nov 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this PR orthogonal to data source formats?
Could you test more data source like JSON, here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't matter, I just need a dataframe, I can even use Seq(Some(1), None).toDF, but using parquet is more convincing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@SparkQA
Copy link

SparkQA commented Nov 9, 2017

Test build #83654 has finished for PR 19707 at commit a59c399.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too

Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a release note. The exceptions will be thrown when their data have null and their codes call deserializer. For example, users can see AssertNotNull in the operator DeserializeToObject in the plan. contain

@gatorsmile
Copy link
Member

LGTM

@asfgit asfgit closed this in 0025dde Nov 10, 2017
asfgit pushed a commit that referenced this pull request Nov 10, 2017
## What changes were proposed in this pull request?

One powerful feature of `Dataset` is, we can easily map SQL rows to Scala/Java objects and do runtime null check automatically.

For example, let's say we have a parquet file with schema `<a: int, b: string>`, and we have a `case class Data(a: Int, b: String)`. Users can easily read this parquet file into `Data` objects, and Spark will throw NPE if column `a` has null values.

However the null checking is left behind for top-level primitive values. For example, let's say we have a parquet file with schema `<a: Int>`, and we read it into Scala `Int`. If column `a` has null values, we will get some weird results.
```
scala> val ds = spark.read.parquet(...).as[Int]

scala> ds.show()
+----+
|v   |
+----+
|null|
|1   |
+----+

scala> ds.collect
res0: Array[Long] = Array(0, 1)

scala> ds.map(_ * 2).show
+-----+
|value|
+-----+
|-2   |
|2    |
+-----+
```

This is because internally Spark use some special default values for primitive types, but never expect users to see/operate these default value directly.

This PR adds null check for top-level primitive values

## How was this patch tested?

new test

Author: Wenchen Fan <wenchen@databricks.com>

Closes #19707 from cloud-fan/bug.

(cherry picked from commit 0025dde)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>

# Conflicts:
#	sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala
#	sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala
@gatorsmile
Copy link
Member

Thanks! Merged to master/2.2

MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
## What changes were proposed in this pull request?

One powerful feature of `Dataset` is, we can easily map SQL rows to Scala/Java objects and do runtime null check automatically.

For example, let's say we have a parquet file with schema `<a: int, b: string>`, and we have a `case class Data(a: Int, b: String)`. Users can easily read this parquet file into `Data` objects, and Spark will throw NPE if column `a` has null values.

However the null checking is left behind for top-level primitive values. For example, let's say we have a parquet file with schema `<a: Int>`, and we read it into Scala `Int`. If column `a` has null values, we will get some weird results.
```
scala> val ds = spark.read.parquet(...).as[Int]

scala> ds.show()
+----+
|v   |
+----+
|null|
|1   |
+----+

scala> ds.collect
res0: Array[Long] = Array(0, 1)

scala> ds.map(_ * 2).show
+-----+
|value|
+-----+
|-2   |
|2    |
+-----+
```

This is because internally Spark use some special default values for primitive types, but never expect users to see/operate these default value directly.

This PR adds null check for top-level primitive values

## How was this patch tested?

new test

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#19707 from cloud-fan/bug.

(cherry picked from commit 0025dde)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>

# Conflicts:
#	sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala
#	sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants