-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22472][SQL] add null check for top-level primitive values #19707
Conversation
assert(e.getCause.isInstanceOf[NullPointerException]) | ||
|
||
withTempPath { path => | ||
Seq(new Integer(1), null).toDF("i").write.parquet(path.getCanonicalPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: toDF()
also works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a big deal, but toDF("i")
is more explicit about column name.
LGTM except one minor comment |
Test build #83644 has finished for PR 19707 at commit
|
expr | ||
} else { | ||
AssertNotNull(expr, walkedTypePath) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @cloud-fan .
It looks great. Can we add a test case in ScalaReflectionSuite
, too?
assert(e.getCause.isInstanceOf[NullPointerException]) | ||
|
||
withTempPath { path => | ||
Seq(new Integer(1), null).toDF("i").write.parquet(path.getCanonicalPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this PR orthogonal to data source formats?
Could you test more data source like JSON
, here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't matter, I just need a dataframe, I can even use Seq(Some(1), None).toDF
, but using parquet is more convincing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
Test build #83654 has finished for PR 19707 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a release note. The exceptions will be thrown when their data have null and their codes call deserializer. For example, users can see AssertNotNull in the operator DeserializeToObject
in the plan. contain
LGTM |
## What changes were proposed in this pull request? One powerful feature of `Dataset` is, we can easily map SQL rows to Scala/Java objects and do runtime null check automatically. For example, let's say we have a parquet file with schema `<a: int, b: string>`, and we have a `case class Data(a: Int, b: String)`. Users can easily read this parquet file into `Data` objects, and Spark will throw NPE if column `a` has null values. However the null checking is left behind for top-level primitive values. For example, let's say we have a parquet file with schema `<a: Int>`, and we read it into Scala `Int`. If column `a` has null values, we will get some weird results. ``` scala> val ds = spark.read.parquet(...).as[Int] scala> ds.show() +----+ |v | +----+ |null| |1 | +----+ scala> ds.collect res0: Array[Long] = Array(0, 1) scala> ds.map(_ * 2).show +-----+ |value| +-----+ |-2 | |2 | +-----+ ``` This is because internally Spark use some special default values for primitive types, but never expect users to see/operate these default value directly. This PR adds null check for top-level primitive values ## How was this patch tested? new test Author: Wenchen Fan <wenchen@databricks.com> Closes #19707 from cloud-fan/bug. (cherry picked from commit 0025dde) Signed-off-by: gatorsmile <gatorsmile@gmail.com> # Conflicts: # sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala # sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala
Thanks! Merged to master/2.2 |
## What changes were proposed in this pull request? One powerful feature of `Dataset` is, we can easily map SQL rows to Scala/Java objects and do runtime null check automatically. For example, let's say we have a parquet file with schema `<a: int, b: string>`, and we have a `case class Data(a: Int, b: String)`. Users can easily read this parquet file into `Data` objects, and Spark will throw NPE if column `a` has null values. However the null checking is left behind for top-level primitive values. For example, let's say we have a parquet file with schema `<a: Int>`, and we read it into Scala `Int`. If column `a` has null values, we will get some weird results. ``` scala> val ds = spark.read.parquet(...).as[Int] scala> ds.show() +----+ |v | +----+ |null| |1 | +----+ scala> ds.collect res0: Array[Long] = Array(0, 1) scala> ds.map(_ * 2).show +-----+ |value| +-----+ |-2 | |2 | +-----+ ``` This is because internally Spark use some special default values for primitive types, but never expect users to see/operate these default value directly. This PR adds null check for top-level primitive values ## How was this patch tested? new test Author: Wenchen Fan <wenchen@databricks.com> Closes apache#19707 from cloud-fan/bug. (cherry picked from commit 0025dde) Signed-off-by: gatorsmile <gatorsmile@gmail.com> # Conflicts: # sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala # sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala
What changes were proposed in this pull request?
One powerful feature of
Dataset
is, we can easily map SQL rows to Scala/Java objects and do runtime null check automatically.For example, let's say we have a parquet file with schema
<a: int, b: string>
, and we have acase class Data(a: Int, b: String)
. Users can easily read this parquet file intoData
objects, and Spark will throw NPE if columna
has null values.However the null checking is left behind for top-level primitive values. For example, let's say we have a parquet file with schema
<a: Int>
, and we read it into ScalaInt
. If columna
has null values, we will get some weird results.This is because internally Spark use some special default values for primitive types, but never expect users to see/operate these default value directly.
This PR adds null check for top-level primitive values
How was this patch tested?
new test