[SPARK-18251][SQL] the type of Dataset can't be Option of non-flat type #15979

cloud-fan · 2016-11-22T14:16:19Z

What changes were proposed in this pull request?

For input object of non-flat type, we can't encode it to row if it's null, as Spark SQL doesn't allow the entire row to be null, only its columns can be null. That's the reason we forbid users to use top level null objects in #13469

However, if users wrap non-flat type with Option, then we may still encoder top level null object to row, which is not allowed.

This PR fixes this case, and suggests users to wrap their type with Tuple1 if they do wanna top level null objects.

How was this patch tested?

new test

cloud-fan · 2016-11-22T14:17:31Z

cc @yhuai @liancheng

SparkQA · 2016-11-22T16:41:42Z

Test build #69000 has finished for PR 15979 at commit cb834b8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-11-22T19:49:30Z

What does "non-flat type" mean?

cloud-fan · 2016-11-23T11:57:53Z

"non-flat type" means "complex type", i.e. array, seq, map, product, etc.

cloud-fan · 2016-11-28T06:16:00Z

retest it please

yhuai · 2016-11-29T03:53:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala

+      throw new UnsupportedOperationException(
+        "Cannot create encoder for Option of non-flat type, as non-flat type is represented " +
+          "as a row, and the entire row can not be null in Spark SQL like normal databases. " +
+          "You can wrap your type with Tuple1 if you do want top level null objects.")


Let's provide an example in the error message to help users understand how to handle this case.

yhuai · 2016-11-29T03:55:27Z

looks good. @liancheng want to double check?

SparkQA · 2016-11-29T13:32:43Z

Test build #69326 has finished for PR 15979 at commit 01b072d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-11-29T13:49:34Z

retest this please

SparkQA · 2016-11-29T16:32:47Z

Test build #69328 has finished for PR 15979 at commit 01b072d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-11-29T19:05:51Z

My only concern is that "non-flat" type is neither intuitive nor a well-known term. In fact, this PR only prevents Option[T <: Product] to be top-level Dataset types. How about just call them "Product" types?

Otherwise LGTM.

rxin · 2016-11-29T20:09:29Z

FWIW I don't think we should call it nonflat.

SparkQA · 2016-11-30T07:02:33Z

Test build #69387 has started for PR 15979 at commit 70dd650.

cloud-fan · 2016-11-30T08:21:56Z

retest this please

liancheng · 2016-11-30T08:57:05Z

Good to merge pending Jenkins. Thanks!

SparkQA · 2016-11-30T10:05:20Z

Test build #69394 has finished for PR 15979 at commit 70dd650.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-11-30T10:23:32Z

retest this please

SparkQA · 2016-11-30T12:06:25Z

Test build #69403 has finished for PR 15979 at commit 70dd650.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-30T18:24:08Z

Test build #69411 has finished for PR 15979 at commit 876e5c7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2016-11-30T21:35:41Z

Merging to master. Thanks!

liancheng · 2016-11-30T21:42:03Z

@rxin Shall we backport this to branch-2.1? I think it's relatively safe.

rxin · 2016-11-30T21:44:48Z

Sounds good.

liancheng · 2016-11-30T21:54:55Z

Also backported to branch-2.1.

## What changes were proposed in this pull request? For input object of non-flat type, we can't encode it to row if it's null, as Spark SQL doesn't allow the entire row to be null, only its columns can be null. That's the reason we forbid users to use top level null objects in #13469 However, if users wrap non-flat type with `Option`, then we may still encoder top level null object to row, which is not allowed. This PR fixes this case, and suggests users to wrap their type with `Tuple1` if they do wanna top level null objects. ## How was this patch tested? new test Author: Wenchen Fan <wenchen@databricks.com> Closes #15979 from cloud-fan/option. (cherry picked from commit f135b70) Signed-off-by: Cheng Lian <lian@databricks.com>

## What changes were proposed in this pull request? For input object of non-flat type, we can't encode it to row if it's null, as Spark SQL doesn't allow the entire row to be null, only its columns can be null. That's the reason we forbid users to use top level null objects in apache#13469 However, if users wrap non-flat type with `Option`, then we may still encoder top level null object to row, which is not allowed. This PR fixes this case, and suggests users to wrap their type with `Tuple1` if they do wanna top level null objects. ## How was this patch tested? new test Author: Wenchen Fan <wenchen@databricks.com> Closes apache#15979 from cloud-fan/option.

koertkuipers · 2016-12-04T05:08:17Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala

+
+    if (ScalaReflection.optionOfProductType(tpe)) {
+      throw new UnsupportedOperationException(
+        "Cannot create encoder for Option of Product type, because Product type is represented " +


this also means an Aggregator cannot use an Option of Product Type for its intermediate type. e.g.
Aggregator[Int, Option[(Int, Int)], Int] is now invalid. but i see no good reason why such an Aggregator wouldnt exist?

this strikes me more as a limitation on Dataset[X] than on Encoder[X]

and now that i think about it more, i also think Dataset[Option[(Int, Int)]] should be valid too if possible.

it should not be represented by a top level Row object, so the schema should be
StructType(StructField("_1", StructType(StructField("_1", IntegerType, false), StructField("_2", IntegerType, false)), true))

we do this trick where we nest top-level non-struct types inside a row, why not do the same thing for Option[X <: Product]?

koertkuipers · 2016-12-04T07:03:55Z

this means anything that uses an encoder can no longer use Option[_ <: Product].
encoders are not just used for the top level Dataset creation.

Dataset.groupByKey[K] requires an encoder for K.
KeyValueGroupedDataset.mapValues[W] requires an encoder for V
Aggregator[A, B, C] requires encoders for B and C

none of these always create top level row objects (for which this pullreq creates the restriction that they cannot be null).

for an aggregator it is sometimes the case. dataset.select(aggregator) does create a top level row object, but dataset.groupByKey(...).agg(aggregator) does not.

so i am not sure it makes sense to put this restriction on the encoder. it seems to belong on the dataset.

another example of something that won't work anymore:

val x: Dataset[String, Option[(String, String)]] = ...
x.groupByKey(_._1).mapValues(_._2).agg(someAgg)

in this case the mapValues requires Encoder[Option[(String, String)]]

cloud-fan · 2016-12-04T07:32:58Z

val x: Dataset[String, Option[(String, String)]] = ...
x.groupByKey(_._1).mapValues(_._2).agg(someAgg)

Does it work before?

Please see the discussion in the JIRA: https://issues.apache.org/jira/browse/SPARK-18251
Ideally we have a map between type T and catalyst schema, and Option[T] maps to the same catalyst schema with T, with additional null handling. We shouldn't change this mapping, which means we can't use a single field struct type to represent Option[T].

It's still possible to support Option[T] completely(without breaking backward compatibility), but that may need a lof of hacky code and special handling, I don't think it worth, as we can easily work around it, by Tuple1.

koertkuipers · 2016-12-04T14:59:17Z

Yes it worked before

…

On Dec 4, 2016 02:33, "Wenchen Fan" ***@***.***> wrote: val x: Dataset[String, Option[(String, String)]] = ... x.groupByKey(_._1).mapValues(_._2).agg(someAgg) Does it work before? Please see the discussion in the JIRA: https://issues.apache.org/ jira/browse/SPARK-18251 Ideally we have a map between type T and catalyst schema, and Option[T] maps to the same catalyst schema with T, with additional null handling. We shouldn't change this mapping, which means we can't use a single field struct type to represent Option[T]. It's still possible to support Option[T] completely(without breaking backward compatibility), but that may need a lof of hacky code and special handling, I don't think it worth, as we can easy work around it, by Tuple1 . — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#15979 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAyIJD-_dmJODKn5_k8MHRFaJkHvL9uRks5rEmzCgaJpZM4K5fEL> .

koertkuipers · 2016-12-04T17:13:56Z

spark 2.0.x does not have mapValues. but this works: scala> Seq(("a", Some((1, 1))), ("a", None)).toDS.groupByKey(_._2).count.show +-----------+--------+ | key|count(1)| +-----------+--------+ |[null,null]| 1| | [1,1]| 1| +-----------+--------+

…

On Sun, Dec 4, 2016 at 9:59 AM, Koert Kuipers ***@***.***> wrote: Yes it worked before On Dec 4, 2016 02:33, "Wenchen Fan" ***@***.***> wrote: > val x: Dataset[String, Option[(String, String)]] = ... > x.groupByKey(_._1).mapValues(_._2).agg(someAgg) > > Does it work before? > > Please see the discussion in the JIRA: https://issues.apache.org/jira > /browse/SPARK-18251 > Ideally we have a map between type T and catalyst schema, and Option[T] > maps to the same catalyst schema with T, with additional null handling. > We shouldn't change this mapping, which means we can't use a single field > struct type to represent Option[T]. > > It's still possible to support Option[T] completely(without breaking > backward compatibility), but that may need a lof of hacky code and special > handling, I don't think it worth, as we can easy work around it, by > Tuple1. > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#15979 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AAyIJD-_dmJODKn5_k8MHRFaJkHvL9uRks5rEmzCgaJpZM4K5fEL> > . >

koertkuipers · 2016-12-04T17:42:16Z

admittedly the result looks weird. it really should be: +-----------+--------+ | key|count(1)| +-----------+--------+ | null| 1| | [1,1]| 1| +-----------+--------+ is that a separate bug or related? i remember running into this before, because serializing and then deserializing None comes back out as Some((null, null)), which causes NPE in codegen. i ran into this with Aggregator buffers.

…

On Sun, Dec 4, 2016 at 12:13 PM, Koert Kuipers ***@***.***> wrote: spark 2.0.x does not have mapValues. but this works: scala> Seq(("a", Some((1, 1))), ("a", None)).toDS.groupByKey(_._2). count.show +-----------+--------+ | key|count(1)| +-----------+--------+ |[null,null]| 1| | [1,1]| 1| +-----------+--------+ On Sun, Dec 4, 2016 at 9:59 AM, Koert Kuipers ***@***.***> wrote: > Yes it worked before > > On Dec 4, 2016 02:33, "Wenchen Fan" ***@***.***> wrote: > >> val x: Dataset[String, Option[(String, String)]] = ... >> x.groupByKey(_._1).mapValues(_._2).agg(someAgg) >> >> Does it work before? >> >> Please see the discussion in the JIRA: https://issues.apache.org/jira >> /browse/SPARK-18251 >> Ideally we have a map between type T and catalyst schema, and Option[T] >> maps to the same catalyst schema with T, with additional null handling. >> We shouldn't change this mapping, which means we can't use a single field >> struct type to represent Option[T]. >> >> It's still possible to support Option[T] completely(without breaking >> backward compatibility), but that may need a lof of hacky code and special >> handling, I don't think it worth, as we can easy work around it, by >> Tuple1. >> >> — >> You are receiving this because you commented. >> Reply to this email directly, view it on GitHub >> <#15979 (comment)>, or mute >> the thread >> <https://github.com/notifications/unsubscribe-auth/AAyIJD-_dmJODKn5_k8MHRFaJkHvL9uRks5rEmzCgaJpZM4K5fEL> >> . >> >

## What changes were proposed in this pull request? For input object of non-flat type, we can't encode it to row if it's null, as Spark SQL doesn't allow the entire row to be null, only its columns can be null. That's the reason we forbid users to use top level null objects in apache#13469 However, if users wrap non-flat type with `Option`, then we may still encoder top level null object to row, which is not allowed. This PR fixes this case, and suggests users to wrap their type with `Tuple1` if they do wanna top level null objects. ## How was this patch tested? new test Author: Wenchen Fan <wenchen@databricks.com> Closes apache#15979 from cloud-fan/option.

yhuai reviewed Nov 29, 2016

View reviewed changes

cloud-fan added 2 commits November 29, 2016 19:39

the type of Dataset can't be Option of non-flat type

0f4d4a4

address comment

01b072d

cloud-fan force-pushed the option branch from cb834b8 to 01b072d Compare November 29, 2016 11:43

cloud-fan mentioned this pull request Nov 30, 2016

[SPARK-18284][SQL] Make ExpressionEncoder.serializer.nullable precise #15780

Closed

address comments

876e5c7

cloud-fan force-pushed the option branch from 70dd650 to 876e5c7 Compare November 30, 2016 15:55

asfgit closed this in f135b70 Nov 30, 2016

koertkuipers reviewed Dec 4, 2016

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18251][SQL] the type of Dataset can't be Option of non-flat type #15979

[SPARK-18251][SQL] the type of Dataset can't be Option of non-flat type #15979

cloud-fan commented Nov 22, 2016 •

edited

Loading

cloud-fan commented Nov 22, 2016

SparkQA commented Nov 22, 2016

rxin commented Nov 22, 2016

cloud-fan commented Nov 23, 2016

cloud-fan commented Nov 28, 2016

yhuai Nov 29, 2016

yhuai commented Nov 29, 2016

SparkQA commented Nov 29, 2016

cloud-fan commented Nov 29, 2016

SparkQA commented Nov 29, 2016

liancheng commented Nov 29, 2016

rxin commented Nov 29, 2016

SparkQA commented Nov 30, 2016

cloud-fan commented Nov 30, 2016

liancheng commented Nov 30, 2016

SparkQA commented Nov 30, 2016

cloud-fan commented Nov 30, 2016

SparkQA commented Nov 30, 2016

SparkQA commented Nov 30, 2016

liancheng commented Nov 30, 2016

liancheng commented Nov 30, 2016

rxin commented Nov 30, 2016

liancheng commented Nov 30, 2016

koertkuipers Dec 4, 2016

koertkuipers Dec 4, 2016 •

edited

Loading

koertkuipers Dec 4, 2016 •

edited

Loading

koertkuipers commented Dec 4, 2016

cloud-fan commented Dec 4, 2016 •

edited

Loading

koertkuipers commented Dec 4, 2016 via email

koertkuipers commented Dec 4, 2016 via email

koertkuipers commented Dec 4, 2016 via email

[SPARK-18251][SQL] the type of Dataset can't be Option of non-flat type #15979

[SPARK-18251][SQL] the type of Dataset can't be Option of non-flat type #15979

Conversation

cloud-fan commented Nov 22, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Nov 22, 2016

SparkQA commented Nov 22, 2016

rxin commented Nov 22, 2016

cloud-fan commented Nov 23, 2016

cloud-fan commented Nov 28, 2016

yhuai Nov 29, 2016

Choose a reason for hiding this comment

yhuai commented Nov 29, 2016

SparkQA commented Nov 29, 2016

cloud-fan commented Nov 29, 2016

SparkQA commented Nov 29, 2016

liancheng commented Nov 29, 2016

rxin commented Nov 29, 2016

SparkQA commented Nov 30, 2016

cloud-fan commented Nov 30, 2016

liancheng commented Nov 30, 2016

SparkQA commented Nov 30, 2016

cloud-fan commented Nov 30, 2016

SparkQA commented Nov 30, 2016

SparkQA commented Nov 30, 2016

liancheng commented Nov 30, 2016

liancheng commented Nov 30, 2016

rxin commented Nov 30, 2016

liancheng commented Nov 30, 2016

koertkuipers Dec 4, 2016

Choose a reason for hiding this comment

koertkuipers Dec 4, 2016 • edited Loading

Choose a reason for hiding this comment

koertkuipers Dec 4, 2016 • edited Loading

Choose a reason for hiding this comment

koertkuipers commented Dec 4, 2016

cloud-fan commented Dec 4, 2016 • edited Loading

koertkuipers commented Dec 4, 2016 via email

koertkuipers commented Dec 4, 2016 via email

koertkuipers commented Dec 4, 2016 via email

cloud-fan commented Nov 22, 2016 •

edited

Loading

koertkuipers Dec 4, 2016 •

edited

Loading

koertkuipers Dec 4, 2016 •

edited

Loading

cloud-fan commented Dec 4, 2016 •

edited

Loading