-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18251][SQL] the type of Dataset can't be Option of non-flat type #15979
Conversation
cc @yhuai @liancheng |
Test build #69000 has finished for PR 15979 at commit
|
What does "non-flat type" mean? |
"non-flat type" means "complex type", i.e. array, seq, map, product, etc. |
retest it please |
throw new UnsupportedOperationException( | ||
"Cannot create encoder for Option of non-flat type, as non-flat type is represented " + | ||
"as a row, and the entire row can not be null in Spark SQL like normal databases. " + | ||
"You can wrap your type with Tuple1 if you do want top level null objects.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's provide an example in the error message to help users understand how to handle this case.
looks good. @liancheng want to double check? |
Test build #69326 has finished for PR 15979 at commit
|
retest this please |
Test build #69328 has finished for PR 15979 at commit
|
My only concern is that "non-flat" type is neither intuitive nor a well-known term. In fact, this PR only prevents Otherwise LGTM. |
FWIW I don't think we should call it nonflat. |
Test build #69387 has started for PR 15979 at commit |
retest this please |
Good to merge pending Jenkins. Thanks! |
Test build #69394 has finished for PR 15979 at commit
|
retest this please |
Test build #69403 has finished for PR 15979 at commit
|
Test build #69411 has finished for PR 15979 at commit
|
Merging to master. Thanks! |
@rxin Shall we backport this to branch-2.1? I think it's relatively safe. |
Sounds good. |
Also backported to branch-2.1. |
## What changes were proposed in this pull request? For input object of non-flat type, we can't encode it to row if it's null, as Spark SQL doesn't allow the entire row to be null, only its columns can be null. That's the reason we forbid users to use top level null objects in #13469 However, if users wrap non-flat type with `Option`, then we may still encoder top level null object to row, which is not allowed. This PR fixes this case, and suggests users to wrap their type with `Tuple1` if they do wanna top level null objects. ## How was this patch tested? new test Author: Wenchen Fan <wenchen@databricks.com> Closes #15979 from cloud-fan/option. (cherry picked from commit f135b70) Signed-off-by: Cheng Lian <lian@databricks.com>
## What changes were proposed in this pull request? For input object of non-flat type, we can't encode it to row if it's null, as Spark SQL doesn't allow the entire row to be null, only its columns can be null. That's the reason we forbid users to use top level null objects in apache#13469 However, if users wrap non-flat type with `Option`, then we may still encoder top level null object to row, which is not allowed. This PR fixes this case, and suggests users to wrap their type with `Tuple1` if they do wanna top level null objects. ## How was this patch tested? new test Author: Wenchen Fan <wenchen@databricks.com> Closes apache#15979 from cloud-fan/option.
|
||
if (ScalaReflection.optionOfProductType(tpe)) { | ||
throw new UnsupportedOperationException( | ||
"Cannot create encoder for Option of Product type, because Product type is represented " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this also means an Aggregator cannot use an Option of Product Type for its intermediate type. e.g.
Aggregator[Int, Option[(Int, Int)], Int] is now invalid. but i see no good reason why such an Aggregator wouldnt exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this strikes me more as a limitation on Dataset[X] than on Encoder[X]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and now that i think about it more, i also think Dataset[Option[(Int, Int)]] should be valid too if possible.
it should not be represented by a top level Row object, so the schema should be
StructType(StructField("_1", StructType(StructField("_1", IntegerType, false), StructField("_2", IntegerType, false)), true))
we do this trick where we nest top-level non-struct types inside a row, why not do the same thing for Option[X <: Product]?
this means anything that uses an encoder can no longer use Option[_ <: Product]. Dataset.groupByKey[K] requires an encoder for K. none of these always create top level row objects (for which this pullreq creates the restriction that they cannot be null). for an aggregator it is sometimes the case. so i am not sure it makes sense to put this restriction on the encoder. it seems to belong on the dataset. another example of something that won't work anymore:
in this case the mapValues requires |
Does it work before? Please see the discussion in the JIRA: https://issues.apache.org/jira/browse/SPARK-18251 It's still possible to support |
Yes it worked before
…On Dec 4, 2016 02:33, "Wenchen Fan" ***@***.***> wrote:
val x: Dataset[String, Option[(String, String)]] = ...
x.groupByKey(_._1).mapValues(_._2).agg(someAgg)
Does it work before?
Please see the discussion in the JIRA: https://issues.apache.org/
jira/browse/SPARK-18251
Ideally we have a map between type T and catalyst schema, and Option[T]
maps to the same catalyst schema with T, with additional null handling.
We shouldn't change this mapping, which means we can't use a single field
struct type to represent Option[T].
It's still possible to support Option[T] completely(without breaking
backward compatibility), but that may need a lof of hacky code and special
handling, I don't think it worth, as we can easy work around it, by Tuple1
.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#15979 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAyIJD-_dmJODKn5_k8MHRFaJkHvL9uRks5rEmzCgaJpZM4K5fEL>
.
|
spark 2.0.x does not have mapValues. but this works:
scala> Seq(("a", Some((1, 1))), ("a",
None)).toDS.groupByKey(_._2).count.show
+-----------+--------+
| key|count(1)|
+-----------+--------+
|[null,null]| 1|
| [1,1]| 1|
+-----------+--------+
…On Sun, Dec 4, 2016 at 9:59 AM, Koert Kuipers ***@***.***> wrote:
Yes it worked before
On Dec 4, 2016 02:33, "Wenchen Fan" ***@***.***> wrote:
> val x: Dataset[String, Option[(String, String)]] = ...
> x.groupByKey(_._1).mapValues(_._2).agg(someAgg)
>
> Does it work before?
>
> Please see the discussion in the JIRA: https://issues.apache.org/jira
> /browse/SPARK-18251
> Ideally we have a map between type T and catalyst schema, and Option[T]
> maps to the same catalyst schema with T, with additional null handling.
> We shouldn't change this mapping, which means we can't use a single field
> struct type to represent Option[T].
>
> It's still possible to support Option[T] completely(without breaking
> backward compatibility), but that may need a lof of hacky code and special
> handling, I don't think it worth, as we can easy work around it, by
> Tuple1.
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#15979 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AAyIJD-_dmJODKn5_k8MHRFaJkHvL9uRks5rEmzCgaJpZM4K5fEL>
> .
>
|
admittedly the result looks weird. it really should be:
+-----------+--------+
| key|count(1)|
+-----------+--------+
| null| 1|
| [1,1]| 1|
+-----------+--------+
is that a separate bug or related? i remember running into this before,
because serializing and then deserializing None comes back out as
Some((null, null)), which causes NPE in codegen. i ran into this with
Aggregator buffers.
…On Sun, Dec 4, 2016 at 12:13 PM, Koert Kuipers ***@***.***> wrote:
spark 2.0.x does not have mapValues. but this works:
scala> Seq(("a", Some((1, 1))), ("a", None)).toDS.groupByKey(_._2).
count.show
+-----------+--------+
| key|count(1)|
+-----------+--------+
|[null,null]| 1|
| [1,1]| 1|
+-----------+--------+
On Sun, Dec 4, 2016 at 9:59 AM, Koert Kuipers ***@***.***> wrote:
> Yes it worked before
>
> On Dec 4, 2016 02:33, "Wenchen Fan" ***@***.***> wrote:
>
>> val x: Dataset[String, Option[(String, String)]] = ...
>> x.groupByKey(_._1).mapValues(_._2).agg(someAgg)
>>
>> Does it work before?
>>
>> Please see the discussion in the JIRA: https://issues.apache.org/jira
>> /browse/SPARK-18251
>> Ideally we have a map between type T and catalyst schema, and Option[T]
>> maps to the same catalyst schema with T, with additional null handling.
>> We shouldn't change this mapping, which means we can't use a single field
>> struct type to represent Option[T].
>>
>> It's still possible to support Option[T] completely(without breaking
>> backward compatibility), but that may need a lof of hacky code and special
>> handling, I don't think it worth, as we can easy work around it, by
>> Tuple1.
>>
>> —
>> You are receiving this because you commented.
>> Reply to this email directly, view it on GitHub
>> <#15979 (comment)>, or mute
>> the thread
>> <https://github.com/notifications/unsubscribe-auth/AAyIJD-_dmJODKn5_k8MHRFaJkHvL9uRks5rEmzCgaJpZM4K5fEL>
>> .
>>
>
|
## What changes were proposed in this pull request? For input object of non-flat type, we can't encode it to row if it's null, as Spark SQL doesn't allow the entire row to be null, only its columns can be null. That's the reason we forbid users to use top level null objects in apache#13469 However, if users wrap non-flat type with `Option`, then we may still encoder top level null object to row, which is not allowed. This PR fixes this case, and suggests users to wrap their type with `Tuple1` if they do wanna top level null objects. ## How was this patch tested? new test Author: Wenchen Fan <wenchen@databricks.com> Closes apache#15979 from cloud-fan/option.
What changes were proposed in this pull request?
For input object of non-flat type, we can't encode it to row if it's null, as Spark SQL doesn't allow the entire row to be null, only its columns can be null. That's the reason we forbid users to use top level null objects in #13469
However, if users wrap non-flat type with
Option
, then we may still encoder top level null object to row, which is not allowed.This PR fixes this case, and suggests users to wrap their type with
Tuple1
if they do wanna top level null objects.How was this patch tested?
new test