Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18251][SQL] the type of Dataset can't be Option of non-flat type #15979

Closed
wants to merge 3 commits into from

Conversation

cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Nov 22, 2016

What changes were proposed in this pull request?

For input object of non-flat type, we can't encode it to row if it's null, as Spark SQL doesn't allow the entire row to be null, only its columns can be null. That's the reason we forbid users to use top level null objects in #13469

However, if users wrap non-flat type with Option, then we may still encoder top level null object to row, which is not allowed.

This PR fixes this case, and suggests users to wrap their type with Tuple1 if they do wanna top level null objects.

How was this patch tested?

new test

@cloud-fan
Copy link
Contributor Author

cc @yhuai @liancheng

@SparkQA
Copy link

SparkQA commented Nov 22, 2016

Test build #69000 has finished for PR 15979 at commit cb834b8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Nov 22, 2016

What does "non-flat type" mean?

@cloud-fan
Copy link
Contributor Author

"non-flat type" means "complex type", i.e. array, seq, map, product, etc.

@cloud-fan
Copy link
Contributor Author

retest it please

throw new UnsupportedOperationException(
"Cannot create encoder for Option of non-flat type, as non-flat type is represented " +
"as a row, and the entire row can not be null in Spark SQL like normal databases. " +
"You can wrap your type with Tuple1 if you do want top level null objects.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's provide an example in the error message to help users understand how to handle this case.

@yhuai
Copy link
Contributor

yhuai commented Nov 29, 2016

looks good. @liancheng want to double check?

@SparkQA
Copy link

SparkQA commented Nov 29, 2016

Test build #69326 has finished for PR 15979 at commit 01b072d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Nov 29, 2016

Test build #69328 has finished for PR 15979 at commit 01b072d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

My only concern is that "non-flat" type is neither intuitive nor a well-known term. In fact, this PR only prevents Option[T <: Product] to be top-level Dataset types. How about just call them "Product" types?

Otherwise LGTM.

@rxin
Copy link
Contributor

rxin commented Nov 29, 2016

FWIW I don't think we should call it nonflat.

@SparkQA
Copy link

SparkQA commented Nov 30, 2016

Test build #69387 has started for PR 15979 at commit 70dd650.

@cloud-fan
Copy link
Contributor Author

retest this please

@liancheng
Copy link
Contributor

Good to merge pending Jenkins. Thanks!

@SparkQA
Copy link

SparkQA commented Nov 30, 2016

Test build #69394 has finished for PR 15979 at commit 70dd650.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Nov 30, 2016

Test build #69403 has finished for PR 15979 at commit 70dd650.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 30, 2016

Test build #69411 has finished for PR 15979 at commit 876e5c7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

Merging to master. Thanks!

@asfgit asfgit closed this in f135b70 Nov 30, 2016
@liancheng
Copy link
Contributor

@rxin Shall we backport this to branch-2.1? I think it's relatively safe.

@rxin
Copy link
Contributor

rxin commented Nov 30, 2016

Sounds good.

@liancheng
Copy link
Contributor

Also backported to branch-2.1.

asfgit pushed a commit that referenced this pull request Nov 30, 2016
## What changes were proposed in this pull request?

For input object of non-flat type, we can't encode it to row if it's null, as Spark SQL doesn't allow the entire row to be null, only its columns can be null. That's the reason we forbid users to use top level null objects in #13469

However, if users wrap non-flat type with `Option`, then we may still encoder top level null object to row, which is not allowed.

This PR fixes this case, and suggests users to wrap their type with `Tuple1` if they do wanna top level null objects.

## How was this patch tested?

new test

Author: Wenchen Fan <wenchen@databricks.com>

Closes #15979 from cloud-fan/option.

(cherry picked from commit f135b70)
Signed-off-by: Cheng Lian <lian@databricks.com>
robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 2, 2016
## What changes were proposed in this pull request?

For input object of non-flat type, we can't encode it to row if it's null, as Spark SQL doesn't allow the entire row to be null, only its columns can be null. That's the reason we forbid users to use top level null objects in apache#13469

However, if users wrap non-flat type with `Option`, then we may still encoder top level null object to row, which is not allowed.

This PR fixes this case, and suggests users to wrap their type with `Tuple1` if they do wanna top level null objects.

## How was this patch tested?

new test

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#15979 from cloud-fan/option.

if (ScalaReflection.optionOfProductType(tpe)) {
throw new UnsupportedOperationException(
"Cannot create encoder for Option of Product type, because Product type is represented " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also means an Aggregator cannot use an Option of Product Type for its intermediate type. e.g.
Aggregator[Int, Option[(Int, Int)], Int] is now invalid. but i see no good reason why such an Aggregator wouldnt exist?

Copy link
Contributor

@koertkuipers koertkuipers Dec 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this strikes me more as a limitation on Dataset[X] than on Encoder[X]

Copy link
Contributor

@koertkuipers koertkuipers Dec 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and now that i think about it more, i also think Dataset[Option[(Int, Int)]] should be valid too if possible.

it should not be represented by a top level Row object, so the schema should be
StructType(StructField("_1", StructType(StructField("_1", IntegerType, false), StructField("_2", IntegerType, false)), true))

we do this trick where we nest top-level non-struct types inside a row, why not do the same thing for Option[X <: Product]?

@koertkuipers
Copy link
Contributor

this means anything that uses an encoder can no longer use Option[_ <: Product].
encoders are not just used for the top level Dataset creation.

Dataset.groupByKey[K] requires an encoder for K.
KeyValueGroupedDataset.mapValues[W] requires an encoder for V
Aggregator[A, B, C] requires encoders for B and C

none of these always create top level row objects (for which this pullreq creates the restriction that they cannot be null).

for an aggregator it is sometimes the case. dataset.select(aggregator) does create a top level row object, but dataset.groupByKey(...).agg(aggregator) does not.

so i am not sure it makes sense to put this restriction on the encoder. it seems to belong on the dataset.

another example of something that won't work anymore:

val x: Dataset[String, Option[(String, String)]] = ...
x.groupByKey(_._1).mapValues(_._2).agg(someAgg)

in this case the mapValues requires Encoder[Option[(String, String)]]

@cloud-fan
Copy link
Contributor Author

cloud-fan commented Dec 4, 2016

val x: Dataset[String, Option[(String, String)]] = ...
x.groupByKey(_._1).mapValues(_._2).agg(someAgg)

Does it work before?

Please see the discussion in the JIRA: https://issues.apache.org/jira/browse/SPARK-18251
Ideally we have a map between type T and catalyst schema, and Option[T] maps to the same catalyst schema with T, with additional null handling. We shouldn't change this mapping, which means we can't use a single field struct type to represent Option[T].

It's still possible to support Option[T] completely(without breaking backward compatibility), but that may need a lof of hacky code and special handling, I don't think it worth, as we can easily work around it, by Tuple1.

@koertkuipers
Copy link
Contributor

koertkuipers commented Dec 4, 2016 via email

@koertkuipers
Copy link
Contributor

koertkuipers commented Dec 4, 2016 via email

@koertkuipers
Copy link
Contributor

koertkuipers commented Dec 4, 2016 via email

uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

For input object of non-flat type, we can't encode it to row if it's null, as Spark SQL doesn't allow the entire row to be null, only its columns can be null. That's the reason we forbid users to use top level null objects in apache#13469

However, if users wrap non-flat type with `Option`, then we may still encoder top level null object to row, which is not allowed.

This PR fixes this case, and suggests users to wrap their type with `Tuple1` if they do wanna top level null objects.

## How was this patch tested?

new test

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#15979 from cloud-fan/option.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants