[SPARK-23862][SQL] Spark ExpressionEncoder should support java enum type in scala #20974

fangshil · 2018-04-04T04:59:30Z

What changes were proposed in this pull request?

In SPARK-21255, spark upstream adds support for creating encoders for java enum types, but the support is only added to Java API(for enum working within Java Beans). Since the java enum can come from third-party java library, we have use case that requires

using java enum types as field of scala case class
using java enum as the type T in Dataset[T]

Spark ExpressionEncoder already supports ser/de many java types in ScalaReflection, so we propose to add support for java enum as well, as a follow up of SPARK-21255.

How was this patch tested?

Tested the patch in our production cluster. Added unit test.
Since:

it is not possible to define a java enum in scala directly, since the defined enum class in scala will miss method like valueOf which is added by java compiler
it is not possible to define a test enum java class and use in scala test because the compilation of single scala test(-DwildcardSuites=org.apache.spark.sql.DatasetSuite) won't compile the test java class first

As a result, I use the Spark SQL public java enum API(SaveMode.java) in the test. Please advise if there is a better way to test

… scala

benmccann · 2018-12-12T03:34:53Z

sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala

@@ -108,6 +108,10 @@ abstract class SQLImplicits extends LowPrioritySQLImplicits {
  /** @since 2.0.0 */
  implicit def newBoxedBooleanEncoder: Encoder[java.lang.Boolean] = Encoders.BOOLEAN

+  /** @since 2.4.0 */


@fangshil I think this will need to be updated to 2.5.0 now that 2.4.0 has been released

It will have to be 3.0.0 as there won't be a 2.5.0

benmccann · 2018-12-12T03:37:14Z

@fangshil would you be able to rebase this PR?

benmccann · 2018-12-12T05:33:12Z

@gatorsmile @cloud-fan would you be able to give this PR a look or suggest a more appropriate reviewer?

srowen

Looks reasonable

srowen · 2018-12-13T14:47:33Z

sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala

@@ -108,6 +108,10 @@ abstract class SQLImplicits extends LowPrioritySQLImplicits {
  /** @since 2.0.0 */
  implicit def newBoxedBooleanEncoder: Encoder[java.lang.Boolean] = Encoders.BOOLEAN

+  /** @since 2.4.0 */


It will have to be 3.0.0 as there won't be a 2.5.0

ajacques · 2018-12-20T22:47:13Z

If I understand this PR correctly, this is going to be internally using the String representation of the Enum instead of some type of ordinal. I like the support for Enums, but I've seen cases where using a string instead of the ordinal significantly increases the data set size. Is there a reason why we're preferring String over an ordinal?

srowen · 2018-12-21T03:04:22Z

Probably because, perhaps, the ordinal value of an enum could change if more are added?

ajacques · 2018-12-21T03:32:07Z

Is this just the in-Spark representation or does it also get persisted? If it's only in-memory, then I would expect all Spark executors to have the same JAR, share the same Enum values, and therefore should match.

srowen · 2018-12-21T03:54:40Z

I think this is the type that ends up in the Dataset or DataFrame, so you'd be writing the string or int ordinal representation, if you wrote it to disk or something.

cloud-fan · 2018-12-21T04:09:50Z

All the data in Spark SQL must have a schema. When we convert a java enum to Spark SQL data, which type should we pick? To me string is a better type than int.

HyukjinKwon · 2019-09-17T00:19:25Z

ok to test

SparkQA · 2019-09-17T00:54:09Z

Test build #110690 has finished for PR 20974 at commit 90effb2.

This patch fails to build.
This patch does not merge cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-09-18T04:36:29Z

ping @fangshil to update or close.

github-actions · 2020-01-12T00:07:43Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

xkrogen · 2020-12-18T23:36:32Z

@HyukjinKwon @srowen @cloud-fan -- It sounds like we had consensus that this PR was okay, are there any concerns from any of you if I pick up this PR and re-submit? @fangshil is no longer working on this. Retroactive apologies from our side for dropping it for so long but I think it is still a good enhancement.

srowen · 2020-12-18T23:43:16Z

I think it's reasonable. So this is round-trip - will let you get enums back from the string rep? OK.
I think we might want tests about the string representation, and for arrays of enums?

xkrogen · 2020-12-18T23:45:51Z

Yes, I will be happy to add more tests to confirm proper functionality.
I will put a new PR soon. Thanks for the quick feedback Sean!

HyukjinKwon · 2020-12-19T05:54:13Z

Yeah, thanks @xkrogen.

SPARK-23862: Spark ExpressionEncoder should support java enum type in…

90effb2

… scala

benmccann reviewed Dec 12, 2018

View reviewed changes

srowen requested changes Dec 13, 2018

View reviewed changes

dongjoon-hyun added the SQL label Jun 14, 2019

github-actions bot added the Stale label Jan 12, 2020

github-actions bot closed this Jan 13, 2020

xkrogen mentioned this pull request Dec 21, 2020

[SPARK-23862][SQL] Support Java enums from Scala Dataset API #30877

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-23862][SQL] Spark ExpressionEncoder should support java enum type in scala #20974

[SPARK-23862][SQL] Spark ExpressionEncoder should support java enum type in scala #20974

fangshil commented Apr 4, 2018

benmccann Dec 12, 2018

srowen Dec 13, 2018

benmccann commented Dec 12, 2018

benmccann commented Dec 12, 2018

srowen left a comment

srowen Dec 13, 2018

ajacques commented Dec 20, 2018

srowen commented Dec 21, 2018

ajacques commented Dec 21, 2018

srowen commented Dec 21, 2018

cloud-fan commented Dec 21, 2018 •

edited

HyukjinKwon commented Sep 17, 2019

SparkQA commented Sep 17, 2019

HyukjinKwon commented Sep 18, 2019

github-actions bot commented Jan 12, 2020

xkrogen commented Dec 18, 2020

srowen commented Dec 18, 2020

xkrogen commented Dec 18, 2020

HyukjinKwon commented Dec 19, 2020

[SPARK-23862][SQL] Spark ExpressionEncoder should support java enum type in scala #20974

[SPARK-23862][SQL] Spark ExpressionEncoder should support java enum type in scala #20974

Conversation

fangshil commented Apr 4, 2018

What changes were proposed in this pull request?

How was this patch tested?

benmccann Dec 12, 2018

Choose a reason for hiding this comment

srowen Dec 13, 2018

Choose a reason for hiding this comment

benmccann commented Dec 12, 2018

benmccann commented Dec 12, 2018

srowen left a comment

Choose a reason for hiding this comment

srowen Dec 13, 2018

Choose a reason for hiding this comment

ajacques commented Dec 20, 2018

srowen commented Dec 21, 2018

ajacques commented Dec 21, 2018

srowen commented Dec 21, 2018

cloud-fan commented Dec 21, 2018 • edited

HyukjinKwon commented Sep 17, 2019

SparkQA commented Sep 17, 2019

HyukjinKwon commented Sep 18, 2019

github-actions bot commented Jan 12, 2020

xkrogen commented Dec 18, 2020

srowen commented Dec 18, 2020

xkrogen commented Dec 18, 2020

HyukjinKwon commented Dec 19, 2020

cloud-fan commented Dec 21, 2018 •

edited