Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23862][SQL] Spark ExpressionEncoder should support java enum type in scala #20974

Closed
wants to merge 1 commit into from

Conversation

fangshil
Copy link

@fangshil fangshil commented Apr 4, 2018

What changes were proposed in this pull request?

In SPARK-21255, spark upstream adds support for creating encoders for java enum types, but the support is only added to Java API(for enum working within Java Beans). Since the java enum can come from third-party java library, we have use case that requires

  1. using java enum types as field of scala case class
  2. using java enum as the type T in Dataset[T]

Spark ExpressionEncoder already supports ser/de many java types in ScalaReflection, so we propose to add support for java enum as well, as a follow up of SPARK-21255.

How was this patch tested?

Tested the patch in our production cluster. Added unit test.
Since:

  1. it is not possible to define a java enum in scala directly, since the defined enum class in scala will miss method like valueOf which is added by java compiler
  2. it is not possible to define a test enum java class and use in scala test because the compilation of single scala test(-DwildcardSuites=org.apache.spark.sql.DatasetSuite) won't compile the test java class first

As a result, I use the Spark SQL public java enum API(SaveMode.java) in the test. Please advise if there is a better way to test

@@ -108,6 +108,10 @@ abstract class SQLImplicits extends LowPrioritySQLImplicits {
/** @since 2.0.0 */
implicit def newBoxedBooleanEncoder: Encoder[java.lang.Boolean] = Encoders.BOOLEAN

/** @since 2.4.0 */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fangshil I think this will need to be updated to 2.5.0 now that 2.4.0 has been released

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will have to be 3.0.0 as there won't be a 2.5.0

@benmccann
Copy link
Contributor

@fangshil would you be able to rebase this PR?

@benmccann
Copy link
Contributor

@gatorsmile @cloud-fan would you be able to give this PR a look or suggest a more appropriate reviewer?

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable

@@ -108,6 +108,10 @@ abstract class SQLImplicits extends LowPrioritySQLImplicits {
/** @since 2.0.0 */
implicit def newBoxedBooleanEncoder: Encoder[java.lang.Boolean] = Encoders.BOOLEAN

/** @since 2.4.0 */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will have to be 3.0.0 as there won't be a 2.5.0

@ajacques
Copy link
Contributor

If I understand this PR correctly, this is going to be internally using the String representation of the Enum instead of some type of ordinal. I like the support for Enums, but I've seen cases where using a string instead of the ordinal significantly increases the data set size. Is there a reason why we're preferring String over an ordinal?

@srowen
Copy link
Member

srowen commented Dec 21, 2018

Probably because, perhaps, the ordinal value of an enum could change if more are added?

@ajacques
Copy link
Contributor

Is this just the in-Spark representation or does it also get persisted? If it's only in-memory, then I would expect all Spark executors to have the same JAR, share the same Enum values, and therefore should match.

@srowen
Copy link
Member

srowen commented Dec 21, 2018

I think this is the type that ends up in the Dataset or DataFrame, so you'd be writing the string or int ordinal representation, if you wrote it to disk or something.

@cloud-fan
Copy link
Contributor

cloud-fan commented Dec 21, 2018

All the data in Spark SQL must have a schema. When we convert a java enum to Spark SQL data, which type should we pick? To me string is a better type than int.

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Sep 17, 2019

Test build #110690 has finished for PR 20974 at commit 90effb2.

  • This patch fails to build.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

ping @fangshil to update or close.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jan 12, 2020
@github-actions github-actions bot closed this Jan 13, 2020
@xkrogen
Copy link
Contributor

xkrogen commented Dec 18, 2020

@HyukjinKwon @srowen @cloud-fan -- It sounds like we had consensus that this PR was okay, are there any concerns from any of you if I pick up this PR and re-submit? @fangshil is no longer working on this. Retroactive apologies from our side for dropping it for so long but I think it is still a good enhancement.

@srowen
Copy link
Member

srowen commented Dec 18, 2020

I think it's reasonable. So this is round-trip - will let you get enums back from the string rep? OK.
I think we might want tests about the string representation, and for arrays of enums?

@xkrogen
Copy link
Contributor

xkrogen commented Dec 18, 2020

Yes, I will be happy to add more tests to confirm proper functionality.
I will put a new PR soon. Thanks for the quick feedback Sean!

@HyukjinKwon
Copy link
Member

Yeah, thanks @xkrogen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants