New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23862][SQL] Spark ExpressionEncoder should support java enum type in scala #20974
Conversation
@@ -108,6 +108,10 @@ abstract class SQLImplicits extends LowPrioritySQLImplicits { | |||
/** @since 2.0.0 */ | |||
implicit def newBoxedBooleanEncoder: Encoder[java.lang.Boolean] = Encoders.BOOLEAN | |||
|
|||
/** @since 2.4.0 */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fangshil I think this will need to be updated to 2.5.0
now that 2.4.0 has been released
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will have to be 3.0.0 as there won't be a 2.5.0
@fangshil would you be able to rebase this PR? |
@gatorsmile @cloud-fan would you be able to give this PR a look or suggest a more appropriate reviewer? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable
@@ -108,6 +108,10 @@ abstract class SQLImplicits extends LowPrioritySQLImplicits { | |||
/** @since 2.0.0 */ | |||
implicit def newBoxedBooleanEncoder: Encoder[java.lang.Boolean] = Encoders.BOOLEAN | |||
|
|||
/** @since 2.4.0 */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will have to be 3.0.0 as there won't be a 2.5.0
If I understand this PR correctly, this is going to be internally using the String representation of the Enum instead of some type of ordinal. I like the support for Enums, but I've seen cases where using a string instead of the ordinal significantly increases the data set size. Is there a reason why we're preferring String over an ordinal? |
Probably because, perhaps, the ordinal value of an enum could change if more are added? |
Is this just the in-Spark representation or does it also get persisted? If it's only in-memory, then I would expect all Spark executors to have the same JAR, share the same Enum values, and therefore should match. |
I think this is the type that ends up in the Dataset or DataFrame, so you'd be writing the string or int ordinal representation, if you wrote it to disk or something. |
All the data in Spark SQL must have a schema. When we convert a java enum to Spark SQL data, which type should we pick? To me string is a better type than int. |
ok to test |
Test build #110690 has finished for PR 20974 at commit
|
ping @fangshil to update or close. |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
@HyukjinKwon @srowen @cloud-fan -- It sounds like we had consensus that this PR was okay, are there any concerns from any of you if I pick up this PR and re-submit? @fangshil is no longer working on this. Retroactive apologies from our side for dropping it for so long but I think it is still a good enhancement. |
I think it's reasonable. So this is round-trip - will let you get enums back from the string rep? OK. |
Yes, I will be happy to add more tests to confirm proper functionality. |
Yeah, thanks @xkrogen. |
What changes were proposed in this pull request?
In SPARK-21255, spark upstream adds support for creating encoders for java enum types, but the support is only added to Java API(for enum working within Java Beans). Since the java enum can come from third-party java library, we have use case that requires
Spark ExpressionEncoder already supports ser/de many java types in ScalaReflection, so we propose to add support for java enum as well, as a follow up of SPARK-21255.
How was this patch tested?
Tested the patch in our production cluster. Added unit test.
Since:
As a result, I use the Spark SQL public java enum API(SaveMode.java) in the test. Please advise if there is a better way to test