Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][SPARK-20384][SQL] Support value classes and always encoded as underlying type #33316

Conversation

eejbyfeldt
Copy link
Contributor

@eejbyfeldt eejbyfeldt commented Jul 13, 2021

The [WIP] is only since solutions to the same issue is already being proposed in #33205. But the code is complete and should work as expected.

What changes were proposed in this pull request?

This PR adds support for using value class in nested case classes. This has previously been proposed in the following PRs: #22309 #27153 #33205

The first PR had the same approach as in this one of always encoding the value class as the underlying type. But as noted by @cloud-fan: #22309 (comment) the implementation was quite spread out in ScalaReflection.scala this is addressed in this PR by "unwrapping" the value class only when used inside a case class. This removes the need for the extra argument instantiateValueClass to deserializerFor and makes the implementation less spread out.

The other two are both by @mickjermsurawong-stripe and takes the approach that value class are only "unwrapped" is cases where there there will be no runtime object. This has the advantage of being fully backwards compatible with cases where case class currently work. The partial support for value classes was original added in: #15284
does only add test cases for single column value classes. Therefore it seems like other cases where value classes happen to work like Option[ValueClass] or Array[ValueClass] are mostly accidental and where not thoughtfully designed to have the current behavior.

This PR is meant for discussion and alternative approach to #33205 if there is no need for schema backwards compatibility.

Why are the changes needed?

Support value classes in Datasets, so that they can be used in code bases that currently use them in their modeling.

Does this PR introduce any user-facing change?

This add supports for using value class in DataFrames and Datasets that would previously led to exceptions.

The main change the can effect existing code is that previously single column value classes would be encoded as they where simple case class.
To avoid extra complexity this patch always treats value classes as their underlying type.
Therefore given the following case class:

case class MyInt(wrapped: Int) extends AnyVal

this means that both Dataset[Int] and Dataset[MyInt] will have the same schema

root
 |-- value: integer (nullable = false)

Before this patch the schema of Dataset[MyInt] would instead have been

root
 |-- wrapped: integer (nullable = false)

How was this patch tested?

New unit tests in: ScalaReflectionSuite.scala, ExpressionEncoderSuite.scala, DataFrameSuite.scala and DatasetSuite.scala.

@github-actions github-actions bot added the SQL label Jul 13, 2021
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@eejbyfeldt
Copy link
Contributor Author

Closing and hoping we can have: #33205 merged instead.

@eejbyfeldt eejbyfeldt closed this Aug 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants