Skip to content

Comments

[SPARK-41690][SQL][CONNECT] Agnostic Encoders#39186

Closed
hvanhovell wants to merge 5 commits intoapache:masterfrom
hvanhovell:SPARK-41690
Closed

[SPARK-41690][SQL][CONNECT] Agnostic Encoders#39186
hvanhovell wants to merge 5 commits intoapache:masterfrom
hvanhovell:SPARK-41690

Conversation

@hvanhovell
Copy link
Contributor

What changes were proposed in this pull request?

This PR introduces so AgnosticEncoders. AgnosticEncoders describe how an external type maps to a Spark data type. They are agnostic in the sense that they do not prescribe which internal format is to be used.

For example the following class:

case class Person(id: Long, name: String, hobbies: Seq[String])

Translates into the following agnostic encoder:

ProductEncoder(Person,List(
  (id, PrimitiveLongEncoder),
  (name, StringEncoder),
  (hobbies, IterableEncoder(scala.collection.Seq,StringEncoder))))

This PR integrates AgnosticEncoders in ScalaReflection, and so it is used for all Dataset operations. In a follow-up we will address RowEncoder and JavaReflection. In the old situation we would traverse the type hierarchy for each options (serializedForType, deserializerForType & schemaFor). In the new situation we will create an AgnosticEncoder first and then generate a serializer, deserializer, and/or schema. This saves significantly in time, especially for ExpressionEncoder where we only need one pass through the type hierarchy instead 2 or 3.

Why are the changes needed?

For the Spark Connect Scala Client we need encoders. We want to stay as close to the current Dataset APIs and encoders are part of this. Additionally we would like retain the rich type support.

Encoders are currently tied to ExpressionEncoders, we cannot use for a couple of reasons:

  1. Mid-term we don't want to have a dependency on Catalyst. Splitting of the public API that will be shared between Catalyst and the client is tracked in SPARK-41400.
  2. ExpressionEncoders only support the internal row format. The client will use Arrow instead.
  3. We are nog particularly keen on sending the expressions needed by ExpressionEncoders over the wire. They are overpowered.

So we need an alternative to the current ExpressionEncoders. This class need to be serializable.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests.

@hvanhovell hvanhovell requested a review from cloud-fan December 22, 2022 19:03
@github-actions github-actions bot added the SQL label Dec 22, 2022
|The type path of the target object is:
|- array element class: "scala.Long"
|- field (class: "scala.Array", name: "arr")
|- array element class: "long"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a side effect of not having the same type information when creating the deserializer.

typePath: WalkedTypePath): Expression = enc match {
case _ if isNativeEncoder(enc) =>
input
case BooleanEncoder =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't these handled by case _ if isNativeEncoder(enc) => already?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, we have PrimitiveBooleanEncoder and BooleanEncoder

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe BoxedBooleanEncoder is a better name

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 030c1ba Dec 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants