[SPARK-41690][SQL][CONNECT] Agnostic Encoders by hvanhovell · Pull Request #39186 · apache/spark

hvanhovell · 2022-12-22T19:03:35Z

What changes were proposed in this pull request?

This PR introduces so AgnosticEncoders. AgnosticEncoders describe how an external type maps to a Spark data type. They are agnostic in the sense that they do not prescribe which internal format is to be used.

For example the following class:

case class Person(id: Long, name: String, hobbies: Seq[String])

Translates into the following agnostic encoder:

ProductEncoder(Person,List(
  (id, PrimitiveLongEncoder),
  (name, StringEncoder),
  (hobbies, IterableEncoder(scala.collection.Seq,StringEncoder))))

This PR integrates AgnosticEncoders in ScalaReflection, and so it is used for all Dataset operations. In a follow-up we will address RowEncoder and JavaReflection. In the old situation we would traverse the type hierarchy for each options (serializedForType, deserializerForType & schemaFor). In the new situation we will create an AgnosticEncoder first and then generate a serializer, deserializer, and/or schema. This saves significantly in time, especially for ExpressionEncoder where we only need one pass through the type hierarchy instead 2 or 3.

Why are the changes needed?

For the Spark Connect Scala Client we need encoders. We want to stay as close to the current Dataset APIs and encoders are part of this. Additionally we would like retain the rich type support.

Encoders are currently tied to ExpressionEncoders, we cannot use for a couple of reasons:

Mid-term we don't want to have a dependency on Catalyst. Splitting of the public API that will be shared between Catalyst and the client is tracked in SPARK-41400.
ExpressionEncoders only support the internal row format. The client will use Arrow instead.
We are nog particularly keen on sending the expressions needed by ExpressionEncoders over the wire. They are overpowered.

So we need an alternative to the current ExpressionEncoders. This class need to be serializable.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests.

hvanhovell · 2022-12-22T19:07:21Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/EncoderResolutionSuite.scala

          |The type path of the target object is:
-          |- array element class: "scala.Long"
-          |- field (class: "scala.Array", name: "arr")
+          |- array element class: "long"


This is a side effect of not having the same type information when creating the deserializer.

cloud-fan · 2022-12-23T11:24:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala

+      typePath: WalkedTypePath): Expression = enc match {
+    case _ if isNativeEncoder(enc) =>
+      input
+    case BooleanEncoder =>


aren't these handled by case _ if isNativeEncoder(enc) => already?

nvm, we have PrimitiveBooleanEncoder and BooleanEncoder

maybe BoxedBooleanEncoder is a better name

cloud-fan · 2022-12-27T03:37:06Z

thanks, merging to master!

hvanhovell added 2 commits December 22, 2022 14:20

AgnosticEncoders

4a20121

Merge remote-tracking branch 'apache/master' into SPARK-41690

9539179

hvanhovell requested a review from cloud-fan December 22, 2022 19:03

github-actions bot added the SQL label Dec 22, 2022

hvanhovell commented Dec 22, 2022

View reviewed changes

Fix a couple of bugs

2192911

cloud-fan reviewed Dec 23, 2022

View reviewed changes

hvanhovell added 2 commits December 23, 2022 13:25

Boxed encoders.

1810541

Update doc, and add a proper field type.

445cb7d

cloud-fan approved these changes Dec 27, 2022

View reviewed changes

cloud-fan closed this in 030c1ba Dec 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[SPARK-41690][SQL][CONNECT] Agnostic Encoders#39186

[SPARK-41690][SQL][CONNECT] Agnostic Encoders#39186
hvanhovell wants to merge 5 commits intoapache:masterfrom
hvanhovell:SPARK-41690

hvanhovell commented Dec 22, 2022

Uh oh!

hvanhovell Dec 22, 2022

Uh oh!

cloud-fan Dec 23, 2022

Uh oh!

cloud-fan Dec 23, 2022

Uh oh!

cloud-fan Dec 23, 2022

Uh oh!

cloud-fan commented Dec 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

hvanhovell commented Dec 22, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

hvanhovell Dec 22, 2022

Choose a reason for hiding this comment

Uh oh!

cloud-fan Dec 23, 2022

Choose a reason for hiding this comment

Uh oh!

cloud-fan Dec 23, 2022

Choose a reason for hiding this comment

Uh oh!

cloud-fan Dec 23, 2022

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants