[SPARK-48008][1/2] Support UDAFs in Spark Connect #46245

xupefei · 2024-04-26T11:38:02Z

What changes were proposed in this pull request?

This PR changes Spark Connect to support defining and registering Aggregator[IN, BUF, OUT] UDAFs.
The mechanism is similar to supporting Scaler UDFs. On the client side, we serialize and send the Aggregator instance to the server, where the data is deserialized into an Aggregator instance recognized by Spark Core.
With this PR we now have two Aggregator interfaces defined, one in Connect API and one in Core. They define exactly the same abstract methods and share the same SerialVersionUID, so the Java serialization engine could map one to another. It is very important to keep these two definitions always in sync.

Second part of this effort will be adding Aggregator.toColumn API (now NotImplemented due to deps to Spark Core).

Why are the changes needed?

Spark Connect does not have UDAF support. We need to fix that.

Does this PR introduce any user-facing change?

Yes, Connect users could now define an Aggregator and register it:

val agg = new Aggregator[INT, INT, INT] { ... }
spark.udf.register("agg", udaf(agg))
val ds: Dataset[Data] = ...
val aggregated = ds.selectExpr("agg(i)")

How was this patch tested?

Added new tests.

Was this patch authored or co-authored using generative AI tooling?

Nope.

sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala

...connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala

connector/connect/common/src/main/protobuf/spark/connect/expressions.proto

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala

...connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala

hvanhovell · 2024-05-21T19:23:41Z

...connect/client/jvm/src/test/scala/org/apache/spark/sql/UserDefinedFunctionE2ETestSuite.scala

@@ -346,4 +347,42 @@ class UserDefinedFunctionE2ETestSuite extends QueryTest {
    val result = df.select(f($"id")).as[Long].head()
    assert(result == 1L)
  }
+
+  test("UDAF custom Aggregator - primitive types") {


Can you add a test for a UDAF with a custom toColumn implementation?

Let's do this in the next PR where we add support of toColumn.

# Conflicts: # connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/UserDefinedFunctionE2ETestSuite.scala

# Conflicts: # python/pyspark/sql/connect/proto/commands_pb2.py

hvanhovell

LGTM

hvanhovell · 2024-05-28T00:54:10Z

@HyukjinKwon I think the current protobuf style checks are a bit too strict. The changes made by @xupefei are wire compatible. Can we make this a bit more lenient?

hvanhovell · 2024-05-28T00:55:10Z

@xupefei can you fix style, and perhaps revert the renaming of the messages?

xupefei · 2024-05-28T15:04:02Z

@xupefei can you fix style, and perhaps revert the renaming of the messages?

Done! I've reverted the Protobuf change but kept the naming changes in the Scala code.

HyukjinKwon · 2024-05-29T02:14:14Z

@HyukjinKwon I think the current protobuf style checks are a bit too strict. The changes made by @xupefei are wire compatible. Can we make this a bit more lenient?

Just saw this. seems tests passing fine (?).

xupefei · 2024-05-29T07:42:56Z

@HyukjinKwon I think the current protobuf style checks are a bit too strict. The changes made by @xupefei are wire compatible. Can we make this a bit more lenient?

Just saw this. seems tests passing fine (?).

This is the Proto test that is failing: https://github.com/xupefei/spark/actions/runs/9255404729/job/25459140837, for commit 82b802d. I reverted the Proto naming change.

HyukjinKwon · 2024-05-29T11:10:41Z

cc @grundprinzip ^^^

hvanhovell · 2024-05-30T16:49:06Z

Merging!

### What changes were proposed in this pull request? This PR changes Spark Connect to support defining and registering `Aggregator[IN, BUF, OUT]` UDAFs. The mechanism is similar to supporting Scaler UDFs. On the client side, we serialize and send the `Aggregator` instance to the server, where the data is deserialized into an `Aggregator` instance recognized by Spark Core. With this PR we now have two `Aggregator` interfaces defined, one in Connect API and one in Core. They define exactly the same abstract methods and share the same `SerialVersionUID`, so the Java serialization engine could map one to another. It is very important to keep these two definitions always in sync. Second part of this effort will be adding `Aggregator.toColumn` API (now NotImplemented due to deps to Spark Core). ### Why are the changes needed? Spark Connect does not have UDAF support. We need to fix that. ### Does this PR introduce _any_ user-facing change? Yes, Connect users could now define an Aggregator and register it: ```scala val agg = new Aggregator[INT, INT, INT] { ... } spark.udf.register("agg", udaf(agg)) val ds: Dataset[Data] = ... val aggregated = ds.selectExpr("agg(i)") ``` ### How was this patch tested? Added new tests. ### Was this patch authored or co-authored using generative AI tooling? Nope. Closes apache#46245 from xupefei/connect-udaf. Authored-by: Paddy Xu <xupaddy@gmail.com> Signed-off-by: Herman van Hovell <herman@databricks.com>

### What changes were proposed in this pull request? This PR follows #46245 to add support `udaf.toColumn` API in Spark Connect. Here we introduce a new Protobuf message, `proto.TypedAggregateExpression`, that includes a serialized UDF packet. On the server, we unpack it into an `Aggregator` object and generate a real `TypedAggregateExpression` instance with the encoder information passed along with the UDF. ### Why are the changes needed? Because the `toColumn` API is not supported in the previous PR. ### Does this PR introduce _any_ user-facing change? Yes, from now on users could create typed UDAF using `udaf.toColumn` API/. ### How was this patch tested? New tests. ### Was this patch authored or co-authored using generative AI tooling? Nope. Closes #46849 from xupefei/connect-udaf-tocolumn. Authored-by: Paddy Xu <xupaddy@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? This PR follows apache#46245 to add support `udaf.toColumn` API in Spark Connect. Here we introduce a new Protobuf message, `proto.TypedAggregateExpression`, that includes a serialized UDF packet. On the server, we unpack it into an `Aggregator` object and generate a real `TypedAggregateExpression` instance with the encoder information passed along with the UDF. ### Why are the changes needed? Because the `toColumn` API is not supported in the previous PR. ### Does this PR introduce _any_ user-facing change? Yes, from now on users could create typed UDAF using `udaf.toColumn` API/. ### How was this patch tested? New tests. ### Was this patch authored or co-authored using generative AI tooling? Nope. Closes apache#46849 from xupefei/connect-udaf-tocolumn. Authored-by: Paddy Xu <xupaddy@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

xupefei added 2 commits April 26, 2024 11:55

initial commit

cb111e1

.

f6efc14

github-actions bot added SQL CONNECT labels Apr 26, 2024

xupefei added 3 commits April 26, 2024 15:45

fix test

38b561a

buf

dbfa1a1

Merge branch 'master' of github.com:apache/spark into connect-udaf

f399de4

github-actions bot added the PYTHON label Apr 30, 2024

mima

75bab29

xupefei marked this pull request as ready for review April 30, 2024 12:42

another mima

2cf9ba9