[SPARK-49308][CONNECT] Support UserDefinedAggregateFunction in Spark Connect Scala Client #49785

hvanhovell · 2025-02-04T01:28:16Z

What changes were proposed in this pull request?

This PR adds support for UserDefinedAggregateFunction to the Spark Connect Scala Client. While this is a deprecated feature, we still believe it is useful to support it to ensure we reduce incompatibilities between classic and connect.

Implementation wise I opted to convert the UserDefinedAggregateFunction to an Aggregator, and use that code path for execution. This is probably not as fast as the original implementation (more allocations).

Why are the changes needed?

This reduces friction between the classic and connect implementations.

Does this PR introduce any user-facing change?

Yes. It enabled Spark Connect Scala Client users to use UserDefinedAggregateFunctions.

How was this patch tested?

Added tests to

Was this patch authored or co-authored using generative AI tooling?

No.

hvanhovell · 2025-02-04T13:25:52Z

Merging to master/4.0

…Connect Scala Client ### What changes were proposed in this pull request? This PR adds support for `UserDefinedAggregateFunction` to the Spark Connect Scala Client. While this is a deprecated feature, we still believe it is useful to support it to ensure we reduce incompatibilities between classic and connect. Implementation wise I opted to convert the `UserDefinedAggregateFunction` to an `Aggregator`, and use that code path for execution. This is probably not as fast as the original implementation (more allocations). ### Why are the changes needed? This reduces friction between the classic and connect implementations. ### Does this PR introduce _any_ user-facing change? Yes. It enabled Spark Connect Scala Client users to use `UserDefinedAggregateFunction`s. ### How was this patch tested? Added tests to ### Was this patch authored or co-authored using generative AI tooling? No. Closes #49785 from hvanhovell/SPARK-49308. Authored-by: Herman van Hovell <herman@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 4953a9c) Signed-off-by: Herman van Hovell <herman@databricks.com>

eejbyfeldt · 2025-05-16T19:23:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/DeserializerBuildHelper.scala

        NewInstance(cls, arguments, Nil, propagateNull = false, dt, outerPointerGetter))

    case AgnosticEncoders.RowEncoder(fields) =>
+      val isExternalRow = !path.dataType.isInstanceOf[StructType]


Is it really safe to call dataType here? The path expression might not be resolved and then this will throw an exception.

It should be. If you don't know the dataType at this point, then you can't build a deserializer.

The problem comes up if you have a RowEncoder being used inside a ProductEncoder. The the path in the recursion will come from

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/DeserializerBuildHelper.scala

Line 401 in 9a99ecb

createDeserializer(field.enc, getter, newTypePath),

and then

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/DeserializerBuildHelper.scala

Line 398 in 9a99ecb

addToPath(path, field.name, field.enc.dataType, newTypePath)

and then here

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/DeserializerBuildHelper.scala

Line 37 in 9a99ecb

val newPath = UnresolvedExtractValue(path, expressions.Literal(part))

so the path will contain UnresolvedExtractValue and the .dataType will throw

org.apache.spark.sql.catalyst.analysis.UnresolvedException: [INTERNAL_ERROR] Invalid call to dataType on unresolved object SQLSTATE: XX000 at org.apache.spark.sql.catalyst.analysis.UnresolvedExtractValue.dataType(unresolved.scala:939) at org.apache.spark.sql.catalyst.DeserializerBuildHelper$.createDeserializer(DeserializerBuildHelper.scala:411)

Is there some assumption somewhere that the encoders should not be fully composable and RowEncoder can only be used it certain cases?

@hvanhovell Created this PR #51319 that fixes the issue.

### What changes were proposed in this pull request? This fixes support for using a RowEncoder inside a ProductEncoder. ### Why are the changes needed? The current does a dataType check on a path when contructing the RowEncoder deserializer. But this is not safe and if the RowEncoder is used inside a ProductEncoder, it will throw because the path Expression is unresolved. The check was introduced in #49785 ### Does this PR introduce _any_ user-facing change? Yes, it makes it possible to use RowEncoder in more cases. ### How was this patch tested? Existing and new unit tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes #51319 from eejbyfeldt/SPARK-52614. Authored-by: Emil Ejbyfeldt <emil.ejbyfeldt@choreograph.com> Signed-off-by: Herman van Hovell <herman@databricks.com>

This fixes support for using a RowEncoder inside a ProductEncoder. The current does a dataType check on a path when contructing the RowEncoder deserializer. But this is not safe and if the RowEncoder is used inside a ProductEncoder, it will throw because the path Expression is unresolved. The check was introduced in apache#49785 Yes, it makes it possible to use RowEncoder in more cases. Existing and new unit tests. No Closes apache#51319 from eejbyfeldt/SPARK-52614. Authored-by: Emil Ejbyfeldt <emil.ejbyfeldt@choreograph.com> Signed-off-by: Herman van Hovell <herman@databricks.com>

This is backport of SPARK-52614 #51319 to branch-4.0 ### What changes were proposed in this pull request? This fixes support for using a RowEncoder inside a ProductEncoder. ### Why are the changes needed? The current does a dataType check on a path when contructing the RowEncoder deserializer. But this is not safe and if the RowEncoder is used inside a ProductEncoder, it will throw because the path Expression is unresolved. The check was introduced in #49785 ### Does this PR introduce _any_ user-facing change? Yes, it makes it possible to use RowEncoder in more cases. ### How was this patch tested? Existing and new unit tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes #52503 from eejbyfeldt/SPARK-52614-4.0. Authored-by: Emil Ejbyfeldt <emil.ejbyfeldt@choreograph.com> Signed-off-by: Herman van Hovell <herman@databricks.com>

hvanhovell added 2 commits February 3, 2025 21:17

SPARK-51077

8d1e95f

Support UserDefinedAggregateFunction in Spark Connect Scala Client.

1d0ccda

github-actions bot added SQL PYTHON CONNECT labels Feb 4, 2025

hvanhovell added 2 commits February 3, 2025 21:33

Merge remote-tracking branch 'apache/master' into SPARK-49308

77ba3b3

Fix distinct :)

77a68d8

HyukjinKwon approved these changes Feb 4, 2025

View reviewed changes

hvanhovell added 2 commits February 4, 2025 07:01

Merge remote-tracking branch 'apache/master' into SPARK-49308

990819f

Regenerate Proto

ab0b731

asfgit closed this in 4953a9c Feb 4, 2025

eejbyfeldt reviewed May 16, 2025

View reviewed changes

eejbyfeldt mentioned this pull request Sep 11, 2025

[SPARK-52614][SQL] Support RowEncoder inside Product Encoder #51319

Closed

eejbyfeldt mentioned this pull request Oct 2, 2025

[SPARK-52614][SQL][4.0] Support RowEncoder inside Product Encoder #52503

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-49308][CONNECT] Support UserDefinedAggregateFunction in Spark Connect Scala Client #49785

[SPARK-49308][CONNECT] Support UserDefinedAggregateFunction in Spark Connect Scala Client #49785

Uh oh!

hvanhovell commented Feb 4, 2025

Uh oh!

hvanhovell commented Feb 4, 2025

Uh oh!

eejbyfeldt May 16, 2025

Uh oh!

hvanhovell May 16, 2025

Uh oh!

eejbyfeldt May 17, 2025

Uh oh!

eejbyfeldt Jun 30, 2025

Uh oh!

Uh oh!

[SPARK-49308][CONNECT] Support UserDefinedAggregateFunction in Spark Connect Scala Client #49785

[SPARK-49308][CONNECT] Support UserDefinedAggregateFunction in Spark Connect Scala Client #49785

Uh oh!

Conversation

hvanhovell commented Feb 4, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

hvanhovell commented Feb 4, 2025

Uh oh!

eejbyfeldt May 16, 2025

Choose a reason for hiding this comment

Uh oh!

hvanhovell May 16, 2025

Choose a reason for hiding this comment

Uh oh!

eejbyfeldt May 17, 2025

Choose a reason for hiding this comment

Uh oh!

eejbyfeldt Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!