Skip to content

Conversation

hvanhovell
Copy link
Contributor

What changes were proposed in this pull request?

This PR adds support for UserDefinedAggregateFunction to the Spark Connect Scala Client. While this is a deprecated feature, we still believe it is useful to support it to ensure we reduce incompatibilities between classic and connect.

Implementation wise I opted to convert the UserDefinedAggregateFunction to an Aggregator, and use that code path for execution. This is probably not as fast as the original implementation (more allocations).

Why are the changes needed?

This reduces friction between the classic and connect implementations.

Does this PR introduce any user-facing change?

Yes. It enabled Spark Connect Scala Client users to use UserDefinedAggregateFunctions.

How was this patch tested?

Added tests to

Was this patch authored or co-authored using generative AI tooling?

No.

@hvanhovell
Copy link
Contributor Author

Merging to master/4.0

@asfgit asfgit closed this in 4953a9c Feb 4, 2025
asfgit pushed a commit that referenced this pull request Feb 4, 2025
…Connect Scala Client

### What changes were proposed in this pull request?
This PR adds support for `UserDefinedAggregateFunction` to the Spark Connect Scala Client. While this is a deprecated feature, we still believe it is useful to support it to ensure we reduce incompatibilities between classic and connect.

Implementation wise I opted to convert the `UserDefinedAggregateFunction` to an `Aggregator`, and use that code path for execution. This is probably not as fast as the original implementation (more allocations).

### Why are the changes needed?
This reduces friction between the classic and connect implementations.

### Does this PR introduce _any_ user-facing change?
Yes. It enabled Spark Connect Scala Client users to use `UserDefinedAggregateFunction`s.

### How was this patch tested?
Added tests to

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #49785 from hvanhovell/SPARK-49308.

Authored-by: Herman van Hovell <herman@databricks.com>
Signed-off-by: Herman van Hovell <herman@databricks.com>
(cherry picked from commit 4953a9c)
Signed-off-by: Herman van Hovell <herman@databricks.com>
NewInstance(cls, arguments, Nil, propagateNull = false, dt, outerPointerGetter))

case AgnosticEncoders.RowEncoder(fields) =>
val isExternalRow = !path.dataType.isInstanceOf[StructType]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really safe to call dataType here? The path expression might not be resolved and then this will throw an exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be. If you don't know the dataType at this point, then you can't build a deserializer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem comes up if you have a RowEncoder being used inside a ProductEncoder. The the path in the recursion will come from

createDeserializer(field.enc, getter, newTypePath),

and then
addToPath(path, field.name, field.enc.dataType, newTypePath)
and then here
val newPath = UnresolvedExtractValue(path, expressions.Literal(part))
so the path will contain UnresolvedExtractValue and the .dataType will throw

   org.apache.spark.sql.catalyst.analysis.UnresolvedException: [INTERNAL_ERROR] Invalid call to dataType on unresolved object SQLSTATE: XX000
  at org.apache.spark.sql.catalyst.analysis.UnresolvedExtractValue.dataType(unresolved.scala:939)
  at org.apache.spark.sql.catalyst.DeserializerBuildHelper$.createDeserializer(DeserializerBuildHelper.scala:411)

Is there some assumption somewhere that the encoders should not be fully composable and RowEncoder can only be used it certain cases?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hvanhovell Created this PR #51319 that fixes the issue.

asf-gitbox-commits pushed a commit that referenced this pull request Oct 1, 2025
### What changes were proposed in this pull request?
This fixes support for using a RowEncoder inside a ProductEncoder.

### Why are the changes needed?
The current does a dataType check on a path when contructing the RowEncoder deserializer. But this is not safe and if the RowEncoder is used inside a ProductEncoder, it will throw because the path Expression is unresolved.

The check was introduced in #49785

### Does this PR introduce _any_ user-facing change?
Yes, it makes it possible to use RowEncoder in more cases.

### How was this patch tested?
Existing and new unit tests.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #51319 from eejbyfeldt/SPARK-52614.

Authored-by: Emil Ejbyfeldt <emil.ejbyfeldt@choreograph.com>
Signed-off-by: Herman van Hovell <herman@databricks.com>
eejbyfeldt added a commit to eejbyfeldt/spark that referenced this pull request Oct 2, 2025
This fixes support for using a RowEncoder inside a ProductEncoder.

The current does a dataType check on a path when contructing the RowEncoder deserializer. But this is not safe and if the RowEncoder is used inside a ProductEncoder, it will throw because the path Expression is unresolved.

The check was introduced in apache#49785

Yes, it makes it possible to use RowEncoder in more cases.

Existing and new unit tests.

No

Closes apache#51319 from eejbyfeldt/SPARK-52614.

Authored-by: Emil Ejbyfeldt <emil.ejbyfeldt@choreograph.com>
Signed-off-by: Herman van Hovell <herman@databricks.com>
asf-gitbox-commits pushed a commit that referenced this pull request Oct 6, 2025
This is backport of SPARK-52614 #51319 to branch-4.0

### What changes were proposed in this pull request?
This fixes support for using a RowEncoder inside a ProductEncoder.

### Why are the changes needed?
The current does a dataType check on a path when contructing the RowEncoder deserializer. But this is not safe and if the RowEncoder is used inside a ProductEncoder, it will throw because the path Expression is unresolved.

The check was introduced in #49785

### Does this PR introduce _any_ user-facing change?
Yes, it makes it possible to use RowEncoder in more cases.

### How was this patch tested?
Existing and new unit tests.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #52503 from eejbyfeldt/SPARK-52614-4.0.

Authored-by: Emil Ejbyfeldt <emil.ejbyfeldt@choreograph.com>
Signed-off-by: Herman van Hovell <herman@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants