Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-43302][SQL] Make Python UDAF an AggregateFunction #40739

Closed
wants to merge 15 commits into from

Conversation

cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Apr 11, 2023

What changes were proposed in this pull request?

Today, PythonUDF can be used as an aggregate function according to the eval type. However, this is done in a tricky way, as PythonUDF does not extend AggregateFunction and we need to add special handling of it here and there. This is pretty error-prone, and we have hit issues such as #39824

This PR adds a new PythonUDAF expression which extends AggregateFunction. Now python udaf will be handled the same as normal aggregate functions, except for the places that we need to extract python functions. After this, we can remove most of the special handling of PythonUDF.

Why are the changes needed?

code cleanup

Does this PR introduce any user-facing change?

no

How was this patch tested?

existing tests

@cloud-fan cloud-fan changed the title [WIP] Make Python UDAF an AggregateFunction [SPARK-43302][SQL] Make Python UDAF an AggregateFunction Apr 27, 2023
@@ -93,12 +93,19 @@ struct<>
-- !query output
org.apache.spark.sql.AnalysisException
{
"errorClass" : "MISSING_AGGREGATION",
"sqlState" : "42803",
"errorClass" : "GROUP_BY_POS_AGGREGATE",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now the error is the same as normal expressions.

@@ -315,24 +315,9 @@ struct<1:int>
-- !query
SELECT 1 FROM range(10) HAVING udaf(id) > 0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now we can run this query, which is the same as normal expressions.

@cloud-fan
Copy link
Contributor Author

thanks for review, merging to master!

@cloud-fan cloud-fan closed this in b496edb Apr 28, 2023
LuciferYang pushed a commit to LuciferYang/spark that referenced this pull request May 10, 2023
### What changes were proposed in this pull request?

Today, `PythonUDF` can be used as an aggregate function according to the eval type. However, this is done in a tricky way, as `PythonUDF` does not extend `AggregateFunction` and we need to add special handling of it here and there. This is pretty error-prone, and we have hit issues such as apache#39824

This PR adds a new `PythonUDAF` expression which extends `AggregateFunction`. Now python udaf will be handled the same as normal aggregate functions, except for the places that we need to extract python functions. After this, we can remove most of the special handling of `PythonUDF`.

### Why are the changes needed?

code cleanup

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes apache#40739 from cloud-fan/python_udaf.

Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
yaooqinn pushed a commit that referenced this pull request May 17, 2023
### What changes were proposed in this pull request?

This is a followup of #40739 to do some code cleanup
1. remove the pattern `PYTHON_UDAF` as it's not used by any rule.
2. add `PythonFuncExpression.evalType` for convenience: catalyst rules (including third-party extensions) may want to get the eval type of a python function, no matter it's UDF or UDAF.
3. update the python profile to use `PythonUDAF.resultId` instead of `AggregateExpression.resultId`, to be consistent with `PythonUDF`

### Why are the changes needed?

code cleanup

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes #41142 from cloud-fan/follow.

Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Kent Yao <yao@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants