[SPARK-43302][SQL] Make Python UDAF an AggregateFunction #40739

cloud-fan · 2023-04-11T09:41:54Z

What changes were proposed in this pull request?

Today, PythonUDF can be used as an aggregate function according to the eval type. However, this is done in a tricky way, as PythonUDF does not extend AggregateFunction and we need to add special handling of it here and there. This is pretty error-prone, and we have hit issues such as #39824

This PR adds a new PythonUDAF expression which extends AggregateFunction. Now python udaf will be handled the same as normal aggregate functions, except for the places that we need to extract python functions. After this, we can remove most of the special handling of PythonUDF.

Why are the changes needed?

code cleanup

Does this PR introduce any user-facing change?

no

How was this patch tested?

existing tests

cloud-fan · 2023-04-27T03:55:35Z

sql/core/src/test/resources/sql-tests/results/udaf/udaf-group-by-ordinal.sql.out

@@ -93,12 +93,19 @@ struct<>
 -- !query output
 org.apache.spark.sql.AnalysisException
 {
-  "errorClass" : "MISSING_AGGREGATION",
-  "sqlState" : "42803",
+  "errorClass" : "GROUP_BY_POS_AGGREGATE",


now the error is the same as normal expressions.

cloud-fan · 2023-04-27T03:56:09Z

sql/core/src/test/resources/sql-tests/results/udaf/udaf-group-by.sql.out

@@ -315,24 +315,9 @@ struct<1:int>
 -- !query
 SELECT 1 FROM range(10) HAVING udaf(id) > 0


now we can run this query, which is the same as normal expressions.

cloud-fan · 2023-04-28T03:06:39Z

thanks for review, merging to master!

### What changes were proposed in this pull request? Today, `PythonUDF` can be used as an aggregate function according to the eval type. However, this is done in a tricky way, as `PythonUDF` does not extend `AggregateFunction` and we need to add special handling of it here and there. This is pretty error-prone, and we have hit issues such as apache#39824 This PR adds a new `PythonUDAF` expression which extends `AggregateFunction`. Now python udaf will be handled the same as normal aggregate functions, except for the places that we need to extract python functions. After this, we can remove most of the special handling of `PythonUDF`. ### Why are the changes needed? code cleanup ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes apache#40739 from cloud-fan/python_udaf. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This is a followup of #40739 to do some code cleanup 1. remove the pattern `PYTHON_UDAF` as it's not used by any rule. 2. add `PythonFuncExpression.evalType` for convenience: catalyst rules (including third-party extensions) may want to get the eval type of a python function, no matter it's UDF or UDAF. 3. update the python profile to use `PythonUDAF.resultId` instead of `AggregateExpression.resultId`, to be consistent with `PythonUDF` ### Why are the changes needed? code cleanup ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #41142 from cloud-fan/follow. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Kent Yao <yao@apache.org>

github-actions bot added PYTHON SQL labels Apr 11, 2023

cloud-fan force-pushed the python_udaf branch 5 times, most recently from ba72bae to 404aa7e Compare April 13, 2023 13:03

github-actions bot added the STRUCTURED STREAMING label Apr 13, 2023

cloud-fan force-pushed the python_udaf branch from 404aa7e to f18dc9e Compare April 14, 2023 05:22

github-actions bot removed the STRUCTURED STREAMING label Apr 14, 2023

cloud-fan force-pushed the python_udaf branch 4 times, most recently from 39b2dca to 97378e7 Compare April 18, 2023 08:34

github-actions bot added the CONNECT label Apr 19, 2023

cloud-fan force-pushed the python_udaf branch 2 times, most recently from dcb9775 to 84c8c37 Compare April 20, 2023 15:16

cloud-fan and others added 5 commits April 24, 2023 18:49

Make Python UDAF an AggregateFunction

043427f

Add sql(isDistinct: String)

c17c79d

fix string

5e0d557

fix spark connect

c6ee67c

fix string

acaa90c

cloud-fan force-pushed the python_udaf branch from 84c8c37 to acaa90c Compare April 24, 2023 10:50

cloud-fan and others added 5 commits April 24, 2023 23:40

fix

fd6cbc6

fix string

0d3c6ed

fix

4db6af3

Update SparkConnectPlanner.scala

64f812d

fix

048b5a6

cloud-fan force-pushed the python_udaf branch from 9fb09ce to 048b5a6 Compare April 26, 2023 03:25

fix

3b05bfb

cloud-fan added 2 commits April 27, 2023 11:25

fix

97b20df

fix

b47e86b

cloud-fan changed the title ~~[WIP] Make Python UDAF an AggregateFunction~~ [SPARK-43302][SQL] Make Python UDAF an AggregateFunction Apr 27, 2023

refine

cc19e71

cloud-fan commented Apr 27, 2023

View reviewed changes

fix style

c7c0b97

HyukjinKwon approved these changes Apr 27, 2023

View reviewed changes

cloud-fan closed this in b496edb Apr 28, 2023

cloud-fan mentioned this pull request May 11, 2023

[SPARK-43302][SQL][FOLLOWUP] Code cleanup for PythonUDAF #41142

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-43302][SQL] Make Python UDAF an AggregateFunction #40739

[SPARK-43302][SQL] Make Python UDAF an AggregateFunction #40739

cloud-fan commented Apr 11, 2023 •

edited

cloud-fan Apr 27, 2023

cloud-fan Apr 27, 2023

cloud-fan commented Apr 28, 2023

		@@ -315,24 +315,9 @@ struct<1:int>
		-- !query
		SELECT 1 FROM range(10) HAVING udaf(id) > 0

[SPARK-43302][SQL] Make Python UDAF an AggregateFunction #40739

[SPARK-43302][SQL] Make Python UDAF an AggregateFunction #40739

Conversation

cloud-fan commented Apr 11, 2023 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

cloud-fan Apr 27, 2023

Choose a reason for hiding this comment

cloud-fan Apr 27, 2023

Choose a reason for hiding this comment

cloud-fan commented Apr 28, 2023

cloud-fan commented Apr 11, 2023 •

edited