New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-43302][SQL] Make Python UDAF an AggregateFunction #40739
Conversation
ba72bae
to
404aa7e
Compare
39b2dca
to
97378e7
Compare
dcb9775
to
84c8c37
Compare
@@ -93,12 +93,19 @@ struct<> | |||
-- !query output | |||
org.apache.spark.sql.AnalysisException | |||
{ | |||
"errorClass" : "MISSING_AGGREGATION", | |||
"sqlState" : "42803", | |||
"errorClass" : "GROUP_BY_POS_AGGREGATE", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now the error is the same as normal expressions.
@@ -315,24 +315,9 @@ struct<1:int> | |||
-- !query | |||
SELECT 1 FROM range(10) HAVING udaf(id) > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now we can run this query, which is the same as normal expressions.
thanks for review, merging to master! |
### What changes were proposed in this pull request? Today, `PythonUDF` can be used as an aggregate function according to the eval type. However, this is done in a tricky way, as `PythonUDF` does not extend `AggregateFunction` and we need to add special handling of it here and there. This is pretty error-prone, and we have hit issues such as apache#39824 This PR adds a new `PythonUDAF` expression which extends `AggregateFunction`. Now python udaf will be handled the same as normal aggregate functions, except for the places that we need to extract python functions. After this, we can remove most of the special handling of `PythonUDF`. ### Why are the changes needed? code cleanup ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes apache#40739 from cloud-fan/python_udaf. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? This is a followup of #40739 to do some code cleanup 1. remove the pattern `PYTHON_UDAF` as it's not used by any rule. 2. add `PythonFuncExpression.evalType` for convenience: catalyst rules (including third-party extensions) may want to get the eval type of a python function, no matter it's UDF or UDAF. 3. update the python profile to use `PythonUDAF.resultId` instead of `AggregateExpression.resultId`, to be consistent with `PythonUDF` ### Why are the changes needed? code cleanup ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #41142 from cloud-fan/follow. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Kent Yao <yao@apache.org>
What changes were proposed in this pull request?
Today,
PythonUDF
can be used as an aggregate function according to the eval type. However, this is done in a tricky way, asPythonUDF
does not extendAggregateFunction
and we need to add special handling of it here and there. This is pretty error-prone, and we have hit issues such as #39824This PR adds a new
PythonUDAF
expression which extendsAggregateFunction
. Now python udaf will be handled the same as normal aggregate functions, except for the places that we need to extract python functions. After this, we can remove most of the special handling ofPythonUDF
.Why are the changes needed?
code cleanup
Does this PR introduce any user-facing change?
no
How was this patch tested?
existing tests