[SPARK-41633][SQL] Identify aggregation expressions in the nodePatterns of PythonUDF by gengliangwang · Pull Request #39142 · apache/spark

gengliangwang · 2022-12-20T20:36:50Z

What changes were proposed in this pull request?

For Python UDF with aggregation functions, we can mark them as AGGREGATE_EXPRESSION in the nodePatterns of PythonUDF. So that we can check whether an expression contains aggregation in a handy way: expr.containsPattern(AGGREGATE_EXPRESSION)

Why are the changes needed?

Address comment in #39134 (comment). This should provide a handy way for checking whether an expression contains aggregations.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New UT

rxin · 2022-12-20T20:38:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/PythonUDF.scala

  override def toString: String = s"$name(${children.mkString(", ")})#${resultId.id}$typeSuffix"

-  final override val nodePatterns: Seq[TreePattern] = Seq(PYTHON_UDF)
+  private def nodePatternsOfPythonFunction: Option[TreePattern] = {


just to think about future proofing - how do we prevent us from adding another expression in the future that can be an aggregate expression but doesn't update this location?

Good point. I just did a simple refactoring on the method PythonUDF.isGroupedAggPandasUDFEvalType to check whether an eval type is a pandas aggregation function.
I added a comment for developers as well.

I asked @sigmod a similar question while reviewing PRs under https://issues.apache.org/jira/browse/SPARK-35042. The answer is developer needs to maintain the node patterns and create more test cases to make sure no regressions happens.

We can add a test case to go over all the PythonEvalType as well. But given it is written in the following way

private[spark] object PythonEvalType { val NON_UDF = 0 val SQL_BATCHED_UDF = 100 val SQL_SCALAR_PANDAS_UDF = 200 val SQL_GROUPED_MAP_PANDAS_UDF = 201 val SQL_GROUPED_AGG_PANDAS_UDF = 202 val SQL_WINDOW_AGG_PANDAS_UDF = 203 val SQL_SCALAR_PANDAS_ITER_UDF = 204 val SQL_MAP_PANDAS_ITER_UDF = 205 val SQL_COGROUPED_MAP_PANDAS_UDF = 206 val SQL_MAP_ARROW_ITER_UDF = 207 val SQL_GROUPED_MAP_PANDAS_UDF_WITH_STATE = 208

I am still thinking about how to refactor and write tests to enumerate all the types for future-proofing.

cloud-fan · 2022-12-21T01:55:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/PythonUDF.scala

+  // SPARK-41633: We should check whether to update the node patterns when adding a new eval type.
+  private def nodePatternsOfPythonFunction: Seq[TreePattern] = {
+    if (PythonUDF.isGroupedAggPandasUDFEvalType(evalType)) {
+      Seq(AGGREGATE_EXPRESSION)


I think it's better to make node pattern match the node type. We created node patterns to speed up isInstanceOf check, not to replace it. Even if we fix the node pattern here, expr.exists(_.isInstanceOf[AggregateExpression]) is still broken.

I think we should have a PythonUDAF that indicates aggregate function explicitly, instead of relying on node pattern.

@rxin Given Wenchen is against the idea and there is little we can do for future-proofing, do you mind matching the aggregation PythonUDF in #39134 instead of adding this node pattern?

I agree that we should have a subtype for aggregate. That's a much safer change. Should've done that from the beginning IMO.

gengliangwang · 2022-12-21T23:29:44Z

Due to comments in #39142 (comment), I am closing this one

Identify aggregation expression in the nodePatterns of PythonUDF

601ba1d

gengliangwang requested review from HyukjinKwon and rxin December 20, 2022 20:37

github-actions bot added PYTHON SQL labels Dec 20, 2022

gengliangwang mentioned this pull request Dec 20, 2022

[SPARK-41635][SQL] GROUP BY ALL #39134

Closed

rxin reviewed Dec 20, 2022

View reviewed changes

gengliangwang added 3 commits December 20, 2022 12:40

use Seq instead of Option

b416002

refactor for future proofing

0444dc8

add comment for developers

6dc0081

HyukjinKwon approved these changes Dec 21, 2022

View reviewed changes

cloud-fan reviewed Dec 21, 2022

View reviewed changes

gengliangwang closed this Dec 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-41633][SQL] Identify aggregation expressions in the nodePatterns of PythonUDF#39142

[SPARK-41633][SQL] Identify aggregation expressions in the nodePatterns of PythonUDF#39142
gengliangwang wants to merge 4 commits intoapache:masterfrom
gengliangwang:pythonUDFBit

gengliangwang commented Dec 20, 2022

Uh oh!

rxin Dec 20, 2022

Uh oh!

gengliangwang Dec 20, 2022

Uh oh!

cloud-fan Dec 21, 2022

Uh oh!

gengliangwang Dec 21, 2022

Uh oh!

rxin Dec 21, 2022

Uh oh!

gengliangwang commented Dec 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

gengliangwang commented Dec 20, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

rxin Dec 20, 2022

Choose a reason for hiding this comment

Uh oh!

gengliangwang Dec 20, 2022

Choose a reason for hiding this comment

Uh oh!

cloud-fan Dec 21, 2022

Choose a reason for hiding this comment

Uh oh!

gengliangwang Dec 21, 2022

Choose a reason for hiding this comment

Uh oh!

rxin Dec 21, 2022

Choose a reason for hiding this comment

Uh oh!

gengliangwang commented Dec 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments