-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23177][SQL][PySpark] Extract zero-parameter UDFs from aggregate #20360
Conversation
Test build #86518 has finished for PR 20360 at commit
|
Test build #86520 has finished for PR 20360 at commit
|
retest this please. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I left one question though.
@@ -45,7 +45,8 @@ object ExtractPythonUDFFromAggregate extends Rule[LogicalPlan] { | |||
|
|||
private def hasPythonUdfOverAggregate(expr: Expression, agg: Aggregate): Boolean = { | |||
expr.find { | |||
e => PythonUDF.isScalarPythonUDF(e) && e.find(belongAggregate(_, agg)).isDefined | |||
e => PythonUDF.isScalarPythonUDF(e) && | |||
(e.references.isEmpty || e.find(belongAggregate(_, agg)).isDefined) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use just e.children
instead of e.references
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just want to consider some literal inputs like df2 = df.distinct().withColumn("a", f_udf(f.lit("2")))
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I wrote a duplicate comment and removed it back. It didn't show up when I write ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see, sounds good. Thanks!
Test build #86523 has finished for PR 20360 at commit
|
LGTM |
@@ -45,7 +45,8 @@ object ExtractPythonUDFFromAggregate extends Rule[LogicalPlan] { | |||
|
|||
private def hasPythonUdfOverAggregate(expr: Expression, agg: Aggregate): Boolean = { | |||
expr.find { | |||
e => PythonUDF.isScalarPythonUDF(e) && e.find(belongAggregate(_, agg)).isDefined |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we update the classdoc too? it currently says Extracts all the Python UDFs in logical aggregate, which depends on aggregate expression or grouping key, evaluate them after aggregate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Updated.
LGTM |
Test build #86551 has finished for PR 20360 at commit
|
Merged to master. |
@viirya, mind if I ask to open a backport to branch-2.3? |
@HyukjinKwon Ok. I will open a backport later. |
is there any workaround for this? my environment hasn't upgrade to 2.3.0, but I have exact code that jira ticket has. (http://mail-archives.apache.org/mod_mbox/spark-issues/201801.mbox/%3CJIRA.13132665.1516622460000.6681.1516622520346@Atlassian.JIRA%3E) |
@hankim maybe like:
|
What changes were proposed in this pull request?
We extract Python UDFs in logical aggregate which depends on aggregate expression or grouping key in ExtractPythonUDFFromAggregate rule. But Python UDFs which don't depend on above expressions should also be extracted to avoid the issue reported in the JIRA.
A small code snippet to reproduce that issue looks like:
Error exception is raised as:
This exception raises because
HashAggregateExec
tries to bind the aliased Python UDF expression (e.g.,pythonUDF0#50 AS a#44
) to grouping key.How was this patch tested?
Added test.