-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-17474] [SQL] fix python udf in TakeOrderedAndProjectExec #15030
Conversation
Test build #65161 has finished for PR 15030 at commit
|
|
Test build #65167 has finished for PR 15030 at commit
|
@@ -148,8 +148,8 @@ case class TakeOrderedAndProjectExec( | |||
localTopK, child.output, SinglePartition, serializer)) | |||
shuffled.mapPartitions { iter => | |||
val topK = org.apache.spark.util.collection.Utils.takeOrdered(iter.map(_.copy()), limit)(ord) | |||
if (projectList.isDefined) { | |||
val proj = UnsafeProjection.create(projectList.get, child.output) | |||
if (AttributeSet(projectList) != child.outputSet) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be order-insensitive, set-based comparision or should it be using AttributeSeq
instead? I'm wondering whether we could hit a bug in case the project happens to permute the child output columns, since in that case I think we'd end up skipping the final column-reordering projection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, we should just compare it with output as Seq directly.
Test build #65273 has finished for PR 15030 at commit
|
LGTM |
Merging into 2.0 and master. |
## What changes were proposed in this pull request? When there is any Python UDF in the Project between Sort and Limit, it will be collected into TakeOrderedAndProjectExec, ExtractPythonUDFs failed to pull the Python UDFs out because QueryPlan.expressions does not include the expression inside Option[Seq[Expression]]. Ideally, we should fix the `QueryPlan.expressions`, but tried with no luck (it always run into infinite loop). In PR, I changed the TakeOrderedAndProjectExec to no use Option[Seq[Expression]] to workaround it. cc JoshRosen ## How was this patch tested? Added regression test. Author: Davies Liu <davies@databricks.com> Closes #15030 from davies/all_expr. (cherry picked from commit a91ab70) Signed-off-by: Davies Liu <davies.liu@gmail.com>
## What changes were proposed in this pull request? When there is any Python UDF in the Project between Sort and Limit, it will be collected into TakeOrderedAndProjectExec, ExtractPythonUDFs failed to pull the Python UDFs out because QueryPlan.expressions does not include the expression inside Option[Seq[Expression]]. Ideally, we should fix the `QueryPlan.expressions`, but tried with no luck (it always run into infinite loop). In PR, I changed the TakeOrderedAndProjectExec to no use Option[Seq[Expression]] to workaround it. cc JoshRosen ## How was this patch tested? Added regression test. Author: Davies Liu <davies@databricks.com> Closes apache#15030 from davies/all_expr.
What changes were proposed in this pull request?
When there is any Python UDF in the Project between Sort and Limit, it will be collected into TakeOrderedAndProjectExec, ExtractPythonUDFs failed to pull the Python UDFs out because QueryPlan.expressions does not include the expression inside Option[Seq[Expression]].
Ideally, we should fix the
QueryPlan.expressions
, but tried with no luck (it always run into infinite loop). In PR, I changed the TakeOrderedAndProjectExec to no use Option[Seq[Expression]] to workaround it. cc @JoshRosenHow was this patch tested?
Added regression test.