[SPARK-17474] [SQL] fix python udf in TakeOrderedAndProjectExec #15030

davies · 2016-09-09T19:06:08Z

What changes were proposed in this pull request?

When there is any Python UDF in the Project between Sort and Limit, it will be collected into TakeOrderedAndProjectExec, ExtractPythonUDFs failed to pull the Python UDFs out because QueryPlan.expressions does not include the expression inside Option[Seq[Expression]].

Ideally, we should fix the QueryPlan.expressions, but tried with no luck (it always run into infinite loop). In PR, I changed the TakeOrderedAndProjectExec to no use Option[Seq[Expression]] to workaround it. cc @JoshRosen

How was this patch tested?

Added regression test.

SparkQA · 2016-09-09T19:15:16Z

Test build #65161 has finished for PR 15030 at commit 3ea0daf.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2016-09-09T19:21:46Z

[error] /home/jenkins/workspace/SparkPullRequestBuilder/sql/core/src/test/scala/org/apache/spark/sql/execution/TakeOrderedAndProjectSuite.scala:62: type mismatch;
[error]  found   : None.type
[error]  required: Seq[org.apache.spark.sql.catalyst.expressions.NamedExpression]
[error]           noOpFilter(TakeOrderedAndProjectExec(limit, sortOrder, None, input)),
[error]                                                                  ^
[error] /home/jenkins/workspace/SparkPullRequestBuilder/sql/core/src/test/scala/org/apache/spark/sql/execution/TakeOrderedAndProjectSuite.scala:77: type mismatch;
[error]  found   : Some[Seq[org.apache.spark.sql.catalyst.expressions.Attribute]]
[error]  required: Seq[org.apache.spark.sql.catalyst.expressions.NamedExpression]
[error]             TakeOrderedAndProjectExec(limit, sortOrder, Some(Seq(input.output.last)), input)),
[error]

SparkQA · 2016-09-09T22:12:14Z

Test build #65167 has finished for PR 15030 at commit 263c147.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2016-09-12T19:03:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala

@@ -148,8 +148,8 @@ case class TakeOrderedAndProjectExec(
        localTopK, child.output, SinglePartition, serializer))
    shuffled.mapPartitions { iter =>
      val topK = org.apache.spark.util.collection.Utils.takeOrdered(iter.map(_.copy()), limit)(ord)
-      if (projectList.isDefined) {
-        val proj = UnsafeProjection.create(projectList.get, child.output)
+      if (AttributeSet(projectList) != child.outputSet) {


Should this be order-insensitive, set-based comparision or should it be using AttributeSeq instead? I'm wondering whether we could hit a bug in case the project happens to permute the child output columns, since in that case I think we'd end up skipping the final column-reordering projection.

Good point, we should just compare it with output as Seq directly.

SparkQA · 2016-09-12T21:21:40Z

Test build #65273 has finished for PR 15030 at commit 1e319d8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2016-09-12T23:00:08Z

LGTM

davies · 2016-09-12T23:35:19Z

Merging into 2.0 and master.

## What changes were proposed in this pull request? When there is any Python UDF in the Project between Sort and Limit, it will be collected into TakeOrderedAndProjectExec, ExtractPythonUDFs failed to pull the Python UDFs out because QueryPlan.expressions does not include the expression inside Option[Seq[Expression]]. Ideally, we should fix the `QueryPlan.expressions`, but tried with no luck (it always run into infinite loop). In PR, I changed the TakeOrderedAndProjectExec to no use Option[Seq[Expression]] to workaround it. cc JoshRosen ## How was this patch tested? Added regression test. Author: Davies Liu <davies@databricks.com> Closes #15030 from davies/all_expr. (cherry picked from commit a91ab70) Signed-off-by: Davies Liu <davies.liu@gmail.com>

## What changes were proposed in this pull request? When there is any Python UDF in the Project between Sort and Limit, it will be collected into TakeOrderedAndProjectExec, ExtractPythonUDFs failed to pull the Python UDFs out because QueryPlan.expressions does not include the expression inside Option[Seq[Expression]]. Ideally, we should fix the `QueryPlan.expressions`, but tried with no luck (it always run into infinite loop). In PR, I changed the TakeOrderedAndProjectExec to no use Option[Seq[Expression]] to workaround it. cc JoshRosen ## How was this patch tested? Added regression test. Author: Davies Liu <davies@databricks.com> Closes apache#15030 from davies/all_expr.

fix python udf in TakeOrderedAndProjectExec

3ea0daf

fix tests

263c147

JoshRosen reviewed Sep 12, 2016
View reviewed changes

use Seq to compare

1e319d8

asfgit closed this in a91ab70 Sep 12, 2016

peter-toth mentioned this pull request Jun 21, 2020

[SPARK-29375][SPARK-28940][SPARK-32041][SQL] Whole plan exchange and subquery reuse #28885

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-17474] [SQL] fix python udf in TakeOrderedAndProjectExec #15030

[SPARK-17474] [SQL] fix python udf in TakeOrderedAndProjectExec #15030

davies commented Sep 9, 2016

SparkQA commented Sep 9, 2016

JoshRosen commented Sep 9, 2016

SparkQA commented Sep 9, 2016

JoshRosen Sep 12, 2016

davies Sep 12, 2016

SparkQA commented Sep 12, 2016

JoshRosen commented Sep 12, 2016

davies commented Sep 12, 2016

[SPARK-17474] [SQL] fix python udf in TakeOrderedAndProjectExec #15030

[SPARK-17474] [SQL] fix python udf in TakeOrderedAndProjectExec #15030

Conversation

davies commented Sep 9, 2016

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Sep 9, 2016

JoshRosen commented Sep 9, 2016

SparkQA commented Sep 9, 2016

JoshRosen Sep 12, 2016

Choose a reason for hiding this comment

davies Sep 12, 2016

Choose a reason for hiding this comment

SparkQA commented Sep 12, 2016

JoshRosen commented Sep 12, 2016

davies commented Sep 12, 2016