[SPARK-20718][SQL] FileSourceScanExec with different filter orders should be the same after canonicalization #17959

wzhfy · 2017-05-12T03:25:46Z

What changes were proposed in this pull request?

Since constraints in QueryPlan is a set, the order of filters can differ. Usually this is ok because of canonicalization. However, in FileSourceScanExec, its data filters and partition filters are sequences, and their orders are not canonicalized. So def sameResult returns different results for different orders of data/partition filters. This leads to, e.g. different decision for ReuseExchange, and thus results in unstable performance.

How was this patch tested?

Added a new test for FileSourceScanExec.sameResult.

wzhfy · 2017-05-12T03:26:44Z

cc @cloud-fan @hvanhovell

wzhfy · 2017-05-12T03:36:13Z

also cc @gatorsmile

cloud-fan · 2017-05-12T04:49:10Z

LGTM

SparkQA · 2017-05-12T05:32:43Z

Test build #76843 has finished for PR 17959 at commit 9ec86ec.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait DataSourceScanExec extends LeafExecNode with CodegenSupport with PredicateHelper

gatorsmile · 2017-05-12T05:42:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

      None)
  }
+
+  private def canonicalizeFilters(filters: Seq[Expression], output: Seq[Attribute])


Add a function description?

gatorsmile · 2017-05-12T05:42:41Z

How about HiveTableScanExec?

…ould be the same after canonicalization ## What changes were proposed in this pull request? Since `constraints` in `QueryPlan` is a set, the order of filters can differ. Usually this is ok because of canonicalization. However, in `FileSourceScanExec`, its data filters and partition filters are sequences, and their orders are not canonicalized. So `def sameResult` returns different results for different orders of data/partition filters. This leads to, e.g. different decision for `ReuseExchange`, and thus results in unstable performance. ## How was this patch tested? Added a new test for `FileSourceScanExec.sameResult`. Author: wangzhenhua <wangzhenhua@huawei.com> Closes #17959 from wzhfy/canonicalizeFileSourceScanExec. (cherry picked from commit c8da535) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan · 2017-05-12T05:55:55Z

merged to master/2.2, please send a follow-up PR to address @gatorsmile 's comments, thanks!

wzhfy · 2017-05-12T06:56:49Z

@gatorsmile Right, thanks for pointing this out!

…ould be the same after canonicalization ## What changes were proposed in this pull request? Since `constraints` in `QueryPlan` is a set, the order of filters can differ. Usually this is ok because of canonicalization. However, in `FileSourceScanExec`, its data filters and partition filters are sequences, and their orders are not canonicalized. So `def sameResult` returns different results for different orders of data/partition filters. This leads to, e.g. different decision for `ReuseExchange`, and thus results in unstable performance. ## How was this patch tested? Added a new test for `FileSourceScanExec.sameResult`. Author: wangzhenhua <wangzhenhua@huawei.com> Closes apache#17959 from wzhfy/canonicalizeFileSourceScanExec.

same result for FileSourceScanExec with different filter orders

9ec86ec

gatorsmile reviewed May 12, 2017

View reviewed changes

asfgit closed this in c8da535 May 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-20718][SQL] FileSourceScanExec with different filter orders should be the same after canonicalization #17959

[SPARK-20718][SQL] FileSourceScanExec with different filter orders should be the same after canonicalization #17959

wzhfy commented May 12, 2017

wzhfy commented May 12, 2017

wzhfy commented May 12, 2017

cloud-fan commented May 12, 2017

SparkQA commented May 12, 2017

gatorsmile May 12, 2017

wzhfy May 12, 2017

gatorsmile commented May 12, 2017

cloud-fan commented May 12, 2017

wzhfy commented May 12, 2017

[SPARK-20718][SQL] FileSourceScanExec with different filter orders should be the same after canonicalization #17959

[SPARK-20718][SQL] FileSourceScanExec with different filter orders should be the same after canonicalization #17959

Conversation

wzhfy commented May 12, 2017

What changes were proposed in this pull request?

How was this patch tested?

wzhfy commented May 12, 2017

wzhfy commented May 12, 2017

cloud-fan commented May 12, 2017

SparkQA commented May 12, 2017

gatorsmile May 12, 2017

Choose a reason for hiding this comment

wzhfy May 12, 2017

Choose a reason for hiding this comment

gatorsmile commented May 12, 2017

cloud-fan commented May 12, 2017

wzhfy commented May 12, 2017