Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20718][SQL] FileSourceScanExec with different filter orders should be the same after canonicalization #17959

Closed
wants to merge 1 commit into from

Conversation

wzhfy
Copy link
Contributor

@wzhfy wzhfy commented May 12, 2017

What changes were proposed in this pull request?

Since constraints in QueryPlan is a set, the order of filters can differ. Usually this is ok because of canonicalization. However, in FileSourceScanExec, its data filters and partition filters are sequences, and their orders are not canonicalized. So def sameResult returns different results for different orders of data/partition filters. This leads to, e.g. different decision for ReuseExchange, and thus results in unstable performance.

How was this patch tested?

Added a new test for FileSourceScanExec.sameResult.

@wzhfy
Copy link
Contributor Author

wzhfy commented May 12, 2017

cc @cloud-fan @hvanhovell

@wzhfy
Copy link
Contributor Author

wzhfy commented May 12, 2017

also cc @gatorsmile

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented May 12, 2017

Test build #76843 has finished for PR 17959 at commit 9ec86ec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait DataSourceScanExec extends LeafExecNode with CodegenSupport with PredicateHelper

None)
}

private def canonicalizeFilters(filters: Seq[Expression], output: Seq[Attribute])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a function description?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@gatorsmile
Copy link
Member

How about HiveTableScanExec?

asfgit pushed a commit that referenced this pull request May 12, 2017
…ould be the same after canonicalization

## What changes were proposed in this pull request?

Since `constraints` in `QueryPlan` is a set, the order of filters can differ. Usually this is ok because of canonicalization. However, in `FileSourceScanExec`, its data filters and partition filters are sequences, and their orders are not canonicalized. So `def sameResult` returns different results for different orders of data/partition filters. This leads to, e.g. different decision for `ReuseExchange`, and thus results in unstable performance.

## How was this patch tested?

Added a new test for `FileSourceScanExec.sameResult`.

Author: wangzhenhua <wangzhenhua@huawei.com>

Closes #17959 from wzhfy/canonicalizeFileSourceScanExec.

(cherry picked from commit c8da535)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@asfgit asfgit closed this in c8da535 May 12, 2017
@cloud-fan
Copy link
Contributor

merged to master/2.2, please send a follow-up PR to address @gatorsmile 's comments, thanks!

@wzhfy
Copy link
Contributor Author

wzhfy commented May 12, 2017

@gatorsmile Right, thanks for pointing this out!

robert3005 pushed a commit to palantir/spark that referenced this pull request May 19, 2017
…ould be the same after canonicalization

## What changes were proposed in this pull request?

Since `constraints` in `QueryPlan` is a set, the order of filters can differ. Usually this is ok because of canonicalization. However, in `FileSourceScanExec`, its data filters and partition filters are sequences, and their orders are not canonicalized. So `def sameResult` returns different results for different orders of data/partition filters. This leads to, e.g. different decision for `ReuseExchange`, and thus results in unstable performance.

## How was this patch tested?

Added a new test for `FileSourceScanExec.sameResult`.

Author: wangzhenhua <wangzhenhua@huawei.com>

Closes apache#17959 from wzhfy/canonicalizeFileSourceScanExec.
liyichao pushed a commit to liyichao/spark that referenced this pull request May 24, 2017
…ould be the same after canonicalization

## What changes were proposed in this pull request?

Since `constraints` in `QueryPlan` is a set, the order of filters can differ. Usually this is ok because of canonicalization. However, in `FileSourceScanExec`, its data filters and partition filters are sequences, and their orders are not canonicalized. So `def sameResult` returns different results for different orders of data/partition filters. This leads to, e.g. different decision for `ReuseExchange`, and thus results in unstable performance.

## How was this patch tested?

Added a new test for `FileSourceScanExec.sameResult`.

Author: wangzhenhua <wangzhenhua@huawei.com>

Closes apache#17959 from wzhfy/canonicalizeFileSourceScanExec.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants