New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20718][SQL] FileSourceScanExec with different filter orders should be the same after canonicalization #17959
Conversation
also cc @gatorsmile |
LGTM |
Test build #76843 has finished for PR 17959 at commit
|
None) | ||
} | ||
|
||
private def canonicalizeFilters(filters: Seq[Expression], output: Seq[Attribute]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a function description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
How about |
…ould be the same after canonicalization ## What changes were proposed in this pull request? Since `constraints` in `QueryPlan` is a set, the order of filters can differ. Usually this is ok because of canonicalization. However, in `FileSourceScanExec`, its data filters and partition filters are sequences, and their orders are not canonicalized. So `def sameResult` returns different results for different orders of data/partition filters. This leads to, e.g. different decision for `ReuseExchange`, and thus results in unstable performance. ## How was this patch tested? Added a new test for `FileSourceScanExec.sameResult`. Author: wangzhenhua <wangzhenhua@huawei.com> Closes #17959 from wzhfy/canonicalizeFileSourceScanExec. (cherry picked from commit c8da535) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
merged to master/2.2, please send a follow-up PR to address @gatorsmile 's comments, thanks! |
@gatorsmile Right, thanks for pointing this out! |
…ould be the same after canonicalization ## What changes were proposed in this pull request? Since `constraints` in `QueryPlan` is a set, the order of filters can differ. Usually this is ok because of canonicalization. However, in `FileSourceScanExec`, its data filters and partition filters are sequences, and their orders are not canonicalized. So `def sameResult` returns different results for different orders of data/partition filters. This leads to, e.g. different decision for `ReuseExchange`, and thus results in unstable performance. ## How was this patch tested? Added a new test for `FileSourceScanExec.sameResult`. Author: wangzhenhua <wangzhenhua@huawei.com> Closes apache#17959 from wzhfy/canonicalizeFileSourceScanExec.
…ould be the same after canonicalization ## What changes were proposed in this pull request? Since `constraints` in `QueryPlan` is a set, the order of filters can differ. Usually this is ok because of canonicalization. However, in `FileSourceScanExec`, its data filters and partition filters are sequences, and their orders are not canonicalized. So `def sameResult` returns different results for different orders of data/partition filters. This leads to, e.g. different decision for `ReuseExchange`, and thus results in unstable performance. ## How was this patch tested? Added a new test for `FileSourceScanExec.sameResult`. Author: wangzhenhua <wangzhenhua@huawei.com> Closes apache#17959 from wzhfy/canonicalizeFileSourceScanExec.
What changes were proposed in this pull request?
Since
constraints
inQueryPlan
is a set, the order of filters can differ. Usually this is ok because of canonicalization. However, inFileSourceScanExec
, its data filters and partition filters are sequences, and their orders are not canonicalized. Sodef sameResult
returns different results for different orders of data/partition filters. This leads to, e.g. different decision forReuseExchange
, and thus results in unstable performance.How was this patch tested?
Added a new test for
FileSourceScanExec.sameResult
.