New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-34354][SQL] Fix failure when apply CostBasedJoinReorder on self-join #32027
Conversation
This reverts commit a7398ae949a04ec7fab575438a5583056ed7fbb7.
This reverts commit 682a2081926dbae97cf6c17b12d3866e92081525.
This reverts commit 3e33cf387afa956d000c1af5ac03ea1892569d67.
This reverts commit 8293531c398abfead078cc8186956c9beb60ea86.
� Conflicts: � sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q16.sf100/explain.txt � sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q16/explain.txt � sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q32/explain.txt � sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q33.sf100/explain.txt � sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q33/explain.txt � sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q39a.sf100/explain.txt � sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q39a/explain.txt � sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q39b.sf100/explain.txt � sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q39b/explain.txt � sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q41.sf100/explain.txt � sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q41/explain.txt � sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q44.sf100/explain.txt � sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q44/explain.txt � sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q58.sf100/explain.txt � sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q58/explain.txt � sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q70.sf100/explain.txt � sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q70/explain.txt � sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q71.sf100/explain.txt � sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q71/explain.txt � sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q92/explain.txt � sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q94.sf100/explain.txt � sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q94/explain.txt � sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q95.sf100/explain.txt � sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q95/explain.txt � sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q67a.sf100/explain.txt � sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q67a/explain.txt � sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q70a.sf100/explain.txt � sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q70a/explain.txt
This reverts commit 9211170
� Conflicts: � sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q32/explain.txt � sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q41.sf100/explain.txt � sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q41/explain.txt � sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q92/explain.txt
This commit 2541fd6 fixes the test failure. Previously, the test failed due to:
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #136819 has finished for PR 32027 at commit
|
I will leave it to @cloud-fan for a final look. |
thanks, merging to master! |
NOTE: This fix improved some TPCDS(sf=20) queries (q14a, q14b ...), , e.g., 180636ms=>141545ms in q14b. |
…f-join (#93) * [SPARK-34354][SQL] Fix failure when apply CostBasedJoinReorder on self-join This PR introduces a new analysis rule `DeduplicateRelations`, which deduplicates any duplicate relations in a plan first and then deduplicates conflicting attributes(which resued the `dedupRight` of `ResolveReferences`). `CostBasedJoinReorder` could fail when applying on self-join, e.g., ```scala // test in JoinReorderSuite test("join reorder with self-join") { val plan = t2.join(t1, Inner, Some(nameToAttr("t1.k-1-2") === nameToAttr("t2.k-1-5"))) .select(nameToAttr("t1.v-1-10")) .join(t2, Inner, Some(nameToAttr("t1.v-1-10") === nameToAttr("t2.k-1-5"))) // this can fail Optimize.execute(plan.analyze) } ``` Besides, with the new rule `DeduplicateRelations`, we'd be able to enable some optimizations, e.g., LeftSemiAnti pushdown, redundant project removal, as reflects in updated unit tests. Added and updated unit tests. Closes apache#32027 from Ngone51/join-reorder-3. Lead-authored-by: yi.wu <yi.wu@databricks.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * Undo tpcds-plan-stability changes * Resolve conflicts in list of rules in 'Resolution' batch * Resolve conflicts in #collectConflictPlans * Resolve conflicts in #dedupRight * Resolve conflicts in ResolveReferences#apply * Resolve conflicts around #newAliases and #findAliases * Resolve conflicts in AnalysisSuite * [SPARK-34178][SQL] Copy tags for the new node created by MultiInstanceRelation.newInstance Call `copyTagsFrom` for the new node created by `MultiInstanceRelation.newInstance()`. ```scala val df = spark.range(2) df.join(df, df("id") <=> df("id")).show() ``` For this query, it's supposed to be non-ambiguous join by the rule `DetectAmbiguousSelfJoin` because of the same attribute reference in the condition: https://github.com/apache/spark/blob/537a49fc0966b0b289b67ac9c6ea20093165b0da/sql/core/src/main/scala/org/apache/spark/sql/execution/analysis/DetectAmbiguousSelfJoin.scala#L125 However, `DetectAmbiguousSelfJoin` can not apply this prediction due to the right side plan doesn't contain the dataset_id TreeNodeTag, which is missing after `MultiInstanceRelation.newInstance`. That's why we should preserve the tags info for the copied node. Fortunately, the query is still considered as non-ambiguous join because `DetectAmbiguousSelfJoin` only checks the left side plan and the reference is the same as the left side plan. However, this's not the expected behavior but only a coincidence. No. Updated a unit test Closes apache#31260 from Ngone51/fix-missing-tags. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: yi.wu <yi.wu@databricks.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
This PR introduces a new analysis rule
DeduplicateRelations
, which deduplicates any duplicate relations in a plan first and then deduplicates conflicting attributes(which resued thededupRight
ofResolveReferences
).Why are the changes needed?
CostBasedJoinReorder
could fail when applying on self-join, e.g.,Besides, with the new rule
DeduplicateRelations
, we'd be able to enable some optimizations, e.g., LeftSemiAnti pushdown, redundant project removal, as reflects in updated unit tests.Does this PR introduce any user-facing change?
How was this patch tested?
Added and updated unit tests.