Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34354][SQL] Fix failure when apply CostBasedJoinReorder on self-join #32027

Closed
wants to merge 42 commits into from

Conversation

Ngone51
Copy link
Member

@Ngone51 Ngone51 commented Apr 1, 2021

What changes were proposed in this pull request?

This PR introduces a new analysis rule DeduplicateRelations, which deduplicates any duplicate relations in a plan first and then deduplicates conflicting attributes(which resued the dedupRight of ResolveReferences).

Why are the changes needed?

CostBasedJoinReorder could fail when applying on self-join, e.g.,

// test in JoinReorderSuite
test("join reorder with self-join") {
  val plan = t2.join(t1, Inner, Some(nameToAttr("t1.k-1-2") === nameToAttr("t2.k-1-5")))
      .select(nameToAttr("t1.v-1-10"))
      .join(t2, Inner, Some(nameToAttr("t1.v-1-10") === nameToAttr("t2.k-1-5")))

    // this can fail
    Optimize.execute(plan.analyze)
}

Besides, with the new rule DeduplicateRelations, we'd be able to enable some optimizations, e.g., LeftSemiAnti pushdown, redundant project removal, as reflects in updated unit tests.

Does this PR introduce any user-facing change?

How was this patch tested?

Added and updated unit tests.

cloud-fan and others added 30 commits March 24, 2021 16:13
This reverts commit a7398ae949a04ec7fab575438a5583056ed7fbb7.
This reverts commit 682a2081926dbae97cf6c17b12d3866e92081525.
This reverts commit 3e33cf387afa956d000c1af5ac03ea1892569d67.
This reverts commit 8293531c398abfead078cc8186956c9beb60ea86.
� Conflicts:
�	sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q16.sf100/explain.txt
�	sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q16/explain.txt
�	sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q32/explain.txt
�	sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q33.sf100/explain.txt
�	sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q33/explain.txt
�	sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q39a.sf100/explain.txt
�	sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q39a/explain.txt
�	sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q39b.sf100/explain.txt
�	sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q39b/explain.txt
�	sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q41.sf100/explain.txt
�	sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q41/explain.txt
�	sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q44.sf100/explain.txt
�	sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q44/explain.txt
�	sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q58.sf100/explain.txt
�	sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q58/explain.txt
�	sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q70.sf100/explain.txt
�	sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q70/explain.txt
�	sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q71.sf100/explain.txt
�	sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q71/explain.txt
�	sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q92/explain.txt
�	sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q94.sf100/explain.txt
�	sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q94/explain.txt
�	sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q95.sf100/explain.txt
�	sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q95/explain.txt
�	sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q67a.sf100/explain.txt
�	sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q67a/explain.txt
�	sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q70a.sf100/explain.txt
�	sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/q70a/explain.txt
This reverts commit 9211170
� Conflicts:
�	sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q32/explain.txt
�	sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q41.sf100/explain.txt
�	sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q41/explain.txt
�	sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q92/explain.txt
@Ngone51
Copy link
Member Author

Ngone51 commented Apr 1, 2021

cc @cloud-fan @maropu @HyukjinKwon

@Ngone51
Copy link
Member Author

Ngone51 commented Apr 1, 2021

This commit 2541fd6 fixes the test failure. Previously, the test failed due to:

FlatMapCoGroupsInPandas has a self flatMap. So, when DeduplicateRelations applies on FlatMapCoGroupsInPandas, both leftAttributes and rightAttributes would be rewrite after the right relation deduplicated. But that's not right for the leftAttributes as it should reference the left relation.

@SparkQA
Copy link

SparkQA commented Apr 1, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41400/

@SparkQA
Copy link

SparkQA commented Apr 1, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41400/

@github-actions github-actions bot added the SQL label Apr 1, 2021
@SparkQA
Copy link

SparkQA commented Apr 1, 2021

Test build #136819 has finished for PR 32027 at commit 2541fd6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

I will leave it to @cloud-fan for a final look.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in f897cc2 Apr 2, 2021
@maropu
Copy link
Member

maropu commented Apr 7, 2021

NOTE: This fix improved some TPCDS(sf=20) queries (q14a, q14b ...), , e.g., 180636ms=>141545ms in q14b.

laflechejonathan pushed a commit to palantir/spark that referenced this pull request Sep 27, 2021
…f-join (#93)

* [SPARK-34354][SQL] Fix failure when apply CostBasedJoinReorder on self-join

This PR introduces a new analysis rule `DeduplicateRelations`, which deduplicates any duplicate relations in a plan first and then deduplicates conflicting attributes(which resued the `dedupRight` of `ResolveReferences`).

`CostBasedJoinReorder` could fail when applying on self-join, e.g.,

```scala
// test in JoinReorderSuite
test("join reorder with self-join") {
  val plan = t2.join(t1, Inner, Some(nameToAttr("t1.k-1-2") === nameToAttr("t2.k-1-5")))
      .select(nameToAttr("t1.v-1-10"))
      .join(t2, Inner, Some(nameToAttr("t1.v-1-10") === nameToAttr("t2.k-1-5")))

    // this can fail
    Optimize.execute(plan.analyze)
}
```
Besides, with the new rule `DeduplicateRelations`, we'd be able to enable some optimizations, e.g., LeftSemiAnti pushdown, redundant project removal, as reflects in updated unit tests.

Added and updated unit tests.

Closes apache#32027 from Ngone51/join-reorder-3.

Lead-authored-by: yi.wu <yi.wu@databricks.com>
Co-authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* Undo tpcds-plan-stability changes

* Resolve conflicts in list of rules in 'Resolution' batch

* Resolve conflicts in #collectConflictPlans

* Resolve conflicts in #dedupRight

* Resolve conflicts in ResolveReferences#apply

* Resolve conflicts around #newAliases and #findAliases

* Resolve conflicts in AnalysisSuite

* [SPARK-34178][SQL] Copy tags for the new node created by MultiInstanceRelation.newInstance

Call `copyTagsFrom` for the new node created by `MultiInstanceRelation.newInstance()`.

```scala
val df = spark.range(2)
df.join(df, df("id") <=> df("id")).show()
```

For this query, it's supposed to be non-ambiguous join by the rule `DetectAmbiguousSelfJoin` because of the same attribute reference in the condition:

https://github.com/apache/spark/blob/537a49fc0966b0b289b67ac9c6ea20093165b0da/sql/core/src/main/scala/org/apache/spark/sql/execution/analysis/DetectAmbiguousSelfJoin.scala#L125

However, `DetectAmbiguousSelfJoin` can not apply this prediction due to the right side plan doesn't contain the dataset_id TreeNodeTag, which is missing after `MultiInstanceRelation.newInstance`. That's why we should preserve the tags info for the copied node.

Fortunately, the query is still considered as non-ambiguous join because `DetectAmbiguousSelfJoin` only checks the left side plan and the reference is the same as the left side plan. However, this's not the expected behavior but only a coincidence.

No.

Updated a unit test

Closes apache#31260 from Ngone51/fix-missing-tags.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Co-authored-by: yi.wu <yi.wu@databricks.com>
Co-authored-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants