Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34178][SQL] Copy tags for the new node created by MultiInstanceRelation.newInstance #31260

Closed
wants to merge 1 commit into from

Conversation

Ngone51
Copy link
Member

@Ngone51 Ngone51 commented Jan 20, 2021

What changes were proposed in this pull request?

Call copyTagsFrom for the new node created by MultiInstanceRelation.newInstance().

Why are the changes needed?

val df = spark.range(2)
df.join(df, df("id") <=> df("id")).show()

For this query, it's supposed to be non-ambiguous join by the rule DetectAmbiguousSelfJoin because of the same attribute reference in the condition:

case Equality(AttrWithCast(a), AttrWithCast(b)) if a.sameRef(b) =>

However, DetectAmbiguousSelfJoin can not apply this prediction due to the right side plan doesn't contain the dataset_id TreeNodeTag, which is missing after MultiInstanceRelation.newInstance. That's why we should preserve the tags info for the copied node.

Fortunately, the query is still considered as non-ambiguous join because DetectAmbiguousSelfJoin only checks the left side plan and the reference is the same as the left side plan. However, this's not the expected behavior but only a coincidence.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Updated a unit test

@github-actions github-actions bot added the SQL label Jan 20, 2021
@Ngone51
Copy link
Member Author

Ngone51 commented Jan 20, 2021

@cloud-fan Please take a look, thanks!

cloud-fan pushed a commit that referenced this pull request Jan 20, 2021
…eRelation.newInstance

### What changes were proposed in this pull request?

Call `copyTagsFrom` for the new node created by `MultiInstanceRelation.newInstance()`.

### Why are the changes needed?

```scala
val df = spark.range(2)
df.join(df, df("id") <=> df("id")).show()
```

For this query, it's supposed to be non-ambiguous join by the rule `DetectAmbiguousSelfJoin` because of the same attribute reference in the condition:

https://github.com/apache/spark/blob/537a49fc0966b0b289b67ac9c6ea20093165b0da/sql/core/src/main/scala/org/apache/spark/sql/execution/analysis/DetectAmbiguousSelfJoin.scala#L125

However, `DetectAmbiguousSelfJoin` can not apply this prediction due to the right side plan doesn't contain the dataset_id TreeNodeTag, which is missing after `MultiInstanceRelation.newInstance`. That's why we should preserve the tags info for the copied node.

Fortunately, the query is still considered as non-ambiguous join because `DetectAmbiguousSelfJoin` only checks the left side plan and the reference is the same as the left side plan. However, this's not the expected behavior but only a coincidence.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Updated a unit test

Closes #31260 from Ngone51/fix-missing-tags.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit f498977)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Jan 20, 2021
…eRelation.newInstance

### What changes were proposed in this pull request?

Call `copyTagsFrom` for the new node created by `MultiInstanceRelation.newInstance()`.

### Why are the changes needed?

```scala
val df = spark.range(2)
df.join(df, df("id") <=> df("id")).show()
```

For this query, it's supposed to be non-ambiguous join by the rule `DetectAmbiguousSelfJoin` because of the same attribute reference in the condition:

https://github.com/apache/spark/blob/537a49fc0966b0b289b67ac9c6ea20093165b0da/sql/core/src/main/scala/org/apache/spark/sql/execution/analysis/DetectAmbiguousSelfJoin.scala#L125

However, `DetectAmbiguousSelfJoin` can not apply this prediction due to the right side plan doesn't contain the dataset_id TreeNodeTag, which is missing after `MultiInstanceRelation.newInstance`. That's why we should preserve the tags info for the copied node.

Fortunately, the query is still considered as non-ambiguous join because `DetectAmbiguousSelfJoin` only checks the left side plan and the reference is the same as the left side plan. However, this's not the expected behavior but only a coincidence.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Updated a unit test

Closes #31260 from Ngone51/fix-missing-tags.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit f498977)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan cloud-fan closed this in f498977 Jan 20, 2021
@cloud-fan
Copy link
Contributor

thanks, merging to master/3.1/3.0!

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm2

@Ngone51
Copy link
Member Author

Ngone51 commented Jan 20, 2021

thanks all!!

@Ngone51 Ngone51 deleted the fix-missing-tags branch January 20, 2021 13:48
@SparkQA
Copy link

SparkQA commented Jan 20, 2021

Test build #134264 has finished for PR 31260 at commit 7cc0a98.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Hi, All.

This breaks the compilation in branch-3.0. Could you fix that?

error] /__w/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala:28: object DetectAmbiguousSelfJoin is not a member of package org.apache.spark.sql.execution.analysis
[error] Note: class DetectAmbiguousSelfJoin exists, but it has no companion object.
[error] import org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin.LogicalPlanWithDatasetId
[error]                                                ^
[error] /__w/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala:274: not found: value LogicalPlanWithDatasetId
[error]           LogicalPlanWithDatasetId(_, leftId),
[error]           ^
[error] /__w/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala:275: not found: value LogicalPlanWithDatasetId
[error]           LogicalPlanWithDatasetId(_, rightId), _, _, _) =>
[error]           ^

@HyukjinKwon
Copy link
Member

Thanks, @dongjoon-hyun. I reverted from branch-3.0.

@dongjoon-hyun
Copy link
Member

Thanks, @HyukjinKwon !

@cloud-fan
Copy link
Contributor

@Ngone51 can you open a PR against 3.0? thanks!

skestle pushed a commit to skestle/spark that referenced this pull request Feb 3, 2021
…eRelation.newInstance

### What changes were proposed in this pull request?

Call `copyTagsFrom` for the new node created by `MultiInstanceRelation.newInstance()`.

### Why are the changes needed?

```scala
val df = spark.range(2)
df.join(df, df("id") <=> df("id")).show()
```

For this query, it's supposed to be non-ambiguous join by the rule `DetectAmbiguousSelfJoin` because of the same attribute reference in the condition:

https://github.com/apache/spark/blob/537a49fc0966b0b289b67ac9c6ea20093165b0da/sql/core/src/main/scala/org/apache/spark/sql/execution/analysis/DetectAmbiguousSelfJoin.scala#L125

However, `DetectAmbiguousSelfJoin` can not apply this prediction due to the right side plan doesn't contain the dataset_id TreeNodeTag, which is missing after `MultiInstanceRelation.newInstance`. That's why we should preserve the tags info for the copied node.

Fortunately, the query is still considered as non-ambiguous join because `DetectAmbiguousSelfJoin` only checks the left side plan and the reference is the same as the left side plan. However, this's not the expected behavior but only a coincidence.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Updated a unit test

Closes apache#31260 from Ngone51/fix-missing-tags.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
laflechejonathan pushed a commit to palantir/spark that referenced this pull request Sep 27, 2021
…f-join (#93)

* [SPARK-34354][SQL] Fix failure when apply CostBasedJoinReorder on self-join

This PR introduces a new analysis rule `DeduplicateRelations`, which deduplicates any duplicate relations in a plan first and then deduplicates conflicting attributes(which resued the `dedupRight` of `ResolveReferences`).

`CostBasedJoinReorder` could fail when applying on self-join, e.g.,

```scala
// test in JoinReorderSuite
test("join reorder with self-join") {
  val plan = t2.join(t1, Inner, Some(nameToAttr("t1.k-1-2") === nameToAttr("t2.k-1-5")))
      .select(nameToAttr("t1.v-1-10"))
      .join(t2, Inner, Some(nameToAttr("t1.v-1-10") === nameToAttr("t2.k-1-5")))

    // this can fail
    Optimize.execute(plan.analyze)
}
```
Besides, with the new rule `DeduplicateRelations`, we'd be able to enable some optimizations, e.g., LeftSemiAnti pushdown, redundant project removal, as reflects in updated unit tests.

Added and updated unit tests.

Closes apache#32027 from Ngone51/join-reorder-3.

Lead-authored-by: yi.wu <yi.wu@databricks.com>
Co-authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* Undo tpcds-plan-stability changes

* Resolve conflicts in list of rules in 'Resolution' batch

* Resolve conflicts in #collectConflictPlans

* Resolve conflicts in #dedupRight

* Resolve conflicts in ResolveReferences#apply

* Resolve conflicts around #newAliases and #findAliases

* Resolve conflicts in AnalysisSuite

* [SPARK-34178][SQL] Copy tags for the new node created by MultiInstanceRelation.newInstance

Call `copyTagsFrom` for the new node created by `MultiInstanceRelation.newInstance()`.

```scala
val df = spark.range(2)
df.join(df, df("id") <=> df("id")).show()
```

For this query, it's supposed to be non-ambiguous join by the rule `DetectAmbiguousSelfJoin` because of the same attribute reference in the condition:

https://github.com/apache/spark/blob/537a49fc0966b0b289b67ac9c6ea20093165b0da/sql/core/src/main/scala/org/apache/spark/sql/execution/analysis/DetectAmbiguousSelfJoin.scala#L125

However, `DetectAmbiguousSelfJoin` can not apply this prediction due to the right side plan doesn't contain the dataset_id TreeNodeTag, which is missing after `MultiInstanceRelation.newInstance`. That's why we should preserve the tags info for the copied node.

Fortunately, the query is still considered as non-ambiguous join because `DetectAmbiguousSelfJoin` only checks the left side plan and the reference is the same as the left side plan. However, this's not the expected behavior but only a coincidence.

No.

Updated a unit test

Closes apache#31260 from Ngone51/fix-missing-tags.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Co-authored-by: yi.wu <yi.wu@databricks.com>
Co-authored-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants