[SPARK-34178][SQL] Copy tags for the new node created by MultiInstanceRelation.newInstance #31260

Ngone51 · 2021-01-20T09:31:31Z

What changes were proposed in this pull request?

Call copyTagsFrom for the new node created by MultiInstanceRelation.newInstance().

Why are the changes needed?

val df = spark.range(2)
df.join(df, df("id") <=> df("id")).show()

For this query, it's supposed to be non-ambiguous join by the rule DetectAmbiguousSelfJoin because of the same attribute reference in the condition:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/analysis/DetectAmbiguousSelfJoin.scala

Line 125 in 537a49f

case Equality(AttrWithCast(a), AttrWithCast(b)) if a.sameRef(b) =>

However, DetectAmbiguousSelfJoin can not apply this prediction due to the right side plan doesn't contain the dataset_id TreeNodeTag, which is missing after MultiInstanceRelation.newInstance. That's why we should preserve the tags info for the copied node.

Fortunately, the query is still considered as non-ambiguous join because DetectAmbiguousSelfJoin only checks the left side plan and the reference is the same as the left side plan. However, this's not the expected behavior but only a coincidence.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Updated a unit test

Ngone51 · 2021-01-20T09:32:18Z

@cloud-fan Please take a look, thanks!

…eRelation.newInstance ### What changes were proposed in this pull request? Call `copyTagsFrom` for the new node created by `MultiInstanceRelation.newInstance()`. ### Why are the changes needed? ```scala val df = spark.range(2) df.join(df, df("id") <=> df("id")).show() ``` For this query, it's supposed to be non-ambiguous join by the rule `DetectAmbiguousSelfJoin` because of the same attribute reference in the condition: https://github.com/apache/spark/blob/537a49fc0966b0b289b67ac9c6ea20093165b0da/sql/core/src/main/scala/org/apache/spark/sql/execution/analysis/DetectAmbiguousSelfJoin.scala#L125 However, `DetectAmbiguousSelfJoin` can not apply this prediction due to the right side plan doesn't contain the dataset_id TreeNodeTag, which is missing after `MultiInstanceRelation.newInstance`. That's why we should preserve the tags info for the copied node. Fortunately, the query is still considered as non-ambiguous join because `DetectAmbiguousSelfJoin` only checks the left side plan and the reference is the same as the left side plan. However, this's not the expected behavior but only a coincidence. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated a unit test Closes #31260 from Ngone51/fix-missing-tags. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit f498977) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan · 2021-01-20T13:37:26Z

thanks, merging to master/3.1/3.0!

HyukjinKwon

lgtm2

Ngone51 · 2021-01-20T13:48:16Z

thanks all!!

SparkQA · 2021-01-20T14:26:04Z

Test build #134264 has finished for PR 31260 at commit 7cc0a98.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2021-01-21T00:45:25Z

Hi, All.

This breaks the compilation in branch-3.0. Could you fix that?

error] /__w/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala:28: object DetectAmbiguousSelfJoin is not a member of package org.apache.spark.sql.execution.analysis
[error] Note: class DetectAmbiguousSelfJoin exists, but it has no companion object.
[error] import org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin.LogicalPlanWithDatasetId
[error]                                                ^
[error] /__w/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala:274: not found: value LogicalPlanWithDatasetId
[error]           LogicalPlanWithDatasetId(_, leftId),
[error]           ^
[error] /__w/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala:275: not found: value LogicalPlanWithDatasetId
[error]           LogicalPlanWithDatasetId(_, rightId), _, _, _) =>
[error]           ^

HyukjinKwon · 2021-01-21T00:56:06Z

Thanks, @dongjoon-hyun. I reverted from branch-3.0.

dongjoon-hyun · 2021-01-21T01:13:18Z

Thanks, @HyukjinKwon !

cloud-fan · 2021-01-21T04:47:05Z

@Ngone51 can you open a PR against 3.0? thanks!

…eRelation.newInstance ### What changes were proposed in this pull request? Call `copyTagsFrom` for the new node created by `MultiInstanceRelation.newInstance()`. ### Why are the changes needed? ```scala val df = spark.range(2) df.join(df, df("id") <=> df("id")).show() ``` For this query, it's supposed to be non-ambiguous join by the rule `DetectAmbiguousSelfJoin` because of the same attribute reference in the condition: https://github.com/apache/spark/blob/537a49fc0966b0b289b67ac9c6ea20093165b0da/sql/core/src/main/scala/org/apache/spark/sql/execution/analysis/DetectAmbiguousSelfJoin.scala#L125 However, `DetectAmbiguousSelfJoin` can not apply this prediction due to the right side plan doesn't contain the dataset_id TreeNodeTag, which is missing after `MultiInstanceRelation.newInstance`. That's why we should preserve the tags info for the copied node. Fortunately, the query is still considered as non-ambiguous join because `DetectAmbiguousSelfJoin` only checks the left side plan and the reference is the same as the left side plan. However, this's not the expected behavior but only a coincidence. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated a unit test Closes apache#31260 from Ngone51/fix-missing-tags. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…f-join (#93) * [SPARK-34354][SQL] Fix failure when apply CostBasedJoinReorder on self-join This PR introduces a new analysis rule `DeduplicateRelations`, which deduplicates any duplicate relations in a plan first and then deduplicates conflicting attributes(which resued the `dedupRight` of `ResolveReferences`). `CostBasedJoinReorder` could fail when applying on self-join, e.g., ```scala // test in JoinReorderSuite test("join reorder with self-join") { val plan = t2.join(t1, Inner, Some(nameToAttr("t1.k-1-2") === nameToAttr("t2.k-1-5"))) .select(nameToAttr("t1.v-1-10")) .join(t2, Inner, Some(nameToAttr("t1.v-1-10") === nameToAttr("t2.k-1-5"))) // this can fail Optimize.execute(plan.analyze) } ``` Besides, with the new rule `DeduplicateRelations`, we'd be able to enable some optimizations, e.g., LeftSemiAnti pushdown, redundant project removal, as reflects in updated unit tests. Added and updated unit tests. Closes apache#32027 from Ngone51/join-reorder-3. Lead-authored-by: yi.wu <yi.wu@databricks.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * Undo tpcds-plan-stability changes * Resolve conflicts in list of rules in 'Resolution' batch * Resolve conflicts in #collectConflictPlans * Resolve conflicts in #dedupRight * Resolve conflicts in ResolveReferences#apply * Resolve conflicts around #newAliases and #findAliases * Resolve conflicts in AnalysisSuite * [SPARK-34178][SQL] Copy tags for the new node created by MultiInstanceRelation.newInstance Call `copyTagsFrom` for the new node created by `MultiInstanceRelation.newInstance()`. ```scala val df = spark.range(2) df.join(df, df("id") <=> df("id")).show() ``` For this query, it's supposed to be non-ambiguous join by the rule `DetectAmbiguousSelfJoin` because of the same attribute reference in the condition: https://github.com/apache/spark/blob/537a49fc0966b0b289b67ac9c6ea20093165b0da/sql/core/src/main/scala/org/apache/spark/sql/execution/analysis/DetectAmbiguousSelfJoin.scala#L125 However, `DetectAmbiguousSelfJoin` can not apply this prediction due to the right side plan doesn't contain the dataset_id TreeNodeTag, which is missing after `MultiInstanceRelation.newInstance`. That's why we should preserve the tags info for the copied node. Fortunately, the query is still considered as non-ambiguous join because `DetectAmbiguousSelfJoin` only checks the left side plan and the reference is the same as the left side plan. However, this's not the expected behavior but only a coincidence. No. Updated a unit test Closes apache#31260 from Ngone51/fix-missing-tags. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: yi.wu <yi.wu@databricks.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com>

fix

7cc0a98

github-actions bot added the SQL label Jan 20, 2021

cloud-fan approved these changes Jan 20, 2021

View reviewed changes

cloud-fan closed this in f498977 Jan 20, 2021

HyukjinKwon reviewed Jan 20, 2021

View reviewed changes

Ngone51 deleted the fix-missing-tags branch January 20, 2021 13:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-34178][SQL] Copy tags for the new node created by MultiInstanceRelation.newInstance #31260

[SPARK-34178][SQL] Copy tags for the new node created by MultiInstanceRelation.newInstance #31260

Ngone51 commented Jan 20, 2021

Ngone51 commented Jan 20, 2021

cloud-fan commented Jan 20, 2021

HyukjinKwon left a comment

Ngone51 commented Jan 20, 2021

SparkQA commented Jan 20, 2021

dongjoon-hyun commented Jan 21, 2021

HyukjinKwon commented Jan 21, 2021

dongjoon-hyun commented Jan 21, 2021

cloud-fan commented Jan 21, 2021

[SPARK-34178][SQL] Copy tags for the new node created by MultiInstanceRelation.newInstance #31260

[SPARK-34178][SQL] Copy tags for the new node created by MultiInstanceRelation.newInstance #31260

Conversation

Ngone51 commented Jan 20, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Ngone51 commented Jan 20, 2021

cloud-fan commented Jan 20, 2021

HyukjinKwon left a comment

Choose a reason for hiding this comment

Ngone51 commented Jan 20, 2021

SparkQA commented Jan 20, 2021

dongjoon-hyun commented Jan 21, 2021

HyukjinKwon commented Jan 21, 2021

dongjoon-hyun commented Jan 21, 2021

cloud-fan commented Jan 21, 2021