[SPARK-29682][SQL] Resolve conflicting attributes in Expand correctly #26441

imback82 · 2019-11-08T22:27:54Z

What changes were proposed in this pull request?

This PR addresses issues where conflicting attributes in Expand are not correctly handled.

Why are the changes needed?

val numsDF = Seq(1, 2, 3, 4, 5, 6).toDF("nums")
val cubeDF = numsDF.cube("nums").agg(max(lit(0)).as("agcol"))
cubeDF.join(cubeDF, "nums").show

fails with the following exception:

org.apache.spark.sql.AnalysisException:
Failure when resolving conflicting references in Join:
'Join Inner
:- Aggregate [nums#38, spark_grouping_id#36], [nums#38, max(0) AS agcol#35]
:  +- Expand [List(nums#3, nums#37, 0), List(nums#3, null, 1)], [nums#3, nums#38, spark_grouping_id#36]
:     +- Project [nums#3, nums#3 AS nums#37]
:        +- Project [value#1 AS nums#3]
:           +- LocalRelation [value#1]
+- Aggregate [nums#38, spark_grouping_id#36], [nums#38, max(0) AS agcol#58]
   +- Expand [List(nums#3, nums#37, 0), List(nums#3, null, 1)], [nums#3, nums#38, spark_grouping_id#36]
                                                                         ^^^^^^^
      +- Project [nums#3, nums#3 AS nums#37]
         +- Project [value#1 AS nums#3]
            +- LocalRelation [value#1]

Conflicting attributes: nums#38

As you can see from the above plan, num#38, the output of Expand on the right side of Join, should have been handled to produce new attribute. Since the conflict is not resolved in Expand, the failure is happening upstream at Aggregate. This PR addresses handling conflicting attributes in Expand.

Does this PR introduce any user-facing change?

Yes, the previous example now shows the following output:

+----+-----+-----+
|nums|agcol|agcol|
+----+-----+-----+
|   1|    0|    0|
|   6|    0|    0|
|   4|    0|    0|
|   2|    0|    0|
|   5|    0|    0|
|   3|    0|    0|
+----+-----+-----+

How was this patch tested?

Added new unit test.

imback82 · 2019-11-08T22:28:17Z

cc: @cloud-fan @viirya

maropu · 2019-11-09T00:27:23Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

+      cubeDF.select("nums").distinct
+        .join(group0, Seq("nums"), "inner")
+        .join(group1, Seq("nums"), "inner"),
+      Seq.empty)


Why is this test query different from one in the PR description? I think the simpler query in the description is better.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

SparkQA · 2019-11-09T02:21:05Z

Test build #113476 has finished for PR 26441 at commit b439d62.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class DateTimeConstants
public final class CalendarInterval implements Serializable, Comparable<CalendarInterval>
case class LocalShuffleReaderExec(child: SparkPlan) extends UnaryExecNode
class ContinuousRecordEndpoint(buckets: Seq[Seq[UnsafeRow]], lock: Object)

SparkQA · 2019-11-10T01:22:55Z

Test build #113506 has finished for PR 26441 at commit d815362.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class MergeIntoTable(
sealed abstract class MergeAction(
case class DeleteAction(condition: Option[Expression]) extends MergeAction(condition)
case class UpdateAction(
case class InsertAction(
case class Assignment(key: Expression, value: Expression) extends Expression with Unevaluable

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

maropu · 2019-11-10T10:05:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+          (oldVersion,
+            oldVersion.copy(
+              projectList =
+                newNamedExpression(projectList, conflictingAttributes)))

        case oldVersion @ Aggregate(_, aggregateExpressions, _)


It seems that the query in the PR description works well with the fix below, too;

case oldVersion @ Aggregate(_, aggregateExpressions, _) if AttributeSet(aggregateExpressions.map(_.toAttribute)) .intersect(conflictingAttributes).nonEmpty => (oldVersion, oldVersion.copy(aggregateExpressions = aggregateExpressions.map(_.newInstance())))

Could we fix this issue in an easier way than the current fix?

Could we fix this issue in an easier way than the current fix?

I don't think it is robust enough. For example, the following test fails with the suggested fix:

[info] - [SPARK-6231] join - self join auto resolve ambiguity *** FAILED *** (251 milliseconds) [info] Failed to analyze query: org.apache.spark.sql.AnalysisException: Resolved attribute(s) key#4619 missing from key#4518,value#4519 in operator !Aggregate [key#4619], [key#4619, sum(cast(key#4619 as bigint)) AS sum(key)#4620L]. Attribute(s) with the same name appear in the operation: key. Please check if the right attribute(s) are used.;; [info] Join Inner, (key#4518 = key#4518) [info] :- Aggregate [key#4518], [key#4518, count(1) AS count(1)#4610L] [info] : +- Project [_1#4513 AS key#4518, _2#4514 AS value#4519] [info] : +- LocalRelation [_1#4513, _2#4514] [info] +- !Aggregate [key#4619], [key#4619, sum(cast(key#4619 as bigint)) AS sum(key)#4620L] [info] +- Project [_1#4513 AS key#4518, _2#4514 AS value#4519] [info] +- LocalRelation [_1#4513, _2#4514] [info]

Ur, I see... In the query you showed in the PR description, it seems the dedup logic doesn't work in the Expand node (^^^^^ below):

'Join Inner :- Aggregate [nums#121, spark_grouping_id#119], [nums#121, max(0) AS agcol#118] : +- Expand [List(nums#79, nums#120, 0), List(nums#79, null, 1)], [nums#79, nums#121, spark_grouping_id#119] : +- Project [nums#79, nums#79 AS nums#120] : +- Project [value#76 AS nums#79] : +- LocalRelation [value#76] +- Aggregate [nums#121, spark_grouping_id#119], [nums#121, max(0) AS agcol#124] ^^^^^^^^ +- Expand [List(nums#79, nums#120, 0), List(nums#79, null, 1)], [nums#79, nums#121, spark_grouping_id#119] ^^^^^^^^ +- Project [nums#79, nums#79 AS nums#120] +- Project [value#76 AS nums#79] +- LocalRelation [value#76]

So, we might be able to fix this dedup issue by adding an entry for Expand in dedupRight like this?;

case oldVersion @ Expand(_, output, _) if oldVersion.outputSet.intersect(conflictingAttributes).nonEmpty => (oldVersion, oldVersion.copy(output = output.map(_.newInstance())))

cloud-fan · 2019-11-11T05:25:07Z

It's better to explain why the bug happens in the PR description. I don't understand the current fix, just FYI why we only handle alias in Project: The self-join dedup logical tries to find the root which causes conflicts. Sometimes it's alias in Project, sometimes it's leaf node. For attributes in Project, there must be other nodes under Project that cause the conflicts.

SparkQA · 2019-11-11T08:05:02Z

Test build #113565 has finished for PR 26441 at commit 7a295cd.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

imback82 · 2019-11-11T20:42:40Z

Thanks @cloud-fan. I updated the PR description, and please let me know if you need additional info. Also, let me know if updating Project is not necessary.

imback82 · 2019-11-11T20:42:47Z

retest this please

maropu · 2019-11-11T23:43:39Z

This issue can happen other than the cube aggregate as shown above?

SparkQA · 2019-11-12T00:49:12Z

Test build #113601 has finished for PR 26441 at commit 7a295cd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-11-12T05:06:59Z

@imback82 thanks for updating the PR description! I see the problem now. The nums#38 attribute is conflicting and it comes from Expand. Ideally we should let Expand to re-generate its output attributes, but the Expand doesn't clearly distinguish the output attributes from its child and the output attributes of itself.

I think we should change Expand to follow Generate

case class Expand(..., additionalOutput: Seq[Attribute]) {
  override def producedAttributes: AttributeSet = AttributeSet(additionalOutput)
  def output = child.output ++ additionalOutput
}

And also follow how we dedup Generate:

case oldVersion: Generate
            if oldVersion.producedAttributes.intersect(conflictingAttributes).nonEmpty =>
          val newOutput = oldVersion.generatorOutput.map(_.newInstance())
          (oldVersion, oldVersion.copy(generatorOutput = newOutput))

imback82 · 2019-11-13T04:10:56Z

Thanks @cloud-fan! Your suggested solution of updating Expand works as expected. However, I do not think the following

def output = child.output ++ additionalOutput

is always true.

For example,

Expand [List(nums#3, nums#37, 0), List(nums#3, null, 1)], [nums#3, nums#38, spark_grouping_id#36]
  +- Project [nums#3, nums#3 AS nums#37]

#37 is an output of child, but not an output of Expand.

So instead of adding additionalOutput to Expand, I just did the following:

case oldVersion: Expand if oldVersion.producedAttributes.intersect(conflictingAttributes).nonEmpty =>
  val producedAttributes = oldVersion.producedAttributes
  val newOutput = oldVersion.output.map{ e =>
    if (producedAttributes.contains(e)) { e.newInstance() } else { e } }
  (oldVersion, oldVersion.copy(output = newOutput))

where Expand.producedAttributes is updated as:

override def producedAttributes: AttributeSet = AttributeSet(output diff child.output)

Let me know if this approach is fine instead of updating Expand.

cloud-fan · 2019-11-13T07:28:58Z

Sound good!

viirya · 2019-11-13T07:33:12Z

+1 for the proposed fix. And please also update the PR title and description accordingly.

SparkQA · 2019-11-14T04:33:22Z

Test build #113739 has finished for PR 26441 at commit 11927c3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-14T05:04:54Z

Test build #113740 has finished for PR 26441 at commit e743133.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

### What changes were proposed in this pull request? This PR addresses issues where conflicting attributes in `Expand` are not correctly handled. ### Why are the changes needed? ```Scala val numsDF = Seq(1, 2, 3, 4, 5, 6).toDF("nums") val cubeDF = numsDF.cube("nums").agg(max(lit(0)).as("agcol")) cubeDF.join(cubeDF, "nums").show ``` fails with the following exception: ``` org.apache.spark.sql.AnalysisException: Failure when resolving conflicting references in Join: 'Join Inner :- Aggregate [nums#38, spark_grouping_id#36], [nums#38, max(0) AS agcol#35] : +- Expand [List(nums#3, nums#37, 0), List(nums#3, null, 1)], [nums#3, nums#38, spark_grouping_id#36] : +- Project [nums#3, nums#3 AS nums#37] : +- Project [value#1 AS nums#3] : +- LocalRelation [value#1] +- Aggregate [nums#38, spark_grouping_id#36], [nums#38, max(0) AS agcol#58] +- Expand [List(nums#3, nums#37, 0), List(nums#3, null, 1)], [nums#3, nums#38, spark_grouping_id#36] ^^^^^^^ +- Project [nums#3, nums#3 AS nums#37] +- Project [value#1 AS nums#3] +- LocalRelation [value#1] Conflicting attributes: nums#38 ``` As you can see from the above plan, `num#38`, the output of `Expand` on the right side of `Join`, should have been handled to produce new attribute. Since the conflict is not resolved in `Expand`, the failure is happening upstream at `Aggregate`. This PR addresses handling conflicting attributes in `Expand`. ### Does this PR introduce any user-facing change? Yes, the previous example now shows the following output: ``` +----+-----+-----+ |nums|agcol|agcol| +----+-----+-----+ | 1| 0| 0| | 6| 0| 0| | 4| 0| 0| | 2| 0| 0| | 5| 0| 0| | 3| 0| 0| +----+-----+-----+ ``` ### How was this patch tested? Added new unit test. Closes #26441 from imback82/spark-29682. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit e46e487) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan · 2019-11-14T06:48:09Z

thanks, merging to master/2.4!

imback82 · 2019-11-14T06:54:41Z

thanks @cloud-fan @maropu @viirya @dongjoon-hyun for review and help!

dongjoon-hyun · 2019-11-14T06:59:21Z

Thank you so much!

imback82 added 2 commits November 8, 2019 17:00

initial checkin

5b46681

Merge branch 'master' into spark-29682

b439d62

maropu reviewed Nov 9, 2019

View reviewed changes

viirya reviewed Nov 9, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated Show resolved Hide resolved

dongjoon-hyun added SPARK CORE SPARK SHELL SPARK SUBMIT labels Nov 9, 2019

imback82 added 2 commits November 9, 2019 16:13

address PR comments

91aad43

Merge branch 'master' into spark-29682

d815362

dongjoon-hyun added SQL and removed SPARK SHELL SPARK SUBMIT SPARK CORE labels Nov 10, 2019

dongjoon-hyun reviewed Nov 10, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Nov 10, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated Show resolved Hide resolved

viirya reviewed Nov 10, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated Show resolved Hide resolved

maropu reviewed Nov 10, 2019

View reviewed changes

address PR comments

7a295cd

imback82 added 2 commits November 13, 2019 16:39

Address PR comments

34903e7

Remove unnecessary function

11927c3

imback82 changed the title ~~[SPARK-29682][SQL] Resolve conflicting references in aggregate expressions~~ [SPARK-29682][SQL] Resolve conflicting references in Expand correctly Nov 14, 2019

imback82 changed the title ~~[SPARK-29682][SQL] Resolve conflicting references in Expand correctly~~ [SPARK-29682][SQL] Resolve conflicting attributes in Expand correctly Nov 14, 2019

Rename test

e743133

cloud-fan approved these changes Nov 14, 2019

View reviewed changes

viirya approved these changes Nov 14, 2019

View reviewed changes

cloud-fan closed this in e46e487 Nov 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-29682][SQL] Resolve conflicting attributes in Expand correctly #26441

[SPARK-29682][SQL] Resolve conflicting attributes in Expand correctly #26441

imback82 commented Nov 8, 2019 •

edited

Loading

imback82 commented Nov 8, 2019

maropu Nov 9, 2019

SparkQA commented Nov 9, 2019

SparkQA commented Nov 10, 2019

maropu Nov 10, 2019

imback82 Nov 10, 2019

maropu Nov 11, 2019

cloud-fan commented Nov 11, 2019

SparkQA commented Nov 11, 2019

imback82 commented Nov 11, 2019

imback82 commented Nov 11, 2019

maropu commented Nov 11, 2019

SparkQA commented Nov 12, 2019

cloud-fan commented Nov 12, 2019

imback82 commented Nov 13, 2019

cloud-fan commented Nov 13, 2019

viirya commented Nov 13, 2019

SparkQA commented Nov 14, 2019

SparkQA commented Nov 14, 2019

cloud-fan commented Nov 14, 2019

imback82 commented Nov 14, 2019

dongjoon-hyun commented Nov 14, 2019

[SPARK-29682][SQL] Resolve conflicting attributes in Expand correctly #26441

[SPARK-29682][SQL] Resolve conflicting attributes in Expand correctly #26441

Conversation

imback82 commented Nov 8, 2019 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

imback82 commented Nov 8, 2019

maropu Nov 9, 2019

Choose a reason for hiding this comment

SparkQA commented Nov 9, 2019

SparkQA commented Nov 10, 2019

maropu Nov 10, 2019

Choose a reason for hiding this comment

imback82 Nov 10, 2019

Choose a reason for hiding this comment

maropu Nov 11, 2019

Choose a reason for hiding this comment

cloud-fan commented Nov 11, 2019

SparkQA commented Nov 11, 2019

imback82 commented Nov 11, 2019

imback82 commented Nov 11, 2019

maropu commented Nov 11, 2019

SparkQA commented Nov 12, 2019

cloud-fan commented Nov 12, 2019

imback82 commented Nov 13, 2019

cloud-fan commented Nov 13, 2019

viirya commented Nov 13, 2019

SparkQA commented Nov 14, 2019

SparkQA commented Nov 14, 2019

cloud-fan commented Nov 14, 2019

imback82 commented Nov 14, 2019

dongjoon-hyun commented Nov 14, 2019

imback82 commented Nov 8, 2019 •

edited

Loading