Skip to content

Conversation

@marmbrus
Copy link
Contributor

We need to handle ambiguous exprIds that are produced by new aliases as well as those caused by leaf nodes (MultiInstanceRelation).

Attempting to fix this revealed a bug in equals for Alias as these objects were comparing equal even when the expression ids did not match. Additionally, LocalRelation did not correctly provide statistics, and some tests in catalyst and hive were not using the helper functions for comparing plans.

Based on #4991 by @chenghao-intel

@SparkQA
Copy link

SparkQA commented Mar 17, 2015

Test build #28700 has finished for PR 5062 at commit 0b9c687.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marmbrus Only handle the Project probably not enough. For example:

  test("self join with aliases#1") {
    Seq(1).map(i => (i, i.toString)).toDF("int", "str").groupBy($"str").max("int").as("str").registerTempTable("df")
    Seq(1).map(i => (i, i.toString)).toDF("int", "str").groupBy($"str").max("int").as("str").explain(true)

    checkAnswer(
      sql(
        """
          |SELECT x.str, y.str
          |FROM df x JOIN df y ON x.str = y.str
        """.stripMargin),
      Row(1, 1) :: Nil)
  }

Output in console like

'Subquery str
 'Aggregate ['str], ['str,MAX(int#48) AS MAX(int)#50]
  Project [_1#46 AS int#48,_2#47 AS str#49]
   LocalRelation [_1#46,_2#47], [[1,1]]


Aggregate [str#49], [str#49,MAX(int#48) AS MAX(int)#50]
 Project [_1#46 AS int#48,_2#47 AS str#49]
  LocalRelation [_1#46,_2#47], [[1,1]]


Aggregate [str#49], [str#49,MAX(int#48) AS MAX(int)#50]
 LocalRelation [int#48,str#49], [[1,1]]


Aggregate false, [str#49], [str#49,MAX(PartialMax#53) AS MAX(int)#50]
 Exchange (HashPartitioning [str#49], 5)
  Aggregate true, [str#49], [str#49,MAX(int#48) AS PartialMax#53]
   LocalTableScan [int#48,str#49], [[1,1]]

Code Generation: false
== RDD ==

next on empty iterator
java.util.NoSuchElementException: next on empty iterator
    at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
    at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
    at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:64)
    at scala.collection.IterableLike$class.head(IterableLike.scala:91)
    at scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$head(ArrayBuffer.scala:47)
    at scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:120)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a doable solution is add a Project node on top of cached plan or temporal table when lookupRelation.

@chenghao-intel
Copy link
Contributor

LGTM, and we can keep adding more rule if we missed one for the self join in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we ignore the qualifiers in the equal test? Will that cause another bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@SparkQA
Copy link

SparkQA commented Mar 18, 2015

Test build #28756 has finished for PR 5062 at commit 8038a36.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@marmbrus
Copy link
Contributor Author

Thanks for looking this over and helping with the fix! Merging to master and 1.3

asfgit pushed a commit that referenced this pull request Mar 18, 2015
…ases

We need to handle ambiguous `exprId`s that are produced by new aliases as well as those caused by leaf nodes (`MultiInstanceRelation`).

Attempting to fix this revealed a bug in `equals` for `Alias` as these objects were comparing equal even when the expression ids did not match. Additionally, `LocalRelation` did not correctly provide statistics, and some tests in `catalyst` and `hive` were not using the helper functions for comparing plans.

Based on #4991 by chenghao-intel

Author: Michael Armbrust <michael@databricks.com>

Closes #5062 from marmbrus/selfJoins and squashes the following commits:

8e9b84b [Michael Armbrust] check qualifier too
8038a36 [Michael Armbrust] handle aggs too
0b9c687 [Michael Armbrust] fix more tests
c3c574b [Michael Armbrust] revert change.
725f1ab [Michael Armbrust] add statistics
a925d08 [Michael Armbrust] check for conflicting attributes in join resolution
b022ef7 [Michael Armbrust] Handle project aliases.
d8caa40 [Michael Armbrust] test case: SPARK-6247
f9c67c2 [Michael Armbrust] Check for duplicate attributes in join resolution.
898af73 [Michael Armbrust] Fix Alias equality.

(cherry picked from commit 3579003)
Signed-off-by: Michael Armbrust <michael@databricks.com>
@asfgit asfgit closed this in 3579003 Mar 18, 2015
@SparkQA
Copy link

SparkQA commented Mar 18, 2015

Test build #28761 has finished for PR 5062 at commit 8e9b84b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@marmbrus marmbrus deleted the selfJoins branch August 3, 2015 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants