[SPARK-12828][SQL]add natural join support #10762

adrian-wang · 2016-01-15T01:44:55Z

Jira:
https://issues.apache.org/jira/browse/SPARK-12828

SparkQA · 2016-01-15T02:00:13Z

Test build #49434 has finished for PR 10762 at commit 91a4a87.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class NaturalJoin(

rxin · 2016-01-15T02:02:01Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala

@@ -179,10 +180,15 @@ object SqlParser extends AbstractSparkSQLParser with DataTypeParser {
    )

  protected lazy val joinedRelation: Parser[LogicalPlan] =
-    relationFactor ~ rep1(joinType.? ~ (JOIN ~> relationFactor) ~ joinConditions.?) ^^ {
+    relationFactor ~


note this is going away

Yes, I will update the hive parser today, too.

SparkQA · 2016-01-15T07:15:29Z

Test build #49445 has finished for PR 10762 at commit 15b96ba.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-01-15T07:16:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala

@@ -144,6 +147,35 @@ case class Join(
    }
  }

+  def outerProjectList: Seq[NamedExpression] = {


what is this doing?

For select * from a natural join, we need to use a Project to get rid of redundant columns.

In mysql, select * from natural join would only contain one row if both size have the row of the same name. but user do can use something like t1.a or t2.a to explicitly reference the column here. Do you think we should implement it exactly the same? I'm afraid it would be a too complicated change, but I can do that if necessary.

did you mean "column" instead of row? columns of the same name should only appear once.

also this code should go into the analyzer.

yes I mean "column"...

rxin · 2016-01-15T07:18:55Z

Thanks for making the change -- after looking at the change I think this new version has too many changes (it is fairly ugly that we need to update a lot of files because of the flag). Are there ways to simplify this?

When I suggested not introducing a new operator, I was thinking about just changing the join type to something like

case class NaturalJoin(tpe: JoinType) extends JoinType

adrian-wang · 2016-01-15T07:49:22Z

@rxin, Thanks for you time, I'll draft another version accordingly.

SparkQA · 2016-01-15T09:37:48Z

Test build #49455 has finished for PR 10762 at commit 41f50cf.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class NarrowReferenceHolder(child: LogicalPlan) extends UnaryNode

SparkQA · 2016-01-20T08:14:36Z

Test build #49768 has finished for PR 10762 at commit 25a7226.

This patch fails to build.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-01-20T11:54:20Z

Test build #49776 has finished for PR 10762 at commit 05ab0e9.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-01-21T04:06:58Z

Test build #49845 has finished for PR 10762 at commit afb60a5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

adrian-wang · 2016-01-21T05:56:53Z

retest this please.

SparkQA · 2016-01-21T07:49:35Z

Test build #49859 has finished for PR 10762 at commit afb60a5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-01-22T02:58:01Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

@@ -104,6 +105,9 @@ trait CheckAnalysis {
              s"filter expression '${f.condition.prettyString}' " +
                s"of type ${f.condition.dataType.simpleString} is not a boolean.")

+          case j @ Join(_, _, NaturalJoin(_), _) =>
+            failAnalysis(s"natural join not resolved.")


when will this happen? is this a useful error message to show at all/?

This will never happen but when there were remaining natural joins after 100 iterations

ok so this can happen - maybe we should print something more than just a message to tell which natural join cannot be resolved?

if this happens, it must because join operator itself has some problems, but not caused by the NaturalJoin type. I think we should remove this case and let other checking rules to tell users what's wrong.

rxin · 2016-01-22T03:08:25Z

Actually now i think about it - maybe we should just have a constructor for Join that the parser calls, and the constructor just creates a normal join with the right project list and conditions. Seems like it'd be a lot simpler.

adrian-wang · 2016-01-22T04:39:08Z

When the parser calls the constructor, how can we get the schema of tables? We need schema to build project list and conditions.

rxin · 2016-01-23T20:58:09Z

@hvanhovell can you help review the parser changes?

cloud-fan · 2016-01-28T21:07:02Z

one comment: #10762 (comment)

adrian-wang · 2016-01-29T02:31:42Z

Acutally MySQL and Oracle does not support normal full outer join either.
PostgreSQL does support natural full outer join: http://www.postgresql.org/docs/9.1/static/queries-table-expressions.html
I think we should support natural full outer join since we support full outer join.

SparkQA · 2016-01-29T05:31:42Z

Test build #50342 has finished for PR 10762 at commit cb8af0e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-01-29T09:29:58Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveNaturalJoinSuite.scala

+class ResolveNaturalJoinSuite extends AnalysisTest {
+  import org.apache.spark.sql.catalyst.analysis.TestRelations._
+
+  val t1 = testRelation2.select('a, 'b)


make all of them lazy; otherwise an exception thrown here will make the entire suite disappear from jenkins reporting.

SparkQA · 2016-01-29T09:48:18Z

Test build #50369 has finished for PR 10762 at commit 7e7de89.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ResolveNaturalJoinSuite extends AnalysisTest

SparkQA · 2016-01-29T10:14:31Z

Test build #50372 has finished for PR 10762 at commit 2cf5629.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-29T11:56:43Z

Test build #50375 has finished for PR 10762 at commit 192f8bf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-01-29T17:51:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala

    childrenResolved &&
      expressions.forall(_.resolved) &&
      selfJoinResolved &&
      condition.forall(_.dataType == BooleanType)
  }
+
+  // Joins are only resolved if they don't introduce ambiguous expression ids.


This comment should belong to resolvedExceptNatural?

This is original comment from the old Join.

yup, this is the original comment for resolved, but now we renamed resolved to resolvedExceptNatural right?

SparkQA · 2016-02-01T03:58:58Z

Test build #50470 has finished for PR 10762 at commit 12de061.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-02-01T05:16:30Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveNaturalJoinSuite.scala

+    AttributeReference("a", StringType, nullable = false)(),
+    AttributeReference("b", StringType, nullable = false)(),
+    AttributeReference("c", StringType, nullable = false)())
+  lazy val tt1 = testRelation0.select('a, 'b)


I think we need some better names than tt1, aa, trueB, etc.....

SparkQA · 2016-02-01T07:02:52Z

Test build #50476 has finished for PR 10762 at commit 6aa2a79.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-02-01T07:13:54Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveNaturalJoinSuite.scala

+class ResolveNaturalJoinSuite extends AnalysisTest {
+  import org.apache.spark.sql.catalyst.analysis.TestRelations._
+
+  lazy val t1 = testRelation2.select('a, 'b)


the naming here is super confusing. can you improve it? some suggestions:

Define relations inline here, and don't use TestRelations

Name relations r1, r2, r3 ...

Prefix attribute names with relation names, e.g. r1_a, r1_b, ...

cloud-fan · 2016-02-01T08:31:28Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveNaturalJoinSuite.scala

+import org.apache.spark.sql.types.StringType
+
+class ResolveNaturalJoinSuite extends AnalysisTest {
+  lazy val r1 = LocalRelation(


since we need each field anyway, how about we define the fields first and use them to compose relation? i.e.

val a = 'a.string val b = 'b.string .... val f = 'f.string val a_nonNullable = a.withNullability(false) ... val r1 = LocalRelation(a, b, c) val r2 = LocalRelation(a, b) val r3 = LocalRelation(d, e, f) ...

tips: using the dsl can make the tests much readable, see an example here

SparkQA · 2016-02-01T09:23:27Z

Test build #50480 has finished for PR 10762 at commit 42f9d2c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-01T11:44:55Z

Test build #50487 has finished for PR 10762 at commit 307cb5e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

adrian-wang · 2016-02-04T05:01:09Z

@cloud-fan any more comments?

rxin · 2016-02-04T05:04:36Z

LGTM - going to merge it.

rxin · 2016-02-04T05:28:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+        // we should only keep unique columns(depends on joinType) for joinCols
+        val projectList = joinType match {
+          case LeftOuter =>
+            leftKeys ++ lUniqueOutput ++ rUniqueOutput.map(_.withNullability(true))


why are we switching the ordering of output columns?

nvm i figured it out.

This is a small addendum to #10762 to make the code more robust again future changes. Author: Reynold Xin <rxin@databricks.com> Closes #11070 from rxin/SPARK-12828-natural-join.

rxin reviewed Jan 15, 2016
View reviewed changes

adrian-wang force-pushed the naturaljoin branch from 15b96ba to 41f50cf Compare January 15, 2016 09:18

adrian-wang changed the title ~~[SPARK-12828][SQL]add natural join support~~ [WIP][SPARK-12828][SQL]add natural join support Jan 15, 2016

adrian-wang force-pushed the naturaljoin branch from 41f50cf to 25a7226 Compare January 20, 2016 07:37

adrian-wang added 2 commits January 20, 2016 18:46

add natural join support

e0fc72a

add in new parser, and use the old Join node, and add test for Analyzer

b5611d5

adrian-wang force-pushed the naturaljoin branch from 05ab0e9 to afb60a5 Compare January 21, 2016 02:47

adrian-wang added 2 commits January 20, 2016 18:48

fix compile

2041382

fix df

afb60a5

adrian-wang changed the title ~~[WIP][SPARK-12828][SQL]add natural join support~~ [SPARK-12828][SQL]add natural join support Jan 21, 2016

rxin reviewed Jan 22, 2016
View reviewed changes

address comments

cb8af0e

in a separate suite

7e7de89

rxin reviewed Jan 29, 2016
View reviewed changes

lazy val

2cf5629

Update CheckAnalysis.scala

192f8bf

cloud-fan reviewed Jan 29, 2016
View reviewed changes

improve some doc

12de061

use withTempTable

b3a6c32

cloud-fan reviewed Feb 1, 2016
View reviewed changes

rename val

6aa2a79

rxin reviewed Feb 1, 2016
View reviewed changes

rename val

42f9d2c

cloud-fan reviewed Feb 1, 2016
View reviewed changes

use dsl

307cb5e

asfgit closed this in 0f81318 Feb 4, 2016

rxin reviewed Feb 4, 2016
View reviewed changes

rxin mentioned this pull request Feb 4, 2016

[SPARK-12828][SQL] Natural join follow-up #11070

Closed

[SPARK-12828][SQL]add natural join support #10762

[SPARK-12828][SQL]add natural join support #10762

Conversation

adrian-wang commented Jan 15, 2016

SparkQA commented Jan 15, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 15, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rxin commented Jan 15, 2016

adrian-wang commented Jan 15, 2016

SparkQA commented Jan 15, 2016

SparkQA commented Jan 20, 2016

SparkQA commented Jan 20, 2016

SparkQA commented Jan 21, 2016

adrian-wang commented Jan 21, 2016

SparkQA commented Jan 21, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rxin commented Jan 22, 2016

adrian-wang commented Jan 22, 2016

rxin commented Jan 23, 2016

cloud-fan commented Jan 28, 2016

adrian-wang commented Jan 29, 2016

SparkQA commented Jan 29, 2016

Choose a reason for hiding this comment

SparkQA commented Jan 29, 2016

SparkQA commented Jan 29, 2016

SparkQA commented Jan 29, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 1, 2016

Choose a reason for hiding this comment

SparkQA commented Feb 1, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 1, 2016

SparkQA commented Feb 1, 2016

adrian-wang commented Feb 4, 2016

rxin commented Feb 4, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment