Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12828][SQL]add natural join support #10762

Closed
wants to merge 17 commits into from

Conversation

adrian-wang
Copy link
Contributor

@SparkQA
Copy link

SparkQA commented Jan 15, 2016

Test build #49434 has finished for PR 10762 at commit 91a4a87.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class NaturalJoin(

@@ -179,10 +180,15 @@ object SqlParser extends AbstractSparkSQLParser with DataTypeParser {
)

protected lazy val joinedRelation: Parser[LogicalPlan] =
relationFactor ~ rep1(joinType.? ~ (JOIN ~> relationFactor) ~ joinConditions.?) ^^ {
relationFactor ~
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note this is going away

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will update the hive parser today, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@SparkQA
Copy link

SparkQA commented Jan 15, 2016

Test build #49445 has finished for PR 10762 at commit 15b96ba.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -144,6 +147,35 @@ case class Join(
}
}

def outerProjectList: Seq[NamedExpression] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this doing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For select * from a natural join, we need to use a Project to get rid of redundant columns.

In mysql, select * from natural join would only contain one row if both size have the row of the same name. but user do can use something like t1.a or t2.a to explicitly reference the column here. Do you think we should implement it exactly the same? I'm afraid it would be a too complicated change, but I can do that if necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you mean "column" instead of row? columns of the same name should only appear once.

also this code should go into the analyzer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I mean "column"...

@rxin
Copy link
Contributor

rxin commented Jan 15, 2016

Thanks for making the change -- after looking at the change I think this new version has too many changes (it is fairly ugly that we need to update a lot of files because of the flag). Are there ways to simplify this?

When I suggested not introducing a new operator, I was thinking about just changing the join type to something like

case class NaturalJoin(tpe: JoinType) extends JoinType

@adrian-wang
Copy link
Contributor Author

@rxin, Thanks for you time, I'll draft another version accordingly.

@adrian-wang adrian-wang changed the title [SPARK-12828][SQL]add natural join support [WIP][SPARK-12828][SQL]add natural join support Jan 15, 2016
@SparkQA
Copy link

SparkQA commented Jan 15, 2016

Test build #49455 has finished for PR 10762 at commit 41f50cf.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class NarrowReferenceHolder(child: LogicalPlan) extends UnaryNode

@SparkQA
Copy link

SparkQA commented Jan 20, 2016

Test build #49768 has finished for PR 10762 at commit 25a7226.

  • This patch fails to build.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 20, 2016

Test build #49776 has finished for PR 10762 at commit 05ab0e9.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 21, 2016

Test build #49845 has finished for PR 10762 at commit afb60a5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@adrian-wang adrian-wang changed the title [WIP][SPARK-12828][SQL]add natural join support [SPARK-12828][SQL]add natural join support Jan 21, 2016
@adrian-wang
Copy link
Contributor Author

retest this please.

@SparkQA
Copy link

SparkQA commented Jan 21, 2016

Test build #49859 has finished for PR 10762 at commit afb60a5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -104,6 +105,9 @@ trait CheckAnalysis {
s"filter expression '${f.condition.prettyString}' " +
s"of type ${f.condition.dataType.simpleString} is not a boolean.")

case j @ Join(_, _, NaturalJoin(_), _) =>
failAnalysis(s"natural join not resolved.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when will this happen? is this a useful error message to show at all/?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will never happen but when there were remaining natural joins after 100 iterations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok so this can happen - maybe we should print something more than just a message to tell which natural join cannot be resolved?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this happens, it must because join operator itself has some problems, but not caused by the NaturalJoin type. I think we should remove this case and let other checking rules to tell users what's wrong.

@rxin
Copy link
Contributor

rxin commented Jan 22, 2016

Actually now i think about it - maybe we should just have a constructor for Join that the parser calls, and the constructor just creates a normal join with the right project list and conditions. Seems like it'd be a lot simpler.

@adrian-wang
Copy link
Contributor Author

When the parser calls the constructor, how can we get the schema of tables? We need schema to build project list and conditions.

@rxin
Copy link
Contributor

rxin commented Jan 23, 2016

@hvanhovell can you help review the parser changes?

@cloud-fan
Copy link
Contributor

one comment: #10762 (comment)

@adrian-wang
Copy link
Contributor Author

Acutally MySQL and Oracle does not support normal full outer join either.
PostgreSQL does support natural full outer join: http://www.postgresql.org/docs/9.1/static/queries-table-expressions.html
I think we should support natural full outer join since we support full outer join.

@SparkQA
Copy link

SparkQA commented Jan 29, 2016

Test build #50342 has finished for PR 10762 at commit cb8af0e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

class ResolveNaturalJoinSuite extends AnalysisTest {
import org.apache.spark.sql.catalyst.analysis.TestRelations._

val t1 = testRelation2.select('a, 'b)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make all of them lazy; otherwise an exception thrown here will make the entire suite disappear from jenkins reporting.

@SparkQA
Copy link

SparkQA commented Jan 29, 2016

Test build #50369 has finished for PR 10762 at commit 7e7de89.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class ResolveNaturalJoinSuite extends AnalysisTest

@SparkQA
Copy link

SparkQA commented Jan 29, 2016

Test build #50372 has finished for PR 10762 at commit 2cf5629.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 29, 2016

Test build #50375 has finished for PR 10762 at commit 192f8bf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

childrenResolved &&
expressions.forall(_.resolved) &&
selfJoinResolved &&
condition.forall(_.dataType == BooleanType)
}

// Joins are only resolved if they don't introduce ambiguous expression ids.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should belong to resolvedExceptNatural?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is original comment from the old Join.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, this is the original comment for resolved, but now we renamed resolved to resolvedExceptNatural right?

@SparkQA
Copy link

SparkQA commented Feb 1, 2016

Test build #50470 has finished for PR 10762 at commit 12de061.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

AttributeReference("a", StringType, nullable = false)(),
AttributeReference("b", StringType, nullable = false)(),
AttributeReference("c", StringType, nullable = false)())
lazy val tt1 = testRelation0.select('a, 'b)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need some better names than tt1, aa, trueB, etc.....

@SparkQA
Copy link

SparkQA commented Feb 1, 2016

Test build #50476 has finished for PR 10762 at commit 6aa2a79.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

class ResolveNaturalJoinSuite extends AnalysisTest {
import org.apache.spark.sql.catalyst.analysis.TestRelations._

lazy val t1 = testRelation2.select('a, 'b)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the naming here is super confusing. can you improve it? some suggestions:

  1. Define relations inline here, and don't use TestRelations
  2. Name relations r1, r2, r3 ...
  3. Prefix attribute names with relation names, e.g. r1_a, r1_b, ...

import org.apache.spark.sql.types.StringType

class ResolveNaturalJoinSuite extends AnalysisTest {
lazy val r1 = LocalRelation(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we need each field anyway, how about we define the fields first and use them to compose relation? i.e.

val a = 'a.string
val b = 'b.string
....
val f = 'f.string
val a_nonNullable = a.withNullability(false)
...

val r1 = LocalRelation(a, b, c)
val r2 = LocalRelation(a, b)
val r3 = LocalRelation(d, e, f)
...

tips: using the dsl can make the tests much readable, see an example here

@SparkQA
Copy link

SparkQA commented Feb 1, 2016

Test build #50480 has finished for PR 10762 at commit 42f9d2c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 1, 2016

Test build #50487 has finished for PR 10762 at commit 307cb5e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@adrian-wang
Copy link
Contributor Author

@cloud-fan any more comments?

@rxin
Copy link
Contributor

rxin commented Feb 4, 2016

LGTM - going to merge it.

@asfgit asfgit closed this in 0f81318 Feb 4, 2016
// we should only keep unique columns(depends on joinType) for joinCols
val projectList = joinType match {
case LeftOuter =>
leftKeys ++ lUniqueOutput ++ rUniqueOutput.map(_.withNullability(true))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we switching the ordering of output columns?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm i figured it out.

asfgit pushed a commit that referenced this pull request Feb 4, 2016
This is a small addendum to #10762 to make the code more robust again future changes.

Author: Reynold Xin <rxin@databricks.com>

Closes #11070 from rxin/SPARK-12828-natural-join.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants