-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12828][SQL]add natural join support #10762
Conversation
Test build #49434 has finished for PR 10762 at commit
|
@@ -179,10 +180,15 @@ object SqlParser extends AbstractSparkSQLParser with DataTypeParser { | |||
) | |||
|
|||
protected lazy val joinedRelation: Parser[LogicalPlan] = | |||
relationFactor ~ rep1(joinType.? ~ (JOIN ~> relationFactor) ~ joinConditions.?) ^^ { | |||
relationFactor ~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note this is going away
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will update the hive parser today, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Test build #49445 has finished for PR 10762 at commit
|
@@ -144,6 +147,35 @@ case class Join( | |||
} | |||
} | |||
|
|||
def outerProjectList: Seq[NamedExpression] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For select * from a natural join, we need to use a Project to get rid of redundant columns.
In mysql, select * from natural join would only contain one row if both size have the row of the same name. but user do can use something like t1.a
or t2.a
to explicitly reference the column here. Do you think we should implement it exactly the same? I'm afraid it would be a too complicated change, but I can do that if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you mean "column" instead of row? columns of the same name should only appear once.
also this code should go into the analyzer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I mean "column"...
Thanks for making the change -- after looking at the change I think this new version has too many changes (it is fairly ugly that we need to update a lot of files because of the flag). Are there ways to simplify this? When I suggested not introducing a new operator, I was thinking about just changing the join type to something like case class NaturalJoin(tpe: JoinType) extends JoinType |
@rxin, Thanks for you time, I'll draft another version accordingly. |
15b96ba
to
41f50cf
Compare
Test build #49455 has finished for PR 10762 at commit
|
41f50cf
to
25a7226
Compare
Test build #49768 has finished for PR 10762 at commit
|
Test build #49776 has finished for PR 10762 at commit
|
05ab0e9
to
afb60a5
Compare
Test build #49845 has finished for PR 10762 at commit
|
retest this please. |
Test build #49859 has finished for PR 10762 at commit
|
@@ -104,6 +105,9 @@ trait CheckAnalysis { | |||
s"filter expression '${f.condition.prettyString}' " + | |||
s"of type ${f.condition.dataType.simpleString} is not a boolean.") | |||
|
|||
case j @ Join(_, _, NaturalJoin(_), _) => | |||
failAnalysis(s"natural join not resolved.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when will this happen? is this a useful error message to show at all/?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will never happen but when there were remaining natural joins after 100 iterations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok so this can happen - maybe we should print something more than just a message to tell which natural join cannot be resolved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this happens, it must because join operator itself has some problems, but not caused by the NaturalJoin
type. I think we should remove this case and let other checking rules to tell users what's wrong.
Actually now i think about it - maybe we should just have a constructor for Join that the parser calls, and the constructor just creates a normal join with the right project list and conditions. Seems like it'd be a lot simpler. |
When the parser calls the constructor, how can we get the schema of tables? We need schema to build project list and conditions. |
@hvanhovell can you help review the parser changes? |
one comment: #10762 (comment) |
Acutally MySQL and Oracle does not support normal full outer join either. |
Test build #50342 has finished for PR 10762 at commit
|
class ResolveNaturalJoinSuite extends AnalysisTest { | ||
import org.apache.spark.sql.catalyst.analysis.TestRelations._ | ||
|
||
val t1 = testRelation2.select('a, 'b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make all of them lazy; otherwise an exception thrown here will make the entire suite disappear from jenkins reporting.
Test build #50369 has finished for PR 10762 at commit
|
Test build #50372 has finished for PR 10762 at commit
|
Test build #50375 has finished for PR 10762 at commit
|
childrenResolved && | ||
expressions.forall(_.resolved) && | ||
selfJoinResolved && | ||
condition.forall(_.dataType == BooleanType) | ||
} | ||
|
||
// Joins are only resolved if they don't introduce ambiguous expression ids. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment should belong to resolvedExceptNatural
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is original comment from the old Join.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, this is the original comment for resolved
, but now we renamed resolved
to resolvedExceptNatural
right?
Test build #50470 has finished for PR 10762 at commit
|
AttributeReference("a", StringType, nullable = false)(), | ||
AttributeReference("b", StringType, nullable = false)(), | ||
AttributeReference("c", StringType, nullable = false)()) | ||
lazy val tt1 = testRelation0.select('a, 'b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need some better names than tt1
, aa
, trueB
, etc.....
Test build #50476 has finished for PR 10762 at commit
|
class ResolveNaturalJoinSuite extends AnalysisTest { | ||
import org.apache.spark.sql.catalyst.analysis.TestRelations._ | ||
|
||
lazy val t1 = testRelation2.select('a, 'b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the naming here is super confusing. can you improve it? some suggestions:
- Define relations inline here, and don't use TestRelations
- Name relations r1, r2, r3 ...
- Prefix attribute names with relation names, e.g. r1_a, r1_b, ...
import org.apache.spark.sql.types.StringType | ||
|
||
class ResolveNaturalJoinSuite extends AnalysisTest { | ||
lazy val r1 = LocalRelation( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we need each field anyway, how about we define the fields first and use them to compose relation? i.e.
val a = 'a.string
val b = 'b.string
....
val f = 'f.string
val a_nonNullable = a.withNullability(false)
...
val r1 = LocalRelation(a, b, c)
val r2 = LocalRelation(a, b)
val r3 = LocalRelation(d, e, f)
...
tips: using the dsl can make the tests much readable, see an example here
Test build #50480 has finished for PR 10762 at commit
|
Test build #50487 has finished for PR 10762 at commit
|
@cloud-fan any more comments? |
LGTM - going to merge it. |
// we should only keep unique columns(depends on joinType) for joinCols | ||
val projectList = joinType match { | ||
case LeftOuter => | ||
leftKeys ++ lUniqueOutput ++ rUniqueOutput.map(_.withNullability(true)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we switching the ordering of output columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm i figured it out.
This is a small addendum to #10762 to make the code more robust again future changes. Author: Reynold Xin <rxin@databricks.com> Closes #11070 from rxin/SPARK-12828-natural-join.
Jira:
https://issues.apache.org/jira/browse/SPARK-12828