-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12656] [SQL] Implement Intersect with Left-semi Join #10630
Conversation
@rxin Please review the implementation. Thank you! |
Which mainstream RDBMS is that? |
* ==> SELECT a1, a2 FROM Tab1, Tab2 ON a1<=>b1 AND a2<=>b2 | ||
* }}} | ||
*/ | ||
object ReplaceIntersectWithLeftSemi extends Rule[LogicalPlan] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LeftSemi -> LeftSemiJoin or just SemiJoin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah. Forgot to specify the join type
MS SQL Server did that |
@@ -322,13 +323,32 @@ class DataFrameSuite extends QueryTest with SharedSQLContext { | |||
} | |||
|
|||
test("intersect") { | |||
val intersectDF = lowerCaseData.intersect(lowerCaseData) | |||
|
|||
// Before Optimizer, the operator is Intersect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should go into one of the optimizer unit test suite, not here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, will add a new test suite for it.
LGTM. cc @cloud-fan to take a look too. |
Test build #48900 has finished for PR 10630 at commit
|
} | ||
} | ||
|
||
def apply(plan: LogicalPlan): LogicalPlan = plan transform { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use transformUp?
cc @yhuai
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually nvm.
def apply(plan: LogicalPlan): LogicalPlan = plan transform { | ||
case Intersect(left, right) => | ||
val joinCond = left.output.zip(right.output).map { case (l, r) => | ||
EqualNullSafe(l, r) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we put it in one line?
When resolving the conflicts, I realized the multi-children Let me know if we need to open a separate PR to do it now. So far, unlike |
Test build #49936 has finished for PR 10630 at commit
|
I don't think its a problem for there to be conflicting attribute ids for set operations, this is because only one child's attribute references need to be propagated up (unlike with a join). |
Yeah, agree! Thank you! |
@@ -125,17 +128,15 @@ object EliminateSerialization extends Rule[LogicalPlan] { | |||
|
|||
/** | |||
* Pushes certain operations to both sides of a Union, Intersect or Except operator. | |||
======= | |||
* Pushes certain operations to both sides of a Union or Except operator. | |||
>>>>>>> IntersectBySemiJoinMerged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we need to remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, sure, will do.
Test build #50176 has finished for PR 10630 at commit
|
@@ -111,6 +113,7 @@ object SamplePushDown extends Rule[LogicalPlan] { | |||
} | |||
|
|||
/** | |||
<<<<<<< HEAD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do.
Test build #50257 has finished for PR 10630 at commit
|
Test build #50313 has finished for PR 10630 at commit
|
failAnalysis( | ||
s""" | ||
|Failure when resolving conflicting references in Join: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now we can keep this message as it only checks join :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can users observe the error? or it can be considered as an internal errors? BTW, we are about to convert it to an internal error in the PR: #41476
LGTM. we can merge it first and @gatorsmile can address remaining comments in a follow-up PR. |
This is not that big. Let's just do it together here. |
Thank you! Just cleaned the codes. : ) |
LGTM, pending test |
Test build #50368 has finished for PR 10630 at commit
|
Thanks - I'm going to merge this. |
Our current Intersect physical operator simply delegates to RDD.intersect. We should remove the Intersect physical operator and simply transform a logical intersect into a semi-join with distinct. This way, we can take advantage of all the benefits of join implementations (e.g. managed memory, code generation, broadcast joins).
After a search, I found one of the mainstream RDBMS did the same. In their query explain, Intersect is replaced by Left-semi Join. Left-semi Join could help outer-join elimination in Optimizer, as shown in the PR: #10566