-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12656] [SQL] Implement Intersect with Left-semi Join #10630
Changes from 6 commits
01e4cdf
6835704
9180687
b38a21e
d2b84af
fda8025
ac0dccd
6e0018b
0546772
b37a64f
c2a872c
ab6dbd7
4276356
0bd1771
7bd102b
bfa99c5
cd23b03
100174a
9aad1cf
6742984
e4c34f0
2dab708
9864b3f
24cea7d
27192be
a932cdb
04a26bd
0458770
6a52e2b
f820c61
4372170
1debdfa
763706d
4de6ec1
9422a4f
52bdf48
1e95df3
d59b37b
6a7979d
fd87585
e566d79
3be78c4
e51de8f
b600089
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -40,6 +40,8 @@ abstract class Optimizer extends RuleExecutor[LogicalPlan] { | |
Batch("Aggregate", FixedPoint(100), | ||
ReplaceDistinctWithAggregate, | ||
RemoveLiteralFromGroupExpressions) :: | ||
Batch("Intersect", FixedPoint(100), | ||
ReplaceIntersectWithSemiJoin) :: | ||
Batch("Operator Optimizations", FixedPoint(100), | ||
// Operator push down | ||
SetOperationPushDown, | ||
|
@@ -93,18 +95,13 @@ object SamplePushDown extends Rule[LogicalPlan] { | |
} | ||
|
||
/** | ||
* Pushes certain operations to both sides of a Union, Intersect or Except operator. | ||
* Pushes certain operations to both sides of a Union or Except operator. | ||
* Operations that are safe to pushdown are listed as follows. | ||
* Union: | ||
* Right now, Union means UNION ALL, which does not de-duplicate rows. So, it is | ||
* safe to pushdown Filters and Projections through it. Once we add UNION DISTINCT, | ||
* we will not be able to pushdown Projections. | ||
* | ||
* Intersect: | ||
* It is not safe to pushdown Projections through it because we need to get the | ||
* intersect of rows by comparing the entire rows. It is fine to pushdown Filters | ||
* with deterministic condition. | ||
* | ||
* Except: | ||
* It is not safe to pushdown Projections through it because we need to get the | ||
* intersect of rows by comparing the entire rows. It is fine to pushdown Filters | ||
|
@@ -116,15 +113,15 @@ object SetOperationPushDown extends Rule[LogicalPlan] with PredicateHelper { | |
* Maps Attributes from the left side to the corresponding Attribute on the right side. | ||
*/ | ||
private def buildRewrites(bn: BinaryNode): AttributeMap[Attribute] = { | ||
assert(bn.isInstanceOf[Union] || bn.isInstanceOf[Intersect] || bn.isInstanceOf[Except]) | ||
assert(bn.isInstanceOf[Union] || bn.isInstanceOf[Except]) | ||
assert(bn.left.output.size == bn.right.output.size) | ||
|
||
AttributeMap(bn.left.output.zip(bn.right.output)) | ||
} | ||
|
||
/** | ||
* Rewrites an expression so that it can be pushed to the right side of a | ||
* Union, Intersect or Except operator. This method relies on the fact that the output attributes | ||
* Union or Except operator. This method relies on the fact that the output attributes | ||
* of a union/intersect/except are always equal to the left child's output. | ||
*/ | ||
private def pushToRight[A <: Expression](e: A, rewrites: AttributeMap[Attribute]) = { | ||
|
@@ -175,17 +172,6 @@ object SetOperationPushDown extends Rule[LogicalPlan] with PredicateHelper { | |
p | ||
} | ||
|
||
// Push down filter through INTERSECT | ||
case Filter(condition, i @ Intersect(left, right)) => | ||
val (deterministic, nondeterministic) = partitionByDeterministic(condition) | ||
val rewrites = buildRewrites(i) | ||
Filter(nondeterministic, | ||
Intersect( | ||
Filter(deterministic, left), | ||
Filter(pushToRight(deterministic, rewrites), right) | ||
) | ||
) | ||
|
||
// Push down filter through EXCEPT | ||
case Filter(condition, e @ Except(left, right)) => | ||
val (deterministic, nondeterministic) = partitionByDeterministic(condition) | ||
|
@@ -965,6 +951,22 @@ object ReplaceDistinctWithAggregate extends Rule[LogicalPlan] { | |
} | ||
} | ||
|
||
/** | ||
* Replaces logical [[Intersect]] operator with a left-semi [[Join]] operator. | ||
* {{{ | ||
* SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2 | ||
* ==> SELECT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND a2<=>b2 | ||
* }}} | ||
*/ | ||
object ReplaceIntersectWithSemiJoin extends Rule[LogicalPlan] { | ||
def apply(plan: LogicalPlan): LogicalPlan = plan transform { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. use transformUp? cc @yhuai There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. actually nvm. |
||
case Intersect(left, right) => | ||
val joinCond = left.output.zip(right.output).map { case (l, r) => | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. one nit - might as well add assert(left.output.size == right.output.size) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added. Thanks! |
||
EqualNullSafe(l, r) } | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: can we put it in one line? |
||
Join(left, right, LeftSemi, joinCond.reduceLeftOption(And)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looks like intersect also implies de-dup unless There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You are right! It should remove duplicates. I just tried it in DB2. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does our current intersect do de-dup? On Thu, Jan 7, 2016 at 8:50 AM -0800, "Xiao Li" notifications@github.com wrote: In sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
You are right! It should remove duplicates. I just tried it in DB2. — There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just ran it. The existing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I also checked the underlying RDD API: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can just insert a distinct into this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure. Will do |
||
} | ||
} | ||
|
||
/** | ||
* Removes literals from group expressions in [[Aggregate]], as they have no effect to the result | ||
* but only makes the grouping key bigger. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
/* | ||
* Licensed to the Apache Software Foundation (ASF) under one or more | ||
* contributor license agreements. See the NOTICE file distributed with | ||
* this work for additional information regarding copyright ownership. | ||
* The ASF licenses this file to You under the Apache License, Version 2.0 | ||
* (the "License"); you may not use this file except in compliance with | ||
* the License. You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
package org.apache.spark.sql.catalyst.optimizer | ||
|
||
import org.apache.spark.sql.catalyst.dsl.expressions._ | ||
import org.apache.spark.sql.catalyst.dsl.plans._ | ||
import org.apache.spark.sql.catalyst.plans.{LeftSemi, PlanTest} | ||
import org.apache.spark.sql.catalyst.plans.logical._ | ||
import org.apache.spark.sql.catalyst.rules.RuleExecutor | ||
|
||
class ReplaceOperatorSuite extends PlanTest { | ||
|
||
object Optimize extends RuleExecutor[LogicalPlan] { | ||
val batches = | ||
Batch("Intersect", FixedPoint(100), | ||
ReplaceIntersectWithSemiJoin) :: Nil | ||
} | ||
|
||
test("replace Intersect with Left-semi Join") { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. also move the test for |
||
val table1 = LocalRelation('a.int, 'b.int) | ||
val table2 = LocalRelation('c.int, 'd.int) | ||
|
||
val query = Intersect(table1, table2) | ||
val optimized = Optimize.execute(query.analyze) | ||
|
||
val correctAnswer = | ||
Join(table1, table2, LeftSemi, Option('a <=> 'c && 'b <=> 'd)).analyze | ||
|
||
comparePlans(optimized, correctAnswer) | ||
} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -329,6 +329,14 @@ class DataFrameSuite extends QueryTest with SharedSQLContext { | |
Row(3, "c") :: | ||
Row(4, "d") :: Nil) | ||
checkAnswer(lowerCaseData.intersect(upperCaseData), Nil) | ||
|
||
// check null equality | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we have a test case where there are duplicates on the left side? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A good case might be two tables that each contain two int rows [1], [1] and intersect them. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you! Let me add the test case. It passes but we should still add this case! |
||
checkAnswer( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. leave a line of comment on what this is testing (e.g. null equality). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok, will do it. |
||
nullInts.intersect(nullInts), | ||
Row(1) :: | ||
Row(2) :: | ||
Row(3) :: | ||
Row(null) :: Nil) | ||
} | ||
|
||
test("udf") { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to add a comment at here to mention that this rewrite is just for
INTERSECT
(i.e.INTERSECT DISTINCT
) and is not applicable toINTERSECT ALL
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, will do it. Actually, I will also implement
INTERSECT ALL
after this one. Thanks!