[SPARK-12656] [SQL] Implement Intersect with Left-semi Join #10630

gatorsmile · 2016-01-07T05:48:33Z

Our current Intersect physical operator simply delegates to RDD.intersect. We should remove the Intersect physical operator and simply transform a logical intersect into a semi-join with distinct. This way, we can take advantage of all the benefits of join implementations (e.g. managed memory, code generation, broadcast joins).

After a search, I found one of the mainstream RDBMS did the same. In their query explain, Intersect is replaced by Left-semi Join. Left-semi Join could help outer-join elimination in Optimizer, as shown in the PR: #10566

gatorsmile · 2016-01-07T05:49:43Z

@rxin Please review the implementation. Thank you!

rxin · 2016-01-07T05:51:17Z

Which mainstream RDBMS is that?

rxin · 2016-01-07T05:51:45Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+ *   ==>  SELECT a1, a2 FROM Tab1, Tab2 ON a1<=>b1 AND a2<=>b2
+ * }}}
+ */
+object ReplaceIntersectWithLeftSemi extends Rule[LogicalPlan] {


LeftSemi -> LeftSemiJoin or just SemiJoin

yeah. Forgot to specify the join type

gatorsmile · 2016-01-07T05:53:52Z

MS SQL Server did that

rxin · 2016-01-07T05:54:32Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

@@ -322,13 +323,32 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
  }

  test("intersect") {
+    val intersectDF = lowerCaseData.intersect(lowerCaseData)
+
+    // Before Optimizer, the operator is Intersect


this should go into one of the optimizer unit test suite, not here.

ok, will add a new test suite for it.

rxin · 2016-01-07T05:57:17Z

LGTM.

cc @cloud-fan to take a look too.

SparkQA · 2016-01-07T06:16:42Z

Test build #48900 has finished for PR 10630 at commit 0bd1771.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-01-07T06:24:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+    }
+  }
+
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {


use transformUp?

cc @yhuai

actually nvm.

cloud-fan · 2016-01-07T06:59:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+    case Intersect(left, right) =>
+      val joinCond = left.output.zip(right.output).map { case (l, r) =>
+        EqualNullSafe(l, r) }


nit: can we put it in one line?

gatorsmile · 2016-01-23T09:16:03Z

When resolving the conflicts, I realized the multi-children Union might introduce duplicate exprId. So far, I did not add/change the corresponding function to de-duplicate them. This is not a trivial work, if needed. When Union has hundreds of children, it is infeasible to use the current per-pair de-duplication. That means, we need to rewrite the whole function dedupRight.

Let me know if we need to open a separate PR to do it now. So far, unlike Intersect, we did not hit any issue even if there exist duplicate exprId values in Union. Thanks!

SparkQA · 2016-01-23T11:02:39Z

Test build #49936 has finished for PR 10630 at commit 6a7979d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2016-01-25T19:31:24Z

I don't think its a problem for there to be conflicting attribute ids for set operations, this is because only one child's attribute references need to be propagated up (unlike with a join).

gatorsmile · 2016-01-25T21:15:51Z

Yeah, agree! Thank you!

yhuai · 2016-01-26T20:44:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

@@ -125,17 +128,15 @@ object EliminateSerialization extends Rule[LogicalPlan] {

 /**
 * Pushes certain operations to both sides of a Union, Intersect or Except operator.
+=======
+ * Pushes certain operations to both sides of a Union or Except operator.
+>>>>>>> IntersectBySemiJoinMerged


Seems we need to remove this.

yeah, sure, will do.

…inMergedNew

SparkQA · 2016-01-27T07:51:27Z

Test build #50176 has finished for PR 10630 at commit e566d79.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-01-27T19:54:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

@@ -111,6 +113,7 @@ object SamplePushDown extends Rule[LogicalPlan] {
 }

 /**
+<<<<<<< HEAD


remove this

SparkQA · 2016-01-28T07:43:40Z

Test build #50257 has finished for PR 10630 at commit 3be78c4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-29T01:11:13Z

Test build #50313 has finished for PR 10630 at commit e51de8f.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class SetOperation(left: LogicalPlan, right: LogicalPlan) extends BinaryNode

cloud-fan · 2016-01-29T07:40:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

            failAnalysis(
              s"""
-                 |Failure when resolving conflicting references in Join:


now we can keep this message as it only checks join :)

Can users observe the error? or it can be considered as an internal errors? BTW, we are about to convert it to an internal error in the PR: #41476

cloud-fan · 2016-01-29T07:49:46Z

LGTM. we can merge it first and @gatorsmile can address remaining comments in a follow-up PR.

rxin · 2016-01-29T07:55:36Z

This is not that big. Let's just do it together here.

gatorsmile · 2016-01-29T09:11:25Z

Thank you! Just cleaned the codes. : )

cloud-fan · 2016-01-29T09:17:14Z

LGTM, pending test

SparkQA · 2016-01-29T10:36:51Z

Test build #50368 has finished for PR 10630 at commit b600089.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-01-29T19:21:53Z

Thanks - I'm going to merge this.

gatorsmile and others added 14 commits November 13, 2015 14:50

Merge remote-tracking branch 'upstream/master'

01e4cdf

Merge remote-tracking branch 'upstream/master'

6835704

Merge remote-tracking branch 'upstream/master'

9180687

SPARK-11633

b38a21e

Merge remote-tracking branch 'upstream/master' into joinMakeCopy

d2b84af

Merge remote-tracking branch 'upstream/master'

fda8025

Merge branch 'master' of https://github.com/gatorsmile/spark

ac0dccd

Merge remote-tracking branch 'upstream/master'

6e0018b

converge

0546772

converge

b37a64f

Merge remote-tracking branch 'upstream/master'

c2a872c

Merge remote-tracking branch 'upstream/master'

ab6dbd7

Merge remote-tracking branch 'upstream/master'

4276356

replace Intersect with Left-semi Join

0bd1771

rxin reviewed Jan 7, 2016
View reviewed changes

Merge remote-tracking branch 'upstream/master' into IntersectBySemiJoin

7bd102b

rxin reviewed Jan 7, 2016
View reviewed changes

gatorsmile added 4 commits January 6, 2016 22:48

address comments.

bfa99c5

Merge remote-tracking branch 'upstream/master' into IntersectBySemiJoin

cd23b03

clean code.

100174a

clean code.

9aad1cf

cloud-fan reviewed Jan 7, 2016
View reviewed changes

yhuai reviewed Jan 26, 2016
View reviewed changes

gatorsmile added 2 commits January 26, 2016 22:12

Merge remote-tracking branch 'upstream/master' into IntersectBySemiJo…

fd87585

…inMergedNew

address comments.

e566d79

cloud-fan reviewed Jan 27, 2016
View reviewed changes

address comments.

3be78c4

fixed the failed cases.

e51de8f

cloud-fan reviewed Jan 29, 2016
View reviewed changes

addressed comments.

b600089

asfgit closed this in 5f686cc Jan 29, 2016

gatorsmile deleted the IntersectBySemiJoin branch February 6, 2016 22:29

gatorsmile mentioned this pull request Apr 28, 2016

[SPARK-12660] [SPARK-14967] [SQL] Implement Except Distinct by Left Anti Join #12736

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-12656] [SQL] Implement Intersect with Left-semi Join #10630

[SPARK-12656] [SQL] Implement Intersect with Left-semi Join #10630

gatorsmile commented Jan 7, 2016

gatorsmile commented Jan 7, 2016

rxin commented Jan 7, 2016

rxin Jan 7, 2016

gatorsmile Jan 7, 2016

gatorsmile commented Jan 7, 2016

rxin Jan 7, 2016

gatorsmile Jan 7, 2016

rxin commented Jan 7, 2016

SparkQA commented Jan 7, 2016

rxin Jan 7, 2016

rxin Jan 7, 2016

cloud-fan Jan 7, 2016

gatorsmile commented Jan 23, 2016

SparkQA commented Jan 23, 2016

marmbrus commented Jan 25, 2016

gatorsmile commented Jan 25, 2016

yhuai Jan 26, 2016

gatorsmile Jan 26, 2016

SparkQA commented Jan 27, 2016

cloud-fan Jan 27, 2016

gatorsmile Jan 27, 2016

SparkQA commented Jan 28, 2016

SparkQA commented Jan 29, 2016

cloud-fan Jan 29, 2016

MaxGekk Jun 27, 2023

cloud-fan commented Jan 29, 2016

rxin commented Jan 29, 2016

gatorsmile commented Jan 29, 2016

cloud-fan commented Jan 29, 2016

SparkQA commented Jan 29, 2016

rxin commented Jan 29, 2016

[SPARK-12656] [SQL] Implement Intersect with Left-semi Join #10630

[SPARK-12656] [SQL] Implement Intersect with Left-semi Join #10630

Conversation

gatorsmile commented Jan 7, 2016

gatorsmile commented Jan 7, 2016

rxin commented Jan 7, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Jan 7, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rxin commented Jan 7, 2016

SparkQA commented Jan 7, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Jan 23, 2016

SparkQA commented Jan 23, 2016

marmbrus commented Jan 25, 2016

gatorsmile commented Jan 25, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 27, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 28, 2016

SparkQA commented Jan 29, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jan 29, 2016

rxin commented Jan 29, 2016

gatorsmile commented Jan 29, 2016

cloud-fan commented Jan 29, 2016

SparkQA commented Jan 29, 2016

rxin commented Jan 29, 2016