[SPARK-2205] [SQL] Avoid unnecessary exchange operators in multi-way joins #7773

JoshRosen · 2015-07-30T02:54:30Z

This PR adds PartitioningCollection, which is used to represent the outputPartitioning for SparkPlans with multiple children (e.g. ShuffledHashJoin). So, a SparkPlan can have multiple descriptions of its partitioning schemes. Taking ShuffledHashJoin as an example, it has two descriptions of its partitioning schemes, i.e. left.outputPartitioning and right.outputPartitioning. So when we have a query like select * from t1 join t2 on (t1.x = t2.x) join t3 on (t2.x = t3.x) will only have three Exchange operators (when shuffled joins are needed) instead of four.

The code in this PR was authored by @yhuai; I'm opening this PR to factor out this change from #7685, a larger pull request which contains two other optimizations.

…ning-improvements

SparkQA · 2015-07-30T03:07:10Z

Test build #38962 has finished for PR 7773 at commit 801b807.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2015-07-30T03:43:22Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/ShuffledHashOuterJoin.scala

+    case x =>
+      throw new IllegalArgumentException(s"HashOuterJoin should not take $x as the JoinType")
+  }
+


I will remove this change for now. Once we have the nullSafe concept, we can better describe how the result of this join operator is partitioned. For example, right now, it is not safe to say that the output of this operator is partitioned by the rightKeys when we have a left outer join (because rows with null keys are not clustered).

SparkQA · 2015-07-30T03:49:34Z

Test build #38966 has finished for PR 7773 at commit 4a99204.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class PartitioningCollection(partitionings: Seq[Partitioning])

yhuai · 2015-07-30T04:20:21Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashOuterJoin.scala

@@ -57,6 +57,8 @@ case class BroadcastHashOuterJoin(
  override def requiredChildDistribution: Seq[Distribution] =
    UnspecifiedDistribution :: UnspecifiedDistribution :: Nil

+  override def outputPartitioning: Partitioning = streamedPlan.outputPartitioning


This is a bug fix.

…in for now.

yhuai · 2015-07-30T04:25:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

@@ -94,8 +95,12 @@ sealed trait Partitioning {
   */
  def compatibleWith(other: Partitioning): Boolean

-  /** Returns the expressions that are used to key the partitioning. */
-  def keyExpressions: Seq[Expression]


Seems keyExpressions is not used at all and I do not remember when we added it. So, I am removing it.

SparkQA · 2015-07-30T04:45:00Z

Test build #38980 has finished for PR 7773 at commit 73913f7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class PartitioningCollection(partitionings: Seq[Partitioning])

SparkQA · 2015-07-30T06:28:04Z

Test build #38989 has finished for PR 7773 at commit 2963857.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class PartitioningCollection(partitionings: Seq[Partitioning])

JoshRosen · 2015-07-30T06:38:34Z

sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala

+
+  test("PartitioningCollection") {
+    // First, we disable broadcast join.
+    val origThreshold = conf.autoBroadcastJoinThreshold


This test could use the new SQLTestUtils withConf and withTempTable helper functions, I think.

I'll make this change now.

JoshRosen · 2015-07-30T18:23:34Z

Jenkins, retest this please.

JoshRosen · 2015-07-30T19:37:36Z

sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala

+  test("PartitioningCollection") {
+    // First, we disable broadcast join.
+    val origThreshold = conf.autoBroadcastJoinThreshold
+    setConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD, 0)


I think that you have to set AUTO_BROADCASTJOIN_THRESHOLD to -1 to disable broadcast, not 0.

Actually, it looks like the implementation might be out of sync w.r.t. the docs for AUTO_BROADCASTJOIN_THRESHOLD...

SparkQA · 2015-07-30T20:05:10Z

Test build #39080 has finished for PR 7773 at commit 2963857.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class PartitioningCollection(partitionings: Seq[Partitioning])

SparkQA · 2015-07-30T20:57:06Z

Test build #39089 has finished for PR 7773 at commit 8104ea8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class PartitioningCollection(partitionings: Seq[Partitioning])

JoshRosen · 2015-07-30T21:09:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

@@ -122,7 +127,10 @@ case object SinglePartition extends Partitioning {
    case _ => false
  }

-  override def keyExpressions: Seq[Expression] = Nil
+  override def guarantees(other: Partitioning): Boolean = other match {


Shouldn't SinglePartition technically guarantee any partitioning which produces a single partition, such as HashPartitioning with a single partition? I guess that hash partitioning with one partition shouldn't ever occur, though.

SparkQA · 2015-07-30T22:12:03Z

Test build #39092 has finished for PR 7773 at commit 8acac75.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class PartitioningCollection(partitionings: Seq[Partitioning])

JoshRosen · 2015-07-30T22:23:12Z

Let's wait and see if we can merge #7807 to remove compatibleWith, since that would end up avoiding the potential for confusing between compatibleWith and guarantees.

… from Exchange While reviewing yhuai's patch for SPARK-2205 (#7773), I noticed that Exchange's `compatible` check may be incorrectly returning `false` in many cases. As far as I know, this is not actually a problem because the `compatible`, `meetsRequirements`, and `needsAnySort` checks are serving only as short-circuit performance optimizations that are not necessary for correctness. In order to reduce code complexity, I think that we should remove these checks and unconditionally rewrite the operator's children. This should be safe because we rewrite the tree in a single bottom-up pass. Author: Josh Rosen <joshrosen@databricks.com> Closes #7807 from JoshRosen/SPARK-9489 and squashes the following commits: 9d76ce9 [Josh Rosen] [SPARK-9489] Remove compatibleWith, meetsRequirements, and needsAnySort checks from Exchange

…ning-improvements

SparkQA · 2015-07-31T02:34:19Z

Test build #39131 has finished for PR 7773 at commit 5c45924.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-08-01T18:13:42Z

Jenkins, retest this please.

yhuai · 2015-08-01T18:26:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

+   * guarantees the same partitioning scheme described by `other`.
+   */
+  // TODO: Add an example once we have the `nullSafe` concept.
+  def guarantees(other: Partitioning): Boolean


Do you think semanticEqual is a better name? I think this method is basically doing a equality check. For example, if other is a HashPartitioning('a :: Nil, 10) and this is a SinglePartition. We probably do not want to return true because the parent of this operator can be a join and the sibling of this operator can be HashPartitioned.

I think that we should only consider a name like semanticEqual or semanticEquiv if a.guarantees(b) implies b.guarantees(a) and vice-versa.

yeah, makes sense. Then, semanticEqual is not a good name because once we have the concept of nullSafe. This method will not have the commutative property because nullSafe hash partitioning can be treated as nullUnsafe hash partitioning.

Yep. Let's leave this for now.

SparkQA · 2015-08-01T19:53:18Z

Test build #39370 has finished for PR 7773 at commit 5c45924.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class PartitioningCollection(partitionings: Seq[Partitioning])

JoshRosen · 2015-08-02T21:23:57Z

Jenkins, retest this please.

SparkQA · 2015-08-02T22:47:49Z

Test build #39444 has finished for PR 7773 at commit 5c45924.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class PartitioningCollection(partitionings: Seq[Partitioning])

yhuai · 2015-08-02T22:57:12Z

The failed test (HiveCompatibilitySuite's semijoin) is tracked by https://issues.apache.org/jira/browse/SPARK-9482.

yhuai · 2015-08-02T23:50:58Z

test this please

yhuai · 2015-08-03T00:46:01Z

@JoshRosen If you think changes in this PR are good, how about we merge it once it passes tests?

SparkQA · 2015-08-03T01:28:05Z

Test build #39474 has finished for PR 7773 at commit 5c45924.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class PartitioningCollection(partitionings: Seq[Partitioning])

yhuai · 2015-08-03T03:43:13Z

OK. I am merging it to master.

JoshRosen · 2015-08-03T03:58:48Z

Yeah, sorry for timing out here. Consider this a post-hoc LGTM.

yhuai and others added 11 commits July 27, 2015 14:15

Filter out rows that will not be joined in equal joins early.

2201129

Do not add unnessary filters.

d5b84c3

Introduce NullSafeHashPartitioning and NullUnsafePartitioning.

69bb072

Bug fix and refactoring.

7c2d2d8

wip

e616d3b

Add PartitioningCollection.

c6667e7

Style

f9516b0

First round of cleanup.

d3d2e64

Bug fix.

c57a954

Merge remote-tracking branch 'origin/master' into multi-way-join-plan…

247e5fa

…ning-improvements

Carve out only SPARK-2205 changes.

884ab95

Delete unrelated expression change

4a99204

JoshRosen force-pushed the multi-way-join-planning-improvements branch from 801b807 to 4a99204 Compare July 30, 2015 03:26

yhuai reviewed Jul 30, 2015
View reviewed changes

Add comments and test. Also, revert the change in ShuffledHashOuterJo…

73913f7

…in for now.

yhuai reviewed Jul 30, 2015
View reviewed changes

Revert unnecessary SqlConf change.

2963857

JoshRosen changed the title ~~[SPARK-2205] [SQL] [WIP] Avoid unnecessary exchange operators in multi-way joins~~ [SPARK-2205] [SQL] Avoid unnecessary exchange operators in multi-way joins Jul 30, 2015

JoshRosen reviewed Jul 30, 2015
View reviewed changes

JoshRosen mentioned this pull request Jul 30, 2015

[SPARK-9489] Remove unnecessary compatibility and requirements checks from Exchange #7807

Closed

Refactor test to use SQLTestUtils

cd8269b

JoshRosen force-pushed the multi-way-join-planning-improvements branch 2 times, most recently from 58b27eb to 0c18da3 Compare July 31, 2015 00:53

Merge remote-tracking branch 'origin/master' into multi-way-join-plan…

5c45924

…ning-improvements

JoshRosen force-pushed the multi-way-join-planning-improvements branch from 0c18da3 to 5c45924 Compare July 31, 2015 00:53

JoshRosen mentioned this pull request Aug 1, 2015

[SPARK-7871][SQL]Improve the outputPartitioning for HashOuterJoin #6413

Closed

3 tasks

yhuai reviewed Aug 1, 2015
View reviewed changes

asfgit closed this in 114ff92 Aug 3, 2015

JoshRosen deleted the multi-way-join-planning-improvements branch August 3, 2015 03:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-2205] [SQL] Avoid unnecessary exchange operators in multi-way joins #7773

[SPARK-2205] [SQL] Avoid unnecessary exchange operators in multi-way joins #7773

JoshRosen commented Jul 30, 2015

SparkQA commented Jul 30, 2015

yhuai Jul 30, 2015

SparkQA commented Jul 30, 2015

yhuai Jul 30, 2015

yhuai Jul 30, 2015

SparkQA commented Jul 30, 2015

SparkQA commented Jul 30, 2015

JoshRosen Jul 30, 2015

JoshRosen Jul 30, 2015

JoshRosen commented Jul 30, 2015

JoshRosen Jul 30, 2015

JoshRosen Jul 30, 2015

SparkQA commented Jul 30, 2015

SparkQA commented Jul 30, 2015

JoshRosen Jul 30, 2015

SparkQA commented Jul 30, 2015

JoshRosen commented Jul 30, 2015

SparkQA commented Jul 31, 2015

JoshRosen commented Aug 1, 2015

yhuai Aug 1, 2015

JoshRosen Aug 1, 2015

yhuai Aug 2, 2015

JoshRosen Aug 3, 2015

SparkQA commented Aug 1, 2015

JoshRosen commented Aug 2, 2015

SparkQA commented Aug 2, 2015

yhuai commented Aug 2, 2015

yhuai commented Aug 2, 2015

yhuai commented Aug 3, 2015

SparkQA commented Aug 3, 2015

yhuai commented Aug 3, 2015

JoshRosen commented Aug 3, 2015

[SPARK-2205] [SQL] Avoid unnecessary exchange operators in multi-way joins #7773

[SPARK-2205] [SQL] Avoid unnecessary exchange operators in multi-way joins #7773

Conversation

JoshRosen commented Jul 30, 2015

SparkQA commented Jul 30, 2015

Choose a reason for hiding this comment

SparkQA commented Jul 30, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 30, 2015

SparkQA commented Jul 30, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JoshRosen commented Jul 30, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 30, 2015

SparkQA commented Jul 30, 2015

Choose a reason for hiding this comment

SparkQA commented Jul 30, 2015

JoshRosen commented Jul 30, 2015

SparkQA commented Jul 31, 2015

JoshRosen commented Aug 1, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 1, 2015

JoshRosen commented Aug 2, 2015

SparkQA commented Aug 2, 2015

yhuai commented Aug 2, 2015

yhuai commented Aug 2, 2015

yhuai commented Aug 3, 2015

SparkQA commented Aug 3, 2015

yhuai commented Aug 3, 2015

JoshRosen commented Aug 3, 2015