[SPARK-17080][SQL][FOLLOWUP] Improve documentation, change buildJoin method structure and add a debug log #17353

wzhfy · 2017-03-20T03:23:24Z

What changes were proposed in this pull request?

Improve documentation for class Cost and JoinReorderDP and method buildJoin().
Change code structure of buildJoin() to make the logic clearer.
Add a debug-level log to record information for join reordering, including time cost, the number of items and the number of plans in memo.

How was this patch tested?

Not related.

wzhfy · 2017-03-20T03:23:33Z

cc @gatorsmile

gatorsmile · 2017-03-20T05:10:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala

+   * @param conf SQLConf for statistics computation.
+   * @param conditions The overall set of join conditions.
+   * @param topOutput The output attributes of the final plan.
+   * @return Return a new JoinPlan if the two sides can be joined with some conditions. Otherwise,


How about?

Builds and returns a new JoinPlan if there exists at least one join condition involving references from both left and right. Otherwise, returns None.

gatorsmile · 2017-03-20T05:17:46Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala

@@ -201,7 +201,16 @@ object JoinReorderDP extends PredicateHelper {
    nextLevel.toMap
  }

-  /** Build a new join node. */
+  /**
+   * Builds a new JoinPlan if the two given sides can be joined with some conditions.


Let us reword it too.

Builds a new JoinPlan if there is a join predicate connecting two given sides.

gatorsmile · 2017-03-20T05:20:38Z

Could you also update the description of object JoinReorderDP based on the recent update in #17286?

gatorsmile · 2017-03-20T05:24:34Z

Could we move the checking if (oneSidePlan.itemIds.intersect(otherSidePlan.itemIds).isEmpty) into the buildJoin?

gatorsmile · 2017-03-20T05:25:38Z

Replace the following codes by using pattern match

            if (joinPlan.isDefined) {
              val newJoinPlan = joinPlan.get

SparkQA · 2017-03-20T05:35:12Z

Test build #74845 has finished for PR 17353 at commit 7598eb8.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class Cost(card: BigInt, size: BigInt)

gatorsmile · 2017-03-20T06:28:40Z

Does it make sense to introduce some counts to decide how many new JoinPlan we build?

We can find out what we pruned in the search. It can easy for us to ensure no regression happened when we improve the codes in the future.

Also cc @cloud-fan

SparkQA · 2017-03-20T06:52:31Z

Test build #74864 has started for PR 17353 at commit 65b2b5b.

gatorsmile · 2017-03-20T07:01:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala

@@ -202,14 +201,15 @@ object JoinReorderDP extends PredicateHelper {
  }

  /**
-   * Builds a new JoinPlan if the two given sides can be joined with some conditions.
+   * Builds a new JoinPlan if there exists at least one join condition involving references from
+   * both left and right.


Builds a new JoinPlan when both conditions hold - the sets of items contained in both left and right sides do not overlap - there exists at least one join condition involving references from both sides

Great! Thanks.

gatorsmile · 2017-03-20T07:02:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala

-   * @return Return a new JoinPlan if the two sides can be joined with some conditions. Otherwise,
-   *         return None.
+   * @return Builds and returns a new JoinPlan if there exists at least one join condition
+   *         involving references from both left and right. Otherwise, returns None.


Now, we can simplify it to Builds and returns a new JoinPlan if both conditions hold. Otherwise, returns None.

gatorsmile · 2017-03-20T07:03:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala

@@ -202,14 +201,15 @@ object JoinReorderDP extends PredicateHelper {
  }

  /**
-   * Builds a new JoinPlan if the two given sides can be joined with some conditions.
+   * Builds a new JoinPlan if there exists at least one join condition involving references from
+   * both left and right.
   * @param oneJoinPlan One side JoinPlan for building a new JoinPlan.
   * @param otherJoinPlan The other side JoinPlan for building a new join node.


Do you want to rename them to leftPlan and rightPlan?

left and right sides are decided inside this method. It tends to build a left deep tree.

wzhfy · 2017-03-20T07:23:15Z

Where should we check the count? Whom do we want to expose it to?
How about a debug level log?

cloud-fan · 2017-03-21T02:14:33Z

debug log SGTM

gatorsmile · 2017-03-21T02:42:00Z

Yeah. The counts can help us understand the pruning rate of the search space. When CBO join reordering is very slow, we can check the counts.

SparkQA · 2017-03-21T05:18:18Z

Test build #74934 has finished for PR 17353 at commit 2a9ba46.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-21T05:47:39Z

Test build #74945 has started for PR 17353 at commit 71459c5.

gatorsmile · 2017-03-21T05:48:57Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala

@@ -152,6 +156,10 @@ object JoinReorderDP extends PredicateHelper {
      foundPlans += searchLevel(foundPlans, conf, conditions, topOutput)
    }

+//    val durationInMs = (System.nanoTime() - startTime) / (1000 * 1000)
+//    logDebug(s"Join reordering finished. Duration: $durationInMs ms, number of items: " +
+//      s"${items.length}, number of plans in memo: ${foundPlans.map(_.size).sum}")


oops, I forgot to recover them...

SparkQA · 2017-03-21T06:17:31Z

Test build #74947 has started for PR 17353 at commit af511b2.

gatorsmile · 2017-03-21T06:29:31Z

Please update PR description and title.

LGTM pending Jenkins

wzhfy · 2017-03-21T08:37:19Z

retest this please

SparkQA · 2017-03-21T10:05:01Z

Test build #74963 has finished for PR 17353 at commit af511b2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-03-21T11:41:42Z

retest this please

wzhfy · 2017-03-21T11:42:24Z

retest this please...

cloud-fan · 2017-03-21T11:42:45Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala

+ * We also prune cartesian product candidates when building a new plan if there exists no join
+ * condition involving references from both left and right. This pruning strategy significantly
+ * reduces the search space.
+ * For example, given A J B J C J D, plans maintained for each level will be as follows:


will be -> may be?

cloud-fan · 2017-03-21T11:44:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala


  def search(
      conf: SQLConf,
      items: Seq[LogicalPlan],
      conditions: Set[Expression],
      topOutput: AttributeSet): LogicalPlan = {

+    val startTime = System.nanoTime()


use System.currentTimeMillis if we only care about the ms level.

nanoTime() is more reliable than currentTimeMillis(): https://github.com/databricks/scala-style-guide#misc_currentTimeMillis_vs_nanoTime

SparkQA · 2017-03-21T13:08:29Z

Test build #74978 has finished for PR 17353 at commit af511b2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-21T13:52:27Z

Test build #74977 has finished for PR 17353 at commit af511b2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-21T15:41:37Z

Test build #74980 has finished for PR 17353 at commit 40af14c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-03-21T15:44:32Z

Thanks! Merging to master.

gatorsmile reviewed Mar 20, 2017

View reviewed changes

wzhfy added 3 commits March 21, 2017 10:56

fxi comments

d7365ba

fix more comments

3c05363

add join reordering info

2a9ba46

wzhfy force-pushed the reorderFollow branch from 65b2b5b to 2a9ba46 Compare March 21, 2017 03:37

isEmpty -> nonEmpty after moving overlap check into buildJoin

71459c5

gatorsmile reviewed Mar 21, 2017

View reviewed changes

recover

af511b2

wzhfy changed the title ~~[SPARK-17080][SQL][FOLLOWUP] Improve documentation and naming for methods/variables~~ [SPARK-17080][SQL][FOLLOWUP] Improve documentation, change buildJoin method structure and add a debug log Mar 21, 2017

cloud-fan reviewed Mar 21, 2017

View reviewed changes

add condition in comments

40af14c

asfgit closed this in 14865d7 Mar 21, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-17080][SQL][FOLLOWUP] Improve documentation, change buildJoin method structure and add a debug log #17353

[SPARK-17080][SQL][FOLLOWUP] Improve documentation, change buildJoin method structure and add a debug log #17353

wzhfy commented Mar 20, 2017 •

edited

Loading

wzhfy commented Mar 20, 2017

gatorsmile Mar 20, 2017 •

edited

Loading

gatorsmile Mar 20, 2017

gatorsmile commented Mar 20, 2017

gatorsmile commented Mar 20, 2017

gatorsmile commented Mar 20, 2017

SparkQA commented Mar 20, 2017

gatorsmile commented Mar 20, 2017 •

edited

Loading

SparkQA commented Mar 20, 2017

gatorsmile Mar 20, 2017

wzhfy Mar 20, 2017

gatorsmile Mar 20, 2017

gatorsmile Mar 20, 2017

wzhfy Mar 20, 2017

wzhfy commented Mar 20, 2017

cloud-fan commented Mar 21, 2017

gatorsmile commented Mar 21, 2017

SparkQA commented Mar 21, 2017

SparkQA commented Mar 21, 2017

gatorsmile Mar 21, 2017

wzhfy Mar 21, 2017

SparkQA commented Mar 21, 2017

gatorsmile commented Mar 21, 2017

wzhfy commented Mar 21, 2017

SparkQA commented Mar 21, 2017

cloud-fan commented Mar 21, 2017

wzhfy commented Mar 21, 2017

cloud-fan Mar 21, 2017

cloud-fan Mar 21, 2017

wzhfy Mar 21, 2017

SparkQA commented Mar 21, 2017

SparkQA commented Mar 21, 2017

SparkQA commented Mar 21, 2017

gatorsmile commented Mar 21, 2017

[SPARK-17080][SQL][FOLLOWUP] Improve documentation, change buildJoin method structure and add a debug log #17353

[SPARK-17080][SQL][FOLLOWUP] Improve documentation, change buildJoin method structure and add a debug log #17353

Conversation

wzhfy commented Mar 20, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

wzhfy commented Mar 20, 2017

gatorsmile Mar 20, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Mar 20, 2017

gatorsmile commented Mar 20, 2017

gatorsmile commented Mar 20, 2017

SparkQA commented Mar 20, 2017

gatorsmile commented Mar 20, 2017 • edited Loading

SparkQA commented Mar 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wzhfy commented Mar 20, 2017

cloud-fan commented Mar 21, 2017

gatorsmile commented Mar 21, 2017

SparkQA commented Mar 21, 2017

SparkQA commented Mar 21, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 21, 2017

gatorsmile commented Mar 21, 2017

wzhfy commented Mar 21, 2017

SparkQA commented Mar 21, 2017

cloud-fan commented Mar 21, 2017

wzhfy commented Mar 21, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 21, 2017

SparkQA commented Mar 21, 2017

SparkQA commented Mar 21, 2017

gatorsmile commented Mar 21, 2017

wzhfy commented Mar 20, 2017 •

edited

Loading

gatorsmile Mar 20, 2017 •

edited

Loading

gatorsmile commented Mar 20, 2017 •

edited

Loading