Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-17080][SQL][FOLLOWUP] Improve documentation, change buildJoin method structure and add a debug log #17353

Closed
wants to merge 6 commits into from

Conversation

wzhfy
Copy link
Contributor

@wzhfy wzhfy commented Mar 20, 2017

What changes were proposed in this pull request?

  1. Improve documentation for class Cost and JoinReorderDP and method buildJoin().
  2. Change code structure of buildJoin() to make the logic clearer.
  3. Add a debug-level log to record information for join reordering, including time cost, the number of items and the number of plans in memo.

How was this patch tested?

Not related.

@wzhfy
Copy link
Contributor Author

wzhfy commented Mar 20, 2017

cc @gatorsmile

* @param conf SQLConf for statistics computation.
* @param conditions The overall set of join conditions.
* @param topOutput The output attributes of the final plan.
* @return Return a new JoinPlan if the two sides can be joined with some conditions. Otherwise,
Copy link
Member

@gatorsmile gatorsmile Mar 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about?

Builds and returns a new JoinPlan if there exists at least one join condition involving references from both left and right. Otherwise, returns None.

@@ -201,7 +201,16 @@ object JoinReorderDP extends PredicateHelper {
nextLevel.toMap
}

/** Build a new join node. */
/**
* Builds a new JoinPlan if the two given sides can be joined with some conditions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us reword it too.

Builds a new JoinPlan if there is a join predicate connecting two given sides.

@gatorsmile
Copy link
Member

Could you also update the description of object JoinReorderDP based on the recent update in #17286?

@gatorsmile
Copy link
Member

Could we move the checking if (oneSidePlan.itemIds.intersect(otherSidePlan.itemIds).isEmpty) into the buildJoin?

@gatorsmile
Copy link
Member

Replace the following codes by using pattern match

            if (joinPlan.isDefined) {
              val newJoinPlan = joinPlan.get

@SparkQA
Copy link

SparkQA commented Mar 20, 2017

Test build #74845 has finished for PR 17353 at commit 7598eb8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Cost(card: BigInt, size: BigInt)

@gatorsmile
Copy link
Member

gatorsmile commented Mar 20, 2017

Does it make sense to introduce some counts to decide how many new JoinPlan we build?

We can find out what we pruned in the search. It can easy for us to ensure no regression happened when we improve the codes in the future.

Also cc @cloud-fan

@SparkQA
Copy link

SparkQA commented Mar 20, 2017

Test build #74864 has started for PR 17353 at commit 65b2b5b.

@@ -202,14 +201,15 @@ object JoinReorderDP extends PredicateHelper {
}

/**
* Builds a new JoinPlan if the two given sides can be joined with some conditions.
* Builds a new JoinPlan if there exists at least one join condition involving references from
* both left and right.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Builds a new JoinPlan when both conditions hold
- the sets of items contained in both left and right sides do not overlap 
- there exists at least one join condition involving references from both sides

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thanks.

* @return Return a new JoinPlan if the two sides can be joined with some conditions. Otherwise,
* return None.
* @return Builds and returns a new JoinPlan if there exists at least one join condition
* involving references from both left and right. Otherwise, returns None.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, we can simplify it to Builds and returns a new JoinPlan if both conditions hold. Otherwise, returns None.

@@ -202,14 +201,15 @@ object JoinReorderDP extends PredicateHelper {
}

/**
* Builds a new JoinPlan if the two given sides can be joined with some conditions.
* Builds a new JoinPlan if there exists at least one join condition involving references from
* both left and right.
* @param oneJoinPlan One side JoinPlan for building a new JoinPlan.
* @param otherJoinPlan The other side JoinPlan for building a new join node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to rename them to leftPlan and rightPlan?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left and right sides are decided inside this method. It tends to build a left deep tree.

@wzhfy
Copy link
Contributor Author

wzhfy commented Mar 20, 2017

Where should we check the count? Whom do we want to expose it to?
How about a debug level log?

@cloud-fan
Copy link
Contributor

debug log SGTM

@gatorsmile
Copy link
Member

Yeah. The counts can help us understand the pruning rate of the search space. When CBO join reordering is very slow, we can check the counts.

@SparkQA
Copy link

SparkQA commented Mar 21, 2017

Test build #74934 has finished for PR 17353 at commit 2a9ba46.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 21, 2017

Test build #74945 has started for PR 17353 at commit 71459c5.

@@ -152,6 +156,10 @@ object JoinReorderDP extends PredicateHelper {
foundPlans += searchLevel(foundPlans, conf, conditions, topOutput)
}

// val durationInMs = (System.nanoTime() - startTime) / (1000 * 1000)
// logDebug(s"Join reordering finished. Duration: $durationInMs ms, number of items: " +
// s"${items.length}, number of plans in memo: ${foundPlans.map(_.size).sum}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, I forgot to recover them...

@SparkQA
Copy link

SparkQA commented Mar 21, 2017

Test build #74947 has started for PR 17353 at commit af511b2.

@gatorsmile
Copy link
Member

Please update PR description and title.

LGTM pending Jenkins

@wzhfy wzhfy changed the title [SPARK-17080][SQL][FOLLOWUP] Improve documentation and naming for methods/variables [SPARK-17080][SQL][FOLLOWUP] Improve documentation, change buildJoin method structure and add a debug log Mar 21, 2017
@wzhfy
Copy link
Contributor Author

wzhfy commented Mar 21, 2017

retest this please

@SparkQA
Copy link

SparkQA commented Mar 21, 2017

Test build #74963 has finished for PR 17353 at commit af511b2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@wzhfy
Copy link
Contributor Author

wzhfy commented Mar 21, 2017

retest this please...

* We also prune cartesian product candidates when building a new plan if there exists no join
* condition involving references from both left and right. This pruning strategy significantly
* reduces the search space.
* For example, given A J B J C J D, plans maintained for each level will be as follows:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will be -> may be?


def search(
conf: SQLConf,
items: Seq[LogicalPlan],
conditions: Set[Expression],
topOutput: AttributeSet): LogicalPlan = {

val startTime = System.nanoTime()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use System.currentTimeMillis if we only care about the ms level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nanoTime() is more reliable than currentTimeMillis(): https://github.com/databricks/scala-style-guide#misc_currentTimeMillis_vs_nanoTime

@SparkQA
Copy link

SparkQA commented Mar 21, 2017

Test build #74978 has finished for PR 17353 at commit af511b2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 21, 2017

Test build #74977 has finished for PR 17353 at commit af511b2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 21, 2017

Test build #74980 has finished for PR 17353 at commit 40af14c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Thanks! Merging to master.

@asfgit asfgit closed this in 14865d7 Mar 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants