[SPARK-21322][SQL][followup] support histogram in filter cardinality estimation #19952

cloud-fan · 2017-12-12T14:37:47Z

What changes were proposed in this pull request?

some code cleanup/refactor and naming improvement.

How was this patch tested?

existing tests

cloud-fan · 2017-12-12T14:38:06Z

cc @ron8hu @wzhfy

cloud-fan · 2017-12-12T14:39:33Z

...main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala

-   *
-   * @param value a literal value of a column
-   * @param bins an array of bins for a given numeric equi-height histogram
-   * @return the id of the first bin into which a column value falls.


Now it's a private method and we can omit the parameter doc as it's trivial.

cloud-fan · 2017-12-12T14:41:08Z

...main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala

+   * @param upperBound the highest value of the given range
+   * @param upperBoundInclusive whether the upperBound is included in the range
+   * @param lowerBound the lowest value of the given range
+   * @param lowerBoundInclusive whether the lowerBound is included in the range


instead of asking the callers to pass in the upper and lower bin id, it's more intuitive to pass whether to include the range boundaries.

SparkQA · 2017-12-12T17:12:11Z

Test build #84773 has finished for PR 19952 at commit ebcd6d1.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-12-12T19:07:04Z

retest this please

SparkQA · 2017-12-12T21:58:41Z

Test build #84784 has finished for PR 19952 at commit ebcd6d1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wzhfy

LGTM except some minor comments

wzhfy · 2017-12-13T02:16:23Z

...main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala

-   * @param lowerId id of the low end bin holding the low end value of a column range
-   * @param higherEnd a given upper bound value of a specified column value range
-   * @param lowerEnd a given lower bound value of a specified column value range
+   * Note that the return value is double type, because the range boundaries usually occupy a


nit: returned value

wzhfy · 2017-12-13T02:17:46Z

...main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala

-   * @param lowerEnd a given lower bound value of a specified column value range
+   * Note that the return value is double type, because the range boundaries usually occupy a
+   * portion of a bin. An extrema case is [value, value] which is generated by equal predicate
+   * `col = value`, we can get more accuracy by allowing returning portion of histogram bins.


nit: get higher accuracy

wzhfy · 2017-12-13T02:28:54Z

...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala

-   * @param min the lower bound of the current valid range for a given column
-   * @param datumNumber the numeric value of a literal
-   * @return the selectivity percentage for a condition in the current range.
+   * Computes the possibility of a equal predicate using histogram.


nit: an equality predicate

wzhfy · 2017-12-13T02:34:24Z

...main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala

+      upperBound: Double,
+      upperBoundInclusive: Boolean,
+      lowerBound: Double,
+      lowerBoundInclusive: Boolean,
      histogram: Histogram): Double = {


Is it better to pass the bin array instead of histogram? we can simplify many histogram.bins here.

wzhfy · 2017-12-13T02:41:21Z

...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala

+    val numBinsHoldingEntireRange = EstimationUtils.numBinsHoldingRange(
+      max, upperBoundInclusive = true, min, lowerBoundInclusive = true, histogram)
+
+    val numBinsHoldingDatum = op match {


numBinsHoldingRange

ron8hu · 2017-12-13T03:08:36Z

...main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala

+   * [lowerBound, upperBound].
+   *
+   * Note that the returned value is double type, because the range boundaries usually occupy a
+   * portion of a bin. An extrema case is [value, value] which is generated by equal predicate


typo: extreme

SparkQA · 2017-12-13T05:33:54Z

Test build #84822 has finished for PR 19952 at commit 8fe0c49.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-13T06:24:33Z

Test build #84827 has finished for PR 19952 at commit 4e35c43.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-12-13T06:49:25Z

thanks, merging to master!

code cleanup

ebcd6d1

cloud-fan force-pushed the minor branch from 3f1f3b1 to ebcd6d1 Compare December 12, 2017 14:38

cloud-fan commented Dec 12, 2017

View reviewed changes

wzhfy reviewed Dec 13, 2017

View reviewed changes

address comments

8fe0c49

ron8hu suggested changes Dec 13, 2017

View reviewed changes

fix typo

4e35c43

asfgit closed this in bdb5e55 Dec 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-21322][SQL][followup] support histogram in filter cardinality estimation #19952

[SPARK-21322][SQL][followup] support histogram in filter cardinality estimation #19952

cloud-fan commented Dec 12, 2017

cloud-fan commented Dec 12, 2017

cloud-fan Dec 12, 2017

cloud-fan Dec 12, 2017

SparkQA commented Dec 12, 2017

gatorsmile commented Dec 12, 2017

SparkQA commented Dec 12, 2017

wzhfy left a comment

wzhfy Dec 13, 2017

wzhfy Dec 13, 2017

wzhfy Dec 13, 2017

wzhfy Dec 13, 2017

wzhfy Dec 13, 2017

ron8hu Dec 13, 2017

SparkQA commented Dec 13, 2017

SparkQA commented Dec 13, 2017

cloud-fan commented Dec 13, 2017

[SPARK-21322][SQL][followup] support histogram in filter cardinality estimation #19952

[SPARK-21322][SQL][followup] support histogram in filter cardinality estimation #19952

Conversation

cloud-fan commented Dec 12, 2017

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Dec 12, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 12, 2017

gatorsmile commented Dec 12, 2017

SparkQA commented Dec 12, 2017

wzhfy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 13, 2017

SparkQA commented Dec 13, 2017

cloud-fan commented Dec 13, 2017