[SPARK-21984] [SQL] Join estimation based on equi-height histogram #19594

wzhfy · 2017-10-28T01:54:03Z

What changes were proposed in this pull request?

Equi-height histogram is one of the state-of-the-art statistics for cardinality estimation, which can provide better estimation accuracy, and good at cases with skew data.

This PR is to improve join estimation based on equi-height histogram. The difference from basic estimation (based on ndv) is the logic for computing join cardinality and the new ndv after join.

The main idea is as follows:

find overlapped ranges between two histograms from two join keys;
apply the formula T(A IJ B) = T(A) * T(B) / max(V(A.k1), V(B.k1)) in each overlapped range.

How was this patch tested?

Added new test cases.

SparkQA · 2017-10-28T04:58:18Z

Test build #83146 has finished for PR 19594 at commit 67bd651.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class OverlappedRange(

SparkQA · 2017-11-14T08:05:01Z

Test build #83825 has finished for PR 19594 at commit 96776ce.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class OverlappedRange(

SparkQA · 2017-11-15T04:07:37Z

Test build #83871 has finished for PR 19594 at commit 8b2084a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class OverlappedRange(

SparkQA · 2017-11-15T04:39:58Z

Test build #83870 has finished for PR 19594 at commit 67bd651.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
case class OverlappedRange(

wzhfy · 2017-11-15T06:41:38Z

cc @cloud-fan @gatorsmile @ron8hu

ron8hu · 2017-11-29T00:35:22Z

...alyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/JoinEstimationSuite.scala

+    val histogram1 = Histogram(height = 300, Array(
+      HistogramBin(lo = 30, hi = 30, ndv = 1), HistogramBin(lo = 30, hi = 60, ndv = 30)))
+    val histogram2 = Histogram(height = 100, Array(
+      HistogramBin(lo = 0, hi = 50, ndv = 50), HistogramBin(lo = 50, hi = 100, ndv = 40)))


Histogram is supposed to handle skewed distribution effectively. In this test case, histogram2 has a skewed distribution as one bin has only one distinct value. Can you add a test case in which both join columns have skewed distributions? That is both join columns have at least one bin with one distinct value each.

OK, I've added test cases for joins of skewed histograms (same skewed value and different skewed values).

ron8hu · 2017-11-29T00:38:18Z

...alyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/JoinEstimationSuite.scala

+    val histogram1 = Histogram(height = 300, Array(
+      HistogramBin(lo = 50, hi = 60, ndv = 10), HistogramBin(lo = 60, hi = 75, ndv = 3)))
+    val histogram2 = Histogram(height = 100, Array(
+      HistogramBin(lo = 0, hi = 50, ndv = 50), HistogramBin(lo = 50, hi = 100, ndv = 40)))


For the very skewed cases, multiple bins in a histogram may have same distinct value. We may add one more test case to cover this situation.

OK, I've added such a test case.

SparkQA · 2017-12-09T03:33:26Z

Test build #84676 has finished for PR 19594 at commit e69e213.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2017-12-09T07:51:17Z

retest this please

SparkQA · 2017-12-09T08:05:01Z

Test build #84679 has finished for PR 19594 at commit e69e213.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2017-12-09T09:05:19Z

retest this please

SparkQA · 2017-12-09T11:38:53Z

Test build #84682 has finished for PR 19594 at commit e69e213.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2017-12-09T12:13:20Z

retest this please

SparkQA · 2017-12-09T15:08:23Z

Test build #84683 has finished for PR 19594 at commit e69e213.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2017-12-10T01:35:13Z

ping @cloud-fan

cloud-fan · 2017-12-12T14:52:57Z

...main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala

+      leftHistogram: Histogram,
+      rightHistogram: Histogram,
+      newMin: Double,
+      newMax: Double): Seq[OverlappedRange] = {


how about upperBound/lowerBound? It's hard to understand the meaning of new by looking at this method.

max/min is also fine

yea I think upperBound/lowerBound is better.

cloud-fan · 2017-12-12T14:54:54Z

...main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala

+      .filter(b => b.lo <= newMax && b.hi >= newMin)
+
+    leftBins.foreach { lb =>
+      rightBins.foreach { rb =>


nit:

for { leftBin <- leftBins rightBin <- rightBins } yield { ... OverlappedRange ... }

Then we can omit val overlappedRanges = new ArrayBuffer[OverlappedRange]()

We only collect OverlappedRange when left part and right part intersect, and the decision is based on some computation, it's not very convenient to use it as guards. So it seems yield form is not very suitable for this case.

cloud-fan · 2017-12-12T14:58:04Z

...main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala

+      // --------+------------------+------------+-------------+------->
+      (min, max)
+    } else if (bin.lo <= min && bin.hi >= min) {
+      //       bin.lo              min        bin.hi


what if the max is after the bin.hi?

in this case, max is after the bin.hi, so the trimmed part is (min, bin.hi). I'll update the figure to indicate that.

cloud-fan · 2017-12-12T15:07:43Z

...main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala

+            // --------+------------------+------------+----------------+------->
+            val leftRatio = (left.hi - right.lo) / (left.hi - left.lo)
+            val rightRatio = (left.hi - right.lo) / (right.hi - right.lo)
+            if (leftRatio == 0) {


it's more understandable to write if (right.lo == left.hi)

cloud-fan · 2017-12-12T15:10:33Z

.../main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala

+        }
+        keyStatsAfterJoin += (
+          leftKey -> joinStat.copy(histogram = leftKeyStat.histogram),
+          rightKey -> joinStat.copy(histogram = rightKeyStat.histogram)


should we update the histogram after join?

Currently we don't update histogram since min/max can help us to know which bins are valid. It doesn't affect correctness. But updating histograms helps to reduce memory usage for histogram propagation. We can do this in both filter and join estimation in following PRs.

Actually keeping it unchanged is more memory efficient. We just pass around pointers, but updating the histogram means creating a new one.

Let's keep it, and add some comments to explain it

ah right, we can keep it.

SparkQA · 2017-12-16T05:11:04Z

Test build #84989 has finished for PR 19594 at commit 2a4ee99.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-16T05:52:36Z

Test build #84991 has finished for PR 19594 at commit 2637429.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2017-12-16T08:30:58Z

retest this please

SparkQA · 2017-12-16T11:25:39Z

Test build #85001 has finished for PR 19594 at commit 2637429.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2017-12-18T13:10:37Z

retest this please

cloud-fan · 2017-12-18T14:51:02Z

...main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala

+   * Given an original bin and a value range [lowerBound, upperBound], returns the trimmed part
+   * of the bin in that range and its number of rows.
+   */
+  def trimBin(bin: HistogramBin, height: Double, lowerBound: Double, upperBound: Double)


maybe explain in the comment that height means the average number of rows of the given bin inside a equi-height histogram.

cloud-fan · 2017-12-18T15:22:56Z

...main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala

+      (bin.lo, bin.hi)
+    }
+
+    if (bin.hi == bin.lo) {


do we really need this branch? I think the else if branch can also cover it, if we assume bin.ndv must be 1 if bin.hi == bin.lo

cloud-fan · 2017-12-18T15:23:38Z

...main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala

+              rightNumRows = rightHeight / right.ndv
+            )
+          } else if (right.lo == right.hi) {
+            // Case2: the right bin has only one value


do we really need case 1 and 2? aren't they covered by branches below?

cloud-fan · 2017-12-18T15:26:16Z

...main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala

+            // Case3: the left bin is "smaller" than the right bin
+            //      left.lo            right.lo     left.hi          right.hi
+            // --------+------------------+------------+----------------+------->
+            if (left.hi == right.lo) {


yea this branch is needed, otherwise we will get 0 ratio which leads to wrong result.

cloud-fan · 2017-12-18T15:27:53Z

...main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala

+    } else {
+      //    lowerBound            bin.lo        bin.hi     upperBound
+      // --------+------------------+------------+-------------+------->
+      (bin.lo, bin.hi)


add an assert to make sure if we reach here, the case is what we want.

cloud-fan · 2017-12-18T15:35:22Z

.../main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala

@@ -225,6 +236,43 @@ case class JoinEstimation(join: Join) extends Logging {
    (ceil(card), newStats)
  }

+  /** Compute join cardinality using equi-height histograms. */
+  private def computeByEquiHeightHistogram(


I think it's ok to only say Histogram in method names and explain it's equi-height in comments.

cloud-fan · 2017-12-18T15:36:46Z

.../main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala

+      rightHistogram = rightHistogram,
+      // Only numeric values have equi-height histograms.
+      lowerBound = newMin.get.toString.toDouble,
+      upperBound = newMax.get.toString.toDouble)


if we assume the min/max must be defined here, I think the parameter type should be double instead of Option[Any]

that's because we need to update the column stats' min and max at the end of the method.

cloud-fan · 2017-12-18T15:42:54Z

.../main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala

+    val leftKeyStat = leftStats.attributeStats(leftKey)
+    val rightKeyStat = rightStats.attributeStats(rightKey)
+    val newMaxLen = math.min(leftKeyStat.maxLen, rightKeyStat.maxLen)
+    val newAvgLen = (leftKeyStat.avgLen + rightKeyStat.avgLen) / 2


shall we count left/right numRows when calculating this?

how do we use left/right numRows to calculate this? Ideally avgLen is calculated by total length of keys / numRowsAfterJoin. For string type, we don't the exact length of the matched keys (we don't support string histogram yet), for numeric types, their avgLen should be the same. So the equation is a fair approximation.

cloud-fan · 2017-12-18T15:44:24Z

.../main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala

+          // truncated by the updated max/min. In this way, only pointers of the histograms are
+          // propagated and thus reduce memory consumption.
+          leftKey -> joinStat.copy(histogram = leftKeyStat.histogram),
+          rightKey -> joinStat.copy(histogram = rightKeyStat.histogram)


shall we do this inside computeByEquiHeightHistogram?

i.e. https://github.com/apache/spark/pull/19594/files#diff-6387e7aaeb7d8e0cb1457b9d0fe5cd00R272

I put it here because computeByEquiHeightHistogram returns a single stats, here we keep the histogram for leftKey and rightKey respectively.

cloud-fan · 2017-12-18T15:58:26Z

...alyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/JoinEstimationSuite.scala

+
+    val expectedRanges = Seq(
+      // histogram1.bins(0) overlaps t0
+      OverlappedRange(10, 30, 10, 40*1/2, 300, 80*1/2),


space between oeprators.

cloud-fan · 2017-12-18T16:00:38Z

...alyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/JoinEstimationSuite.scala

+      OverlappedRange(50, 60, 30*1/3, 8, 300*1/3, 20)
+    )
+    assert(expectedRanges.equals(
+      getOverlappedRanges(histogram1, histogram2, lowerBound = 10D, upperBound = 60D)))


10D looks weird, how about 10.0

actually we can just write 10 right?

cloud-fan · 2017-12-18T16:05:31Z

...alyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/JoinEstimationSuite.scala

+      expectedMin = 10D,
+      expectedMax = 60D,
+      // 10 + 20 + 8
+      expectedNdv = 38L,


expectedNdv = 10 + 20 + 8?

SparkQA · 2017-12-18T16:06:05Z

Test build #85061 has finished for PR 19594 at commit 2637429.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-12-18T16:06:20Z

...alyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/JoinEstimationSuite.scala

+      // 10 + 20 + 8
+      expectedNdv = 38L,
+      // 300*40/20 + 200*40/20 + 100*20/10
+      expectedRows = 1200L)


cloud-fan · 2017-12-18T16:18:41Z

LGTM except some minor comments

SparkQA · 2017-12-19T13:52:21Z

Test build #85106 has finished for PR 19594 at commit 16797d2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-12-19T13:55:35Z

thanks, merging to master!

wzhfy force-pushed the join_estimation_histogram branch from 67bd651 to 96776ce Compare November 14, 2017 05:29

join estimation based on histogram

8b2084a

wzhfy force-pushed the join_estimation_histogram branch 2 times, most recently from 67bd651 to 8b2084a Compare November 15, 2017 01:17

wzhfy changed the title ~~[WIP] [SPARK-21984] Join estimation based on equi-height histogram~~ [SPARK-21984] [SQL] Join estimation based on equi-height histogram Nov 15, 2017

ron8hu reviewed Nov 29, 2017

View reviewed changes

ron8hu suggested changes Nov 29, 2017

View reviewed changes

wzhfy added 2 commits December 8, 2017 15:07

Merge branch 'master' into join_estimation_histogram

6cb9b39

add test cases, add some comments and a small factor

e69e213

cloud-fan reviewed Dec 12, 2017

View reviewed changes

wzhfy added 3 commits December 14, 2017 11:28

Merge branch 'master' into join_estimation_histogram

ad14a5e

fix comments

2a4ee99

add comment

2637429

cloud-fan reviewed Dec 18, 2017

View reviewed changes

wzhfy added 2 commits December 19, 2017 15:56

Merge branch 'master' into join_estimation_histogram

e1669ed

fix more comments

16797d2

asfgit closed this in 571aa27 Dec 19, 2017

[SPARK-21984] [SQL] Join estimation based on equi-height histogram #19594

[SPARK-21984] [SQL] Join estimation based on equi-height histogram #19594

Conversation

wzhfy commented Oct 28, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Oct 28, 2017

SparkQA commented Nov 14, 2017

SparkQA commented Nov 15, 2017

SparkQA commented Nov 15, 2017

wzhfy commented Nov 15, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wzhfy Dec 9, 2017 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Dec 9, 2017

wzhfy commented Dec 9, 2017

SparkQA commented Dec 9, 2017

wzhfy commented Dec 9, 2017

SparkQA commented Dec 9, 2017

wzhfy commented Dec 9, 2017 • edited Loading

SparkQA commented Dec 9, 2017

wzhfy commented Dec 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 16, 2017

SparkQA commented Dec 16, 2017

wzhfy commented Dec 16, 2017

SparkQA commented Dec 16, 2017

wzhfy commented Dec 18, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Dec 18, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 18, 2017

Choose a reason for hiding this comment

cloud-fan commented Dec 18, 2017

SparkQA commented Dec 19, 2017

cloud-fan commented Dec 19, 2017

wzhfy commented Oct 28, 2017 •

edited

Loading

wzhfy Dec 9, 2017 •

edited

Loading

wzhfy commented Dec 9, 2017 •

edited

Loading

cloud-fan Dec 18, 2017 •

edited

Loading