New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21322][SQL] support histogram in filter cardinality estimation #19783
Conversation
Test build #84003 has finished for PR 19783 at commit
|
@@ -158,8 +196,8 @@ class FilterEstimationSuite extends StatsEstimationTestBase { | |||
val condition = Not(And(LessThan(attrInt, Literal(3)), Literal(null, IntegerType))) | |||
validateEstimatedStats( | |||
Filter(condition, childStatsTestPlan(Seq(attrInt), 10L)), | |||
Seq(attrInt -> colStatInt.copy(distinctCount = 8)), | |||
expectedRowCount = 8) | |||
Seq(attrInt -> colStatInt.copy(distinctCount = 7)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add new test cases for filter estimation based on histogram, instead of modifying existing test results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
dd5b975
to
8e5d04e
Compare
Test build #84190 has finished for PR 19783 at commit
|
Test build #84236 has finished for PR 19783 at commit
|
* @return the number of the first bin/bucket into which a column values falls. | ||
*/ | ||
|
||
def findFirstBucketForValue(value: Double, histogram: Histogram): Int = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we unify all names to bin
/bins
in code and comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had bucket(s) and bin(s) used interchangeably. To avoid confusion, I will unify them to use only bin/bins.
052d111
to
241089c
Compare
Test build #84317 has finished for PR 19783 at commit
|
Some(1.0 / BigDecimal(ndv)) | ||
} else { | ||
// We compute filter selectivity using Histogram information | ||
attr.dataType match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use if (colStat.histogram.isEmpty)
to seperate the logic of basic stats (Some(1.0 / BigDecimal(ndv))
) and histogram computation.
@@ -332,8 +332,45 @@ case class FilterEstimation(plan: Filter) extends Logging { | |||
colStatsMap.update(attr, newStats) | |||
} | |||
|
|||
Some(1.0 / BigDecimal(ndv)) | |||
} else { | |||
// We compute filter selectivity using Histogram information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this comment where the histogram computation really starts
// returns 1/ndv if there is no histogram | ||
if (colStat.histogram.isEmpty) return Some(1.0 / BigDecimal(ndv)) | ||
|
||
// We traverse histogram bins to locate the literal value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is not accurate, here we want to get the bins occupied by the literal value, because if the value is skewed, it can occupy multiple bins.
// We traverse histogram bins to locate the literal value | ||
val hgmBins = colStat.histogram.get.bins | ||
val datum = EstimationUtils.toDecimal(literal.value, literal.dataType).toDouble | ||
// find the interval where this datum locates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can remove this comment, it's explained above
// find the interval where this datum locates | ||
var lowerId, higherId = -1 | ||
for (i <- hgmBins.indices) { | ||
// if datum > upperBound, just move to next bin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove the comment, it does not match the logic at next line (there's no "move" logic)
(1.0 / hgmBins.length) / math.max(lowerBinNdv, 1) + | ||
(1.0 / hgmBins.length) / math.max(higherBinNdv, 1) | ||
} | ||
Some(percent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about simplifying the above logic as:
val occupiedBins = if (lowerId == higherId) {
1.0 / lowerBinNdv
} else {
(higherId - lowerId - 1) + 1.0 / lowerBinNdv + 1.0 / higherBinNdv
}
Some(occupiedBins / hgmBins.length)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
binId += 1 | ||
} | ||
if ((value == histogram.bins(i).hi) && (i < histogram.bins.length - 1)) { | ||
if (value == histogram.bins(i + 1).lo) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merge two if
s: if ((value == histogram.bins(i).hi) && (value == histogram.bins(i + 1).lo) && (i < histogram.bins.length - 1))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used two statements instead of one statement is because, when i points to the last bin, this condition "value == histogram.bins(i + 1).lo" may be out of bound. By separating the conditions into two statements, we can be sure that the out-of-bound error will not happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By "out of bound", do you mean it exceeds 100 length limit? You can just switch new line after &&
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I meant the upper bound for the array of bins in a histogram. The default length of the histogram bin array is 254. When i is equal to 253 (the last bin), then i+1 is 254 leading to out-of-bound error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just move this condition after the length check:
if ((value == histogram.bins(i).hi) && (i < histogram.bins.length - 1) && (value == histogram.bins(i + 1).lo))
* @param histogram a numeric equi-height histogram | ||
* @return the number of the first bin into which a column values falls. | ||
*/ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you remove the empty line between method comment and its definition?
same for other methods here.
histogram: Histogram): Double = { | ||
// find bins where current min and max locate | ||
val minBinId = findFirstBinForValue(lowerEnd, histogram) | ||
val maxBinId = findLastBinForValue(higherEnd, histogram) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about lowerBinId, higherBinId
?
1.0 | ||
} else if (binId == 0 && curBin.hi != curBin.lo) { | ||
if (higherValue == lowerValue) { | ||
// in the case curBin.binNdv == 0, current bin is occupied by one value, which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
binNdv will never be zero
} else if (lowerId == higherId) { | ||
getOccupation(lowerId, higherEnd, lowerEnd, histogram) * histogram.bins(lowerId).ndv | ||
} else { | ||
// compute how much lowerEnd/higherEnd occupy its bin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: occupies its bin
Test build #84365 has finished for PR 19783 at commit
|
6e6c49b
to
9d2a463
Compare
Test build #84385 has finished for PR 19783 at commit
|
def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = { | ||
var binId = 0 | ||
bins.foreach { bin => | ||
if (value > bin.hi) binId += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks more like a while loop pattern, can we use while loop here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Actually while loop is better because it can exit early when the condition no longer qualifies.
* @param bins an array of bins for a given numeric equi-height histogram | ||
* @return the number of the last bin into which a column values falls. | ||
*/ | ||
def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this method so different from findFirstBinForValue
? It looks like we just need to reverse the iteration order, i.e. from bins.length
to 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. We can simplify the logic by iterating from bins.length-1 to 0.
1.0 / curBin.ndv.toDouble | ||
} else { | ||
// Use proration since the range falls inside this bin. | ||
(higherValue - lowerValue) / (curBin.hi - curBin.lo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the only branch we need to specialize for binId=0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to specialize it?
* @param higherEnd a given upper bound value of a specified column value range | ||
* @param lowerEnd a given lower bound value of a specified column value range | ||
* @param histogram a numeric equi-height histogram | ||
* @return the selectivity percentage for column values in [lowerEnd, higherEnd]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't match the java doc: Returns the number of bins...
9d2a463
to
d068888
Compare
Test build #84517 has finished for PR 19783 at commit
|
* | ||
* @param value a literal value of a column | ||
* @param bins an array of bins for a given numeric equi-height histogram | ||
* @return the number of the first bin into which a column values falls. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the id of the first bin into which the given value falls.
def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int = { | ||
var i = 0 | ||
while ((i < bins.length) && (value > bins(i).hi)) { | ||
i +=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s space after +=
@@ -114,4 +114,171 @@ object EstimationUtils { | |||
} | |||
} | |||
|
|||
/** | |||
* Returns the number of the first bin into which a column values falls for a specified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
} | ||
|
||
/** | ||
* Returns the number of the last bin into which a column values falls for a specified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
* | ||
* @param value a literal value of a column | ||
* @param bins an array of bins for a given numeric equi-height histogram | ||
* @return the number of the last bin into which a column values falls. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = { | ||
var i = bins.length - 1 | ||
while ((i >= 0) && (value < bins(i).lo)) { | ||
i -=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a space after -=
binId: Int, | ||
higherValue: Double, | ||
lowerValue: Double, | ||
histogram: Histogram): Double = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the method signature looks weird, shouldn't it be
private def getOccupation(
higherValue: Double,
lowerValue: Double,
bin: HistogramBin)
val curBin = histogram.bins(binId) | ||
if (curBin.hi == curBin.lo) { | ||
// the entire bin is covered in the range | ||
1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get it, shouldn't we check lowerValue <= curBin.lo <= higherValue
here?
1.0 | ||
} else if (higherValue == lowerValue) { | ||
// set percentage to 1/NDV | ||
1.0 / curBin.ndv.toDouble |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we check the lowerValue/higherValues
fits in the bin value range?
// of next bin is equal to the hi value of the previous bin. We bump up | ||
// ndv value only if the hi values of two consecutive bins are different. | ||
var middleNdv: Long = 0 | ||
for (i <- histogram.bins.indices) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again this is a typical while loop pattern.
var middleNdv: Long = 0 | ||
for (i <- histogram.bins.indices) { | ||
val bin = histogram.bins(i) | ||
if (bin.hi != bin.lo && i >= lowerId + 1 && i <= higherId - 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var i = lowerId + 1
while (i < higherId) {
...
i += 1
}
// The total ndv is minPartNdv + maxPartNdv + Ndvs between them. | ||
// In order to avoid counting same distinct value twice, we check if the upperBound value | ||
// of next bin is equal to the hi value of the previous bin. We bump up | ||
// ndv value only if the hi values of two consecutive bins are different. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't match the code, the actual logic is: only if the lo and hi values of the bin are different
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will change the comment so that it matches with the code. Actually my original comment means the same thing as your comment. This is because the hi value of a bin is equal to the lo value of the next bin.
// Here we traverse histogram bins to locate the range of bins the literal values falls | ||
// into. For skewed distribution, a literal value can occupy multiple bins. | ||
val hgmBins = colStat.histogram.get.bins | ||
val datum = EstimationUtils.toDecimal(literal.value, literal.dataType).toDouble |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @wzhfy , you would refactor this part to always use Double for CBO computing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I'll refactor this part.
((datum == hgmBins(i).hi) && (datum < hgmBins(i + 1).hi))) { | ||
higherId = i | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about
var lowerId = -1
var highIdFound = false
var i = 0
while (i < hgmBins.length || highIdFound) {
if (datum <= hgmBins(i).hi && lowerId < 0) lowerId = i
if (datum >= hgmBins(i).lo) highIdFound = true
}
val highId = i
LGTM overall |
|
||
// compute how many bins the column's current valid range [min, max] occupies. | ||
// Note that a column's [min, max] range may vary after we apply some filter conditions. | ||
val minToMaxLength = EstimationUtils.getOccupationBins(maxBinId, minBinId, max, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I prefer to have this method unit-tested, because it's the core part of filter estimation. We can do this in follow-up anyway.
c9538b8
to
be1e7ba
Compare
Test build #84700 has finished for PR 19783 at commit
|
retest this please. |
Test build #84725 has finished for PR 19783 at commit
|
retest this please |
Test build #84732 has finished for PR 19783 at commit
|
For the past 2 test builds #84725 and #84732, I checked the test result on the web. Actually there were no failures. See https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84725/testReport/. It appears that there is a bug in the jenkins test system. |
retest this please. |
Test build #84751 has finished for PR 19783 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except some code style issue, we can address them later
@@ -114,4 +114,99 @@ object EstimationUtils { | |||
} | |||
} | |||
|
|||
/** | |||
* Returns the number of the first bin into which a column value falls for a specified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: number
-> index
?
* | ||
* @param value a literal value of a column | ||
* @param bins an array of bins for a given numeric equi-height histogram | ||
* @return the id of the first bin into which a column value falls. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems this is redundant, shall we remove it?
} | ||
|
||
/** | ||
* Returns the number of the last bin into which a column value falls for a specified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
* | ||
* @param value a literal value of a column | ||
* @param bins an array of bins for a given numeric equi-height histogram | ||
* @return the id of the last bin into which a column value falls. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
* @param higherValue a given upper bound value of a specified column value range | ||
* @param lowerValue a given lower bound value of a specified column value range | ||
* @param bin a single histogram bin | ||
* @return the percentage of a single bin holding values in [lowerValue, higherValue]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
redundant
* double value is because we may return a portion of a bin. For example, a predicate | ||
* "column = 8" may return the number of bins 0.2 if the holding bin has 5 distinct values. | ||
* | ||
* @param higherId id of the high end bin holding the high end value of a column range |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: higherIndex
} else { | ||
// compute how much lowerEnd/higherEnd occupies its bin | ||
val lowerCurBin = histogram.bins(lowerId) | ||
val lowerPart = getOccupation(lowerCurBin.hi, lowerEnd, lowerCurBin) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we assert that lowerBin.lo <= lowerEnd
// returns 1/ndv if there is no histogram | ||
Some(1.0 / BigDecimal(ndv)) | ||
} else { | ||
// We compute filter selectivity using Histogram information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you create a new method?
// range may change due to another condition applied earlier. | ||
val min = EstimationUtils.toDecimal(colStat.min.get, literal.dataType).toDouble | ||
val max = EstimationUtils.toDecimal(colStat.max.get, literal.dataType).toDouble | ||
val minBinId = EstimationUtils.findFirstBinForValue(min, hgmBins) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: minBinIndex
val lowerBinNdv = hgmBins(lowerBinId).ndv | ||
val higherBinNdv = hgmBins(higherBinId).ndv | ||
// assume uniform distribution in each bin | ||
val occupiedBins = if (lowerBinId == higherBinId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this just EstimationUtils.getOccupationBins(higherBinId, lowerBinId, datum, datum, histogram)
?
thanks, merging to master! |
What changes were proposed in this pull request?
Histogram is effective in dealing with skewed distribution. After we generate histogram information for column statistics, we need to adjust filter estimation based on histogram data structure.
How was this patch tested?
We revised all the unit test cases by including histogram data structure.
Please review http://spark.apache.org/contributing.html before opening a pull request.