-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21322][SQL][followup] support histogram in filter cardinality estimation #19952
Conversation
* | ||
* @param value a literal value of a column | ||
* @param bins an array of bins for a given numeric equi-height histogram | ||
* @return the id of the first bin into which a column value falls. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it's a private method and we can omit the parameter doc as it's trivial.
* @param upperBound the highest value of the given range | ||
* @param upperBoundInclusive whether the upperBound is included in the range | ||
* @param lowerBound the lowest value of the given range | ||
* @param lowerBoundInclusive whether the lowerBound is included in the range |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of asking the callers to pass in the upper and lower bin id, it's more intuitive to pass whether to include the range boundaries.
Test build #84773 has finished for PR 19952 at commit
|
retest this please |
Test build #84784 has finished for PR 19952 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except some minor comments
* @param lowerId id of the low end bin holding the low end value of a column range | ||
* @param higherEnd a given upper bound value of a specified column value range | ||
* @param lowerEnd a given lower bound value of a specified column value range | ||
* Note that the return value is double type, because the range boundaries usually occupy a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: returned value
* @param lowerEnd a given lower bound value of a specified column value range | ||
* Note that the return value is double type, because the range boundaries usually occupy a | ||
* portion of a bin. An extrema case is [value, value] which is generated by equal predicate | ||
* `col = value`, we can get more accuracy by allowing returning portion of histogram bins. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: get higher accuracy
* @param min the lower bound of the current valid range for a given column | ||
* @param datumNumber the numeric value of a literal | ||
* @return the selectivity percentage for a condition in the current range. | ||
* Computes the possibility of a equal predicate using histogram. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: an equality predicate
upperBound: Double, | ||
upperBoundInclusive: Boolean, | ||
lowerBound: Double, | ||
lowerBoundInclusive: Boolean, | ||
histogram: Histogram): Double = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better to pass the bin array instead of histogram? we can simplify many histogram.bins
here.
val numBinsHoldingEntireRange = EstimationUtils.numBinsHoldingRange( | ||
max, upperBoundInclusive = true, min, lowerBoundInclusive = true, histogram) | ||
|
||
val numBinsHoldingDatum = op match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numBinsHoldingRange
* [lowerBound, upperBound]. | ||
* | ||
* Note that the returned value is double type, because the range boundaries usually occupy a | ||
* portion of a bin. An extrema case is [value, value] which is generated by equal predicate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: extreme
Test build #84822 has finished for PR 19952 at commit
|
Test build #84827 has finished for PR 19952 at commit
|
thanks, merging to master! |
What changes were proposed in this pull request?
some code cleanup/refactor and naming improvement.
How was this patch tested?
existing tests