HIVE-29479: Improve histogram-based selectivity estimation for two-sided range predicates#6477
HIVE-29479: Improve histogram-based selectivity estimation for two-sided range predicates#6477rubenada wants to merge 9 commits into
Conversation
…ded range predicates
soumyakanti3578
left a comment
There was a problem hiding this comment.
Overall looks good to me, but maybe we should add tests for IS [NOT] NULL with SEARCH, and also a couple of tests for SEARCH containing both ranges and points?
|
Please go through the Sonar issues as well. I think some of them are good suggestions. :) |
thomasrebele
left a comment
There was a problem hiding this comment.
Thank you for the PR! The overall approach looks good. I've made a few suggestions and requests.
| rexBuilder.makeCall(SqlStdOperatorTable.IS_NULL, ref).accept(FilterSelectivityEstimator.this)); | ||
| } | ||
|
|
||
| RangeSets.forEach(sarg.rangeSet, new RangeSets.Consumer<C>() { |
There was a problem hiding this comment.
Maybe the code could be simplified with for (Range<C> range : sarg.rangeSet.asRanges()) { ... }? It might be possible to treat a range as-is, without differentiating all the different kinds of ranges.
There was a problem hiding this comment.
Good point! I have applied the suggested refactoring.
There was a problem hiding this comment.
While I agree that the refactored code is much smaller/simplified now, I feel the previous version was more organized and readable as it's now a huge method with multiple if else blocks.
Moreover, I see that implementing RangeSets.Consumer<C> is the preferred method both in Hive (org.apache.hadoop.hive.ql.optimizer.calcite.RangeConverter) and in Calcite (several places). If the new code is not significantly more performant than the earlier version, then maybe we should keep things familiar?
Another small benefit of implementing RangeSets.Consumer<C> is it will be easily searchable from IDE by looking for all subclasses.
BTW, I am willing to approve this as-is, but just wanted to hear both of your thoughts on this.
There was a problem hiding this comment.
I have no strong opinion here.
I agree that the original version with the RangeSets.Consumer<C> seemed a bit better in terms of code homogeneity and maintainability (easier to find if we ever need to apply adjustments to all rangeSet processing, like the separate, incoming HIVE-28911).
So I'm willing to move back to the Consumer approach, wdyt @thomasrebele ?
There was a problem hiding this comment.
There are a few places in Calcite that iterate over sarg.rangeSet.asRanges() without the Conumer:
The places where a RangeSets.Consumer<C> is used in Calcite, there is an easy mapping from the different range types to a distinct action. Hive always uses the sarg.rangeSet with a RangeSets.Consumer. However, I could only find one usage, and it was introduced by @soumyakanti3578, so I'm not sure whether the opinion is unbiased :) The usages (including those from the Consumer) can be found in the IDE by looking at the usages of org.apache.calcite.util.Sarg#rangeSet.
I had a try simplifying the code a bit, see thomasrebele@29cb98b. It's a bit less efficient than Ruben's proposal. It might be a bit more readable.
|
|
@soumyakanti3578 , @thomasrebele thanks for the review. I think I have addressed all the remarks, feel free to take another look. |



See HIVE-29479.
What changes were proposed in this pull request?
This PR adapts FilterSelectivityEstimator so that histogram statistics are used for all types of range predicates (so far it was only done for single-sided range predicates and bounded ranges).
Why are the changes needed?
This PR allows the CBO planner to use histogram statistics for all types of range predicates.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Unit tests added.