[CALCITE-4465] Estimate the number of distinct values by predicates#2330
[CALCITE-4465] Estimate the number of distinct values by predicates#2330liyafan82 wants to merge 1 commit intoapache:mainfrom
Conversation
a28a08d to
03f896d
Compare
71ce2dc to
950e3c0
Compare
| // point: 10 (non-nullable) | ||
| Sarg sarg = Sarg.of(false, | ||
| ImmutableRangeSet.of(Range.closed(10, 10))); | ||
| assertThat(sarg.numDistinctVals(nonNullableInt), is(1.0)); |
There was a problem hiding this comment.
Could you please factor this to a helper method that receives sarg, type, and the expected matcher?
Then the error message could include sarg + ".numDistinctVals(" + type + ")" so the developers could see the inputs that produced invalid output.
The current assertion failures would look like "expected 1.0 got 1.5", and it would be hard to tell what is going on.
An alternative option is to split assertions to individual tests (e.g. parameterized)
There was a problem hiding this comment.
Thank you for the good suggestion.
I have revised the code accordingly.
| Number lowerNum = (Number) lower; | ||
| Number upperNum = (Number) upper; | ||
|
|
||
| boolean discreteType = type.getSqlTypeName() == SqlTypeName.BOOLEAN.BOOLEAN |
There was a problem hiding this comment.
Is .BOOLEAN.BOOLEAN expected here?
There was a problem hiding this comment.
Nice catch. Thank you.
950e3c0 to
9cdbd52
Compare
| return mq.getDistinctRowCount(rel.getInput(), groupKey, unionPreds); | ||
| Double ndvUpperBound = RexUtil.estimateColumnsNdv(groupKey, unionPreds); | ||
| return NumberUtil.min( | ||
| mq.getDistinctRowCount(rel.getInput(), groupKey, unionPreds), ndvUpperBound); |
There was a problem hiding this comment.
I am trying to understand why use min not max? Could you clarify a bit?
There was a problem hiding this comment.
Thanks for your comments.
Here we are using min, because the estimated NDV is an upper bound of the actual NDV. In other words, the real NDV should never exceed the estimated NDV.
For example, given expression x = 1, the estimated NDV is 1. However, the real NDV can also be 0, when the underlying data do not contain a row with x equal to 1.
For this code, if the value of mq.getDistinctRowCount(rel.getInput(), groupKey, unionPreds) is smaller than 1, we just use that value. On the other hand, if the value of mq.getDistinctRowCount(rel.getInput(), groupKey, unionPreds) is greater than 1, we use the NDV 1, because it is an upper bound.
This is why we are using min here.
| for (RexNode condition : conditions) { | ||
| Double singleNdv = estimateColumnNdvSingleCondition(colIdx, condition); | ||
| if (singleNdv != null) { | ||
| if (ndv == null) { |
There was a problem hiding this comment.
| if (ndv == null) { | |
| ndv = (ndv == null) ? singleNdv : Math.min(ndv, singleNdv); |
line 2658 to line 2668
There was a problem hiding this comment.
Accepted. Thanks for the suggestion.
| public static <C extends Comparable<C>> @Nullable Double numDistinctVals( | ||
| Range<C> range, RelDataType type) { | ||
| if (RangeSets.isPoint(range)) { | ||
| return 1.0; |
There was a problem hiding this comment.
hmmm Why 1.0 when range is a point (i.e. not return null)?
There was a problem hiding this comment.
Returning null means we do not have an estimation, or equivalently, the number of distinct values is infinity.
When the range is a point (e.g. x = 1), we estimate that there is only one distinct value, which is likely to be true in practice.
9cdbd52 to
b715126
Compare
8a5cf83 to
cf7f71b
Compare
According to our current implementation (
RelMdDistinctRowCount), estimating the number of distinctive values (NDV) does not make good use of the filter condition. It simply forwards the call to its input operator with the fiter condition attached.In fact, more information can be obtained for some special but commonly used conditions. For example, given condition
x = 'a', we can deduce thatNDV( x ) <= 1. Given conditionx in ('a', 'b'), we can deduce thatNDV( x ) <= 2.More generally, if we have
x in ('a', 'b') AND y in ('c', 'd', 'e'), we haveNDV(x, y) <= 2 * 3 = 6.