Skip to content

[CALCITE-4465] Estimate the number of distinct values by predicates#2330

Open
liyafan82 wants to merge 1 commit intoapache:mainfrom
liyafan82:fly_0113_ndv
Open

[CALCITE-4465] Estimate the number of distinct values by predicates#2330
liyafan82 wants to merge 1 commit intoapache:mainfrom
liyafan82:fly_0113_ndv

Conversation

@liyafan82
Copy link
Contributor

According to our current implementation (RelMdDistinctRowCount), estimating the number of distinctive values (NDV) does not make good use of the filter condition. It simply forwards the call to its input operator with the fiter condition attached.

In fact, more information can be obtained for some special but commonly used conditions. For example, given condition x = 'a', we can deduce that NDV( x ) <= 1. Given condition x in ('a', 'b'), we can deduce that NDV( x ) <= 2.
More generally, if we have x in ('a', 'b') AND y in ('c', 'd', 'e'), we have NDV(x, y) <= 2 * 3 = 6.

@liyafan82 liyafan82 force-pushed the fly_0113_ndv branch 3 times, most recently from a28a08d to 03f896d Compare January 19, 2021 05:15
@amaliujia amaliujia self-requested a review January 19, 2021 05:47
@liyafan82 liyafan82 force-pushed the fly_0113_ndv branch 4 times, most recently from 71ce2dc to 950e3c0 Compare January 19, 2021 07:48
// point: 10 (non-nullable)
Sarg sarg = Sarg.of(false,
ImmutableRangeSet.of(Range.closed(10, 10)));
assertThat(sarg.numDistinctVals(nonNullableInt), is(1.0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please factor this to a helper method that receives sarg, type, and the expected matcher?

Then the error message could include sarg + ".numDistinctVals(" + type + ")" so the developers could see the inputs that produced invalid output.

The current assertion failures would look like "expected 1.0 got 1.5", and it would be hard to tell what is going on.

An alternative option is to split assertions to individual tests (e.g. parameterized)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the good suggestion.
I have revised the code accordingly.

Number lowerNum = (Number) lower;
Number upperNum = (Number) upper;

boolean discreteType = type.getSqlTypeName() == SqlTypeName.BOOLEAN.BOOLEAN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is .BOOLEAN.BOOLEAN expected here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. Thank you.

return mq.getDistinctRowCount(rel.getInput(), groupKey, unionPreds);
Double ndvUpperBound = RexUtil.estimateColumnsNdv(groupKey, unionPreds);
return NumberUtil.min(
mq.getDistinctRowCount(rel.getInput(), groupKey, unionPreds), ndvUpperBound);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to understand why use min not max? Could you clarify a bit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comments.

Here we are using min, because the estimated NDV is an upper bound of the actual NDV. In other words, the real NDV should never exceed the estimated NDV.

For example, given expression x = 1, the estimated NDV is 1. However, the real NDV can also be 0, when the underlying data do not contain a row with x equal to 1.

For this code, if the value of mq.getDistinctRowCount(rel.getInput(), groupKey, unionPreds) is smaller than 1, we just use that value. On the other hand, if the value of mq.getDistinctRowCount(rel.getInput(), groupKey, unionPreds) is greater than 1, we use the NDV 1, because it is an upper bound.

This is why we are using min here.

for (RexNode condition : conditions) {
Double singleNdv = estimateColumnNdvSingleCondition(colIdx, condition);
if (singleNdv != null) {
if (ndv == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (ndv == null) {
ndv = (ndv == null) ? singleNdv : Math.min(ndv, singleNdv);

line 2658 to line 2668

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accepted. Thanks for the suggestion.

public static <C extends Comparable<C>> @Nullable Double numDistinctVals(
Range<C> range, RelDataType type) {
if (RangeSets.isPoint(range)) {
return 1.0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm Why 1.0 when range is a point (i.e. not return null)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning null means we do not have an estimation, or equivalently, the number of distinct values is infinity.

When the range is a point (e.g. x = 1), we estimate that there is only one distinct value, which is likely to be true in practice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants