[CALCITE-4465] Estimate the number of distinct values by predicates by liyafan82 · Pull Request #2330 · apache/calcite

liyafan82 · 2021-01-19T01:58:16Z

According to our current implementation (RelMdDistinctRowCount), estimating the number of distinctive values (NDV) does not make good use of the filter condition. It simply forwards the call to its input operator with the fiter condition attached.

In fact, more information can be obtained for some special but commonly used conditions. For example, given condition x = 'a', we can deduce that NDV( x ) <= 1. Given condition x in ('a', 'b'), we can deduce that NDV( x ) <= 2.
More generally, if we have x in ('a', 'b') AND y in ('c', 'd', 'e'), we have NDV(x, y) <= 2 * 3 = 6.

vlsi · 2021-01-20T13:19:16Z

core/src/test/java/org/apache/calcite/rex/RexProgramTest.java

+    // point: 10 (non-nullable)
+    Sarg sarg = Sarg.of(false,
+        ImmutableRangeSet.of(Range.closed(10, 10)));
+    assertThat(sarg.numDistinctVals(nonNullableInt), is(1.0));


Could you please factor this to a helper method that receives sarg, type, and the expected matcher?

Then the error message could include sarg + ".numDistinctVals(" + type + ")" so the developers could see the inputs that produced invalid output.

The current assertion failures would look like "expected 1.0 got 1.5", and it would be hard to tell what is going on.

An alternative option is to split assertions to individual tests (e.g. parameterized)

Thank you for the good suggestion.
I have revised the code accordingly.

vlsi · 2021-01-20T13:20:01Z

core/src/main/java/org/apache/calcite/util/RangeSets.java

+      Number lowerNum = (Number) lower;
+      Number upperNum = (Number) upper;
+
+      boolean discreteType = type.getSqlTypeName() == SqlTypeName.BOOLEAN.BOOLEAN


Is .BOOLEAN.BOOLEAN expected here?

Nice catch. Thank you.

amaliujia · 2021-01-22T05:02:58Z

core/src/main/java/org/apache/calcite/rel/metadata/RelMdDistinctRowCount.java

-    return mq.getDistinctRowCount(rel.getInput(), groupKey, unionPreds);
+    Double ndvUpperBound = RexUtil.estimateColumnsNdv(groupKey, unionPreds);
+    return NumberUtil.min(
+        mq.getDistinctRowCount(rel.getInput(), groupKey, unionPreds), ndvUpperBound);


I am trying to understand why use min not max? Could you clarify a bit?

Thanks for your comments.

Here we are using min, because the estimated NDV is an upper bound of the actual NDV. In other words, the real NDV should never exceed the estimated NDV.

For example, given expression x = 1, the estimated NDV is 1. However, the real NDV can also be 0, when the underlying data do not contain a row with x equal to 1.

For this code, if the value of mq.getDistinctRowCount(rel.getInput(), groupKey, unionPreds) is smaller than 1, we just use that value. On the other hand, if the value of mq.getDistinctRowCount(rel.getInput(), groupKey, unionPreds) is greater than 1, we use the NDV 1, because it is an upper bound.

This is why we are using min here.

amaliujia · 2021-01-22T05:09:43Z

core/src/main/java/org/apache/calcite/rex/RexUtil.java

+    for (RexNode condition : conditions) {
+      Double singleNdv = estimateColumnNdvSingleCondition(colIdx, condition);
+      if (singleNdv != null) {
+        if (ndv == null) {


Suggested change

if (ndv == null) {

ndv = (ndv == null) ? singleNdv : Math.min(ndv, singleNdv);

line 2658 to line 2668

Accepted. Thanks for the suggestion.

amaliujia · 2021-01-22T05:14:20Z

core/src/main/java/org/apache/calcite/util/RangeSets.java

+  public static <C extends Comparable<C>> @Nullable Double numDistinctVals(
+      Range<C> range, RelDataType type) {
+    if (RangeSets.isPoint(range)) {
+      return 1.0;


hmmm Why 1.0 when range is a point (i.e. not return null)?

Returning null means we do not have an estimation, or equivalently, the number of distinct values is infinity.

When the range is a point (e.g. x = 1), we estimate that there is only one distinct value, which is likely to be true in practice.

liyafan82 force-pushed the fly_0113_ndv branch 3 times, most recently from a28a08d to 03f896d Compare January 19, 2021 05:15

amaliujia self-requested a review January 19, 2021 05:47

liyafan82 force-pushed the fly_0113_ndv branch 4 times, most recently from 71ce2dc to 950e3c0 Compare January 19, 2021 07:48

vlsi reviewed Jan 20, 2021

View reviewed changes

liyafan82 force-pushed the fly_0113_ndv branch from 950e3c0 to 9cdbd52 Compare January 21, 2021 02:41

amaliujia reviewed Jan 22, 2021

View reviewed changes

[CALCITE-4465] Estimate the number of distinct values by predicates

b715126

liyafan82 force-pushed the fly_0113_ndv branch from 9cdbd52 to b715126 Compare January 22, 2021 09:01

julianhyde force-pushed the master branch from 52c1284 to d4e1eea Compare March 1, 2021 02:56

vlsi force-pushed the master branch from 7f65cf2 to 4bc9166 Compare March 24, 2021 09:43

zabetak force-pushed the master branch from f14cf4c to dcbc493 Compare March 10, 2022 09:13

julianhyde force-pushed the main branch from fa65a2e to 1226d1a Compare June 20, 2022 20:27

asfgit force-pushed the main branch from 9fc50f2 to e2f949d Compare September 10, 2022 16:37

asfgit force-pushed the main branch from f8f8a51 to a326bd2 Compare January 25, 2023 07:39

julianhyde force-pushed the main branch 2 times, most recently from 8a5cf83 to cf7f71b Compare June 8, 2023 21:21

tanclary force-pushed the main branch from 4804912 to 00db001 Compare September 6, 2023 00:42

libenchao force-pushed the main branch from 47db81a to 0be8eae Compare November 10, 2023 13:18

F21 force-pushed the main branch from 7d38212 to cacf36a Compare February 17, 2025 03:33

asolimando force-pushed the main branch from 19400ab to 2d5ec10 Compare May 28, 2025 13:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CALCITE-4465] Estimate the number of distinct values by predicates#2330

[CALCITE-4465] Estimate the number of distinct values by predicates#2330
liyafan82 wants to merge 1 commit intoapache:mainfrom
liyafan82:fly_0113_ndv

liyafan82 commented Jan 19, 2021

Uh oh!

vlsi Jan 20, 2021

Uh oh!

liyafan82 Jan 21, 2021

Uh oh!

vlsi Jan 20, 2021

Uh oh!

liyafan82 Jan 21, 2021

Uh oh!

amaliujia Jan 22, 2021

Uh oh!

liyafan82 Jan 22, 2021

Uh oh!

amaliujia Jan 22, 2021

Uh oh!

liyafan82 Jan 22, 2021

Uh oh!

amaliujia Jan 22, 2021

Uh oh!

liyafan82 Jan 22, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	if (ndv == null) {
	ndv = (ndv == null) ? singleNdv : Math.min(ndv, singleNdv);

Conversation

liyafan82 commented Jan 19, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants