avoid binary searches in distinct count#7802
avoid binary searches in distinct count#7802richardstartin wants to merge 1 commit intoapache:masterfrom
Conversation
db43fc4 to
7a3fcf3
Compare
Codecov Report
@@ Coverage Diff @@
## master #7802 +/- ##
============================================
- Coverage 71.68% 65.16% -6.52%
- Complexity 4072 4076 +4
============================================
Files 1578 1533 -45
Lines 80784 78916 -1868
Branches 12001 11799 -202
============================================
- Hits 57906 51429 -6477
- Misses 18985 23824 +4839
+ Partials 3893 3663 -230
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
7a3fcf3 to
98048eb
Compare
Jackie-Jiang
left a comment
There was a problem hiding this comment.
Can I conclude that using RoaringBitmapWriter is always more performant than directly writing into the RoaringBitmap?
Let's add this optimization to DistinctCountBitmap and DistinctCountHLL as well
|
@Jackie-Jiang it depends. Constructing bitmaps from unordered values is inefficient, and in general there's no point in using |
I have been working with a user to understand why distinctcounts of string values with group-bys are slow, and there appear to be two issues:
When sorting by the distinctcounted field, surprisingly, the contribution of binary searches gets worse:

This change accumulates dictionary ids in a small bitset (8KB) which is flushed to the dictionary id bitmap whenever a dictionary id in a different 16 bit interval is encountered, to avoid doing binary searches. Whenever the cardinality of the counted column is less than 2^16, this is enough capacity to accumulate the entire distinct count, and when the dictionary ids within a column roughly increase with docId (+/- 2^15) the small bitset bypasses binary searches.