Skip to content

avoid binary searches in distinct count#7802

Closed
richardstartin wants to merge 1 commit intoapache:masterfrom
richardstartin:distinct-count-dictionarized
Closed

avoid binary searches in distinct count#7802
richardstartin wants to merge 1 commit intoapache:masterfrom
richardstartin:distinct-count-dictionarized

Conversation

@richardstartin
Copy link
Copy Markdown
Member

I have been working with a user to understand why distinctcounts of string values with group-bys are slow, and there appear to be two issues:

  1. It's necessary to dedictionarize segment level bitmaps of dictionary ids before producing intermediate results. This manifests itself in the set construction itself (line 1), reading the string values (line 3, which is mostly solved by optimise getUnpaddedString with SWAR padding search #7708 in 0.9.0), and hashing the newly created strings to put in to the set (lines 4 and 6).
  2. Appending dictionary ids to the segment level bitmaps, which results in binary searches (line 2)

Screenshot 2021-11-19 at 16 31 01

When sorting by the distinctcounted field, surprisingly, the contribution of binary searches gets worse:
Screenshot 2021-11-19 at 16 33 59

This change accumulates dictionary ids in a small bitset (8KB) which is flushed to the dictionary id bitmap whenever a dictionary id in a different 16 bit interval is encountered, to avoid doing binary searches. Whenever the cardinality of the counted column is less than 2^16, this is enough capacity to accumulate the entire distinct count, and when the dictionary ids within a column roughly increase with docId (+/- 2^15) the small bitset bypasses binary searches.

@richardstartin richardstartin force-pushed the distinct-count-dictionarized branch from db43fc4 to 7a3fcf3 Compare November 19, 2021 16:44
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Nov 19, 2021

Codecov Report

Merging #7802 (f383401) into master (ab01ac3) will decrease coverage by 6.51%.
The diff coverage is 100.00%.

❗ Current head f383401 differs from pull request most recent head 7a3fcf3. Consider uploading reports for the commit 7a3fcf3 to get more accurate results
Impacted file tree graph

@@             Coverage Diff              @@
##             master    #7802      +/-   ##
============================================
- Coverage     71.68%   65.16%   -6.52%     
- Complexity     4072     4076       +4     
============================================
  Files          1578     1533      -45     
  Lines         80784    78916    -1868     
  Branches      12001    11799     -202     
============================================
- Hits          57906    51429    -6477     
- Misses        18985    23824    +4839     
+ Partials       3893     3663     -230     
Flag Coverage Δ
integration1 ?
integration2 ?
unittests1 68.62% <100.00%> (+<0.01%) ⬆️
unittests2 14.58% <0.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ion/function/DistinctCountAggregationFunction.java 32.00% <100.00%> (+0.60%) ⬆️
...n/function/DistinctCountMVAggregationFunction.java 16.54% <100.00%> (ø)
...a/org/apache/pinot/common/metrics/MinionMeter.java 0.00% <0.00%> (-100.00%) ⬇️
...g/apache/pinot/common/metrics/ControllerMeter.java 0.00% <0.00%> (-100.00%) ⬇️
.../apache/pinot/common/metrics/BrokerQueryPhase.java 0.00% <0.00%> (-100.00%) ⬇️
.../apache/pinot/common/metrics/MinionQueryPhase.java 0.00% <0.00%> (-100.00%) ⬇️
...he/pinot/common/messages/SegmentReloadMessage.java 0.00% <0.00%> (-100.00%) ⬇️
...pinot/core/data/manager/realtime/TimerService.java 0.00% <0.00%> (-100.00%) ⬇️
...pinot/minion/exception/TaskCancelledException.java 0.00% <0.00%> (-100.00%) ⬇️
...not/common/exception/HttpErrorStatusException.java 0.00% <0.00%> (-100.00%) ⬇️
... and 359 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ab01ac3...7a3fcf3. Read the comment docs.

@richardstartin richardstartin force-pushed the distinct-count-dictionarized branch from 7a3fcf3 to 98048eb Compare November 19, 2021 18:06
Copy link
Copy Markdown
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I conclude that using RoaringBitmapWriter is always more performant than directly writing into the RoaringBitmap?

Let's add this optimization to DistinctCountBitmap and DistinctCountHLL as well

@richardstartin
Copy link
Copy Markdown
Member Author

@Jackie-Jiang it depends. Constructing bitmaps from unordered values is inefficient, and in general there's no point in using RoaringBitmapWriter, so if the dictionary ids have a very large range this won't help much, but if the cardinality is less than the 2^16 this particular configuration will work very well because all the adds will be buffered into a small bitset, regardless of order. It's not clear to me how helpful this will be with unordered data with very high range though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants