perf: optimize DISTINCTCOUNTHLL for high-cardinality dictionary-encoded columns#18141
Closed
deeppatel710 wants to merge 4 commits intoapache:masterfrom
Closed
perf: optimize DISTINCTCOUNTHLL for high-cardinality dictionary-encoded columns#18141deeppatel710 wants to merge 4 commits intoapache:masterfrom
deeppatel710 wants to merge 4 commits intoapache:masterfrom
Conversation
…it(); fix nonLeaderCleanup - Remove redundant public abstract modifiers from RealtimeOffsetAutoResetHandler interface - Update init() javadoc: "called once in constructor" -> "called once after instantiation" - Remove constructor injection from RealtimeOffsetAutoResetKafkaHandler; init() is now the sole initialization path called by the manager after no-arg reflective instantiation - Fix ensureBackfillJobsRunning signature: List<String> -> Collection<String> to match interface - Update RealtimeOffsetAutoResetManager.getOrConstructHandler() to use no-arg constructor + init() - Fix bug in nonLeaderCleanup(): also clear _tableBackfillTopics to avoid stale state on re-election - Fix misleading error message: "Custom analyzer" -> "Custom handler" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove redundant `ensureBackfillJobsRunning` abstract override from RealtimeOffsetAutoResetKafkaHandler; the interface already declares it - Fix TestRealtimeOffsetAutoResetHandler to use no-arg constructor so reflection-based instantiation in getOrConstructHandler() works correctly - Strengthen testNonLeaderCleanup to assert handler is removed from internal map after nonLeaderCleanup is called Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…high-cardinality columns For dictionary-encoded columns with high cardinality (e.g., 14M+ distinct values), DISTINCTCOUNTHLL spent O(n log n) time inserting dictionary IDs into a RoaringBitmap before converting to HLL at finalization. This mirrors the performance issue originally reported for DISTINCTCOUNTSMARTHLL (fixed in apache#17411). This commit introduces an optional third argument `dictSizeThreshold` (default: 100,000). When the dictionary size exceeds the threshold, dictionary values are offered directly to the HyperLogLog without going through a RoaringBitmap first. Since DISTINCTCOUNTHLL already produces an approximate result, bitmap deduplication is not needed for correctness in high-cardinality scenarios — HLL handles duplicate offers gracefully. The optimization applies to all aggregation paths: - Non-group-by SV and MV - Group-by SV (both SV and MV group keys) - Group-by MV (both SV and MV group keys) Usage: DISTINCTCOUNTHLL(col) -- default threshold (100K) DISTINCTCOUNTHLL(col, 12) -- custom log2m, default threshold DISTINCTCOUNTHLL(col, 12, 50000) -- custom log2m and threshold DISTINCTCOUNTHLL(col, 12, 0) -- disable optimization (threshold = MAX_VALUE) Expected speedup for high-cardinality columns: 4x-10x, consistent with the benchmark results demonstrated for DISTINCTCOUNTSMARTHLL in apache#17411. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Contributor
Author
|
@Jackie-Jiang can you take a look at this optimization PR and leave a feedback please? Thanks |
Contributor
Author
|
Closing in favor of clean branch |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the performance bottleneck in
DISTINCTCOUNTHLLfor dictionary-encoded columns with high cardinality (reported in #17336).DISTINCTCOUNTHLLcurrently always accumulates dictionary IDs into aRoaringBitmapbefore converting to HLL at finalization. For high-cardinality columns(14M+ distinct values), bitmap insertions dominate execution time at O(n log n), causing queries to take 6-10 seconds.
This adds an optional third argument
dictSizeThreshold(default:100,000). Whendictionary.length() > dictSizeThreshold, dictionary values are offereddirectly to the HyperLogLog, bypassing the bitmap entirely. Since
DISTINCTCOUNTHLLalready returns an approximate result, exact bitmap deduplication isunnecessary — HLL handles duplicate offers gracefully.
The same root cause was fixed for
DISTINCT_COUNT_SMART_HLLin #17411, which demonstrated 4x-10x CPU reduction for high-cardinality workloads. This PRapplies the equivalent optimization to the more commonly used
DISTINCTCOUNTHLLfunction.Usage