Skip to content

perf: optimize DISTINCTCOUNTHLL for high-cardinality dictionary-encoded columns#18141

Closed
deeppatel710 wants to merge 4 commits intoapache:masterfrom
deeppatel710:feat/distinctcounthll-dict-size-optimization
Closed

perf: optimize DISTINCTCOUNTHLL for high-cardinality dictionary-encoded columns#18141
deeppatel710 wants to merge 4 commits intoapache:masterfrom
deeppatel710:feat/distinctcounthll-dict-size-optimization

Conversation

@deeppatel710
Copy link
Copy Markdown
Contributor

Summary

Fixes the performance bottleneck in DISTINCTCOUNTHLL for dictionary-encoded columns with high cardinality (reported in #17336).

DISTINCTCOUNTHLL currently always accumulates dictionary IDs into a RoaringBitmap before converting to HLL at finalization. For high-cardinality columns
(14M+ distinct values), bitmap insertions dominate execution time at O(n log n), causing queries to take 6-10 seconds.

This adds an optional third argument dictSizeThreshold (default: 100,000). When dictionary.length() > dictSizeThreshold, dictionary values are offered
directly to the HyperLogLog, bypassing the bitmap entirely. Since DISTINCTCOUNTHLL already returns an approximate result, exact bitmap deduplication is
unnecessary — HLL handles duplicate offers gracefully.

The same root cause was fixed for DISTINCT_COUNT_SMART_HLL in #17411, which demonstrated 4x-10x CPU reduction for high-cardinality workloads. This PR
applies the equivalent optimization to the more commonly used DISTINCTCOUNTHLL function.

Usage

DISTINCTCOUNTHLL(col)             -- default threshold (100K)
DISTINCTCOUNTHLL(col, 12)         -- custom log2m, default threshold
DISTINCTCOUNTHLL(col, 12, 50000)  -- custom log2m and threshold     
DISTINCTCOUNTHLL(col, 12, 0)      -- disable optimization (threshold = Integer.MAX_VALUE)                                                                      

Changes                                                                                                                                                        
                                                          
- DistinctCountHLLAggregationFunction: added DEFAULT_DICT_SIZE_THRESHOLD = 100_000, _dictSizeThreshold field, optional 3rd constructor arg, and direct-HLL path
 in all 6 aggregation paths (non-group-by SV/MV, group-by SV/MV with SV/MV group keys)
- Tests: added 4 new test cases covering parameter defaults, custom threshold, disabled optimization, and correctness of direct-HLL path vs bitmap path        
                                                                                                                                                               
Test plan
                                                                                                                                                               
- All existing HLL tests pass                                                                                                                                  
- New tests verify parameter parsing and approximate correctness of direct-HLL path
- Manual benchmark against high-cardinality column (expected ~4-10x speedup matching #17411 results)  

Deep Patel and others added 4 commits April 7, 2026 23:39
…it(); fix nonLeaderCleanup

- Remove redundant public abstract modifiers from RealtimeOffsetAutoResetHandler interface
- Update init() javadoc: "called once in constructor" -> "called once after instantiation"
- Remove constructor injection from RealtimeOffsetAutoResetKafkaHandler; init() is now the
  sole initialization path called by the manager after no-arg reflective instantiation
- Fix ensureBackfillJobsRunning signature: List<String> -> Collection<String> to match interface
- Update RealtimeOffsetAutoResetManager.getOrConstructHandler() to use no-arg constructor + init()
- Fix bug in nonLeaderCleanup(): also clear _tableBackfillTopics to avoid stale state on re-election
- Fix misleading error message: "Custom analyzer" -> "Custom handler"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove redundant `ensureBackfillJobsRunning` abstract override from
  RealtimeOffsetAutoResetKafkaHandler; the interface already declares it
- Fix TestRealtimeOffsetAutoResetHandler to use no-arg constructor so
  reflection-based instantiation in getOrConstructHandler() works correctly
- Strengthen testNonLeaderCleanup to assert handler is removed from
  internal map after nonLeaderCleanup is called

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…high-cardinality columns

For dictionary-encoded columns with high cardinality (e.g., 14M+ distinct values),
DISTINCTCOUNTHLL spent O(n log n) time inserting dictionary IDs into a RoaringBitmap
before converting to HLL at finalization. This mirrors the performance issue originally
reported for DISTINCTCOUNTSMARTHLL (fixed in apache#17411).

This commit introduces an optional third argument `dictSizeThreshold` (default: 100,000).
When the dictionary size exceeds the threshold, dictionary values are offered directly
to the HyperLogLog without going through a RoaringBitmap first. Since DISTINCTCOUNTHLL
already produces an approximate result, bitmap deduplication is not needed for correctness
in high-cardinality scenarios — HLL handles duplicate offers gracefully.

The optimization applies to all aggregation paths:
- Non-group-by SV and MV
- Group-by SV (both SV and MV group keys)
- Group-by MV (both SV and MV group keys)

Usage:
  DISTINCTCOUNTHLL(col)               -- default threshold (100K)
  DISTINCTCOUNTHLL(col, 12)           -- custom log2m, default threshold
  DISTINCTCOUNTHLL(col, 12, 50000)    -- custom log2m and threshold
  DISTINCTCOUNTHLL(col, 12, 0)        -- disable optimization (threshold = MAX_VALUE)

Expected speedup for high-cardinality columns: 4x-10x, consistent with the
benchmark results demonstrated for DISTINCTCOUNTSMARTHLL in apache#17411.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@deeppatel710
Copy link
Copy Markdown
Contributor Author

deeppatel710 commented Apr 9, 2026

@Jackie-Jiang can you take a look at this optimization PR and leave a feedback please? Thanks

@deeppatel710
Copy link
Copy Markdown
Contributor Author

Closing in favor of clean branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant