Add inverted-index-based distinct operator with runtime cost heuristic#17872
Open
xiangfu0 wants to merge 4 commits intoapache:masterfrom
Open
Add inverted-index-based distinct operator with runtime cost heuristic#17872xiangfu0 wants to merge 4 commits intoapache:masterfrom
xiangfu0 wants to merge 4 commits intoapache:masterfrom
Conversation
Merge inverted index logic into DistinctOperator with a runtime cost heuristic that chooses between inverted-index and scan paths based on dictionary cardinality vs filtered doc count. The default cost ratio of 5 (derived from JMH benchmarking) can be overridden via the invertedIndexDistinctCostRatio query option. Opt-in via: OPTION(useIndexBasedDistinctOperator=true) Changes: - DistinctOperator: dual-path execution with cost heuristic - DistinctPlanNode: wire inverted index context when eligible - CommonConstants/QueryOptionsUtils: new query option keys - Integration tests for correctness validation - Unit tests for cost heuristic boundary behavior - JMH benchmark for inverted index vs scan crossover analysis Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…lity check Remove the useIndexBasedDistinctOperator query option. DistinctOperator now has a single constructor and checks inverted index eligibility internally (single column, dictionary + inverted index, no nulls). The cost heuristic runs automatically — no opt-in needed. The invertedIndexDistinctCostRatio query option is retained for tuning. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #17872 +/- ##
============================================
- Coverage 63.27% 63.19% -0.08%
- Complexity 1466 1481 +15
============================================
Files 3190 3191 +1
Lines 192101 192462 +361
Branches 29433 29509 +76
============================================
+ Hits 121547 121630 +83
- Misses 61040 61275 +235
- Partials 9514 9557 +43
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…s not eligible - Create ProjectOperator eagerly in constructor when inverted index path is not eligible, ensuring forward-index-disabled validation errors are thrown during plan creation and explain plan tree is complete - Handle null _projectOperator in getExecutionStatistics() for error handling paths where the operator is called before execution completes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ward compatibility Restore original DistinctOperator unchanged to avoid impacting existing production workloads. The inverted-index-based distinct logic is now in a separate InvertedIndexDistinctOperator, enabled via the query option useInvertedIndexDistinct=true. The cost ratio can be tuned via invertedIndexDistinctCostRatio (default 5). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
4935111 to
a1219f4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add an inverted-index-based execution path for single-column
DISTINCTqueries inDistinctOperator. When a column has both a dictionary and an inverted index, the operator automatically decides at runtime whether to use the inverted index path (iterate dictionary entries with bitmap intersections) or the scan path (iterate filtered documents), based on a cost heuristic.Key changes:
DistinctOperatornow internally checks inverted index eligibility (single column, dictionary + inverted index, no null values) and chooses the optimal path at runtimedictionaryCardinality * costRatio <= filteredDocCount(defaultcostRatio=5)invertedIndexDistinctCostRatioquery option allows tuning the cost ratio per queryDistinctPlanNodesimplified — no longer needs to check eligibility or query optionsInvertedIndexDistinctOperator— all logic absorbed intoDistinctOperatorJMH Benchmark Results
Setup: 1M docs, 4 dictionary cardinalities × 5 filter selectivities,
@Fork(value=1, warmups=3)Key takeaways:
dictCardinality << filteredDocCount(e.g., 26,376x for 100 unique values over 1M docs)dictCardinality >= filteredDocCount(e.g., high cardinality with selective filters)costRatio=5captures the crossover well — inverted index is chosen when it provides significant benefit, scan is chosen for selective filters with high cardinalityinvertedIndexDistinctCostRatioquery option allows per-query tuning if neededTest plan
InvertedIndexDistinctCostHeuristicTest(8 tests covering heuristic boundary, correctness, default behavior)DistinctQueriesTest(existing tests pass)OfflineClusterIntegrationTest.testDistinctWithInvertedIndexBenchmarkInvertedIndexDistinct(40 parameter combinations)🤖 Generated with Claude Code