Skip to content

Add inverted-index-based distinct operator with runtime cost heuristic#17872

Open
xiangfu0 wants to merge 4 commits intoapache:masterfrom
xiangfu0:inverted-index-distinct-operator
Open

Add inverted-index-based distinct operator with runtime cost heuristic#17872
xiangfu0 wants to merge 4 commits intoapache:masterfrom
xiangfu0:inverted-index-distinct-operator

Conversation

@xiangfu0
Copy link
Contributor

@xiangfu0 xiangfu0 commented Mar 13, 2026

Summary

Add an inverted-index-based execution path for single-column DISTINCT queries in DistinctOperator. When a column has both a dictionary and an inverted index, the operator automatically decides at runtime whether to use the inverted index path (iterate dictionary entries with bitmap intersections) or the scan path (iterate filtered documents), based on a cost heuristic.

Key changes:

  • DistinctOperator now internally checks inverted index eligibility (single column, dictionary + inverted index, no null values) and chooses the optimal path at runtime
  • Cost heuristic: use inverted index when dictionaryCardinality * costRatio <= filteredDocCount (default costRatio=5)
  • invertedIndexDistinctCostRatio query option allows tuning the cost ratio per query
  • DistinctPlanNode simplified — no longer needs to check eligibility or query options
  • Deleted InvertedIndexDistinctOperator — all logic absorbed into DistinctOperator

JMH Benchmark Results

Setup: 1M docs, 4 dictionary cardinalities × 5 filter selectivities, @Fork(value=1, warmups=3)

dictCard filterDocs Inverted (us/op) Scan (us/op) Winner Speedup
100 1K 29.2 9.5 Scan 3.1x
100 10K 16.2 96.0 Inverted 5.9x
100 100K 1.1 2,097 Inverted 1,961x
100 500K 0.7 9,887 Inverted 13,183x
100 1M 0.7 19,518 Inverted 26,376x
1K 1K 4,090 15.9 Scan 257x
1K 10K 864 365 Scan 2.4x
1K 100K 22.1 3,004 Inverted 136x
1K 500K 10.6 15,408 Inverted 1,453x
1K 1M 8.5 31,585 Inverted 3,738x
10K 1K 9,130 20.1 Scan 454x
10K 10K 13,190 272 Scan 48.5x
10K 100K 234 734 Inverted 3.1x
10K 500K 123 2,180 Inverted 17.7x
10K 1M 95.2 4,016 Inverted 42.2x
100K 1K 23,483 18.3 Scan 1,283x
100K 10K 53,949 557 Scan 96.8x
100K 100K 9,877 1,372 Scan 7.2x
100K 500K 4,348 4,003 ~Equal ~1x
100K 1M 2,191 7,016 Inverted 3.2x

Key takeaways:

  • Inverted index path provides orders of magnitude speedup when dictCardinality << filteredDocCount (e.g., 26,376x for 100 unique values over 1M docs)
  • Scan path wins when dictCardinality >= filteredDocCount (e.g., high cardinality with selective filters)
  • Default costRatio=5 captures the crossover well — inverted index is chosen when it provides significant benefit, scan is chosen for selective filters with high cardinality
  • The invertedIndexDistinctCostRatio query option allows per-query tuning if needed

Test plan

  • Unit tests: InvertedIndexDistinctCostHeuristicTest (8 tests covering heuristic boundary, correctness, default behavior)
  • Unit tests: DistinctQueriesTest (existing tests pass)
  • Integration tests: OfflineClusterIntegrationTest.testDistinctWithInvertedIndex
  • JMH benchmark: BenchmarkInvertedIndexDistinct (40 parameter combinations)
  • Checkstyle and spotless pass

🤖 Generated with Claude Code

xiangfu0 and others added 2 commits March 13, 2026 01:24
Merge inverted index logic into DistinctOperator with a runtime cost
heuristic that chooses between inverted-index and scan paths based on
dictionary cardinality vs filtered doc count. The default cost ratio
of 5 (derived from JMH benchmarking) can be overridden via the
invertedIndexDistinctCostRatio query option.

Opt-in via: OPTION(useIndexBasedDistinctOperator=true)

Changes:
- DistinctOperator: dual-path execution with cost heuristic
- DistinctPlanNode: wire inverted index context when eligible
- CommonConstants/QueryOptionsUtils: new query option keys
- Integration tests for correctness validation
- Unit tests for cost heuristic boundary behavior
- JMH benchmark for inverted index vs scan crossover analysis

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…lity check

Remove the useIndexBasedDistinctOperator query option. DistinctOperator
now has a single constructor and checks inverted index eligibility
internally (single column, dictionary + inverted index, no nulls).
The cost heuristic runs automatically — no opt-in needed.

The invertedIndexDistinctCostRatio query option is retained for tuning.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@codecov-commenter
Copy link

codecov-commenter commented Mar 13, 2026

Codecov Report

❌ Patch coverage is 46.82927% with 109 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.19%. Comparing base (097a89f) to head (a1219f4).
⚠️ Report is 4 commits behind head on master.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
.../operator/query/InvertedIndexDistinctOperator.java 46.56% 88 Missing and 13 partials ⚠️
...a/org/apache/pinot/core/plan/DistinctPlanNode.java 38.46% 3 Missing and 5 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17872      +/-   ##
============================================
- Coverage     63.27%   63.19%   -0.08%     
- Complexity     1466     1481      +15     
============================================
  Files          3190     3191       +1     
  Lines        192101   192462     +361     
  Branches      29433    29509      +76     
============================================
+ Hits         121547   121630      +83     
- Misses        61040    61275     +235     
- Partials       9514     9557      +43     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 55.50% <46.82%> (-7.73%) ⬇️
java-21 63.17% <46.82%> (-0.08%) ⬇️
temurin 63.19% <46.82%> (-0.08%) ⬇️
unittests 63.19% <46.82%> (-0.08%) ⬇️
unittests1 55.52% <46.82%> (-0.07%) ⬇️
unittests2 34.21% <0.00%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

xiangfu0 and others added 2 commits March 13, 2026 13:18
…s not eligible

- Create ProjectOperator eagerly in constructor when inverted index path is
  not eligible, ensuring forward-index-disabled validation errors are thrown
  during plan creation and explain plan tree is complete
- Handle null _projectOperator in getExecutionStatistics() for error handling
  paths where the operator is called before execution completes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ward compatibility

Restore original DistinctOperator unchanged to avoid impacting existing
production workloads. The inverted-index-based distinct logic is now in a
separate InvertedIndexDistinctOperator, enabled via the query option
useInvertedIndexDistinct=true. The cost ratio can be tuned via
invertedIndexDistinctCostRatio (default 5).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@xiangfu0 xiangfu0 force-pushed the inverted-index-distinct-operator branch from 4935111 to a1219f4 Compare March 13, 2026 23:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants