Refine index-based DISTINCT operators (JSON / inverted)#18588
Open
Jackie-Jiang wants to merge 1 commit into
Open
Refine index-based DISTINCT operators (JSON / inverted)#18588Jackie-Jiang wants to merge 1 commit into
Jackie-Jiang wants to merge 1 commit into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR refines Pinot’s opt-in index-based DISTINCT execution paths for JSON and inverted/sorted indexes, adds a JSON missing-path query option, and replaces lower-level mock tests with query-driven coverage.
Changes:
- Moves JSON DISTINCT validation/execution into
JsonIndexDistinctOperator, adds 5-arg and skip-missing-path handling. - Refines inverted/sorted DISTINCT path selection, execution stats, and bitmap/range checks.
- Reworks tests around broker/operator query execution and removes older mock-based unit tests.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
pinot-spi/src/main/java/org/apache/pinot/spi/utils/CommonConstants.java |
Adds the jsonIndexDistinctSkipMissingPath query option key. |
pinot-common/src/main/java/org/apache/pinot/common/utils/config/QueryOptionsUtils.java |
Adds parsing helper for the new JSON DISTINCT skip option. |
pinot-core/src/main/java/org/apache/pinot/core/plan/DistinctPlanNode.java |
Routes jsonExtractIndex DISTINCT expressions to the JSON index operator by function name. |
pinot-core/src/main/java/org/apache/pinot/core/operator/transform/function/JsonExtractIndexTransformFunction.java |
Refactors JSON index transform argument handling and filter expression naming. |
pinot-core/src/main/java/org/apache/pinot/core/operator/query/JsonIndexDistinctOperator.java |
Reworks JSON index DISTINCT validation, filtering, missing-path handling, MV type acceptance, and statistics. |
pinot-core/src/main/java/org/apache/pinot/core/operator/query/InvertedIndexDistinctOperator.java |
Refines inverted/sorted index DISTINCT paths and statistics reporting. |
pinot-core/src/test/java/org/apache/pinot/queries/JsonIndexDistinctOperatorQueriesTest.java |
Adds query-path coverage for JSON index DISTINCT behavior. |
pinot-core/src/test/java/org/apache/pinot/queries/InvertedIndexDistinctOperatorQueriesTest.java |
Updates inverted index DISTINCT query tests and statistics assertions. |
pinot-core/src/test/java/org/apache/pinot/core/operator/query/JsonIndexDistinctOperatorTest.java |
Removes older mock-based JSON operator unit tests. |
pinot-core/src/test/java/org/apache/pinot/core/operator/query/InvertedIndexDistinctOperatorUnitTest.java |
Removes older mock-based inverted operator unit tests. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18588 +/- ##
============================================
+ Coverage 56.82% 64.29% +7.47%
- Complexity 7 1137 +1130
============================================
Files 2567 3335 +768
Lines 149066 205899 +56833
Branches 24103 32104 +8001
============================================
+ Hits 84700 132392 +47692
- Misses 57178 62889 +5711
- Partials 7188 10618 +3430
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
dea8cbd to
8351441
Compare
- JsonIndexDistinctOperator: validate 3/4/5-arg jsonExtractIndex in the constructor (mirroring JsonExtractIndexTransformFunction.init), accept MV `_ARRAY` types, intersect per-value doc ids with the WHERE-clause filter through a unified `remainingDocs` bitmap, and surface numDocsScanned in execution statistics. - New `jsonIndexDistinctSkipMissingPath` query option: when true, the operator skips parsing the 4-arg default, skips `remainingDocs` tracking, and never throws "Illegal Json Path". - `canUseJsonIndexDistinct` simplified to a function-name check; planner routes any `jsonExtractIndex` call through the operator and lets the constructor validate arguments. - InvertedIndexDistinctOperator: cache `_totalDocs`, short-circuit the DESC sorted path with `intersects` instead of `getLongCardinality`, drop redundant `advanceIfNeeded` / inner `hasNext`, and report a correct numDocsScanned for sorted / inverted paths. - Tests rewritten as queries-based suites (`JsonIndexDistinctOperatorQueriesTest`, `InvertedIndexDistinctOperatorQueriesTest`) that drive the full broker -> operator path and assert on execution statistics; the older mock-based unit tests are removed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8351441 to
0590d12
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Refines the two index-based
DISTINCToperators added in #17872 / #17820 and rewrites their tests to drive the full broker → operator path.JsonIndexDistinctOperatorJsonExtractIndexTransformFunction.init. The operator accepts 3-/4-/5-argjsonExtractIndexcalls (path, type, optional default, optionalJSON_MATCHfilter expression) and MV_ARRAYtypes (INT_ARRAY,STRING_ARRAY, etc.).canUseJsonIndexDistinctis simplified to a function-name check; the planner routes everyjsonExtractIndexcall through the operator and lets the constructor surface invalid arguments asIllegalArgumentException.WHERE-clause filter through a singleremainingDocsbitmap, so cross-pathJSON_MATCH, base-column filters, andIS NULLfilters all use the same code path.numDocsScannedis populated from the filtered doc set (or total docs when the filter matches everything), so execution statistics line up with the scan/projection path.jsonIndexDistinctSkipMissingPath: when set, the operator skips parsing the 4-arg default, skipsremainingDocstracking, and suppresses the "Illegal Json Path" throw for the 3-arg form. Useful when the caller knows every doc has the path (or doesn't care about misses).InvertedIndexDistinctOperator_totalDocsin the constructor instead of recomputing per call.intersects(boolean) rather thangetLongCardinality, which is orders of magnitude cheaper on dense bitmaps.advanceIfNeeded(startDocId)on the ASC sorted path and the redundant innerfilterIter.hasNext()check.numDocsScannedon the sorted / inverted paths (previously zero).FilterPreparationhelper and renames_numEntriesExamined→_numEntriesExaminedPostFilterso the stats name matches its meaning.Tests
JsonIndexDistinctOperatorQueriesTest,InvertedIndexDistinctOperatorQueriesTest) that drivesSELECT DISTINCTthroughBaseQueriesTest, asserts on result tables, explain strings, and execution statistics (numDocsScanned,numEntriesScannedInFilter,numEntriesScannedPostFilter,numTotalDocs).OPTION(...)syntax in the suites is converted to standardSET a=b;prefixes; repeated query strings are extracted into shared constants.