[WIP] Push JSON-index GROUP BY COUNT(*) into a dictionary scan#18561
Draft
siddharthteotia wants to merge 4 commits into
Draft
[WIP] Push JSON-index GROUP BY COUNT(*) into a dictionary scan#18561siddharthteotia wants to merge 4 commits into
siddharthteotia wants to merge 4 commits into
Conversation
For shapes like SELECT jsonExtractIndex(col, '$.x', 'STRING'), COUNT(*) FROM t GROUP BY 1, GroupByPlanNode now routes to a new JsonIndexGroupByOperator that counts via per-value posting-list cardinality over the JSON-index dictionary, avoiding forward-index reads and JSON parsing on the matching docs. Same-path JSON_MATCH predicates are pushed into the index lookup; cross-column filters are applied as a residual bitmap. Parsing and same-path JSON_MATCH push-down helpers shared with JsonIndexDistinctOperator are extracted into JsonExtractIndexUtils. The DISTINCT operator's behavior is unchanged. A same-path JSON_MATCH that could match missing-path docs (IS_NULL) is not forwarded into the JSON-index lookup, so correctness does not depend on implementation-specific "returns empty map" behavior of the reader SPI. The new operator emits IntermediateRecord per (value, count) and lets the combine path sum across segments via CountAggregationFunction.merge. Out of scope for v1 and falling back to the existing GroupByOperator: COUNT(DISTINCT), MIN/MAX/SUM, HAVING, multiple group-by keys, mutable segments, and the multi-stage engine. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… tune the threshold
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18561 +/- ##
============================================
+ Coverage 63.74% 64.28% +0.53%
+ Complexity 1932 1126 -806
============================================
Files 3292 3313 +21
Lines 201519 203983 +2464
Branches 31322 31754 +432
============================================
+ Hits 128468 131133 +2665
+ Misses 62773 62345 -428
- Partials 10278 10505 +227
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…x JMH bench per-invocation state
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is WIP. Not yet ready for review.
Summary
Pushes
GROUP BY jsonExtractIndex(col, '$.path', 'TYPE') + COUNT(*)into a dictionary scan over the JSON index instead of the forward-index + Jackson parse path. Work scales with the number of distinct values at the path (D), not with the number of matched documents (M). A runtime selectivity gate routes the query back to the standardGroupByOperatorwhenD > k × M, so the new path is only chosen when it actually wins.What runs today
Given a query like:
GroupByOperatoriterates the WHERE bitmap and, for each matched doc, reads the raw JSON from the forward index, parses it with Jackson, extracts the path, and hashes into the group map. For 5M matched docs and 200 countries, that's 5M parses for what the JSON index already knows.What this PR does
JsonIndexGroupByOperator(pinot-core/.../operator/query/). For each entry in the dictionary range covering the path, intersects the posting list with the WHERE bitmap viaRoaringBitmap.andCardinalityand emits(value, count). Zero forward-index reads, zero JSON parses.JsonExtractIndexUtilsextracted from the existingJsonIndexDistinctOperatorso both index-aware operators can share parsing + same-path JSON_MATCH push-down logic. DISTINCT operator behavior is unchanged.canUse(...). Compares path cardinality (D) to matched-doc count (M); routes toJsonIndexGroupByOperatoronly whenD ≤ SELECTIVITY_THRESHOLD × M. New SPI methodJsonIndexReader.getDistinctValueCountForPath(path)provides the cheapDestimate (ImmutableJsonIndexReaderanswers in O(log N) via the dictionary range;MutableJsonIndexImplanswers via theTreeMapsub-range; default delegates to materializing the value set for third-party readers).GroupByPlanNoderefactored to build the filter operator once and reuse it for either path.BenchmarkJsonIndexGroupByCountsweeps(pathCardinality × matchedFraction)to empirically settleSELECTIVITY_THRESHOLD. Current value (2.0) is a placeholder pending the benchmark numbers.