Skip to content

[WIP] Push JSON-index GROUP BY COUNT(*) into a dictionary scan#18561

Draft
siddharthteotia wants to merge 4 commits into
apache:masterfrom
siddharthteotia:json-index-groupby-count-pushdown
Draft

[WIP] Push JSON-index GROUP BY COUNT(*) into a dictionary scan#18561
siddharthteotia wants to merge 4 commits into
apache:masterfrom
siddharthteotia:json-index-groupby-count-pushdown

Conversation

@siddharthteotia
Copy link
Copy Markdown
Contributor

@siddharthteotia siddharthteotia commented May 21, 2026

This is WIP. Not yet ready for review.

Summary

Pushes GROUP BY jsonExtractIndex(col, '$.path', 'TYPE') + COUNT(*) into a dictionary scan over the JSON index instead of the forward-index + Jackson parse path. Work scales with the number of distinct values at the path (D), not with the number of matched documents (M). A runtime selectivity gate routes the query back to the standard GroupByOperator when D > k × M, so the new path is only chosen when it actually wins.

What runs today

Given a query like:

SELECT jsonExtractIndex(payload, '$.country', 'STRING') AS country, COUNT(*)
FROM events
WHERE JSON_MATCH(payload, '"$.event_type" = ''click''')
GROUP BY country

GroupByOperator iterates the WHERE bitmap and, for each matched doc, reads the raw JSON from the forward index, parses it with Jackson, extracts the path, and hashes into the group map. For 5M matched docs and 200 countries, that's 5M parses for what the JSON index already knows.

What this PR does

  1. New operator JsonIndexGroupByOperator (pinot-core/.../operator/query/). For each entry in the dictionary range covering the path, intersects the posting list with the WHERE bitmap via RoaringBitmap.andCardinality and emits (value, count). Zero forward-index reads, zero JSON parses.
  2. Shared parsing helper JsonExtractIndexUtils extracted from the existing JsonIndexDistinctOperator so both index-aware operators can share parsing + same-path JSON_MATCH push-down logic. DISTINCT operator behavior is unchanged.
  3. Same-path JSON_MATCH push-down. A WHERE predicate on the same path as the GROUP BY key gets pushed into the index lookup. Cross-column / cross-path filters are applied as a residual bitmap intersection.
  4. IS_NULL safety. A same-path JSON_MATCH that could match missing-path docs is NOT forwarded into the index lookup, so correctness no longer depends on implementation-specific "returns empty map" behavior of the reader SPI.
  5. Selectivity gate in canUse(...). Compares path cardinality (D) to matched-doc count (M); routes to JsonIndexGroupByOperator only when D ≤ SELECTIVITY_THRESHOLD × M. New SPI method JsonIndexReader.getDistinctValueCountForPath(path) provides the cheap D estimate (ImmutableJsonIndexReader answers in O(log N) via the dictionary range; MutableJsonIndexImpl answers via the TreeMap sub-range; default delegates to materializing the value set for third-party readers).
  6. GroupByPlanNode refactored to build the filter operator once and reuse it for either path.
  7. JMH benchmark BenchmarkJsonIndexGroupByCount sweeps (pathCardinality × matchedFraction) to empirically settle SELECTIVITY_THRESHOLD. Current value (2.0) is a placeholder pending the benchmark numbers.

siddharthteotia and others added 2 commits May 20, 2026 22:26
For shapes like SELECT jsonExtractIndex(col, '$.x', 'STRING'), COUNT(*) FROM t
GROUP BY 1, GroupByPlanNode now routes to a new JsonIndexGroupByOperator that
counts via per-value posting-list cardinality over the JSON-index dictionary,
avoiding forward-index reads and JSON parsing on the matching docs. Same-path
JSON_MATCH predicates are pushed into the index lookup; cross-column filters
are applied as a residual bitmap.

Parsing and same-path JSON_MATCH push-down helpers shared with
JsonIndexDistinctOperator are extracted into JsonExtractIndexUtils. The
DISTINCT operator's behavior is unchanged.

A same-path JSON_MATCH that could match missing-path docs (IS_NULL) is not
forwarded into the JSON-index lookup, so correctness does not depend on
implementation-specific "returns empty map" behavior of the reader SPI.

The new operator emits IntermediateRecord per (value, count) and lets the
combine path sum across segments via CountAggregationFunction.merge. Out of
scope for v1 and falling back to the existing GroupByOperator: COUNT(DISTINCT),
MIN/MAX/SUM, HAVING, multiple group-by keys, mutable segments, and the
multi-stage engine.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@siddharthteotia siddharthteotia changed the title Push JSON-index GROUP BY COUNT(*) into a dictionary scan [WIP] Push JSON-index GROUP BY COUNT(*) into a dictionary scan May 21, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 21, 2026

Codecov Report

❌ Patch coverage is 61.57205% with 88 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.28%. Comparing base (d9dd6e0) to head (711e45b).
⚠️ Report is 18 commits behind head on master.

Files with missing lines Patch % Lines
...not/core/operator/query/JsonExtractIndexUtils.java 47.00% 33 Missing and 20 partials ⚠️
.../core/operator/query/JsonIndexGroupByOperator.java 73.72% 23 Missing and 8 partials ⚠️
...va/org/apache/pinot/core/plan/GroupByPlanNode.java 0.00% 2 Missing and 1 partial ⚠️
...core/operator/query/JsonIndexDistinctOperator.java 85.71% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18561      +/-   ##
============================================
+ Coverage     63.74%   64.28%   +0.53%     
+ Complexity     1932     1126     -806     
============================================
  Files          3292     3313      +21     
  Lines        201519   203983    +2464     
  Branches      31322    31754     +432     
============================================
+ Hits         128468   131133    +2665     
+ Misses        62773    62345     -428     
- Partials      10278    10505     +227     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-21 64.28% <61.57%> (+0.53%) ⬆️
temurin 64.28% <61.57%> (+0.53%) ⬆️
unittests 64.28% <61.57%> (+0.53%) ⬆️
unittests1 56.76% <61.57%> (+0.97%) ⬆️
unittests2 35.48% <0.00%> (+0.23%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants