Add JsonIndexDistinct benchmark and optimize same-path JSON_MATCH#17921
Add JsonIndexDistinct benchmark and optimize same-path JSON_MATCH#17921xiangfu0 merged 3 commits intoapache:masterfrom
Conversation
7ed24b5 to
ba76279
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #17921 +/- ##
============================================
- Coverage 63.37% 63.29% -0.08%
Complexity 1543 1543
============================================
Files 3200 3200
Lines 194169 194335 +166
Branches 29915 29961 +46
============================================
- Hits 123051 123004 -47
- Misses 61466 61635 +169
- Partials 9652 9696 +44
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
ba76279 to
8b55cd4
Compare
e510cc2 to
ad423b8
Compare
5f2cd8f to
a038c11
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds a new JMH benchmark and introduces an optimization path for SELECT DISTINCT jsonExtractIndex(...) queries that can fully push down a same-path JSON_MATCH filter into the JSON index, avoiding per-value bitmap materialization and flattened→real doc ID conversion.
Changes:
- Add
BenchmarkJsonIndexDistinctcovering 3 query variants and a_missingPathFractionbenchmark parameter. - Extend
JsonIndexReaderwith agetMatchingValues(key, filter)SPI and implement an optimized override inImmutableJsonIndexReader. - Update
JsonIndexDistinctOperatorto use a fast path for fully pushed-down same-pathJSON_MATCH, and add unit tests for planning/strategy behavior.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/reader/JsonIndexReader.java |
Adds getMatchingValues SPI default method for distinct-only value retrieval. |
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/json/ImmutableJsonIndexReader.java |
Implements getMatchingValues with a buffer-based intersection optimization; refactors filter/path normalization helpers. |
pinot-core/src/main/java/org/apache/pinot/core/operator/query/JsonIndexDistinctOperator.java |
Adds same-path JSON_MATCH pushdown + a “distinct values only” fast path and a doc-centric strategy for sparse/high-cardinality cases. |
pinot-core/src/test/java/org/apache/pinot/core/operator/query/JsonIndexDistinctOperatorTest.java |
Adds unit tests validating pushdown/strategy/default behavior (needs updates to match new fast path). |
pinot-perf/src/main/java/org/apache/pinot/perf/BenchmarkJsonIndexDistinct.java |
Adds a JMH benchmark to measure baseline vs index-based distinct operator across variants and missing-path scenarios. |
836fb40 to
97f2488
Compare
97f2488 to
4dd3752
Compare
Latest Benchmark Results (post-fix rerun)After fixing the THREE_ARG Results (same-path filter, no default value)
FOUR_ARG Results (same-path filter, with default value)
JsonIndexDistinctOperator outperforms DistinctOperator in all 24 tested parameter combinations, with speedups ranging from 1.5x to 6.8x. The Settings: 500K rows, JDK 17, |
4dd3752 to
618f271
Compare
…or same-path filters For fully-pushed-down same-path JSON_MATCH predicates, the previous implementation scanned the dictionary twice (once for filter evaluation, once for value-map building) and materialized per-value posting list bitmaps + doc ID conversions that were never used. This adds getMatchingDistinctValues() to JsonIndexReader SPI with a fused single-pass implementation in ImmutableJsonIndexReader that evaluates the predicate directly on dictionary value strings — zero posting list reads, zero bitmap operations, zero doc ID mapping lookups. Benchmark shows 12x-337x speedup across all cardinality/selectivity combinations, eliminating the previous regression at high cardinality + low selectivity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ath) comparison Adds THREE_ARG_EXTRA_FILTER and FOUR_ARG_EXTRA_FILTER query variants that combine the same-path REGEXP_LIKE with a cross-path JSON_MATCH on $.cluster, preventing the fully-pushed-down fast path. This exercises the bitmap-based code path and confirms it still outperforms baseline in most configurations (1.2x-4.2x). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
c12de3b to
0b0a064
Compare
|
📝 Documentation PR: pinot-contrib/pinot-docs#616 Documentation has been created for the JSON index DISTINCT optimization. The PR documents:
|
Summary
BenchmarkJsonIndexDistinctJMH benchmark forJsonIndexDistinctOperatorvsDistinctOperatorJsonIndexDistinctOperatorthat uses the JSON index directly forSELECT DISTINCT jsonExtractIndex(...)queriesdefaultValueis provided (matching baseline behavior)getMatchingDistinctValues()toJsonIndexReaderSPI with fused single-pass dictionary scan inImmutableJsonIndexReader— evaluates same-path predicates directly on dictionary value strings without reading posting lists, building bitmaps, or converting doc IDsOptimization: Single-pass dictionary scan for fully-pushed-down same-path filters
When the WHERE clause is a single same-path
JSON_MATCHpredicate (e.g.,WHERE JSON_MATCH(col, 'REGEXP_LIKE("$.path", ...)')), the previous implementation:The new
getMatchingDistinctValues()method fuses filter evaluation and value extraction into a single dictionary scan that evaluates the predicate directly on each dictionary entry's value string. Zero posting list reads, zero bitmap operations, zero doc ID mapping lookups.Benchmark Results
Settings: 500K rows, JDK 17,
-f 0 -wi 2 -i 3,verifySpeedup()median guard passed all configs.Same-path filter only (fully-pushed-down fast path)
40/40 configs pass (THREE_ARG + FOUR_ARG × all cardinalities × all fractions). Speedups: 12x–337x.
Same-path + cross-path filter (bitmap-based path, no fast-path shortcut)
With a cross-path filter the fast path is not used; the bitmap-based path still outperforms baseline in 20/24 configs (1.2x–4.2x), is even in 2 configs (~1.0x), and shows a minor regression in 2 configs (0.7x at cardinality=1000/fraction=50%).
Test Plan
JsonIndexDistinctOperatorTest— 9 unit tests (fast path verification, cross-path fallback, default value, missing path error, planning eligibility)JsonPathTest— 42 integration tests covering all same-path predicate types (EQ, NOT_EQ, IN, NOT_IN, REGEXP_LIKE, IS_NOT_NULL), 4-arg default value, LIMIT, cross-path filter, and baseline comparison on both SSE and MSEJsonIndexTest— 18 segment-level JSON index testsDistinctQueriesTest— 6 distinct query testsJsonExtractIndexTransformFunctionTest— 31 transform function tests./mvnw spotless:apply checkstyle:check license:check— all passverifySpeedup()assertion — all 40 same-path configs pass🤖 Generated with Claude Code