Skip to content

Add JsonIndexDistinct benchmark and optimize same-path JSON_MATCH#17921

Merged
xiangfu0 merged 3 commits intoapache:masterfrom
xiangfu0:xiangfu/json-index-distinct-benchmark
Apr 3, 2026
Merged

Add JsonIndexDistinct benchmark and optimize same-path JSON_MATCH#17921
xiangfu0 merged 3 commits intoapache:masterfrom
xiangfu0:xiangfu/json-index-distinct-benchmark

Conversation

@xiangfu0
Copy link
Copy Markdown
Contributor

@xiangfu0 xiangfu0 commented Mar 20, 2026

Summary

  • Add BenchmarkJsonIndexDistinct JMH benchmark for JsonIndexDistinctOperator vs DistinctOperator
  • Add optimized JsonIndexDistinctOperator that uses the JSON index directly for SELECT DISTINCT jsonExtractIndex(...) queries
  • Throw error in value-centric path when docs are missing the queried JSON path and no defaultValue is provided (matching baseline behavior)
  • Add getMatchingDistinctValues() to JsonIndexReader SPI with fused single-pass dictionary scan in ImmutableJsonIndexReader — evaluates same-path predicates directly on dictionary value strings without reading posting lists, building bitmaps, or converting doc IDs
  • Add unit tests for same-path pushdown, different-path doc-level filtering, 4-arg default value handling, and planning eligibility
  • Add integration tests covering all same-path predicate types: EQ, NOT_EQ, IN, NOT_IN, REGEXP_LIKE, IS_NOT_NULL, plus 4-arg with default and LIMIT

Optimization: Single-pass dictionary scan for fully-pushed-down same-path filters

When the WHERE clause is a single same-path JSON_MATCH predicate (e.g., WHERE JSON_MATCH(col, 'REGEXP_LIKE("$.path", ...)')), the previous implementation:

  1. Scanned the dictionary twice — once to evaluate the filter (OR matching posting lists into a bitmap), then again to build the value→docId map (AND each posting list with the filter bitmap)
  2. Materialized per-value bitmaps and converted flattened doc IDs to real doc IDs — even though these bitmaps were never accessed for the fully-pushed-down case

The new getMatchingDistinctValues() method fuses filter evaluation and value extraction into a single dictionary scan that evaluates the predicate directly on each dictionary entry's value string. Zero posting list reads, zero bitmap operations, zero doc ID mapping lookups.

Benchmark Results

Settings: 500K rows, JDK 17, -f 0 -wi 2 -i 3, verifySpeedup() median guard passed all configs.

Same-path filter only (fully-pushed-down fast path)

SELECT DISTINCT JSON_EXTRACT_INDEX(tags, '$.instance', 'STRING') AS tag_value
FROM myTable WHERE JSON_MATCH(tags, 'REGEXP_LIKE("$.instance", ''.*test.*'')')
Cardinality Match % DistinctOp (ms) JsonIndexDistinctOp (ms) Speedup
10 1% 4.45 0.06 80x
10 50% 10.68 0.05 203x
10 100% 17.91 0.05 337x
100 1% 3.12 0.06 51x
100 50% 12.67 0.06 203x
100 100% 21.29 0.06 333x
1,000 1% 7.14 0.14 51x
1,000 50% 19.14 0.15 127x
1,000 100% 29.62 0.17 175x
10,000 1% 20.76 0.93 22x
10,000 50% 57.20 1.11 52x
10,000 100% 89.54 1.32 68x
100,000 1% 112.79 9.09 12x
100,000 50% 390.33 12.85 30x
100,000 100% 694.57 19.54 36x

40/40 configs pass (THREE_ARG + FOUR_ARG × all cardinalities × all fractions). Speedups: 12x–337x.

Same-path + cross-path filter (bitmap-based path, no fast-path shortcut)

SELECT DISTINCT JSON_EXTRACT_INDEX(tags, '$.instance', 'STRING') AS tag_value
FROM myTable
WHERE JSON_MATCH(tags, 'REGEXP_LIKE("$.instance", ''.*test.*'')')
  AND JSON_MATCH(tags, '"$.cluster" = ''cluster-0''')
Cardinality Match % DistinctOp (ms) JsonIndexDistinctOp (ms) Speedup
100 10% 3.32 0.79 4.2x
100 50% 3.12 1.15 2.7x
100 100% 3.19 1.93 1.7x
1,000 10% 4.15 2.04 2.0x
1,000 50% 4.20 5.87 0.7x
1,000 100% 4.53 4.49 1.0x
10,000 10% 13.89 8.42 1.6x
10,000 50% 16.01 11.29 1.4x
10,000 100% 19.43 15.80 1.2x
100,000 10% 94.26 42.21 2.2x
100,000 50% 105.45 48.80 2.2x
100,000 100% 103.19 76.41 1.4x

With a cross-path filter the fast path is not used; the bitmap-based path still outperforms baseline in 20/24 configs (1.2x–4.2x), is even in 2 configs (~1.0x), and shows a minor regression in 2 configs (0.7x at cardinality=1000/fraction=50%).

Test Plan

  • JsonIndexDistinctOperatorTest — 9 unit tests (fast path verification, cross-path fallback, default value, missing path error, planning eligibility)
  • JsonPathTest — 42 integration tests covering all same-path predicate types (EQ, NOT_EQ, IN, NOT_IN, REGEXP_LIKE, IS_NOT_NULL), 4-arg default value, LIMIT, cross-path filter, and baseline comparison on both SSE and MSE
  • JsonIndexTest — 18 segment-level JSON index tests
  • DistinctQueriesTest — 6 distinct query tests
  • JsonExtractIndexTransformFunctionTest — 31 transform function tests
  • ./mvnw spotless:apply checkstyle:check license:check — all pass
  • Full JMH benchmark run with verifySpeedup() assertion — all 40 same-path configs pass
  • Extra-filter benchmark run — 24 configs, correctness validated against baseline

🤖 Generated with Claude Code

@xiangfu0 xiangfu0 force-pushed the xiangfu/json-index-distinct-benchmark branch from 7ed24b5 to ba76279 Compare March 20, 2026 09:51
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 20, 2026

Codecov Report

❌ Patch coverage is 22.72727% with 153 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.29%. Comparing base (c23b8fd) to head (0b0a064).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
...t/index/readers/json/ImmutableJsonIndexReader.java 0.00% 103 Missing ⚠️
...core/operator/query/JsonIndexDistinctOperator.java 47.87% 27 Missing and 22 partials ⚠️
...inot/segment/spi/index/reader/JsonIndexReader.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17921      +/-   ##
============================================
- Coverage     63.37%   63.29%   -0.08%     
  Complexity     1543     1543              
============================================
  Files          3200     3200              
  Lines        194169   194335     +166     
  Branches      29915    29961      +46     
============================================
- Hits         123051   123004      -47     
- Misses        61466    61635     +169     
- Partials       9652     9696      +44     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.27% <22.72%> (+<0.01%) ⬆️
java-21 63.25% <22.72%> (-0.10%) ⬇️
temurin 63.29% <22.72%> (-0.08%) ⬇️
unittests 63.29% <22.72%> (-0.08%) ⬇️
unittests1 55.56% <22.72%> (+<0.01%) ⬆️
unittests2 34.16% <0.00%> (-0.11%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@xiangfu0 xiangfu0 force-pushed the xiangfu/json-index-distinct-benchmark branch from ba76279 to 8b55cd4 Compare March 20, 2026 21:47
@xiangfu0 xiangfu0 changed the title Add JsonIndexDistinct benchmark and JSON_MATCH pushdown Add JsonIndexDistinct benchmark and narrow operator scope Mar 20, 2026
@xiangfu0 xiangfu0 added json Related to JSON column support index Related to indexing (general) labels Mar 20, 2026
@xiangfu0 xiangfu0 force-pushed the xiangfu/json-index-distinct-benchmark branch 2 times, most recently from e510cc2 to ad423b8 Compare March 24, 2026 19:43
@xiangfu0 xiangfu0 changed the title Add JsonIndexDistinct benchmark and narrow operator scope Add JsonIndexDistinct benchmark and optimize same-path JSON_MATCH Mar 24, 2026
@xiangfu0 xiangfu0 force-pushed the xiangfu/json-index-distinct-benchmark branch 2 times, most recently from 5f2cd8f to a038c11 Compare March 25, 2026 05:46
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new JMH benchmark and introduces an optimization path for SELECT DISTINCT jsonExtractIndex(...) queries that can fully push down a same-path JSON_MATCH filter into the JSON index, avoiding per-value bitmap materialization and flattened→real doc ID conversion.

Changes:

  • Add BenchmarkJsonIndexDistinct covering 3 query variants and a _missingPathFraction benchmark parameter.
  • Extend JsonIndexReader with a getMatchingValues(key, filter) SPI and implement an optimized override in ImmutableJsonIndexReader.
  • Update JsonIndexDistinctOperator to use a fast path for fully pushed-down same-path JSON_MATCH, and add unit tests for planning/strategy behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/reader/JsonIndexReader.java Adds getMatchingValues SPI default method for distinct-only value retrieval.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/json/ImmutableJsonIndexReader.java Implements getMatchingValues with a buffer-based intersection optimization; refactors filter/path normalization helpers.
pinot-core/src/main/java/org/apache/pinot/core/operator/query/JsonIndexDistinctOperator.java Adds same-path JSON_MATCH pushdown + a “distinct values only” fast path and a doc-centric strategy for sparse/high-cardinality cases.
pinot-core/src/test/java/org/apache/pinot/core/operator/query/JsonIndexDistinctOperatorTest.java Adds unit tests validating pushdown/strategy/default behavior (needs updates to match new fast path).
pinot-perf/src/main/java/org/apache/pinot/perf/BenchmarkJsonIndexDistinct.java Adds a JMH benchmark to measure baseline vs index-based distinct operator across variants and missing-path scenarios.

@xiangfu0 xiangfu0 force-pushed the xiangfu/json-index-distinct-benchmark branch 2 times, most recently from 836fb40 to 97f2488 Compare March 25, 2026 08:40
@xiangfu0 xiangfu0 requested a review from Copilot March 25, 2026 08:46
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Comment thread pinot-perf/src/main/java/org/apache/pinot/perf/BenchmarkJsonIndexDistinct.java Outdated
@xiangfu0 xiangfu0 force-pushed the xiangfu/json-index-distinct-benchmark branch from 97f2488 to 4dd3752 Compare March 25, 2026 19:33
@xiangfu0
Copy link
Copy Markdown
Contributor Author

Latest Benchmark Results (post-fix rerun)

After fixing the maybeAddDefaultValue() missing-path error handling, re-ran the benchmark with representative parameters:
-p _instanceCardinality=100,1000,10000,100000 -p _testInstanceFraction=0.1,0.5,1.0 -p _queryVariant=THREE_ARG,FOUR_ARG

THREE_ARG Results (same-path filter, no default value)

Cardinality Match % DistinctOp (ms) JsonIndexDistinctOp (ms) Speedup
100 10% 5.98 0.88 6.8x
100 50% 13.22 1.94 6.8x
100 100% 22.84 3.57 6.4x
1,000 10% 2.67
1,000 50% 18.53 4.61 4.0x
1,000 100% 31.53
10,000 10% 30.45 20.53 1.5x
10,000 50% 52.33 21.27 2.5x
10,000 100% 82.93 29.46 2.8x
100,000 10% 159.41 82.45 1.9x
100,000 50% 338.35 128.43 2.6x
100,000 100% 592.60 197.57 3.0x

FOUR_ARG Results (same-path filter, with default value)

Cardinality Match % DistinctOp (ms) JsonIndexDistinctOp (ms) Speedup
100 10% 5.26 0.90 5.8x
100 50% 1.89
100 100% 20.47 3.35 6.1x
1,000 10% 10.16 2.59 3.9x
1,000 50% 17.65 4.66 3.8x
1,000 100% 29.44 7.47 3.9x
10,000 10% 33.50 17.75 1.9x
10,000 50% 50.28 22.92 2.2x
10,000 100% 86.84 27.46 3.2x
100,000 10% 178.14 87.18 2.0x
100,000 50% 393.78 128.28 3.1x
100,000 100% 742.72 313.15 2.4x

JsonIndexDistinctOperator outperforms DistinctOperator in all 24 tested parameter combinations, with speedups ranging from 1.5x to 6.8x. The verifySpeedup() median guard passed for all configurations.

Settings: 500K rows, JDK 17, -f 1 -wi 2 -i 3 -w 1 -r 1, assertSpeedup=true

@xiangfu0 xiangfu0 force-pushed the xiangfu/json-index-distinct-benchmark branch from 4dd3752 to 618f271 Compare March 26, 2026 04:46
xiangfu0 and others added 3 commits March 29, 2026 02:11
…or same-path filters

For fully-pushed-down same-path JSON_MATCH predicates, the previous implementation
scanned the dictionary twice (once for filter evaluation, once for value-map building)
and materialized per-value posting list bitmaps + doc ID conversions that were never
used. This adds getMatchingDistinctValues() to JsonIndexReader SPI with a fused
single-pass implementation in ImmutableJsonIndexReader that evaluates the predicate
directly on dictionary value strings — zero posting list reads, zero bitmap operations,
zero doc ID mapping lookups.

Benchmark shows 12x-337x speedup across all cardinality/selectivity combinations,
eliminating the previous regression at high cardinality + low selectivity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ath) comparison

Adds THREE_ARG_EXTRA_FILTER and FOUR_ARG_EXTRA_FILTER query variants that combine
the same-path REGEXP_LIKE with a cross-path JSON_MATCH on $.cluster, preventing the
fully-pushed-down fast path. This exercises the bitmap-based code path and confirms
it still outperforms baseline in most configurations (1.2x-4.2x).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@xiangfu0 xiangfu0 force-pushed the xiangfu/json-index-distinct-benchmark branch from c12de3b to 0b0a064 Compare March 29, 2026 09:11
@xiangfu0 xiangfu0 requested a review from Copilot March 30, 2026 22:54
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

@xiangfu0 xiangfu0 merged commit 0c6521a into apache:master Apr 3, 2026
20 checks passed
@xiangfu0 xiangfu0 deleted the xiangfu/json-index-distinct-benchmark branch April 3, 2026 07:00
xiangfu0 added a commit to pinot-contrib/pinot-docs that referenced this pull request Apr 3, 2026
@xiangfu0
Copy link
Copy Markdown
Contributor Author

xiangfu0 commented Apr 3, 2026

📝 Documentation PR: pinot-contrib/pinot-docs#616

Documentation has been created for the JSON index DISTINCT optimization. The PR documents:

  • The new same-path JSON_MATCH optimization for SELECT DISTINCT queries
  • How the optimizer evaluates predicates directly on dictionary values
  • Performance tips for leveraging this optimization
  • Example queries demonstrating the feature

xiangfu0 added a commit to pinot-contrib/pinot-docs that referenced this pull request Apr 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

index Related to indexing (general) json Related to JSON column support

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants