Skip to content

Optimize JSON index doc id mapping#18680

Open
xiangfu0 wants to merge 1 commit into
apache:masterfrom
xiangfu0:json-index-docid-fastpath
Open

Optimize JSON index doc id mapping#18680
xiangfu0 wants to merge 1 commit into
apache:masterfrom
xiangfu0:json-index-docid-fastpath

Conversation

@xiangfu0
Copy link
Copy Markdown
Contributor

@xiangfu0 xiangfu0 commented Jun 4, 2026

Summary

  • skip flattened-doc-id to segment-doc-id translation when the JSON index mapping is identity
  • emit JSON index V3 for immutable scalar-only JSON indexes so the identity doc-id mapping is omitted from disk
  • keep a direct doc-id bitmap path for realtime JSON index predicates and immutable scalar paths whose JSON paths cannot expand through arrays
  • fall back to flattened-doc evaluation for array paths to preserve same-array-element semantics

User Manual

No table config changes are required. Existing JSON index configurations continue to work. Queries using JSON_MATCH or jsonExtractIndex can benefit automatically when a predicate targets a scalar/object JSON path that does not require array-element correlation.

When every indexed JSON document flattens to exactly one record, new immutable segments write JSON index V3 and omit the identity flattened-doc-id to real-doc-id mapping. JSON documents with arrays keep the compatible V2 layout and add a scalar-path direct-doc-id sidecar when useful.

Sample table config snippet:

{
  "tableIndexConfig": {
    "jsonIndexConfigs": {
      "payload": {}
    }
  }
}

Sample queries:

SELECT COUNT(*)
FROM myTable
WHERE JSON_MATCH(payload, '"$.eventType" = ''click''');

SELECT COUNT(*)
FROM myTable
WHERE JSON_MATCH(payload, '"$.dir" != ''upstream''');

SELECT jsonExtractIndex(payload, '$.eventType', 'STRING')
FROM myTable
WHERE JSON_MATCH(payload, '"$.country" = ''US''');

Array predicates still use flattened-doc semantics:

SELECT COUNT(*)
FROM myTable
WHERE JSON_MATCH(payload, '"$.items[*].sku" = ''abc'' AND "$.items[*].qty" > 1');

Benchmark

Local harness: .bench-compare/JsonDocIdFastPathBench.java, JDK 21, baseline dd6520c726, current 2fd1c3bcb1, 10 warmups / 30 measured iterations.

Scenario Predicate Baseline avg Current avg Speedup Baseline index Current index Storage reduction
scalar dir-only, 1M rows "$.dir" != 'upstream' 4.260707 ms 0.302997 ms 14.1x 4,256,550 B 256,566 B 16.59x
scalar dir-only, 1M rows "$.dir" IS NOT NULL 5.014949 ms 0.078740 ms 63.7x 4,256,550 B 256,566 B 16.59x
scalar + extra fields, 1M rows "$.dir" != 'upstream' 4.230829 ms 0.257292 ms 16.4x 7,282,779 B 3,282,795 B 2.22x
array length 32, 50k rows "$.dir" != 'upstream' 7.862618 ms 0.151867 ms 51.8x 12,856,486 B 12,881,206 B 1.00x
array length 128, 50k rows "$.dir" != 'upstream' 31.125785 ms 0.100103 ms 310.9x 51,454,913 B 51,479,633 B 1.00x
array length 128, 50k rows "$.dir" IS NOT NULL 33.870899 ms 0.068719 ms 492.9x 51,454,913 B 51,479,633 B 1.00x

Notes: V3 removes the identity doc-id mapping bytes entirely (4,000,000 B -> 0 B for the 1M-row scalar cases). Total index-size reduction depends on the remaining dictionary/inverted-index payload. Array scenarios keep the V2 mapping for compatibility and add a 24,720 B direct-doc-id sidecar in this benchmark.

Tests

  • ./mvnw -pl pinot-segment-local -Dtest=JsonIndexTest test
  • ./mvnw -pl pinot-tools -DskipTests compile
  • ./mvnw spotless:apply -pl pinot-segment-local,pinot-tools
  • ./mvnw checkstyle:check -pl pinot-segment-local,pinot-tools
  • ./mvnw license:format -pl pinot-segment-local,pinot-tools
  • ./mvnw license:check -pl pinot-segment-local,pinot-tools
  • git diff --check

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jun 4, 2026

Codecov Report

❌ Patch coverage is 71.26654% with 152 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.50%. Comparing base (dd6520c) to head (2fd1c3b).
⚠️ Report is 21 commits behind head on master.

Files with missing lines Patch % Lines
...t/index/readers/json/ImmutableJsonIndexReader.java 61.92% 57 Missing and 34 partials ⚠️
...local/realtime/impl/json/MutableJsonIndexImpl.java 69.74% 42 Missing and 17 partials ⚠️
...nt/creator/impl/inv/json/BaseJsonIndexCreator.java 97.75% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18680      +/-   ##
============================================
+ Coverage     64.47%   64.50%   +0.03%     
  Complexity     1291     1291              
============================================
  Files          3371     3373       +2     
  Lines        208551   209127     +576     
  Branches      32569    32730     +161     
============================================
+ Hits         134455   134900     +445     
- Misses        63292    63368      +76     
- Partials      10804    10859      +55     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-21 64.50% <71.26%> (+0.03%) ⬆️
temurin 64.50% <71.26%> (+0.03%) ⬆️
unittests 64.50% <71.26%> (+0.03%) ⬆️
unittests1 56.84% <42.34%> (-0.06%) ⬇️
unittests2 37.21% <70.51%> (+0.11%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes JSON index doc-id handling by avoiding flattened-doc-id → segment-doc-id translation when it’s unnecessary and by introducing a direct doc-id evaluation path for eligible realtime JSON predicates, while preserving correct semantics for array paths.

Changes:

  • Detect and fast-path identity doc-id mappings in ImmutableJsonIndexReader to avoid per-match doc-id translation.
  • Add a direct doc-id posting-list map and corresponding predicate evaluation path to MutableJsonIndexImpl for eligible (non-array) JSON paths.
  • Keep fallback behavior for array-path predicates to preserve same-array-element correlation semantics.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/json/ImmutableJsonIndexReader.java Adds identity mapping detection and a direct-doc-id evaluation path when flattened-doc IDs match segment doc IDs.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/json/MutableJsonIndexImpl.java Adds a doc-id posting list map and predicate evaluation path to bypass flattened-doc mapping for eligible realtime predicates.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

@xiangfu0 xiangfu0 force-pushed the json-index-docid-fastpath branch 2 times, most recently from bbee8be to 4d44dff Compare June 6, 2026 09:18
@xiangfu0 xiangfu0 force-pushed the json-index-docid-fastpath branch from 4d44dff to 2fd1c3b Compare June 6, 2026 09:51
Copy link
Copy Markdown
Contributor Author

@xiangfu0 xiangfu0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 1 high-signal compatibility issue; see inline comment.

ByteBuffer headerBuffer = ByteBuffer.allocate(HEADER_LENGTH);
headerBuffer.putInt(VERSION_2);
boolean omitDocIdMapping = _docIdMappingIdentity;
headerBuffer.putInt(omitDocIdMapping ? VERSION_3 : VERSION_2);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This writes a brand-new on-disk JSON index version. Older Pinot servers only accept VERSION_1/VERSION_2, so any segment built with VERSION_3 will fail to load during a rolling upgrade or when older servers consume newly-pushed segments. The V2 sidecar is backward-compatible because old readers ignore trailing bytes; this version bump is not. Please keep writing VERSION_2 (with the optional sidecar/zero-length marker) unless the rollout is gated on every reader in the cluster understanding VERSION_3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants