Optimize JSON index doc id mapping#18680
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18680 +/- ##
============================================
+ Coverage 64.47% 64.50% +0.03%
Complexity 1291 1291
============================================
Files 3371 3373 +2
Lines 208551 209127 +576
Branches 32569 32730 +161
============================================
+ Hits 134455 134900 +445
- Misses 63292 63368 +76
- Partials 10804 10859 +55
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR optimizes JSON index doc-id handling by avoiding flattened-doc-id → segment-doc-id translation when it’s unnecessary and by introducing a direct doc-id evaluation path for eligible realtime JSON predicates, while preserving correct semantics for array paths.
Changes:
- Detect and fast-path identity doc-id mappings in
ImmutableJsonIndexReaderto avoid per-match doc-id translation. - Add a direct doc-id posting-list map and corresponding predicate evaluation path to
MutableJsonIndexImplfor eligible (non-array) JSON paths. - Keep fallback behavior for array-path predicates to preserve same-array-element correlation semantics.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/json/ImmutableJsonIndexReader.java | Adds identity mapping detection and a direct-doc-id evaluation path when flattened-doc IDs match segment doc IDs. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/json/MutableJsonIndexImpl.java | Adds a doc-id posting list map and predicate evaluation path to bypass flattened-doc mapping for eligible realtime predicates. |
733888f to
8c4df24
Compare
bbee8be to
4d44dff
Compare
4d44dff to
2fd1c3b
Compare
xiangfu0
left a comment
There was a problem hiding this comment.
Found 1 high-signal compatibility issue; see inline comment.
| ByteBuffer headerBuffer = ByteBuffer.allocate(HEADER_LENGTH); | ||
| headerBuffer.putInt(VERSION_2); | ||
| boolean omitDocIdMapping = _docIdMappingIdentity; | ||
| headerBuffer.putInt(omitDocIdMapping ? VERSION_3 : VERSION_2); |
There was a problem hiding this comment.
This writes a brand-new on-disk JSON index version. Older Pinot servers only accept VERSION_1/VERSION_2, so any segment built with VERSION_3 will fail to load during a rolling upgrade or when older servers consume newly-pushed segments. The V2 sidecar is backward-compatible because old readers ignore trailing bytes; this version bump is not. Please keep writing VERSION_2 (with the optional sidecar/zero-length marker) unless the rollout is gated on every reader in the cluster understanding VERSION_3.
Summary
User Manual
No table config changes are required. Existing JSON index configurations continue to work. Queries using
JSON_MATCHorjsonExtractIndexcan benefit automatically when a predicate targets a scalar/object JSON path that does not require array-element correlation.When every indexed JSON document flattens to exactly one record, new immutable segments write JSON index V3 and omit the identity flattened-doc-id to real-doc-id mapping. JSON documents with arrays keep the compatible V2 layout and add a scalar-path direct-doc-id sidecar when useful.
Sample table config snippet:
{ "tableIndexConfig": { "jsonIndexConfigs": { "payload": {} } } }Sample queries:
Array predicates still use flattened-doc semantics:
Benchmark
Local harness:
.bench-compare/JsonDocIdFastPathBench.java, JDK 21, baselinedd6520c726, current2fd1c3bcb1, 10 warmups / 30 measured iterations."$.dir" != 'upstream'"$.dir" IS NOT NULL"$.dir" != 'upstream'"$.dir" != 'upstream'"$.dir" != 'upstream'"$.dir" IS NOT NULLNotes: V3 removes the identity doc-id mapping bytes entirely (
4,000,000 B -> 0 Bfor the 1M-row scalar cases). Total index-size reduction depends on the remaining dictionary/inverted-index payload. Array scenarios keep the V2 mapping for compatibility and add a 24,720 B direct-doc-id sidecar in this benchmark.Tests
./mvnw -pl pinot-segment-local -Dtest=JsonIndexTest test./mvnw -pl pinot-tools -DskipTests compile./mvnw spotless:apply -pl pinot-segment-local,pinot-tools./mvnw checkstyle:check -pl pinot-segment-local,pinot-tools./mvnw license:format -pl pinot-segment-local,pinot-tools./mvnw license:check -pl pinot-segment-local,pinot-toolsgit diff --check