OPEN_STRUCT storage layer — columnar two-tier dense/sparse index (PR 2/4)#18643
Open
tarun11Mavani wants to merge 21 commits into
Open
OPEN_STRUCT storage layer — columnar two-tier dense/sparse index (PR 2/4)#18643tarun11Mavani wants to merge 21 commits into
tarun11Mavani wants to merge 21 commits into
Conversation
7f254e4 to
d0e436f
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18643 +/- ##
============================================
+ Coverage 64.41% 64.49% +0.08%
- Complexity 1282 1291 +9
============================================
Files 3362 3381 +19
Lines 207907 209304 +1397
Branches 32463 32714 +251
============================================
+ Hits 133923 135000 +1077
- Misses 63221 63450 +229
- Partials 10763 10854 +91
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
…x (PR 2/4) Storage layer for the OPEN_STRUCT column type: a self-describing column whose keys are discovered at ingest time and stored columnar in two tiers. Mutable (consuming) path: - MutableOpenStructIndex / MutableKeyColumn: per-key dictionary-encoded forward index + presence bitmap, with 3-level type resolution (declared child FieldSpec, else value-based inference) and maxDenseKeys capping. - MutableOpenStructDataSource exposes per-key DataSources to the query layer via the OpenStructDataSource SPI; implemented as a MutableIndex. Seal / offline path: - OpenStructColumnSplitter classifies keys into dense vs sparse by fill rate (plus explicit denseKeys and maxDenseKeys), writes each dense key as a standard materialized column (col$key) with optional dictionary/inverted/ null-vector, and packs the rest into a single sparse JSON column (col$__sparse__). Emits per-child and parent column metadata. - BaseSegmentCreator merges the splitter's per-child metadata into the segment properties so each child loads as its own column. Immutable (sealed) path: - ImmutableSegmentImpl groups materialized children (parentColumn metadata) under an ImmutableOpenStructDataSource in a single pass over column metadata. - ImmutableSegmentLoader keeps materialized children that are absent from the user-facing schema. Wiring / validation: - OpenStructIndexType + OpenStructIndexPlugin register the index; reader factory is a no-op (children load as standard columns). - TableConfigUtils rejects user columns containing the reserved '$' separator when any field is OPEN_STRUCT. - RealtimeSegmentStatsContainer returns EmptyColumnStatistics for the parent. - V1Constants: PARENT_COLUMN, HAS_SPARSE_COLUMN; OpenStructNaming helpers including shared value->DataType inference. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
c6a3db5 to
72cf061
Compare
A BIG_DECIMAL-valued OPEN_STRUCT key aborted segment seal: BIG_DECIMAL is its own stored type (unlike BOOLEAN->INT, TIMESTAMP->LONG), so the splitter's type switches hit `default: throw`. It is reachable via inferDataType for any java.math.BigDecimal or an explicit child FieldSpec. Add BIG_DECIMAL as a variable-length stored type alongside STRING/BYTES in OpenStructColumnSplitter: getDefaultValue (BigDecimal.ZERO), dictionary build, raw var-byte forward index (putBigDecimal), and dictionary-vs-raw sizing. Dictionary dedup uses BigDecimal.equals (matching SegmentDictionaryCreator's equals-keyed indexOfSV), not compareTo, so scale-differing equal values (1.0 vs 1.00) stay distinct instead of silently resolving to dict id 0. The realtime mutable path and segment min/max metadata already handle BIG_DECIMAL and are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tisticsCollector Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… stats collector Treats absent docs as null docs holding the default value (standard model). For keys with absent docs, CARDINALITY now counts the default and MIN/MAX include it. Intended. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…MetadataInfo Replaces the hand-written emitVirtualColumnMetadata with the standard metadata writer plus the OPEN_STRUCT-specific keys (PARENT_COLUMN, hasNullValue, hasInvertedIndex). The shared writer additionally emits standard keys the old code omitted, producing a superset of the previous metadata. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…istinctValuesPerKey Replaces the _distinctValuesPerKey distinct-string-set tracking with the sealed stats collector's cardinality and longest-element length. _totalRawBytesPerKey is retained for the var-length raw-size estimate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
dictElementSize is already 0 on the raw path (only assigned when a dictionary creator exists), so the conditional alias was a no-op. Pass it directly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…reators Drives each dense materialized child's forward, dictionary-id forward, and inverted index through the standard ForwardIndexCreator / DictionaryBasedInvertedIndexCreator obtained from StandardIndexes, using an IndexCreationContext built from the sealed stats collector (no TableConfig). The dict-vs-raw decision now mirrors BaseSegmentCreator.createDictionaryForColumn (standard default flags), and absent docs store the Pinot dimension null default. Deletes writeRawForwardIndex, shouldUseDictionary, getDefaultValue, and _totalRawBytesPerKey. Behavior changes (intended, unreleased feature): absent-doc defaults are now dimension nulls (STRING "null", INT MIN_VALUE, ...); the local size-ratio dict downgrade is removed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…riteDenseKeyColumn Split the ~144-line writeDenseKeyColumn into a focused orchestrator plus two private helpers, mirroring how BaseSegmentCreator decomposes column creation: - resolveUseDictionary(...): the three-step dict-vs-raw decision. - writeForwardAndInvertedIndexes(...): dictionary + forward + inverted creation, the per-doc add loop, and the nested resource handling; returns the dictionary element size for metadata. Behavior-preserving — no logic, ordering, or resource-management change. Covered by the existing OpenStructColumnSplitterTest (16 tests). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…o OpenStructTypeInference inferDataType performs value->DataType inference, an orthogonal concern to OpenStructNaming's column-name string mapping (it uses none of the naming constants). Relocate it to a dedicated OpenStructTypeInference class in the same package so each class has a single responsibility; update the two callers. Behavior-preserving — the method body is moved verbatim. The core Object classification still delegates to PinotDataType.getSingleValueType; the remaining PinotDataType->DataType switch is OPEN_STRUCT-specific policy that exists nowhere else. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…taType Data-driven test locking the value->DataType mapping policy now that it lives in its own class: each return branch (INT via all four widening inputs, LONG, FLOAT, DOUBLE, BIG_DECIMAL, BOOLEAN, TIMESTAMP, STRING via all four folding inputs, BYTES) plus the unrepresentable cases (Map, bare Object) returning null. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ldConfig, no TableConfig) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…at table-config time Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… (inverted/range/bloom) Drive per-key index creation from IndexService#getAllIndexes (filtered to the OPEN_STRUCT allowlist), building the dictionary separately and reconciling forward/dictionary encoding with the dictionary decision. Each vetted index type's validate() runs against the resolved child FieldSpec at build time so misconfigurations (e.g. range on a non-numeric raw key) fail with the canonical message instead of crashing inside the creator. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…exCreator failures dictCreator.seal() was positioned after the inner try/finally for index creators. If any IndexCreator.seal() threw, the dictionary file was left unsealed on disk. Move seal into the inner finally so it runs regardless of index creator success/failure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…build pipeline The OPEN_STRUCT storage/creator layer was implemented but the offline batch ingestion pipeline (SegmentIndexCreationDriverImpl) was never made OPEN_STRUCT-aware, causing segment builds to throw. Four gaps fixed: - PinotDataType.getPinotDataTypeForIngestion: add OPEN_STRUCT case returning MAP so the Map value passes through the transform pipeline unchanged (unblocks both offline and realtime ingestion). - StatsCollectorUtil.createStatsCollector: add OPEN_STRUCT case using MapColumnPreIndexStatsCollector and exclude from NoDictCollector opt. - ForwardIndexType.shouldCreateIndex: return false for OPEN_STRUCT parent — it has no forward index (mirroring MutableSegmentImpl). - BaseSegmentCreator.writeMetadata: register splitter child columns in the DIMENSIONS property so V3 converter discovers their index files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix three gaps in the realtime OPEN_STRUCT consume→seal path: 1. MutableOpenStructIndex: remove consume-time dense-key cap that incorrectly dropped keys by arrival order. All observed keys are now retained; dense/sparse classification deferred to seal time via OpenStructColumnSplitter.classify(). 2. PinotSegmentRecordReader: OPEN_STRUCT parent columns have no forward index, so PinotSegmentColumnReader crashes on them. Track OpenStructDataSource columns separately and reconstruct per-doc map values via getMapValue(docId) for the row-major seal path. 3. SegmentColumnarIndexCreator: the column-major build path (used by RealtimeSegmentConverter when columnMajorSegmentBuilder is enabled) also tried to create PinotSegmentColumnReader for OPEN_STRUCT parents. Read from OpenStructDataSource.getMapValue(docId) instead, feeding the map values to the OpenStructColumnSplitter. Add OpenStructDataSource.getMapValue() as a default SPI method (throws UnsupportedOperationException), implemented by MutableOpenStructDataSource which reconstructs the map from per-key MutableKeyColumns. Add realtime e2e integration test (OpenStructIngestionCommitRealtimeTest) that validates: Kafka consume, forceCommit, COUNT(*), dense/sparse child column presence and types, per-key forward/dictionary index_map, and parent/sparse metadata properties. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…eProcessor The SegmentPreProcessor's InvertedIndexHandler was stripping inverted indexes from OPEN_STRUCT child columns because IndexLoadingConfig only knew about schema-level columns. Add addOpenStructChildConfigs() to resolve per-key FieldConfig from the parent's OpenStructIndexConfig and inject it into the config map before handlers run. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Exercises the row-major offline build path (PinotSegmentRecordReader → SegmentColumnarIndexCreator) with the same per-key index matrix and dense/sparse validation as the realtime variant. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
Author
|
@xiangfu0 @raghavyadav01 please take a look. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This is the second PR in a 4-part stack introducing
DataType.OPEN_STRUCT. It adds the segment storage layer — the runtime that makes OPEN_STRUCT columns ingestible, sealable, and queryable. Without it, declaringdataType: OPEN_STRUCTin a schema does nothing.RFC: https://docs.google.com/document/d/14kPmjDTKbO8l0ql4rrN7I5Yki5pqMw6GeGmxxc9grsU/edit?tab=t.0
PR 1 (SPI + data model): #18368
Architecture
OPEN_STRUCT columns use a two-tier columnar storage model:
denseKeyMinFillRate(default 0.5) or explicitly listed indenseKeysare materialized as standard Pinot columns (<column>$<key>), each with its own forward index, dictionary, null vector, and optional per-key indexes (inverted, range, bloom).<column>$__sparse__).Dense keys reuse the entire standard column infrastructure — no bespoke binary format. New index types (range, text, JSON) work on dense keys with zero custom code, and segment-level column pruning applies automatically.
Mutable (consuming) path
MutableOpenStructIndex— per-key in-memory state during consumption. Implements bothMutableIndexandOpenStructIndexReader(dual-role: write-side index + read-side reader for queries against mutable segments). All observed keys are retained during consumption; dense/sparse classification is deferred to seal time where fill rates are known.MutableKeyColumn— per-key forward index + presence bitmap + type tracking. Three-level type resolution: declared childFieldSpec→ value-based inference viaOpenStructTypeInference.MutableOpenStructDataSource— exposes per-key DataSources to the query layer via theOpenStructDataSourceSPI. SupportsgetMapValue(docId)for row-major map reconstruction during seal.Seal / offline path
OpenStructColumnSplitter(~550 LOC) — classifies keys into dense vs sparse by fill rate (plus explicitdenseKeysandmaxDenseKeys), writes each dense key as a standard materialized column via the standard index creator pipeline (ForwardIndexCreator,DictionaryBasedInvertedIndexCreator, etc.), and packs the rest into a single sparse JSON column. Emits per-child and parent column metadata. Dense-key indexes are driven by a generic creator loop overIndexService.getAllIndexes()filtered to a vetted allowlist (inverted, range, bloom), with each index type'svalidate()run at build time against the resolved childFieldSpec.FieldIndexConfigsUtil.fromFieldConfig— new additive method that buildsFieldIndexConfigsfrom a singleFieldConfigwithout requiring aTableConfig/Schema, enabling per-key index resolution for synthetic materialized children.BaseSegmentCreator— merges the splitter's per-child metadata into the segment properties so each child loads as its own column. Registers splitter child columns in theDIMENSIONSproperty so the V3 converter discovers their index files.AbstractColumnStatisticsCollectorfamily rather than hand-rolled tracking. Added aFieldSpec-based constructor to the base collector + 7 scalar collectors so they can be used without aStatsCollectorConfig/TableConfig.Immutable (sealed) path
ImmutableSegmentImpl— single-pass post-load grouping. Scans column metadata forparentColumn, groups child columns by parent, and builds oneImmutableOpenStructDataSourceper parent OPEN_STRUCT field. Child columns are removed from_dataSourceMapand only accessible via the parent'sgetDataSource(key).ImmutableSegmentLoader— exempts materialized child columns from schema-vs-segment reconciliation (they exist in the segment but not in the user-facing schema).Ingestion pipeline wiring
PinotDataType.getPinotDataTypeForIngestion— OPEN_STRUCT case returns MAP so theMapvalue passes through the transform pipeline unchanged.StatsCollectorUtil.createStatsCollector— OPEN_STRUCT case usesMapColumnPreIndexStatsCollector; excluded fromNoDictCollectoroptimization.ForwardIndexType.shouldCreateIndex— returnsfalsefor OPEN_STRUCT parent columns (no forward index on the parent; data lives in materialized children).PinotSegmentRecordReader— detects OPEN_STRUCT parents (no forward index), tracks theirOpenStructDataSource, and reconstructs per-docMap<String, Object>viagetMapValue()for the row-major seal path.SegmentColumnarIndexCreator— column-major build path reads fromOpenStructDataSource.getMapValue()instead of attempting aPinotSegmentColumnReaderon the parent.IndexLoadingConfig.addOpenStructChildConfigs— resolves per-keyFieldConfigfrom the parent'sOpenStructIndexConfigand injects it into the config map beforeSegmentPreProcessorhandlers run, preventingInvertedIndexHandlerfrom stripping per-key inverted indexes.Wiring / validation
OpenStructIndexType+OpenStructIndexPlugin— register the index via@AutoService(IndexPlugin.class). Reader factory is a no-op (children load as standard columns). Index handler returnsneedUpdateIndices() = falseto prevent data corruption from handler-path index updates on OPEN_STRUCT parent columns.OpenStructSupportedIndexes— vetted allowlist of per-key index types (inverted, range, bloom) with table-config-time validation.TableConfigUtils— rejects user columns containing the reserved$separator when any field is OPEN_STRUCT.RealtimeSegmentStatsContainer— returnsEmptyColumnStatisticsfor the OPEN_STRUCT parent.pinot-spichangesOpenStructTypeInference— value→DataType inference extracted fromOpenStructNaminginto its own single-responsibility class.OpenStructNaming— addedisMaterializedChildNameandparentColumnNameparser helpers for name-based child column detection.V1Constants—HAS_SPARSE_COLUMNmetadata key for parent column metadata.Backward compatibility
DataType.MAPsegments and schemas are untouched. Zero behavior change for MAP.$column name restriction applies only to schemas containing at least one OPEN_STRUCT field.Follow-up PRs
pinot-core):OpenStructFilterOperator,ItemTransformFunctiontyping path,AggregationPlanNode,GroupByPlanNode.Test plan
OpenStructColumnSplitterTest— 18 tests covering dense/sparse classification, per-type materialization (INT, LONG, FLOAT, DOUBLE, STRING, BYTES, BIG_DECIMAL, BOOLEAN, TIMESTAMP), dictionary vs raw encoding, inverted/range/bloom index creation, absent-doc defaults, explicitdenseKeys,maxDenseKeyscap, sparse JSON round-trip, BIG_DECIMAL scale-distinct valuesOpenStructSegmentCreationTest— end-to-end segment creation with V1/V3 format conversion and index map verificationMutableOpenStructIndexTest— consume-time key tracking, type inference, per-key forward index reads,getMapValue()reconstruction, all-keys-retained semanticsMutableOpenStructDataSourceTest— per-key DataSource exposure,isMaterialized/isFullyMaterializedsemanticsImmutableOpenStructDataSourceTest— post-load child grouping, dense/sparse/mixed materialization status,getMapValue()for immutable segmentsOpenStructIndexTypeTest— index registration, no-op handler, config extraction, per-key index validationStatsCollectorUtilFieldSpecTest— FieldSpec-based collector factory for all scalar typesFieldIndexConfigsUtilTest—fromFieldConfigwith dictionary/raw encoding, inverted/range/bloom indexes, default fallbackOpenStructTypeInferenceTest— value→DataType mapping for all input types including widening (Byte/Short→INT), folding (Date/UUID→STRING), and unrepresentable (Map, bare Object→null)OpenStructNamingTest— materialized/sparse column naming,isMaterializedChildName,parentColumnNameparsingOpenStructIngestionCommitRealtimeTest— realtime e2e: Kafka consume, forceCommit, COUNT(*), dense/sparse child column presence and types, per-key forward/dictionary/inverted index verification, parent/sparse metadata propertiesOpenStructIngestionCommitOfflineTest— offline e2e: row-major build path (PinotSegmentRecordReader → SegmentColumnarIndexCreator) with the same per-key index matrix and dense/sparse validation