Skip to content

Remove SegmentIndexCreationInfo and ColumnIndexCreationInfo wrapper classes#18116

Merged
xiangfu0 merged 1 commit intoapache:masterfrom
Jackie-Jiang:remove_segment_creation_info
Apr 8, 2026
Merged

Remove SegmentIndexCreationInfo and ColumnIndexCreationInfo wrapper classes#18116
xiangfu0 merged 1 commit intoapache:masterfrom
Jackie-Jiang:remove_segment_creation_info

Conversation

@Jackie-Jiang
Copy link
Copy Markdown
Contributor

@Jackie-Jiang Jackie-Jiang commented Apr 7, 2026

Summary

Removes unnecessary wrapper classes and consolidates segment creation parameters.

Delete SegmentIndexCreationInfo and ColumnIndexCreationInfo

  • SegmentIndexCreationInfo was a trivial wrapper holding only int totalDocs — replaced with a plain int parameter
  • ColumnIndexCreationInfo wrapped ColumnStatistics with 3 extra fields that are now derived at usage sites:
    • useVarLengthDictionary: re-derived from DictionaryIndexConfig + column stats
    • isAutoGenerated: detected via instanceof DefaultColumnStatistics
    • defaultNullValue: computed from FieldSpec.getDefaultNullValueString()

Simplify SegmentCreator interface

  • init() reduced from 7 parameters to 4: (SegmentGeneratorConfig, int totalDocs, TreeMap<String, ColumnStatistics>, File outDir)
  • InstanceType, immutableToMutableIdMap, Schema moved into SegmentGeneratorConfig
  • Common init logic (directory creation, _colIndexes population) consolidated into BaseSegmentCreator.init()
  • SegmentColumnarIndexCreator no longer overrides init() — inherits from BaseSegmentCreator

Consolidate parameters into SegmentGeneratorConfig

  • Added instanceType, mutableSegmentCompacted, mutableToImmutableDocIdMap fields
  • Callers (SegmentPurger, SegmentProcessorFramework, RealtimeSegmentConverter, minion tasks, etc.) now set these on the config instead of passing them as separate init() parameters
  • Removed SegmentIndexCreationDriver.init(config, instanceType) overload — instanceType is now on the config

Simplify SegmentIndexCreationDriverImpl

  • Removed _segmentIndexCreationInfo field — uses _totalDocs directly
  • _indexCreationInfoMap replaced with _columnStatisticsMap (TreeMap<String, ColumnStatistics>)
  • Stats collection stores ColumnStatistics directly instead of wrapping in ColumnIndexCreationInfo
  • init() overloads simplified: InstanceType parameter removed (read from config)

Fix NoDictColumnStatisticsCollector.getUniqueValuesSet()

  • Changed from throwing NotImplementedException to returning null
  • Added @Nullable annotation to ColumnStatistics.getUniqueValuesSet()
  • Same fix applied to MutableNoDictColumnStatistics

Improve Lucene text index reuse during realtime conversion

  • LuceneTextIndexCreator: renamed immutableToMutableIdMap to mutableToImmutableDocIdMap to match actual semantics
  • Added mutableSegmentCompacted parameter — disables mutable index reuse when segment is compacted (compaction changes doc IDs, making the mutable text index invalid)
  • RealtimeSegmentConverter: computes mutableToImmutableDocIdMap from sortedDocIds when both sorting and mutable text index reuse are enabled
  • MutableSegmentImpl: added hasColumnWithReuseMutableTextIndex() to check if any column uses mutable text index reuse

Cleanup

  • IndexCreationContext.Builder: replaced withAllColumnStatistics() + withColumnIndexCreationInfo() with single withColumnStatistics() method
  • CompactedPinotSegmentRecordReader: removed unused constructor overloads taking ThreadSafeMutableRoaringBitmap
  • Updated all callers and tests across pinot-core, pinot-plugins, pinot-segment-local, pinot-segment-spi

Test plan

  • Existing unit tests pass (ColumnMetadataTest, SegmentPreProcessorTest, MutableSegmentImplTest, LuceneTextIndexBufferIntegrationTest)
  • Minion task tests pass (MergeRollupTaskExecutorTest, PurgeTaskExecutorTest, RealtimeToOfflineSegmentsTaskExecutorTest)
  • Existing integration tests pass
  • No behavioral change except: NoDictColumnStatisticsCollector.getUniqueValuesSet() returns null instead of throwing, and mutable text index reuse is disabled when segment is compacted

@Jackie-Jiang Jackie-Jiang added ingestion Related to data ingestion pipeline cleanup Code cleanup or removal of dead code refactor Code restructuring without changing behavior backward-incompat Introduces a backward-incompatible API or behavior change bug Something is not working as expected labels Apr 7, 2026
@Jackie-Jiang Jackie-Jiang force-pushed the remove_segment_creation_info branch from 557efcb to 8238301 Compare April 7, 2026 20:55
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 7, 2026

Codecov Report

❌ Patch coverage is 72.15909% with 49 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.03%. Comparing base (03b78cc) to head (b2dc5a5).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
...inot/segment/spi/creator/IndexCreationContext.java 0.00% 18 Missing ⚠️
...ot/segment/spi/creator/SegmentGeneratorConfig.java 26.66% 11 Missing ⚠️
...loader/defaultcolumn/BaseDefaultColumnHandler.java 58.33% 8 Missing and 2 partials ⚠️
...local/segment/creator/impl/BaseSegmentCreator.java 92.30% 0 Missing and 4 partials ⚠️
...sks/refreshsegment/RefreshSegmentTaskExecutor.java 0.00% 2 Missing ⚠️
...upsertcompaction/UpsertCompactionTaskExecutor.java 0.00% 2 Missing ⚠️
...l/realtime/converter/RealtimeSegmentConverter.java 90.90% 1 Missing ⚠️
...ment/creator/impl/text/LuceneTextIndexCreator.java 91.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18116      +/-   ##
============================================
+ Coverage     63.97%   64.03%   +0.06%     
- Complexity     1611     1617       +6     
============================================
  Files          3181     3179       -2     
  Lines        193767   193697      -70     
  Branches      29918    29914       -4     
============================================
+ Hits         123953   124028      +75     
+ Misses        60034    59865     -169     
- Partials       9780     9804      +24     
Flag Coverage Δ
custom-integration1 100.00% <ø> (?)
integration 100.00% <ø> (?)
integration1 100.00% <ø> (?)
integration2 0.00% <ø> (?)
java-11 64.00% <72.15%> (+0.08%) ⬆️
java-21 63.93% <72.15%> (+<0.01%) ⬆️
temurin 64.03% <72.15%> (+0.06%) ⬆️
unittests 64.02% <72.15%> (+0.05%) ⬆️
unittests1 55.87% <41.76%> (+<0.01%) ⬆️
unittests2 34.39% <67.61%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…lasses

These two classes add unnecessary indirection:
- SegmentIndexCreationInfo is a trivial wrapper holding only `int totalDocs`
- ColumnIndexCreationInfo wraps ColumnStatistics with 3 extra fields that
  can be derived at usage sites

Replace SegmentIndexCreationInfo with plain `int totalDocs` parameter.
Replace ColumnIndexCreationInfo with ColumnStatistics directly. The extra
fields are handled as:
- useVarLengthDictionary: re-derived from DictionaryIndexConfig + stats
- isAutoGenerated: detected via `instanceof DefaultColumnStatistics`
- defaultNullValue: computed from FieldSpec.getDefaultNullValueString()

Also simplifies SegmentCreator.init() and SegmentGeneratorConfig by
consolidating parameters and moving common initialization into
BaseSegmentCreator.
@Jackie-Jiang Jackie-Jiang force-pushed the remove_segment_creation_info branch from 8238301 to b2dc5a5 Compare April 7, 2026 22:17
@xiangfu0 xiangfu0 merged commit e7ea035 into apache:master Apr 8, 2026
30 of 32 checks passed
@Jackie-Jiang Jackie-Jiang deleted the remove_segment_creation_info branch April 8, 2026 20:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backward-incompat Introduces a backward-incompatible API or behavior change bug Something is not working as expected cleanup Code cleanup or removal of dead code ingestion Related to data ingestion pipeline refactor Code restructuring without changing behavior

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants