Skip to content

Add getValueSize() to Dictionary and ValueReader to avoid buffer allocation#18012

Merged
xiangfu0 merged 1 commit intoapache:masterfrom
Jackie-Jiang:dictionary_value_size
Mar 28, 2026
Merged

Add getValueSize() to Dictionary and ValueReader to avoid buffer allocation#18012
xiangfu0 merged 1 commit intoapache:masterfrom
Jackie-Jiang:dictionary_value_size

Conversation

@Jackie-Jiang
Copy link
Copy Markdown
Contributor

Summary

  • Add getValueSize(int dictId) to the Dictionary interface, returning the byte size of the stored value (for STRING: UTF-8 encoded length; for BIG_DECIMAL: serialized bytes length; for fixed-width types: the type's natural byte size via the default implementation)
  • Add getValueSize(int index, int numBytesPerValue) and getUnpaddedValueSize(int index, int numBytesPerValue) to the ValueReader interface, with implementations in FixedByteValueReaderWriter (returns numBytesPerValue / ZeroInWord scan) and VarLengthValueReader (reads from offset table)
  • Refactor MutableColumnStatistics and IndexAndDictionaryBasedForwardIndexCreator to use getValueSize() instead of allocating bytes just to measure their length
  • Implement getValueSize() across all immutable and mutable dictionary types (on-heap, off-heap, constant-value, same-value)

Test plan

  • Added getValueSize assertions to ImmutableDictionaryTest for all dictionary types (INT, LONG, FLOAT, DOUBLE, BIG_DECIMAL, STRING, BYTES)
  • Added testGetValueSize to FixedByteValueReaderWriterTest covering both getValueSize and getUnpaddedValueSize
  • Added getValueSize / getUnpaddedValueSize assertions to VarLengthValueReaderWriterTest

🤖 Generated with Claude Code

@Jackie-Jiang Jackie-Jiang added enhancement Improvement to existing functionality performance Related to performance optimization labels Mar 28, 2026
@Jackie-Jiang Jackie-Jiang force-pushed the dictionary_value_size branch from 1620b31 to aab2035 Compare March 28, 2026 00:18
@xiangfu0 xiangfu0 requested review from Copilot, jtao15 and xiangfu0 March 28, 2026 00:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces byte-size introspection APIs on dictionary/value readers to avoid allocating buffers just to compute value lengths, and refactors index/statistics codepaths to use the new APIs.

Changes:

  • Add Dictionary#getValueSize(int) with implementations across immutable/mutable dictionary types.
  • Add ValueReader#getByteSize(...) / getUnpaddedByteSize(...) with implementations for fixed-byte and var-length readers.
  • Refactor forward-index creation and mutable column stats to compute max/min lengths via getValueSize() (no materialization).

Reviewed changes

Copilot reviewed 29 out of 29 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pinot-tools/src/main/java/org/apache/pinot/tools/segment/converter/DictionaryToRawIndexConverter.java Use Dictionary#getValueSize() when computing longest entry for raw index creation.
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/reader/Dictionary.java Add getValueSize(int) default API and update related doc comments.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/readerwriter/FixedByteValueReaderWriterTest.java Add tests for fixed-byte getByteSize/getUnpaddedByteSize.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/readers/ImmutableDictionaryTest.java Add assertions for Dictionary#getValueSize() across dictionary types.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/io/util/VarLengthValueReaderWriterTest.java Add assertions for var-length reader getByteSize/getUnpaddedByteSize.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/virtualcolumn/PartitionIdVirtualColumnProvider.java Implement getBytesValue()/getValueSize() for virtual string dictionary.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/StringDictionary.java Implement getValueSize() via unpadded byte size.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/OnHeapStringDictionary.java Implement getValueSize() for on-heap string dictionary.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/OnHeapBytesDictionary.java Implement getByteArrayValue() override and getValueSize().
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/OnHeapBigDecimalDictionary.java Implement getValueSize() for on-heap big-decimal dictionary.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/ConstantValueStringDictionary.java Implement getValueSize() for constant string dictionary.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/ConstantValueBytesDictionary.java Implement getValueSize() for constant bytes dictionary.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/ConstantValueBigDecimalDictionary.java Cache serialized bytes; implement getBytesValue() and getValueSize().
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/BytesDictionary.java Implement getValueSize() for immutable bytes dictionary.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/BigDecimalDictionary.java Implement getValueSize() for immutable big-decimal dictionary.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/BaseImmutableDictionary.java Add protected helpers for byte-size queries via ValueReader.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/loader/InvertedIndexAndDictionaryBasedForwardIndexCreator.java Replace allocation/encoding-based length tracking with getValueSize().
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/StringOnHeapMutableDictionary.java Implement getValueSize() for mutable on-heap string dictionary.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/StringOffHeapMutableDictionary.java Implement getValueSize() for mutable off-heap string dictionary.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/SameValueMutableDictionary.java Delegate getValueSize() to underlying dictionary.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/BytesOnHeapMutableDictionary.java Implement getValueSize() for mutable on-heap bytes dictionary.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/BytesOffHeapMutableDictionary.java Implement getValueSize() for mutable off-heap bytes dictionary.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/BigDecimalOnHeapMutableDictionary.java Fix bytes serialization impl; implement getValueSize().
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/BigDecimalOffHeapMutableDictionary.java Implement getValueSize() for mutable off-heap big-decimal dictionary.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/converter/stats/MutableColumnStatistics.java Use Dictionary#getValueSize() to compute min/max element lengths.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/writer/impl/MutableOffHeapByteArrayStore.java Add getValueSize(int) to avoid allocating returned byte arrays.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/util/VarLengthValueReader.java Implement getByteSize/getUnpaddedByteSize via the offset table.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/util/ValueReader.java Add byte-size query APIs for variable/fixed-size readers.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/util/FixedByteValueReaderWriter.java Implement getByteSize/getUnpaddedByteSize for fixed-byte store.

@Jackie-Jiang Jackie-Jiang force-pushed the dictionary_value_size branch from aab2035 to 220053e Compare March 28, 2026 00:45
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 28, 2026

Codecov Report

❌ Patch coverage is 16.17647% with 57 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.52%. Comparing base (cfd7268) to head (220053e).
⚠️ Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
...l/io/writer/impl/MutableOffHeapByteArrayStore.java 0.00% 11 Missing ⚠️
...ment/local/io/util/FixedByteValueReaderWriter.java 53.33% 5 Missing and 2 partials ⚠️
...ltime/converter/stats/MutableColumnStatistics.java 0.00% 7 Missing ⚠️
...ot/segment/local/io/util/VarLengthValueReader.java 0.00% 5 Missing ⚠️
...tedIndexAndDictionaryBasedForwardIndexCreator.java 20.00% 4 Missing ⚠️
...dex/readers/ConstantValueBigDecimalDictionary.java 0.00% 3 Missing ⚠️
.../dictionary/BigDecimalOnHeapMutableDictionary.java 0.00% 2 Missing ⚠️
...l/segment/index/readers/OnHeapBytesDictionary.java 0.00% 2 Missing ⚠️
...irtualcolumn/PartitionIdVirtualColumnProvider.java 0.00% 2 Missing ⚠️
...dictionary/BigDecimalOffHeapMutableDictionary.java 0.00% 1 Missing ⚠️
... and 13 more

❗ There is a different number of reports uploaded between BASE (cfd7268) and HEAD (220053e). Click for more details.

HEAD has 4 uploads less than BASE
Flag BASE (cfd7268) HEAD (220053e)
unittests 3 2
temurin 9 8
java-11 5 4
unittests2 1 0
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18012      +/-   ##
============================================
- Coverage     63.27%   55.52%   -7.75%     
+ Complexity     1543      752     -791     
============================================
  Files          3200     2505     -695     
  Lines        194074   143033   -51041     
  Branches      29883    22954    -6929     
============================================
- Hits         122792    79414   -43378     
+ Misses        61637    56878    -4759     
+ Partials       9645     6741    -2904     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 55.50% <16.17%> (-7.75%) ⬇️
java-21 55.47% <16.17%> (-0.01%) ⬇️
temurin 55.52% <16.17%> (-7.75%) ⬇️
unittests 55.51% <16.17%> (-7.75%) ⬇️
unittests1 55.51% <16.17%> (+0.01%) ⬆️
unittests2 ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@xiangfu0 xiangfu0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix the failure tests

@xiangfu0 xiangfu0 merged commit fe3e5de into apache:master Mar 28, 2026
42 of 48 checks passed
@Jackie-Jiang Jackie-Jiang deleted the dictionary_value_size branch March 29, 2026 02:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Improvement to existing functionality performance Related to performance optimization

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants