Add getValueSize() to Dictionary and ValueReader to avoid buffer allocation#18012
Conversation
1620b31 to
aab2035
Compare
There was a problem hiding this comment.
Pull request overview
This PR introduces byte-size introspection APIs on dictionary/value readers to avoid allocating buffers just to compute value lengths, and refactors index/statistics codepaths to use the new APIs.
Changes:
- Add
Dictionary#getValueSize(int)with implementations across immutable/mutable dictionary types. - Add
ValueReader#getByteSize(...)/getUnpaddedByteSize(...)with implementations for fixed-byte and var-length readers. - Refactor forward-index creation and mutable column stats to compute max/min lengths via
getValueSize()(no materialization).
Reviewed changes
Copilot reviewed 29 out of 29 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| pinot-tools/src/main/java/org/apache/pinot/tools/segment/converter/DictionaryToRawIndexConverter.java | Use Dictionary#getValueSize() when computing longest entry for raw index creation. |
| pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/reader/Dictionary.java | Add getValueSize(int) default API and update related doc comments. |
| pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/readerwriter/FixedByteValueReaderWriterTest.java | Add tests for fixed-byte getByteSize/getUnpaddedByteSize. |
| pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/readers/ImmutableDictionaryTest.java | Add assertions for Dictionary#getValueSize() across dictionary types. |
| pinot-segment-local/src/test/java/org/apache/pinot/segment/local/io/util/VarLengthValueReaderWriterTest.java | Add assertions for var-length reader getByteSize/getUnpaddedByteSize. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/virtualcolumn/PartitionIdVirtualColumnProvider.java | Implement getBytesValue()/getValueSize() for virtual string dictionary. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/StringDictionary.java | Implement getValueSize() via unpadded byte size. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/OnHeapStringDictionary.java | Implement getValueSize() for on-heap string dictionary. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/OnHeapBytesDictionary.java | Implement getByteArrayValue() override and getValueSize(). |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/OnHeapBigDecimalDictionary.java | Implement getValueSize() for on-heap big-decimal dictionary. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/ConstantValueStringDictionary.java | Implement getValueSize() for constant string dictionary. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/ConstantValueBytesDictionary.java | Implement getValueSize() for constant bytes dictionary. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/ConstantValueBigDecimalDictionary.java | Cache serialized bytes; implement getBytesValue() and getValueSize(). |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/BytesDictionary.java | Implement getValueSize() for immutable bytes dictionary. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/BigDecimalDictionary.java | Implement getValueSize() for immutable big-decimal dictionary. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/BaseImmutableDictionary.java | Add protected helpers for byte-size queries via ValueReader. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/loader/InvertedIndexAndDictionaryBasedForwardIndexCreator.java | Replace allocation/encoding-based length tracking with getValueSize(). |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/StringOnHeapMutableDictionary.java | Implement getValueSize() for mutable on-heap string dictionary. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/StringOffHeapMutableDictionary.java | Implement getValueSize() for mutable off-heap string dictionary. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/SameValueMutableDictionary.java | Delegate getValueSize() to underlying dictionary. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/BytesOnHeapMutableDictionary.java | Implement getValueSize() for mutable on-heap bytes dictionary. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/BytesOffHeapMutableDictionary.java | Implement getValueSize() for mutable off-heap bytes dictionary. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/BigDecimalOnHeapMutableDictionary.java | Fix bytes serialization impl; implement getValueSize(). |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/BigDecimalOffHeapMutableDictionary.java | Implement getValueSize() for mutable off-heap big-decimal dictionary. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/converter/stats/MutableColumnStatistics.java | Use Dictionary#getValueSize() to compute min/max element lengths. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/writer/impl/MutableOffHeapByteArrayStore.java | Add getValueSize(int) to avoid allocating returned byte arrays. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/util/VarLengthValueReader.java | Implement getByteSize/getUnpaddedByteSize via the offset table. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/util/ValueReader.java | Add byte-size query APIs for variable/fixed-size readers. |
| pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/util/FixedByteValueReaderWriter.java | Implement getByteSize/getUnpaddedByteSize for fixed-byte store. |
...rg/apache/pinot/segment/local/segment/index/readerwriter/FixedByteValueReaderWriterTest.java
Show resolved
Hide resolved
...rc/main/java/org/apache/pinot/segment/local/io/writer/impl/MutableOffHeapByteArrayStore.java
Outdated
Show resolved
Hide resolved
aab2035 to
220053e
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #18012 +/- ##
============================================
- Coverage 63.27% 55.52% -7.75%
+ Complexity 1543 752 -791
============================================
Files 3200 2505 -695
Lines 194074 143033 -51041
Branches 29883 22954 -6929
============================================
- Hits 122792 79414 -43378
+ Misses 61637 56878 -4759
+ Partials 9645 6741 -2904
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
xiangfu0
left a comment
There was a problem hiding this comment.
please fix the failure tests
Summary
getValueSize(int dictId)to theDictionaryinterface, returning the byte size of the stored value (for STRING: UTF-8 encoded length; for BIG_DECIMAL: serialized bytes length; for fixed-width types: the type's natural byte size via the default implementation)getValueSize(int index, int numBytesPerValue)andgetUnpaddedValueSize(int index, int numBytesPerValue)to theValueReaderinterface, with implementations inFixedByteValueReaderWriter(returnsnumBytesPerValue/ ZeroInWord scan) andVarLengthValueReader(reads from offset table)MutableColumnStatisticsandIndexAndDictionaryBasedForwardIndexCreatorto usegetValueSize()instead of allocating bytes just to measure their lengthgetValueSize()across all immutable and mutable dictionary types (on-heap, off-heap, constant-value, same-value)Test plan
getValueSizeassertions toImmutableDictionaryTestfor all dictionary types (INT, LONG, FLOAT, DOUBLE, BIG_DECIMAL, STRING, BYTES)testGetValueSizetoFixedByteValueReaderWriterTestcovering bothgetValueSizeandgetUnpaddedValueSizegetValueSize/getUnpaddedValueSizeassertions toVarLengthValueReaderWriterTest🤖 Generated with Claude Code