GH-3503: Optimize ByteStreamSplitValuesWriter with batched scatter writes#3504
Open
iemejia wants to merge 1 commit intoapache:masterfrom
Open
GH-3503: Optimize ByteStreamSplitValuesWriter with batched scatter writes#3504iemejia wants to merge 1 commit intoapache:masterfrom
iemejia wants to merge 1 commit intoapache:masterfrom
Conversation
…ter writes The current ByteStreamSplitValuesWriter.writeFloat/writeDouble/writeInteger/ writeLong path allocates a new byte[4] or byte[8] per value via BytesUtils.intToBytes / BytesUtils.longToBytes, then dispatches one single-byte CapacityByteArrayOutputStream.write(int) call per byte per value (4 calls per float/int, 8 per double/long). For a 100k-value page that is up to 800k single-byte virtual dispatches plus 100k short-lived byte[] allocations. This change collapses that hot path in two stacked steps: 1. Eliminate the per-value byte[] allocation by inlining the little-endian decomposition with bit shifts into helper methods bufferInt(int) / bufferLong(long), instead of going through BytesUtils.intToBytes / BytesUtils.longToBytes which allocate byte[4] / byte[8] on every call. 2. Batch values into a small per-instance scratch buffer (BATCH_SIZE = 128) and flush them as N bulk write(byte[], off, len) calls per stream per flush, replacing N * elementSizeInBytes single-byte virtual dispatches with elementSizeInBytes bulk writes. The batch is flushed automatically when full, on getBytes(), and is included in getBufferedSize() so page sizing decisions remain correct. reset() and close() clear the pending batch. The constant was selected by sweeping 16/32/64/128/256/512/1024; 128 maximises FLOAT throughput while still capturing most of the DOUBLE/LONG gains. Only one of intBatch / longBatch is used per writer instance; the four numeric subclasses (Float/Double/Integer/Long) each call exactly one of bufferInt / bufferLong via their writeXxx implementations. The FixedLenByteArrayByteStreamSplitValuesWriter still uses scatterBytes(byte[]) since its values arrive as already-laid-out byte arrays. Benchmark (new ByteStreamSplitEncodingBenchmark, 100k values per invocation, JDK 18, JMH -wi 5 -i 10 -f 3, 30 samples per row): Type Before (ops/s) After (ops/s) Improvement Alloc B/op Float 15,080,427 65,060,920 +331% (4.31x) 33.27 -> 9.27 (-72%) Double 6,994,501 49,475,535 +608% (7.07x) 42.54 -> 18.55 (-56%) Int 15,641,334 68,128,560 +335% (4.36x) 33.27 -> 9.27 (-72%) Long 7,090,154 53,225,645 +651% (7.51x) 42.54 -> 18.55 (-56%) The remaining per-op allocation (~9 B/op for Int/Float, ~19 B/op for Long/Double) is the BytesInput[] returned by getBytes() and the streams' internal slabs, which are amortised across the page rather than per value. All 573 parquet-column tests pass.
This was referenced Apr 19, 2026
iemejia
added a commit
to iemejia/parquet-java
that referenced
this pull request
Apr 19, 2026
… shaded jar The parquet-benchmarks pom is missing the JMH annotation-processor configuration and the AppendingTransformer entries for BenchmarkList / CompilerHints. As a result, the shaded jar built from master fails at runtime with "Unable to find the resource: /META-INF/BenchmarkList". This commit: - Fixes parquet-benchmarks/pom.xml so the shaded jar is runnable: adds jmh-generator-annprocess to maven-compiler-plugin's annotation processor paths, and adds AppendingTransformer entries for META-INF/BenchmarkList and META-INF/CompilerHints to the shade plugin. - Adds 11 JMH benchmarks covering the encode/decode paths used by the pending performance optimization PRs (apache#3494, apache#3496, apache#3500, apache#3504, apache#3506, apache#3510), so reviewers can reproduce the reported numbers and detect regressions: IntEncodingBenchmark, BinaryEncodingBenchmark, ByteStreamSplitEncodingBenchmark, ByteStreamSplitDecodingBenchmark, FixedLenByteArrayEncodingBenchmark, FileReadBenchmark, FileWriteBenchmark, RowGroupFlushBenchmark, ConcurrentReadWriteBenchmark, BlackHoleOutputFile, TestDataFactory. After this change the shaded jar registers 87 benchmarks (was 0 from a working build, or unrunnable at all from a default build).
iemejia
added a commit
to iemejia/parquet-java
that referenced
this pull request
Apr 19, 2026
… shaded jar The parquet-benchmarks pom is missing the JMH annotation-processor configuration and the AppendingTransformer entries for BenchmarkList / CompilerHints. As a result, the shaded jar built from master fails at runtime with "Unable to find the resource: /META-INF/BenchmarkList". This commit: - Fixes parquet-benchmarks/pom.xml so the shaded jar is runnable: adds jmh-generator-annprocess to maven-compiler-plugin's annotation processor paths, and adds AppendingTransformer entries for META-INF/BenchmarkList and META-INF/CompilerHints to the shade plugin. - Adds 11 JMH benchmarks covering the encode/decode paths used by the pending performance optimization PRs (apache#3494, apache#3496, apache#3500, apache#3504, apache#3506, apache#3510), so reviewers can reproduce the reported numbers and detect regressions: IntEncodingBenchmark, BinaryEncodingBenchmark, ByteStreamSplitEncodingBenchmark, ByteStreamSplitDecodingBenchmark, FixedLenByteArrayEncodingBenchmark, FileReadBenchmark, FileWriteBenchmark, RowGroupFlushBenchmark, ConcurrentReadWriteBenchmark, BlackHoleOutputFile, TestDataFactory. After this change the shaded jar registers 87 benchmarks (was 0 from a working build, or unrunnable at all from a default build).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale for this change
ByteStreamSplitValuesWriteris the primary writer forBYTE_STREAM_SPLIT-encodedFLOAT,DOUBLE,INT32, andINT64columns. Each value goes through a hot path that performs both an unnecessary allocation and N single-byte virtual dispatches.For
FloatByteStreamSplitValuesWriter.writeFloat(float v):BytesUtils.intToBytesallocates a freshbyte[4]on every call.scatterBytesthen loops:So per value: 1
byte[4]allocation + 4 single-byte virtual dispatches. For a 100k-valueFLOATpage that is 100k allocations and 400k single-byte writes.DOUBLE/LONGare even worse (byte[8], 800k single-byte writes).What changes are included in this PR?
Two stacked changes in
ByteStreamSplitValuesWriter:Eliminate per-value allocation: replace
super.scatterBytes(BytesUtils.intToBytes(v))withbufferInt(v)/bufferLong(v)that perform the little-endian decomposition with bit shifts directly, no temporarybyte[].Batch single-byte writes: accumulate
BATCH_SIZE = 128values in a small per-instance scratch buffer and flush them asNbulkwrite(byte[], off, len)calls (one per stream), replacingBATCH_SIZE * elementSizeInBytessingle-byte virtual dispatches withelementSizeInBytesbulk writes per flush. The constant was chosen by sweeping 16/32/64/128/256/512/1024 — 128 is the sweet spot forFLOATthroughput while still capturing most of theDOUBLE/LONGgains.Pending values are included in
getBufferedSize()(so page-sizing decisions remain correct) and flushed ingetBytes().reset()andclose()clear pending state. Only the four numeric subclasses (Float/Double/Integer/Long) use the batching path;FixedLenByteArrayByteStreamSplitValuesWritercontinues to usescatterBytes(byte[])since its values arrive as already-laid-out byte arrays.Benchmark
New
ByteStreamSplitEncodingBenchmark(100k values per invocation, JDK 18, JMH-wi 5 -i 10 -f 3, 30 samples per row):The remaining per-op allocation (~9 B/op for Int/Float, ~19 B/op for Long/Double) is the
BytesInput[]returned bygetBytes()and the streams' internal slabs, which are amortised across the page rather than per value.Are these changes tested?
Yes. All 573
parquet-columntests pass; 51 BSS-specific tests pass (mvn test -pl parquet-column -Dtest='*ByteStreamSplit*'). No new test was added because behaviour is unchanged (covered by the existing round-trip and writer tests).Are there any user-facing changes?
No. Only an internal writer optimization. No public API, file format, or configuration change.
Closes #3503
Part of a small series of focused performance PRs from work in parquet-perf. Previous: #3494, #3496, #3500.
How to reproduce the benchmarks
The JMH benchmarks cited above are being added to
parquet-benchmarksin #3512. Once that lands, reproduce with:Compare runs against
master(baseline) and this branch (optimized).