Describe the enhancement requested
ByteStreamSplitValuesWriter is the primary writer for BYTE_STREAM_SPLIT-encoded FLOAT, DOUBLE, INT32, and INT64 columns. Each value goes through a hot path that performs both an unnecessary allocation and N single-byte virtual dispatches.
For FloatByteStreamSplitValuesWriter.writeFloat(float v):
super.scatterBytes(BytesUtils.intToBytes(Float.floatToIntBits(v)));
BytesUtils.intToBytes allocates a fresh byte[4] on every call. scatterBytes then loops:
for (int i = 0; i < bytes.length; ++i) {
this.byteStreams[i].write(bytes[i]); // CapacityByteArrayOutputStream.write(int)
}
That is, per value: 1 byte[4] allocation + 4 single-byte virtual dispatches. For a 100k-value FLOAT page that is 100k allocations and 400k single-byte writes. DOUBLE/LONG are even worse (byte[8], 800k single-byte writes).
JMH (new ByteStreamSplitEncodingBenchmark, 100k values per invocation, JDK 18, -wi 5 -i 10 -f 3, 30 samples) on master:
| Type |
ops/s |
gc.alloc.rate.norm |
| Float |
15.08M |
33.27 B/op |
| Double |
6.99M |
42.54 B/op |
| Int |
15.64M |
33.27 B/op |
| Long |
7.09M |
42.54 B/op |
The B/op figure for Float/Int (33 B) is mostly the per-value byte[4] allocation.
Proposal
Two stacked changes in ByteStreamSplitValuesWriter:
-
Eliminate per-value allocation: replace super.scatterBytes(BytesUtils.intToBytes(v)) with bufferInt(v) / bufferLong(v) that perform the little-endian decomposition with bit shifts directly, no temporary byte[].
-
Batch single-byte writes: accumulate BATCH_SIZE = 128 values in a small per-instance scratch buffer and flush them as N bulk write(byte[], off, len) calls (one per stream), replacing BATCH_SIZE * elementSizeInBytes single-byte virtual dispatches with elementSizeInBytes bulk writes per flush. The constant was chosen by sweeping 16/32/64/128/256/512/1024 — 128 is the sweet spot for FLOAT throughput while still capturing most of the DOUBLE/LONG gains.
Pending values are included in getBufferedSize() (so page-sizing decisions remain correct) and flushed in getBytes(). reset() and close() clear pending state. Only the four numeric subclasses use the batching path; FixedLenByteArrayByteStreamSplitValuesWriter continues to use scatterBytes(byte[]) since its values arrive as already-laid-out byte arrays.
Expected speedup (same JMH config):
| Type |
Before |
After |
Δ |
Alloc B/op |
| Float |
15.08M |
65.06M |
+331% (4.3x) |
33.27 → 9.27 (-72%) |
| Double |
6.99M |
49.48M |
+608% (7.1x) |
42.54 → 18.55 (-56%) |
| Int |
15.64M |
68.13M |
+335% (4.4x) |
33.27 → 9.27 (-72%) |
| Long |
7.09M |
53.23M |
+651% (7.5x) |
42.54 → 18.55 (-56%) |
Scope
- Single file change to
parquet-column/src/main/java/org/apache/parquet/column/values/bytestreamsplit/ByteStreamSplitValuesWriter.java.
- No public-API change;
bufferInt/bufferLong are package-internal helpers; existing public methods preserve their contracts.
- All 573
parquet-column tests pass; 51 BSS-specific tests pass.
Relation
Part of a small series of focused performance PRs from work in parquet-perf. Previous: #3494 (PlainValuesReader), #3496 (PlainValuesWriter), #3500 (Binary.hashCode cache).
Describe the enhancement requested
ByteStreamSplitValuesWriteris the primary writer forBYTE_STREAM_SPLIT-encodedFLOAT,DOUBLE,INT32, andINT64columns. Each value goes through a hot path that performs both an unnecessary allocation and N single-byte virtual dispatches.For
FloatByteStreamSplitValuesWriter.writeFloat(float v):BytesUtils.intToBytesallocates a freshbyte[4]on every call.scatterBytesthen loops:That is, per value: 1
byte[4]allocation + 4 single-byte virtual dispatches. For a 100k-valueFLOATpage that is 100k allocations and 400k single-byte writes.DOUBLE/LONGare even worse (byte[8], 800k single-byte writes).JMH (new
ByteStreamSplitEncodingBenchmark, 100k values per invocation, JDK 18,-wi 5 -i 10 -f 3, 30 samples) on master:The B/op figure for Float/Int (33 B) is mostly the per-value
byte[4]allocation.Proposal
Two stacked changes in
ByteStreamSplitValuesWriter:Eliminate per-value allocation: replace
super.scatterBytes(BytesUtils.intToBytes(v))withbufferInt(v)/bufferLong(v)that perform the little-endian decomposition with bit shifts directly, no temporarybyte[].Batch single-byte writes: accumulate
BATCH_SIZE = 128values in a small per-instance scratch buffer and flush them asNbulkwrite(byte[], off, len)calls (one per stream), replacingBATCH_SIZE * elementSizeInBytessingle-byte virtual dispatches withelementSizeInBytesbulk writes per flush. The constant was chosen by sweeping 16/32/64/128/256/512/1024 — 128 is the sweet spot forFLOATthroughput while still capturing most of theDOUBLE/LONGgains.Pending values are included in
getBufferedSize()(so page-sizing decisions remain correct) and flushed ingetBytes().reset()andclose()clear pending state. Only the four numeric subclasses use the batching path;FixedLenByteArrayByteStreamSplitValuesWritercontinues to usescatterBytes(byte[])since its values arrive as already-laid-out byte arrays.Expected speedup (same JMH config):
Scope
parquet-column/src/main/java/org/apache/parquet/column/values/bytestreamsplit/ByteStreamSplitValuesWriter.java.bufferInt/bufferLongare package-internal helpers; existing public methods preserve their contracts.parquet-columntests pass; 51 BSS-specific tests pass.Relation
Part of a small series of focused performance PRs from work in parquet-perf. Previous: #3494 (PlainValuesReader), #3496 (PlainValuesWriter), #3500 (Binary.hashCode cache).