GH-3503: Optimize ByteStreamSplitValuesWriter with batched scatter writes by iemejia · Pull Request #3504 · apache/parquet-java

iemejia · 2026-04-19T18:25:59Z

Rationale for this change

ByteStreamSplitValuesWriter is the primary writer for BYTE_STREAM_SPLIT-encoded FLOAT, DOUBLE, INT32, and INT64 columns. Each value goes through a hot path that performs both an unnecessary allocation and N single-byte virtual dispatches.

For FloatByteStreamSplitValuesWriter.writeFloat(float v):

super.scatterBytes(BytesUtils.intToBytes(Float.floatToIntBits(v)));

BytesUtils.intToBytes allocates a fresh byte[4] on every call. scatterBytes then loops:

for (int i = 0; i < bytes.length; ++i) {
  this.byteStreams[i].write(bytes[i]);   // CapacityByteArrayOutputStream.write(int)
}

So per value: 1 byte[4] allocation + 4 single-byte virtual dispatches. For a 100k-value FLOAT page that is 100k allocations and 400k single-byte writes. DOUBLE/LONG are even worse (byte[8], 800k single-byte writes).

What changes are included in this PR?

Two stacked changes in ByteStreamSplitValuesWriter:

Eliminate per-value allocation: replace super.scatterBytes(BytesUtils.intToBytes(v)) with bufferInt(v) / bufferLong(v) that perform the little-endian decomposition with bit shifts directly, no temporary byte[].
Batch single-byte writes: accumulate BATCH_SIZE = 128 values in a small per-instance scratch buffer and flush them as N bulk write(byte[], off, len) calls (one per stream), replacing BATCH_SIZE * elementSizeInBytes single-byte virtual dispatches with elementSizeInBytes bulk writes per flush. The constant was chosen by sweeping 16/32/64/128/256/512/1024 — 128 is the sweet spot for FLOAT throughput while still capturing most of the DOUBLE/LONG gains.

Pending values are included in getBufferedSize() (so page-sizing decisions remain correct) and flushed in getBytes(). reset() and close() clear pending state. Only the four numeric subclasses (Float/Double/Integer/Long) use the batching path; FixedLenByteArrayByteStreamSplitValuesWriter continues to use scatterBytes(byte[]) since its values arrive as already-laid-out byte arrays.

Benchmark

New ByteStreamSplitEncodingBenchmark (100k values per invocation, JDK 18, JMH -wi 5 -i 10 -f 3, 30 samples per row):

Type	Before	After	Δ	Alloc B/op
Float	15.08M	65.06M	+331% (4.3x)	33.27 → 9.27 (-72%)
Double	6.99M	49.48M	+608% (7.1x)	42.54 → 18.55 (-56%)
Int	15.64M	68.13M	+335% (4.4x)	33.27 → 9.27 (-72%)
Long	7.09M	53.23M	+651% (7.5x)	42.54 → 18.55 (-56%)

The remaining per-op allocation (~9 B/op for Int/Float, ~19 B/op for Long/Double) is the BytesInput[] returned by getBytes() and the streams' internal slabs, which are amortised across the page rather than per value.

Are these changes tested?

Yes. All 573 parquet-column tests pass; 51 BSS-specific tests pass (mvn test -pl parquet-column -Dtest='*ByteStreamSplit*'). No new test was added because behaviour is unchanged (covered by the existing round-trip and writer tests).

Are there any user-facing changes?

No. Only an internal writer optimization. No public API, file format, or configuration change.

Closes #3503

Part of a small series of focused performance PRs from work in parquet-perf. Previous: #3494, #3496, #3500.

How to reproduce the benchmarks

The JMH benchmarks cited above are being added to parquet-benchmarks in #3512. Once that lands, reproduce with:

./mvnw clean package -pl parquet-benchmarks -DskipTests \
    -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true
java -jar parquet-benchmarks/target/parquet-benchmarks.jar 'ByteStreamSplitEncodingBenchmark' \
    -wi 5 -i 10 -f 3

Compare runs against master (baseline) and this branch (optimized).

…ter writes The current ByteStreamSplitValuesWriter.writeFloat/writeDouble/writeInteger/ writeLong path allocates a new byte[4] or byte[8] per value via BytesUtils.intToBytes / BytesUtils.longToBytes, then dispatches one single-byte CapacityByteArrayOutputStream.write(int) call per byte per value (4 calls per float/int, 8 per double/long). For a 100k-value page that is up to 800k single-byte virtual dispatches plus 100k short-lived byte[] allocations. This change collapses that hot path in two stacked steps: 1. Eliminate the per-value byte[] allocation by inlining the little-endian decomposition with bit shifts into helper methods bufferInt(int) / bufferLong(long), instead of going through BytesUtils.intToBytes / BytesUtils.longToBytes which allocate byte[4] / byte[8] on every call. 2. Batch values into a small per-instance scratch buffer (BATCH_SIZE = 128) and flush them as N bulk write(byte[], off, len) calls per stream per flush, replacing N * elementSizeInBytes single-byte virtual dispatches with elementSizeInBytes bulk writes. The batch is flushed automatically when full, on getBytes(), and is included in getBufferedSize() so page sizing decisions remain correct. reset() and close() clear the pending batch. The constant was selected by sweeping 16/32/64/128/256/512/1024; 128 maximises FLOAT throughput while still capturing most of the DOUBLE/LONG gains. Only one of intBatch / longBatch is used per writer instance; the four numeric subclasses (Float/Double/Integer/Long) each call exactly one of bufferInt / bufferLong via their writeXxx implementations. The FixedLenByteArrayByteStreamSplitValuesWriter still uses scatterBytes(byte[]) since its values arrive as already-laid-out byte arrays. Benchmark (new ByteStreamSplitEncodingBenchmark, 100k values per invocation, JDK 18, JMH -wi 5 -i 10 -f 3, 30 samples per row): Type Before (ops/s) After (ops/s) Improvement Alloc B/op Float 15,080,427 65,060,920 +331% (4.31x) 33.27 -> 9.27 (-72%) Double 6,994,501 49,475,535 +608% (7.07x) 42.54 -> 18.55 (-56%) Int 15,641,334 68,128,560 +335% (4.36x) 33.27 -> 9.27 (-72%) Long 7,090,154 53,225,645 +651% (7.51x) 42.54 -> 18.55 (-56%) The remaining per-op allocation (~9 B/op for Int/Float, ~19 B/op for Long/Double) is the BytesInput[] returned by getBytes() and the streams' internal slabs, which are amortised across the page rather than per value. All 573 parquet-column tests pass.

… shaded jar The parquet-benchmarks pom is missing the JMH annotation-processor configuration and the AppendingTransformer entries for BenchmarkList / CompilerHints. As a result, the shaded jar built from master fails at runtime with "Unable to find the resource: /META-INF/BenchmarkList". This commit: - Fixes parquet-benchmarks/pom.xml so the shaded jar is runnable: adds jmh-generator-annprocess to maven-compiler-plugin's annotation processor paths, and adds AppendingTransformer entries for META-INF/BenchmarkList and META-INF/CompilerHints to the shade plugin. - Adds 11 JMH benchmarks covering the encode/decode paths used by the pending performance optimization PRs (apache#3494, apache#3496, apache#3500, apache#3504, apache#3506, apache#3510), so reviewers can reproduce the reported numbers and detect regressions: IntEncodingBenchmark, BinaryEncodingBenchmark, ByteStreamSplitEncodingBenchmark, ByteStreamSplitDecodingBenchmark, FixedLenByteArrayEncodingBenchmark, FileReadBenchmark, FileWriteBenchmark, RowGroupFlushBenchmark, ConcurrentReadWriteBenchmark, BlackHoleOutputFile, TestDataFactory. After this change the shaded jar registers 87 benchmarks (was 0 from a working build, or unrunnable at all from a default build).

iemejia mentioned this pull request Apr 19, 2026

GH-3511: Add JMH encoding benchmarks and fix parquet-benchmarks shaded jar #3512

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-3503: Optimize ByteStreamSplitValuesWriter with batched scatter writes#3504

GH-3503: Optimize ByteStreamSplitValuesWriter with batched scatter writes#3504
iemejia wants to merge 1 commit intoapache:masterfrom
iemejia:perf-bss-writer-batch

iemejia commented Apr 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

iemejia commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Benchmark

Are these changes tested?

Are there any user-facing changes?

Closes #3503

How to reproduce the benchmarks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

iemejia commented Apr 19, 2026 •

edited

Loading