Skip to content

GH-3503: Optimize ByteStreamSplitValuesWriter with batched scatter writes#3504

Open
iemejia wants to merge 1 commit intoapache:masterfrom
iemejia:perf-bss-writer-batch
Open

GH-3503: Optimize ByteStreamSplitValuesWriter with batched scatter writes#3504
iemejia wants to merge 1 commit intoapache:masterfrom
iemejia:perf-bss-writer-batch

Conversation

@iemejia
Copy link
Copy Markdown
Member

@iemejia iemejia commented Apr 19, 2026

Rationale for this change

ByteStreamSplitValuesWriter is the primary writer for BYTE_STREAM_SPLIT-encoded FLOAT, DOUBLE, INT32, and INT64 columns. Each value goes through a hot path that performs both an unnecessary allocation and N single-byte virtual dispatches.

For FloatByteStreamSplitValuesWriter.writeFloat(float v):

super.scatterBytes(BytesUtils.intToBytes(Float.floatToIntBits(v)));

BytesUtils.intToBytes allocates a fresh byte[4] on every call. scatterBytes then loops:

for (int i = 0; i < bytes.length; ++i) {
  this.byteStreams[i].write(bytes[i]);   // CapacityByteArrayOutputStream.write(int)
}

So per value: 1 byte[4] allocation + 4 single-byte virtual dispatches. For a 100k-value FLOAT page that is 100k allocations and 400k single-byte writes. DOUBLE/LONG are even worse (byte[8], 800k single-byte writes).

What changes are included in this PR?

Two stacked changes in ByteStreamSplitValuesWriter:

  1. Eliminate per-value allocation: replace super.scatterBytes(BytesUtils.intToBytes(v)) with bufferInt(v) / bufferLong(v) that perform the little-endian decomposition with bit shifts directly, no temporary byte[].

  2. Batch single-byte writes: accumulate BATCH_SIZE = 128 values in a small per-instance scratch buffer and flush them as N bulk write(byte[], off, len) calls (one per stream), replacing BATCH_SIZE * elementSizeInBytes single-byte virtual dispatches with elementSizeInBytes bulk writes per flush. The constant was chosen by sweeping 16/32/64/128/256/512/1024 — 128 is the sweet spot for FLOAT throughput while still capturing most of the DOUBLE/LONG gains.

Pending values are included in getBufferedSize() (so page-sizing decisions remain correct) and flushed in getBytes(). reset() and close() clear pending state. Only the four numeric subclasses (Float/Double/Integer/Long) use the batching path; FixedLenByteArrayByteStreamSplitValuesWriter continues to use scatterBytes(byte[]) since its values arrive as already-laid-out byte arrays.

Benchmark

New ByteStreamSplitEncodingBenchmark (100k values per invocation, JDK 18, JMH -wi 5 -i 10 -f 3, 30 samples per row):

Type Before After Δ Alloc B/op
Float 15.08M 65.06M +331% (4.3x) 33.27 → 9.27 (-72%)
Double 6.99M 49.48M +608% (7.1x) 42.54 → 18.55 (-56%)
Int 15.64M 68.13M +335% (4.4x) 33.27 → 9.27 (-72%)
Long 7.09M 53.23M +651% (7.5x) 42.54 → 18.55 (-56%)

The remaining per-op allocation (~9 B/op for Int/Float, ~19 B/op for Long/Double) is the BytesInput[] returned by getBytes() and the streams' internal slabs, which are amortised across the page rather than per value.

Are these changes tested?

Yes. All 573 parquet-column tests pass; 51 BSS-specific tests pass (mvn test -pl parquet-column -Dtest='*ByteStreamSplit*'). No new test was added because behaviour is unchanged (covered by the existing round-trip and writer tests).

Are there any user-facing changes?

No. Only an internal writer optimization. No public API, file format, or configuration change.

Closes #3503

Part of a small series of focused performance PRs from work in parquet-perf. Previous: #3494, #3496, #3500.

How to reproduce the benchmarks

The JMH benchmarks cited above are being added to parquet-benchmarks in #3512. Once that lands, reproduce with:

./mvnw clean package -pl parquet-benchmarks -DskipTests \
    -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true
java -jar parquet-benchmarks/target/parquet-benchmarks.jar 'ByteStreamSplitEncodingBenchmark' \
    -wi 5 -i 10 -f 3

Compare runs against master (baseline) and this branch (optimized).

…ter writes

The current ByteStreamSplitValuesWriter.writeFloat/writeDouble/writeInteger/
writeLong path allocates a new byte[4] or byte[8] per value via
BytesUtils.intToBytes / BytesUtils.longToBytes, then dispatches one
single-byte CapacityByteArrayOutputStream.write(int) call per byte per
value (4 calls per float/int, 8 per double/long). For a 100k-value page
that is up to 800k single-byte virtual dispatches plus 100k short-lived
byte[] allocations.

This change collapses that hot path in two stacked steps:

1. Eliminate the per-value byte[] allocation by inlining the
   little-endian decomposition with bit shifts into helper methods
   bufferInt(int) / bufferLong(long), instead of going through
   BytesUtils.intToBytes / BytesUtils.longToBytes which allocate
   byte[4] / byte[8] on every call.

2. Batch values into a small per-instance scratch buffer (BATCH_SIZE = 128)
   and flush them as N bulk write(byte[], off, len) calls per stream per
   flush, replacing N * elementSizeInBytes single-byte virtual dispatches
   with elementSizeInBytes bulk writes. The batch is flushed automatically
   when full, on getBytes(), and is included in getBufferedSize() so page
   sizing decisions remain correct. reset() and close() clear the pending
   batch. The constant was selected by sweeping 16/32/64/128/256/512/1024;
   128 maximises FLOAT throughput while still capturing most of the
   DOUBLE/LONG gains.

Only one of intBatch / longBatch is used per writer instance; the four
numeric subclasses (Float/Double/Integer/Long) each call exactly one of
bufferInt / bufferLong via their writeXxx implementations. The
FixedLenByteArrayByteStreamSplitValuesWriter still uses scatterBytes(byte[])
since its values arrive as already-laid-out byte arrays.

Benchmark (new ByteStreamSplitEncodingBenchmark, 100k values per
invocation, JDK 18, JMH -wi 5 -i 10 -f 3, 30 samples per row):

  Type     Before (ops/s)   After (ops/s)    Improvement   Alloc B/op
  Float       15,080,427      65,060,920    +331% (4.31x)  33.27 -> 9.27  (-72%)
  Double       6,994,501      49,475,535    +608% (7.07x)  42.54 -> 18.55 (-56%)
  Int         15,641,334      68,128,560    +335% (4.36x)  33.27 -> 9.27  (-72%)
  Long         7,090,154      53,225,645    +651% (7.51x)  42.54 -> 18.55 (-56%)

The remaining per-op allocation (~9 B/op for Int/Float, ~19 B/op for
Long/Double) is the BytesInput[] returned by getBytes() and the streams'
internal slabs, which are amortised across the page rather than per value.

All 573 parquet-column tests pass.
iemejia added a commit to iemejia/parquet-java that referenced this pull request Apr 19, 2026
… shaded jar

The parquet-benchmarks pom is missing the JMH annotation-processor
configuration and the AppendingTransformer entries for BenchmarkList /
CompilerHints. As a result, the shaded jar built from master fails at
runtime with "Unable to find the resource: /META-INF/BenchmarkList".

This commit:

- Fixes parquet-benchmarks/pom.xml so the shaded jar is runnable: adds
  jmh-generator-annprocess to maven-compiler-plugin's annotation
  processor paths, and adds AppendingTransformer entries for
  META-INF/BenchmarkList and META-INF/CompilerHints to the shade plugin.

- Adds 11 JMH benchmarks covering the encode/decode paths used by the
  pending performance optimization PRs (apache#3494, apache#3496, apache#3500, apache#3504,
  apache#3506, apache#3510), so reviewers can reproduce the reported numbers and
  detect regressions:

    IntEncodingBenchmark, BinaryEncodingBenchmark,
    ByteStreamSplitEncodingBenchmark, ByteStreamSplitDecodingBenchmark,
    FixedLenByteArrayEncodingBenchmark, FileReadBenchmark,
    FileWriteBenchmark, RowGroupFlushBenchmark,
    ConcurrentReadWriteBenchmark, BlackHoleOutputFile, TestDataFactory.

After this change the shaded jar registers 87 benchmarks (was 0 from a
working build, or unrunnable at all from a default build).
iemejia added a commit to iemejia/parquet-java that referenced this pull request Apr 19, 2026
… shaded jar

The parquet-benchmarks pom is missing the JMH annotation-processor
configuration and the AppendingTransformer entries for BenchmarkList /
CompilerHints. As a result, the shaded jar built from master fails at
runtime with "Unable to find the resource: /META-INF/BenchmarkList".

This commit:

- Fixes parquet-benchmarks/pom.xml so the shaded jar is runnable: adds
  jmh-generator-annprocess to maven-compiler-plugin's annotation
  processor paths, and adds AppendingTransformer entries for
  META-INF/BenchmarkList and META-INF/CompilerHints to the shade plugin.

- Adds 11 JMH benchmarks covering the encode/decode paths used by the
  pending performance optimization PRs (apache#3494, apache#3496, apache#3500, apache#3504,
  apache#3506, apache#3510), so reviewers can reproduce the reported numbers and
  detect regressions:

    IntEncodingBenchmark, BinaryEncodingBenchmark,
    ByteStreamSplitEncodingBenchmark, ByteStreamSplitDecodingBenchmark,
    FixedLenByteArrayEncodingBenchmark, FileReadBenchmark,
    FileWriteBenchmark, RowGroupFlushBenchmark,
    ConcurrentReadWriteBenchmark, BlackHoleOutputFile, TestDataFactory.

After this change the shaded jar registers 87 benchmarks (was 0 from a
working build, or unrunnable at all from a default build).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize ByteStreamSplitValuesWriter: remove per-value allocation and batch single-byte writes

1 participant