Skip to content

Optimize ByteStreamSplitValuesWriter: remove per-value allocation and batch single-byte writes #3503

@iemejia

Description

@iemejia

Describe the enhancement requested

ByteStreamSplitValuesWriter is the primary writer for BYTE_STREAM_SPLIT-encoded FLOAT, DOUBLE, INT32, and INT64 columns. Each value goes through a hot path that performs both an unnecessary allocation and N single-byte virtual dispatches.

For FloatByteStreamSplitValuesWriter.writeFloat(float v):

super.scatterBytes(BytesUtils.intToBytes(Float.floatToIntBits(v)));

BytesUtils.intToBytes allocates a fresh byte[4] on every call. scatterBytes then loops:

for (int i = 0; i < bytes.length; ++i) {
  this.byteStreams[i].write(bytes[i]);   // CapacityByteArrayOutputStream.write(int)
}

That is, per value: 1 byte[4] allocation + 4 single-byte virtual dispatches. For a 100k-value FLOAT page that is 100k allocations and 400k single-byte writes. DOUBLE/LONG are even worse (byte[8], 800k single-byte writes).

JMH (new ByteStreamSplitEncodingBenchmark, 100k values per invocation, JDK 18, -wi 5 -i 10 -f 3, 30 samples) on master:

Type ops/s gc.alloc.rate.norm
Float 15.08M 33.27 B/op
Double 6.99M 42.54 B/op
Int 15.64M 33.27 B/op
Long 7.09M 42.54 B/op

The B/op figure for Float/Int (33 B) is mostly the per-value byte[4] allocation.

Proposal

Two stacked changes in ByteStreamSplitValuesWriter:

  1. Eliminate per-value allocation: replace super.scatterBytes(BytesUtils.intToBytes(v)) with bufferInt(v) / bufferLong(v) that perform the little-endian decomposition with bit shifts directly, no temporary byte[].

  2. Batch single-byte writes: accumulate BATCH_SIZE = 128 values in a small per-instance scratch buffer and flush them as N bulk write(byte[], off, len) calls (one per stream), replacing BATCH_SIZE * elementSizeInBytes single-byte virtual dispatches with elementSizeInBytes bulk writes per flush. The constant was chosen by sweeping 16/32/64/128/256/512/1024 — 128 is the sweet spot for FLOAT throughput while still capturing most of the DOUBLE/LONG gains.

Pending values are included in getBufferedSize() (so page-sizing decisions remain correct) and flushed in getBytes(). reset() and close() clear pending state. Only the four numeric subclasses use the batching path; FixedLenByteArrayByteStreamSplitValuesWriter continues to use scatterBytes(byte[]) since its values arrive as already-laid-out byte arrays.

Expected speedup (same JMH config):

Type Before After Δ Alloc B/op
Float 15.08M 65.06M +331% (4.3x) 33.27 → 9.27 (-72%)
Double 6.99M 49.48M +608% (7.1x) 42.54 → 18.55 (-56%)
Int 15.64M 68.13M +335% (4.4x) 33.27 → 9.27 (-72%)
Long 7.09M 53.23M +651% (7.5x) 42.54 → 18.55 (-56%)

Scope

  • Single file change to parquet-column/src/main/java/org/apache/parquet/column/values/bytestreamsplit/ByteStreamSplitValuesWriter.java.
  • No public-API change; bufferInt/bufferLong are package-internal helpers; existing public methods preserve their contracts.
  • All 573 parquet-column tests pass; 51 BSS-specific tests pass.

Relation

Part of a small series of focused performance PRs from work in parquet-perf. Previous: #3494 (PlainValuesReader), #3496 (PlainValuesWriter), #3500 (Binary.hashCode cache).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions