Skip to content

GH-3505: Optimize ByteStreamSplitValuesReader page transposition#3506

Open
iemejia wants to merge 1 commit intoapache:masterfrom
iemejia:perf-bss-reader-gather
Open

GH-3505: Optimize ByteStreamSplitValuesReader page transposition#3506
iemejia wants to merge 1 commit intoapache:masterfrom
iemejia:perf-bss-reader-gather

Conversation

@iemejia
Copy link
Copy Markdown
Member

@iemejia iemejia commented Apr 19, 2026

Rationale for this change

ByteStreamSplitValuesReader is the symmetric reader for BYTE_STREAM_SPLIT-encoded FLOAT, DOUBLE, INT32, and INT64 columns. On initFromPage it eagerly transposes the entire page from stream-split layout (elementSizeInBytes separate streams of valuesCount bytes each) back to interleaved layout. The current loop is:

private byte[] decodeData(ByteBuffer encoded, int valuesCount) {
  byte[] decoded = new byte[encoded.limit()];
  int destByteIndex = 0;
  for (int srcValueIndex = 0; srcValueIndex < valuesCount; ++srcValueIndex) {
    for (int stream = 0; stream < elementSizeInBytes; ++stream, ++destByteIndex) {
      decoded[destByteIndex] = encoded.get(srcValueIndex + stream * valuesCount);
    }
  }
  return decoded;
}

Two issues on the hot path:

  1. Every read goes through ByteBuffer.get(int) (per-call bounds checks + virtual dispatch through HeapByteBuffer/DirectByteBuffer).
  2. The inner stream offset (stream * valuesCount) is recomputed on every iteration even though it depends only on the outer loop.

For a 100k-value FLOAT page that is 400k ByteBuffer.get(int) calls; for DOUBLE/LONG it is 800k.

What changes are included in this PR?

Rewrite decodeData in three steps:

  1. Drop down to a byte[] view of the encoded buffer. When encoded.hasArray() is true (the typical case), use the backing array directly with the correct base offset; otherwise copy once with a single get(byte[]) call. Eliminates the per-byte ByteBuffer.get(int) bounds check and virtual dispatch.

  2. Specialize loops for the common element sizes (4 and 8). Hoist all stream * valuesCount offsets into local ints (s0..s3 for floats/ints, s0..s7 for doubles/longs) and write each output slot exactly once in a single sequential pass. Reads come from elementSizeInBytes concurrent sequential streams, which modern hardware prefetchers handle well.

  3. Generic fallback for arbitrary element sizes (FIXED_LEN_BYTE_ARRAY of any width).

Benchmark

New ByteStreamSplitDecodingBenchmark (100k values per invocation, JDK 18, JMH -wi 5 -i 10 -f 3, 30 samples per row):

Type Before After Δ
Float 47.80M 162.29M +240% (3.40x)
Double 26.32M 66.00M +151% (2.51x)
Int 47.07M 162.18M +245% (3.45x)
Long 26.80M 66.00M +146% (2.46x)

Decoded output is byte-identical to before; per-op heap allocation is unchanged.

Are these changes tested?

Yes. All 573 parquet-column tests pass; 51 BSS-specific tests pass (mvn test -pl parquet-column -Dtest='*ByteStreamSplit*'). No new test was added because the decoded bytes are unchanged (covered by existing round-trip and ByteStreamSplitValuesReaderTest tests).

Are there any user-facing changes?

No. Only an internal reader optimization. No public API, file format, or configuration change.

Closes #3505

Symmetric companion to #3504 (writer-side BSS optimization). Part of a small series of focused performance PRs from work in parquet-perf. Previous: #3494, #3496, #3500, #3504.

How to reproduce the benchmarks

The JMH benchmarks cited above are being added to parquet-benchmarks in #3512. Once that lands, reproduce with:

./mvnw clean package -pl parquet-benchmarks -DskipTests \
    -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true
java -jar parquet-benchmarks/target/parquet-benchmarks.jar 'ByteStreamSplitDecodingBenchmark' \
    -wi 5 -i 10 -f 3

Compare runs against master (baseline) and this branch (optimized).

ByteStreamSplitValuesReader.decodeData eagerly transposes an entire page
from stream-split layout (elementSizeInBytes streams of valuesCount bytes
each) back to interleaved layout (valuesCount elements of elementSizeInBytes
bytes each). The current loop performs one ByteBuffer.get(int) per byte,
which incurs per-call bounds checks and virtual dispatch through
HeapByteBuffer/DirectByteBuffer for every single byte of the page. For a
100k-value FLOAT page that is 400k get(int) calls; for DOUBLE/LONG it is
800k.

This change rewrites decodeData in three steps:

1. Drop down to a byte[] view of the encoded buffer. When encoded.hasArray()
   is true (the typical case) use the backing array directly with the
   correct base offset; otherwise copy once with a single get(byte[]) call.
   This eliminates the per-byte ByteBuffer.get(int) bounds check and
   virtual dispatch.

2. Specialize loops for the common element sizes (4 and 8). Hoist all
   stream * valuesCount offsets out of the inner loop into local ints
   (s0..s3 for floats/ints, s0..s7 for doubles/longs), and write each
   output slot exactly once in a single sequential pass. Reads come from
   elementSizeInBytes concurrent sequential streams which modern hardware
   prefetchers handle well.

3. Generic fallback for arbitrary element sizes (FIXED_LEN_BYTE_ARRAY of
   any width) keeps the existing behaviour.

Benchmark (new ByteStreamSplitDecodingBenchmark, 100k values per
invocation, JDK 18, JMH -wi 5 -i 10 -f 3, 30 samples per row):

  Type     Before (ops/s)   After (ops/s)    Improvement
  Float       47,798,981     162,294,904    +240% (3.40x)
  Double      26,320,043      66,002,524    +151% (2.51x)
  Int         47,072,832     162,177,747    +245% (3.45x)
  Long        26,795,544      65,999,343    +146% (2.46x)

Decoded output is byte-identical to before; per-op heap allocation is
unchanged (the only allocation is the per-page decode buffer plus the
boxing of returned primitives by the benchmark).

All 573 parquet-column tests pass; 51 BSS-specific tests pass.
@iemejia iemejia force-pushed the perf-bss-reader-gather branch from fa627d5 to 88a3b0e Compare April 19, 2026 19:15
iemejia added a commit to iemejia/parquet-java that referenced this pull request Apr 19, 2026
… shaded jar

The parquet-benchmarks pom is missing the JMH annotation-processor
configuration and the AppendingTransformer entries for BenchmarkList /
CompilerHints. As a result, the shaded jar built from master fails at
runtime with "Unable to find the resource: /META-INF/BenchmarkList".

This commit:

- Fixes parquet-benchmarks/pom.xml so the shaded jar is runnable: adds
  jmh-generator-annprocess to maven-compiler-plugin's annotation
  processor paths, and adds AppendingTransformer entries for
  META-INF/BenchmarkList and META-INF/CompilerHints to the shade plugin.

- Adds 11 JMH benchmarks covering the encode/decode paths used by the
  pending performance optimization PRs (apache#3494, apache#3496, apache#3500, apache#3504,
  apache#3506, apache#3510), so reviewers can reproduce the reported numbers and
  detect regressions:

    IntEncodingBenchmark, BinaryEncodingBenchmark,
    ByteStreamSplitEncodingBenchmark, ByteStreamSplitDecodingBenchmark,
    FixedLenByteArrayEncodingBenchmark, FileReadBenchmark,
    FileWriteBenchmark, RowGroupFlushBenchmark,
    ConcurrentReadWriteBenchmark, BlackHoleOutputFile, TestDataFactory.

After this change the shaded jar registers 87 benchmarks (was 0 from a
working build, or unrunnable at all from a default build).
iemejia added a commit to iemejia/parquet-java that referenced this pull request Apr 19, 2026
… shaded jar

The parquet-benchmarks pom is missing the JMH annotation-processor
configuration and the AppendingTransformer entries for BenchmarkList /
CompilerHints. As a result, the shaded jar built from master fails at
runtime with "Unable to find the resource: /META-INF/BenchmarkList".

This commit:

- Fixes parquet-benchmarks/pom.xml so the shaded jar is runnable: adds
  jmh-generator-annprocess to maven-compiler-plugin's annotation
  processor paths, and adds AppendingTransformer entries for
  META-INF/BenchmarkList and META-INF/CompilerHints to the shade plugin.

- Adds 11 JMH benchmarks covering the encode/decode paths used by the
  pending performance optimization PRs (apache#3494, apache#3496, apache#3500, apache#3504,
  apache#3506, apache#3510), so reviewers can reproduce the reported numbers and
  detect regressions:

    IntEncodingBenchmark, BinaryEncodingBenchmark,
    ByteStreamSplitEncodingBenchmark, ByteStreamSplitDecodingBenchmark,
    FixedLenByteArrayEncodingBenchmark, FileReadBenchmark,
    FileWriteBenchmark, RowGroupFlushBenchmark,
    ConcurrentReadWriteBenchmark, BlackHoleOutputFile, TestDataFactory.

After this change the shaded jar registers 87 benchmarks (was 0 from a
working build, or unrunnable at all from a default build).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize ByteStreamSplitValuesReader page transposition

1 participant