GH-3505: Optimize ByteStreamSplitValuesReader page transposition by iemejia · Pull Request #3506 · apache/parquet-java

iemejia · 2026-04-19T18:47:33Z

Rationale for this change

ByteStreamSplitValuesReader is the symmetric reader for BYTE_STREAM_SPLIT-encoded FLOAT, DOUBLE, INT32, and INT64 columns. On initFromPage it eagerly transposes the entire page from stream-split layout (elementSizeInBytes separate streams of valuesCount bytes each) back to interleaved layout. The current loop is:

private byte[] decodeData(ByteBuffer encoded, int valuesCount) {
  byte[] decoded = new byte[encoded.limit()];
  int destByteIndex = 0;
  for (int srcValueIndex = 0; srcValueIndex < valuesCount; ++srcValueIndex) {
    for (int stream = 0; stream < elementSizeInBytes; ++stream, ++destByteIndex) {
      decoded[destByteIndex] = encoded.get(srcValueIndex + stream * valuesCount);
    }
  }
  return decoded;
}

Two issues on the hot path:

Every read goes through ByteBuffer.get(int) (per-call bounds checks + virtual dispatch through HeapByteBuffer/DirectByteBuffer).
The inner stream offset (stream * valuesCount) is recomputed on every iteration even though it depends only on the outer loop.

For a 100k-value FLOAT page that is 400k ByteBuffer.get(int) calls; for DOUBLE/LONG it is 800k.

What changes are included in this PR?

Rewrite decodeData in three steps:

Drop down to a byte[] view of the encoded buffer. When encoded.hasArray() is true (the typical case), use the backing array directly with the correct base offset; otherwise copy once with a single get(byte[]) call. Eliminates the per-byte ByteBuffer.get(int) bounds check and virtual dispatch.
Specialize loops for the common element sizes (4 and 8). Hoist all stream * valuesCount offsets into local ints (s0..s3 for floats/ints, s0..s7 for doubles/longs) and write each output slot exactly once in a single sequential pass. Reads come from elementSizeInBytes concurrent sequential streams, which modern hardware prefetchers handle well.
Generic fallback for arbitrary element sizes (FIXED_LEN_BYTE_ARRAY of any width).

Benchmark

New ByteStreamSplitDecodingBenchmark (100k values per invocation, JDK 18, JMH -wi 5 -i 10 -f 3, 30 samples per row):

Type	Before	After	Δ
Float	47.80M	162.29M	+240% (3.40x)
Double	26.32M	66.00M	+151% (2.51x)
Int	47.07M	162.18M	+245% (3.45x)
Long	26.80M	66.00M	+146% (2.46x)

Decoded output is byte-identical to before; per-op heap allocation is unchanged.

Are these changes tested?

Yes. All 573 parquet-column tests pass; 51 BSS-specific tests pass (mvn test -pl parquet-column -Dtest='*ByteStreamSplit*'). No new test was added because the decoded bytes are unchanged (covered by existing round-trip and ByteStreamSplitValuesReaderTest tests).

Are there any user-facing changes?

No. Only an internal reader optimization. No public API, file format, or configuration change.

Closes #3505

Symmetric companion to #3504 (writer-side BSS optimization). Part of a small series of focused performance PRs from work in parquet-perf. Previous: #3494, #3496, #3500, #3504.

How to reproduce the benchmarks

The JMH benchmarks cited above are being added to parquet-benchmarks in #3512. Once that lands, reproduce with:

./mvnw clean package -pl parquet-benchmarks -DskipTests \
    -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true
java -jar parquet-benchmarks/target/parquet-benchmarks.jar 'ByteStreamSplitDecodingBenchmark' \
    -wi 5 -i 10 -f 3

Compare runs against master (baseline) and this branch (optimized).

ByteStreamSplitValuesReader.decodeData eagerly transposes an entire page from stream-split layout (elementSizeInBytes streams of valuesCount bytes each) back to interleaved layout (valuesCount elements of elementSizeInBytes bytes each). The current loop performs one ByteBuffer.get(int) per byte, which incurs per-call bounds checks and virtual dispatch through HeapByteBuffer/DirectByteBuffer for every single byte of the page. For a 100k-value FLOAT page that is 400k get(int) calls; for DOUBLE/LONG it is 800k. This change rewrites decodeData in three steps: 1. Drop down to a byte[] view of the encoded buffer. When encoded.hasArray() is true (the typical case) use the backing array directly with the correct base offset; otherwise copy once with a single get(byte[]) call. This eliminates the per-byte ByteBuffer.get(int) bounds check and virtual dispatch. 2. Specialize loops for the common element sizes (4 and 8). Hoist all stream * valuesCount offsets out of the inner loop into local ints (s0..s3 for floats/ints, s0..s7 for doubles/longs), and write each output slot exactly once in a single sequential pass. Reads come from elementSizeInBytes concurrent sequential streams which modern hardware prefetchers handle well. 3. Generic fallback for arbitrary element sizes (FIXED_LEN_BYTE_ARRAY of any width) keeps the existing behaviour. Benchmark (new ByteStreamSplitDecodingBenchmark, 100k values per invocation, JDK 18, JMH -wi 5 -i 10 -f 3, 30 samples per row): Type Before (ops/s) After (ops/s) Improvement Float 47,798,981 162,294,904 +240% (3.40x) Double 26,320,043 66,002,524 +151% (2.51x) Int 47,072,832 162,177,747 +245% (3.45x) Long 26,795,544 65,999,343 +146% (2.46x) Decoded output is byte-identical to before; per-op heap allocation is unchanged (the only allocation is the per-page decode buffer plus the boxing of returned primitives by the benchmark). All 573 parquet-column tests pass; 51 BSS-specific tests pass.

… shaded jar The parquet-benchmarks pom is missing the JMH annotation-processor configuration and the AppendingTransformer entries for BenchmarkList / CompilerHints. As a result, the shaded jar built from master fails at runtime with "Unable to find the resource: /META-INF/BenchmarkList". This commit: - Fixes parquet-benchmarks/pom.xml so the shaded jar is runnable: adds jmh-generator-annprocess to maven-compiler-plugin's annotation processor paths, and adds AppendingTransformer entries for META-INF/BenchmarkList and META-INF/CompilerHints to the shade plugin. - Adds 11 JMH benchmarks covering the encode/decode paths used by the pending performance optimization PRs (apache#3494, apache#3496, apache#3500, apache#3504, apache#3506, apache#3510), so reviewers can reproduce the reported numbers and detect regressions: IntEncodingBenchmark, BinaryEncodingBenchmark, ByteStreamSplitEncodingBenchmark, ByteStreamSplitDecodingBenchmark, FixedLenByteArrayEncodingBenchmark, FileReadBenchmark, FileWriteBenchmark, RowGroupFlushBenchmark, ConcurrentReadWriteBenchmark, BlackHoleOutputFile, TestDataFactory. After this change the shaded jar registers 87 benchmarks (was 0 from a working build, or unrunnable at all from a default build).

iemejia force-pushed the perf-bss-reader-gather branch from fa627d5 to 88a3b0e Compare April 19, 2026 19:15

iemejia mentioned this pull request Apr 19, 2026

GH-3511: Add JMH encoding benchmarks and fix parquet-benchmarks shaded jar #3512

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-3505: Optimize ByteStreamSplitValuesReader page transposition#3506

GH-3505: Optimize ByteStreamSplitValuesReader page transposition#3506
iemejia wants to merge 1 commit intoapache:masterfrom
iemejia:perf-bss-reader-gather

iemejia commented Apr 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

iemejia commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Benchmark

Are these changes tested?

Are there any user-facing changes?

Closes #3505

How to reproduce the benchmarks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

iemejia commented Apr 19, 2026 •

edited

Loading