Skip to content

GH-3530: Optimize BYTE_STREAM_SPLIT encoding/decoding#3569

Open
iemejia wants to merge 1 commit into
apache:masterfrom
iemejia:parquet-perf-v2-par5-bss
Open

GH-3530: Optimize BYTE_STREAM_SPLIT encoding/decoding#3569
iemejia wants to merge 1 commit into
apache:masterfrom
iemejia:parquet-perf-v2-par5-bss

Conversation

@iemejia
Copy link
Copy Markdown
Member

@iemejia iemejia commented May 17, 2026

Part of #3530 — Apache Parquet Java Performance Improvements

Summary

Optimize scalar encode/decode for the BYTE_STREAM_SPLIT encoding.

Reader: Specialized transpose loops for element sizes 2/4/8/12/16 bytes plus generic fallback. Bulk array access when backing array is available.

Writer: Batched scatter buffers (int[]/long[] batches of 64) replacing per-value scatterBytes() which allocated temp byte[] and issued N single-byte writes.

Includes unit tests for transpose specializations, batch-boundary crossing, getBufferedSize with partial batches, direct ByteBuffer decode paths, and close/reset with pending unflushed batches.

JMH benchmarks: BssEncodingBenchmark, BssDecodingBenchmark covering FLOAT, DOUBLE, INT32, INT64, and FIXED_LEN_BYTE_ARRAY.

Benchmark results

Environment: JDK 25.0.3 (Temurin), OpenJDK 64-Bit Server VM, JMH 1.37, Linux x86_64.

Decoding:

Benchmark Baseline (M ops/s) Optimized (M ops/s) Speedup
decodeInt 203 1,034 5.1x
decodeFloat 263 1,032 3.9x
decodeDouble 132 363 2.8x
decodeLong 133 365 2.7x
decodeFlba(2) 286 491 1.7x
decodeFlba(12) 95 179 1.9x
decodeFlba(16) 78 142 1.8x

Encoding:

Benchmark Baseline (M ops/s) Optimized (M ops/s) Speedup
encodeDouble 53 365 6.9x
encodeLong 52 356 6.9x
encodeInt 99 515 5.2x
encodeFloat 101 499 5.0x
encodeFlba(16) 32 95 3.0x
encodeFlba(12) 41 114 2.8x
encodeFlba(7) 69 166 2.4x
encodeFlba(2) 192 314 1.6x

Every benchmark shows clear improvement with no regressions. 8-byte types benefit most from the batched scatter (6.9x) since the baseline scattered 8 bytes per value into 8 separate streams.

Reader: replace generic ByteBuffer.get() transpose loop in decodeData()
with specialized single-pass loops for element sizes 2/4/8/12/16 bytes
plus a stream-oriented generic fallback. Bulk-access the backing array
directly when available, falling back to a single bulk copy for direct
buffers.

Writer: replace per-value scatterBytes() (which allocates a temp byte[]
and issues N single-byte stream writes) with batched scatter buffers.
Int/Long values accumulate in int[]/long[] batches of 64 and flush as
bulk write(byte[], off, len) calls -- one per stream. FLBA uses
per-stream byte[][] scratch buffers with the same batching strategy.
getBufferedSize() now accounts for unflushed batch values.

Add JMH benchmarks for scalar encode/decode of all 5 BSS types (FLOAT,
DOUBLE, INT32, INT64, FIXED_LEN_BYTE_ARRAY). Add TestDataFactory for
deterministic FLBA benchmark data generation. Add unit tests for
transpose specializations, batch-boundary crossing, getBufferedSize
with partial batches, direct ByteBuffer decode paths, and close/reset
with pending unflushed batches.
@iemejia iemejia force-pushed the parquet-perf-v2-par5-bss branch from f7fdee5 to 84de9c6 Compare May 17, 2026 23:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant