GH-3530: Optimize BYTE_STREAM_SPLIT encoding/decoding by iemejia · Pull Request #3569 · apache/parquet-java

iemejia · 2026-05-17T22:38:55Z

Part of #3530 — Apache Parquet Java Performance Improvements

Summary

Optimize scalar encode/decode for the BYTE_STREAM_SPLIT encoding.

Reader: Specialized transpose loops for element sizes 2/4/8/12/16 bytes plus generic fallback. Bulk array access when backing array is available.

Writer: Batched scatter buffers (int[]/long[] batches of 64) replacing per-value scatterBytes() which allocated temp byte[] and issued N single-byte writes.

Includes unit tests for transpose specializations, batch-boundary crossing, getBufferedSize with partial batches, direct ByteBuffer decode paths, and close/reset with pending unflushed batches.

JMH benchmarks: BssEncodingBenchmark, BssDecodingBenchmark covering FLOAT, DOUBLE, INT32, INT64, and FIXED_LEN_BYTE_ARRAY.

Benchmark results

Environment: JDK 25.0.3 (Temurin), OpenJDK 64-Bit Server VM, JMH 1.37, Linux x86_64.

Decoding:

Benchmark	Baseline (M ops/s)	Optimized (M ops/s)	Speedup
decodeInt	203	1,034	5.1x
decodeFloat	263	1,032	3.9x
decodeDouble	132	363	2.8x
decodeLong	133	365	2.7x
decodeFlba(2)	286	491	1.7x
decodeFlba(12)	95	179	1.9x
decodeFlba(16)	78	142	1.8x

Encoding:

Benchmark	Baseline (M ops/s)	Optimized (M ops/s)	Speedup
encodeDouble	53	365	6.9x
encodeLong	52	356	6.9x
encodeInt	99	515	5.2x
encodeFloat	101	499	5.0x
encodeFlba(16)	32	95	3.0x
encodeFlba(12)	41	114	2.8x
encodeFlba(7)	69	166	2.4x
encodeFlba(2)	192	314	1.6x

Every benchmark shows clear improvement with no regressions. 8-byte types benefit most from the batched scatter (6.9x) since the baseline scattered 8 bytes per value into 8 separate streams.

Reader: replace generic ByteBuffer.get() transpose loop in decodeData() with specialized single-pass loops for element sizes 2/4/8/12/16 bytes plus a stream-oriented generic fallback. Bulk-access the backing array directly when available, falling back to a single bulk copy for direct buffers. Writer: replace per-value scatterBytes() (which allocates a temp byte[] and issues N single-byte stream writes) with batched scatter buffers. Int/Long values accumulate in int[]/long[] batches of 64 and flush as bulk write(byte[], off, len) calls -- one per stream. FLBA uses per-stream byte[][] scratch buffers with the same batching strategy. getBufferedSize() now accounts for unflushed batch values. Add JMH benchmarks for scalar encode/decode of all 5 BSS types (FLOAT, DOUBLE, INT32, INT64, FIXED_LEN_BYTE_ARRAY). Add TestDataFactory for deterministic FLBA benchmark data generation. Add unit tests for transpose specializations, batch-boundary crossing, getBufferedSize with partial batches, direct ByteBuffer decode paths, and close/reset with pending unflushed batches.

iemejia force-pushed the parquet-perf-v2-par5-bss branch from f7fdee5 to 84de9c6 Compare May 17, 2026 23:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-3530: Optimize BYTE_STREAM_SPLIT encoding/decoding#3569

GH-3530: Optimize BYTE_STREAM_SPLIT encoding/decoding#3569
iemejia wants to merge 1 commit into
apache:masterfrom
iemejia:parquet-perf-v2-par5-bss

iemejia commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

iemejia commented May 17, 2026

Summary

Benchmark results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant