Overview
This is an umbrella issue tracking a series of performance improvements to the Parquet vectorized reader in Spark SQL. The changes target allocation reduction, bulk-read optimizations, and JIT-friendly code patterns across multiple encoding paths.
All PRs are independent and can be reviewed/merged in any order. Together they yield significant throughput gains (1.2x to 7x depending on the encoding and data shape) for Parquet reads with no user-facing behavioral changes.
Note: I know SPARK tickets are managed on JIRA, this is only to have a centralized point to refer the different encoding performance improvements to other parties and avoid creating a parent ticket in JIRA.
Summary
| # |
PR |
Focus |
Key Speedup |
| 1 |
#55919 |
DELTA_BINARY_PACKED bulk reads |
up to 7.2x |
| 2 |
#55920 |
Dictionary decode hasNull fast path |
1.24x |
| 3 |
#55921 |
BYTE_STREAM_SPLIT vectorized reader |
2.8-4.5x |
| 4 |
#55922 |
RLE PACKED batch ByteBuffer slice |
2.1-2.4x |
| 5 |
#55923 |
Timestamp/date updater bulk reads |
up to 2.9x |
| 6 |
#55924 |
DELTA_BYTE_ARRAY allocation reduction |
1.1-1.9x |
| 7 |
#55932 |
DELTA_LENGTH_BYTE_ARRAY allocation reduction |
1.2-1.4x |
Pull Requests
1. DELTA_BINARY_PACKED bulk read optimization
PR: #55919 (SPARK-56892)
Replaces per-element lambda dispatch in readIntegers/readLongs with bulk paths that compute prefix sums in-place and write via putInts/putLongs. Also eliminates 3 allocations per value in readUnsignedLongs by replacing BigInteger(Long.toUnsignedString(v)) with a reusable ByteBuffer.
| Type |
Speedup |
| INT32 (monotonic) |
1.4x |
| INT64 (monotonic) |
3.8x |
| readUnsignedLongs |
7.2x |
2. Dictionary decoding hasNull fast path + per-class updater overrides
PR: #55920 (SPARK-56893)
Adds a hasNull() fast path that skips per-element null checks when the column has no nulls (common case). Per-class decodeDictionaryIds overrides give C2 monomorphic call sites, enabling full inlining of type-specific decode expressions.
| Scenario |
Speedup |
| No nulls (avg across 6 updaters) |
1.24x |
3. Vectorized BYTE_STREAM_SPLIT reader
PR: #55921 (SPARK-56894)
Adds a new VectorizedByteStreamSplitValuesReader that decodes BSS-encoded pages (FLOAT, DOUBLE, INT32, INT64, FIXED_LEN_BYTE_ARRAY) using batch byte-gathering instead of falling back to parquet-mr per-value reads.
| Type |
Speedup vs parquet-mr |
| INT32 |
4.5x |
| INT64 |
2.8x |
| FLOAT |
4.2x |
| DOUBLE |
2.8x |
4. Batch ByteBuffer slice in RLE PACKED decode
PR: #55922 (SPARK-56895)
Replaces per-group in.slice(bitWidth) (one ByteBuffer allocation per 8 values) with a single bulk slice for the entire PACKED run. Eliminates ~128K short-lived ByteBuffer allocations per 1M-value page.
| bitWidth |
Speedup (readIntegers) |
| 4 |
2.1x |
| 8 |
2.4x |
| 12 |
1.6x |
| 20 |
1.4x |
5. Bulk read paths for timestamp/date Parquet vector updaters
PR: #55923 (SPARK-56896)
Replaces per-element readValue loops with two-pass bulk read + in-place conversion for five updaters (LongAsMicrosUpdater, LongAsNanosUpdater, LongAsMicrosRebaseUpdater, DateToTimestampNTZUpdater, DateToTimestampNTZWithRebaseUpdater). Avoids per-element virtual dispatch through VectorizedValuesReader.
| Updater |
Speedup |
| LongAsMicrosUpdater |
2.9x |
| DateToTimestampNTZUpdater |
1.2x |
6. Reduce per-value allocations in DELTA_BYTE_ARRAY decoder
PR: #55924 (SPARK-56897)
Replaces ByteBuffer-based state tracking with a reusable byte[] buffer, eliminating 2 ByteBuffer allocations per decoded value (~8K objects per 4096-value page). Also rewrites skipBinary to avoid column vector reset/swap overhead.
| Operation |
Speedup |
| readBinary |
1.1-1.3x |
| skipBinary |
1.5-1.9x |
7. Reduce per-value allocation in DELTA_LENGTH_BYTE_ARRAY decoder
PR: #55932 (SPARK-56907)
Replaces per-value in.slice(length) with a single bulk slice for the entire batch. Replaces per-value skip loop with a single bulk skip.
| Operation |
Speedup |
| readBinary (small payloads) |
1.2x |
| skipBinary |
1.4x |
Common Themes
- Allocation reduction: Replace per-value
ByteBuffer.slice() / ByteBuffer.wrap() with bulk reads into reusable buffers
- Bulk vectorized reads: Replace per-element virtual dispatch with single batch calls backed by
System.arraycopy
- JIT-friendly patterns: Per-class method overrides for monomorphic call sites; avoiding megamorphic profile pollution from shared helpers
Benchmarking
All benchmarks were run on AMD EPYC 9V45 with OpenJDK 17/25, comparing upstream master against the patched version on the same machine with identical JVM flags.
Overview
This is an umbrella issue tracking a series of performance improvements to the Parquet vectorized reader in Spark SQL. The changes target allocation reduction, bulk-read optimizations, and JIT-friendly code patterns across multiple encoding paths.
All PRs are independent and can be reviewed/merged in any order. Together they yield significant throughput gains (1.2x to 7x depending on the encoding and data shape) for Parquet reads with no user-facing behavioral changes.
Note: I know SPARK tickets are managed on JIRA, this is only to have a centralized point to refer the different encoding performance improvements to other parties and avoid creating a parent ticket in JIRA.
Summary
Pull Requests
1. DELTA_BINARY_PACKED bulk read optimization
PR: #55919 (SPARK-56892)
Replaces per-element lambda dispatch in
readIntegers/readLongswith bulk paths that compute prefix sums in-place and write viaputInts/putLongs. Also eliminates 3 allocations per value inreadUnsignedLongsby replacingBigInteger(Long.toUnsignedString(v))with a reusableByteBuffer.2. Dictionary decoding hasNull fast path + per-class updater overrides
PR: #55920 (SPARK-56893)
Adds a
hasNull()fast path that skips per-element null checks when the column has no nulls (common case). Per-classdecodeDictionaryIdsoverrides give C2 monomorphic call sites, enabling full inlining of type-specific decode expressions.3. Vectorized BYTE_STREAM_SPLIT reader
PR: #55921 (SPARK-56894)
Adds a new
VectorizedByteStreamSplitValuesReaderthat decodes BSS-encoded pages (FLOAT, DOUBLE, INT32, INT64, FIXED_LEN_BYTE_ARRAY) using batch byte-gathering instead of falling back to parquet-mr per-value reads.4. Batch ByteBuffer slice in RLE PACKED decode
PR: #55922 (SPARK-56895)
Replaces per-group
in.slice(bitWidth)(oneByteBufferallocation per 8 values) with a single bulk slice for the entire PACKED run. Eliminates ~128K short-lived ByteBuffer allocations per 1M-value page.5. Bulk read paths for timestamp/date Parquet vector updaters
PR: #55923 (SPARK-56896)
Replaces per-element
readValueloops with two-pass bulk read + in-place conversion for five updaters (LongAsMicrosUpdater,LongAsNanosUpdater,LongAsMicrosRebaseUpdater,DateToTimestampNTZUpdater,DateToTimestampNTZWithRebaseUpdater). Avoids per-element virtual dispatch throughVectorizedValuesReader.6. Reduce per-value allocations in DELTA_BYTE_ARRAY decoder
PR: #55924 (SPARK-56897)
Replaces
ByteBuffer-based state tracking with a reusablebyte[]buffer, eliminating 2 ByteBuffer allocations per decoded value (~8K objects per 4096-value page). Also rewritesskipBinaryto avoid column vector reset/swap overhead.7. Reduce per-value allocation in DELTA_LENGTH_BYTE_ARRAY decoder
PR: #55932 (SPARK-56907)
Replaces per-value
in.slice(length)with a single bulk slice for the entire batch. Replaces per-value skip loop with a single bulk skip.Common Themes
ByteBuffer.slice()/ByteBuffer.wrap()with bulk reads into reusable buffersSystem.arraycopyBenchmarking
All benchmarks were run on AMD EPYC 9V45 with OpenJDK 17/25, comparing upstream
masteragainst the patched version on the same machine with identical JVM flags.