[SPARK-56894][SQL] Add vectorized Parquet BYTE_STREAM_SPLIT reader#55921
Open
iemejia wants to merge 1 commit into
Open
[SPARK-56894][SQL] Add vectorized Parquet BYTE_STREAM_SPLIT reader#55921iemejia wants to merge 1 commit into
iemejia wants to merge 1 commit into
Conversation
Adds a vectorized reader for the Parquet BYTE_STREAM_SPLIT encoding,
enabling native batch decoding of BSS-encoded columns in Spark's
vectorized Parquet reader.
BYTE_STREAM_SPLIT de-interleaves the bytes of N fixed-width values into
W separate streams (one per byte position). Decoding gathers the bytes
back: value[i] = {stream[0][i], stream[1][i], ..., stream[W-1][i]}.
This encoding is increasingly used for time-series and scientific data
(e.g., IoT sensor readings, financial tick data) because adjacent values
typically share high-order bytes, making each stream highly compressible.
Before this PR, BSS-encoded columns threw
SparkUnsupportedOperationException in vectorized mode.
Changes:
- VectorizedByteStreamSplitValuesReader: new reader extending
VectorizedReaderBase. Eagerly reads all page bytes in initFromPage,
then assembles values from the interleaved streams. Per-element
assembleInt/assembleLong helpers are used in both single-value and
batch read methods. Supports INT32, INT64, FLOAT, DOUBLE, and
FIXED_LEN_BYTE_ARRAY.
- VectorizedColumnReader.getValuesReader: added BYTE_STREAM_SPLIT case
that dispatches by primitive type to determine typeWidth.
- 31 unit tests across 5 suites (Integer, Long, Float, Double, FLBA)
covering batch reads, single-value reads, skip operations, special
values (NaN, Inf, min/max), and direct ByteBuffer support.
- Benchmark comparing Spark vectorized reader vs parquet-mr per-value
reader: 2.7-4.1x speedup across all types.
b5c9564 to
2d06832
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR adds a vectorized reader for the Parquet
BYTE_STREAM_SPLITencoding (VectorizedByteStreamSplitValuesReader), wired intoVectorizedColumnReader.getValuesReader().BYTE_STREAM_SPLIT de-interleaves N fixed-width values (W bytes each) into W separate byte streams. Decoding gathers the original bytes back:
value[i] = {stream[0][i], stream[1][i], ..., stream[W-1][i]}. This encoding is particularly effective for time-series and scientific data where adjacent values share high-order bytes.The new reader:
byte[]viainitFromPageassembleInt/assembleLonghelpers for byte gatheringreadIntegers,readLongs,readFloats,readDoubles,readBinary) and skip methodsThe
VectorizedColumnReaderchange is a singlecase BYTE_STREAM_SPLIT ->block (12 lines) that resolves the type width from the column descriptor and yields the new reader.Why are the changes needed?
Before this PR, Spark fell back to parquet-mr's per-value
ByteStreamSplitValuesReaderfor BSS-encoded columns. The new vectorized batch reader is 2.8-4.5x faster on the benchmark:Does this PR introduce any user-facing change?
No. This is an internal performance optimization. BSS-encoded Parquet columns that were already readable via the parquet-mr fallback are now decoded faster through the vectorized path. No API, configuration, or behavioral changes.
How was this patch tested?
ParquetByteStreamSplitEncodingSuite.scala:ParquetByteStreamSplitEncodingSuite[T]with 7 shared test cases (roundtrip, nulls, skip, large batches, special values, sequential reads, mixed skip-read)assertEqualfor bitwise NaN-safe comparison)VectorizedByteStreamSplitReaderBenchmark.scalacomparing against parquet-mr per-value readersWas this patch authored or co-authored using generative AI tooling?
Generated-by: OpenCode (Claude claude-opus-4.6)