[SPARK-55683][SQL] Optimize VectorizedPlainValuesReader.readUnsignedLongs#54479
[SPARK-55683][SQL] Optimize VectorizedPlainValuesReader.readUnsignedLongs#54479LuciferYang wants to merge 1 commit intoapache:masterfrom
VectorizedPlainValuesReader.readUnsignedLongs#54479Conversation
|
Test first, the PRprdescription and performance comparison will be updated later. |
There was a problem hiding this comment.
Pull request overview
Optimizes Spark SQL’s Parquet vectorized PLAIN decoding for UINT_64 by replacing the per-value BigInteger(String) construction path with direct byte-level conversion, and extends the existing Parquet IO test to cover boundary cases for unsigned 64-bit decoding.
Changes:
- Reworked
VectorizedPlainValuesReader.readUnsignedLongsto convert little-endianUINT_64bytes into BigInteger-compatible big-endian two’s-complement byte arrays withoutString/BigIntegerintermediates. - Added a helper to produce
BigInteger.toByteArray()-compatible encodings (zero handling + sign-byte handling). - Extended
ParquetIOSuite’sUINT_64test data with boundary values (0/1/max/min/-1/-2).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java | Implements the new byte-manipulation decoding path + helper for BigInteger-compatible output. |
| sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala | Extends existing UINT_64 Parquet read test with boundary values to validate the new encoding logic. |
Comments suppressed due to low confidence (2)
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java:278
putLittleEndianBytesAsBigIntegerstill allocates a newbyte[]per value (new byte[totalLen], plusnew byte[]{0}for zeros). SinceWritableColumnVector.putByteArray(rowId, value, offset, count)copies the bytes intoarrayData(), you can reuse a single scratch buffer (e.g.,byte[9]) across the loop and pass the appropriate offset/length to avoid per-element heap allocations entirely.
// Zero value: must write [0x00] rather than an empty array.
// BigInteger.ZERO.toByteArray() returns [0x00], and new BigInteger(new byte[0])
// throws NumberFormatException("Zero length BigInteger").
if (msbIndex == offset && src[offset] == 0) {
c.putByteArray(rowId, new byte[]{0});
return;
}
// Prepend a 0x00 sign byte if the most significant byte has bit 7 set.
// This matches BigInteger.toByteArray() behavior: positive values whose highest
// magnitude byte has the MSB set are prefixed with 0x00 to distinguish them
// from negative values in two's-complement encoding.
boolean needSignByte = (src[msbIndex] & 0x80) != 0;
int valueLen = msbIndex - offset + 1;
int totalLen = needSignByte ? valueLen + 1 : valueLen;
byte[] dest = new byte[totalLen];
int destOffset = 0;
if (needSignByte) {
dest[destOffset++] = 0x00;
}
// Reverse byte order: little-endian src → big-endian dest
for (int i = msbIndex; i >= offset; i--) {
dest[destOffset++] = src[i];
}
c.putByteArray(rowId, dest, 0, totalLen);
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala:1275
- The boundary values sequence is duplicated here and again when constructing
boundaryExpectedbelow. Consider defining a singleval boundaryValues = Seq(...)and reusing it for both writing and expected rows to avoid accidental drift if the list changes in the future.
// Boundary values: zero, one, signed extremes interpreted as unsigned
Seq(0L, 1L, Long.MaxValue, Long.MinValue, -2L, -1L).foreach { v =>
val group = factory.newGroup().append("a", v)
writer.write(group)
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Merged to master for Apache Spark 4.2.0. Thank you, @LuciferYang and @pan3793 . |
This suggestion from Copilot sounds good. Let me try it out to see the effect. If there are benefits, I'll submit a FOLLOWUP. |
It has shown stable optimization effects for OnHeap , and I've opened a follow-up: |
…adUnsignedLongs` to reuse scratch buffer and avoid per-element allocations ### What changes were proposed in this pull request? This pr refer to the suggestion from Copilot: #54479 (review), further optimizes `VectorizedPlainValuesReader.readUnsignedLongs` by introducing a reusable scratch buffer to eliminate per-element `byte[]` allocations introduced in the previous refactoring. The previous implementation allocates a new `byte[]` per element for the encoded output: ```java // Previous: new byte[totalLen] per element, plus new byte[]{0} for zero values byte[] dest = new byte[totalLen]; ... c.putByteArray(rowId, dest, 0, totalLen); ``` The new implementation allocates a single `byte[9]` scratch buffer once per batch and reuses it across all elements. Since `WritableColumnVector.putByteArray` copies the bytes into its internal storage immediately, the scratch buffer can be safely overwritten on the next iteration: ```java // New: one byte[9] allocated per batch, reused for every element byte[] scratch = new byte[9]; for (...) { putLittleEndianBytesAsBigInteger(c, rowId, src, offset, scratch); } ``` The scratch buffer is sized at 9 bytes to accommodate the worst case: 1 `0x00` sign byte + 8 value bytes. The zero value special case is also handled via scratch, avoiding the previous `new byte[]{0}` allocation. ### Why are the changes needed? The previous implementation still allocates one `byte[]` per element for the encoded output. For a typical batch of 4096 values this means 4096 heap allocations per `readUnsignedLongs` call, creating GC pressure in workloads that read large `UINT_64` columns. With the scratch buffer approach, the entire batch produces only 2 allocations: `byte[9]` (scratch) and `byte[8]` (direct buffer fallback read buffer), regardless of batch size. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass Github Action - Reused the JMH benchmark provided in #54479, and the test results are as follows: Java 17 ``` [info] Benchmark (numValues) Mode Cnt Score Error Units [info] VectorizedPlainValuesReaderJMHBenchmark.readUnsignedLongs_offHeap_New 10000000 avgt 10 233820.658 ± 1888.523 us/op [info] VectorizedPlainValuesReaderJMHBenchmark.readUnsignedLongs_offHeap_Old 10000000 avgt 10 255563.248 ± 3500.165 us/op [info] VectorizedPlainValuesReaderJMHBenchmark.readUnsignedLongs_onHeap_New 10000000 avgt 10 228672.684 ± 2985.496 us/op [info] VectorizedPlainValuesReaderJMHBenchmark.readUnsignedLongs_onHeap_Old 10000000 avgt 10 275756.804 ± 2065.405 us/op ``` Java 21 ``` [info] Benchmark (numValues) Mode Cnt Score Error Units [info] VectorizedPlainValuesReaderJMHBenchmark.readUnsignedLongs_offHeap_New 10000000 avgt 10 241977.924 ± 15125.343 us/op [info] VectorizedPlainValuesReaderJMHBenchmark.readUnsignedLongs_offHeap_Old 10000000 avgt 10 250343.470 ± 1342.509 us/op [info] VectorizedPlainValuesReaderJMHBenchmark.readUnsignedLongs_onHeap_New 10000000 avgt 10 212929.948 ± 1387.671 us/op [info] VectorizedPlainValuesReaderJMHBenchmark.readUnsignedLongs_onHeap_Old 10000000 avgt 10 274561.949 ± 1226.348 us/op ``` Judging from the test results, the onHeap path demonstrates approximately a 17-22% improvement, while the offHeap path shows roughly a 3-9% improvement across Java 17 and Java 21. ### Was this patch authored or co-authored using generative AI tooling? Yes, Claude Sonnet 4.6 was used to assist in completing the code writing. Closes #54510 from LuciferYang/SPARK-55683-FOLLOWUP. Lead-authored-by: yangjie01 <yangjie01@baidu.com> Co-authored-by: YangJie <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
This PR optimizes
VectorizedPlainValuesReader.readUnsignedLongsby replacing the per-elementBigIntegerheap allocation chain with direct byte manipulation.The original implementation allocates multiple objects per element:
The new implementation reads raw little-endian bytes directly from the
ByteBufferbacking array (when available) and converts them toBigInteger-compatible big-endian encoding in a single pass:The private helper
putLittleEndianBytesAsBigIntegerhandles the conversion with output matchingBigInteger.toByteArray()semantics:[0x00](1 byte) rather than an empty array, sincenew BigInteger(new byte[0])throwsNumberFormatException0x00when the most significant byte has bit 7 set, to ensure the value is interpreted as positive byBigIntegerWhy are the changes needed?
The original implementation constructs a
BigIntegerviaLong.toUnsignedString+new BigInteger(String), which involves per-element allocations of aString, aBigInteger, its internalint[]magnitude array, and the finalbyte[]. For a typical batch of 4096 values this means ~16K object allocations, creating significant GC pressure in workloads reading largeUINT_64columns.The new implementation reduces this to one
byte[]allocation per element by operating directly on the raw bytes from theByteBuffer, avoiding all intermediate object creation. Additionally, the direct buffer fallback path reuses a singlebyte[8]scratch buffer across the entire batch.Does this PR introduce any user-facing change?
No.
How was this patch tested?
SPARK-34817: Read UINT_64 as Decimal from parquetinParquetIOSuitewas extended with boundary values covering the critical edge cases of the new byte manipulation logicOldVectorizedPlainValuesReader, and compare the latency of the old and newreadUnsignedLongsmethods using JMH:Benchmark Code (click to expand)
Perform
build/sbt "sql/Test/runMain org.apache.spark.sql.execution.datasources.parquet.VectorizedPlainValuesReaderJMHBenchmark"to conduct the testBenchmark results:
Both onHeap and offHeap paths show approximately ~8x improvement.
Was this patch authored or co-authored using generative AI tooling?
Yes, Claude Sonnet 4.6 was used to assist in completing the code writing.