[SPARK-57036][SQL] Use intrinsic bulk-fill APIs for constant-value WritableColumnVector methods#56082
Open
viirya wants to merge 2 commits into
Open
[SPARK-57036][SQL] Use intrinsic bulk-fill APIs for constant-value WritableColumnVector methods#56082viirya wants to merge 2 commits into
viirya wants to merge 2 commits into
Conversation
…itableColumnVector methods
Six bulk-fill methods on the column vectors implement constant-value
fills with degenerate per-element loops:
OnHeapColumnVector:
putBooleans(int rowId, int count, boolean value)
putBytes(int rowId, int count, byte value)
putShorts(int rowId, int count, short value)
putLongs(int rowId, int count, long value)
OffHeapColumnVector:
putBooleans(int rowId, int count, boolean value)
putBytes(int rowId, int count, byte value)
Replace them with intrinsic substitutions:
- OnHeap variants -> Arrays.fill on the typed array.
- OffHeap variants -> Platform.setMemory with a small-count fallback
to an inline byte loop, gated by a SET_MEMORY_THRESHOLD of 128.
Below the threshold, the JNI fixed cost of Unsafe.setMemory loses
to the inline loop; at or above, setMemory dominates and gains
accelerate to ~10x at count=4096+.
Also adds WritableColumnVectorBulkFillBenchmark for measuring the
constant-value bulk-fill APIs across a count sweep (1, 8, 64, 512,
4096, 65536), covering both OnHeap and OffHeap paths. This is the
benchmark used to produce the numbers in the PR description.
OffHeap multi-byte fills (putShorts / putInts / putLongs / putFloats /
putDoubles) are out of scope: Platform.setMemory is byte-only and a
value=0 short-circuit alternative was tried and showed no measurable
gain on Apple M4 Max + OpenJDK 21.
Co-authored-by: Claude Code
….13, split 1 of 1)
peter-toth
approved these changes
May 24, 2026
Member
Author
|
Thank you @peter-toth |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Six bulk-fill methods on the column vectors implement constant-value
fills with degenerate per-element loops. This PR replaces them with
intrinsic substitutions:
OnHeapColumnVector.putBooleans(rowId, count, value)Arrays.fill(byte[], ..., (byte) v)OnHeapColumnVector.putBytes(rowId, count, value)Arrays.fill(byte[], ...)OnHeapColumnVector.putShorts(rowId, count, value)Arrays.fill(short[], ...)OnHeapColumnVector.putLongs(rowId, count, value)Arrays.fill(long[], ...)OffHeapColumnVector.putBooleans(rowId, count, value)Platform.setMemorywith small-count fallbackOffHeapColumnVector.putBytes(rowId, count, value)Platform.setMemorywith small-count fallbackThe two OffHeap methods share a
SET_MEMORY_THRESHOLD = 128constant.Below the threshold, an inline byte loop avoids the JNI fixed cost of
Unsafe.setMemory; at or above,setMemorydominates and the gainaccelerates rapidly.
This PR also adds
WritableColumnVectorBulkFillBenchmarkto measurethese constant-value bulk-fill APIs across a count sweep covering both
the small-count (call-overhead dominated) and large-count (memory
bandwidth dominated) regimes.
Why are the changes needed?
The bulk-fill APIs on
WritableColumnVectorare the natural call tomake from any column writer, but their implementations were per-element
loops. Switching to intrinsics:
Arrays.fillis backed by HotSpot's_jbyte_fill/_jshort_fill/_jlong_fillintrinsic stubs.Unsafe.setMemorylowers to a nativememset. For OffHeap bytefills the original per-byte
Platform.putByteloop cannot bevectorized through the JNI call, so the gain is dramatic at large
counts.
Benchmark numbers (GitHub Actions, JDK 17, Scala 2.13)
Measured by running
WritableColumnVectorBulkFillBenchmarkvia theRun benchmarksworkflow on both the baseline (#56084) and this PR'sbranch, so the two runs use identical hardware and JDK. Rate (M
elements/s):
OffHeap byte fills (
putBytes/putBooleans) — the headline win:(Numbers averaged across
putBytesandputBooleanssince they sharethe same code path.)
At and above the 128-element threshold,
setMemoryproduces a 7-16ximprovement that grows with run length, consistent with
memsetbeingamortized cleanly over long fills. Below the threshold, both runs use
the same inline byte loop, so the small differences at
count=1andcount=8are GHA run-to-run variance rather than a structural change.OnHeap fills: on the GHA runner (Linux + Zulu JDK 17) the C2
compiler already auto-vectorizes the original byte loop near the byte
memory-bandwidth ceiling, so
Arrays.fillis at parity (~2,790 M/s,unchanged across
putBooleans/putBytes/putShorts/putLongs,all counts, both baseline and patched). On Apple M4 Max + OpenJDK 21
the same change yields +5-33% in the small/medium count range. The
OnHeap changes are kept for consistency with the OffHeap fixes and to
avoid future divergence between platforms.
OffHeap multi-byte fills (
putShorts/putInts/putLongs/putFloats/putDoubles) are out of scope:Platform.setMemoryisbyte-only and a value=0 short-circuit alternative was tried and showed
no measurable gain.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing tests; no behavior change. Ran locally:
VectorizedRleValuesReaderSuiteColumnVectorSuiteColumnarBatchSuiteParquetIOSuite237 tests, all pass.
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Claude Opus 4.7)