[SPARK-56522][SQL] Batch PACKED null/non-null runs in VectorizedRleValuesReader#55386
[SPARK-56522][SQL] Batch PACKED null/non-null runs in VectorizedRleValuesReader#55386LuciferYang wants to merge 7 commits intoapache:masterfrom
VectorizedRleValuesReader#55386Conversation
|
Let me update the benchmark results to confirm the effect. I'll add PR description later. |
| // package-private access across runtime packages. Reflection with setAccessible sidesteps | ||
| // the check without widening production visibility. | ||
|
|
||
| private val stateCls = SparkClassUtils.classForName[Any]( |
There was a problem hiding this comment.
Refactor common code into VectorizedRleValuesReaderTestUtils and ParquetReadStateTestAccess to enable reuse in VectorizedRleValuesReaderSuite.
| nullRatio=0.5, clustered 6 6 0 170.5 5.9 0.0X | ||
| nullRatio=0.9, random 8 8 0 137.2 7.3 0.0X | ||
| nullRatio=0.9, clustered 6 6 0 183.6 5.4 0.0X | ||
| nullRatio=1.0, random 0 0 0 2832.8 0.4 0.4X |
There was a problem hiding this comment.
The performance has regressed; we need to investigate.
…Reader
### What changes were proposed in this pull request?
This PR introduces a batched PACKED decode path for the nullable column
read loop in `VectorizedRleValuesReader`, replacing the per-row virtual
dispatch with a run-aware, value-side batched loop. Two methods are
changed:
* `readBatchInternal` (flat nullable columns, no def-level
materialization)
* `readValuesN` (nullable columns with def-level materialization,
used for nested or wide-schema reads)
Both PACKED branches are extracted into private helpers
(`readPackedBatch`, `readPackedBatchWithDefLevels`) so that the
enclosing methods stay small enough for C2 to inline the RLE fast path
into its caller.
The PACKED helpers:
* scan `currentBuffer` for contiguous runs of identical def-level
values
* dispatch **one** batched value read per non-null run
(`updater.readValues(runLen, ...)`), with a `runLen == 1` fast path
calling the singular `updater.readValue(...)`
* write nulls and def-level values per row (`putNull` /
`putInt`) — deliberately NOT batched via `putNulls` / `putInts`
### Why is the null/def-level write per-row instead of batched?
The RLE branch already calls `putNulls(offset, n)` and
`defLevels.putInts(offset, n, value)` with `n = 4096`. HotSpot compiles
each of those methods with a global method-level profile. Introducing
PACKED call sites with short `runLen` (≤ 7, since the Parquet hybrid
encoder never emits PACKED runs of 8+) would average the count profile
over small and large values, causing C2 to drop count-specialised
optimisations and slow the RLE fast path by 3-4x. Keeping those writes
per-row in PACKED preserves the monomorphic profile for the RLE hot
path while still delivering the value-side batching win.
### Test coverage
Adds `VectorizedRleValuesReaderSuite` (14 tests) exercising RLE and
PACKED across on-heap and off-heap `WritableColumnVector` modes,
including multi-page, cross-batch, row-index filtering, and buffer-grow
scenarios.
Adds `VectorizedRleValuesReaderTestUtils` and
`ParquetReadStateTestAccess` — shared test helpers used by both the new
suite and the existing `VectorizedRleValuesReaderBenchmark`
(introduced in SPARK-47741). The benchmark is refactored to consume
these utilities (no functional change).
### Benchmark impact
Measured with `VectorizedRleValuesReaderBenchmark` on GHA
(AMD EPYC 7763, JDK 17 / 21 / 25).
RLE fast paths (nullRatio = 0.0 / 1.0): within ±1% of master on all
three JDKs.
PACKED clustered workloads (JDK 21):
* `nullable without def-levels, null=0.1 clustered`: +84%
* `nullable without def-levels, null=0.3 clustered`: +73%
* `nullable without def-levels, null=0.5 clustered`: +66%
* `nullable without def-levels, null=0.9 clustered`: +51%
* `nullable with def-levels, null=0.1..0.9 clustered`: +22% to +61%
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
* Added `VectorizedRleValuesReaderSuite`
* Ran the existing `VectorizedRleValuesReaderBenchmark` on JDK 17,
21, and 25 via GHA to capture result files
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Opus 4.7
8d4215d to
2fac3a3
Compare
…uet.VectorizedRleValuesReaderBenchmark (JDK 17, Scala 2.13, split 1 of 1)
…uet.VectorizedRleValuesReaderBenchmark (JDK 21, Scala 2.13, split 1 of 1)
…uet.VectorizedRleValuesReaderBenchmark (JDK 25, Scala 2.13, split 1 of 1)
…uet.VectorizedRleValuesReaderBenchmark (JDK 25, Scala 2.13, split 1 of 1)
…uet.VectorizedRleValuesReaderBenchmark (JDK 25, Scala 2.13, split 1 of 1)
…uet.VectorizedRleValuesReaderBenchmark (JDK 25, Scala 2.13, split 1 of 1)
| reused reader, trueRatio=0.5 1 1 0 833.3 1.2 0.0X | ||
| cold reader, trueRatio=0.9 1 1 0 753.3 1.3 0.0X | ||
| reused reader, trueRatio=0.9 1 1 0 753.6 1.3 0.0X | ||
| cold reader, trueRatio=1.0 0 0 0 25165.0 0.0 1.0X |
There was a problem hiding this comment.
Is this improved as a side-effect also?
There was a problem hiding this comment.
Not a side-effect of this PR. readBooleans (Group A) is untouched — the PR only changes the PACKED branches in readBatchInternal / readValuesN. This case reports Best Time(ms) = 0 (sub-ms per 1M-row iteration) and is sensitive to first-case JIT warmup. Master captured 4201, this run captured 25165. Adjacent PACKED cases (trueRatio=0.1/0.5/0.9) are unchanged between runs (755 vs 754, 835 vs 833).
| nullRatio=0.5, clustered 5 5 0 194.7 5.1 0.0X | ||
| nullRatio=0.9, random 7 7 0 151.9 6.6 0.0X | ||
| nullRatio=0.9, clustered 5 5 0 200.5 5.0 0.0X | ||
| nullRatio=1.0, random 0 0 0 11945.0 0.1 1.0X |
There was a problem hiding this comment.
So, this PR improves this?
There was a problem hiding this comment.
No PR-caused improvement. This is the RLE all-null fast path in readBatchInternal, byte-identical to master. The same case measures across runs:
| JDK | master | this PR |
|---|---|---|
| 17 | 11580 | 11675 |
| 21 | 6191 | 11945 |
| 25 | 12016 | 11957 |
JDK 21 master's 6191 is the single outlier among six observations; this PR's 11945 matches the ~11500-12000 band seen on JDK 17 / 25.
| reused reader, trueRatio=0.9 1 2 0 706.8 1.4 0.1X | ||
| cold reader, trueRatio=1.0 0 0 0 59098.0 0.0 9.8X | ||
| reused reader, trueRatio=1.0 0 0 0 57130.7 0.0 9.4X | ||
| cold reader, trueRatio=0.0 0 0 0 60530.9 0.0 1.0X |
There was a problem hiding this comment.
The base line seems to be faster.
There was a problem hiding this comment.
The baseline is actually the slower side here — master captured 6055, this PR captured 60530. Same first-case sub-ms JIT-warmup variance (Best Time(ms) = 0). readBooleans code path is unchanged by this PR.
| cold reader, trueRatio=0.9 1 1 0 709.0 1.4 0.0X | ||
| reused reader, trueRatio=0.9 1 2 0 705.6 1.4 0.0X | ||
| cold reader, trueRatio=1.0 0 0 0 59164.7 0.0 1.0X | ||
| reused reader, trueRatio=1.0 0 0 0 58832.7 0.0 1.0X |
There was a problem hiding this comment.
Although this is 9x -> 1x, given that the baseline is improved, this seems to be not a regression.
| reused reader, trueRatio=0.9 1 1 0 891.6 1.1 0.0X | ||
| cold reader, trueRatio=1.0 0 0 0 67001.7 0.0 0.9X | ||
| reused reader, trueRatio=1.0 0 0 0 72126.6 0.0 1.0X | ||
| cold reader, trueRatio=0.0 0 0 0 47464.1 0.0 1.0X |
There was a problem hiding this comment.
Baseline seems to be changed somehow. Not a problem in the ratio.
|
|
||
| /** | ||
| * PACKED-branch decode for {@link #readBatchInternal}. Extracted into a separate method so | ||
| * C2 can keep {@link #readBatchInternal}'s RLE fast path small and inlineable. Scans for |
There was a problem hiding this comment.
Thank you for adding this purpose.
There was a problem hiding this comment.
Thanks for reviewing! The design-intent comment is there because the per-row null / def-level write looks like a missed batching opportunity at first glance — wanted to make the rationale explicit for future readers.
|
Merged to master for Apache Spark 4.2.0. |
|
Thank you @dongjoon-hyun ~ |
cloud-fan
left a comment
There was a problem hiding this comment.
Tiny post-merge nit in case of a future follow-up — otherwise safe to ignore.
| * Each batch uses a fresh output vector since `state.valueOffset` resets to 0 per batch — | ||
| * mirrors production where `VectorizedColumnReader` hands in a batch-sized vector. |
There was a problem hiding this comment.
The em-dash — (U+2014) is non-ASCII. Spark's convention (per dev docs / scalastyle) is ASCII-only in code and comments. Tiny post-merge nit — easy to tidy up alongside the planned pre-warm follow-up.
| * Each batch uses a fresh output vector since `state.valueOffset` resets to 0 per batch — | |
| * mirrors production where `VectorizedColumnReader` hands in a batch-sized vector. | |
| * Each batch uses a fresh output vector since `state.valueOffset` resets to 0 per batch, | |
| * mirroring production where `VectorizedColumnReader` hands in a batch-sized vector. |
There was a problem hiding this comment.
Thanks for catching this. Fixed in a follow-up — branch SPARK-56522-ascii-nit, I'll open the PR shortly.
…ctorizedRleValuesReaderSuite` docstring ### What changes were proposed in this pull request? Replace a non-ASCII em-dash (U+2014) in the `runAndAssert` docstring of `VectorizedRleValuesReaderSuite` with an ASCII comma, following `cloud-fan`'s post-merge review comment on PR #55386. ### Why are the changes needed? Spark's convention is ASCII-only in code and comments. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Grep for non-ASCII in the file returns empty after the change. ### Was this patch authored or co-authored using generative AI tooling? No Closes #55499 from LuciferYang/SPARK-56522-ascii-nit. Authored-by: YangJie <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com>
What changes were proposed in this pull request?
Rewrite the PACKED branches of
readBatchInternalandreadValuesNinVectorizedRleValuesReaderto scancurrentBufferfor contiguous runs and batch the value-side reads, replacing the per-row virtual dispatch.updater.readValues(runLen, ...)per run, with arunLen == 1fast path callingupdater.readValue(...).putNull/putInt). Batching them viaputNulls/putIntswith short counts (PACKED runs are capped at 7 by the Parquet hybrid encoder) pollutes those methods' shared JIT profile — the RLE branch calls them withn = 4096and loses its count-specialized code, regressing the RLE fast path by 3-4×.readPackedBatch/readPackedBatchWithDefLevelsso the enclosing methods stay small enough for C2 to inline the RLE fast path.Why are the changes needed?
VectorizedRleValuesReaderis on the critical path of every nullable Parquet column read. In PACKED mode (scattered-null columns) the original code dispatches per element through a virtual call. Batching the value side converts O(n) virtual dispatch into O(runs) batch calls while leaving the RLE fast path untouched.Does this PR introduce any user-facing change?
No.
How was this patch tested?
VectorizedRleValuesReaderSuiteBenchmark
VectorizedRleValuesReaderBenchmarkon GHA (AMD EPYC 7763, JDK 17). All three JDK result files are updated in this PR; JDK 21 and JDK 25 show comparable results.RLE fast paths stay at baseline:
PACKED (per-row ns, lower is better):
Short-run random cases at mid null ratios (0.3 / 0.5) are roughly flat — short runs benefit little from batching.
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Opus 4.7