[SPARK-56522][SQL] Batch PACKED null/non-null runs in `VectorizedRleValuesReader` by LuciferYang · Pull Request #55386 · apache/spark

LuciferYang · 2026-04-17T03:22:37Z

What changes were proposed in this pull request?

Rewrite the PACKED branches of readBatchInternal and readValuesN in VectorizedRleValuesReader to scan currentBuffer for contiguous runs and batch the value-side reads, replacing the per-row virtual dispatch.

Non-null runs: one updater.readValues(runLen, ...) per run, with a runLen == 1 fast path calling updater.readValue(...).
Null and def-level writes stay per-row (putNull / putInt). Batching them via putNulls / putInts with short counts (PACKED runs are capped at 7 by the Parquet hybrid encoder) pollutes those methods' shared JIT profile — the RLE branch calls them with n = 4096 and loses its count-specialized code, regressing the RLE fast path by 3-4×.
PACKED bodies are extracted into readPackedBatch / readPackedBatchWithDefLevels so the enclosing methods stay small enough for C2 to inline the RLE fast path.

Why are the changes needed?

VectorizedRleValuesReader is on the critical path of every nullable Parquet column read. In PACKED mode (scattered-null columns) the original code dispatches per element through a virtual call. Batching the value side converts O(n) virtual dispatch into O(runs) batch calls while leaving the RLE fast path untouched.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Add VectorizedRleValuesReaderSuite
Pass Github Actions

Benchmark

VectorizedRleValuesReaderBenchmark on GHA (AMD EPYC 7763, JDK 17). All three JDK result files are updated in this PR; JDK 21 and JDK 25 show comparable results.

RLE fast paths stay at baseline:

Case	master (M rows/s)	this PR (M rows/s)	Δ
Group C null=1.0	8057	8031	-0.3%
Group D null=1.0	11580	11675	+0.8%

PACKED (per-row ns, lower is better):

Case	master	this PR	Δ
Group D null=0.1 clustered	10.1	5.3	-48%
Group D null=0.3 clustered	9.6	5.3	-45%
Group D null=0.5 clustered	9.1	5.3	-42%
Group D null=0.9 clustered	8.1	5.1	-37%
Group D null=0.1 random	11.0	7.2	-35%
Group D null=0.9 random	9.3	6.7	-28%
Group C null=0.1 clustered	12.2	6.9	-43%
Group C null=0.3 clustered	11.5	6.4	-44%
Group C null=0.5 clustered	10.9	6.3	-42%
Group C null=0.9 clustered	9.6	5.8	-40%
Group C null=0.1 random	12.8	8.9	-30%
Group C null=0.9 random	10.3	8.0	-22%

Short-run random cases at mid null ratios (0.3 / 0.5) are roughly flat — short runs benefit little from batching.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.7

LuciferYang · 2026-04-17T03:23:42Z

Let me update the benchmark results to confirm the effect. I'll add PR description later.

LuciferYang · 2026-04-17T04:38:16Z

-  // package-private access across runtime packages. Reflection with setAccessible sidesteps
-  // the check without widening production visibility.
-
-  private val stateCls = SparkClassUtils.classForName[Any](


Refactor common code into VectorizedRleValuesReaderTestUtils and ParquetReadStateTestAccess to enable reuse in VectorizedRleValuesReaderSuite.

LuciferYang · 2026-04-17T07:40:30Z

+nullRatio=0.5, clustered                              6              6           0        170.5           5.9       0.0X
+nullRatio=0.9, random                                 8              8           0        137.2           7.3       0.0X
+nullRatio=0.9, clustered                              6              6           0        183.6           5.4       0.0X
+nullRatio=1.0, random                                 0              0           0       2832.8           0.4       0.4X


The performance has regressed; we need to investigate.

…Reader ### What changes were proposed in this pull request? This PR introduces a batched PACKED decode path for the nullable column read loop in `VectorizedRleValuesReader`, replacing the per-row virtual dispatch with a run-aware, value-side batched loop. Two methods are changed: * `readBatchInternal` (flat nullable columns, no def-level materialization) * `readValuesN` (nullable columns with def-level materialization, used for nested or wide-schema reads) Both PACKED branches are extracted into private helpers (`readPackedBatch`, `readPackedBatchWithDefLevels`) so that the enclosing methods stay small enough for C2 to inline the RLE fast path into its caller. The PACKED helpers: * scan `currentBuffer` for contiguous runs of identical def-level values * dispatch **one** batched value read per non-null run (`updater.readValues(runLen, ...)`), with a `runLen == 1` fast path calling the singular `updater.readValue(...)` * write nulls and def-level values per row (`putNull` / `putInt`) — deliberately NOT batched via `putNulls` / `putInts` ### Why is the null/def-level write per-row instead of batched? The RLE branch already calls `putNulls(offset, n)` and `defLevels.putInts(offset, n, value)` with `n = 4096`. HotSpot compiles each of those methods with a global method-level profile. Introducing PACKED call sites with short `runLen` (≤ 7, since the Parquet hybrid encoder never emits PACKED runs of 8+) would average the count profile over small and large values, causing C2 to drop count-specialised optimisations and slow the RLE fast path by 3-4x. Keeping those writes per-row in PACKED preserves the monomorphic profile for the RLE hot path while still delivering the value-side batching win. ### Test coverage Adds `VectorizedRleValuesReaderSuite` (14 tests) exercising RLE and PACKED across on-heap and off-heap `WritableColumnVector` modes, including multi-page, cross-batch, row-index filtering, and buffer-grow scenarios. Adds `VectorizedRleValuesReaderTestUtils` and `ParquetReadStateTestAccess` — shared test helpers used by both the new suite and the existing `VectorizedRleValuesReaderBenchmark` (introduced in SPARK-47741). The benchmark is refactored to consume these utilities (no functional change). ### Benchmark impact Measured with `VectorizedRleValuesReaderBenchmark` on GHA (AMD EPYC 7763, JDK 17 / 21 / 25). RLE fast paths (nullRatio = 0.0 / 1.0): within ±1% of master on all three JDKs. PACKED clustered workloads (JDK 21): * `nullable without def-levels, null=0.1 clustered`: +84% * `nullable without def-levels, null=0.3 clustered`: +73% * `nullable without def-levels, null=0.5 clustered`: +66% * `nullable without def-levels, null=0.9 clustered`: +51% * `nullable with def-levels, null=0.1..0.9 clustered`: +22% to +61% ### Does this PR introduce any user-facing change? No. ### How was this patch tested? * Added `VectorizedRleValuesReaderSuite` * Ran the existing `VectorizedRleValuesReaderBenchmark` on JDK 17, 21, and 25 via GHA to capture result files ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.7

…uet.VectorizedRleValuesReaderBenchmark (JDK 17, Scala 2.13, split 1 of 1)

…uet.VectorizedRleValuesReaderBenchmark (JDK 21, Scala 2.13, split 1 of 1)

…uet.VectorizedRleValuesReaderBenchmark (JDK 25, Scala 2.13, split 1 of 1)

dongjoon-hyun · 2026-04-20T20:30:21Z

+reused reader, trueRatio=0.5                          1              1           0        833.3           1.2       0.0X
+cold reader, trueRatio=0.9                            1              1           0        753.3           1.3       0.0X
+reused reader, trueRatio=0.9                          1              1           0        753.6           1.3       0.0X
+cold reader, trueRatio=1.0                            0              0           0      25165.0           0.0       1.0X


Is this improved as a side-effect also?

Not a side-effect of this PR. readBooleans (Group A) is untouched — the PR only changes the PACKED branches in readBatchInternal / readValuesN. This case reports Best Time(ms) = 0 (sub-ms per 1M-row iteration) and is sensitive to first-case JIT warmup. Master captured 4201, this run captured 25165. Adjacent PACKED cases (trueRatio=0.1/0.5/0.9) are unchanged between runs (755 vs 754, 835 vs 833).

dongjoon-hyun · 2026-04-20T20:30:55Z

+nullRatio=0.5, clustered                              5              5           0        194.7           5.1       0.0X
+nullRatio=0.9, random                                 7              7           0        151.9           6.6       0.0X
+nullRatio=0.9, clustered                              5              5           0        200.5           5.0       0.0X
+nullRatio=1.0, random                                 0              0           0      11945.0           0.1       1.0X


So, this PR improves this?

No PR-caused improvement. This is the RLE all-null fast path in readBatchInternal, byte-identical to master. The same case measures across runs:

JDK master this PR

17 11580 11675

21 6191 11945

25 12016 11957

JDK 21 master's 6191 is the single outlier among six observations; this PR's 11945 matches the ~11500-12000 band seen on JDK 17 / 25.

dongjoon-hyun · 2026-04-20T20:31:22Z

-reused reader, trueRatio=0.9                          1              2           0        706.8           1.4       0.1X
-cold reader, trueRatio=1.0                            0              0           0      59098.0           0.0       9.8X
-reused reader, trueRatio=1.0                          0              0           0      57130.7           0.0       9.4X
+cold reader, trueRatio=0.0                            0              0           0      60530.9           0.0       1.0X


The base line seems to be faster.

The baseline is actually the slower side here — master captured 6055, this PR captured 60530. Same first-case sub-ms JIT-warmup variance (Best Time(ms) = 0). readBooleans code path is unchanged by this PR.

dongjoon-hyun · 2026-04-20T20:31:49Z

+cold reader, trueRatio=0.9                            1              1           0        709.0           1.4       0.0X
+reused reader, trueRatio=0.9                          1              2           0        705.6           1.4       0.0X
+cold reader, trueRatio=1.0                            0              0           0      59164.7           0.0       1.0X
+reused reader, trueRatio=1.0                          0              0           0      58832.7           0.0       1.0X


Although this is 9x -> 1x, given that the baseline is improved, this seems to be not a regression.

dongjoon-hyun · 2026-04-20T20:33:05Z

-reused reader, trueRatio=0.9                          1              1           0        891.6           1.1       0.0X
-cold reader, trueRatio=1.0                            0              0           0      67001.7           0.0       0.9X
-reused reader, trueRatio=1.0                          0              0           0      72126.6           0.0       1.0X
+cold reader, trueRatio=0.0                            0              0           0      47464.1           0.0       1.0X


Baseline seems to be changed somehow. Not a problem in the ratio.

dongjoon-hyun · 2026-04-20T20:34:27Z


+  /**
+   * PACKED-branch decode for {@link #readBatchInternal}. Extracted into a separate method so
+   * C2 can keep {@link #readBatchInternal}'s RLE fast path small and inlineable. Scans for


Thank you for adding this purpose.

Thanks for reviewing! The design-intent comment is there because the per-row null / def-level write looks like a missed batching opportunity at first glance — wanted to make the rationale explicit for future readers.

dongjoon-hyun

+1, LGTM.

dongjoon-hyun · 2026-04-20T20:36:28Z

Merged to master for Apache Spark 4.2.0.

LuciferYang · 2026-04-21T06:16:07Z

Thank you @dongjoon-hyun ~
Let me try adding one pre-warm iteration for Group A and see if it stabilizes the test results.

cloud-fan

Tiny post-merge nit in case of a future follow-up — otherwise safe to ignore.

cloud-fan · 2026-04-23T02:08:22Z

+   * Each batch uses a fresh output vector since `state.valueOffset` resets to 0 per batch —
+   * mirrors production where `VectorizedColumnReader` hands in a batch-sized vector.


The em-dash — (U+2014) is non-ASCII. Spark's convention (per dev docs / scalastyle) is ASCII-only in code and comments. Tiny post-merge nit — easy to tidy up alongside the planned pre-warm follow-up.

Suggested change

* Each batch uses a fresh output vector since `state.valueOffset` resets to 0 per batch —

* mirrors production where `VectorizedColumnReader` hands in a batch-sized vector.

* Each batch uses a fresh output vector since `state.valueOffset` resets to 0 per batch,

* mirroring production where `VectorizedColumnReader` hands in a batch-sized vector.

Thanks for catching this. Fixed in a follow-up — branch SPARK-56522-ascii-nit, I'll open the PR shortly.

…ctorizedRleValuesReaderSuite` docstring ### What changes were proposed in this pull request? Replace a non-ASCII em-dash (U+2014) in the `runAndAssert` docstring of `VectorizedRleValuesReaderSuite` with an ASCII comma, following `cloud-fan`'s post-merge review comment on PR #55386. ### Why are the changes needed? Spark's convention is ASCII-only in code and comments. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Grep for non-ASCII in the file returns empty after the change. ### Was this patch authored or co-authored using generative AI tooling? No Closes #55499 from LuciferYang/SPARK-56522-ascii-nit. Authored-by: YangJie <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com>

LuciferYang marked this pull request as draft April 17, 2026 03:22

LuciferYang commented Apr 17, 2026

View reviewed changes

LuciferYang force-pushed the SPARK-56522 branch from 8d4215d to 2fac3a3 Compare April 20, 2026 05:43

LuciferYang added 6 commits April 20, 2026 06:09

Benchmark results for org.apache.spark.sql.execution.datasources.parq…

25c95a8

…uet.VectorizedRleValuesReaderBenchmark (JDK 17, Scala 2.13, split 1 of 1)

Benchmark results for org.apache.spark.sql.execution.datasources.parq…

a531a8d

…uet.VectorizedRleValuesReaderBenchmark (JDK 21, Scala 2.13, split 1 of 1)

Benchmark results for org.apache.spark.sql.execution.datasources.parq…

9d1bf15

…uet.VectorizedRleValuesReaderBenchmark (JDK 25, Scala 2.13, split 1 of 1)

Benchmark results for org.apache.spark.sql.execution.datasources.parq…

5aaa311

…uet.VectorizedRleValuesReaderBenchmark (JDK 25, Scala 2.13, split 1 of 1)

Benchmark results for org.apache.spark.sql.execution.datasources.parq…

8f5c61c

…uet.VectorizedRleValuesReaderBenchmark (JDK 25, Scala 2.13, split 1 of 1)

Benchmark results for org.apache.spark.sql.execution.datasources.parq…

2a27378

…uet.VectorizedRleValuesReaderBenchmark (JDK 25, Scala 2.13, split 1 of 1)

LuciferYang marked this pull request as ready for review April 20, 2026 11:36

dongjoon-hyun reviewed Apr 20, 2026

View reviewed changes

dongjoon-hyun approved these changes Apr 20, 2026

View reviewed changes

dongjoon-hyun closed this in 32be66f Apr 20, 2026

LuciferYang deleted the SPARK-56522 branch April 22, 2026 06:28

cloud-fan reviewed Apr 23, 2026

View reviewed changes

		* Each batch uses a fresh output vector since `state.valueOffset` resets to 0 per batch —
		* mirrors production where `VectorizedColumnReader` hands in a batch-sized vector.

Conversation

LuciferYang commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

LuciferYang commented Apr 17, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Apr 20, 2026

Uh oh!

LuciferYang commented Apr 21, 2026

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LuciferYang commented Apr 17, 2026 •

edited

Loading

dongjoon-hyun Apr 20, 2026 •

edited

Loading