Skip to content

[SPARK-56522][SQL] Batch PACKED null/non-null runs in VectorizedRleValuesReader#55386

Closed
LuciferYang wants to merge 7 commits intoapache:masterfrom
LuciferYang:SPARK-56522
Closed

[SPARK-56522][SQL] Batch PACKED null/non-null runs in VectorizedRleValuesReader#55386
LuciferYang wants to merge 7 commits intoapache:masterfrom
LuciferYang:SPARK-56522

Conversation

@LuciferYang
Copy link
Copy Markdown
Contributor

@LuciferYang LuciferYang commented Apr 17, 2026

What changes were proposed in this pull request?

Rewrite the PACKED branches of readBatchInternal and readValuesN in VectorizedRleValuesReader to scan currentBuffer for contiguous runs and batch the value-side reads, replacing the per-row virtual dispatch.

  • Non-null runs: one updater.readValues(runLen, ...) per run, with a runLen == 1 fast path calling updater.readValue(...).
  • Null and def-level writes stay per-row (putNull / putInt). Batching them via putNulls / putInts with short counts (PACKED runs are capped at 7 by the Parquet hybrid encoder) pollutes those methods' shared JIT profile — the RLE branch calls them with n = 4096 and loses its count-specialized code, regressing the RLE fast path by 3-4×.
  • PACKED bodies are extracted into readPackedBatch / readPackedBatchWithDefLevels so the enclosing methods stay small enough for C2 to inline the RLE fast path.

Why are the changes needed?

VectorizedRleValuesReader is on the critical path of every nullable Parquet column read. In PACKED mode (scattered-null columns) the original code dispatches per element through a virtual call. Batching the value side converts O(n) virtual dispatch into O(runs) batch calls while leaving the RLE fast path untouched.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

  • Add VectorizedRleValuesReaderSuite
  • Pass Github Actions

Benchmark

VectorizedRleValuesReaderBenchmark on GHA (AMD EPYC 7763, JDK 17). All three JDK result files are updated in this PR; JDK 21 and JDK 25 show comparable results.

RLE fast paths stay at baseline:

Case master (M rows/s) this PR (M rows/s) Δ
Group C null=1.0 8057 8031 -0.3%
Group D null=1.0 11580 11675 +0.8%

PACKED (per-row ns, lower is better):

Case master this PR Δ
Group D null=0.1 clustered 10.1 5.3 -48%
Group D null=0.3 clustered 9.6 5.3 -45%
Group D null=0.5 clustered 9.1 5.3 -42%
Group D null=0.9 clustered 8.1 5.1 -37%
Group D null=0.1 random 11.0 7.2 -35%
Group D null=0.9 random 9.3 6.7 -28%
Group C null=0.1 clustered 12.2 6.9 -43%
Group C null=0.3 clustered 11.5 6.4 -44%
Group C null=0.5 clustered 10.9 6.3 -42%
Group C null=0.9 clustered 9.6 5.8 -40%
Group C null=0.1 random 12.8 8.9 -30%
Group C null=0.9 random 10.3 8.0 -22%

Short-run random cases at mid null ratios (0.3 / 0.5) are roughly flat — short runs benefit little from batching.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.7

@LuciferYang LuciferYang marked this pull request as draft April 17, 2026 03:22
@LuciferYang
Copy link
Copy Markdown
Contributor Author

Let me update the benchmark results to confirm the effect. I'll add PR description later.

// package-private access across runtime packages. Reflection with setAccessible sidesteps
// the check without widening production visibility.

private val stateCls = SparkClassUtils.classForName[Any](
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor common code into VectorizedRleValuesReaderTestUtils and ParquetReadStateTestAccess to enable reuse in VectorizedRleValuesReaderSuite.

nullRatio=0.5, clustered 6 6 0 170.5 5.9 0.0X
nullRatio=0.9, random 8 8 0 137.2 7.3 0.0X
nullRatio=0.9, clustered 6 6 0 183.6 5.4 0.0X
nullRatio=1.0, random 0 0 0 2832.8 0.4 0.4X
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance has regressed; we need to investigate.

…Reader

### What changes were proposed in this pull request?

This PR introduces a batched PACKED decode path for the nullable column
read loop in `VectorizedRleValuesReader`, replacing the per-row virtual
dispatch with a run-aware, value-side batched loop. Two methods are
changed:

  * `readBatchInternal` (flat nullable columns, no def-level
    materialization)
  * `readValuesN` (nullable columns with def-level materialization,
    used for nested or wide-schema reads)

Both PACKED branches are extracted into private helpers
(`readPackedBatch`, `readPackedBatchWithDefLevels`) so that the
enclosing methods stay small enough for C2 to inline the RLE fast path
into its caller.

The PACKED helpers:

  * scan `currentBuffer` for contiguous runs of identical def-level
    values
  * dispatch **one** batched value read per non-null run
    (`updater.readValues(runLen, ...)`), with a `runLen == 1` fast path
    calling the singular `updater.readValue(...)`
  * write nulls and def-level values per row (`putNull` /
    `putInt`) — deliberately NOT batched via `putNulls` / `putInts`

### Why is the null/def-level write per-row instead of batched?

The RLE branch already calls `putNulls(offset, n)` and
`defLevels.putInts(offset, n, value)` with `n = 4096`. HotSpot compiles
each of those methods with a global method-level profile. Introducing
PACKED call sites with short `runLen` (≤ 7, since the Parquet hybrid
encoder never emits PACKED runs of 8+) would average the count profile
over small and large values, causing C2 to drop count-specialised
optimisations and slow the RLE fast path by 3-4x. Keeping those writes
per-row in PACKED preserves the monomorphic profile for the RLE hot
path while still delivering the value-side batching win.

### Test coverage

Adds `VectorizedRleValuesReaderSuite` (14 tests) exercising RLE and
PACKED across on-heap and off-heap `WritableColumnVector` modes,
including multi-page, cross-batch, row-index filtering, and buffer-grow
scenarios.

Adds `VectorizedRleValuesReaderTestUtils` and
`ParquetReadStateTestAccess` — shared test helpers used by both the new
suite and the existing `VectorizedRleValuesReaderBenchmark`
(introduced in SPARK-47741). The benchmark is refactored to consume
these utilities (no functional change).

### Benchmark impact

Measured with `VectorizedRleValuesReaderBenchmark` on GHA
(AMD EPYC 7763, JDK 17 / 21 / 25).

RLE fast paths (nullRatio = 0.0 / 1.0): within ±1% of master on all
three JDKs.

PACKED clustered workloads (JDK 21):
  * `nullable without def-levels, null=0.1 clustered`: +84%
  * `nullable without def-levels, null=0.3 clustered`: +73%
  * `nullable without def-levels, null=0.5 clustered`: +66%
  * `nullable without def-levels, null=0.9 clustered`: +51%
  * `nullable with def-levels, null=0.1..0.9 clustered`: +22% to +61%

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

  * Added `VectorizedRleValuesReaderSuite`
  * Ran the existing `VectorizedRleValuesReaderBenchmark` on JDK 17,
    21, and 25 via GHA to capture result files

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.7
…uet.VectorizedRleValuesReaderBenchmark (JDK 17, Scala 2.13, split 1 of 1)
…uet.VectorizedRleValuesReaderBenchmark (JDK 21, Scala 2.13, split 1 of 1)
…uet.VectorizedRleValuesReaderBenchmark (JDK 25, Scala 2.13, split 1 of 1)
…uet.VectorizedRleValuesReaderBenchmark (JDK 25, Scala 2.13, split 1 of 1)
…uet.VectorizedRleValuesReaderBenchmark (JDK 25, Scala 2.13, split 1 of 1)
…uet.VectorizedRleValuesReaderBenchmark (JDK 25, Scala 2.13, split 1 of 1)
@LuciferYang LuciferYang marked this pull request as ready for review April 20, 2026 11:36
reused reader, trueRatio=0.5 1 1 0 833.3 1.2 0.0X
cold reader, trueRatio=0.9 1 1 0 753.3 1.3 0.0X
reused reader, trueRatio=0.9 1 1 0 753.6 1.3 0.0X
cold reader, trueRatio=1.0 0 0 0 25165.0 0.0 1.0X
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this improved as a side-effect also?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a side-effect of this PR. readBooleans (Group A) is untouched — the PR only changes the PACKED branches in readBatchInternal / readValuesN. This case reports Best Time(ms) = 0 (sub-ms per 1M-row iteration) and is sensitive to first-case JIT warmup. Master captured 4201, this run captured 25165. Adjacent PACKED cases (trueRatio=0.1/0.5/0.9) are unchanged between runs (755 vs 754, 835 vs 833).

nullRatio=0.5, clustered 5 5 0 194.7 5.1 0.0X
nullRatio=0.9, random 7 7 0 151.9 6.6 0.0X
nullRatio=0.9, clustered 5 5 0 200.5 5.0 0.0X
nullRatio=1.0, random 0 0 0 11945.0 0.1 1.0X
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this PR improves this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No PR-caused improvement. This is the RLE all-null fast path in readBatchInternal, byte-identical to master. The same case measures across runs:

JDK master this PR
17 11580 11675
21 6191 11945
25 12016 11957

JDK 21 master's 6191 is the single outlier among six observations; this PR's 11945 matches the ~11500-12000 band seen on JDK 17 / 25.

reused reader, trueRatio=0.9 1 2 0 706.8 1.4 0.1X
cold reader, trueRatio=1.0 0 0 0 59098.0 0.0 9.8X
reused reader, trueRatio=1.0 0 0 0 57130.7 0.0 9.4X
cold reader, trueRatio=0.0 0 0 0 60530.9 0.0 1.0X
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The base line seems to be faster.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The baseline is actually the slower side here — master captured 6055, this PR captured 60530. Same first-case sub-ms JIT-warmup variance (Best Time(ms) = 0). readBooleans code path is unchanged by this PR.

cold reader, trueRatio=0.9 1 1 0 709.0 1.4 0.0X
reused reader, trueRatio=0.9 1 2 0 705.6 1.4 0.0X
cold reader, trueRatio=1.0 0 0 0 59164.7 0.0 1.0X
reused reader, trueRatio=1.0 0 0 0 58832.7 0.0 1.0X
Copy link
Copy Markdown
Member

@dongjoon-hyun dongjoon-hyun Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this is 9x -> 1x, given that the baseline is improved, this seems to be not a regression.

reused reader, trueRatio=0.9 1 1 0 891.6 1.1 0.0X
cold reader, trueRatio=1.0 0 0 0 67001.7 0.0 0.9X
reused reader, trueRatio=1.0 0 0 0 72126.6 0.0 1.0X
cold reader, trueRatio=0.0 0 0 0 47464.1 0.0 1.0X
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Baseline seems to be changed somehow. Not a problem in the ratio.


/**
* PACKED-branch decode for {@link #readBatchInternal}. Extracted into a separate method so
* C2 can keep {@link #readBatchInternal}'s RLE fast path small and inlineable. Scans for
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this purpose.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing! The design-intent comment is there because the per-row null / def-level write looks like a missed batching opportunity at first glance — wanted to make the rationale explicit for future readers.

Copy link
Copy Markdown
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@dongjoon-hyun
Copy link
Copy Markdown
Member

Merged to master for Apache Spark 4.2.0.

@LuciferYang
Copy link
Copy Markdown
Contributor Author

Thank you @dongjoon-hyun ~
Let me try adding one pre-warm iteration for Group A and see if it stabilizes the test results.

@LuciferYang LuciferYang deleted the SPARK-56522 branch April 22, 2026 06:28
Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny post-merge nit in case of a future follow-up — otherwise safe to ignore.

Comment on lines +137 to +138
* Each batch uses a fresh output vector since `state.valueOffset` resets to 0 per batch —
* mirrors production where `VectorizedColumnReader` hands in a batch-sized vector.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The em-dash (U+2014) is non-ASCII. Spark's convention (per dev docs / scalastyle) is ASCII-only in code and comments. Tiny post-merge nit — easy to tidy up alongside the planned pre-warm follow-up.

Suggested change
* Each batch uses a fresh output vector since `state.valueOffset` resets to 0 per batch
* mirrors production where `VectorizedColumnReader` hands in a batch-sized vector.
* Each batch uses a fresh output vector since `state.valueOffset` resets to 0 per batch,
* mirroring production where `VectorizedColumnReader` hands in a batch-sized vector.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Catch

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this. Fixed in a follow-up — branch SPARK-56522-ascii-nit, I'll open the PR shortly.

LuciferYang added a commit that referenced this pull request Apr 23, 2026
…ctorizedRleValuesReaderSuite` docstring

### What changes were proposed in this pull request?

Replace a non-ASCII em-dash (U+2014) in the `runAndAssert` docstring of `VectorizedRleValuesReaderSuite` with an ASCII comma, following `cloud-fan`'s post-merge review comment on PR #55386.

### Why are the changes needed?

Spark's convention is ASCII-only in code and comments.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Grep for non-ASCII in the file returns empty after the change.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #55499 from LuciferYang/SPARK-56522-ascii-nit.

Authored-by: YangJie <yangjie01@baidu.com>
Signed-off-by: yangjie01 <yangjie01@baidu.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants