[SPARK-56801][SQL] Add bulk read+widen path for INT32 to Double Parquet vector updater by LuciferYang · Pull Request #55795 · apache/spark

LuciferYang · 2026-05-11T09:36:17Z

What changes were proposed in this pull request?

Extend the bulk read+widen pattern introduced in SPARK-56791 to IntegerToDoubleUpdater (parquet INT32 read into Spark DoubleType).

A new readIntegersAsDoubles default method on VectorizedValuesReader does the per-row fallback. VectorizedPlainValuesReader overrides it to fetch source bytes once via getBuffer(total * 4) and run a tight in-method conversion loop. IntegerToDoubleUpdater.readValues becomes a one-line delegation. The widen is Java's primitive int-to-double conversion, lossless for all INT32 values.

Why are the changes needed?

IntegerToDoubleUpdater.readValues allocates a fresh ByteBuffer slice inside getBuffer(4) for every element on the legacy path, and that allocation dominates the loop. Collapsing N allocations into one is the same win SPARK-56791 delivered for the INT32 -> Long sibling, with the gain again amplifying on newer JDKs where escape analysis better optimizes the tight loop:

JDK	Baseline	After	Speedup
17	479.1 M/s	1475.2 M/s	3.08x
21	531.8 M/s	6215.0 M/s	11.68x
25	454.6 M/s	6423.7 M/s	14.13x

Peer Updaters in the same benchmark group hold within run-to-run noise, confirming the change is local to IntegerToDoubleUpdater.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New unit tests in ParquetVectorUpdaterSuite cover boundary batch lengths (0, 1, 7, 8, 9, 17, 1024, 4097) and the singular readValue path. A new end-to-end test in ParquetIOSuite round-trips INT32 written to parquet and read back as DoubleType, exercising both REQUIRED columns (no def-levels) and OPTIONAL columns with interleaved nulls so that readValue and readValues are both invoked.

Benchmark results on JDK 17, 21, and 25 are committed on the branch.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code

…et vector updater

…uet.ParquetVectorUpdaterBenchmark (JDK 17, Scala 2.13, split 1 of 1)

LuciferYang · 2026-05-11T12:20:37Z

-DateToTimestampNTZUpdater                               36             36           1         29.1          34.4       0.0X
-DowncastLongUpdater (INT64 -> Decimal(9,2))              2              2           0        487.0           2.1       0.4X
+IntegerToLongUpdater                                     1              1           0       1277.6           0.8       1.0X
+IntegerToDoubleUpdater                                   1              1           0       1475.2           0.7       1.2X


This is the embodiment of this optimization. will update others results later

…uet.ParquetVectorUpdaterBenchmark (JDK 21, Scala 2.13, split 1 of 1)

…uet.ParquetVectorUpdaterBenchmark (JDK 25, Scala 2.13, split 1 of 1)

…uet.ParquetVectorUpdaterBenchmark (JDK 21, Scala 2.13, split 1 of 1)

LuciferYang · 2026-05-12T03:05:58Z

cc @dongjoon-hyun @yaooqinn

…et vector updater ### What changes were proposed in this pull request? Extend the bulk read+widen pattern introduced in SPARK-56791 to `IntegerToDoubleUpdater` (parquet INT32 read into Spark `DoubleType`). A new `readIntegersAsDoubles` default method on `VectorizedValuesReader` does the per-row fallback. `VectorizedPlainValuesReader` overrides it to fetch source bytes once via `getBuffer(total * 4)` and run a tight in-method conversion loop. `IntegerToDoubleUpdater.readValues` becomes a one-line delegation. The widen is Java's primitive int-to-double conversion, lossless for all INT32 values. ### Why are the changes needed? `IntegerToDoubleUpdater.readValues` allocates a fresh `ByteBuffer` slice inside `getBuffer(4)` for every element on the legacy path, and that allocation dominates the loop. Collapsing N allocations into one is the same win SPARK-56791 delivered for the INT32 -> Long sibling, with the gain again amplifying on newer JDKs where escape analysis better optimizes the tight loop: | JDK | Baseline | After | Speedup | |----:|---------:|----------:|--------:| | 17 | 479.1 M/s | 1475.2 M/s | 3.08x | | 21 | 531.8 M/s | 6215.0 M/s | 11.68x | | 25 | 454.6 M/s | 6423.7 M/s | 14.13x | Peer Updaters in the same benchmark group hold within run-to-run noise, confirming the change is local to `IntegerToDoubleUpdater`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit tests in `ParquetVectorUpdaterSuite` cover boundary batch lengths (0, 1, 7, 8, 9, 17, 1024, 4097) and the singular `readValue` path. A new end-to-end test in `ParquetIOSuite` round-trips INT32 written to parquet and read back as `DoubleType`, exercising both REQUIRED columns (no def-levels) and OPTIONAL columns with interleaved nulls so that `readValue` and `readValues` are both invoked. Benchmark results on JDK 17, 21, and 25 are committed on the branch. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code Closes #55795 from LuciferYang/SPARK-56801-int32-to-double. Authored-by: YangJie <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com> (cherry picked from commit 1360e5c) Signed-off-by: yangjie01 <yangjie01@baidu.com>

LuciferYang · 2026-05-12T11:59:24Z

Merged into master. Thanks @yaooqinn

dongjoon-hyun

+1, late LGTM.

LuciferYang · 2026-05-12T18:12:45Z

Thank you @dongjoon-hyun

[SPARK-56801][SQL] Add bulk read+widen path for INT32 to Double Parqu…

68e1593

…et vector updater

LuciferYang marked this pull request as draft May 11, 2026 09:40

Benchmark results for org.apache.spark.sql.execution.datasources.parq…

792390e

…uet.ParquetVectorUpdaterBenchmark (JDK 17, Scala 2.13, split 1 of 1)

LuciferYang commented May 11, 2026

View reviewed changes

LuciferYang added 5 commits May 11, 2026 13:22

Benchmark results for org.apache.spark.sql.execution.datasources.parq…

2879c29

…uet.ParquetVectorUpdaterBenchmark (JDK 21, Scala 2.13, split 1 of 1)

Benchmark results for org.apache.spark.sql.execution.datasources.parq…

49e9af3

…uet.ParquetVectorUpdaterBenchmark (JDK 25, Scala 2.13, split 1 of 1)

Benchmark results for org.apache.spark.sql.execution.datasources.parq…

ff85e52

…uet.ParquetVectorUpdaterBenchmark (JDK 21, Scala 2.13, split 1 of 1)

Benchmark results for org.apache.spark.sql.execution.datasources.parq…

1277d71

…uet.ParquetVectorUpdaterBenchmark (JDK 21, Scala 2.13, split 1 of 1)

Benchmark results for org.apache.spark.sql.execution.datasources.parq…

3273935

…uet.ParquetVectorUpdaterBenchmark (JDK 21, Scala 2.13, split 1 of 1)

LuciferYang marked this pull request as ready for review May 11, 2026 23:57

Merge branch 'apache:master' into SPARK-56801-int32-to-double

ca04b99

yaooqinn approved these changes May 12, 2026

View reviewed changes

LuciferYang closed this in 1360e5c May 12, 2026

dongjoon-hyun reviewed May 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56801][SQL] Add bulk read+widen path for INT32 to Double Parquet vector updater#55795

[SPARK-56801][SQL] Add bulk read+widen path for INT32 to Double Parquet vector updater#55795
LuciferYang wants to merge 8 commits into
apache:masterfrom
LuciferYang:SPARK-56801-int32-to-double

LuciferYang commented May 11, 2026 •

edited

Loading

Uh oh!

LuciferYang May 11, 2026

Uh oh!

LuciferYang commented May 12, 2026

Uh oh!

LuciferYang commented May 12, 2026

Uh oh!

dongjoon-hyun left a comment

Uh oh!

LuciferYang commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

LuciferYang commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

LuciferYang May 11, 2026

Choose a reason for hiding this comment

Uh oh!

LuciferYang commented May 12, 2026

Uh oh!

LuciferYang commented May 12, 2026

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

LuciferYang commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LuciferYang commented May 11, 2026 •

edited

Loading