Skip to content

[SPARK-56802][SQL] Add bulk read+widen path for FLOAT to Double Parquet vector updater#55816

Closed
LuciferYang wants to merge 4 commits into
apache:masterfrom
LuciferYang:SPARK-56802-float-to-double
Closed

[SPARK-56802][SQL] Add bulk read+widen path for FLOAT to Double Parquet vector updater#55816
LuciferYang wants to merge 4 commits into
apache:masterfrom
LuciferYang:SPARK-56802-float-to-double

Conversation

@LuciferYang
Copy link
Copy Markdown
Contributor

@LuciferYang LuciferYang commented May 12, 2026

What changes were proposed in this pull request?

Extend the bulk read+widen pattern introduced in SPARK-56791 to FloatToDoubleUpdater (parquet FLOAT read into Spark DoubleType).

A new readFloatsAsDoubles default method on VectorizedValuesReader does the per-row fallback. VectorizedPlainValuesReader overrides it to fetch source bytes once via getBuffer(total * 4) and run a tight in-method conversion loop. FloatToDoubleUpdater.readValues becomes a one-line delegation. The widen is Java's primitive float-to-double conversion: exact for every finite and infinite float; a NaN float widens to a double NaN (the JVM may canonicalize the payload).

Why are the changes needed?

FloatToDoubleUpdater.readValues allocates a fresh ByteBuffer slice inside getBuffer(4) for every element on the legacy path, and that allocation dominates the loop. Collapsing N allocations into one is the same win SPARK-56791 delivered for the INT32 -> Long sibling, with the gain again amplifying on newer JDKs where escape analysis better optimizes the tight loop. Baselines and after-numbers are taken directly from the benchmark-result diffs committed on this branch:

JDK Baseline After Speedup
17 480.2 M/s 1418.8 M/s 2.95x
21 489.5 M/s 2527.2 M/s 5.16x
25 483.3 M/s 2570.7 M/s 5.32x

Peer Updaters in the same benchmark group hold within run-to-run noise, confirming the change is local to FloatToDoubleUpdater.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New unit tests in ParquetVectorUpdaterSuite cover boundary batch lengths (0, 1, 7, 8, 9, 17, 1024, 4097), the singular readValue path, and special values (signed zeros, +/-Infinity, NaN, with raw-bit assertions on signed zeros). A NaN-safe assertDoublesEqual helper uses java.lang.Double.compare so NaN equality and signed-zero distinction are well-defined.

A new end-to-end test in ParquetIOSuite round-trips FLOAT written to parquet and read back as DoubleType, exercising both REQUIRED columns (no def-levels) and OPTIONAL columns with interleaved nulls so that readValue and readValues are both invoked.

Benchmark results on JDK 17, 21, and 25 are committed on the branch.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code

@LuciferYang LuciferYang marked this pull request as draft May 12, 2026 05:55
@LuciferYang LuciferYang force-pushed the SPARK-56802-float-to-double branch from 308150a to 051b94b Compare May 12, 2026 17:15
…uet.ParquetVectorUpdaterBenchmark (JDK 17, Scala 2.13, split 1 of 1)
…uet.ParquetVectorUpdaterBenchmark (JDK 21, Scala 2.13, split 1 of 1)
…uet.ParquetVectorUpdaterBenchmark (JDK 25, Scala 2.13, split 1 of 1)
DowncastLongUpdater (INT64 -> Decimal(9,2)) 2 2 0 455.0 2.2 0.4X
IntegerToLongUpdater 1 1 0 1280.6 0.8 1.0X
IntegerToDoubleUpdater 1 1 0 1537.9 0.7 1.2X
FloatToDoubleUpdater 1 1 0 1418.8 0.7 1.1X
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the embodiment of this optimization.

@LuciferYang LuciferYang marked this pull request as ready for review May 13, 2026 09:16
@LuciferYang
Copy link
Copy Markdown
Contributor Author

cc @yaooqinn @dongjoon-hyun

LuciferYang added a commit that referenced this pull request May 13, 2026
…et vector updater

### What changes were proposed in this pull request?

Extend the bulk read+widen pattern introduced in SPARK-56791 to `FloatToDoubleUpdater` (parquet FLOAT read into Spark `DoubleType`).

A new `readFloatsAsDoubles` default method on `VectorizedValuesReader` does the per-row fallback. `VectorizedPlainValuesReader` overrides it to fetch source bytes once via `getBuffer(total * 4)` and run a tight in-method conversion loop. `FloatToDoubleUpdater.readValues` becomes a one-line delegation. The widen is Java's primitive float-to-double conversion: exact for every finite and infinite float; a NaN float widens to a double NaN (the JVM may canonicalize the payload).

### Why are the changes needed?

`FloatToDoubleUpdater.readValues` allocates a fresh `ByteBuffer` slice inside `getBuffer(4)` for every element on the legacy path, and that allocation dominates the loop. Collapsing N allocations into one is the same win SPARK-56791 delivered for the INT32 -> Long sibling, with the gain again amplifying on newer JDKs where escape analysis better optimizes the tight loop. Baselines and after-numbers are taken directly from the benchmark-result diffs committed on this branch:

| JDK | Baseline | After     | Speedup |
|----:|---------:|----------:|--------:|
|  17 | 480.2 M/s | 1418.8 M/s | 2.95x |
|  21 | 489.5 M/s | 2527.2 M/s | 5.16x |
|  25 | 483.3 M/s | 2570.7 M/s | 5.32x |

Peer Updaters in the same benchmark group hold within run-to-run noise, confirming the change is local to `FloatToDoubleUpdater`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New unit tests in `ParquetVectorUpdaterSuite` cover boundary batch lengths (0, 1, 7, 8, 9, 17, 1024, 4097), the singular `readValue` path, and special values (signed zeros, +/-Infinity, NaN, with raw-bit assertions on signed zeros). A NaN-safe `assertDoublesEqual` helper uses `java.lang.Double.compare` so NaN equality and signed-zero distinction are well-defined.

A new end-to-end test in `ParquetIOSuite` round-trips FLOAT written to parquet and read back as `DoubleType`, exercising both REQUIRED columns (no def-levels) and OPTIONAL columns with interleaved nulls so that `readValue` and `readValues` are both invoked.

Benchmark results on JDK 17, 21, and 25 are committed on the branch.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code

Closes #55816 from LuciferYang/SPARK-56802-float-to-double.

Authored-by: YangJie <yangjie01@baidu.com>
Signed-off-by: yangjie01 <yangjie01@baidu.com>
(cherry picked from commit bc4bf69)
Signed-off-by: yangjie01 <yangjie01@baidu.com>
@LuciferYang
Copy link
Copy Markdown
Contributor Author

Merged into master. Thanks @yaooqinn

@dongjoon-hyun
Copy link
Copy Markdown
Member

+1, LGTM.

@LuciferYang
Copy link
Copy Markdown
Contributor Author

Thank you @dongjoon-hyun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants