[SPARK-56804][SQL] Add bulk read+convert path for DATE to TimestampNTZ Parquet vector updater by LuciferYang · Pull Request #55971 · apache/spark

LuciferYang · 2026-05-19T02:36:45Z

What changes were proposed in this pull request?

Extend the bulk-read pattern (SPARK-56791 / 56801 / 56802 / 56803) to DateToTimestampNTZUpdater. Add a readIntegersAsTimestampMicros default method on VectorizedValuesReader, override it in VectorizedPlainValuesReader to fetch source bytes once via getBuffer(total * 4) and run a tight conversion loop, and reduce DateToTimestampNTZUpdater.readValues to a one-line delegation. Scope: UTC, CORRECTED rebase mode; the LEGACY / EXCEPTION rebase variants (handled by DateToTimestampNTZWithRebaseUpdater) are out of scope.

Why are the changes needed?

This was first proposed in #55855 and closed because benchmarks showed no measurable gain -- the per-row work was dominated by DateTimeUtils.daysToMicros's LocalDate / ZonedDateTime / Instant allocation chain, swamping the savings from collapsing N getBuffer(4) slice allocations.

SPARK-56874 fixed that by fast-pathing daysToMicros at ZoneOffset.UTC to Math.multiplyExact(days.toLong, MICROS_PER_DAY). With the conversion now cheap, the bulk-read savings become visible:

JDK	Master baseline	With this PR	Speedup
17	2.8 ns/row (357.5 M/s)	1.7 ns/row (605.2 M/s)	~1.7x
21	2.7 ns/row (366.1 M/s)	1.1 ns/row (934.8 M/s)	~2.5x
25	2.6 ns/row (378.3 M/s)	1.1 ns/row (884.9 M/s)	~2.4x

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New unit tests in ParquetVectorUpdaterSuite for readIntegersAsTimestampMicros (various batch sizes, single-value readValue, UTC conversion semantics). New integration test in ParquetIOSuite for INT32 DATE -> TimestampNTZType through the vectorized reader with both dictionary-encoded and plain pages. Benchmark numbers above are from GHA.

Was this patch authored or co-authored using generative AI tooling?

No

…Z Parquet vector updater

…to-tsntz # Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala

LuciferYang · 2026-05-19T06:55:58Z

cc @yaooqinn @dongjoon-hyun

yaooqinn

LGTM.

…Z Parquet vector updater ### What changes were proposed in this pull request? Extend the bulk-read pattern (SPARK-56791 / 56801 / 56802 / 56803) to `DateToTimestampNTZUpdater`. Add a `readIntegersAsTimestampMicros` default method on `VectorizedValuesReader`, override it in `VectorizedPlainValuesReader` to fetch source bytes once via `getBuffer(total * 4)` and run a tight conversion loop, and reduce `DateToTimestampNTZUpdater.readValues` to a one-line delegation. Scope: UTC, `CORRECTED` rebase mode; the `LEGACY` / `EXCEPTION` rebase variants (handled by `DateToTimestampNTZWithRebaseUpdater`) are out of scope. ### Why are the changes needed? This was first proposed in #55855 and closed because benchmarks showed no measurable gain -- the per-row work was dominated by `DateTimeUtils.daysToMicros`'s `LocalDate` / `ZonedDateTime` / `Instant` allocation chain, swamping the savings from collapsing N `getBuffer(4)` slice allocations. SPARK-56874 fixed that by fast-pathing `daysToMicros` at `ZoneOffset.UTC` to `Math.multiplyExact(days.toLong, MICROS_PER_DAY)`. With the conversion now cheap, the bulk-read savings become visible: | JDK | Master baseline | With this PR | Speedup | |---|---|---|---| | 17 | 2.8 ns/row (357.5 M/s) | 1.7 ns/row (605.2 M/s) | ~1.7x | | 21 | 2.7 ns/row (366.1 M/s) | 1.1 ns/row (934.8 M/s) | ~2.5x | | 25 | 2.6 ns/row (378.3 M/s) | 1.1 ns/row (884.9 M/s) | ~2.4x | ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit tests in `ParquetVectorUpdaterSuite` for `readIntegersAsTimestampMicros` (various batch sizes, single-value `readValue`, UTC conversion semantics). New integration test in `ParquetIOSuite` for INT32 DATE -> `TimestampNTZType` through the vectorized reader with both dictionary-encoded and plain pages. Benchmark numbers above are from GHA. ### Was this patch authored or co-authored using generative AI tooling? No Closes #55971 from LuciferYang/SPARK-56804-date-to-tsntz. Authored-by: YangJie <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com> (cherry picked from commit 19073d8) Signed-off-by: yangjie01 <yangjie01@baidu.com>

LuciferYang · 2026-05-19T10:08:08Z

Merged into master and branch-4.3. Thanks @yaooqinn

LuciferYang added 3 commits May 14, 2026 23:34

[SPARK-56804][SQL] Add bulk read+convert path for DATE to TimestampNT…

eaf396f

…Z Parquet vector updater

Merge remote-tracking branch 'upstream/master' into SPARK-56804-date-…

04f32f9

…to-tsntz # Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala

update benchmark results

2d48760

LuciferYang marked this pull request as draft May 19, 2026 02:37

Merge branch 'apache:master' into SPARK-56804-date-to-tsntz

4f66340

LuciferYang marked this pull request as ready for review May 19, 2026 04:49

yaooqinn approved these changes May 19, 2026

View reviewed changes

LuciferYang closed this in 19073d8 May 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56804][SQL] Add bulk read+convert path for DATE to TimestampNTZ Parquet vector updater#55971

[SPARK-56804][SQL] Add bulk read+convert path for DATE to TimestampNTZ Parquet vector updater#55971
LuciferYang wants to merge 4 commits into
apache:masterfrom
LuciferYang:SPARK-56804-date-to-tsntz

LuciferYang commented May 19, 2026 •

edited

Loading

Uh oh!

LuciferYang commented May 19, 2026

Uh oh!

yaooqinn left a comment •

edited

Loading

Uh oh!

LuciferYang commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LuciferYang commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

LuciferYang commented May 19, 2026

Uh oh!

yaooqinn left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LuciferYang commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LuciferYang commented May 19, 2026 •

edited

Loading

yaooqinn left a comment •

edited

Loading