[SPARK-57100][SQL] Add columnar (ColumnVector) support for nanosecond timestamp types by MaxGekk · Pull Request #56198 · apache/spark

MaxGekk · 2026-05-29T09:03:26Z

What changes were proposed in this pull request?

Implement columnar storage support for TimestampNTZNanosType and TimestampLTZNanosType across the column-vector stack. The layout mirrors CalendarInterval: each column gets two child vectors — a Long child for epochMicros and a Short child for nanosWithinMicro (range [0, 999]).

Concretely:

ColumnVector — getTimestampNTZNanos / getTimestampLTZNanos now read from child vectors instead of throwing SparkUnsupportedOperationException.
WritableColumnVector — allocates the two child columns in the constructor; adds putTimestampNTZNanos / putTimestampLTZNanos write methods.
ConstantColumnVector — same child-column allocation; adds setTimestampNanosVal for the constant-value (partition-column) path.
RowToColumnConverter (Columnar.scala) — adds TimestampNTZNanosConverter / TimestampLTZNanosConverter objects (append epochMicros + nanosWithinMicro to children via appendStruct); routes nullable columns through StructNullableTypeConverter.
ColumnVectorUtils — handles both types in populate (constant-column path) and in appendValue (null and non-null branches).

Why are the changes needed?

SPARK-56981 added row-level physical representation for nanosecond timestamps, but columnar execution could not hold or move these values — any attempt to build a ColumnarBatch from rows containing nanosecond timestamps threw an unsupported-operation exception. This PR closes that gap.

Does this PR introduce any user-facing change?

Yes. ColumnarBatch can now be built from InternalRows containing TimestampNTZNanosType / TimestampLTZNanosType values. Previously this threw SparkUnsupportedOperationException.

How was this patch tested?

Added four unit tests to RowToColumnConverterSuite:

TimestampNTZNanosType column roundtrip — non-null values survive the row→column→read cycle.
TimestampNTZNanosType column with nulls — null slots are preserved correctly.
TimestampLTZNanosType column roundtrip — same for the LTZ variant.
TimestampLTZNanosType column with nulls — same for the LTZ variant.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Sonnet 4.6 (claude.ai/code)

… timestamp types Implement read/write/append support for TimestampNTZNanosType and TimestampLTZNanosType in column vectors, following the CalendarInterval two-child-vector pattern (Long for epochMicros, Short for nanosWithinMicro). Co-authored-by: Max Gekk <max.gekk@gmail.com>

…-vector support Four issues found in code review: 1. appendStruct(true) null-propagation: extend the StructType|VariantType guard in WritableColumnVector to also recurse for CalendarIntervalType, TimestampNTZNanosType, and TimestampLTZNanosType children, so that a nullable struct field of these types correctly propagates nulls into their own child sub-columns, preventing index divergence. 2. MutableColumnarRow: add copy(), get(), and update() branches for TimestampNTZNanosType and TimestampLTZNanosType, plus setTimestampNTZNanos and setTimestampLTZNanos setters. 3. ColumnVector Javadoc: fix "int vector" -> "short vector" for child 1 of the nanosecond timestamp layout. 4. Test coverage: add testVectors (OnHeap + OffHeap) for both nanos types to ColumnVectorSuite; add populate tests to ColumnVectorUtilsSuite; add nanos columns to the ColumnarBatchSuite RowToColumnConverter end-to-end test. Co-authored-by: Max Gekk <max.gekk@gmail.com>

Co-authored-by: Max Gekk <max.gekk@gmail.com>

MaxGekk · 2026-05-30T05:25:03Z

@dongjoon-hyun @peter-toth May I ask you to review this PR. We need this to support timestamps with nanosecond precision in Parquet vectorized reader and in ORC.

peter-toth

Summary

Closes the SPARK-56981 follow-up gap so that ColumnarBatch can hold nanosecond timestamps. Each parent column gets two children — Long epochMicros + Short nanosWithinMicro — mirroring the existing CalendarInterval pattern.

Prior state and problem. SPARK-56981 added the row-layer plumbing (SpecializedGetters, UnsafeRow, UnsafeArrayData) for TimestampNTZNanosType/TimestampLTZNanosType, but the column-vector stack was untouched: the default ColumnVector.getTimestamp{NTZ,LTZ}Nanos threw SparkUnsupportedOperationException, WritableColumnVector had no putters or constructor child allocation, and every type-dispatch site (MutableColumnarRow.update, ColumnVectorUtils.populate/appendValue, RowToColumnConverter.getConverterForType) lacked a branch. Any path materializing a ColumnarBatch from rows containing nanos values blew up. Structurally this is because each dispatch site predates the nanos types and fans out independently — adding the type means walking each one.

Design approach. Treat the new types as structurally identical to CalendarInterval (struct-shaped parent with two children) and follow the existing dispatch shape rather than introduce a new pattern. The ColumnVector defaults read via getChild(0).getLong + getChild(1).getShort; child columns are auto-allocated in the WritableColumnVector and ConstantColumnVector constructors; nullable wrappers are routed to StructNullableTypeConverter. NTZ and LTZ remain as parallel methods even with identical bodies, matching the row-layer convention from SPARK-56981.

Key design decisions.

The ColumnVector default getters now read the layout instead of throwing, so concrete subclasses that don't override but happen to have child vectors of the right primitive types could silently return values rather than fail loudly. Same risk profile as getInterval/getVariant; callers are expected to dispatch on dataType() first.
ConstantColumnVector exposes a single type-agnostic setTimestampNanosVal, while WritableColumnVector and MutableColumnarRow keep parallel NTZ/LTZ setters. Each surface follows its neighbour's existing convention.
The appendStruct(true) recursion in WritableColumnVector now treats CalendarIntervalType, TimestampNTZNanosType, TimestampLTZNanosType as structurally-childed alongside StructType/VariantType, so a null parent struct cascades to grandchild cursors.

Implementation sketch. Java side: ColumnVector default getters; WritableColumnVector putters + constructor + appendStruct recursion; ConstantColumnVector constructor + setter; MutableColumnarRow setters + update/get/copy dispatch; ColumnVectorUtils populate + appendValue. Scala side: two new RowToColumnConverter cases and the routing branch in getConverterForType for nullable wrappers.

Behavioral changes worth calling out.

ColumnarBatch can now hold nanosecond timestamp columns where it previously threw SparkUnsupportedOperationException.
WritableColumnVector.appendStruct(true) now recurses into CalendarIntervalType child columns, fixing a previously-latent grandchild-cursor skew (see #1 below). Not flagged in the PR description.

General

The PR description doesn't mention the CalendarIntervalType change in WritableColumnVector.appendStruct. It's a real fix for nested struct-of-interval scenarios, narrowly latent before this PR. A one-line note in the description would help reviewers focused on nanos not miss the interval-semantics shift.

Suggested improvements

appendStruct CalendarIntervalType is a separate latent fix. The recursion now treats CalendarInterval child columns as structurally-childed; previously they took the primitive path and would have skewed grandchild cursors for null parent rows. Worth either splitting out or adding a struct-of-interval test. [sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java:766]
Two identical dispatch branches in populate. appendValue already collapses both nanos types into one branch; populate does not. [sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVectorUtils.java:110]
Missing MutableColumnarRow test for TimestampLTZNanosType. The new setTimestampLTZNanos/update/get/copy paths for LTZ aren't exercised; the NTZ test mirrors them straightforwardly. [sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnVectorSuite.scala:424]

peter-toth · 2026-05-30T09:18:10Z

      for (WritableColumnVector c: childColumns) {
-        if (c.type instanceof StructType || c.type instanceof VariantType) {
+        if (c.type instanceof StructType || c.type instanceof VariantType
+            || c.type instanceof CalendarIntervalType


Adding CalendarIntervalType here isn't just supporting the new nanos types — it also fixes a previously-latent bug for nested struct-of-interval. Pre-PR, when an outer struct column was appended as null and one of its child columns was a CalendarInterval, the interval child took the else branch (c.appendNull()), advancing only the interval's own cursor and leaving its three grandchild primitive columns (months/days/microseconds) un-advanced. Subsequent rows would then write into the wrong grandchild slots — silent skew.

The fix is correct, but:

The PR description doesn't mention this. Worth one line so reviewers don't miss the interval-semantics change.

There's no test exercising the new recursion. The minimum case is a struct-of-interval column with at least one null parent row, then read back the next non-null row's children to verify they aren't shifted. Same shape extends to struct-of-TimestampNanos.

Up to you whether to split out into a separate commit or keep bundled.

I've extracted the CalendarIntervalType fix into a separate PR, along with a test that covers the struct-of-interval null-parent skew case: #56235 (SPARK-57184). This PR now only adds the nanosecond timestamp types to the appendStruct recursion guard.

peter-toth · 2026-05-30T09:18:10Z

      } else if (pdt instanceof PhysicalCalendarIntervalType) {
        // The value of `numRows` is irrelevant.
        col.setCalendarInterval((CalendarInterval) row.get(fieldIdx, t));
+      } else if (pdt instanceof PhysicalTimestampNTZNanosType) {


The two else if branches are identical. appendValue below at line 178 already collapses both nanos types into one condition with ||. Suggest the same here:

} else if (pdt instanceof PhysicalTimestampNTZNanosType || pdt instanceof PhysicalTimestampLTZNanosType) { col.setTimestampNanosVal((TimestampNanosVal) row.get(fieldIdx, t)); }

Done. Collapsed both nanos branches into one condition with ||, matching appendValue.

peter-toth · 2026-05-30T09:18:11Z

+      assert(mutableRow.get(0, TimestampNTZNanosType(9)) === v)
+      assert(mutableRow.copy().get(0, TimestampNTZNanosType(9)) === v)
+    }
+  }


The PR adds a MutableColumnarRow test for TimestampNTZNanosType but not for TimestampLTZNanosType. The LTZ paths (setTimestampLTZNanos at MutableColumnarRow.java:348, the update(TimestampLTZNanosType) and get(TimestampLTZNanosType) dispatches at MutableColumnarRow.java:240,272, and the copy() branch at MutableColumnarRow.java:104) aren't exercised. Adding a parallel mutable ColumnarRow with TimestampLTZNanosType block right after this one closes the gap.

Done. Added a parallel mutable ColumnarRow with TimestampLTZNanosType test that exercises setTimestampLTZNanos/getTimestampLTZNanos, the update/get dispatches, and the copy() branch.

peter-toth · 2026-05-30T09:23:15Z

@MaxGekk, probably we should extract the CalendarIntervalType related fix to a separate PR, but the other 2 are just minor nits.

…olumn-vector support Apply review feedback from the PR: 1. Extract the CalendarIntervalType fix out of appendStruct: WritableColumnVector no longer recurses for CalendarIntervalType children, keeping only the nanosecond timestamp types in this PR. The interval fix will be handled separately. 2. Collapse the two identical nanos branches in ColumnVectorUtils.populate into a single condition with ||, matching appendValue. 3. Add a MutableColumnarRow test for TimestampLTZNanosType, mirroring the existing TimestampNTZNanosType test. Co-authored-by: Max Gekk <max.gekk@gmail.com>

MaxGekk · 2026-05-31T21:50:28Z

Merging to master/4.x. Thank you @peter-toth for review.

… timestamp types ### What changes were proposed in this pull request? Implement columnar storage support for `TimestampNTZNanosType` and `TimestampLTZNanosType` across the column-vector stack. The layout mirrors `CalendarInterval`: each column gets two child vectors — a `Long` child for `epochMicros` and a `Short` child for `nanosWithinMicro` (range [0, 999]). Concretely: - **`ColumnVector`** — `getTimestampNTZNanos` / `getTimestampLTZNanos` now read from child vectors instead of throwing `SparkUnsupportedOperationException`. - **`WritableColumnVector`** — allocates the two child columns in the constructor; adds `putTimestampNTZNanos` / `putTimestampLTZNanos` write methods. - **`ConstantColumnVector`** — same child-column allocation; adds `setTimestampNanosVal` for the constant-value (partition-column) path. - **`RowToColumnConverter` (`Columnar.scala`)** — adds `TimestampNTZNanosConverter` / `TimestampLTZNanosConverter` objects (append `epochMicros` + `nanosWithinMicro` to children via `appendStruct`); routes nullable columns through `StructNullableTypeConverter`. - **`ColumnVectorUtils`** — handles both types in `populate` (constant-column path) and in `appendValue` (null and non-null branches). ### Why are the changes needed? SPARK-56981 added row-level physical representation for nanosecond timestamps, but columnar execution could not hold or move these values — any attempt to build a `ColumnarBatch` from rows containing nanosecond timestamps threw an unsupported-operation exception. This PR closes that gap. ### Does this PR introduce _any_ user-facing change? Yes. `ColumnarBatch` can now be built from `InternalRow`s containing `TimestampNTZNanosType` / `TimestampLTZNanosType` values. Previously this threw `SparkUnsupportedOperationException`. ### How was this patch tested? Added four unit tests to `RowToColumnConverterSuite`: - `TimestampNTZNanosType column roundtrip` — non-null values survive the row→column→read cycle. - `TimestampNTZNanosType column with nulls` — null slots are preserved correctly. - `TimestampLTZNanosType column roundtrip` — same for the LTZ variant. - `TimestampLTZNanosType column with nulls` — same for the LTZ variant. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Sonnet 4.6 (claude.ai/code) Closes #56198 from MaxGekk/nanos-column-vector. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit e73fadf) Signed-off-by: Max Gekk <max.gekk@gmail.com>

…ing a null struct ### What changes were proposed in this pull request? Include `CalendarIntervalType` in the recursion guard of `WritableColumnVector.appendStruct(boolean isNull)`, so that appending a NULL parent struct cascades `appendStruct(true)` into a `CalendarInterval` child column and advances all three of its grandchild columns (months / days / microseconds). ```java for (WritableColumnVector c: childColumns) { if (c.type instanceof StructType || c.type instanceof VariantType || c.type instanceof CalendarIntervalType) { c.appendStruct(true); } else { c.appendNull(); } } ``` This was split out of the nanosecond-timestamp `ColumnVector` PR (SPARK-57100, #56198) per review, since it is an independent fix. ### Why are the changes needed? A `CalendarInterval` column is struct-shaped: it is backed by three grandchild primitive columns (`months` as int, `days` as int, `microseconds` as long). The recursion guard in `appendStruct` only handled `StructType` and `VariantType`, so an interval child column took the `else` branch (`c.appendNull()`), which advances only the interval column's own cursor and leaves its three grandchild cursors un-advanced. As a result, for a struct column with a `CalendarInterval` field, appending a NULL parent row left the interval's grandchild cursors behind by one. A subsequent non-null row then wrote its `months`/`days`/`microseconds` into the wrong (earlier) grandchild slots, and reading that row back returned a skewed value - silent data corruption for the nested struct-of-interval case. ### Does this PR introduce _any_ user-facing change? Yes. Reading back a struct-of-interval column that contains a NULL parent row followed by a non-null row now returns the correct interval value instead of a skewed one. Previously it returned corrupted data. ### How was this patch tested? Added a unit test to `ColumnarBatchSuite` that uses `RowToColumnConverter` to convert a null parent struct followed by non-null struct-of-interval rows, and verifies the interval values are read back correctly. The test fails without the fix and passes with it. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.8 Closes #56235 from MaxGekk/fix-appendstruct-interval. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

…ing a null struct ### What changes were proposed in this pull request? Include `CalendarIntervalType` in the recursion guard of `WritableColumnVector.appendStruct(boolean isNull)`, so that appending a NULL parent struct cascades `appendStruct(true)` into a `CalendarInterval` child column and advances all three of its grandchild columns (months / days / microseconds). ```java for (WritableColumnVector c: childColumns) { if (c.type instanceof StructType || c.type instanceof VariantType || c.type instanceof CalendarIntervalType) { c.appendStruct(true); } else { c.appendNull(); } } ``` This was split out of the nanosecond-timestamp `ColumnVector` PR (SPARK-57100, #56198) per review, since it is an independent fix. ### Why are the changes needed? A `CalendarInterval` column is struct-shaped: it is backed by three grandchild primitive columns (`months` as int, `days` as int, `microseconds` as long). The recursion guard in `appendStruct` only handled `StructType` and `VariantType`, so an interval child column took the `else` branch (`c.appendNull()`), which advances only the interval column's own cursor and leaves its three grandchild cursors un-advanced. As a result, for a struct column with a `CalendarInterval` field, appending a NULL parent row left the interval's grandchild cursors behind by one. A subsequent non-null row then wrote its `months`/`days`/`microseconds` into the wrong (earlier) grandchild slots, and reading that row back returned a skewed value - silent data corruption for the nested struct-of-interval case. ### Does this PR introduce _any_ user-facing change? Yes. Reading back a struct-of-interval column that contains a NULL parent row followed by a non-null row now returns the correct interval value instead of a skewed one. Previously it returned corrupted data. ### How was this patch tested? Added a unit test to `ColumnarBatchSuite` that uses `RowToColumnConverter` to convert a null parent struct followed by non-null struct-of-interval rows, and verifies the interval values are read back correctly. The test fails without the fix and passes with it. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Opus 4.8 Closes #56235 from MaxGekk/fix-appendstruct-interval. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 18f2bac) Signed-off-by: Max Gekk <max.gekk@gmail.com>

MaxGekk changed the title ~~[SPARK-57100][SQL] Add columnar (ColumnVector) support for nanosecond timestamp types~~ [WIP][SPARK-57100][SQL] Add columnar (ColumnVector) support for nanosecond timestamp types May 29, 2026

MaxGekk added 3 commits May 29, 2026 11:27

[SPARK-57100][SQL] Fix Scalastyle import order in ColumnarBatchSuite

e46c1f2

Co-authored-by: Max Gekk <max.gekk@gmail.com>

Remove unused import

7a10e70

MaxGekk changed the title ~~[WIP][SPARK-57100][SQL] Add columnar (ColumnVector) support for nanosecond timestamp types~~ [SPARK-57100][SQL] Add columnar (ColumnVector) support for nanosecond timestamp types May 29, 2026

peter-toth reviewed May 30, 2026

View reviewed changes

MaxGekk added 2 commits May 31, 2026 19:42

Merge remote-tracking branch 'origin/master' into nanos-column-vector

37c8283

MaxGekk mentioned this pull request May 31, 2026

[SPARK-57184][SQL] Recurse into CalendarInterval children when appending a null struct #56235

Closed

peter-toth approved these changes May 31, 2026

View reviewed changes

MaxGekk closed this in e73fadf May 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57100][SQL] Add columnar (ColumnVector) support for nanosecond timestamp types#56198

[SPARK-57100][SQL] Add columnar (ColumnVector) support for nanosecond timestamp types#56198
MaxGekk wants to merge 6 commits into
apache:masterfrom
MaxGekk:nanos-column-vector

MaxGekk commented May 29, 2026

Uh oh!

MaxGekk commented May 30, 2026

Uh oh!

peter-toth left a comment

Uh oh!

peter-toth May 30, 2026

Uh oh!

MaxGekk May 31, 2026

Uh oh!

peter-toth May 30, 2026

Uh oh!

MaxGekk May 31, 2026

Uh oh!

peter-toth May 30, 2026

Uh oh!

MaxGekk May 31, 2026

Uh oh!

peter-toth commented May 30, 2026

Uh oh!

MaxGekk commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MaxGekk commented May 29, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

MaxGekk commented May 30, 2026

Uh oh!

peter-toth left a comment

Choose a reason for hiding this comment

Summary

General

Suggested improvements

Uh oh!

peter-toth May 30, 2026

Choose a reason for hiding this comment

Uh oh!

MaxGekk May 31, 2026

Choose a reason for hiding this comment

Uh oh!

peter-toth May 30, 2026

Choose a reason for hiding this comment

Uh oh!

MaxGekk May 31, 2026

Choose a reason for hiding this comment

Uh oh!

peter-toth May 30, 2026

Choose a reason for hiding this comment

Uh oh!

MaxGekk May 31, 2026

Choose a reason for hiding this comment

Uh oh!

peter-toth commented May 30, 2026

Uh oh!

MaxGekk commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants