Skip to content

[GLUTEN-11622][VL] Fallback TimestampNTZ to Spark#11609

Merged
rui-mo merged 3 commits intoapache:mainfrom
acvictor:acvictor/timestampntz
Feb 19, 2026
Merged

[GLUTEN-11622][VL] Fallback TimestampNTZ to Spark#11609
rui-mo merged 3 commits intoapache:mainfrom
acvictor:acvictor/timestampntz

Conversation

@acvictor
Copy link
Contributor

@acvictor acvictor commented Feb 12, 2026

What changes are proposed in this pull request?

TimestampNTZ is not fully supported in Velox. This PR adds a FallbackByTimestampNTZ validator that unconditionally falls back any operator whose input or output schema contains TimestampNTZType.

The Arrow type mapping for TimestampNTZType is added in SparkArrowUti and is required because RowToVeloxColumnarExec transitions are inserted after the validation phase at RTC boundaries, and these call SparkArrowUtil.toArrowSchema which must be able to handle all types present in the schema. Without this mapping, the transition crashes with UnsupportedOperationException even though the operator itself was correctly fallen back.

How was this patch tested?

Added UTs in VeloxParquetDataTypeValidationSuite and DeltaSuite to verify that TimestampNTZ scans fall back to Spark and return correct results, and also added tests for Delta tables with NTZ columns, NTZ partition columns, and filters on NTZ columns were we originally saw the UnsupportedOperationException being thrown.

Was this patch authored or co-authored using generative AI tooling?

No

Related issue: #11622

@github-actions github-actions bot added the VELOX label Feb 12, 2026
@acvictor acvictor force-pushed the acvictor/timestampntz branch from afc463d to 62c6a55 Compare February 12, 2026 14:41
@github-actions github-actions bot added CORE works for Gluten Core DATA_LAKE labels Feb 12, 2026
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@FelixYBW FelixYBW requested a review from rui-mo February 12, 2026 23:51
@acvictor
Copy link
Contributor Author

@rui-mo can you please review this PR? Thanks in advance!

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

Copy link
Contributor

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @acvictor, has the RowToVeloxColumnarExec transition for the TimestampNTZ type already been supported with this PR? Thanks.

@rui-mo rui-mo changed the title [VL] Fallback TimestampNTZ to Spark [GLUTEN-11622][VL] Fallback TimestampNTZ to Spark Feb 16, 2026
@acvictor
Copy link
Contributor Author

Hi @acvictor, has the RowToVeloxColumnarExec transition for the TimestampNTZ type already been supported with this PR? Thanks.

Thank you for the review!

This PR fixes the SparkArrowUtil Arrow type mapping so that the RTC transitions at fallback boundaries don't crash with UnsupportedOperationException. The flow is:

  1. The new FallbackByTimestampNTZ validator detects TimestampNTZ in a plan node's schema and forces a fallback to Spark's row-based execution.
  2. At the fallback boundary, RowToVeloxColumnarExec calls SparkArrowUtil.toArrowSchema() to convert the schema. Before this PR, that call would throw on timestamp_ntz java.lang.UnsupportedOperationException: Unsupported data type: timestamp_ntz at org.apache.spark.sql.utils.SparkArrowUtil$.toArrowType(SparkArrowUtil.scala:58).

@rui-mo
Copy link
Contributor

rui-mo commented Feb 16, 2026

@acvictor Thank you for the additional information. I’m still a bit unclear: in a plan like op0 -> R2C -> op1, if all plan nodes involving the timestamp_ntz type fall back, could you help clarify which operator op1 might be that would lead to an R2C being inserted?

@acvictor
Copy link
Contributor Author

@acvictor Thank you for the additional information. I’m still a bit unclear: in a plan like op0 -> R2C -> op1, if all plan nodes involving the timestamp_ntz type fall back, could you help clarify which operator op1 might be that would lead to an R2C being inserted?

I added debug logs to Gluten and used the example of the test use TIMESTAMP_NTZ in a partition column from here. This test creates a table with schema c1 STRING, c2 TIMESTAMP, c3 TIMESTAMP_NTZ partitioned by c3, inserts a row, then calls spark.table("delta_test").head.

op1 here would be ColumnarCollectLimitExec.

The actual runtime plan is:

VeloxColumnarToRowExec
└── ColumnarCollectLimitExec - op1
└── RowToVeloxColumnarExec
└── WholeStageCodegenExec - op0 (vanilla Spark, wraps FileScan fall back)
└── ColumnarToRow
└── FileScan parquet spark_catalog.default.delta_test
[c1, c2, c3(TimestampNTZ)] PARTITIONED BY (c3)

Debug logs added to Transitions.scala confirm it:

   [TRANSITION-DEBUG] node: ColumnarCollectLimit
   [TRANSITION-DEBUG]   conv: Impl(None$,VanillaBatchType$) -> Impl(Any,Is(VeloxBatchType$))
   [TRANSITION-DEBUG]   child: Scan parquet spark_catalog.default.delta_test
   [TRANSITION-DEBUG]   new: RowToVeloxColumnar
   [TRANSITION-DEBUG]   schema: StructType(...,StructField(c3,TimestampNTZType,true))

ColumnarCollectLimitExec appears despite the FallbackByTimestampNTZ validator because it is registered as a post-transform rule, which runs after validation. It sees the vanilla CollectLimitExec with a columnar child and unconditionally replaces it with ColumnarCollectLimitExec bypassing the validator entirely. Then InsertTransitions sees a convention mismatch (VanillaBatch - VeloxBatch) and inserts the RowToVeloxColumnarExec, which throws an exception in SparkArrowUtil.toArrowSchema because there is no case for TimestampNTZType. The validator alone cannot handle this because post-transform rules like CollectLimitTransformerRule can reintroduce Gluten native operators after validation has already run.

@rui-mo
Copy link
Contributor

rui-mo commented Feb 17, 2026

VeloxColumnarToRowExec
└── ColumnarCollectLimitExec - op1
└── RowToVeloxColumnarExec

@acvictor Thanks for providing the detailed query plan. ColumnarCollectLimitExec is a Scala implementation and might automatically supports the TimestampNTZ type.

For this patch, to make the TimestampNTZ fallback strategy work, I assume we need to ensure that both VeloxColumnarToRowExec and RowToVeloxColumnarExec handle TimestampNTZ correctly. I was a bit surprised by your test results showing that they seem to work with only a small change in toArrowSchema. Is this because the type is treated during conversion simply as an Arrow timestamp without a timezone? Could you help confirm if they work well? Thanks.

@acvictor
Copy link
Contributor Author

VeloxColumnarToRowExec
└── ColumnarCollectLimitExec - op1
└── RowToVeloxColumnarExec

@acvictor Thanks for providing the detailed query plan. ColumnarCollectLimitExec is a Scala implementation and might automatically supports the TimestampNTZ type.

For this patch, to make the TimestampNTZ fallback strategy work, I assume we need to ensure that both VeloxColumnarToRowExec and RowToVeloxColumnarExec handle TimestampNTZ correctly. I was a bit surprised by your test results showing that they seem to work with only a small change in toArrowSchema. Is this because the type is treated during conversion simply as an Arrow timestamp without a timezone? Could you help confirm if they work well? Thanks.

Is this because the type is treated during conversion simply as an Arrow timestamp without a timezone? - Yes that's right! I was able to get all OSS Delta tests to pass with this change as well as Spark and Gluten UTs and have not found any issue so far.

@rui-mo rui-mo merged commit 7dde101 into apache:main Feb 19, 2026
62 checks passed
@acvictor acvictor deleted the acvictor/timestampntz branch February 21, 2026 11:28
@rui-mo rui-mo mentioned this pull request Mar 10, 2026
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core DATA_LAKE VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants