Skip to content

Spark 4.1: Support time type#16665

Open
Benjamin0313 wants to merge 1 commit into
apache:mainfrom
Benjamin0313:spark-4.1-time-type-support
Open

Spark 4.1: Support time type#16665
Benjamin0313 wants to merge 1 commit into
apache:mainfrom
Benjamin0313:spark-4.1-time-type-support

Conversation

@Benjamin0313
Copy link
Copy Markdown

@Benjamin0313 Benjamin0313 commented Jun 2, 2026

What

Adds support for the Iceberg time type in the Spark 4.1 module, mapping it to Spark's
TimeType (introduced in SPARK-51162).

Previously, projecting or writing a time column from Spark threw
UnsupportedOperationException: Spark does not support time fields from TypeToSparkType.
This revisits #9006, which was closed in 2019 — before Spark had a native time type.

TimeType only exists in Spark 4.1, so this targets spark/v4.1 only (not 3.5 / 4.0).

How

Type conversion

  • TypeToSparkType: Iceberg time → Spark TimeType() (microsecond precision)
  • SparkTypeToType: Spark TimeType → Iceberg time

Value conversion — Iceberg stores time as microseconds-from-midnight; Spark 4.1 stores
nanoseconds-from-midnight (SPARK-52460). Conversion happens at the read/write boundary
(×1000 on read, ÷1000 on write):

  • Parquet — SparkParquetReaders (TimeReader), SparkParquetWriters (TimeMicrosWriter)
  • ORC — SparkOrcValueReaders#times, SparkOrcValueWriters#times (via LongColumnVector)
  • Avro — SparkPlannedAvroReader / SparkAvroWriter (time-micros logical type)
  • Row-level: SparkValueConverter, InternalRowWrapper

Vectorized reads are intentionally not supported in this PR. Spark 4.1's ColumnarBatch
cannot expose TimeType values (ColumnarBatchRow#get throws
Datatype not supported TimeType(6)), and exposing time through the shared arrow module's
accessor would require an engine-wide change affecting Flink and others. SparkBatch therefore
falls back to row-based reads when a time column is projected (both Parquet and ORC). This can be
lifted in a follow-up once Spark's vectorized time support matures.

Testing

  • Enabled the existing supportsTime() hook in TestSparkParquetReader, TestSparkAvroReader,
    and TestSparkRecordOrcReaderWriter, exercising schema + value round-trips via testTypeSchema.
  • Re-enabled TestInternalRowWrapper#testTime.
  • Added time handling to test helpers (GenericsHelpers#assertEqualsSafe/assertEqualsUnsafe,
    RandomData).
  • TestSparkOrcReader keeps supportsTime() == false because it also exercises the vectorized
    path, which is not supported here.

AI assistance

This change was implemented with the help of an AI coding assistant (Claude). I reviewed and
understand the implementation end-to-end and verified it locally (spotlessApply and the Spark 4.1
module tests pass). I'd especially welcome scrutiny on:

  • Deferring vectorized reads (the SparkBatch row-based fallback for time columns).
  • The time value paths in SparkValueConverter and InternalRowWrapper.

Closes #16663

Map Iceberg's time type to Spark 4.1's TimeType (added in SPARK-51162) for
row-based reads and writes across Parquet, ORC, and Avro. Iceberg stores
time as microseconds from midnight while Spark stores it as nanoseconds, so
values are converted on the boundary (x1000 on read, /1000 on write).

Vectorized reads are intentionally left unsupported for now: Spark 4.1's
ColumnarBatch (ColumnarBatchRow#get) does not support TimeType, and exposing
time through the shared Arrow accessor would require an engine-wide change.
SparkBatch therefore falls back to row-based reads when a time column is
projected.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Spark 4.1: Support Iceberg time type

1 participant