Spark 4.1: Support time type#16665
Open
Benjamin0313 wants to merge 1 commit into
Open
Conversation
Map Iceberg's time type to Spark 4.1's TimeType (added in SPARK-51162) for row-based reads and writes across Parquet, ORC, and Avro. Iceberg stores time as microseconds from midnight while Spark stores it as nanoseconds, so values are converted on the boundary (x1000 on read, /1000 on write). Vectorized reads are intentionally left unsupported for now: Spark 4.1's ColumnarBatch (ColumnarBatchRow#get) does not support TimeType, and exposing time through the shared Arrow accessor would require an engine-wide change. SparkBatch therefore falls back to row-based reads when a time column is projected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds support for the Iceberg
timetype in the Spark 4.1 module, mapping it to Spark'sTimeType(introduced in SPARK-51162).Previously, projecting or writing a
timecolumn from Spark threwUnsupportedOperationException: Spark does not support time fieldsfromTypeToSparkType.This revisits #9006, which was closed in 2019 — before Spark had a native time type.
TimeTypeonly exists in Spark 4.1, so this targetsspark/v4.1only (not 3.5 / 4.0).How
Type conversion
TypeToSparkType: Icebergtime→ SparkTimeType()(microsecond precision)SparkTypeToType: SparkTimeType→ IcebergtimeValue conversion — Iceberg stores time as microseconds-from-midnight; Spark 4.1 stores
nanoseconds-from-midnight (SPARK-52460). Conversion happens at the read/write boundary
(×1000 on read, ÷1000 on write):
SparkParquetReaders(TimeReader),SparkParquetWriters(TimeMicrosWriter)SparkOrcValueReaders#times,SparkOrcValueWriters#times(viaLongColumnVector)SparkPlannedAvroReader/SparkAvroWriter(time-microslogical type)SparkValueConverter,InternalRowWrapperVectorized reads are intentionally not supported in this PR. Spark 4.1's
ColumnarBatchcannot expose
TimeTypevalues (ColumnarBatchRow#getthrowsDatatype not supported TimeType(6)), and exposing time through the sharedarrowmodule'saccessor would require an engine-wide change affecting Flink and others.
SparkBatchthereforefalls back to row-based reads when a time column is projected (both Parquet and ORC). This can be
lifted in a follow-up once Spark's vectorized time support matures.
Testing
supportsTime()hook inTestSparkParquetReader,TestSparkAvroReader,and
TestSparkRecordOrcReaderWriter, exercising schema + value round-trips viatestTypeSchema.TestInternalRowWrapper#testTime.timehandling to test helpers (GenericsHelpers#assertEqualsSafe/assertEqualsUnsafe,RandomData).TestSparkOrcReaderkeepssupportsTime() == falsebecause it also exercises the vectorizedpath, which is not supported here.
AI assistance
This change was implemented with the help of an AI coding assistant (Claude). I reviewed and
understand the implementation end-to-end and verified it locally (
spotlessApplyand the Spark 4.1module tests pass). I'd especially welcome scrutiny on:
SparkBatchrow-based fallback for time columns).SparkValueConverterandInternalRowWrapper.Closes #16663