Skip to content

Native DataFusion scan silently returns wrong values reading INT96 as TimestampNTZ prior to Spark 4.0 #4218

@andygrove

Description

@andygrove

Description

When a Parquet file stores timestamps as INT96 (Spark's TimestampType with UTC-adjusted local-time semantics) and the read schema requests TimestampNTZ, the native_datafusion scan silently returns wall-clock values that disagree with what was written.

Spark 3.x itself raises an error in this scenario (SPARK-36182) to prevent silent reinterpretation of an LTZ instant as NTZ. Comet's native scan should either match Spark's behavior by raising an error, or correctly handle the timezone conversion.

Steps to Reproduce

val sessionTz = "America/Los_Angeles"
val written = "2020-01-01 12:00:00"

withSQLConf(
  SQLConf.SESSION_LOCAL_TIMEZONE.key -> sessionTz,
  SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key -> "INT96",
  SQLConf.USE_V1_SOURCE_LIST.key -> "parquet") {
  withTempPath { dir =>
    val path = dir.getCanonicalPath

    // Write "2020-01-01 12:00:00" America/Los_Angeles as INT96.
    // The bits encode the UTC instant 2020-01-01 20:00:00.
    Seq(Timestamp.valueOf(written)).toDF("ts").write.parquet(path)

    // Spark refuses to read INT96 as TimestampNTZ (SPARK-36182)
    withSQLConf(CometConf.COMET_ENABLED.key -> "false") {
      intercept[SparkException] {
        spark.read.schema("ts timestamp_ntz").parquet(path).collect()
      }
    }

    // native_datafusion silently returns a shifted value
    withSQLConf(CometConf.COMET_NATIVE_SCAN_IMPL.key -> CometConf.SCAN_NATIVE_DATAFUSION) {
      val rows = spark.read.schema("ts timestamp_ntz").parquet(path).collect()
      val actual = rows.head.getAs[LocalDateTime](0)
      // actual != LocalDateTime.parse("2020-01-01T12:00:00")
      // The value is silently wrong — shifted by the timezone offset
    }
  }
}

Expected Behavior

Comet should match Spark's behavior and raise an error when asked to read INT96 timestamps as TimestampNTZ, since the LTZ→NTZ reinterpretation cannot be done safely without explicit conversion.

Actual Behavior

The native DataFusion scan returns a result without error, but the timestamp value is silently incorrect (shifted by the session timezone offset).

Related

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions