Skip to content

[SUPPORT] Spark write path silently ignores spark.sql.parquet.outputTimestampType and hoodie.parquet.outputtimestamptype #18752

@linliu-code

Description

@linliu-code

Describe the problem you faced

When writing a Hudi table via the Spark DataSource, the Parquet file's timestamp column logical type is always TIMESTAMP(MICROS, isAdjustedToUTC=true) for Spark TimestampType columns, regardless of what the user requests via:

  1. SparkSession config spark.sql.parquet.outputTimestampType=TIMESTAMP_MILLIS (or INT96), or
  2. The documented Hudi write option hoodie.parquet.outputtimestamptype=TIMESTAMP_MILLIS.

In contrast, Spark's own Parquet writer (df.write.parquet(...)) under the same SparkSession honors both TIMESTAMP_MILLIS and INT96.

This means Hudi-written tables cannot:

  • store timestamps as TIMESTAMP(MILLIS, …) for downstream consumers that expect millisecond-precision Parquet, or
  • store timestamps as INT96 for legacy Hive/Impala readers that require INT96.

Data values round-trip correctly (MICROS is the higher-precision encoding, so no value information is lost) — but it silently breaks (a) storage workflows targeting MILLIS for smaller files and (b) interop with readers requiring INT96. It also means the documented hoodie.parquet.outputtimestamptype config has no effect.

TimestampNTZType is not affected: Spark itself also ignores outputTimestampType for NTZ, so Hudi matching that is correct.

To Reproduce

Single-file pyspark script — no Docker required.

export HUDI_BUNDLE=/path/to/hudi-spark3.4-bundle_2.12-<VERSION>.jar
spark-submit \
  --master 'local[2]' \
  --jars "$HUDI_BUNDLE" \
  --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
  --conf spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar \
  --conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension \
  --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog \
  repro.py
import datetime, glob, os, shutil, tempfile
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, TimestampType

WORK = tempfile.mkdtemp(prefix="hudi_ts_repro_")
ROW = (1, datetime.datetime(2026, 5, 15, 12, 34, 56, 123456, tzinfo=datetime.timezone.utc))
SCHEMA = StructType([
    StructField("id", IntegerType(), False),
    StructField("ts_tz", TimestampType(), False),
])
HUDI_OPTS = {
    "hoodie.table.name": "ts_repro",
    "hoodie.datasource.write.recordkey.field": "id",
    "hoodie.datasource.write.precombine.field": "id",
    "hoodie.datasource.write.partitionpath.field": "",
    "hoodie.datasource.write.table.type": "COPY_ON_WRITE",
}

def parquet_logical_type(d, col):
    spark = SparkSession.builder.getOrCreate()
    jvm = spark._jvm
    pq = sorted(glob.glob(f"{d}/**/*.parquet", recursive=True))[0]
    reader = jvm.org.apache.parquet.hadoop.ParquetFileReader.open(
        jvm.org.apache.parquet.hadoop.util.HadoopInputFile.fromPath(
            jvm.org.apache.hadoop.fs.Path(pq), spark._jsc.hadoopConfiguration()))
    s = reader.getFileMetaData().getSchema().toString(); reader.close()
    for line in s.splitlines():
        line = line.strip()
        if f" {col};" in line or (f" {col} " in line and "(" in line):
            return line[line.find(col):].rstrip(";").strip()
    return "(not found)"

def run(label, out_ts):
    spark = (SparkSession.builder.appName(f"r_{label}")
        .config("spark.sql.session.timeZone", "UTC")
        .config("spark.sql.parquet.outputTimestampType", out_ts).getOrCreate())
    spark.sparkContext.setLogLevel("WARN")
    df = spark.createDataFrame([ROW], SCHEMA)
    sp = os.path.join(WORK, f"{label}_spark"); hp = os.path.join(WORK, f"{label}_hudi")
    shutil.rmtree(sp, ignore_errors=True); shutil.rmtree(hp, ignore_errors=True)
    df.coalesce(1).write.mode("overwrite").parquet(sp)
    df.write.format("hudi").options(**HUDI_OPTS).mode("overwrite").save(hp)
    print(f"  {label}: Spark-direct = {parquet_logical_type(sp, 'ts_tz')}   |   Hudi = {parquet_logical_type(hp, 'ts_tz')}")
    spark.stop()

run("session_MILLIS", "TIMESTAMP_MILLIS")
run("session_MICROS", "TIMESTAMP_MICROS")
run("session_INT96",  "INT96")

Expected behavior

Same Parquet logical type as Spark-direct:

session_MILLIS: Spark-direct = ts_tz (TIMESTAMP(MILLIS,true))  |  Hudi = ts_tz (TIMESTAMP(MILLIS,true))
session_MICROS: Spark-direct = ts_tz (TIMESTAMP(MICROS,true))  |  Hudi = ts_tz (TIMESTAMP(MICROS,true))
session_INT96 : Spark-direct = ts_tz                           |  Hudi = ts_tz

Actual behavior

session_MILLIS: Spark-direct = ts_tz (TIMESTAMP(MILLIS,true))  |  Hudi = ts_tz (TIMESTAMP(MICROS,true))   << diverged
session_MICROS: Spark-direct = ts_tz (TIMESTAMP(MICROS,true))  |  Hudi = ts_tz (TIMESTAMP(MICROS,true))   << matches
session_INT96 : Spark-direct = ts_tz                           |  Hudi = ts_tz (TIMESTAMP(MICROS,true))   << diverged

Reproduced identically against three bundles by only swapping --jars:

Bundle (Maven Central / Apache staging) session MILLIS Hudi option MILLIS session INT96
hudi-spark3.4-bundle_2.12-0.15.0.jar reproduces reproduces reproduces
hudi-spark3.4-bundle_2.12-0.15.1-rc1.jar reproduces reproduces reproduces
hudi-spark3.4-bundle_2.12-1.1.1.jar reproduces reproduces reproduces

Bug spans at least 0.15.0 → 1.1.1, surviving the 1.x rewrite.

Environment Description

  • Hudi version: 0.15.0, 0.15.1-rc1, 1.1.1 (all reproduce; Spark 3.4 bundles, Scala 2.12)
  • Spark version: 3.4.3
  • Hadoop version: 3 (bundled Spark distribution)
  • Storage: local FS; bug is in the Parquet writer logical-type selection and is storage-independent
  • Running on Docker?: optional

Additional context

  • TimestampNTZType is unaffected (Spark itself ignores outputTimestampType for NTZ).
  • hoodie.parquet.outputtimestamptype is documented but appears to be effectively dead code across at least three releases.
  • Workaround for users needing MILLIS: convert to LongType epoch-millis before writing. No workaround for INT96 / legacy Hive interop without re-writing files outside Hudi.

Stacktrace

n/a — no exception; the bug is silent.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions