Describe the problem you faced
When writing a Hudi table via the Spark DataSource, the Parquet file's timestamp column logical type is always TIMESTAMP(MICROS, isAdjustedToUTC=true) for Spark TimestampType columns, regardless of what the user requests via:
- SparkSession config
spark.sql.parquet.outputTimestampType=TIMESTAMP_MILLIS (or INT96), or
- The documented Hudi write option
hoodie.parquet.outputtimestamptype=TIMESTAMP_MILLIS.
In contrast, Spark's own Parquet writer (df.write.parquet(...)) under the same SparkSession honors both TIMESTAMP_MILLIS and INT96.
This means Hudi-written tables cannot:
- store timestamps as
TIMESTAMP(MILLIS, …) for downstream consumers that expect millisecond-precision Parquet, or
- store timestamps as
INT96 for legacy Hive/Impala readers that require INT96.
Data values round-trip correctly (MICROS is the higher-precision encoding, so no value information is lost) — but it silently breaks (a) storage workflows targeting MILLIS for smaller files and (b) interop with readers requiring INT96. It also means the documented hoodie.parquet.outputtimestamptype config has no effect.
TimestampNTZType is not affected: Spark itself also ignores outputTimestampType for NTZ, so Hudi matching that is correct.
To Reproduce
Single-file pyspark script — no Docker required.
export HUDI_BUNDLE=/path/to/hudi-spark3.4-bundle_2.12-<VERSION>.jar
spark-submit \
--master 'local[2]' \
--jars "$HUDI_BUNDLE" \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar \
--conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog \
repro.py
import datetime, glob, os, shutil, tempfile
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, TimestampType
WORK = tempfile.mkdtemp(prefix="hudi_ts_repro_")
ROW = (1, datetime.datetime(2026, 5, 15, 12, 34, 56, 123456, tzinfo=datetime.timezone.utc))
SCHEMA = StructType([
StructField("id", IntegerType(), False),
StructField("ts_tz", TimestampType(), False),
])
HUDI_OPTS = {
"hoodie.table.name": "ts_repro",
"hoodie.datasource.write.recordkey.field": "id",
"hoodie.datasource.write.precombine.field": "id",
"hoodie.datasource.write.partitionpath.field": "",
"hoodie.datasource.write.table.type": "COPY_ON_WRITE",
}
def parquet_logical_type(d, col):
spark = SparkSession.builder.getOrCreate()
jvm = spark._jvm
pq = sorted(glob.glob(f"{d}/**/*.parquet", recursive=True))[0]
reader = jvm.org.apache.parquet.hadoop.ParquetFileReader.open(
jvm.org.apache.parquet.hadoop.util.HadoopInputFile.fromPath(
jvm.org.apache.hadoop.fs.Path(pq), spark._jsc.hadoopConfiguration()))
s = reader.getFileMetaData().getSchema().toString(); reader.close()
for line in s.splitlines():
line = line.strip()
if f" {col};" in line or (f" {col} " in line and "(" in line):
return line[line.find(col):].rstrip(";").strip()
return "(not found)"
def run(label, out_ts):
spark = (SparkSession.builder.appName(f"r_{label}")
.config("spark.sql.session.timeZone", "UTC")
.config("spark.sql.parquet.outputTimestampType", out_ts).getOrCreate())
spark.sparkContext.setLogLevel("WARN")
df = spark.createDataFrame([ROW], SCHEMA)
sp = os.path.join(WORK, f"{label}_spark"); hp = os.path.join(WORK, f"{label}_hudi")
shutil.rmtree(sp, ignore_errors=True); shutil.rmtree(hp, ignore_errors=True)
df.coalesce(1).write.mode("overwrite").parquet(sp)
df.write.format("hudi").options(**HUDI_OPTS).mode("overwrite").save(hp)
print(f" {label}: Spark-direct = {parquet_logical_type(sp, 'ts_tz')} | Hudi = {parquet_logical_type(hp, 'ts_tz')}")
spark.stop()
run("session_MILLIS", "TIMESTAMP_MILLIS")
run("session_MICROS", "TIMESTAMP_MICROS")
run("session_INT96", "INT96")
Expected behavior
Same Parquet logical type as Spark-direct:
session_MILLIS: Spark-direct = ts_tz (TIMESTAMP(MILLIS,true)) | Hudi = ts_tz (TIMESTAMP(MILLIS,true))
session_MICROS: Spark-direct = ts_tz (TIMESTAMP(MICROS,true)) | Hudi = ts_tz (TIMESTAMP(MICROS,true))
session_INT96 : Spark-direct = ts_tz | Hudi = ts_tz
Actual behavior
session_MILLIS: Spark-direct = ts_tz (TIMESTAMP(MILLIS,true)) | Hudi = ts_tz (TIMESTAMP(MICROS,true)) << diverged
session_MICROS: Spark-direct = ts_tz (TIMESTAMP(MICROS,true)) | Hudi = ts_tz (TIMESTAMP(MICROS,true)) << matches
session_INT96 : Spark-direct = ts_tz | Hudi = ts_tz (TIMESTAMP(MICROS,true)) << diverged
Reproduced identically against three bundles by only swapping --jars:
| Bundle (Maven Central / Apache staging) |
session MILLIS |
Hudi option MILLIS |
session INT96 |
hudi-spark3.4-bundle_2.12-0.15.0.jar |
reproduces |
reproduces |
reproduces |
hudi-spark3.4-bundle_2.12-0.15.1-rc1.jar |
reproduces |
reproduces |
reproduces |
hudi-spark3.4-bundle_2.12-1.1.1.jar |
reproduces |
reproduces |
reproduces |
Bug spans at least 0.15.0 → 1.1.1, surviving the 1.x rewrite.
Environment Description
- Hudi version: 0.15.0, 0.15.1-rc1, 1.1.1 (all reproduce; Spark 3.4 bundles, Scala 2.12)
- Spark version: 3.4.3
- Hadoop version: 3 (bundled Spark distribution)
- Storage: local FS; bug is in the Parquet writer logical-type selection and is storage-independent
- Running on Docker?: optional
Additional context
TimestampNTZType is unaffected (Spark itself ignores outputTimestampType for NTZ).
hoodie.parquet.outputtimestamptype is documented but appears to be effectively dead code across at least three releases.
- Workaround for users needing MILLIS: convert to LongType epoch-millis before writing. No workaround for INT96 / legacy Hive interop without re-writing files outside Hudi.
Stacktrace
n/a — no exception; the bug is silent.
Describe the problem you faced
When writing a Hudi table via the Spark DataSource, the Parquet file's timestamp column logical type is always
TIMESTAMP(MICROS, isAdjustedToUTC=true)for SparkTimestampTypecolumns, regardless of what the user requests via:spark.sql.parquet.outputTimestampType=TIMESTAMP_MILLIS(orINT96), orhoodie.parquet.outputtimestamptype=TIMESTAMP_MILLIS.In contrast, Spark's own Parquet writer (
df.write.parquet(...)) under the same SparkSession honors bothTIMESTAMP_MILLISandINT96.This means Hudi-written tables cannot:
TIMESTAMP(MILLIS, …)for downstream consumers that expect millisecond-precision Parquet, orINT96for legacy Hive/Impala readers that require INT96.Data values round-trip correctly (MICROS is the higher-precision encoding, so no value information is lost) — but it silently breaks (a) storage workflows targeting MILLIS for smaller files and (b) interop with readers requiring INT96. It also means the documented
hoodie.parquet.outputtimestamptypeconfig has no effect.TimestampNTZTypeis not affected: Spark itself also ignoresoutputTimestampTypefor NTZ, so Hudi matching that is correct.To Reproduce
Single-file pyspark script — no Docker required.
Expected behavior
Same Parquet logical type as Spark-direct:
Actual behavior
Reproduced identically against three bundles by only swapping
--jars:hudi-spark3.4-bundle_2.12-0.15.0.jarhudi-spark3.4-bundle_2.12-0.15.1-rc1.jarhudi-spark3.4-bundle_2.12-1.1.1.jarBug spans at least 0.15.0 → 1.1.1, surviving the 1.x rewrite.
Environment Description
Additional context
TimestampNTZTypeis unaffected (Spark itself ignoresoutputTimestampTypefor NTZ).hoodie.parquet.outputtimestamptypeis documented but appears to be effectively dead code across at least three releases.Stacktrace
n/a — no exception; the bug is silent.