[SUPPORT] Spark write path silently ignores spark.sql.parquet.outputTimestampType and hoodie.parquet.outputtimestamptype

## Describe the problem you faced

When writing a Hudi table via the Spark DataSource, the Parquet file's timestamp column logical type is always `TIMESTAMP(MICROS, isAdjustedToUTC=true)` for Spark `TimestampType` columns, regardless of what the user requests via:

1. SparkSession config `spark.sql.parquet.outputTimestampType=TIMESTAMP_MILLIS` (or `INT96`), or
2. The documented Hudi write option `hoodie.parquet.outputtimestamptype=TIMESTAMP_MILLIS`.

In contrast, Spark's own Parquet writer (`df.write.parquet(...)`) under the **same SparkSession** honors both `TIMESTAMP_MILLIS` and `INT96`.

This means Hudi-written tables cannot:
- store timestamps as `TIMESTAMP(MILLIS, …)` for downstream consumers that expect millisecond-precision Parquet, or
- store timestamps as `INT96` for legacy Hive/Impala readers that require INT96.

Data values round-trip correctly (MICROS is the higher-precision encoding, so no value information is lost) — but it silently breaks (a) storage workflows targeting MILLIS for smaller files and (b) interop with readers requiring INT96. It also means the documented `hoodie.parquet.outputtimestamptype` config has no effect.

`TimestampNTZType` is **not** affected: Spark itself also ignores `outputTimestampType` for NTZ, so Hudi matching that is correct.

## To Reproduce

Single-file pyspark script — no Docker required.

```bash
export HUDI_BUNDLE=/path/to/hudi-spark3.4-bundle_2.12-<VERSION>.jar
spark-submit \
  --master 'local[2]' \
  --jars "$HUDI_BUNDLE" \
  --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
  --conf spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar \
  --conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension \
  --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog \
  repro.py
```

```python
import datetime, glob, os, shutil, tempfile
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, TimestampType

WORK = tempfile.mkdtemp(prefix="hudi_ts_repro_")
ROW = (1, datetime.datetime(2026, 5, 15, 12, 34, 56, 123456, tzinfo=datetime.timezone.utc))
SCHEMA = StructType([
    StructField("id", IntegerType(), False),
    StructField("ts_tz", TimestampType(), False),
])
HUDI_OPTS = {
    "hoodie.table.name": "ts_repro",
    "hoodie.datasource.write.recordkey.field": "id",
    "hoodie.datasource.write.precombine.field": "id",
    "hoodie.datasource.write.partitionpath.field": "",
    "hoodie.datasource.write.table.type": "COPY_ON_WRITE",
}

def parquet_logical_type(d, col):
    spark = SparkSession.builder.getOrCreate()
    jvm = spark._jvm
    pq = sorted(glob.glob(f"{d}/**/*.parquet", recursive=True))[0]
    reader = jvm.org.apache.parquet.hadoop.ParquetFileReader.open(
        jvm.org.apache.parquet.hadoop.util.HadoopInputFile.fromPath(
            jvm.org.apache.hadoop.fs.Path(pq), spark._jsc.hadoopConfiguration()))
    s = reader.getFileMetaData().getSchema().toString(); reader.close()
    for line in s.splitlines():
        line = line.strip()
        if f" {col};" in line or (f" {col} " in line and "(" in line):
            return line[line.find(col):].rstrip(";").strip()
    return "(not found)"

def run(label, out_ts):
    spark = (SparkSession.builder.appName(f"r_{label}")
        .config("spark.sql.session.timeZone", "UTC")
        .config("spark.sql.parquet.outputTimestampType", out_ts).getOrCreate())
    spark.sparkContext.setLogLevel("WARN")
    df = spark.createDataFrame([ROW], SCHEMA)
    sp = os.path.join(WORK, f"{label}_spark"); hp = os.path.join(WORK, f"{label}_hudi")
    shutil.rmtree(sp, ignore_errors=True); shutil.rmtree(hp, ignore_errors=True)
    df.coalesce(1).write.mode("overwrite").parquet(sp)
    df.write.format("hudi").options(**HUDI_OPTS).mode("overwrite").save(hp)
    print(f"  {label}: Spark-direct = {parquet_logical_type(sp, 'ts_tz')}   |   Hudi = {parquet_logical_type(hp, 'ts_tz')}")
    spark.stop()

run("session_MILLIS", "TIMESTAMP_MILLIS")
run("session_MICROS", "TIMESTAMP_MICROS")
run("session_INT96",  "INT96")
```

## Expected behavior

Same Parquet logical type as Spark-direct:

```
session_MILLIS: Spark-direct = ts_tz (TIMESTAMP(MILLIS,true))  |  Hudi = ts_tz (TIMESTAMP(MILLIS,true))
session_MICROS: Spark-direct = ts_tz (TIMESTAMP(MICROS,true))  |  Hudi = ts_tz (TIMESTAMP(MICROS,true))
session_INT96 : Spark-direct = ts_tz                           |  Hudi = ts_tz
```

## Actual behavior

```
session_MILLIS: Spark-direct = ts_tz (TIMESTAMP(MILLIS,true))  |  Hudi = ts_tz (TIMESTAMP(MICROS,true))   << diverged
session_MICROS: Spark-direct = ts_tz (TIMESTAMP(MICROS,true))  |  Hudi = ts_tz (TIMESTAMP(MICROS,true))   << matches
session_INT96 : Spark-direct = ts_tz                           |  Hudi = ts_tz (TIMESTAMP(MICROS,true))   << diverged
```

Reproduced identically against three bundles by only swapping `--jars`:

| Bundle (Maven Central / Apache staging) | session MILLIS | Hudi option MILLIS | session INT96 |
|---|---|---|---|
| `hudi-spark3.4-bundle_2.12-0.15.0.jar` | reproduces | reproduces | reproduces |
| `hudi-spark3.4-bundle_2.12-0.15.1-rc1.jar` | reproduces | reproduces | reproduces |
| `hudi-spark3.4-bundle_2.12-1.1.1.jar` | reproduces | reproduces | reproduces |

Bug spans at least 0.15.0 → 1.1.1, surviving the 1.x rewrite.

## Environment Description

- Hudi version: 0.15.0, 0.15.1-rc1, 1.1.1 (all reproduce; Spark 3.4 bundles, Scala 2.12)
- Spark version: 3.4.3
- Hadoop version: 3 (bundled Spark distribution)
- Storage: local FS; bug is in the Parquet writer logical-type selection and is storage-independent
- Running on Docker?: optional

## Additional context

- `TimestampNTZType` is unaffected (Spark itself ignores `outputTimestampType` for NTZ).
- `hoodie.parquet.outputtimestamptype` is documented but appears to be effectively dead code across at least three releases.
- Workaround for users needing MILLIS: convert to LongType epoch-millis before writing. No workaround for INT96 / legacy Hive interop without re-writing files outside Hudi.

## Stacktrace

n/a — no exception; the bug is silent.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT] Spark write path silently ignores spark.sql.parquet.outputTimestampType and hoodie.parquet.outputtimestamptype #18752

Describe the problem you faced

To Reproduce

Expected behavior

Actual behavior

Environment Description

Additional context

Stacktrace

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Bundle (Maven Central / Apache staging)	session MILLIS	Hudi option MILLIS	session INT96
`hudi-spark3.4-bundle_2.12-0.15.0.jar`	reproduces	reproduces	reproduces
`hudi-spark3.4-bundle_2.12-0.15.1-rc1.jar`	reproduces	reproduces	reproduces
`hudi-spark3.4-bundle_2.12-1.1.1.jar`	reproduces	reproduces	reproduces

[SUPPORT] Spark write path silently ignores spark.sql.parquet.outputTimestampType and hoodie.parquet.outputtimestamptype #18752

Description

Describe the problem you faced

To Reproduce

Expected behavior

Actual behavior

Environment Description

Additional context

Stacktrace

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions