Skip to content

Conversation

@AbinayaJayaprakasam
Copy link
Contributor

@AbinayaJayaprakasam AbinayaJayaprakasam commented Nov 25, 2025

Convert TIMESTAMP(NANOS,*) to LongType regardless of nanosAsLong config to allow reading Parquet files with nanosecond precision timestamps.

What changes were proposed in this pull request?

Simplified the TIMESTAMP(NANOS) handling in ParquetSchemaConverter to always convert to LongType, removing the nanosAsLong condition check that caused TIMESTAMP(NANOS,false) files to be unreadable.

Why are the changes needed?

SPARK-40819 added spark.sql.legacy.parquet.nanosAsLong as a workaround for TIMESTAMP(NANOS,true), but:

  • Only worked for TIMESTAMP(NANOS,true), not for TIMESTAMP(NANOS,false)
  • Required users to know about an obscure internal config flag
  • Still required manual casting from Long to Timestamp

This fix makes all NANOS timestamps readable by default. Since Spark cannot fully support nanosecond precision in its type system, converting to LongType preserves precision while allowing files to be read.

Does this PR introduce any user-facing change?

Yes - Parquet files with TIMESTAMP(NANOS,*) are now readable by default without configuration. Values are read as LongType (nanoseconds since epoch). Users can convert to timestamp if needed: (col('nanos') / 1e9).cast('timestamp')

How was this patch tested?

  • Updated ParquetSchemaSuite test expectations
  • All tests in ParquetSchemaSuite pass
  • Manually tested with TIMESTAMP(NANOS,false) Parquet file generated via PyArrow

Was this patch authored or co-authored using generative AI tooling?

No

Convert TIMESTAMP(NANOS,*) to LongType regardless of nanosAsLong config
to allow reading Parquet files with nanosecond precision timestamps.

### What changes were proposed in this pull request?

Simplified the TIMESTAMP(NANOS) handling in ParquetSchemaConverter to always
convert to LongType, removing the nanosAsLong condition check that caused
TIMESTAMP(NANOS,false) files to be unreadable.

### Why are the changes needed?

SPARK-40819 added spark.sql.legacy.parquet.nanosAsLong as a workaround for
TIMESTAMP(NANOS,true), but:
- Only worked for TIMESTAMP(NANOS,true), not for  TIMESTAMP(NANOS,false)
- Required users to know about an obscure internal config flag
- Still required manual casting from Long to Timestamp

This fix makes all NANOS timestamps readable by default. Since Spark cannot
fully support nanosecond precision in its type system, converting to LongType
preserves precision while allowing files to be read.

### Does this PR introduce any user-facing change?

Yes - Parquet files with TIMESTAMP(NANOS,*) are now readable by default
without configuration. Values are read as LongType (nanoseconds since epoch).
Users can convert to timestamp if needed: (col('nanos') / 1e9).cast('timestamp')

### How was this patch tested?

- Updated ParquetSchemaSuite test expectations (lines 1112-1121)
- All 110 tests in ParquetSchemaSuite pass
- Manually tested with TIMESTAMP(NANOS,false) Parquet file generated via PyArrow
@github-actions github-actions bot added the SQL label Nov 25, 2025
@AbinayaJayaprakasam
Copy link
Contributor Author

AbinayaJayaprakasam commented Nov 25, 2025

What problem does this solve
Parquet files with TIMESTAMP(NANOS,false) exist and are completely unreadable
SPARK-40819 which only fixed TIMESTAMP(NANOS,true) with a config flag
No workaround exists for users

Testing procedure :
Step 1: Generated a test parquet file
image

Step 2: Read it with pyspark
image

Step 3: Before fix :

image

Step 4: after fix
image

Test coverage
Updated existing test: ParquetSchemaSuite -Changed test expectation from "error" to "success with LongType"

image

Behavior Matrix

Scenario Before After Breaking?
NANOS + nanosAsLong=true LongType LongType No
NANOS + nanosAsLong=false ERROR LongType No (fix!)
MICROS/MILLIS timestamps TimestampType TimestampType No

@AbinayaJayaprakasam
Copy link
Contributor Author

All build failures are due to CI infrastructure issues during the "Free up disk space"
setup step (exit code 100 - package download failures from Ubuntu mirrors).

The failures occurred before code compilation and tests could run, as evidenced by:

  • No test log files generated
  • No test result files uploaded
  • All failures in the same setup phase

This is unrelated to the code changes.So did a dummy commit [059c360] to retrigger the CI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant