[SPARK-36182][SQL] Support TimestampNTZ type in Parquet data source #34495
[SPARK-36182][SQL] Support TimestampNTZ type in Parquet data source #34495gengliangwang wants to merge 5 commits intoapache:masterfrom
Conversation
|
|
||
| void validateTimestampType(DataType sparkType) { | ||
| assert(logicalTypeAnnotation instanceof TimestampLogicalTypeAnnotation); | ||
| // Throw an exception if the Parquet type is TimestampLTZ and the Catalyst type is TimestampNTZ. |
There was a problem hiding this comment.
Here only reading TimestampLTZ as TimestampNTZ is disallowed. Suggestions are welcome.
|
Kubernetes integration test starting |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Kubernetes integration test status failure |
|
Test build #144933 has finished for PR 34495 at commit
|
.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala
Show resolved
Hide resolved
| // For TimestampNTZType column, Spark always output as INT64 with Timestamp annotation in | ||
| // MICROS time unit. | ||
| (row: SpecializedGetters, ordinal: Int) => | ||
| recordConsumer.addLong(row.getLong(ordinal)) |
There was a problem hiding this comment.
(row: SpecializedGetters, ordinal: Int) => recordConsumer.addLong(row.getLong(ordinal))
| } | ||
| } | ||
|
|
||
| void converterErrorForTimestampNTZ(String parquetType) { |
There was a problem hiding this comment.
nit: convertErrorForTimestampNTZ?
.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala
Show resolved
Hide resolved
| } | ||
| } | ||
|
|
||
| test("SPARK-36182: can't read TimestampLTZ as TimestampNTZ") { |
There was a problem hiding this comment.
Can we add a test for reading TimestampNTZ as TimestampLTZ?
|
|
||
| void convertErrorForTimestampNTZ(String parquetType) { | ||
| throw new RuntimeException("Unable to create Parquet converter for data type " + | ||
| DataTypes.TimestampNTZType.json() + " whose Parquet type is " + parquetType); |
There was a problem hiding this comment.
can we use the SQL format? TIMESTAMP WITHOUT TIMEZONE
There was a problem hiding this comment.
Both are "timestamp_ntz"
There was a problem hiding this comment.
I was not referring to .sql, I mean "TIMESTAMP WITHOUT TIMEZONE".
There was a problem hiding this comment.
I prefer keeping "timestamp_ntz", which is the keyword we used in DDL/literals..
There was a problem hiding this comment.
nvm, timestamp_ntz is kind of an alias of TIMESTAMP WITHOUT TIMEZONE in Spark today, and a shorter name is better here.
| !parquetType.getLogicalTypeAnnotation | ||
| .asInstanceOf[TimestampLogicalTypeAnnotation].isAdjustedToUTC && | ||
| parquetType.getLogicalTypeAnnotation | ||
| .asInstanceOf[TimestampLogicalTypeAnnotation].getUnit == TimeUnit.MICROS => |
There was a problem hiding this comment.
shall we have a method to return an optional timestamp unit? then we can shorten this very long condition.
|
|
||
| test("SPARK-36182: writing and reading TimestampNTZType column") { | ||
| withTable("ts") { | ||
| sql("create table ts (c1 timestamp_ntz) using parquet") |
There was a problem hiding this comment.
can we test that we can not insert ltz into a ntz column?
There was a problem hiding this comment.
We should forbid it in the store assignment rule, which is independent of data sources. I will raise another PR for that.
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #145052 has finished for PR 34495 at commit
|
|
Test build #145061 has finished for PR 34495 at commit
|
|
thanks, merging to master! |
What changes were proposed in this pull request?
Support TimestampNTZ type in Parquet data source. In this PR, the timestamp types of Parquet are mapped to the two timestamp types in Spark:
Parquet writer
For TIMESTAMP_NTZ columns, it follows the Parquet Logical Type Definitions and sets the field
isAdjustedToUTCasfalseon writing TIMESTAMP_NTZ columns. The output type is always INT64 in the MICROS time unit. Parquet's timestamp logical annotation can only be used for INT64 so that we won't support writing TIMESTAMP_NTZ as INT96. Otherwise, it is hard to decide the timestamp type on the reader side.Parquet reader
isAdjustedToUTC.operation since TIMESTAMP_LTZ is the "wider" type.
Why are the changes needed?
Support TimestampNTZ type in Parquet data source
Does this PR introduce any user-facing change?
Yes, support TimestampNTZ type in Parquet data source
How was this patch tested?
New UTs