[SPARK-36182][SQL] Support TimestampNTZ type in Parquet data source by gengliangwang · Pull Request #34495 · apache/spark

gengliangwang · 2021-11-05T13:29:50Z

What changes were proposed in this pull request?

Support TimestampNTZ type in Parquet data source. In this PR, the timestamp types of Parquet are mapped to the two timestamp types in Spark:

Parquet type			Spark Catalyst Type
INT64 Timestamp	isAdjustedToUTC = false	unit = MILLIS	TimestampNTZType
	isAdjustedToUTC = false	unit = MICROS	TimestampNTZType
	isAdjustedToUTC = true	unit = MILLIS	TimestampType
	isAdjustedToUTC = true	unit = MICROS	TimestampType
INT96 Timestamp	-	-	TimestampType

Parquet writer

For TIMESTAMP_NTZ columns, it follows the Parquet Logical Type Definitions and sets the field isAdjustedToUTC as false on writing TIMESTAMP_NTZ columns. The output type is always INT64 in the MICROS time unit. Parquet's timestamp logical annotation can only be used for INT64 so that we won't support writing TIMESTAMP_NTZ as INT96. Otherwise, it is hard to decide the timestamp type on the reader side.

Parquet reader

INT96 columns: The reader behavior is the same with Spark 3.2 or prior.
Schema inference for INT64 Timestamp columns: for schema inference of one file, Spark infers TIMESTAMP_NTZ or TIMESTAMP_LTZ type according to the annotation flag isAdjustedToUTC.
Row converter for INT64 Timestamp columns during reading
- Given a TIMESTAMP_NTZ Parquet column and a catalyst schema of TIMESTAMP_LTZ type, Spark allows the read
  operation since TIMESTAMP_LTZ is the "wider" type.
- Given a TIMESTAMP_LTZ Parquet column and a catalyst schema of TIMESTAMP_NTZ type, Spark disallows the read operation since TIMESTAMP_NTZ is the "narrower" type.

Why are the changes needed?

Support TimestampNTZ type in Parquet data source

Does this PR introduce any user-facing change?

Yes, support TimestampNTZ type in Parquet data source

How was this patch tested?

New UTs

gengliangwang · 2021-11-05T13:46:30Z

cc @sadikovi @cloud-fan @beliefer

gengliangwang · 2021-11-05T14:01:47Z

...ain/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java


+  void validateTimestampType(DataType sparkType) {
+    assert(logicalTypeAnnotation instanceof TimestampLogicalTypeAnnotation);
+    // Throw an exception if the Parquet type is TimestampLTZ and the Catalyst type is TimestampNTZ.


Here only reading TimestampLTZ as TimestampNTZ is disallowed. Suggestions are welcome.

SparkQA · 2021-11-05T14:33:01Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49404/

SparkQA · 2021-11-05T14:42:25Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49405/

SparkQA · 2021-11-05T15:33:17Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49404/

SparkQA · 2021-11-05T15:36:41Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49405/

SparkQA · 2021-11-05T18:26:24Z

Test build #144933 has finished for PR 34495 at commit b8e1e02.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala

beliefer · 2021-11-07T11:38:41Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala

+        // For TimestampNTZType column, Spark always output as INT64 with Timestamp annotation in
+        // MICROS time unit.
+        (row: SpecializedGetters, ordinal: Int) =>
+          recordConsumer.addLong(row.getLong(ordinal))


(row: SpecializedGetters, ordinal: Int) => recordConsumer.addLong(row.getLong(ordinal))

sadikovi · 2021-11-07T21:53:33Z

...ain/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java

+    }
+  }
+
+  void converterErrorForTimestampNTZ(String parquetType) {


nit: convertErrorForTimestampNTZ?

Thanks, updated

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala

sadikovi · 2021-11-07T22:03:23Z

...re/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala

+    }
+  }
+
+  test("SPARK-36182: can't read TimestampLTZ as TimestampNTZ") {


Can we add a test for reading TimestampNTZ as TimestampLTZ?

Thanks, updated

cloud-fan · 2021-11-10T07:56:08Z

...ain/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java

+
+  void convertErrorForTimestampNTZ(String parquetType) {
+    throw new RuntimeException("Unable to create Parquet converter for data type " +
+      DataTypes.TimestampNTZType.json() + " whose Parquet type is " + parquetType);


can we use the SQL format? TIMESTAMP WITHOUT TIMEZONE

Both are "timestamp_ntz"

I was not referring to .sql, I mean "TIMESTAMP WITHOUT TIMEZONE".

I prefer keeping "timestamp_ntz", which is the keyword we used in DDL/literals..

nvm, timestamp_ntz is kind of an alias of TIMESTAMP WITHOUT TIMEZONE in Spark today, and a shorter name is better here.

cloud-fan · 2021-11-10T07:59:50Z

.../src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala

+          !parquetType.getLogicalTypeAnnotation
+            .asInstanceOf[TimestampLogicalTypeAnnotation].isAdjustedToUTC &&
+          parquetType.getLogicalTypeAnnotation
+            .asInstanceOf[TimestampLogicalTypeAnnotation].getUnit == TimeUnit.MICROS =>


shall we have a method to return an optional timestamp unit? then we can shorten this very long condition.

ok, updated

cloud-fan · 2021-11-10T08:02:27Z

...re/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala


+  test("SPARK-36182: writing and reading TimestampNTZType column") {
+    withTable("ts") {
+      sql("create table ts (c1 timestamp_ntz) using parquet")


can we test that we can not insert ltz into a ntz column?

We should forbid it in the store assignment rule, which is independent of data sources. I will raise another PR for that.

SparkQA · 2021-11-10T08:41:10Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49522/

SparkQA · 2021-11-10T12:04:46Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49531/

SparkQA · 2021-11-10T13:07:42Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49531/

SparkQA · 2021-11-10T14:40:27Z

Test build #145052 has finished for PR 34495 at commit 8a71793.

This patch fails from timeout after a configured wait of 500m.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-11-10T15:46:01Z

Test build #145061 has finished for PR 34495 at commit 760f5b0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-11-11T04:46:01Z

thanks, merging to master!

gengliangwang added 2 commits November 4, 2021 16:45

support timestamp_ntz in parquet

3f67c32

add tests

7507947

github-actions bot added the SQL label Nov 5, 2021

add comments

b8e1e02

gengliangwang commented Nov 5, 2021

View reviewed changes

beliefer reviewed Nov 7, 2021

View reviewed changes

sadikovi reviewed Nov 7, 2021

View reviewed changes

address comments

8a71793

cloud-fan reviewed Nov 10, 2021

View reviewed changes

cloud-fan approved these changes Nov 10, 2021

View reviewed changes

refactor

760f5b0

cloud-fan closed this in ef5278f Nov 11, 2021

Conversation

gengliangwang commented Nov 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Parquet writer

Parquet reader

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gengliangwang commented Nov 5, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 5, 2021

Uh oh!

SparkQA commented Nov 5, 2021

Uh oh!

SparkQA commented Nov 5, 2021

Uh oh!

SparkQA commented Nov 5, 2021

Uh oh!

SparkQA commented Nov 5, 2021

Uh oh!

Uh oh!

beliefer Nov 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gengliangwang Nov 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 10, 2021

Uh oh!

SparkQA commented Nov 10, 2021

Uh oh!

SparkQA commented Nov 10, 2021

Uh oh!

SparkQA commented Nov 10, 2021

Uh oh!

SparkQA commented Nov 10, 2021

Uh oh!

cloud-fan commented Nov 11, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gengliangwang commented Nov 5, 2021 •

edited

Loading

beliefer Nov 7, 2021 •

edited

Loading

gengliangwang Nov 10, 2021 •

edited

Loading