[SPARK-38813][SQL][3.3] Remove TimestampNTZ type support in Spark 3.3 #36094

gengliangwang · 2022-04-07T02:44:46Z

What changes were proposed in this pull request?

Remove TimestampNTZ type support in the production code of Spark 3.3.
To archive the goal, this PR adds the check "Utils.isTesting" in the following code branches:

keyword "timestamp_ntz" and "timestamp_ltz" in parser
New expressions from https://issues.apache.org/jira/browse/SPARK-35662
Using java.time.localDateTime as the external type for TimestampNTZType
SQLConf.timestampType which determines the default timestamp type of Spark SQL.

This is to minimize the code difference between the master branch. So that future users won't think TimestampNTZ is already available in Spark 3.3.
The downside is that users can still find TimestampNTZType under package org.apache.spark.sql.types. There should be nothing left other than this.

Why are the changes needed?

The TimestampNTZ project is not finished yet:

Lack Hive metastore support
Lack JDBC support
Need to spend time scanning the codebase to find out any missing support. The current code usages of TimestampType are larger than TimestampNTZType

Does this PR introduce any user-facing change?

No, the TimestampNTZ type is not released yet.

How was this patch tested?

UT

Remove TimestampNTZ type support in the production code of Spark 3.3. To archive the goal, this PR adds the check "Utils.isTesting" in the following code branches: - keyword "timestamp_ntz" and "timestamp_ltz" in parser - New expressions from https://issues.apache.org/jira/browse/SPARK-35662 - Using java.time.localDateTime as the external type for TimestampNTZType - `SQLConf.timestampType` which determines the default timestamp type of Spark SQL. This is to minimize the code difference between the master branch. So that future users won't think TimestampNTZ is already available in Spark 3.3. The downside is that users can still find TimestampNTZType under package `org.apache.spark.sql.types`. There should be nothing left other than this.

gengliangwang · 2022-04-07T02:46:40Z

cc @beliefer @sadikovi as well.
We have great progress in the past few months. But it seems risky to release TimestampNTZ recently. I suggest postponing it to the 3.4 release.

gengliangwang · 2022-04-07T02:51:22Z

This one is similar with #33444

cloud-fan · 2022-04-07T03:02:52Z

I think the decision was to not support Hive tables, but JDBC is a valid concern. It's really painful to support NTZ in JDBC as we already treat the existing LTZ as NTZ in the JDBC source. We need a good migration plan.

beliefer · 2022-04-07T03:15:35Z

Yes. It is very painful to supports NTZ in some data sources.

gengliangwang · 2022-04-07T03:19:11Z

@cloud-fan I meant Hive metastore support. I just updated the PR description.

cloud-fan · 2022-04-07T03:40:06Z

Spark stores table schema in table properties so Hive metastore is not an issue here. It's all about hive compatibility as we can't change the mapping between Spark timestamp and Hive timestamp, which is Spark LTZ <-> Hive timestamp (Hive only has NTZ).

sadikovi · 2022-04-07T03:40:36Z

Can you define the risk? Lack of support in data sources is rather a limitation that could be prioritised and resolved. Is there a risk of regression in users' queries, performance implications, or backward compatibility?

As far as I am concerned, TimestampNTZ is not a default type anyway and you have to explicitly opt-in to use this type which is up to users. If you try writing this type in JDBC data source, I expect an error to be thrown. Is the behaviour different from that?

We can state in the release notes that support is limited to certain data sources but I would not necessarily disable it for everything.

sadikovi · 2022-04-07T03:57:56Z

I would suggest to "disable" the data type instead of removing it, e.g. failing queries at analysis phase if the query contains TimestampNTZ.

cloud-fan · 2022-04-07T04:28:48Z

@sadikovi that sounds like a simpler solution. e.g. we can do this in CheckAnalysis and only allow NTZ in tests.

gengliangwang · 2022-04-07T04:40:04Z

I would suggest to "disable" the data type instead of removing it, e.g. failing queries at analysis phase if the query contains TimestampNTZ.

@sadikovi could you raise a PR for this? I think we need to check create/alter table as well.

HyukjinKwon · 2022-04-08T02:20:29Z

Hm, can we just simply mention that TimestampNTZ is unstable not, and not supported very well yet? We should remove this in PySpark API docs too at least.

gengliangwang · 2022-04-08T02:54:26Z

Another issue is data source support..Take Parquet as an example, Spark SQL will infer the schema as TimestampNTZType if the annotation isAdjustedToUTC is `false.
cc @cloud-fan @gatorsmile , shall we mark TimestampNTZ as unstable like @HyukjinKwon said?

gengliangwang · 2022-04-08T08:13:25Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

@@ -157,6 +158,10 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog {
      case _: ShowTableExtended =>
        throw QueryCompilationErrors.commandUnsupportedInV2TableError("SHOW TABLE EXTENDED")

+      case operator: LogicalPlan
+        if !Utils.isTesting && operator.output.exists(_.dataType.isInstanceOf[TimestampNTZType]) =>


I removed the !Utils.isTesting checked and test manually. It worked.

I got it. LGTM.

gengliangwang · 2022-04-08T08:26:35Z

As this PR is becoming big, I created two follow-ups:

Remove PySpark support: https://issues.apache.org/jira/browse/SPARK-38828
A new flag to control the schema inference behavior: https://issues.apache.org/jira/browse/SPARK-38829
cc @itholic @sadikovi

itholic · 2022-04-08T08:30:57Z

Thanks for pinging!

cloud-fan · 2022-04-08T12:57:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

        "LOCAL TIME ZONE type.")
-      .version("3.3.0")
+      .version("3.4.0")


this change should be done in the master branch as well.

gengliangwang · 2022-04-08T12:59:26Z

@cloud-fan @sadikovi @beliefer @HyukjinKwon Thanks for the inputs.
Merging to 3.3

### What changes were proposed in this pull request? Remove TimestampNTZ type support in the production code of Spark 3.3. To archive the goal, this PR adds the check "Utils.isTesting" in the following code branches: - keyword "timestamp_ntz" and "timestamp_ltz" in parser - New expressions from https://issues.apache.org/jira/browse/SPARK-35662 - Using java.time.localDateTime as the external type for TimestampNTZType - `SQLConf.timestampType` which determines the default timestamp type of Spark SQL. This is to minimize the code difference between the master branch. So that future users won't think TimestampNTZ is already available in Spark 3.3. The downside is that users can still find TimestampNTZType under package `org.apache.spark.sql.types`. There should be nothing left other than this. ### Why are the changes needed? The TimestampNTZ project is not finished yet: * Lack Hive metastore support * Lack JDBC support * Need to spend time scanning the codebase to find out any missing support. The current code usages of `TimestampType` are larger than `TimestampNTZType` ### Does this PR introduce _any_ user-facing change? No, the TimestampNTZ type is not released yet. ### How was this patch tested? UT Closes #36094 from gengliangwang/disableNTZ. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>

… as 3.4.0 ### What changes were proposed in this pull request? Update the version info of TimestampNTZ related changes as 3.4.0 ### Why are the changes needed? We decided to remove the TimestampNTZ support in 3.3 release(#36094) and release TimestampNTZ in 3.4 release. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA tests. Closes #36118 from gengliangwang/updateNTZVersion. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>

dongjoon-hyun · 2022-04-08T21:19:50Z

Thank you for the decision!

dongjoon-hyun

+1, late LGTM.

…or Spark 3.3 ### What changes were proposed in this pull request? This is a follow-up for #36094. I added `Utils.isTesting` whenever we perform schema conversion or row conversion for TimestampNTZType. I verified that the tests, e.g. ParquetIOSuite, fail with unsupported data type when running in non-testing mode: ``` [info] Cause: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 40.0 failed 1 times, most recent failure: Lost task 1.0 in stage 40.0 (TID 66) (ip-10-110-16-208.us-west-2.compute.internal executor driver): org.apache.spark.sql.AnalysisException: Unsupported data type timestamp_ntz [info] at org.apache.spark.sql.errors.QueryCompilationErrors$.cannotConvertDataTypeToParquetTypeError(QueryCompilationErrors.scala:1304) [info] at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:707) [info] at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:479) [info] at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.$anonfun$convert$1(ParquetSchemaConverter.scala:471) ``` ### Why are the changes needed? We have to disable TimestampNTZType as other parts of the codebase do not yet support this type. ### Does this PR introduce _any_ user-facing change? No, the TimestampNTZ type is not released yet. ### How was this patch tested? I tested the changes manually by rerunning the test suites that verify TimestampNTZType in the non-testing mode. Closes #36137 from sadikovi/SPARK-38829-parquet-ntz-off. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>

ueshin · 2022-04-21T18:04:04Z

A weird error message is seen as follows likely caused by this PR:

scala> val df = spark.range(2)
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> df.select("i")
org.apache.spark.sql.AnalysisException: Invalid call to dataType on unresolved object;
'Project ['i]
+- Range (0, 2, step=1, splits=Some(16))

  at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:137)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$4(CheckAnalysis.scala:164)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$4$adapted(CheckAnalysis.scala:164)
  at scala.collection.IndexedSeqOptimized.prefixLengthImpl(IndexedSeqOptimized.scala:41)
  at scala.collection.IndexedSeqOptimized.exists(IndexedSeqOptimized.scala:49)
  at scala.collection.IndexedSeqOptimized.exists$(IndexedSeqOptimized.scala:49)
  at scala.collection.mutable.ArrayBuffer.exists(ArrayBuffer.scala:49)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:164)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:100)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:357)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:100)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:95)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:187)
...

The error message should be the following as the master branch shows:

org.apache.spark.sql.AnalysisException: Column 'i' does not exist. Did you mean one of the following? [id];
...

cloud-fan · 2022-04-22T03:01:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

@@ -157,6 +158,10 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog {
      case _: ShowTableExtended =>
        throw QueryCompilationErrors.commandUnsupportedInV2TableError("SHOW TABLE EXTENDED")

+      case operator: LogicalPlan
+        if !Utils.isTesting && operator.output.exists(_.dataType.isInstanceOf[TimestampNTZType]) =>


Ah, a requirement is we should call dataType only on resolved attributes, so the check should be

operator.output.exists(attr => attr.resolved && attr.dataType.isInstanceOf[TimestampNTZType])

@gengliangwang can you help to fix it?

Yes I am about to create a PR

…stampNTZ output ### What changes were proposed in this pull request? In #36094, a check for failing TimestampNTZ output is added. However, if there is an unresolved attribute in the plan, even if it is note related to TimestampNTZ, the error message becomes confusing ``` scala> val df = spark.range(2) df: org.apache.spark.sql.Dataset[Long] = [id: bigint] scala> df.select("i") org.apache.spark.sql.AnalysisException: Invalid call to dataType on unresolved object; 'Project ['i] +- Range (0, 2, step=1, splits=Some(16)) at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:137) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$4(CheckAnalysis.scala:164) ... ``` Before changes it was ``` org.apache.spark.sql.AnalysisException: Column 'i' does not exist. Did you mean one of the following? [id]; ``` This PR is the improve the check for TimestampNTZ and restore the error message for unresolved attributes. ### Why are the changes needed? Fix a regression in analysis error message. ### Does this PR introduce _any_ user-facing change? No, it is not released yet. ### How was this patch tested? Manual test Closes #36316 from gengliangwang/bugFix. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>

### What changes were proposed in this pull request? In [#36094](#36094), a check for failing TimestampNTZ type output(since we are disabling TimestampNTZ in 3.3) is added: ``` case operator: LogicalPlan if !Utils.isTesting && operator.output.exists(attr => attr.resolved && attr.dataType.isInstanceOf[TimestampNTZType]) => operator.failAnalysis("TimestampNTZ type is not supported in Spark 3.3.") ``` However, the check can cause misleading error messages. In 3.3: ``` > sql( "select date '2018-11-17' > 1").show() org.apache.spark.sql.AnalysisException: Invalid call to toAttribute on unresolved object; 'Project [unresolvedalias((2018-11-17 > 1), None)] +- OneRowRelation at org.apache.spark.sql.catalyst.analysis.UnresolvedAlias.toAttribute(unresolved.scala:510) at org.apache.spark.sql.catalyst.plans.logical.Project.$anonfun$output$1(basicLogicalOperators.scala:70) ``` In master or 3.2 ``` > sql( "select date '2018-11-17' > 1").show() org.apache.spark.sql.AnalysisException: cannot resolve '(DATE '2018-11-17' > 1)' due to data type mismatch: differing types in '(DATE '2018-11-17' > 1)' (date and int).; line 1 pos 7; 'Project [unresolvedalias((2018-11-17 > 1), None)] +- OneRowRelation at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) ``` We should just remove the check to avoid such regression. It's not necessary for disabling TimestampNTZ anyway. ### Why are the changes needed? Fix regression in the error output of analysis check. ### Does this PR introduce _any_ user-facing change? No, it is not released yet. ### How was this patch tested? Build and try on `spark-shell` Closes #36609 from gengliangwang/fixRegression. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>

gengliangwang added 2 commits April 6, 2022 20:37

update comments

0f9ab18

gengliangwang requested review from cloud-fan, gatorsmile and MaxGekk April 7, 2022 02:45

github-actions bot added the SQL label Apr 7, 2022

cloud-fan approved these changes Apr 7, 2022

View reviewed changes

gengliangwang added 2 commits April 7, 2022 14:18

fix

951e11f

update versions; disable convert_timezone

f141c01

gengliangwang added 2 commits April 8, 2022 15:07

Remove TimestampNTZ from API docs

7b72d05

fix compile;add check rule;

d8e8829

gengliangwang commented Apr 8, 2022

View reviewed changes

cloud-fan reviewed Apr 8, 2022

View reviewed changes

cloud-fan approved these changes Apr 8, 2022

View reviewed changes

gengliangwang closed this Apr 8, 2022

gengliangwang mentioned this pull request Apr 8, 2022

[SPARK-38834][SQL] Update the version of TimestampNTZ related changes as 3.4.0 #36118

Closed

dongjoon-hyun reviewed Apr 8, 2022

View reviewed changes

sadikovi mentioned this pull request Apr 11, 2022

[SPARK-38829][SQL][3.3] Remove TimestampNTZ type support in Parquet for Spark 3.3 #36137

Closed

cloud-fan reviewed Apr 22, 2022

View reviewed changes

gengliangwang mentioned this pull request Apr 22, 2022

[SPARK-38813][3.3][SQL][FOLLOWUP] Improve the analysis check for TimestampNTZ output #36316

Closed

gengliangwang mentioned this pull request May 19, 2022

[SPARK-39233][SQL] Remove the check for TimestampNTZ output in Analyzer #36609

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-38813][SQL][3.3] Remove TimestampNTZ type support in Spark 3.3 #36094

[SPARK-38813][SQL][3.3] Remove TimestampNTZ type support in Spark 3.3 #36094

gengliangwang commented Apr 7, 2022 •

edited

gengliangwang commented Apr 7, 2022 •

edited

gengliangwang commented Apr 7, 2022

cloud-fan commented Apr 7, 2022

beliefer commented Apr 7, 2022

gengliangwang commented Apr 7, 2022

cloud-fan commented Apr 7, 2022 •

edited

sadikovi commented Apr 7, 2022

sadikovi commented Apr 7, 2022

cloud-fan commented Apr 7, 2022

gengliangwang commented Apr 7, 2022 •

edited

HyukjinKwon commented Apr 8, 2022

gengliangwang commented Apr 8, 2022

gengliangwang Apr 8, 2022

beliefer Apr 8, 2022

gengliangwang commented Apr 8, 2022

itholic commented Apr 8, 2022

cloud-fan Apr 8, 2022

gengliangwang Apr 8, 2022

gengliangwang commented Apr 8, 2022

dongjoon-hyun commented Apr 8, 2022

dongjoon-hyun left a comment

ueshin commented Apr 21, 2022

cloud-fan Apr 22, 2022

cloud-fan Apr 22, 2022

gengliangwang Apr 22, 2022

[SPARK-38813][SQL][3.3] Remove TimestampNTZ type support in Spark 3.3 #36094

[SPARK-38813][SQL][3.3] Remove TimestampNTZ type support in Spark 3.3 #36094

Conversation

gengliangwang commented Apr 7, 2022 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

gengliangwang commented Apr 7, 2022 • edited

gengliangwang commented Apr 7, 2022

cloud-fan commented Apr 7, 2022

beliefer commented Apr 7, 2022

gengliangwang commented Apr 7, 2022

cloud-fan commented Apr 7, 2022 • edited

sadikovi commented Apr 7, 2022

sadikovi commented Apr 7, 2022

cloud-fan commented Apr 7, 2022

gengliangwang commented Apr 7, 2022 • edited

HyukjinKwon commented Apr 8, 2022

gengliangwang commented Apr 8, 2022

gengliangwang Apr 8, 2022

Choose a reason for hiding this comment

beliefer Apr 8, 2022

Choose a reason for hiding this comment

gengliangwang commented Apr 8, 2022

itholic commented Apr 8, 2022

cloud-fan Apr 8, 2022

Choose a reason for hiding this comment

gengliangwang Apr 8, 2022

Choose a reason for hiding this comment

gengliangwang commented Apr 8, 2022

dongjoon-hyun commented Apr 8, 2022

dongjoon-hyun left a comment

Choose a reason for hiding this comment

ueshin commented Apr 21, 2022

cloud-fan Apr 22, 2022

Choose a reason for hiding this comment

cloud-fan Apr 22, 2022

Choose a reason for hiding this comment

gengliangwang Apr 22, 2022

Choose a reason for hiding this comment

gengliangwang commented Apr 7, 2022 •

edited

gengliangwang commented Apr 7, 2022 •

edited

cloud-fan commented Apr 7, 2022 •

edited

gengliangwang commented Apr 7, 2022 •

edited