Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-38813][SQL][3.3] Remove TimestampNTZ type support in Spark 3.3 #36094

Closed
wants to merge 6 commits into from

Conversation

gengliangwang
Copy link
Member

@gengliangwang gengliangwang commented Apr 7, 2022

What changes were proposed in this pull request?

Remove TimestampNTZ type support in the production code of Spark 3.3.
To archive the goal, this PR adds the check "Utils.isTesting" in the following code branches:

  • keyword "timestamp_ntz" and "timestamp_ltz" in parser
  • New expressions from https://issues.apache.org/jira/browse/SPARK-35662
  • Using java.time.localDateTime as the external type for TimestampNTZType
  • SQLConf.timestampType which determines the default timestamp type of Spark SQL.

This is to minimize the code difference between the master branch. So that future users won't think TimestampNTZ is already available in Spark 3.3.
The downside is that users can still find TimestampNTZType under package org.apache.spark.sql.types. There should be nothing left other than this.

Why are the changes needed?

The TimestampNTZ project is not finished yet:

  • Lack Hive metastore support
  • Lack JDBC support
  • Need to spend time scanning the codebase to find out any missing support. The current code usages of TimestampType are larger than TimestampNTZType

Does this PR introduce any user-facing change?

No, the TimestampNTZ type is not released yet.

How was this patch tested?

UT

Remove TimestampNTZ type support in the production code of Spark 3.3.
To archive the goal, this PR adds the check "Utils.isTesting" in the following code branches:
- keyword "timestamp_ntz" and "timestamp_ltz" in parser
- New expressions from https://issues.apache.org/jira/browse/SPARK-35662
- Using java.time.localDateTime as the external type for TimestampNTZType
- `SQLConf.timestampType` which determines the default timestamp type of Spark SQL.

This is to minimize the code difference between the master branch. So that future users won't think TimestampNTZ is already available in Spark 3.3.
The downside is that users can still find TimestampNTZType under package `org.apache.spark.sql.types`. There should be nothing left other than this.
@gengliangwang
Copy link
Member Author

gengliangwang commented Apr 7, 2022

cc @beliefer @sadikovi as well.
We have great progress in the past few months. But it seems risky to release TimestampNTZ recently. I suggest postponing it to the 3.4 release.

@gengliangwang
Copy link
Member Author

This one is similar with #33444

@cloud-fan
Copy link
Contributor

I think the decision was to not support Hive tables, but JDBC is a valid concern. It's really painful to support NTZ in JDBC as we already treat the existing LTZ as NTZ in the JDBC source. We need a good migration plan.

@beliefer
Copy link
Contributor

beliefer commented Apr 7, 2022

Yes. It is very painful to supports NTZ in some data sources.

@gengliangwang
Copy link
Member Author

@cloud-fan I meant Hive metastore support. I just updated the PR description.

@cloud-fan
Copy link
Contributor

cloud-fan commented Apr 7, 2022

Spark stores table schema in table properties so Hive metastore is not an issue here. It's all about hive compatibility as we can't change the mapping between Spark timestamp and Hive timestamp, which is Spark LTZ <-> Hive timestamp (Hive only has NTZ).

@sadikovi
Copy link
Contributor

sadikovi commented Apr 7, 2022

Can you define the risk? Lack of support in data sources is rather a limitation that could be prioritised and resolved. Is there a risk of regression in users' queries, performance implications, or backward compatibility?

As far as I am concerned, TimestampNTZ is not a default type anyway and you have to explicitly opt-in to use this type which is up to users. If you try writing this type in JDBC data source, I expect an error to be thrown. Is the behaviour different from that?

We can state in the release notes that support is limited to certain data sources but I would not necessarily disable it for everything.

@sadikovi
Copy link
Contributor

sadikovi commented Apr 7, 2022

I would suggest to "disable" the data type instead of removing it, e.g. failing queries at analysis phase if the query contains TimestampNTZ.

@cloud-fan
Copy link
Contributor

@sadikovi that sounds like a simpler solution. e.g. we can do this in CheckAnalysis and only allow NTZ in tests.

@gengliangwang
Copy link
Member Author

gengliangwang commented Apr 7, 2022

I would suggest to "disable" the data type instead of removing it, e.g. failing queries at analysis phase if the query contains TimestampNTZ.

@sadikovi could you raise a PR for this? I think we need to check create/alter table as well.

@HyukjinKwon
Copy link
Member

Hm, can we just simply mention that TimestampNTZ is unstable not, and not supported very well yet? We should remove this in PySpark API docs too at least.

@gengliangwang
Copy link
Member Author

Another issue is data source support..Take Parquet as an example, Spark SQL will infer the schema as TimestampNTZType if the annotation isAdjustedToUTC is `false.
cc @cloud-fan @gatorsmile , shall we mark TimestampNTZ as unstable like @HyukjinKwon said?

@@ -157,6 +158,10 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog {
case _: ShowTableExtended =>
throw QueryCompilationErrors.commandUnsupportedInV2TableError("SHOW TABLE EXTENDED")

case operator: LogicalPlan
if !Utils.isTesting && operator.output.exists(_.dataType.isInstanceOf[TimestampNTZType]) =>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the !Utils.isTesting checked and test manually. It worked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it. LGTM.

@gengliangwang
Copy link
Member Author

As this PR is becoming big, I created two follow-ups:

@itholic
Copy link
Contributor

itholic commented Apr 8, 2022

Thanks for pinging!

"LOCAL TIME ZONE type.")
.version("3.3.0")
.version("3.4.0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change should be done in the master branch as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

@gengliangwang
Copy link
Member Author

@cloud-fan @sadikovi @beliefer @HyukjinKwon Thanks for the inputs.
Merging to 3.3

gengliangwang added a commit that referenced this pull request Apr 8, 2022
### What changes were proposed in this pull request?

Remove TimestampNTZ type support in the production code of Spark 3.3.
To archive the goal, this PR adds the check "Utils.isTesting" in the following code branches:
- keyword "timestamp_ntz" and "timestamp_ltz" in parser
- New expressions from https://issues.apache.org/jira/browse/SPARK-35662
- Using java.time.localDateTime as the external type for TimestampNTZType
- `SQLConf.timestampType` which determines the default timestamp type of Spark SQL.

This is to minimize the code difference between the master branch. So that future users won't think TimestampNTZ is already available in Spark 3.3.
The downside is that users can still find TimestampNTZType under package `org.apache.spark.sql.types`. There should be nothing left other than this.

### Why are the changes needed?

The TimestampNTZ project is not finished yet:
* Lack Hive metastore support
* Lack JDBC support
* Need to spend time scanning the codebase to find out any missing support. The current code usages of `TimestampType` are larger than `TimestampNTZType`

### Does this PR introduce _any_ user-facing change?

No, the TimestampNTZ type is not released yet.

### How was this patch tested?

UT

Closes #36094 from gengliangwang/disableNTZ.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
gengliangwang added a commit that referenced this pull request Apr 8, 2022
… as 3.4.0

### What changes were proposed in this pull request?

Update the version info of TimestampNTZ related changes as 3.4.0
### Why are the changes needed?

We decided to remove the TimestampNTZ support in 3.3 release(#36094) and release TimestampNTZ in 3.4 release.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

GA tests.

Closes #36118 from gengliangwang/updateNTZVersion.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
@dongjoon-hyun
Copy link
Member

Thank you for the decision!

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, late LGTM.

gengliangwang pushed a commit that referenced this pull request Apr 13, 2022
…or Spark 3.3

### What changes were proposed in this pull request?

This is a follow-up for #36094.
I added `Utils.isTesting` whenever we perform schema conversion or row conversion for TimestampNTZType.

I verified that the tests, e.g. ParquetIOSuite, fail with unsupported data type when running in non-testing mode:
```
[info]   Cause: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 40.0 failed 1 times, most recent failure: Lost task 1.0 in stage 40.0 (TID 66) (ip-10-110-16-208.us-west-2.compute.internal executor driver): org.apache.spark.sql.AnalysisException: Unsupported data type timestamp_ntz
[info] 	at org.apache.spark.sql.errors.QueryCompilationErrors$.cannotConvertDataTypeToParquetTypeError(QueryCompilationErrors.scala:1304)
[info] 	at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:707)
[info] 	at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:479)
[info] 	at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.$anonfun$convert$1(ParquetSchemaConverter.scala:471)
```

### Why are the changes needed?
We have to disable TimestampNTZType as other parts of the codebase do not yet support this type.

### Does this PR introduce _any_ user-facing change?

No, the TimestampNTZ type is not released yet.

### How was this patch tested?

I tested the changes manually by rerunning the test suites that verify TimestampNTZType in the non-testing mode.

Closes #36137 from sadikovi/SPARK-38829-parquet-ntz-off.

Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
@ueshin
Copy link
Member

ueshin commented Apr 21, 2022

A weird error message is seen as follows likely caused by this PR:

scala> val df = spark.range(2)
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> df.select("i")
org.apache.spark.sql.AnalysisException: Invalid call to dataType on unresolved object;
'Project ['i]
+- Range (0, 2, step=1, splits=Some(16))

  at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:137)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$4(CheckAnalysis.scala:164)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$4$adapted(CheckAnalysis.scala:164)
  at scala.collection.IndexedSeqOptimized.prefixLengthImpl(IndexedSeqOptimized.scala:41)
  at scala.collection.IndexedSeqOptimized.exists(IndexedSeqOptimized.scala:49)
  at scala.collection.IndexedSeqOptimized.exists$(IndexedSeqOptimized.scala:49)
  at scala.collection.mutable.ArrayBuffer.exists(ArrayBuffer.scala:49)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:164)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:100)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:357)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:100)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:95)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:187)
...

The error message should be the following as the master branch shows:

org.apache.spark.sql.AnalysisException: Column 'i' does not exist. Did you mean one of the following? [id];
...

@@ -157,6 +158,10 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog {
case _: ShowTableExtended =>
throw QueryCompilationErrors.commandUnsupportedInV2TableError("SHOW TABLE EXTENDED")

case operator: LogicalPlan
if !Utils.isTesting && operator.output.exists(_.dataType.isInstanceOf[TimestampNTZType]) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, a requirement is we should call dataType only on resolved attributes, so the check should be

operator.output.exists(attr => attr.resolved && attr.dataType.isInstanceOf[TimestampNTZType])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gengliangwang can you help to fix it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I am about to create a PR

gengliangwang added a commit that referenced this pull request Apr 22, 2022
…stampNTZ output

### What changes were proposed in this pull request?

In #36094, a check for failing TimestampNTZ output is added.
However, if there is an unresolved attribute in the plan, even if it is note related to TimestampNTZ, the error message becomes confusing
```
scala> val df = spark.range(2)
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> df.select("i")
org.apache.spark.sql.AnalysisException: Invalid call to dataType on unresolved object;
'Project ['i]
+- Range (0, 2, step=1, splits=Some(16))

  at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:137)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$4(CheckAnalysis.scala:164)
...
```

Before changes it was
```
org.apache.spark.sql.AnalysisException: Column 'i' does not exist. Did you mean one of the following? [id];
```

This PR is the improve the check for TimestampNTZ and restore the error message for unresolved attributes.
### Why are the changes needed?

Fix a regression in analysis error message.

### Does this PR introduce _any_ user-facing change?

No, it is not released yet.

### How was this patch tested?

Manual test

Closes #36316 from gengliangwang/bugFix.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
gengliangwang added a commit that referenced this pull request May 19, 2022
### What changes were proposed in this pull request?

In [#36094](#36094), a check for failing TimestampNTZ type output(since we are disabling TimestampNTZ in 3.3) is added:
```
      case operator: LogicalPlan
        if !Utils.isTesting && operator.output.exists(attr =>
          attr.resolved && attr.dataType.isInstanceOf[TimestampNTZType]) =>
        operator.failAnalysis("TimestampNTZ type is not supported in Spark 3.3.")
```

However, the check can cause misleading error messages.

In 3.3:
```
> sql( "select date '2018-11-17' > 1").show()
org.apache.spark.sql.AnalysisException: Invalid call to toAttribute on unresolved object;
'Project [unresolvedalias((2018-11-17 > 1), None)]
+- OneRowRelation

  at org.apache.spark.sql.catalyst.analysis.UnresolvedAlias.toAttribute(unresolved.scala:510)
  at org.apache.spark.sql.catalyst.plans.logical.Project.$anonfun$output$1(basicLogicalOperators.scala:70)
```
In master or 3.2
```
> sql( "select date '2018-11-17' > 1").show()
org.apache.spark.sql.AnalysisException: cannot resolve '(DATE '2018-11-17' > 1)' due to data type mismatch: differing types in '(DATE '2018-11-17' > 1)' (date and int).; line 1 pos 7;
'Project [unresolvedalias((2018-11-17 > 1), None)]
+- OneRowRelation

  at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
```

We should just remove the check to avoid such regression. It's not necessary for disabling TimestampNTZ anyway.

### Why are the changes needed?

Fix regression in the error output of analysis check.

### Does this PR introduce _any_ user-facing change?

No, it is not released yet.

### How was this patch tested?

Build and try on `spark-shell`

Closes #36609 from gengliangwang/fixRegression.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
8 participants