New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-38813][SQL][3.3] Remove TimestampNTZ type support in Spark 3.3 #36094
Conversation
Remove TimestampNTZ type support in the production code of Spark 3.3. To archive the goal, this PR adds the check "Utils.isTesting" in the following code branches: - keyword "timestamp_ntz" and "timestamp_ltz" in parser - New expressions from https://issues.apache.org/jira/browse/SPARK-35662 - Using java.time.localDateTime as the external type for TimestampNTZType - `SQLConf.timestampType` which determines the default timestamp type of Spark SQL. This is to minimize the code difference between the master branch. So that future users won't think TimestampNTZ is already available in Spark 3.3. The downside is that users can still find TimestampNTZType under package `org.apache.spark.sql.types`. There should be nothing left other than this.
This one is similar with #33444 |
I think the decision was to not support Hive tables, but JDBC is a valid concern. It's really painful to support NTZ in JDBC as we already treat the existing LTZ as NTZ in the JDBC source. We need a good migration plan. |
Yes. It is very painful to supports NTZ in some data sources. |
@cloud-fan I meant Hive metastore support. I just updated the PR description. |
Spark stores table schema in table properties so Hive metastore is not an issue here. It's all about hive compatibility as we can't change the mapping between Spark timestamp and Hive timestamp, which is Spark LTZ <-> Hive timestamp (Hive only has NTZ). |
Can you define the risk? Lack of support in data sources is rather a limitation that could be prioritised and resolved. Is there a risk of regression in users' queries, performance implications, or backward compatibility? As far as I am concerned, TimestampNTZ is not a default type anyway and you have to explicitly opt-in to use this type which is up to users. If you try writing this type in JDBC data source, I expect an error to be thrown. Is the behaviour different from that? We can state in the release notes that support is limited to certain data sources but I would not necessarily disable it for everything. |
I would suggest to "disable" the data type instead of removing it, e.g. failing queries at analysis phase if the query contains TimestampNTZ. |
@sadikovi that sounds like a simpler solution. e.g. we can do this in |
@sadikovi could you raise a PR for this? I think we need to check |
Hm, can we just simply mention that |
Another issue is data source support..Take Parquet as an example, Spark SQL will infer the schema as |
@@ -157,6 +158,10 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog { | |||
case _: ShowTableExtended => | |||
throw QueryCompilationErrors.commandUnsupportedInV2TableError("SHOW TABLE EXTENDED") | |||
|
|||
case operator: LogicalPlan | |||
if !Utils.isTesting && operator.output.exists(_.dataType.isInstanceOf[TimestampNTZType]) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the !Utils.isTesting
checked and test manually. It worked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got it. LGTM.
As this PR is becoming big, I created two follow-ups:
|
Thanks for pinging! |
"LOCAL TIME ZONE type.") | ||
.version("3.3.0") | ||
.version("3.4.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change should be done in the master branch as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
@cloud-fan @sadikovi @beliefer @HyukjinKwon Thanks for the inputs. |
### What changes were proposed in this pull request? Remove TimestampNTZ type support in the production code of Spark 3.3. To archive the goal, this PR adds the check "Utils.isTesting" in the following code branches: - keyword "timestamp_ntz" and "timestamp_ltz" in parser - New expressions from https://issues.apache.org/jira/browse/SPARK-35662 - Using java.time.localDateTime as the external type for TimestampNTZType - `SQLConf.timestampType` which determines the default timestamp type of Spark SQL. This is to minimize the code difference between the master branch. So that future users won't think TimestampNTZ is already available in Spark 3.3. The downside is that users can still find TimestampNTZType under package `org.apache.spark.sql.types`. There should be nothing left other than this. ### Why are the changes needed? The TimestampNTZ project is not finished yet: * Lack Hive metastore support * Lack JDBC support * Need to spend time scanning the codebase to find out any missing support. The current code usages of `TimestampType` are larger than `TimestampNTZType` ### Does this PR introduce _any_ user-facing change? No, the TimestampNTZ type is not released yet. ### How was this patch tested? UT Closes #36094 from gengliangwang/disableNTZ. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>
… as 3.4.0 ### What changes were proposed in this pull request? Update the version info of TimestampNTZ related changes as 3.4.0 ### Why are the changes needed? We decided to remove the TimestampNTZ support in 3.3 release(#36094) and release TimestampNTZ in 3.4 release. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA tests. Closes #36118 from gengliangwang/updateNTZVersion. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>
Thank you for the decision! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, late LGTM.
…or Spark 3.3 ### What changes were proposed in this pull request? This is a follow-up for #36094. I added `Utils.isTesting` whenever we perform schema conversion or row conversion for TimestampNTZType. I verified that the tests, e.g. ParquetIOSuite, fail with unsupported data type when running in non-testing mode: ``` [info] Cause: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 40.0 failed 1 times, most recent failure: Lost task 1.0 in stage 40.0 (TID 66) (ip-10-110-16-208.us-west-2.compute.internal executor driver): org.apache.spark.sql.AnalysisException: Unsupported data type timestamp_ntz [info] at org.apache.spark.sql.errors.QueryCompilationErrors$.cannotConvertDataTypeToParquetTypeError(QueryCompilationErrors.scala:1304) [info] at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:707) [info] at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:479) [info] at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.$anonfun$convert$1(ParquetSchemaConverter.scala:471) ``` ### Why are the changes needed? We have to disable TimestampNTZType as other parts of the codebase do not yet support this type. ### Does this PR introduce _any_ user-facing change? No, the TimestampNTZ type is not released yet. ### How was this patch tested? I tested the changes manually by rerunning the test suites that verify TimestampNTZType in the non-testing mode. Closes #36137 from sadikovi/SPARK-38829-parquet-ntz-off. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>
A weird error message is seen as follows likely caused by this PR:
The error message should be the following as the master branch shows:
|
@@ -157,6 +158,10 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog { | |||
case _: ShowTableExtended => | |||
throw QueryCompilationErrors.commandUnsupportedInV2TableError("SHOW TABLE EXTENDED") | |||
|
|||
case operator: LogicalPlan | |||
if !Utils.isTesting && operator.output.exists(_.dataType.isInstanceOf[TimestampNTZType]) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, a requirement is we should call dataType
only on resolved attributes, so the check should be
operator.output.exists(attr => attr.resolved && attr.dataType.isInstanceOf[TimestampNTZType])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gengliangwang can you help to fix it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I am about to create a PR
…stampNTZ output ### What changes were proposed in this pull request? In #36094, a check for failing TimestampNTZ output is added. However, if there is an unresolved attribute in the plan, even if it is note related to TimestampNTZ, the error message becomes confusing ``` scala> val df = spark.range(2) df: org.apache.spark.sql.Dataset[Long] = [id: bigint] scala> df.select("i") org.apache.spark.sql.AnalysisException: Invalid call to dataType on unresolved object; 'Project ['i] +- Range (0, 2, step=1, splits=Some(16)) at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:137) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$4(CheckAnalysis.scala:164) ... ``` Before changes it was ``` org.apache.spark.sql.AnalysisException: Column 'i' does not exist. Did you mean one of the following? [id]; ``` This PR is the improve the check for TimestampNTZ and restore the error message for unresolved attributes. ### Why are the changes needed? Fix a regression in analysis error message. ### Does this PR introduce _any_ user-facing change? No, it is not released yet. ### How was this patch tested? Manual test Closes #36316 from gengliangwang/bugFix. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request? In [#36094](#36094), a check for failing TimestampNTZ type output(since we are disabling TimestampNTZ in 3.3) is added: ``` case operator: LogicalPlan if !Utils.isTesting && operator.output.exists(attr => attr.resolved && attr.dataType.isInstanceOf[TimestampNTZType]) => operator.failAnalysis("TimestampNTZ type is not supported in Spark 3.3.") ``` However, the check can cause misleading error messages. In 3.3: ``` > sql( "select date '2018-11-17' > 1").show() org.apache.spark.sql.AnalysisException: Invalid call to toAttribute on unresolved object; 'Project [unresolvedalias((2018-11-17 > 1), None)] +- OneRowRelation at org.apache.spark.sql.catalyst.analysis.UnresolvedAlias.toAttribute(unresolved.scala:510) at org.apache.spark.sql.catalyst.plans.logical.Project.$anonfun$output$1(basicLogicalOperators.scala:70) ``` In master or 3.2 ``` > sql( "select date '2018-11-17' > 1").show() org.apache.spark.sql.AnalysisException: cannot resolve '(DATE '2018-11-17' > 1)' due to data type mismatch: differing types in '(DATE '2018-11-17' > 1)' (date and int).; line 1 pos 7; 'Project [unresolvedalias((2018-11-17 > 1), None)] +- OneRowRelation at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) ``` We should just remove the check to avoid such regression. It's not necessary for disabling TimestampNTZ anyway. ### Why are the changes needed? Fix regression in the error output of analysis check. ### Does this PR introduce _any_ user-facing change? No, it is not released yet. ### How was this patch tested? Build and try on `spark-shell` Closes #36609 from gengliangwang/fixRegression. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>
What changes were proposed in this pull request?
Remove TimestampNTZ type support in the production code of Spark 3.3.
To archive the goal, this PR adds the check "Utils.isTesting" in the following code branches:
SQLConf.timestampType
which determines the default timestamp type of Spark SQL.This is to minimize the code difference between the master branch. So that future users won't think TimestampNTZ is already available in Spark 3.3.
The downside is that users can still find TimestampNTZType under package
org.apache.spark.sql.types
. There should be nothing left other than this.Why are the changes needed?
The TimestampNTZ project is not finished yet:
TimestampType
are larger thanTimestampNTZType
Does this PR introduce any user-facing change?
No, the TimestampNTZ type is not released yet.
How was this patch tested?
UT