[SPARK-42243][SQL] Use `spark.sql.inferTimestampNTZInDataSources.enabled` to infer timestamp type on partition columns #39812

gengliangwang · 2023-01-30T23:35:40Z

What changes were proposed in this pull request?

Use spark.sql.inferTimestampNTZInDataSources.enabled to infer timestamp type on partition columns, instead of spark.sql.timestampType.

Why are the changes needed?

Similar to #39777:

make the schema inference in data sources consistent
use a light-weight configuration for data source schema inference.

Does this PR introduce any user-facing change?

No, TimestampNTZ is not released yet.

How was this patch tested?

UT

cloud-fan · 2023-01-31T05:30:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

-        "columns, the inference results will still be of TimestampLTZ types.")
+        "backward compatibility. As a result, for JSON/CSV files and partition directories  " +
+        "written with TimestampNTZ columns, the inference results will still be of TimestampLTZ " +
+        "types.")
      .version("3.4.0")
      .booleanConf
      .createWithDefault(false)


does it mean users can't do NTZ roundtrip (write and read) in 3.4 by default?

Yup. Take partition directory naming formats as an example, the outputs from Timestamp NTZ and LTZ are exactly the same.

gengliangwang · 2023-01-31T06:58:44Z

Merging to master/3.4. cc @xinrong-meng

…led` to infer timestamp type on partition columns ### What changes were proposed in this pull request? Use `spark.sql.inferTimestampNTZInDataSources.enabled` to infer timestamp type on partition columns, instead of `spark.sql.timestampType`. ### Why are the changes needed? Similar to #39777: * make the schema inference in data sources consistent * use a light-weight configuration for data source schema inference. ### Does this PR introduce _any_ user-facing change? No, TimestampNTZ is not released yet. ### How was this patch tested? UT Closes #39812 from gengliangwang/partitionNTZ. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit b509ad1) Signed-off-by: Gengliang Wang <gengliang@apache.org>

…bled on JDBC data source ### What changes were proposed in this pull request? Simliar to #39777 and #39812, this PR proposes to use `spark.sql.inferTimestampNTZInDataSources.enabled` to control the behavior of timestamp type inference on JDBC data sources. ### Why are the changes needed? Unify the TimestampNTZ type inference behavior over data sources. In JDBC/JSON/CSV data sources, a column can be Timestamp type or TimestampNTZ type. We need a lightweight configuration to control the behavior. ### Does this PR introduce _any_ user-facing change? No, TimestampNTZ is not released yet. ### How was this patch tested? UTs Closes #39868 from gengliangwang/jdbcNTZ. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…bled on JDBC data source ### What changes were proposed in this pull request? Simliar to #39777 and #39812, this PR proposes to use `spark.sql.inferTimestampNTZInDataSources.enabled` to control the behavior of timestamp type inference on JDBC data sources. ### Why are the changes needed? Unify the TimestampNTZ type inference behavior over data sources. In JDBC/JSON/CSV data sources, a column can be Timestamp type or TimestampNTZ type. We need a lightweight configuration to control the behavior. ### Does this PR introduce _any_ user-facing change? No, TimestampNTZ is not released yet. ### How was this patch tested? UTs Closes #39868 from gengliangwang/jdbcNTZ. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 4760a8b) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…bled on JDBC data source ### What changes were proposed in this pull request? Simliar to apache/spark#39777 and apache/spark#39812, this PR proposes to use `spark.sql.inferTimestampNTZInDataSources.enabled` to control the behavior of timestamp type inference on JDBC data sources. ### Why are the changes needed? Unify the TimestampNTZ type inference behavior over data sources. In JDBC/JSON/CSV data sources, a column can be Timestamp type or TimestampNTZ type. We need a lightweight configuration to control the behavior. ### Does this PR introduce _any_ user-facing change? No, TimestampNTZ is not released yet. ### How was this patch tested? UTs Closes #39868 from gengliangwang/jdbcNTZ. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…urces.timestampNTZTypeInference.enabled ### What changes were proposed in this pull request? Rename TimestampNTZ data source inference configuration from `spark.sql.inferTimestampNTZInDataSources.enabled` to `spark.sql.sources.timestampNTZTypeInference.enabled` For more context on this configuration: #39777 #39812 #39868 ### Why are the changes needed? Since the configuration is for data source, we can put it under the prefix `spark.sql.sources`. The new naming is consistent with another configuration `spark.sql.sources.partitionColumnTypeInference.enabled`. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #39885 from gengliangwang/renameConf. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com>

…urces.timestampNTZTypeInference.enabled ### What changes were proposed in this pull request? Rename TimestampNTZ data source inference configuration from `spark.sql.inferTimestampNTZInDataSources.enabled` to `spark.sql.sources.timestampNTZTypeInference.enabled` For more context on this configuration: #39777 #39812 #39868 ### Why are the changes needed? Since the configuration is for data source, we can put it under the prefix `spark.sql.sources`. The new naming is consistent with another configuration `spark.sql.sources.partitionColumnTypeInference.enabled`. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #39885 from gengliangwang/renameConf. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit c5c1927) Signed-off-by: Max Gekk <max.gekk@gmail.com>

…led` to infer timestamp type on partition columns ### What changes were proposed in this pull request? Use `spark.sql.inferTimestampNTZInDataSources.enabled` to infer timestamp type on partition columns, instead of `spark.sql.timestampType`. ### Why are the changes needed? Similar to apache#39777: * make the schema inference in data sources consistent * use a light-weight configuration for data source schema inference. ### Does this PR introduce _any_ user-facing change? No, TimestampNTZ is not released yet. ### How was this patch tested? UT Closes apache#39812 from gengliangwang/partitionNTZ. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit b509ad1) Signed-off-by: Gengliang Wang <gengliang@apache.org>

…bled on JDBC data source ### What changes were proposed in this pull request? Simliar to apache#39777 and apache#39812, this PR proposes to use `spark.sql.inferTimestampNTZInDataSources.enabled` to control the behavior of timestamp type inference on JDBC data sources. ### Why are the changes needed? Unify the TimestampNTZ type inference behavior over data sources. In JDBC/JSON/CSV data sources, a column can be Timestamp type or TimestampNTZ type. We need a lightweight configuration to control the behavior. ### Does this PR introduce _any_ user-facing change? No, TimestampNTZ is not released yet. ### How was this patch tested? UTs Closes apache#39868 from gengliangwang/jdbcNTZ. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 4760a8b) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…urces.timestampNTZTypeInference.enabled ### What changes were proposed in this pull request? Rename TimestampNTZ data source inference configuration from `spark.sql.inferTimestampNTZInDataSources.enabled` to `spark.sql.sources.timestampNTZTypeInference.enabled` For more context on this configuration: apache#39777 apache#39812 apache#39868 ### Why are the changes needed? Since the configuration is for data source, we can put it under the prefix `spark.sql.sources`. The new naming is consistent with another configuration `spark.sql.sources.partitionColumnTypeInference.enabled`. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes apache#39885 from gengliangwang/renameConf. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit c5c1927) Signed-off-by: Max Gekk <max.gekk@gmail.com>

parttion inference

30cfd0c

gengliangwang requested a review from cloud-fan January 30, 2023 23:35

github-actions bot added the SQL label Jan 30, 2023

cloud-fan reviewed Jan 31, 2023

View reviewed changes

cloud-fan approved these changes Jan 31, 2023

View reviewed changes

gengliangwang closed this in b509ad1 Jan 31, 2023

gengliangwang mentioned this pull request Feb 3, 2023

[SPARK-42296][SQL] Apply spark.sql.inferTimestampNTZInDataSources.enabled on JDBC data source #39868

Closed

gengliangwang mentioned this pull request Feb 4, 2023

[SPARK-42345][SQL] Rename TimestampNTZ inference conf as spark.sql.sources.timestampNTZTypeInference.enabled #39885

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-42243][SQL] Use `spark.sql.inferTimestampNTZInDataSources.enabled` to infer timestamp type on partition columns #39812

[SPARK-42243][SQL] Use `spark.sql.inferTimestampNTZInDataSources.enabled` to infer timestamp type on partition columns #39812

gengliangwang commented Jan 30, 2023

cloud-fan Jan 31, 2023

gengliangwang Jan 31, 2023 •

edited

gengliangwang commented Jan 31, 2023

[SPARK-42243][SQL] Use spark.sql.inferTimestampNTZInDataSources.enabled to infer timestamp type on partition columns #39812

[SPARK-42243][SQL] Use spark.sql.inferTimestampNTZInDataSources.enabled to infer timestamp type on partition columns #39812

Conversation

gengliangwang commented Jan 30, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

cloud-fan Jan 31, 2023

Choose a reason for hiding this comment

gengliangwang Jan 31, 2023 • edited

Choose a reason for hiding this comment

gengliangwang commented Jan 31, 2023

[SPARK-42243][SQL] Use `spark.sql.inferTimestampNTZInDataSources.enabled` to infer timestamp type on partition columns #39812

[SPARK-42243][SQL] Use `spark.sql.inferTimestampNTZInDataSources.enabled` to infer timestamp type on partition columns #39812

gengliangwang Jan 31, 2023 •

edited