[SPARK-42221][SQL] Introduce a new conf for TimestampNTZ schema inference in JSON/CSV #39777

gengliangwang · 2023-01-28T04:10:28Z

What changes were proposed in this pull request?

The TimestampNTZ schema inference over data sources is not consistent in the current code (most of them are for the purpose of backward compatibility to infer as Timestamp LTZ by default):

CSV & JSON: depends on spark.sql.timestampType to determine the result
ORC: depends on whether there is metadata written. If not, inferred as Timestamp LTZ
Parquet: infer timestamp column with annotation isAdjustedToUTC = false as Timestamp NTZ. There is a configuration spark.sql.parquet.timestampNTZ.enabled to determine whether to support NTZ. When spark.sql.parquet.timestampNTZ.enabled is false, users can't write Timestamp NTZ columns to parquet files.
Avro: Local timestamp type is a new logical type so there is no backward compatibility issue and there is no configuration to control the inference.

Since we are going to release Timestamp NTZ in Spark 3.4.0, I propose using a new configuration spark.sql.inferTimestampNTZInDataSources.enabled for TimestampNTZ schema inference. The flag is false by default for backward compatibility. When true, if a column can be either TimestampNTZ or TimestampLTZ, the infer result will be TimestampNTZ. This PR converts JSON/CSV data sources. If the proposal is fine to others, I will continue on the other data sources.

Why are the changes needed?

The TimestampNTZ schema inference over data sources is not consistent in the current code
The configuration spark.sql.timestampType is heavy. It changes the DDL/SQL functions's default timestamp type. If a user only wants to read back the newly written TimestampNTZ data without breaking the existing workloads, having a lightweight flag is a good idea.

Does this PR introduce any user-facing change?

No, TimestampNTZ is not released yet.

How was this patch tested?

UTs

gengliangwang · 2023-01-28T04:11:31Z

cc @srielau @sadikovi @xinrong-meng

cloud-fan · 2023-01-30T02:28:45Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchemaSuite.scala

@@ -252,6 +252,9 @@ class CSVInferSchemaSuite extends SparkFunSuite with SQLHelper {
    withSQLConf(SQLConf.TIMESTAMP_TYPE.key -> "TIMESTAMP_NTZ") {


shall we remove this test? TIMESTAMP_TYPE config does not affect schema inference anymore.

Yeah I am keeping it on purpose. It doesn't hurt to check the impact of the TIMESTAMP_TYPE on the behavior of CSV data source.

cloud-fan · 2023-01-30T02:30:27Z

LGTM. For ORC and Avro, the NTZ type is new (ORC has a metadata, Avro has a new logical type), so we only need to apply this new conf to parquet?

gengliangwang · 2023-01-30T05:03:10Z

For ORC and Avro, the NTZ type is new (ORC has a metadata, Avro has a new logical type), so we only need to apply this new conf to parquet?

For ORC without timestamp metadata written by Spark, the new configuration will still control the behavior. WDYT?

cloud-fan · 2023-01-30T08:07:49Z

For ORC without timestamp metadata written by Spark, the new configuration will still control the behavior.

SGTM

gengliangwang · 2023-01-30T21:30:52Z

Merging to master/3.4, cc @xinrong-meng

…ence in JSON/CSV ### What changes were proposed in this pull request? The TimestampNTZ schema inference over data sources is not consistent in the current code (most of them are for the purpose of backward compatibility to infer as Timestamp LTZ by default): * CSV & JSON: depends on `spark.sql.timestampType` to determine the result * ORC: depends on whether there is metadata written. If not, inferred as Timestamp LTZ * Parquet: infer timestamp column with annotation isAdjustedToUTC = false as Timestamp NTZ. There is a configuration `spark.sql.parquet.timestampNTZ.enabled` to determine whether to support NTZ. When `spark.sql.parquet.timestampNTZ.enabled` is false, users can't write Timestamp NTZ columns to parquet files. * Avro: [Local timestamp](https://avro.apache.org/docs/1.10.2/spec.html#Local+timestamp+%28microsecond+precision%29) type is a new logical type so there is no backward compatibility issue and there is no configuration to control the inference. Since we are going to release Timestamp NTZ in Spark 3.4.0, I propose using a new configuration `spark.sql.inferTimestampNTZInDataSources.enabled` for TimestampNTZ schema inference. The flag is false by default for backward compatibility. When true, if a column can be either TimestampNTZ or TimestampLTZ, the infer result will be TimestampNTZ. This PR converts JSON/CSV data sources. If the proposal is fine to others, I will continue on the other data sources. ### Why are the changes needed? * The TimestampNTZ schema inference over data sources is not consistent in the current code * The configuration `spark.sql.timestampType` is heavy. It changes the DDL/SQL functions's default timestamp type. If a user only wants to read back the newly written TimestampNTZ data without breaking the existing workloads, having a lightweight flag is a good idea. ### Does this PR introduce _any_ user-facing change? No, TimestampNTZ is not released yet. ### How was this patch tested? UTs Closes #39777 from gengliangwang/ntzOptions. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 973b3d5) Signed-off-by: Gengliang Wang <gengliang@apache.org>

…led` to infer timestamp type on partition columns ### What changes were proposed in this pull request? Use `spark.sql.inferTimestampNTZInDataSources.enabled` to infer timestamp type on partition columns, instead of `spark.sql.timestampType`. ### Why are the changes needed? Similar to #39777: * make the schema inference in data sources consistent * use a light-weight configuration for data source schema inference. ### Does this PR introduce _any_ user-facing change? No, TimestampNTZ is not released yet. ### How was this patch tested? UT Closes #39812 from gengliangwang/partitionNTZ. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>

…led` to infer timestamp type on partition columns ### What changes were proposed in this pull request? Use `spark.sql.inferTimestampNTZInDataSources.enabled` to infer timestamp type on partition columns, instead of `spark.sql.timestampType`. ### Why are the changes needed? Similar to #39777: * make the schema inference in data sources consistent * use a light-weight configuration for data source schema inference. ### Does this PR introduce _any_ user-facing change? No, TimestampNTZ is not released yet. ### How was this patch tested? UT Closes #39812 from gengliangwang/partitionNTZ. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit b509ad1) Signed-off-by: Gengliang Wang <gengliang@apache.org>

…bled on JDBC data source ### What changes were proposed in this pull request? Simliar to #39777 and #39812, this PR proposes to use `spark.sql.inferTimestampNTZInDataSources.enabled` to control the behavior of timestamp type inference on JDBC data sources. ### Why are the changes needed? Unify the TimestampNTZ type inference behavior over data sources. In JDBC/JSON/CSV data sources, a column can be Timestamp type or TimestampNTZ type. We need a lightweight configuration to control the behavior. ### Does this PR introduce _any_ user-facing change? No, TimestampNTZ is not released yet. ### How was this patch tested? UTs Closes #39868 from gengliangwang/jdbcNTZ. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…bled on JDBC data source ### What changes were proposed in this pull request? Simliar to #39777 and #39812, this PR proposes to use `spark.sql.inferTimestampNTZInDataSources.enabled` to control the behavior of timestamp type inference on JDBC data sources. ### Why are the changes needed? Unify the TimestampNTZ type inference behavior over data sources. In JDBC/JSON/CSV data sources, a column can be Timestamp type or TimestampNTZ type. We need a lightweight configuration to control the behavior. ### Does this PR introduce _any_ user-facing change? No, TimestampNTZ is not released yet. ### How was this patch tested? UTs Closes #39868 from gengliangwang/jdbcNTZ. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 4760a8b) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…bled on JDBC data source ### What changes were proposed in this pull request? Simliar to apache/spark#39777 and apache/spark#39812, this PR proposes to use `spark.sql.inferTimestampNTZInDataSources.enabled` to control the behavior of timestamp type inference on JDBC data sources. ### Why are the changes needed? Unify the TimestampNTZ type inference behavior over data sources. In JDBC/JSON/CSV data sources, a column can be Timestamp type or TimestampNTZ type. We need a lightweight configuration to control the behavior. ### Does this PR introduce _any_ user-facing change? No, TimestampNTZ is not released yet. ### How was this patch tested? UTs Closes #39868 from gengliangwang/jdbcNTZ. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…urces.timestampNTZTypeInference.enabled ### What changes were proposed in this pull request? Rename TimestampNTZ data source inference configuration from `spark.sql.inferTimestampNTZInDataSources.enabled` to `spark.sql.sources.timestampNTZTypeInference.enabled` For more context on this configuration: #39777 #39812 #39868 ### Why are the changes needed? Since the configuration is for data source, we can put it under the prefix `spark.sql.sources`. The new naming is consistent with another configuration `spark.sql.sources.partitionColumnTypeInference.enabled`. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #39885 from gengliangwang/renameConf. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com>

…urces.timestampNTZTypeInference.enabled ### What changes were proposed in this pull request? Rename TimestampNTZ data source inference configuration from `spark.sql.inferTimestampNTZInDataSources.enabled` to `spark.sql.sources.timestampNTZTypeInference.enabled` For more context on this configuration: #39777 #39812 #39868 ### Why are the changes needed? Since the configuration is for data source, we can put it under the prefix `spark.sql.sources`. The new naming is consistent with another configuration `spark.sql.sources.partitionColumnTypeInference.enabled`. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes #39885 from gengliangwang/renameConf. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit c5c1927) Signed-off-by: Max Gekk <max.gekk@gmail.com>

…ence in JSON/CSV ### What changes were proposed in this pull request? The TimestampNTZ schema inference over data sources is not consistent in the current code (most of them are for the purpose of backward compatibility to infer as Timestamp LTZ by default): * CSV & JSON: depends on `spark.sql.timestampType` to determine the result * ORC: depends on whether there is metadata written. If not, inferred as Timestamp LTZ * Parquet: infer timestamp column with annotation isAdjustedToUTC = false as Timestamp NTZ. There is a configuration `spark.sql.parquet.timestampNTZ.enabled` to determine whether to support NTZ. When `spark.sql.parquet.timestampNTZ.enabled` is false, users can't write Timestamp NTZ columns to parquet files. * Avro: [Local timestamp](https://avro.apache.org/docs/1.10.2/spec.html#Local+timestamp+%28microsecond+precision%29) type is a new logical type so there is no backward compatibility issue and there is no configuration to control the inference. Since we are going to release Timestamp NTZ in Spark 3.4.0, I propose using a new configuration `spark.sql.inferTimestampNTZInDataSources.enabled` for TimestampNTZ schema inference. The flag is false by default for backward compatibility. When true, if a column can be either TimestampNTZ or TimestampLTZ, the infer result will be TimestampNTZ. This PR converts JSON/CSV data sources. If the proposal is fine to others, I will continue on the other data sources. ### Why are the changes needed? * The TimestampNTZ schema inference over data sources is not consistent in the current code * The configuration `spark.sql.timestampType` is heavy. It changes the DDL/SQL functions's default timestamp type. If a user only wants to read back the newly written TimestampNTZ data without breaking the existing workloads, having a lightweight flag is a good idea. ### Does this PR introduce _any_ user-facing change? No, TimestampNTZ is not released yet. ### How was this patch tested? UTs Closes apache#39777 from gengliangwang/ntzOptions. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 973b3d5) Signed-off-by: Gengliang Wang <gengliang@apache.org>

…led` to infer timestamp type on partition columns ### What changes were proposed in this pull request? Use `spark.sql.inferTimestampNTZInDataSources.enabled` to infer timestamp type on partition columns, instead of `spark.sql.timestampType`. ### Why are the changes needed? Similar to apache#39777: * make the schema inference in data sources consistent * use a light-weight configuration for data source schema inference. ### Does this PR introduce _any_ user-facing change? No, TimestampNTZ is not released yet. ### How was this patch tested? UT Closes apache#39812 from gengliangwang/partitionNTZ. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit b509ad1) Signed-off-by: Gengliang Wang <gengliang@apache.org>

…bled on JDBC data source ### What changes were proposed in this pull request? Simliar to apache#39777 and apache#39812, this PR proposes to use `spark.sql.inferTimestampNTZInDataSources.enabled` to control the behavior of timestamp type inference on JDBC data sources. ### Why are the changes needed? Unify the TimestampNTZ type inference behavior over data sources. In JDBC/JSON/CSV data sources, a column can be Timestamp type or TimestampNTZ type. We need a lightweight configuration to control the behavior. ### Does this PR introduce _any_ user-facing change? No, TimestampNTZ is not released yet. ### How was this patch tested? UTs Closes apache#39868 from gengliangwang/jdbcNTZ. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 4760a8b) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…urces.timestampNTZTypeInference.enabled ### What changes were proposed in this pull request? Rename TimestampNTZ data source inference configuration from `spark.sql.inferTimestampNTZInDataSources.enabled` to `spark.sql.sources.timestampNTZTypeInference.enabled` For more context on this configuration: apache#39777 apache#39812 apache#39868 ### Why are the changes needed? Since the configuration is for data source, we can put it under the prefix `spark.sql.sources`. The new naming is consistent with another configuration `spark.sql.sources.partitionColumnTypeInference.enabled`. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Closes apache#39885 from gengliangwang/renameConf. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit c5c1927) Signed-off-by: Max Gekk <max.gekk@gmail.com>

new option for data source schema inference

aae4862

gengliangwang requested review from MaxGekk and cloud-fan January 28, 2023 04:10

github-actions bot added the SQL label Jan 28, 2023

cloud-fan reviewed Jan 30, 2023

View reviewed changes

gengliangwang closed this in 973b3d5 Jan 30, 2023

gengliangwang mentioned this pull request Jan 30, 2023

[SPARK-42243][SQL] Use spark.sql.inferTimestampNTZInDataSources.enabled to infer timestamp type on partition columns #39812

Closed

gengliangwang mentioned this pull request Feb 3, 2023

[SPARK-42296][SQL] Apply spark.sql.inferTimestampNTZInDataSources.enabled on JDBC data source #39868

Closed

gengliangwang mentioned this pull request Feb 4, 2023

[SPARK-42345][SQL] Rename TimestampNTZ inference conf as spark.sql.sources.timestampNTZTypeInference.enabled #39885

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-42221][SQL] Introduce a new conf for TimestampNTZ schema inference in JSON/CSV #39777

[SPARK-42221][SQL] Introduce a new conf for TimestampNTZ schema inference in JSON/CSV #39777

gengliangwang commented Jan 28, 2023

gengliangwang commented Jan 28, 2023

cloud-fan Jan 30, 2023

gengliangwang Jan 30, 2023

cloud-fan commented Jan 30, 2023

gengliangwang commented Jan 30, 2023 •

edited

Loading

cloud-fan commented Jan 30, 2023

gengliangwang commented Jan 30, 2023

		@@ -252,6 +252,9 @@ class CSVInferSchemaSuite extends SparkFunSuite with SQLHelper {
		withSQLConf(SQLConf.TIMESTAMP_TYPE.key -> "TIMESTAMP_NTZ") {

[SPARK-42221][SQL] Introduce a new conf for TimestampNTZ schema inference in JSON/CSV #39777

[SPARK-42221][SQL] Introduce a new conf for TimestampNTZ schema inference in JSON/CSV #39777

Conversation

gengliangwang commented Jan 28, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

gengliangwang commented Jan 28, 2023

cloud-fan Jan 30, 2023

Choose a reason for hiding this comment

gengliangwang Jan 30, 2023

Choose a reason for hiding this comment

cloud-fan commented Jan 30, 2023

gengliangwang commented Jan 30, 2023 • edited Loading

cloud-fan commented Jan 30, 2023

gengliangwang commented Jan 30, 2023

gengliangwang commented Jan 30, 2023 •

edited

Loading