[SPARK-45433][SQL] Fix CSV/JSON schema inference when timestamps do not match specified timestampFormat #43243

Hisoka-X · 2023-10-06T05:14:58Z

What changes were proposed in this pull request?

This PR fix CSV/JSON schema inference when timestamps do not match specified timestampFormat will report error.

//eg
val csv = spark.read.option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss")
  .option("inferSchema", true).csv(Seq("2884-06-24T02:45:51.138").toDS())
csv.show() 
//error
Caused by: java.time.format.DateTimeParseException: Text '2884-06-24T02:45:51.138' could not be parsed, unparsed text found at index 19

This bug only happend when partition had one row. The data type should be StringType not TimestampType because the value not match timestampFormat.

Use csv as eg, in CSVInferSchema::tryParseTimestampNTZ, first, use timestampNTZFormatter.parseWithoutTimeZoneOptional to inferring return TimestampType, if same partition had another row, it will use tryParseTimestamp to parse row with user defined timestampFormat, then found it can't be convert to timestamp with timestampFormat. Finally return StringType. But when only one row, we use timestampNTZFormatter.parseWithoutTimeZoneOptional to parse normally timestamp not right. We should only parse it when spark.sql.timestampType is TIMESTAMP_NTZ. If spark.sql.timestampType is TIMESTAMP_LTZ, we should directly parse it use tryParseTimestamp. To avoid return TimestampType when timestamps do not match specified timestampFormat.

Why are the changes needed?

Fix schema inference bug.

Does this PR introduce any user-facing change?

No

How was this patch tested?

add new test.

Was this patch authored or co-authored using generative AI tooling?

No

…ot match specified timestampFormat

Hisoka-X · 2023-10-06T05:32:11Z

cc @MaxGekk @gengliangwang

Hisoka-X · 2023-10-06T08:24:29Z

This PR base on #43245

MaxGekk

This PR base on #43245

The dependency has been merged. Could you rebase this PR, please.

…one-row

Hisoka-X · 2023-10-09T12:51:37Z

This PR base on #43245

The dependency has been merged. Could you rebase this PR, please.

Done. Thanks @MaxGekk

MaxGekk · 2023-10-09T15:27:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala

+    if ((SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY ||
+        timestampType == TimestampNTZType) &&
+        timestampNTZFormatter.parseWithoutTimeZoneOptional(field, false).isDefined) {


Could you clarify this:
(legacyTimeParserPolicy = LEGACY || timestampType == TimestampLTZType)
we are trying to parse it as NTZ, and it is parsable we return TimestampLTZType?

This confuses me, return TIMESTAMP LTZ when the input was parsed by a NTZ function.

also cc @gengliangwang

Because the LEGACY behavior used timestampNTZFormatter to parse timestamp. So I don't change it when use LEGACY mode. Without this, some test case like CSVLegacyTimeParserSuite.SPARK-37326: Timestamp type inference for a column with TIMESTAMP_NTZ values can't passed. https://github.com/Hisoka-X/spark/runs/17462554632

It should be (legacyTimeParserPolicy = LEGACY || timestampType == TimestampNTZType) not (legacyTimeParserPolicy = LEGACY || timestampType == TimestampLTZType) if I think correctly.

Because the LEGACY behavior used timestampNTZFormatter to parse timestamp.

I see. It is ok if the such legacy behaviour is covered by a test.

Because the LEGACY behavior used timestampNTZFormatter to parse timestamp.

I can't find the related code, @Hisoka-X can you point to it?

Because the string format are same of two type.

Then we hit the else branch and tryParseTimestamp can infer the type properly?

According test case, I think yes. Is any case not right now?

The code looks wrong, we may infer ltz using the nzt formatter. This can be a potential bug and bite us in the future.

Yep, but it only happened when use legacy mode. Feel free to change it if you think the legacy behavior not right.

MaxGekk · 2023-10-11T16:22:46Z

+1, LGTM. Merging to master/3.5.
Thank you, @Hisoka-X.

…ot match specified timestampFormat ### What changes were proposed in this pull request? This PR fix CSV/JSON schema inference when timestamps do not match specified timestampFormat will report error. ```scala //eg val csv = spark.read.option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss") .option("inferSchema", true).csv(Seq("2884-06-24T02:45:51.138").toDS()) csv.show() //error Caused by: java.time.format.DateTimeParseException: Text '2884-06-24T02:45:51.138' could not be parsed, unparsed text found at index 19 ``` This bug only happend when partition had one row. The data type should be `StringType` not `TimestampType` because the value not match `timestampFormat`. Use csv as eg, in `CSVInferSchema::tryParseTimestampNTZ`, first, use `timestampNTZFormatter.parseWithoutTimeZoneOptional` to inferring return `TimestampType`, if same partition had another row, it will use `tryParseTimestamp` to parse row with user defined `timestampFormat`, then found it can't be convert to timestamp with `timestampFormat`. Finally return `StringType`. But when only one row, we use `timestampNTZFormatter.parseWithoutTimeZoneOptional` to parse normally timestamp not right. We should only parse it when `spark.sql.timestampType` is `TIMESTAMP_NTZ`. If `spark.sql.timestampType` is `TIMESTAMP_LTZ`, we should directly parse it use `tryParseTimestamp`. To avoid return `TimestampType` when timestamps do not match specified timestampFormat. ### Why are the changes needed? Fix schema inference bug. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? add new test. ### Was this patch authored or co-authored using generative AI tooling? No Closes #43243 from Hisoka-X/SPARK-45433-inference-mismatch-timestamp-one-row. Authored-by: Jia Fan <fanjiaeminem@qq.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit eae5c0e) Signed-off-by: Max Gekk <max.gekk@gmail.com>

MaxGekk · 2023-10-11T16:35:33Z

@Hisoka-X Could you backport this changes to branch-3.4. This PR fails on 3.4:

[error] /Users/maximgekk/proj/review-Hisoka-X_SPARK-45433-inference-mismatch-timestamp-one-row-3.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:30:8: object LegacyBehaviorPolicy is not a member of package org.apache.spark.sql.internal
[error] import org.apache.spark.sql.internal.{LegacyBehaviorPolicy, SQLConf}
[error]        ^
[error] /Users/maximgekk/proj/review-Hisoka-X_SPARK-45433-inference-mismatch-timestamp-one-row-3.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:206:48: not found: value LegacyBehaviorPolicy
[error]     if ((SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY ||
[error]

… do not match specified timestampFormat ### What changes were proposed in this pull request? This is a backport PR of #43243. Fix the bug of schema inference when timestamps do not match specified timestampFormat. Please check #43243 for detail. ### Why are the changes needed? Fix schema inference bug on 3.4. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? add new test. ### Was this patch authored or co-authored using generative AI tooling? Closes #43343 from Hisoka-X/backport-SPARK-45433-inference-schema. Authored-by: Jia Fan <fanjiaeminem@qq.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

… do not match specified timestampFormat ### What changes were proposed in this pull request? This is a backport PR of apache#43243. Fix the bug of schema inference when timestamps do not match specified timestampFormat. Please check apache#43243 for detail. ### Why are the changes needed? Fix schema inference bug on 3.4. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? add new test. ### Was this patch authored or co-authored using generative AI tooling? Closes apache#43343 from Hisoka-X/backport-SPARK-45433-inference-schema. Authored-by: Jia Fan <fanjiaeminem@qq.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

### What changes were proposed in this pull request? This is a refinement of #43243 . This PR enforces one thing: we only infer TIMESTAMP NTZ type using NTZ parser, and only infer LTZ type using LTZ parser. This consistency is important to avoid nondeterministic behaviors. ### Why are the changes needed? Avoid non-deterministic behaviors. After #43243 , we can still have inconsistency if the LEGACY mode is enabled. ### Does this PR introduce _any_ user-facing change? Yes for the legacy parser. Now it's more likely to infer string type instead of inferring timestamp type "by luck" ### How was this patch tested? existing tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #44789 Closes #44800 from cloud-fan/infer. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

This is a refinement of #43243 . This PR enforces one thing: we only infer TIMESTAMP NTZ type using NTZ parser, and only infer LTZ type using LTZ parser. This consistency is important to avoid nondeterministic behaviors. Avoid non-deterministic behaviors. After #43243 , we can still have inconsistency if the LEGACY mode is enabled. Yes for the legacy parser. Now it's more likely to infer string type instead of inferring timestamp type "by luck" existing tests no Closes #44789 Closes #44800 from cloud-fan/infer. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit e4e4076) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

[SPARK-45433][SQL] Fix CSV/JSON schema inference when timestamps do n…

7107fa4

…ot match specified timestampFormat

Hisoka-X marked this pull request as ready for review October 6, 2023 05:15

github-actions bot added the SQL label Oct 6, 2023

Hisoka-X marked this pull request as draft October 6, 2023 08:45

Hisoka-X added 2 commits October 6, 2023 17:04

update

2611057

update

a661b9c

MaxGekk reviewed Oct 9, 2023

View reviewed changes

Merge branch 'master_' into SPARK-45433-inference-mismatch-timestamp-…

454d11b

…one-row

Hisoka-X marked this pull request as ready for review October 9, 2023 12:51

MaxGekk reviewed Oct 9, 2023

View reviewed changes

MaxGekk approved these changes Oct 11, 2023

View reviewed changes

MaxGekk closed this in eae5c0e Oct 11, 2023

Hisoka-X mentioned this pull request Oct 12, 2023

[SPARK-45433][SQL][3.4] Fix CSV/JSON schema inference when timestamps do not match specified timestampFormat #43343

Closed

Hisoka-X deleted the SPARK-45433-inference-mismatch-timestamp-one-row branch October 12, 2023 02:20

MaxGekk mentioned this pull request Jan 19, 2024

[SPARK-46769][SQL] Fix type inferring for timestamps without time zone in JSON/CSV #44789

Closed

cloud-fan mentioned this pull request Jan 19, 2024

[SPARK-46769][SQL] Refine timestamp related schema inference #44800

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-45433][SQL] Fix CSV/JSON schema inference when timestamps do not match specified timestampFormat #43243

[SPARK-45433][SQL] Fix CSV/JSON schema inference when timestamps do not match specified timestampFormat #43243

Hisoka-X commented Oct 6, 2023

Hisoka-X commented Oct 6, 2023

Hisoka-X commented Oct 6, 2023

MaxGekk left a comment

Hisoka-X commented Oct 9, 2023

MaxGekk Oct 9, 2023

MaxGekk Oct 9, 2023

Hisoka-X Oct 10, 2023

MaxGekk Oct 11, 2023

cloud-fan Jan 17, 2024

Hisoka-X Jan 17, 2024

cloud-fan Jan 17, 2024

Hisoka-X Jan 17, 2024

cloud-fan Jan 17, 2024

Hisoka-X Jan 17, 2024

MaxGekk commented Oct 11, 2023

MaxGekk commented Oct 11, 2023

[SPARK-45433][SQL] Fix CSV/JSON schema inference when timestamps do not match specified timestampFormat #43243

[SPARK-45433][SQL] Fix CSV/JSON schema inference when timestamps do not match specified timestampFormat #43243

Conversation

Hisoka-X commented Oct 6, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Hisoka-X commented Oct 6, 2023

Hisoka-X commented Oct 6, 2023

MaxGekk left a comment

Choose a reason for hiding this comment

Hisoka-X commented Oct 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MaxGekk commented Oct 11, 2023

MaxGekk commented Oct 11, 2023