Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-45433][SQL] Fix CSV/JSON schema inference when timestamps do not match specified timestampFormat #43243

Conversation

Hisoka-X
Copy link
Member

@Hisoka-X Hisoka-X commented Oct 6, 2023

What changes were proposed in this pull request?

This PR fix CSV/JSON schema inference when timestamps do not match specified timestampFormat will report error.

//eg
val csv = spark.read.option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss")
  .option("inferSchema", true).csv(Seq("2884-06-24T02:45:51.138").toDS())
csv.show() 
//error
Caused by: java.time.format.DateTimeParseException: Text '2884-06-24T02:45:51.138' could not be parsed, unparsed text found at index 19 

This bug only happend when partition had one row. The data type should be StringType not TimestampType because the value not match timestampFormat.

Use csv as eg, in CSVInferSchema::tryParseTimestampNTZ, first, use timestampNTZFormatter.parseWithoutTimeZoneOptional to inferring return TimestampType, if same partition had another row, it will use tryParseTimestamp to parse row with user defined timestampFormat, then found it can't be convert to timestamp with timestampFormat. Finally return StringType. But when only one row, we use timestampNTZFormatter.parseWithoutTimeZoneOptional to parse normally timestamp not right. We should only parse it when spark.sql.timestampType is TIMESTAMP_NTZ. If spark.sql.timestampType is TIMESTAMP_LTZ, we should directly parse it use tryParseTimestamp. To avoid return TimestampType when timestamps do not match specified timestampFormat.

Why are the changes needed?

Fix schema inference bug.

Does this PR introduce any user-facing change?

No

How was this patch tested?

add new test.

Was this patch authored or co-authored using generative AI tooling?

No

@Hisoka-X Hisoka-X marked this pull request as ready for review October 6, 2023 05:15
@github-actions github-actions bot added the SQL label Oct 6, 2023
@Hisoka-X
Copy link
Member Author

Hisoka-X commented Oct 6, 2023

cc @MaxGekk @gengliangwang

@Hisoka-X
Copy link
Member Author

Hisoka-X commented Oct 6, 2023

This PR base on #43245

@Hisoka-X Hisoka-X marked this pull request as draft October 6, 2023 08:45
Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR base on #43245

The dependency has been merged. Could you rebase this PR, please.

@Hisoka-X
Copy link
Member Author

Hisoka-X commented Oct 9, 2023

This PR base on #43245

The dependency has been merged. Could you rebase this PR, please.

Done. Thanks @MaxGekk

@Hisoka-X Hisoka-X marked this pull request as ready for review October 9, 2023 12:51
Comment on lines +206 to +208
if ((SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY ||
timestampType == TimestampNTZType) &&
timestampNTZFormatter.parseWithoutTimeZoneOptional(field, false).isDefined) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify this:
(legacyTimeParserPolicy = LEGACY || timestampType == TimestampLTZType)
we are trying to parse it as NTZ, and it is parsable we return TimestampLTZType?

This confuses me, return TIMESTAMP LTZ when the input was parsed by a NTZ function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also cc @gengliangwang

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Because the LEGACY behavior used timestampNTZFormatter to parse timestamp. So I don't change it when use LEGACY mode. Without this, some test case like CSVLegacyTimeParserSuite.SPARK-37326: Timestamp type inference for a column with TIMESTAMP_NTZ values can't passed. https://github.com/Hisoka-X/spark/runs/17462554632
  2. It should be (legacyTimeParserPolicy = LEGACY || timestampType == TimestampNTZType) not (legacyTimeParserPolicy = LEGACY || timestampType == TimestampLTZType) if I think correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the LEGACY behavior used timestampNTZFormatter to parse timestamp.

I see. It is ok if the such legacy behaviour is covered by a test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the LEGACY behavior used timestampNTZFormatter to parse timestamp.

I can't find the related code, @Hisoka-X can you point to it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the string format are same of two type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we hit the else branch and tryParseTimestamp can infer the type properly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According test case, I think yes. Is any case not right now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks wrong, we may infer ltz using the nzt formatter. This can be a potential bug and bite us in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, but it only happened when use legacy mode. Feel free to change it if you think the legacy behavior not right.

@MaxGekk
Copy link
Member

MaxGekk commented Oct 11, 2023

+1, LGTM. Merging to master/3.5.
Thank you, @Hisoka-X.

@MaxGekk MaxGekk closed this in eae5c0e Oct 11, 2023
MaxGekk pushed a commit that referenced this pull request Oct 11, 2023
…ot match specified timestampFormat

### What changes were proposed in this pull request?
This PR fix CSV/JSON schema inference when timestamps do not match specified timestampFormat will report error.
```scala
//eg
val csv = spark.read.option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss")
  .option("inferSchema", true).csv(Seq("2884-06-24T02:45:51.138").toDS())
csv.show()
//error
Caused by: java.time.format.DateTimeParseException: Text '2884-06-24T02:45:51.138' could not be parsed, unparsed text found at index 19
```
This bug only happend when partition had one row. The data type should be `StringType` not `TimestampType` because the value not match `timestampFormat`.

Use csv as eg, in `CSVInferSchema::tryParseTimestampNTZ`, first, use `timestampNTZFormatter.parseWithoutTimeZoneOptional` to inferring return `TimestampType`, if same partition had another row, it will use `tryParseTimestamp` to parse row with user defined `timestampFormat`, then found it can't be convert to timestamp with `timestampFormat`. Finally return `StringType`. But when only one row, we use `timestampNTZFormatter.parseWithoutTimeZoneOptional` to parse normally timestamp not right. We should only parse it when `spark.sql.timestampType` is `TIMESTAMP_NTZ`. If `spark.sql.timestampType` is `TIMESTAMP_LTZ`, we should directly parse it use `tryParseTimestamp`. To avoid return `TimestampType` when timestamps do not match specified timestampFormat.

### Why are the changes needed?
Fix schema inference bug.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
add new test.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43243 from Hisoka-X/SPARK-45433-inference-mismatch-timestamp-one-row.

Authored-by: Jia Fan <fanjiaeminem@qq.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit eae5c0e)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
@MaxGekk
Copy link
Member

MaxGekk commented Oct 11, 2023

@Hisoka-X Could you backport this changes to branch-3.4. This PR fails on 3.4:

[error] /Users/maximgekk/proj/review-Hisoka-X_SPARK-45433-inference-mismatch-timestamp-one-row-3.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:30:8: object LegacyBehaviorPolicy is not a member of package org.apache.spark.sql.internal
[error] import org.apache.spark.sql.internal.{LegacyBehaviorPolicy, SQLConf}
[error]        ^
[error] /Users/maximgekk/proj/review-Hisoka-X_SPARK-45433-inference-mismatch-timestamp-one-row-3.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala:206:48: not found: value LegacyBehaviorPolicy
[error]     if ((SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY ||
[error]

@Hisoka-X Hisoka-X deleted the SPARK-45433-inference-mismatch-timestamp-one-row branch October 12, 2023 02:20
MaxGekk pushed a commit that referenced this pull request Oct 12, 2023
… do not match specified timestampFormat

### What changes were proposed in this pull request?
This is a backport PR of #43243. Fix the bug of schema inference when timestamps do not match specified timestampFormat. Please check #43243 for detail.

### Why are the changes needed?
Fix schema inference bug on 3.4.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
add new test.

### Was this patch authored or co-authored using generative AI tooling?

Closes #43343 from Hisoka-X/backport-SPARK-45433-inference-schema.

Authored-by: Jia Fan <fanjiaeminem@qq.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
viirya pushed a commit to viirya/spark-1 that referenced this pull request Oct 19, 2023
… do not match specified timestampFormat

### What changes were proposed in this pull request?
This is a backport PR of apache#43243. Fix the bug of schema inference when timestamps do not match specified timestampFormat. Please check apache#43243 for detail.

### Why are the changes needed?
Fix schema inference bug on 3.4.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
add new test.

### Was this patch authored or co-authored using generative AI tooling?

Closes apache#43343 from Hisoka-X/backport-SPARK-45433-inference-schema.

Authored-by: Jia Fan <fanjiaeminem@qq.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
cloud-fan added a commit that referenced this pull request Jan 20, 2024
### What changes were proposed in this pull request?

This is a refinement of #43243 . This PR enforces one thing: we only infer TIMESTAMP NTZ type using NTZ parser, and only infer LTZ type using LTZ parser. This consistency is important to avoid nondeterministic behaviors.

### Why are the changes needed?

Avoid non-deterministic behaviors. After #43243 , we can still have inconsistency if the LEGACY mode is enabled.

### Does this PR introduce _any_ user-facing change?

Yes for the legacy parser. Now it's more likely to infer string type instead of inferring timestamp type "by luck"

### How was this patch tested?

existing tests

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #44789

Closes #44800 from cloud-fan/infer.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan added a commit that referenced this pull request Jan 20, 2024
This is a refinement of #43243 . This PR enforces one thing: we only infer TIMESTAMP NTZ type using NTZ parser, and only infer LTZ type using LTZ parser. This consistency is important to avoid nondeterministic behaviors.

Avoid non-deterministic behaviors. After #43243 , we can still have inconsistency if the LEGACY mode is enabled.

Yes for the legacy parser. Now it's more likely to infer string type instead of inferring timestamp type "by luck"

existing tests

no

Closes #44789

Closes #44800 from cloud-fan/infer.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit e4e4076)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants