New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-45433][SQL] Fix CSV/JSON schema inference when timestamps do not match specified timestampFormat #43243
Conversation
…ot match specified timestampFormat
This PR base on #43245 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR base on #43245
The dependency has been merged. Could you rebase this PR, please.
if ((SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY || | ||
timestampType == TimestampNTZType) && | ||
timestampNTZFormatter.parseWithoutTimeZoneOptional(field, false).isDefined) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify this:
(legacyTimeParserPolicy = LEGACY || timestampType == TimestampLTZType)
we are trying to parse it as NTZ, and it is parsable we return TimestampLTZType
?
This confuses me, return TIMESTAMP LTZ
when the input was parsed by a NTZ function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also cc @gengliangwang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Because the
LEGACY
behavior usedtimestampNTZFormatter
to parse timestamp. So I don't change it when useLEGACY
mode. Without this, some test case likeCSVLegacyTimeParserSuite.SPARK-37326: Timestamp type inference for a column with TIMESTAMP_NTZ values
can't passed. https://github.com/Hisoka-X/spark/runs/17462554632 - It should be
(legacyTimeParserPolicy = LEGACY || timestampType == TimestampNTZType)
not(legacyTimeParserPolicy = LEGACY || timestampType == TimestampLTZType)
if I think correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the LEGACY behavior used timestampNTZFormatter to parse timestamp.
I see. It is ok if the such legacy behaviour is covered by a test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the LEGACY behavior used timestampNTZFormatter to parse timestamp.
I can't find the related code, @Hisoka-X can you point to it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the string format are same of two type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we hit the else branch and tryParseTimestamp
can infer the type properly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According test case, I think yes. Is any case not right now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks wrong, we may infer ltz using the nzt formatter. This can be a potential bug and bite us in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, but it only happened when use legacy mode. Feel free to change it if you think the legacy behavior not right.
+1, LGTM. Merging to master/3.5. |
…ot match specified timestampFormat ### What changes were proposed in this pull request? This PR fix CSV/JSON schema inference when timestamps do not match specified timestampFormat will report error. ```scala //eg val csv = spark.read.option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss") .option("inferSchema", true).csv(Seq("2884-06-24T02:45:51.138").toDS()) csv.show() //error Caused by: java.time.format.DateTimeParseException: Text '2884-06-24T02:45:51.138' could not be parsed, unparsed text found at index 19 ``` This bug only happend when partition had one row. The data type should be `StringType` not `TimestampType` because the value not match `timestampFormat`. Use csv as eg, in `CSVInferSchema::tryParseTimestampNTZ`, first, use `timestampNTZFormatter.parseWithoutTimeZoneOptional` to inferring return `TimestampType`, if same partition had another row, it will use `tryParseTimestamp` to parse row with user defined `timestampFormat`, then found it can't be convert to timestamp with `timestampFormat`. Finally return `StringType`. But when only one row, we use `timestampNTZFormatter.parseWithoutTimeZoneOptional` to parse normally timestamp not right. We should only parse it when `spark.sql.timestampType` is `TIMESTAMP_NTZ`. If `spark.sql.timestampType` is `TIMESTAMP_LTZ`, we should directly parse it use `tryParseTimestamp`. To avoid return `TimestampType` when timestamps do not match specified timestampFormat. ### Why are the changes needed? Fix schema inference bug. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? add new test. ### Was this patch authored or co-authored using generative AI tooling? No Closes #43243 from Hisoka-X/SPARK-45433-inference-mismatch-timestamp-one-row. Authored-by: Jia Fan <fanjiaeminem@qq.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit eae5c0e) Signed-off-by: Max Gekk <max.gekk@gmail.com>
@Hisoka-X Could you backport this changes to
|
… do not match specified timestampFormat ### What changes were proposed in this pull request? This is a backport PR of #43243. Fix the bug of schema inference when timestamps do not match specified timestampFormat. Please check #43243 for detail. ### Why are the changes needed? Fix schema inference bug on 3.4. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? add new test. ### Was this patch authored or co-authored using generative AI tooling? Closes #43343 from Hisoka-X/backport-SPARK-45433-inference-schema. Authored-by: Jia Fan <fanjiaeminem@qq.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
… do not match specified timestampFormat ### What changes were proposed in this pull request? This is a backport PR of apache#43243. Fix the bug of schema inference when timestamps do not match specified timestampFormat. Please check apache#43243 for detail. ### Why are the changes needed? Fix schema inference bug on 3.4. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? add new test. ### Was this patch authored or co-authored using generative AI tooling? Closes apache#43343 from Hisoka-X/backport-SPARK-45433-inference-schema. Authored-by: Jia Fan <fanjiaeminem@qq.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request? This is a refinement of #43243 . This PR enforces one thing: we only infer TIMESTAMP NTZ type using NTZ parser, and only infer LTZ type using LTZ parser. This consistency is important to avoid nondeterministic behaviors. ### Why are the changes needed? Avoid non-deterministic behaviors. After #43243 , we can still have inconsistency if the LEGACY mode is enabled. ### Does this PR introduce _any_ user-facing change? Yes for the legacy parser. Now it's more likely to infer string type instead of inferring timestamp type "by luck" ### How was this patch tested? existing tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #44789 Closes #44800 from cloud-fan/infer. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
This is a refinement of #43243 . This PR enforces one thing: we only infer TIMESTAMP NTZ type using NTZ parser, and only infer LTZ type using LTZ parser. This consistency is important to avoid nondeterministic behaviors. Avoid non-deterministic behaviors. After #43243 , we can still have inconsistency if the LEGACY mode is enabled. Yes for the legacy parser. Now it's more likely to infer string type instead of inferring timestamp type "by luck" existing tests no Closes #44789 Closes #44800 from cloud-fan/infer. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit e4e4076) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
This PR fix CSV/JSON schema inference when timestamps do not match specified timestampFormat will report error.
This bug only happend when partition had one row. The data type should be
StringType
notTimestampType
because the value not matchtimestampFormat
.Use csv as eg, in
CSVInferSchema::tryParseTimestampNTZ
, first, usetimestampNTZFormatter.parseWithoutTimeZoneOptional
to inferring returnTimestampType
, if same partition had another row, it will usetryParseTimestamp
to parse row with user definedtimestampFormat
, then found it can't be convert to timestamp withtimestampFormat
. Finally returnStringType
. But when only one row, we usetimestampNTZFormatter.parseWithoutTimeZoneOptional
to parse normally timestamp not right. We should only parse it whenspark.sql.timestampType
isTIMESTAMP_NTZ
. Ifspark.sql.timestampType
isTIMESTAMP_LTZ
, we should directly parse it usetryParseTimestamp
. To avoid returnTimestampType
when timestamps do not match specified timestampFormat.Why are the changes needed?
Fix schema inference bug.
Does this PR introduce any user-facing change?
No
How was this patch tested?
add new test.
Was this patch authored or co-authored using generative AI tooling?
No