[SPARK-31150][SQL] Parsing seconds fraction with variable length for timestamp#27906
[SPARK-31150][SQL] Parsing seconds fraction with variable length for timestamp#27906yaooqinn wants to merge 19 commits intoapache:masterfrom yaooqinn:SPARK-31150
Conversation
|
Test build #119767 has finished for PR 27906 at commit
|
|
Test build #119768 has finished for PR 27906 at commit
|
|
Test build #119771 has finished for PR 27906 at commit
|
|
Test build #119784 has finished for PR 27906 at commit
|
|
retest this please |
|
Test build #119790 has finished for PR 27906 at commit
|
|
cc @cloud-fan @maropu @MaxGekk @xuanyuanking, thanks |
| options.locale, | ||
| legacyFormat = FAST_DATE_FORMAT) | ||
| legacyFormat = FAST_DATE_FORMAT, | ||
| varLenEnabled = options.timestampFormat.contains('S')) |
There was a problem hiding this comment.
it's only for parsing so we should always set it to true?
There was a problem hiding this comment.
I have updated this, and mv the cache optimization to DateTimeFormatterHelper
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelper.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala
Outdated
Show resolved
Hide resolved
|
Test build #119854 has finished for PR 27906 at commit
|
|
Test build #119855 has finished for PR 27906 at commit
|
|
Test build #119860 has finished for PR 27906 at commit
|
|
Test build #119856 has finished for PR 27906 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
Outdated
Show resolved
Hide resolved
| var rest = pattenPart | ||
| while (rest.nonEmpty) { | ||
| rest match { | ||
| case extractor(prefix, sss, suffix) => |
There was a problem hiding this comment.
nit: sss -> secondFraction
| builder.appendFraction(ChronoField.NANO_OF_SECOND, 0, sss.length, false) | ||
| } | ||
| rest = suffix | ||
| case _ => // never reach |
There was a problem hiding this comment.
it's reachable. The outer-most is pattern.split("'"), so it's possible that a section of pattern string doesn't have S.
There was a problem hiding this comment.
the extractor here can match anything, so this is unreachable. If a pattern split contains no S, it goes the prefix
There was a problem hiding this comment.
then let's throw IllegalStateException as it's un-reachable
| /** | ||
| * Building a formatter for parsing seconds fraction with variable length | ||
| */ | ||
| def appendPattern( |
There was a problem hiding this comment.
the name is bad. How about createBuilderWithVarLengthSecondFraction?
|
Test build #119877 has finished for PR 27906 at commit
|
|
Test build #119872 has finished for PR 27906 at commit
|
|
Test build #119880 has finished for PR 27906 at commit
|
|
Test build #119882 has finished for PR 27906 at commit
|
|
Hi, @yaooqinn and @cloud-fan . |
|
Test build #119886 has finished for PR 27906 at commit
|
|
Test build #119888 has finished for PR 27906 at commit
|
|
Hi @dongjoon-hyun, thanks for pointing that out. I'd say that this is a regression of 3.0.0. |
|
Yea the JIRA should be reported as bug. The same query works in 2.4 but returns null in 3.0. |
| "yyyy-MM-dd'T'HH:mm:ss.SSSSSS", | ||
| DateTimeUtils.getZoneId(zoneId)) | ||
| DateTimeUtils.getZoneId(zoneId), | ||
| needVarLengthSecondFraction = true) |
There was a problem hiding this comment.
is this really needed? The test input is fixed: 2018-12-02T10:11:12.001234
There was a problem hiding this comment.
We are not going to use fixed-length formatter for parsing right? So I guess test that one is not needed anymore. the var-length formatter should work fine with fixed inputs.
| "yyyy-MM-dd'T'HH:mm:ss.SSSSSS", | ||
| DateTimeUtils.getZoneId(zoneId)) | ||
| DateTimeUtils.getZoneId(zoneId), | ||
| needVarLengthSecondFraction = true) |
There was a problem hiding this comment.
really? this test is for formatting, and the actual formatting doesn't do padding, so this flag should be false.
There was a problem hiding this comment.
oh. I mean this should be false too.
| val parsed = formatter.parse(timestamp) | ||
| val timestamp = TimestampFormatter(pattern, zoneId).format(micros) | ||
| val parsed = TimestampFormatter( | ||
| pattern, zoneId, needVarLengthSecondFraction = true).parse(timestamp) |
There was a problem hiding this comment.
this change is needed because it is how the actual roundtrip goes no matter what the input is
|
|
||
|
|
||
| -- !query | ||
| select to_timestamp('2019-10-06 10:11:12.123', 'yyyy-MM-dd HH:mm:ss[.SSSSSS]') |
There was a problem hiding this comment.
can we test the optional case? to_timestamp('2019-10-06 10:11:12', 'yyyy-MM-dd HH:mm:ss[.SSSSSS]')
cloud-fan
left a comment
There was a problem hiding this comment.
LGTM except a few comments about tests
|
|
||
| test("SPARK-30958: parse timestamp with negative year") { | ||
| val formatter1 = TimestampFormatter("yyyy-MM-dd HH:mm:ss", ZoneOffset.UTC) | ||
| val formatter1 = TimestampFormatter("yyyy-MM-dd HH:mm:ss", ZoneOffset.UTC, true) |
There was a problem hiding this comment.
is the point to test the actual parser? with needVarLengthSecondFraction = true.
There was a problem hiding this comment.
yes, the actual parser should work with patterns w/o SSS too. I don't add new tests in TimestampFormatterSuite but modify the existing ones because we have enough e2e tests in date_time.sql
xuanyuanking
left a comment
There was a problem hiding this comment.
+1, only few comments for code structure, please add more comments for the util functions.
| /** | ||
| * Building a formatter for parsing seconds fraction with variable length | ||
| */ | ||
| def createBuilderWithVarLengthSecondFraction( |
There was a problem hiding this comment.
How about move this method into DateTimeUtils, together with convertIncompatiblePattern?
I think we need to collect all these kind of util functions analyzing pattern string together.
There was a problem hiding this comment.
This is very specific to datetime parsing. How about we move convertIncompatiblePattern to object DateTimeFormatterHelper?
There was a problem hiding this comment.
Then, I will not do this refactoring in this PR, not related
| val newPattern = DateTimeUtils.convertIncompatiblePattern(pattern) | ||
| val key = (newPattern, locale) | ||
| val useVarLen = needVarLengthSecondFraction && newPattern.split("'").zipWithIndex | ||
| .exists { case (p, idx) => idx % 2 == 0 && p.contains('S') } |
There was a problem hiding this comment.
Also for this one, although it's only one line, maybe it still worth having a function in DateTimeUtils and more comments.
There was a problem hiding this comment.
S in literal is very rare, maybe newPattern.contains('S') is good enough as it's only for performance.
|
Test build #119909 has finished for PR 27906 at commit
|
|
Test build #119914 has finished for PR 27906 at commit
|
|
Test build #119921 has finished for PR 27906 at commit
|
|
thanks, merging to master/3.0! |
…timestamp
### What changes were proposed in this pull request?
This PR is to support parsing timestamp values with variable length second fraction parts.
e.g. 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]' can parse timestamp with 0~6 digit-length second fraction but fail >=7
```sql
select to_timestamp(v, 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]') from values
('2019-10-06 10:11:12.'),
('2019-10-06 10:11:12.0'),
('2019-10-06 10:11:12.1'),
('2019-10-06 10:11:12.12'),
('2019-10-06 10:11:12.123UTC'),
('2019-10-06 10:11:12.1234'),
('2019-10-06 10:11:12.12345CST'),
('2019-10-06 10:11:12.123456PST') t(v)
2019-10-06 03:11:12.123
2019-10-06 08:11:12.12345
2019-10-06 10:11:12
2019-10-06 10:11:12
2019-10-06 10:11:12.1
2019-10-06 10:11:12.12
2019-10-06 10:11:12.1234
2019-10-06 10:11:12.123456
select to_timestamp('2019-10-06 10:11:12.1234567PST', 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]')
NULL
```
Since 3.0, we use java 8 time API to parse and format timestamp values. when we create the `DateTimeFormatter`, we use `appendPattern` to create the build first, where the 'S..S' part will be parsed to a fixed-length(= `'S..S'.length`). This fits the formatting part but too strict for the parsing part because the trailing zeros are very likely to be truncated.
### Why are the changes needed?
improve timestamp parsing and more compatible with 2.4.x
### Does this PR introduce any user-facing change?
no, the related changes are newly added
### How was this patch tested?
add uts
Closes #27906 from yaooqinn/SPARK-31150.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 0946a95)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…timestamp
### What changes were proposed in this pull request?
This PR is to support parsing timestamp values with variable length second fraction parts.
e.g. 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]' can parse timestamp with 0~6 digit-length second fraction but fail >=7
```sql
select to_timestamp(v, 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]') from values
('2019-10-06 10:11:12.'),
('2019-10-06 10:11:12.0'),
('2019-10-06 10:11:12.1'),
('2019-10-06 10:11:12.12'),
('2019-10-06 10:11:12.123UTC'),
('2019-10-06 10:11:12.1234'),
('2019-10-06 10:11:12.12345CST'),
('2019-10-06 10:11:12.123456PST') t(v)
2019-10-06 03:11:12.123
2019-10-06 08:11:12.12345
2019-10-06 10:11:12
2019-10-06 10:11:12
2019-10-06 10:11:12.1
2019-10-06 10:11:12.12
2019-10-06 10:11:12.1234
2019-10-06 10:11:12.123456
select to_timestamp('2019-10-06 10:11:12.1234567PST', 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]')
NULL
```
Since 3.0, we use java 8 time API to parse and format timestamp values. when we create the `DateTimeFormatter`, we use `appendPattern` to create the build first, where the 'S..S' part will be parsed to a fixed-length(= `'S..S'.length`). This fits the formatting part but too strict for the parsing part because the trailing zeros are very likely to be truncated.
### Why are the changes needed?
improve timestamp parsing and more compatible with 2.4.x
### Does this PR introduce any user-facing change?
no, the related changes are newly added
### How was this patch tested?
add uts
Closes apache#27906 from yaooqinn/SPARK-31150.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
This PR is to support parsing timestamp values with variable length second fraction parts.
e.g. 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]' can parse timestamp with 0~6 digit-length second fraction but fail >=7
Since 3.0, we use java 8 time API to parse and format timestamp values. when we create the
DateTimeFormatter, we useappendPatternto create the build first, where the 'S..S' part will be parsed to a fixed-length(='S..S'.length). This fits the formatting part but too strict for the parsing part because the trailing zeros are very likely to be truncated.Why are the changes needed?
improve timestamp parsing and more compatible with 2.4.x
Does this PR introduce any user-facing change?
no, the related changes are newly added
How was this patch tested?
add uts