Skip to content

[SPARK-31150][SQL] Parsing seconds fraction with variable length for timestamp#27906

Closed
yaooqinn wants to merge 19 commits intoapache:masterfrom
yaooqinn:SPARK-31150
Closed

[SPARK-31150][SQL] Parsing seconds fraction with variable length for timestamp#27906
yaooqinn wants to merge 19 commits intoapache:masterfrom
yaooqinn:SPARK-31150

Conversation

@yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Mar 13, 2020

What changes were proposed in this pull request?

This PR is to support parsing timestamp values with variable length second fraction parts.

e.g. 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]' can parse timestamp with 0~6 digit-length second fraction but fail >=7

select to_timestamp(v, 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]') from values
 ('2019-10-06 10:11:12.'),
 ('2019-10-06 10:11:12.0'),
 ('2019-10-06 10:11:12.1'),
 ('2019-10-06 10:11:12.12'),
 ('2019-10-06 10:11:12.123UTC'),
 ('2019-10-06 10:11:12.1234'),
 ('2019-10-06 10:11:12.12345CST'),
 ('2019-10-06 10:11:12.123456PST') t(v)
2019-10-06 03:11:12.123
2019-10-06 08:11:12.12345
2019-10-06 10:11:12
2019-10-06 10:11:12
2019-10-06 10:11:12.1
2019-10-06 10:11:12.12
2019-10-06 10:11:12.1234
2019-10-06 10:11:12.123456

select to_timestamp('2019-10-06 10:11:12.1234567PST', 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]')
NULL

Since 3.0, we use java 8 time API to parse and format timestamp values. when we create the DateTimeFormatter, we use appendPattern to create the build first, where the 'S..S' part will be parsed to a fixed-length(= 'S..S'.length). This fits the formatting part but too strict for the parsing part because the trailing zeros are very likely to be truncated.

Why are the changes needed?

improve timestamp parsing and more compatible with 2.4.x

Does this PR introduce any user-facing change?

no, the related changes are newly added

How was this patch tested?

add uts

@SparkQA
Copy link

SparkQA commented Mar 13, 2020

Test build #119767 has finished for PR 27906 at commit 3e94191.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 13, 2020

Test build #119768 has finished for PR 27906 at commit 8ff8875.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 13, 2020

Test build #119771 has finished for PR 27906 at commit c58764d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 14, 2020

Test build #119784 has finished for PR 27906 at commit b135dd5.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Mar 14, 2020

Test build #119790 has finished for PR 27906 at commit b135dd5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn
Copy link
Member Author

cc @cloud-fan @maropu @MaxGekk @xuanyuanking, thanks

options.locale,
legacyFormat = FAST_DATE_FORMAT)
legacyFormat = FAST_DATE_FORMAT,
varLenEnabled = options.timestampFormat.contains('S'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's only for parsing so we should always set it to true?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated this, and mv the cache optimization to DateTimeFormatterHelper

@SparkQA
Copy link

SparkQA commented Mar 16, 2020

Test build #119854 has finished for PR 27906 at commit 2686c2e.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 16, 2020

Test build #119855 has finished for PR 27906 at commit a0e449a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 16, 2020

Test build #119860 has finished for PR 27906 at commit 81b69fd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 16, 2020

Test build #119856 has finished for PR 27906 at commit 18338df.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

var rest = pattenPart
while (rest.nonEmpty) {
rest match {
case extractor(prefix, sss, suffix) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: sss -> secondFraction

builder.appendFraction(ChronoField.NANO_OF_SECOND, 0, sss.length, false)
}
rest = suffix
case _ => // never reach
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's reachable. The outer-most is pattern.split("'"), so it's possible that a section of pattern string doesn't have S.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the extractor here can match anything, so this is unreachable. If a pattern split contains no S, it goes the prefix

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then let's throw IllegalStateException as it's un-reachable

/**
* Building a formatter for parsing seconds fraction with variable length
*/
def appendPattern(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name is bad. How about createBuilderWithVarLengthSecondFraction?

@SparkQA
Copy link

SparkQA commented Mar 16, 2020

Test build #119877 has finished for PR 27906 at commit 64fa825.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 16, 2020

Test build #119872 has finished for PR 27906 at commit 234f7c8.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 16, 2020

Test build #119880 has finished for PR 27906 at commit d4aea29.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 16, 2020

Test build #119882 has finished for PR 27906 at commit 18de1b7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Hi, @yaooqinn and @cloud-fan .
This is created an improvement JIRA. Is this a regression in 3.0.0 or an improvement in 3.1.0?

@SparkQA
Copy link

SparkQA commented Mar 16, 2020

Test build #119886 has finished for PR 27906 at commit e550f16.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 16, 2020

Test build #119888 has finished for PR 27906 at commit bb1a580.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn
Copy link
Member Author

Hi @dongjoon-hyun, thanks for pointing that out. I'd say that this is a regression of 3.0.0.

@cloud-fan
Copy link
Contributor

Yea the JIRA should be reported as bug. The same query works in 2.4 but returns null in 3.0.

"yyyy-MM-dd'T'HH:mm:ss.SSSSSS",
DateTimeUtils.getZoneId(zoneId))
DateTimeUtils.getZoneId(zoneId),
needVarLengthSecondFraction = true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this really needed? The test input is fixed: 2018-12-02T10:11:12.001234

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not going to use fixed-length formatter for parsing right? So I guess test that one is not needed anymore. the var-length formatter should work fine with fixed inputs.

"yyyy-MM-dd'T'HH:mm:ss.SSSSSS",
DateTimeUtils.getZoneId(zoneId))
DateTimeUtils.getZoneId(zoneId),
needVarLengthSecondFraction = true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one is not needed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really? this test is for formatting, and the actual formatting doesn't do padding, so this flag should be false.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh. I mean this should be false too.

val parsed = formatter.parse(timestamp)
val timestamp = TimestampFormatter(pattern, zoneId).format(micros)
val parsed = TimestampFormatter(
pattern, zoneId, needVarLengthSecondFraction = true).parse(timestamp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change is needed because it is how the actual roundtrip goes no matter what the input is

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok makes sense.



-- !query
select to_timestamp('2019-10-06 10:11:12.123', 'yyyy-MM-dd HH:mm:ss[.SSSSSS]')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we test the optional case? to_timestamp('2019-10-06 10:11:12', 'yyyy-MM-dd HH:mm:ss[.SSSSSS]')

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except a few comments about tests


test("SPARK-30958: parse timestamp with negative year") {
val formatter1 = TimestampFormatter("yyyy-MM-dd HH:mm:ss", ZoneOffset.UTC)
val formatter1 = TimestampFormatter("yyyy-MM-dd HH:mm:ss", ZoneOffset.UTC, true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the point to test the actual parser? with needVarLengthSecondFraction = true.

Copy link
Member Author

@yaooqinn yaooqinn Mar 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the actual parser should work with patterns w/o SSS too. I don't add new tests in TimestampFormatterSuite but modify the existing ones because we have enough e2e tests in date_time.sql

Copy link
Member

@xuanyuanking xuanyuanking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, only few comments for code structure, please add more comments for the util functions.

/**
* Building a formatter for parsing seconds fraction with variable length
*/
def createBuilderWithVarLengthSecondFraction(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about move this method into DateTimeUtils, together with convertIncompatiblePattern?
I think we need to collect all these kind of util functions analyzing pattern string together.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very specific to datetime parsing. How about we move convertIncompatiblePattern to object DateTimeFormatterHelper?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, I will not do this refactoring in this PR, not related

val newPattern = DateTimeUtils.convertIncompatiblePattern(pattern)
val key = (newPattern, locale)
val useVarLen = needVarLengthSecondFraction && newPattern.split("'").zipWithIndex
.exists { case (p, idx) => idx % 2 == 0 && p.contains('S') }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also for this one, although it's only one line, maybe it still worth having a function in DateTimeUtils and more comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

S in literal is very rare, maybe newPattern.contains('S') is good enough as it's only for performance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aggreed

@SparkQA
Copy link

SparkQA commented Mar 17, 2020

Test build #119909 has finished for PR 27906 at commit d1ab415.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 17, 2020

Test build #119914 has finished for PR 27906 at commit 27095da.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 17, 2020

Test build #119921 has finished for PR 27906 at commit ec44083.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.0!

@cloud-fan cloud-fan closed this in 0946a95 Mar 17, 2020
cloud-fan pushed a commit that referenced this pull request Mar 17, 2020
…timestamp

### What changes were proposed in this pull request?
This PR is to support parsing timestamp values with variable length second fraction parts.

e.g. 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]' can parse timestamp with 0~6 digit-length second fraction but fail >=7
```sql
select to_timestamp(v, 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]') from values
 ('2019-10-06 10:11:12.'),
 ('2019-10-06 10:11:12.0'),
 ('2019-10-06 10:11:12.1'),
 ('2019-10-06 10:11:12.12'),
 ('2019-10-06 10:11:12.123UTC'),
 ('2019-10-06 10:11:12.1234'),
 ('2019-10-06 10:11:12.12345CST'),
 ('2019-10-06 10:11:12.123456PST') t(v)
2019-10-06 03:11:12.123
2019-10-06 08:11:12.12345
2019-10-06 10:11:12
2019-10-06 10:11:12
2019-10-06 10:11:12.1
2019-10-06 10:11:12.12
2019-10-06 10:11:12.1234
2019-10-06 10:11:12.123456

select to_timestamp('2019-10-06 10:11:12.1234567PST', 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]')
NULL
```
Since 3.0, we use java 8 time API to parse and format timestamp values. when we create the `DateTimeFormatter`, we use `appendPattern` to create the build first, where the 'S..S' part will be parsed to a fixed-length(= `'S..S'.length`). This fits the formatting part but too strict for the parsing part because the trailing zeros are very likely to be truncated.

### Why are the changes needed?

improve timestamp parsing and more compatible with 2.4.x

### Does this PR introduce any user-facing change?

no, the related changes are newly added
### How was this patch tested?

add uts

Closes #27906 from yaooqinn/SPARK-31150.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 0946a95)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
…timestamp

### What changes were proposed in this pull request?
This PR is to support parsing timestamp values with variable length second fraction parts.

e.g. 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]' can parse timestamp with 0~6 digit-length second fraction but fail >=7
```sql
select to_timestamp(v, 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]') from values
 ('2019-10-06 10:11:12.'),
 ('2019-10-06 10:11:12.0'),
 ('2019-10-06 10:11:12.1'),
 ('2019-10-06 10:11:12.12'),
 ('2019-10-06 10:11:12.123UTC'),
 ('2019-10-06 10:11:12.1234'),
 ('2019-10-06 10:11:12.12345CST'),
 ('2019-10-06 10:11:12.123456PST') t(v)
2019-10-06 03:11:12.123
2019-10-06 08:11:12.12345
2019-10-06 10:11:12
2019-10-06 10:11:12
2019-10-06 10:11:12.1
2019-10-06 10:11:12.12
2019-10-06 10:11:12.1234
2019-10-06 10:11:12.123456

select to_timestamp('2019-10-06 10:11:12.1234567PST', 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]')
NULL
```
Since 3.0, we use java 8 time API to parse and format timestamp values. when we create the `DateTimeFormatter`, we use `appendPattern` to create the build first, where the 'S..S' part will be parsed to a fixed-length(= `'S..S'.length`). This fits the formatting part but too strict for the parsing part because the trailing zeros are very likely to be truncated.

### Why are the changes needed?

improve timestamp parsing and more compatible with 2.4.x

### Does this PR introduce any user-facing change?

no, the related changes are newly added
### How was this patch tested?

add uts

Closes apache#27906 from yaooqinn/SPARK-31150.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants