Skip to content

Commit

Permalink
[SPARK-32424][SQL] Fix silent data change for timestamp parsing if ov…
Browse files Browse the repository at this point in the history
…erflow happens

### What changes were proposed in this pull request?

When using `Seconds.toMicros` API to convert epoch seconds to microseconds,

```scala
 /**
     * Equivalent to
     * {link #convert(long, TimeUnit) MICROSECONDS.convert(duration, this)}.
     * param duration the duration
     * return the converted duration,
     * or {code Long.MIN_VALUE} if conversion would negatively
     * overflow, or {code Long.MAX_VALUE} if it would positively overflow.
     */
```
This PR change it to `Math.multiplyExact(epochSeconds, MICROS_PER_SECOND)`

### Why are the changes needed?

fix silent data change between 3.x and 2.x
```
 ~/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200722   bin/spark-sql -S -e "select to_timestamp('300000', 'y');"
+294247-01-10 12:00:54.775807
```
```
 kentyaohulk  ~/Downloads/spark/spark-2.4.5-bin-hadoop2.7  bin/spark-sql -S  -e "select to_timestamp('300000', 'y');"
284550-10-19 15:58:1010.448384
```

### Does this PR introduce _any_ user-facing change?

Yes, we will raise `ArithmeticException` instead of giving the wrong answer if overflow.

### How was this patch tested?

add unit test

Closes #29220 from yaooqinn/SPARK-32424.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
  • Loading branch information
yaooqinn authored and cloud-fan committed Jul 27, 2020
1 parent 548b7db commit d315ebf
Show file tree
Hide file tree
Showing 14 changed files with 323 additions and 30 deletions.
2 changes: 1 addition & 1 deletion docs/sql-ref-datetime-pattern.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ The count of pattern letters determines the format.
For formatting, the fraction length would be padded to the number of contiguous 'S' with zeros.
Spark supports datetime of micro-of-second precision, which has up to 6 significant digits, but can parse nano-of-second with exceeded part truncated.

- Year: The count of letters determines the minimum field width below which padding is used. If the count of letters is two, then a reduced two digit form is used. For printing, this outputs the rightmost two digits. For parsing, this will parse using the base value of 2000, resulting in a year within the range 2000 to 2099 inclusive. If the count of letters is less than four (but not two), then the sign is only output for negative years. Otherwise, the sign is output if the pad width is exceeded when 'G' is not present. 11 or more letters will fail.
- Year: The count of letters determines the minimum field width below which padding is used. If the count of letters is two, then a reduced two digit form is used. For printing, this outputs the rightmost two digits. For parsing, this will parse using the base value of 2000, resulting in a year within the range 2000 to 2099 inclusive. If the count of letters is less than four (but not two), then the sign is only output for negative years. Otherwise, the sign is output if the pad width is exceeded when 'G' is not present. 7 or more letters will fail.

- Month: It follows the rule of Number/Text. The text form is depend on letters - 'M' denotes the 'standard' form, and 'L' is for 'stand-alone' form. These two forms are different only in some certain languages. For example, in Russian, 'Июль' is the stand-alone form of July, and 'Июля' is the standard form. Here are examples for all supported pattern letters:
- `'M'` or `'L'`: Month number in a year starting from 1. There is no difference between 'M' and 'L'. Month from 1 to 9 are printed without padding.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -296,7 +296,9 @@ private object DateTimeFormatterHelper {
// unchecked `ArrayIndexOutOfBoundsException` by the `NumberPrinterParser` for formatting. It
// makes the call side difficult to handle exceptions and easily leads to silent data change
// because of the exceptions being suppressed.
Seq("y").map(_ * 11)
// SPARK-32424: The max year that we can actually handle is 6 digits, otherwise, it will
// overflow
Seq("y").map(_ * 7)
}.toSet

/**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,6 @@ import java.time.format.{DateTimeFormatter, DateTimeParseException}
import java.time.temporal.ChronoField.MICRO_OF_SECOND
import java.time.temporal.TemporalQueries
import java.util.{Calendar, GregorianCalendar, Locale, TimeZone}
import java.util.concurrent.TimeUnit.SECONDS

import org.apache.commons.lang3.time.FastDateFormat

Expand Down Expand Up @@ -83,7 +82,7 @@ class Iso8601TimestampFormatter(
val epochSeconds = zonedDateTime.toEpochSecond
val microsOfSecond = zonedDateTime.get(MICRO_OF_SECOND)

Math.addExact(SECONDS.toMicros(epochSeconds), microsOfSecond)
Math.addExact(Math.multiplyExact(epochSeconds, MICROS_PER_SECOND), microsOfSecond)
} catch checkParsedDiff(s, legacyFormatter.parse)
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -437,4 +437,14 @@ class TimestampFormatterSuite extends DatetimeFormatterSuite {
assert(formatter.format(date(1970, 4, 10)) == "100")
}
}

test("SPARK-32424: avoid silent data change when timestamp overflows") {
val formatter = TimestampFormatter("y", UTC, isParsing = true)
assert(formatter.parse("294247") === date(294247))
assert(formatter.parse("-290307") === date(-290307))
val e1 = intercept[ArithmeticException](formatter.parse("294248"))
assert(e1.getMessage === "long overflow")
val e2 = intercept[ArithmeticException](formatter.parse("-290308"))
assert(e2.getMessage === "long overflow")
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@
-- separating this from datetime-formatting.sql, because the text form
-- for patterns with 5 letters in SimpleDateFormat varies from different JDKs
select date_format('2018-11-17 13:33:33.333', 'GGGGG');
-- pattern letter count can not be greater than 10
select date_format('2018-11-17 13:33:33.333', 'yyyyyyyyyyy');
-- pattern letter count can not be greater than 6
select date_format('2018-11-17 13:33:33.333', 'yyyyyyy');
-- q/L in JDK 8 will fail when the count is more than 2
select date_format('2018-11-17 13:33:33.333', 'qqqqq');
select date_format('2018-11-17 13:33:33.333', 'QQQQQ');
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ create temporary view v as select col from values

select col, date_format(col, 'G GG GGG GGGG') from v;

select col, date_format(col, 'y yy yyy yyyy yyyyy yyyyyy yyyyyyy yyyyyyyy yyyyyyyyy yyyyyyyyyy') from v;
select col, date_format(col, 'y yy yyy yyyy yyyyy yyyyyy') from v;

select col, date_format(col, 'q qq') from v;

Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,14 @@
--- TESTS FOR DATETIME PARSING FUNCTIONS WITH INVALID VALUES ---

-- parsing invalid value with pattern 'y'
select to_timestamp('294248', 'y'); -- out of year value range [0, 294247]
select to_timestamp('1', 'yy'); -- the number of digits must be 2 for 'yy'.
select to_timestamp('-12', 'yy'); -- out of year value range [0, 99] for reduced two digit form
select to_timestamp('123', 'yy'); -- the number of digits must be 2 for 'yy'.
select to_timestamp('1', 'yyy'); -- the number of digits must be in [3, 6] for 'yyy'

select to_timestamp('1234567', 'yyyyyyy'); -- the length of 'y' pattern must be less than 7

-- parsing invalid values with pattern 'D'
select to_timestamp('366', 'D');
select to_timestamp('9', 'DD');
Expand Down
28 changes: 28 additions & 0 deletions sql/core/src/test/resources/sql-tests/inputs/datetime-parsing.sql
Original file line number Diff line number Diff line change
@@ -1,5 +1,33 @@
--- TESTS FOR DATETIME PARSING FUNCTIONS ---

-- parsing with pattern 'y'.
-- the range of valid year is [-290307, 294247],
-- but particularly, some thrift client use java.sql.Timestamp to parse timestamp, which allows
-- only positive year values less or equal than 9999. So the cases bellow only use [1, 9999] to pass
-- ThriftServerQueryTestSuite
select to_timestamp('1', 'y');
select to_timestamp('009999', 'y');

-- reduced two digit form is used, the range of valid year is 20-[01, 99]
select to_timestamp('00', 'yy');
select to_timestamp('99', 'yy');

-- the range of valid year is [-290307, 294247], the number of digits must be in [3, 6] for 'yyy'
select to_timestamp('001', 'yyy');
select to_timestamp('009999', 'yyy');

-- the range of valid year is [-9999, 9999], the number of digits must be 4 for 'yyyy'.
select to_timestamp('0001', 'yyyy');
select to_timestamp('9999', 'yyyy');

-- the range of valid year is [-99999, 99999], the number of digits must be 5 for 'yyyyy'.
select to_timestamp('00001', 'yyyyy');
select to_timestamp('09999', 'yyyyy');

-- the range of valid year is [-290307, 294247], the number of digits must be 6 for 'yyyyyy'.
select to_timestamp('000001', 'yyyyyy');
select to_timestamp('009999', 'yyyyyy');

-- parsing with pattern 'D'
select to_timestamp('9', 'D');
select to_timestamp('300', 'D');
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,12 @@ You may get a different result due to the upgrading of Spark 3.0: Fail to recogn


-- !query
select date_format('2018-11-17 13:33:33.333', 'yyyyyyyyyyy')
select date_format('2018-11-17 13:33:33.333', 'yyyyyyy')
-- !query schema
struct<>
-- !query output
org.apache.spark.SparkUpgradeException
You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'yyyyyyyyyyy' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'yyyyyyy' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html


-- !query
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,17 +32,17 @@ struct<col:timestamp,date_format(col, G GG GGG GGGG):string>


-- !query
select col, date_format(col, 'y yy yyy yyyy yyyyy yyyyyy yyyyyyy yyyyyyyy yyyyyyyyy yyyyyyyyyy') from v
select col, date_format(col, 'y yy yyy yyyy yyyyy yyyyyy') from v
-- !query schema
struct<col:timestamp,date_format(col, y yy yyy yyyy yyyyy yyyyyy yyyyyyy yyyyyyyy yyyyyyyyy yyyyyyyyyy):string>
struct<col:timestamp,date_format(col, y yy yyy yyyy yyyyy yyyyyy):string>
-- !query output
1582-05-31 19:40:35.123 1582 82 1582 1582 01582 001582 0001582 00001582 000001582 0000001582
1969-12-31 15:00:00 1969 69 1969 1969 01969 001969 0001969 00001969 000001969 0000001969
1970-12-31 04:59:59.999 1970 70 1970 1970 01970 001970 0001970 00001970 000001970 0000001970
1996-03-31 07:03:33.123 1996 96 1996 1996 01996 001996 0001996 00001996 000001996 0000001996
2018-11-17 05:33:33.123 2018 18 2018 2018 02018 002018 0002018 00002018 000002018 0000002018
2019-12-31 09:33:33.123 2019 19 2019 2019 02019 002019 0002019 00002019 000002019 0000002019
2100-01-01 01:33:33.123 2100 00 2100 2100 02100 002100 0002100 00002100 000002100 0000002100
1582-05-31 19:40:35.123 1582 82 1582 1582 01582 001582
1969-12-31 15:00:00 1969 69 1969 1969 01969 001969
1970-12-31 04:59:59.999 1970 70 1970 1970 01970 001970
1996-03-31 07:03:33.123 1996 96 1996 1996 01996 001996
2018-11-17 05:33:33.123 2018 18 2018 2018 02018 002018
2019-12-31 09:33:33.123 2019 19 2019 2019 02019 002019
2100-01-01 01:33:33.123 2100 00 2100 2100 02100 002100


-- !query
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,17 +32,17 @@ struct<col:timestamp,date_format(col, G GG GGG GGGG):string>


-- !query
select col, date_format(col, 'y yy yyy yyyy yyyyy yyyyyy yyyyyyy yyyyyyyy yyyyyyyyy yyyyyyyyyy') from v
select col, date_format(col, 'y yy yyy yyyy yyyyy yyyyyy') from v
-- !query schema
struct<col:timestamp,date_format(col, y yy yyy yyyy yyyyy yyyyyy yyyyyyy yyyyyyyy yyyyyyyyy yyyyyyyyyy):string>
struct<col:timestamp,date_format(col, y yy yyy yyyy yyyyy yyyyyy):string>
-- !query output
1582-05-31 19:40:35.123 1582 82 1582 1582 01582 001582 0001582 00001582 000001582 0000001582
1969-12-31 15:00:00 1969 69 1969 1969 01969 001969 0001969 00001969 000001969 0000001969
1970-12-31 04:59:59.999 1970 70 1970 1970 01970 001970 0001970 00001970 000001970 0000001970
1996-03-31 07:03:33.123 1996 96 1996 1996 01996 001996 0001996 00001996 000001996 0000001996
2018-11-17 05:33:33.123 2018 18 2018 2018 02018 002018 0002018 00002018 000002018 0000002018
2019-12-31 09:33:33.123 2019 19 2019 2019 02019 002019 0002019 00002019 000002019 0000002019
2100-01-01 01:33:33.123 2100 00 2100 2100 02100 002100 0002100 00002100 000002100 0000002100
1582-05-31 19:40:35.123 1582 82 1582 1582 01582 001582
1969-12-31 15:00:00 1969 69 1969 1969 01969 001969
1970-12-31 04:59:59.999 1970 70 1970 1970 01970 001970
1996-03-31 07:03:33.123 1996 96 1996 1996 01996 001996
2018-11-17 05:33:33.123 2018 18 2018 2018 02018 002018
2019-12-31 09:33:33.123 2019 19 2019 2019 02019 002019
2100-01-01 01:33:33.123 2100 00 2100 2100 02100 002100


-- !query
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,58 @@
-- Automatically generated by SQLQueryTestSuite
-- Number of queries: 13
-- Number of queries: 19


-- !query
select to_timestamp('294248', 'y')
-- !query schema
struct<>
-- !query output
java.lang.ArithmeticException
long overflow


-- !query
select to_timestamp('1', 'yy')
-- !query schema
struct<>
-- !query output
org.apache.spark.SparkUpgradeException
You may get a different result due to the upgrading of Spark 3.0: Fail to parse '1' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.


-- !query
select to_timestamp('-12', 'yy')
-- !query schema
struct<to_timestamp(-12, yy):timestamp>
-- !query output
NULL


-- !query
select to_timestamp('123', 'yy')
-- !query schema
struct<>
-- !query output
org.apache.spark.SparkUpgradeException
You may get a different result due to the upgrading of Spark 3.0: Fail to parse '123' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.


-- !query
select to_timestamp('1', 'yyy')
-- !query schema
struct<>
-- !query output
org.apache.spark.SparkUpgradeException
You may get a different result due to the upgrading of Spark 3.0: Fail to parse '1' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.


-- !query
select to_timestamp('1234567', 'yyyyyyy')
-- !query schema
struct<>
-- !query output
org.apache.spark.SparkUpgradeException
You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'yyyyyyy' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html


-- !query
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,101 @@
-- Automatically generated by SQLQueryTestSuite
-- Number of queries: 13
-- Number of queries: 25


-- !query
select to_timestamp('1', 'y')
-- !query schema
struct<to_timestamp(1, y):timestamp>
-- !query output
0001-01-01 00:00:00


-- !query
select to_timestamp('009999', 'y')
-- !query schema
struct<to_timestamp(009999, y):timestamp>
-- !query output
9999-01-01 00:00:00


-- !query
select to_timestamp('00', 'yy')
-- !query schema
struct<to_timestamp(00, yy):timestamp>
-- !query output
2000-01-01 00:00:00


-- !query
select to_timestamp('99', 'yy')
-- !query schema
struct<to_timestamp(99, yy):timestamp>
-- !query output
1999-01-01 00:00:00


-- !query
select to_timestamp('001', 'yyy')
-- !query schema
struct<to_timestamp(001, yyy):timestamp>
-- !query output
0001-01-01 00:00:00


-- !query
select to_timestamp('009999', 'yyy')
-- !query schema
struct<to_timestamp(009999, yyy):timestamp>
-- !query output
9999-01-01 00:00:00


-- !query
select to_timestamp('0001', 'yyyy')
-- !query schema
struct<to_timestamp(0001, yyyy):timestamp>
-- !query output
0001-01-01 00:00:00


-- !query
select to_timestamp('9999', 'yyyy')
-- !query schema
struct<to_timestamp(9999, yyyy):timestamp>
-- !query output
9999-01-01 00:00:00


-- !query
select to_timestamp('00001', 'yyyyy')
-- !query schema
struct<to_timestamp(00001, yyyyy):timestamp>
-- !query output
0001-01-01 00:00:00


-- !query
select to_timestamp('09999', 'yyyyy')
-- !query schema
struct<to_timestamp(09999, yyyyy):timestamp>
-- !query output
9999-01-01 00:00:00


-- !query
select to_timestamp('000001', 'yyyyyy')
-- !query schema
struct<to_timestamp(000001, yyyyyy):timestamp>
-- !query output
0001-01-01 00:00:00


-- !query
select to_timestamp('009999', 'yyyyyy')
-- !query schema
struct<to_timestamp(009999, yyyyyy):timestamp>
-- !query output
9999-01-01 00:00:00


-- !query
Expand Down
Loading

0 comments on commit d315ebf

Please sign in to comment.