Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29371][SQL] Fractional representation for interval string #26592

Closed
wants to merge 15 commits into from
Closed

[SPARK-29371][SQL] Fractional representation for interval string #26592

wants to merge 15 commits into from

Conversation

yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Nov 19, 2019

What changes were proposed in this pull request?

Add fractional representation for interval values

postgres=# select interval '1.41 years 2.51 months 2.21 weeks 15.24 days 3.31 hours 5.38 minutes 12.3456789 seconds';
               interval
---------------------------------------
 1 year 6 mons 45 days 27:38:35.145679
(1 row)postgres=# select interval '1.42 years 2.51 months 2.21 weeks 15.24 days 3.31 hours 5.38 minutes 12.3456789 seconds';
               interval
---------------------------------------
 1 year 7 mons 45 days 27:38:35.145679

RULES

1. The table shows how each unit converts to CalendarInterval's months, days and microseconds.

Unit TO Rounding mode Example Notes
year months only ROUND_DOWN 1.41 year = 1 years 4 months, 1.42 year = 1 years 5 months only the integral part of the total years * 30 will be used in months, the rest will be omitted
month months NO the month's integral part will be added exactly to months
month days NO the factional of months * 30, and the integral of this result will be added exactly to days
month microseconds java.lang.Math.round 1.1333333333301 months = 1 months 3 days 23 hours 59 minutes 59.999992 seconds the rest part will be added to microseconds
week days NO
week microseconds java.lang.Math.round 2.13333333 week = 14 days 22 hours 23 minutes 59.997984 seconds the rest part will be added to microseconds
day days NO
day microseconds java.lang.Math.round the rest of day will be added to microseconds
hour, minute, second, millisecond, microsecond microseconds java.lang.Math.round all these unit value go to microseconds

2. Additional RULES

2.1. ROUNDING for microseconds happens in each value-unit, not at the end, 0.0004 millisecond 0.4 microseconds should be 0 not 1 microsecond, for example,

postgres=# select interval '0.0004 millisecond 0.4 microseconds';
 interval
----------
 00:00:00
(1 row)

postgres=# select interval '0.001 millisecond 0.4 microseconds';
    interval
-----------------
 00:00:00.000001
(1 row)

2.2 When seconds value is decimal, the millisecond and microsecond cannot be decimal, I have no reason to support this at the cost for performance and making the parsing logic more complex.

select interval '0.1111111 seconds 2 microseconds';
ERROR:  invalid input syntax for type interval: "0.1111111 seconds 2 microseconds"
LINE 1: select interval '0.1111111 seconds 2 microseconds';

postgres=# select interval '1 seconds 2.1 microseconds';
    interval
-----------------
 00:00:01.000002
(1 row)

2.3 0.5 microsecond round to 0 in PostgreSQL, but 1 in Spark due to different round policy supported by the host language I guess.

This PR closes #26314, which has performance regression.

Why are the changes needed?

In PostgreSQL, interval field values can have fractional parts. See https://www.postgresql.org/docs/current/datatype-datetime.html#DATATYPE-INTERVAL-INPUT

Does this PR introduce any user-facing change?

Yes

  1. add fraction input support for interval values
  2. NO nanosecond out of range exception, round silently.

How was this patch tested?

1. add unit tests
2. benchmark test, with a modified benchmark test which adds several fraction second value.

https://github.com/apache/spark/pull/26592/files#diff-c02ae27cf4adc93f30d4a13839aa6bbaR85-R89
It is supported by master and this fix, then we can equally see the difference. only 2~3% performace loss here.

master(before)
[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.1
[info] Intel(R) Core(TM) i5-5287U CPU @ 2.90GHz
[info] cast strings to intervals:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] prepare string w/ interval                          789            838          45          1.3         789.1       1.0X
[info] prepare string w/o interval                         761           1248         728          1.3         760.9       1.0X
[info] 1 units w/ interval                                 576            624          42          1.7         576.0       1.4X
[info] 1 units w/o interval                                589            623          32          1.7         588.9       1.3X
[info] 2 units w/ interval                                 814            913          99          1.2         814.3       1.0X
[info] 2 units w/o interval                                698            713          18          1.4         698.0       1.1X
[info] 4 units w/ interval                                1564           1626          88          0.6        1564.2       0.5X
[info] 4 units w/o interval                               1459           1596         210          0.7        1459.2       0.5X
[info] 6 units w/ interval                                1965           2026          67          0.5        1964.6       0.4X
[info] 6 units w/o interval                               1747           1771          21          0.6        1747.4       0.5X
[info] 8 units w/ interval                                2193           2348         163          0.5        2193.0       0.4X
[info] 8 units w/o interval                               2247           2274          24          0.4        2247.3       0.4X
[info] 10 units w/ interval                               2321           2437         185          0.4        2320.8       0.3X
[info] 10 units w/o interval                              2306           2339          39          0.4        2305.8       0.3X
[info] 12 units w/ interval                               2774           2793          18          0.4        2774.1       0.3X
[info] 12 units w/o interval                              2765           2791          22          0.4        2764.8       0.3X
[info] 14 units w/ interval                               3265           3290          36          0.3        3264.8       0.2X
[info] 14 units w/o interval                              3264           3276          16          0.3        3263.7       0.2X
this fix
[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.1
[info] Intel(R) Core(TM) i5-5287U CPU @ 2.90GHz
[info] cast strings to intervals:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] prepare string w/ interval                          675            735          68          1.5         675.2       1.0X
[info] prepare string w/o interval                         569            630          60          1.8         569.2       1.2X
[info] 1 units w/ interval                                 519            565          68          1.9         519.1       1.3X
[info] 1 units w/o interval                                450            468          21          2.2         450.1       1.5X
[info] 2 units w/ interval                                 637            645          11          1.6         636.6       1.1X
[info] 2 units w/o interval                                601            616          14          1.7         601.5       1.1X
[info] 4 units w/ interval                                1403           1418          26          0.7        1403.2       0.5X
[info] 4 units w/o interval                               1404           1408           5          0.7        1403.6       0.5X
[info] 6 units w/ interval                                1720           1728          12          0.6        1720.2       0.4X
[info] 6 units w/o interval                               1679           1686           6          0.6        1678.7       0.4X
[info] 8 units w/ interval                                1998           2022          21          0.5        1997.6       0.3X
[info] 8 units w/o interval                               2012           2021          13          0.5        2011.6       0.3X
[info] 10 units w/ interval                               2362           2385          22          0.4        2362.0       0.3X
[info] 10 units w/o interval                              2378           2401          22          0.4        2377.6       0.3X
[info] 12 units w/ interval                               2835           2851          17          0.4        2835.0       0.2X
[info] 12 units w/o interval                              2829           2832           4          0.4        2829.4       0.2X
[info] 14 units w/ interval                               3325           3376          47          0.3        3325.2       0.2X
[info] 14 units w/o interval                              3323           3336          13          0.3        3323.2       0.2X
[info]
[success] Total time: 231 s, completed 2019-11-19 13:40:18
3. precision check with PostgreSQL

postgres=# select interval '1.41666666666666 year';
   interval
---------------
 1 year 4 mons
(1 row)

postgres=# select interval '1.41666666666667 year';
   interval
---------------
 1 year 5 mons
(1 row)

postgres=# select interval '2.13333333 week';
        interval
-------------------------
 14 days 22:23:59.997984
(1 row)

postgres=# select interval '2.13333334 week';
        interval
-------------------------
 14 days 22:24:00.004032
(1 row)


postgres=# select interval '1.133333333330 months';
           interval
------------------------------
 1 mon 3 days 23:59:59.999991
(1 row)

postgres=# select interval '1.1333333333301 months';
           interval
------------------------------
 1 mon 3 days 23:59:59.999992
(1 row)

postgres=#
postgres=# select interval '0.50 microseconds';
 interval
----------
 00:00:00
(1 row)

postgres=# select interval '0.60 microseconds';
    interval
-----------------
 00:00:00.000001
(1 row)

@yaooqinn
Copy link
Member Author

cc @cloud-fan @MaxGekk @HyukjinKwon, comparing to #26314, the performance is improved here.

Thanks for reviewing

@cloud-fan
Copy link
Contributor

Can we have a summary of the rules? What I see from the examples:

  1. year fraction part only affect months
  2. week fraction part affect all the following units
  3. ...

@yaooqinn
Copy link
Member Author

Can we have a summary of the rules? What I see from the examples:

  1. year fraction part only affect months
  2. week fraction part affect all the following units
  3. ...

OK, I will add these to the PR description.

@SparkQA
Copy link

SparkQA commented Nov 19, 2019

Test build #114067 has finished for PR 26592 at commit dd40ea0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 19, 2019

Test build #114061 has finished for PR 26592 at commit 4018278.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn
Copy link
Member Author

yaooqinn commented Nov 19, 2019

postgres=# select interval '0.1111111 seconds 2 microseconds';
ERROR:  invalid input syntax for type interval: "0.1111111 seconds 2 microseconds"
LINE 1: select interval '0.1111111 seconds 2 microseconds';
                        ^
postgres=# select interval '1 seconds 2 microseconds';
    interval
-----------------
 00:00:01.000002
(1 row)

postgres=# select interval '1 seconds 2.1 microseconds';
    interval
-----------------
 00:00:01.000002
(1 row)

postgres=# select interval '1 minute 2.1 microseconds';
    interval
-----------------
 00:01:00.000002
(1 row)

PG does not allow millis and micros to be fraction number, when and only when second part is fr fraction too. A bit difficult to follow that

@MaxGekk
Copy link
Member

MaxGekk commented Nov 19, 2019

PG does not allow millis and micros to be fraction number

millis can have a fraction:

# select interval '0.1111111 milliseconds';
    interval
-----------------
 00:00:00.000111

@yaooqinn
Copy link
Member Author

PG does not allow millis and micros to be fraction number

millis can have a fraction:

# select interval '0.1111111 milliseconds';
    interval
-----------------
 00:00:00.000111

when and only when second part is fraction

-- !query 59 output
13.123456 seconds -13.123456 seconds
13.123457 seconds -13.123457 seconds
Copy link
Member Author

@yaooqinn yaooqinn Nov 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

postgres=# select interval '13.123456789 seconds';
    interval
-----------------
 00:00:13.123457
(1 row)

postgres=# select interval '-13.123456789 seconds';
     interval
------------------
 -00:00:13.123457
(1 row)



-- !query 126
select interval '0.50 microseconds'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this case is different from PG

postgres=# select interval '0.5 us';
 interval
----------
 00:00:00
(1 row)

postgres=# select interval '0.51 us';
    interval
-----------------
 00:00:00.000001
(1 row)

@@ -85,7 +85,8 @@ object IntervalBenchmark extends SqlBasedBenchmark {
val timeUnits = Seq(
"13 months", " 1 months",
"100 weeks", "9 days", "12 hours", "- 3 hours",
"5 minutes", "45 seconds", "123 milliseconds", "567 microseconds")
"5 minutes", "45.123456 seconds", "123 milliseconds", "567 microseconds",
"98.76543210 seconds", "12.34567890 seconds", "99.999999999 seconds")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only add seconds decimals for the master branch to support, so we can compare it with our PR in a fair play. Also, parsing other units should be as same as the seconds.

intervalToTest.append(unit)
addCase(benchmark, N, intervalToTest)
if (i % 2 == 0) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

units were added more and more, try to reduce some test round

@SparkQA
Copy link

SparkQA commented Nov 19, 2019

Test build #114081 has finished for PR 26592 at commit 4dc8f02.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 19, 2019

Test build #114089 has finished for PR 26592 at commit 5b4f518.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 19, 2019

Test build #114085 has finished for PR 26592 at commit a5e2811.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -646,7 +646,7 @@ class ExpressionParserSuite extends AnalysisTest {
Literal(new CalendarInterval(
0,
0,
-13 * MICROS_PER_SECOND - 123 * MICROS_PER_MILLIS - 456)))
-13 * MICROS_PER_SECOND - 123 * MICROS_PER_MILLIS - 457)))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

postgres=# select interval '-13.123456789 second';
     interval
------------------
 -00:00:13.123457
(1 row)

@SparkQA
Copy link

SparkQA commented Nov 19, 2019

Test build #114098 has finished for PR 26592 at commit 854b6f8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 20, 2019

Test build #114138 has finished for PR 26592 at commit 5604b83.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Nov 20, 2019

Test build #114146 has finished for PR 26592 at commit 5604b83.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 25, 2019

Test build #114380 has finished for PR 26592 at commit 4dffac9.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Nov 25, 2019

Test build #114389 has finished for PR 26592 at commit 4dffac9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 27, 2019

Test build #114512 has finished for PR 26592 at commit fe3e6ba.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Nov 27, 2019

Test build #114518 has finished for PR 26592 at commit fe3e6ba.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants