New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-29371][SQL] Fractional representation for interval string #26592
Conversation
cc @cloud-fan @MaxGekk @HyukjinKwon, comparing to #26314, the performance is improved here. Thanks for reviewing |
Can we have a summary of the rules? What I see from the examples:
|
OK, I will add these to the PR description. |
Test build #114067 has finished for PR 26592 at commit
|
Test build #114061 has finished for PR 26592 at commit
|
postgres=# select interval '0.1111111 seconds 2 microseconds';
ERROR: invalid input syntax for type interval: "0.1111111 seconds 2 microseconds"
LINE 1: select interval '0.1111111 seconds 2 microseconds';
^
postgres=# select interval '1 seconds 2 microseconds';
interval
-----------------
00:00:01.000002
(1 row)
postgres=# select interval '1 seconds 2.1 microseconds';
interval
-----------------
00:00:01.000002
(1 row)
postgres=# select interval '1 minute 2.1 microseconds';
interval
-----------------
00:01:00.000002
(1 row) PG does not allow millis and micros to be fraction number, when and only when second part is fr fraction too. A bit difficult to follow that |
millis can have a fraction:
|
when and only when second part is fraction |
-- !query 59 output | ||
13.123456 seconds -13.123456 seconds | ||
13.123457 seconds -13.123457 seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
postgres=# select interval '13.123456789 seconds';
interval
-----------------
00:00:13.123457
(1 row)
postgres=# select interval '-13.123456789 seconds';
interval
------------------
-00:00:13.123457
(1 row)
|
||
|
||
-- !query 126 | ||
select interval '0.50 microseconds' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this case is different from PG
postgres=# select interval '0.5 us';
interval
----------
00:00:00
(1 row)
postgres=# select interval '0.51 us';
interval
-----------------
00:00:00.000001
(1 row)
@@ -85,7 +85,8 @@ object IntervalBenchmark extends SqlBasedBenchmark { | |||
val timeUnits = Seq( | |||
"13 months", " 1 months", | |||
"100 weeks", "9 days", "12 hours", "- 3 hours", | |||
"5 minutes", "45 seconds", "123 milliseconds", "567 microseconds") | |||
"5 minutes", "45.123456 seconds", "123 milliseconds", "567 microseconds", | |||
"98.76543210 seconds", "12.34567890 seconds", "99.999999999 seconds") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only add seconds
decimals for the master branch to support, so we can compare it with our PR in a fair play. Also, parsing other units should be as same as the seconds.
intervalToTest.append(unit) | ||
addCase(benchmark, N, intervalToTest) | ||
if (i % 2 == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
units were added more and more, try to reduce some test round
Test build #114081 has finished for PR 26592 at commit
|
Test build #114089 has finished for PR 26592 at commit
|
Test build #114085 has finished for PR 26592 at commit
|
@@ -646,7 +646,7 @@ class ExpressionParserSuite extends AnalysisTest { | |||
Literal(new CalendarInterval( | |||
0, | |||
0, | |||
-13 * MICROS_PER_SECOND - 123 * MICROS_PER_MILLIS - 456))) | |||
-13 * MICROS_PER_SECOND - 123 * MICROS_PER_MILLIS - 457))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
postgres=# select interval '-13.123456789 second';
interval
------------------
-00:00:13.123457
(1 row)
Test build #114098 has finished for PR 26592 at commit
|
Test build #114138 has finished for PR 26592 at commit
|
retest this please |
Test build #114146 has finished for PR 26592 at commit
|
Test build #114380 has finished for PR 26592 at commit
|
retest this please |
Test build #114389 has finished for PR 26592 at commit
|
Test build #114512 has finished for PR 26592 at commit
|
retest this please |
Test build #114518 has finished for PR 26592 at commit
|
What changes were proposed in this pull request?
Add fractional representation for interval values
RULES
1. The table shows how each unit converts to
CalendarInterval
'smonths
,days
andmicroseconds
.1.41 year = 1 years 4 months
,1.42 year = 1 years 5 months
1.1333333333301 months = 1 months 3 days 23 hours 59 minutes 59.999992 seconds
2.13333333 week = 14 days 22 hours 23 minutes 59.997984 seconds
2. Additional RULES
2.1. ROUNDING for microseconds happens in each value-unit, not at the end,
0.0004 millisecond 0.4 microseconds
should be 0 not 1 microsecond, for example,2.2 When
seconds
value is decimal, the millisecond and microsecond cannot be decimal, I have no reason to support this at the cost for performance and making the parsing logic more complex.2.3 0.5 microsecond round to 0 in PostgreSQL, but 1 in Spark due to different round policy supported by the host language I guess.
This PR closes #26314, which has performance regression.
Why are the changes needed?
In PostgreSQL, interval field values can have fractional parts. See https://www.postgresql.org/docs/current/datatype-datetime.html#DATATYPE-INTERVAL-INPUT
Does this PR introduce any user-facing change?
Yes
How was this patch tested?
1. add unit tests
2. benchmark test, with a modified benchmark test which adds several fraction second value.
https://github.com/apache/spark/pull/26592/files#diff-c02ae27cf4adc93f30d4a13839aa6bbaR85-R89
It is supported by master and this fix, then we can equally see the difference. only 2~3% performace loss here.
master(before)
this fix
3. precision check with PostgreSQL