-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-30026][SQL] Whitespaces can be identified as delimiters in interval string #26662
Conversation
Test build #114404 has finished for PR 26662 at commit
|
Test build #114410 has finished for PR 26662 at commit
|
// scalastyle:on | ||
val bytes = s.getBytes | ||
if (bytes.isEmpty) { | ||
val strs = input.trimAll().toLowerCase().split(UTF8String.fromString("interval\\s+"), -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, no. Compiling the regular expression per each call? Could you run the benchmark. Just wondering how much are you going to slow it down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this vs master vs static UTF8String.fromString("interval\s+")
[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.1
[info] Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
[info] cast strings to intervals: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] prepare string w/ interval 342 383 36 2.9 342.2 1.0X
[info] prepare string w/o interval 312 325 16 3.2 311.6 1.1X
[info] 1 units w/ interval 1002 1019 15 1.0 1002.3 0.3X
[info] 1 units w/o interval 815 817 1 1.2 815.0 0.4X
[info] 2 units w/ interval 1090 1092 2 0.9 1089.8 0.3X
[info] 2 units w/o interval 918 926 8 1.1 917.9 0.4X
[info] 3 units w/ interval 1618 1620 2 0.6 1618.0 0.2X
[info] 3 units w/o interval 1550 1733 241 0.6 1549.7 0.2X
[info] 4 units w/ interval 1751 1755 7 0.6 1750.8 0.2X
[info] 4 units w/o interval 1580 1587 7 0.6 1579.5 0.2X
[info] 5 units w/ interval 1874 1877 4 0.5 1873.6 0.2X
[info] 5 units w/o interval 1696 1705 8 0.6 1695.9 0.2X
[info] 6 units w/ interval 2025 2054 35 0.5 2025.1 0.2X
[info] 6 units w/o interval 1836 1843 10 0.5 1835.9 0.2X
[info] 7 units w/ interval 2280 2284 4 0.4 2280.4 0.2X
[info] 7 units w/o interval 2090 2096 8 0.5 2089.7 0.2X
[info] 8 units w/ interval 2363 2368 7 0.4 2362.6 0.1X
[info] 8 units w/o interval 2170 2182 13 0.5 2170.4 0.2X
[info] 9 units w/ interval 2618 2626 11 0.4 2618.1 0.1X
[info] 9 units w/o interval 2432 2439 9 0.4 2432.2 0.1X
[info] 10 units w/ interval 2622 2635 14 0.4 2621.7 0.1X
[info] 10 units w/o interval 2455 2462 11 0.4 2455.2 0.1X
[info] 11 units w/ interval 3080 3086 9 0.3 3079.9 0.1X
[info] 11 units w/o interval 2900 2907 12 0.3 2899.6 0.1X
[info]
[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.1
[info] Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
[info] cast strings to intervals: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] prepare string w/ interval 343 384 35 2.9 343.4 1.0X
[info] prepare string w/o interval 319 328 14 3.1 319.3 1.1X
[info] 1 units w/ interval 390 396 7 2.6 390.5 0.9X
[info] 1 units w/o interval 353 359 8 2.8 352.7 1.0X
[info] 2 units w/ interval 446 453 12 2.2 446.3 0.8X
[info] 2 units w/o interval 436 444 9 2.3 436.3 0.8X
[info] 3 units w/ interval 962 976 20 1.0 961.5 0.4X
[info] 3 units w/o interval 913 919 10 1.1 912.8 0.4X
[info] 4 units w/ interval 1065 1082 27 0.9 1064.8 0.3X
[info] 4 units w/o interval 1058 1083 24 0.9 1057.7 0.3X
[info] 5 units w/ interval 1158 1171 11 0.9 1157.8 0.3X
[info] 5 units w/o interval 1129 1140 10 0.9 1129.4 0.3X
[info] 6 units w/ interval 1296 1332 32 0.8 1295.9 0.3X
[info] 6 units w/o interval 1273 1293 24 0.8 1273.5 0.3X
[info] 7 units w/ interval 1543 1570 27 0.6 1543.1 0.2X
[info] 7 units w/o interval 1500 1510 9 0.7 1500.4 0.2X
[info] 8 units w/ interval 1607 1609 2 0.6 1606.7 0.2X
[info] 8 units w/o interval 1588 1595 6 0.6 1587.9 0.2X
[info] 9 units w/ interval 1726 1727 0 0.6 1726.2 0.2X
[info] 9 units w/o interval 1721 1727 5 0.6 1721.2 0.2X
[info] 10 units w/ interval 1982 1989 6 0.5 1982.1 0.2X
[info] 10 units w/o interval 1963 1969 6 0.5 1963.1 0.2X
[info] 11 units w/ interval 2044 2050 6 0.5 2043.6 0.2X
[info] 11 units w/o interval 2013 2024 10 0.5 2013.1 0.2X
[info]
[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.1
[info] Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
[info] cast strings to intervals: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] prepare string w/ interval 366 398 27 2.7 366.4 1.0X
[info] prepare string w/o interval 326 338 18 3.1 325.5 1.1X
[info] 1 units w/ interval 1001 1009 9 1.0 1000.8 0.4X
[info] 1 units w/o interval 793 803 14 1.3 792.9 0.5X
[info] 2 units w/ interval 1090 1094 6 0.9 1090.0 0.3X
[info] 2 units w/o interval 894 901 6 1.1 894.4 0.4X
[info] 3 units w/ interval 1689 1706 28 0.6 1688.7 0.2X
[info] 3 units w/o interval 1496 1512 15 0.7 1495.6 0.2X
[info] 4 units w/ interval 1823 1862 35 0.5 1822.8 0.2X
[info] 4 units w/o interval 1634 1637 3 0.6 1634.2 0.2X
[info] 5 units w/ interval 1948 1959 16 0.5 1947.8 0.2X
[info] 5 units w/o interval 1750 1790 56 0.6 1749.7 0.2X
[info] 6 units w/ interval 2075 2114 51 0.5 2075.5 0.2X
[info] 6 units w/o interval 1897 1942 73 0.5 1896.5 0.2X
[info] 7 units w/ interval 2319 2352 33 0.4 2319.1 0.2X
[info] 7 units w/o interval 2346 2512 286 0.4 2346.0 0.2X
[info] 8 units w/ interval 2479 2509 32 0.4 2479.2 0.1X
[info] 8 units w/o interval 2384 2427 40 0.4 2384.5 0.2X
[info] 9 units w/ interval 2506 2523 19 0.4 2505.7 0.1X
[info] 9 units w/o interval 2297 2411 152 0.4 2297.2 0.2X
[info] 10 units w/ interval 2839 2867 38 0.4 2839.1 0.1X
[info] 10 units w/o interval 2660 2671 18 0.4 2659.9 0.1X
[info] 11 units w/ interval 2768 2802 51 0.4 2767.5 0.1X
[info] 11 units w/o interval 2579 2586 6 0.4 2579.3 0.1X
[info]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 9 units from 1726 -> 2618. This is slowing down by 50% which is significant. Is it possible to implement the same without regexps and split?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll think about that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.1
[info] Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
[info] cast strings to intervals: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] prepare string w/ interval 357 370 22 2.8 357.0 1.0X
[info] prepare string w/o interval 315 333 22 3.2 314.7 1.1X
[info] 1 units w/ interval 356 380 21 2.8 355.8 1.0X
[info] 1 units w/o interval 317 326 12 3.2 317.1 1.1X
[info] 2 units w/ interval 481 488 8 2.1 480.8 0.7X
[info] 2 units w/o interval 456 464 9 2.2 456.0 0.8X
[info] 3 units w/ interval 1074 1080 5 0.9 1073.7 0.3X
[info] 3 units w/o interval 1025 1027 2 1.0 1025.4 0.3X
[info] 4 units w/ interval 1192 1196 5 0.8 1192.2 0.3X
[info] 4 units w/o interval 1219 1233 14 0.8 1218.9 0.3X
[info] 5 units w/ interval 1367 1382 23 0.7 1367.3 0.3X
[info] 5 units w/o interval 1295 1301 7 0.8 1295.1 0.3X
[info] 6 units w/ interval 1489 1525 31 0.7 1489.3 0.2X
[info] 6 units w/o interval 1496 1500 6 0.7 1495.8 0.2X
[info] 7 units w/ interval 1326 1330 4 0.8 1325.5 0.3X
[info] 7 units w/o interval 1324 1332 11 0.8 1324.1 0.3X
[info] 8 units w/ interval 1535 1547 11 0.7 1535.4 0.2X
[info] 8 units w/o interval 1542 1547 5 0.6 1542.3 0.2X
[info] 9 units w/ interval 1623 1641 18 0.6 1623.0 0.2X
[info] 9 units w/o interval 1615 1619 3 0.6 1615.3 0.2X
[info] 10 units w/ interval 1845 1861 16 0.5 1844.7 0.2X
[info] 10 units w/o interval 1858 1868 9 0.5 1857.8 0.2X
[info] 11 units w/ interval 1919 1925 11 0.5 1918.7 0.2X
[info] 11 units w/o interval 1973 1995 23 0.5 1972.8 0.2X
[info]
Thanks for your advice @MaxGekk. the performance regression has been resolved.
cc @cloud-fan @maropu, thanks for reviewing this. |
@@ -485,14 +485,15 @@ object IntervalUtils { | |||
|
|||
def trimToNextState(b: Byte, next: ParseState): Unit = { | |||
b match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still need to do match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if-else looks better.
@@ -560,7 +563,7 @@ object IntervalUtils { | |||
case _ if '0' <= b && b <= '9' && fractionScale > 0 => | |||
fraction += (b - '0') * fractionScale | |||
fractionScale /= 10 | |||
case ' ' if !pointPrefixed || fractionScale < initialFractionScale => | |||
case _ if b <= ' ' && (!pointPrefixed || fractionScale < initialFractionScale) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, do we still the pattern match here?
@@ -79,7 +79,7 @@ class IntervalUtilsSuite extends SparkFunSuite { | |||
|
|||
checkFromInvalidString(null, "cannot be null") | |||
|
|||
for (input <- Seq("", " ", "interval", "interval1 day", "foo", "foo 1 day")) { | |||
for (input <- Seq("", "interval", "foo", "foo 1 day")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why remove some test cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah i see, it's covered in the new one
val strings = s.split(UTF8String.blankString(1), -1) | ||
val lenRight = s.substring(i, s.numBytes()).split(UTF8String.blankString(1), -1).length | ||
val sep = UTF8String.fromString("\\s+") | ||
val strings = s.split(sep, -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UTF8String.split
simply calls java String.split
. Since we only need to get the current word for error message, can we just call String.split
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, now String.split
shall work, let me change this.
Test build #114451 has finished for PR 26662 at commit
|
Test build #114453 has finished for PR 26662 at commit
|
Test build #114465 has finished for PR 26662 at commit
|
thanks, merging to master! |
…erval string ### What changes were proposed in this pull request? We are now able to handle whitespaces for integral and fractional types, and the leading or trailing whitespaces for interval, date, and timestamps. But the current interval parser is not able to identify whitespaces as separates as PostgreSQL can do. This PR makes the whitespaces handling be consistent for nterval values. Typed interval literal, multi-unit representation, and casting from strings are all supported. ```sql postgres=# select interval E'1 \t day'; interval ---------- 1 day (1 row) postgres=# select interval E'1\t' day; interval ---------- 1 day (1 row) ``` ### Why are the changes needed? Whitespace handling should be consistent for interval value, and across different types in Spark. PostgreSQL feature parity. ### Does this PR introduce any user-facing change? Yes, the interval string of multi-units values which separated by whitespaces can be valid now. ### How was this patch tested? add ut. Closes apache#26662 from yaooqinn/SPARK-30026. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
We are now able to handle whitespaces for integral and fractional types, and the leading or trailing whitespaces for interval, date, and timestamps. But the current interval parser is not able to identify whitespaces as separates as PostgreSQL can do.
This PR makes the whitespaces handling be consistent for nterval values.
Typed interval literal, multi-unit representation, and casting from strings are all supported.
Why are the changes needed?
Whitespace handling should be consistent for interval value, and across different types in Spark.
PostgreSQL feature parity.
Does this PR introduce any user-facing change?
Yes, the interval string of multi-units values which separated by whitespaces can be valid now.
How was this patch tested?
add ut.