Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-30026][SQL] Whitespaces can be identified as delimiters in interval string #26662

Closed
wants to merge 5 commits into from
Closed

Conversation

yaooqinn
Copy link
Member

What changes were proposed in this pull request?

We are now able to handle whitespaces for integral and fractional types, and the leading or trailing whitespaces for interval, date, and timestamps. But the current interval parser is not able to identify whitespaces as separates as PostgreSQL can do.

This PR makes the whitespaces handling be consistent for nterval values.
Typed interval literal, multi-unit representation, and casting from strings are all supported.

postgres=# select interval E'1 \t day';
 interval
----------
 1 day
(1 row)

postgres=# select interval E'1\t' day;
 interval
----------
 1 day
(1 row)

Why are the changes needed?

Whitespace handling should be consistent for interval value, and across different types in Spark.
PostgreSQL feature parity.

Does this PR introduce any user-facing change?

Yes, the interval string of multi-units values which separated by whitespaces can be valid now.

How was this patch tested?

add ut.

@SparkQA
Copy link

SparkQA commented Nov 25, 2019

Test build #114404 has finished for PR 26662 at commit 7de372a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 25, 2019

Test build #114410 has finished for PR 26662 at commit 7af388b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// scalastyle:on
val bytes = s.getBytes
if (bytes.isEmpty) {
val strs = input.trimAll().toLowerCase().split(UTF8String.fromString("interval\\s+"), -1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, no. Compiling the regular expression per each call? Could you run the benchmark. Just wondering how much are you going to slow it down.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this vs master vs static UTF8String.fromString("interval\s+")

[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.1
[info] Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
[info] cast strings to intervals:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] prepare string w/ interval                          342            383          36          2.9         342.2       1.0X
[info] prepare string w/o interval                         312            325          16          3.2         311.6       1.1X
[info] 1 units w/ interval                                1002           1019          15          1.0        1002.3       0.3X
[info] 1 units w/o interval                                815            817           1          1.2         815.0       0.4X
[info] 2 units w/ interval                                1090           1092           2          0.9        1089.8       0.3X
[info] 2 units w/o interval                                918            926           8          1.1         917.9       0.4X
[info] 3 units w/ interval                                1618           1620           2          0.6        1618.0       0.2X
[info] 3 units w/o interval                               1550           1733         241          0.6        1549.7       0.2X
[info] 4 units w/ interval                                1751           1755           7          0.6        1750.8       0.2X
[info] 4 units w/o interval                               1580           1587           7          0.6        1579.5       0.2X
[info] 5 units w/ interval                                1874           1877           4          0.5        1873.6       0.2X
[info] 5 units w/o interval                               1696           1705           8          0.6        1695.9       0.2X
[info] 6 units w/ interval                                2025           2054          35          0.5        2025.1       0.2X
[info] 6 units w/o interval                               1836           1843          10          0.5        1835.9       0.2X
[info] 7 units w/ interval                                2280           2284           4          0.4        2280.4       0.2X
[info] 7 units w/o interval                               2090           2096           8          0.5        2089.7       0.2X
[info] 8 units w/ interval                                2363           2368           7          0.4        2362.6       0.1X
[info] 8 units w/o interval                               2170           2182          13          0.5        2170.4       0.2X
[info] 9 units w/ interval                                2618           2626          11          0.4        2618.1       0.1X
[info] 9 units w/o interval                               2432           2439           9          0.4        2432.2       0.1X
[info] 10 units w/ interval                               2622           2635          14          0.4        2621.7       0.1X
[info] 10 units w/o interval                              2455           2462          11          0.4        2455.2       0.1X
[info] 11 units w/ interval                               3080           3086           9          0.3        3079.9       0.1X
[info] 11 units w/o interval                              2900           2907          12          0.3        2899.6       0.1X
[info]

[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.1
[info] Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
[info] cast strings to intervals:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] prepare string w/ interval                          343            384          35          2.9         343.4       1.0X
[info] prepare string w/o interval                         319            328          14          3.1         319.3       1.1X
[info] 1 units w/ interval                                 390            396           7          2.6         390.5       0.9X
[info] 1 units w/o interval                                353            359           8          2.8         352.7       1.0X
[info] 2 units w/ interval                                 446            453          12          2.2         446.3       0.8X
[info] 2 units w/o interval                                436            444           9          2.3         436.3       0.8X
[info] 3 units w/ interval                                 962            976          20          1.0         961.5       0.4X
[info] 3 units w/o interval                                913            919          10          1.1         912.8       0.4X
[info] 4 units w/ interval                                1065           1082          27          0.9        1064.8       0.3X
[info] 4 units w/o interval                               1058           1083          24          0.9        1057.7       0.3X
[info] 5 units w/ interval                                1158           1171          11          0.9        1157.8       0.3X
[info] 5 units w/o interval                               1129           1140          10          0.9        1129.4       0.3X
[info] 6 units w/ interval                                1296           1332          32          0.8        1295.9       0.3X
[info] 6 units w/o interval                               1273           1293          24          0.8        1273.5       0.3X
[info] 7 units w/ interval                                1543           1570          27          0.6        1543.1       0.2X
[info] 7 units w/o interval                               1500           1510           9          0.7        1500.4       0.2X
[info] 8 units w/ interval                                1607           1609           2          0.6        1606.7       0.2X
[info] 8 units w/o interval                               1588           1595           6          0.6        1587.9       0.2X
[info] 9 units w/ interval                                1726           1727           0          0.6        1726.2       0.2X
[info] 9 units w/o interval                               1721           1727           5          0.6        1721.2       0.2X
[info] 10 units w/ interval                               1982           1989           6          0.5        1982.1       0.2X
[info] 10 units w/o interval                              1963           1969           6          0.5        1963.1       0.2X
[info] 11 units w/ interval                               2044           2050           6          0.5        2043.6       0.2X
[info] 11 units w/o interval                              2013           2024          10          0.5        2013.1       0.2X
[info]

[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.1
[info] Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
[info] cast strings to intervals:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] prepare string w/ interval                          366            398          27          2.7         366.4       1.0X
[info] prepare string w/o interval                         326            338          18          3.1         325.5       1.1X
[info] 1 units w/ interval                                1001           1009           9          1.0        1000.8       0.4X
[info] 1 units w/o interval                                793            803          14          1.3         792.9       0.5X
[info] 2 units w/ interval                                1090           1094           6          0.9        1090.0       0.3X
[info] 2 units w/o interval                                894            901           6          1.1         894.4       0.4X
[info] 3 units w/ interval                                1689           1706          28          0.6        1688.7       0.2X
[info] 3 units w/o interval                               1496           1512          15          0.7        1495.6       0.2X
[info] 4 units w/ interval                                1823           1862          35          0.5        1822.8       0.2X
[info] 4 units w/o interval                               1634           1637           3          0.6        1634.2       0.2X
[info] 5 units w/ interval                                1948           1959          16          0.5        1947.8       0.2X
[info] 5 units w/o interval                               1750           1790          56          0.6        1749.7       0.2X
[info] 6 units w/ interval                                2075           2114          51          0.5        2075.5       0.2X
[info] 6 units w/o interval                               1897           1942          73          0.5        1896.5       0.2X
[info] 7 units w/ interval                                2319           2352          33          0.4        2319.1       0.2X
[info] 7 units w/o interval                               2346           2512         286          0.4        2346.0       0.2X
[info] 8 units w/ interval                                2479           2509          32          0.4        2479.2       0.1X
[info] 8 units w/o interval                               2384           2427          40          0.4        2384.5       0.2X
[info] 9 units w/ interval                                2506           2523          19          0.4        2505.7       0.1X
[info] 9 units w/o interval                               2297           2411         152          0.4        2297.2       0.2X
[info] 10 units w/ interval                               2839           2867          38          0.4        2839.1       0.1X
[info] 10 units w/o interval                              2660           2671          18          0.4        2659.9       0.1X
[info] 11 units w/ interval                               2768           2802          51          0.4        2767.5       0.1X
[info] 11 units w/o interval                              2579           2586           6          0.4        2579.3       0.1X
[info]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 9 units from 1726 -> 2618. This is slowing down by 50% which is significant. Is it possible to implement the same without regexps and split?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll think about that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.1
[info] Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
[info] cast strings to intervals:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] prepare string w/ interval                          357            370          22          2.8         357.0       1.0X
[info] prepare string w/o interval                         315            333          22          3.2         314.7       1.1X
[info] 1 units w/ interval                                 356            380          21          2.8         355.8       1.0X
[info] 1 units w/o interval                                317            326          12          3.2         317.1       1.1X
[info] 2 units w/ interval                                 481            488           8          2.1         480.8       0.7X
[info] 2 units w/o interval                                456            464           9          2.2         456.0       0.8X
[info] 3 units w/ interval                                1074           1080           5          0.9        1073.7       0.3X
[info] 3 units w/o interval                               1025           1027           2          1.0        1025.4       0.3X
[info] 4 units w/ interval                                1192           1196           5          0.8        1192.2       0.3X
[info] 4 units w/o interval                               1219           1233          14          0.8        1218.9       0.3X
[info] 5 units w/ interval                                1367           1382          23          0.7        1367.3       0.3X
[info] 5 units w/o interval                               1295           1301           7          0.8        1295.1       0.3X
[info] 6 units w/ interval                                1489           1525          31          0.7        1489.3       0.2X
[info] 6 units w/o interval                               1496           1500           6          0.7        1495.8       0.2X
[info] 7 units w/ interval                                1326           1330           4          0.8        1325.5       0.3X
[info] 7 units w/o interval                               1324           1332          11          0.8        1324.1       0.3X
[info] 8 units w/ interval                                1535           1547          11          0.7        1535.4       0.2X
[info] 8 units w/o interval                               1542           1547           5          0.6        1542.3       0.2X
[info] 9 units w/ interval                                1623           1641          18          0.6        1623.0       0.2X
[info] 9 units w/o interval                               1615           1619           3          0.6        1615.3       0.2X
[info] 10 units w/ interval                               1845           1861          16          0.5        1844.7       0.2X
[info] 10 units w/o interval                              1858           1868           9          0.5        1857.8       0.2X
[info] 11 units w/ interval                               1919           1925          11          0.5        1918.7       0.2X
[info] 11 units w/o interval                              1973           1995          23          0.5        1972.8       0.2X
[info]

Thanks for your advice @MaxGekk. the performance regression has been resolved.

@yaooqinn
Copy link
Member Author

cc @cloud-fan @maropu, thanks for reviewing this.

@@ -485,14 +485,15 @@ object IntervalUtils {

def trimToNextState(b: Byte, next: ParseState): Unit = {
b match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need to do match?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if-else looks better.

@@ -560,7 +563,7 @@ object IntervalUtils {
case _ if '0' <= b && b <= '9' && fractionScale > 0 =>
fraction += (b - '0') * fractionScale
fractionScale /= 10
case ' ' if !pointPrefixed || fractionScale < initialFractionScale =>
case _ if b <= ' ' && (!pointPrefixed || fractionScale < initialFractionScale) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, do we still the pattern match here?

@@ -79,7 +79,7 @@ class IntervalUtilsSuite extends SparkFunSuite {

checkFromInvalidString(null, "cannot be null")

for (input <- Seq("", " ", "interval", "interval1 day", "foo", "foo 1 day")) {
for (input <- Seq("", "interval", "foo", "foo 1 day")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove some test cases?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah i see, it's covered in the new one

val strings = s.split(UTF8String.blankString(1), -1)
val lenRight = s.substring(i, s.numBytes()).split(UTF8String.blankString(1), -1).length
val sep = UTF8String.fromString("\\s+")
val strings = s.split(sep, -1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UTF8String.split simply calls java String.split. Since we only need to get the current word for error message, can we just call String.split?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, now String.split shall work, let me change this.

@SparkQA
Copy link

SparkQA commented Nov 26, 2019

Test build #114451 has finished for PR 26662 at commit 4d3750d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 26, 2019

Test build #114453 has finished for PR 26662 at commit c6fd813.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 26, 2019

Test build #114465 has finished for PR 26662 at commit 08aabe6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan closed this in ed0c33f Nov 26, 2019
@cloud-fan
Copy link
Contributor

thanks, merging to master!

@yaooqinn yaooqinn deleted the SPARK-30026 branch November 27, 2019 00:40
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Dec 6, 2019
…erval string

### What changes were proposed in this pull request?

We are now able to handle whitespaces for integral and fractional types, and the leading or trailing whitespaces for interval, date, and timestamps. But the current interval parser is not able to identify whitespaces as separates as PostgreSQL can do.

This PR makes the whitespaces handling be consistent for nterval values.
Typed interval literal, multi-unit representation, and casting from strings are all supported.

```sql
postgres=# select interval E'1 \t day';
 interval
----------
 1 day
(1 row)

postgres=# select interval E'1\t' day;
 interval
----------
 1 day
(1 row)
```

### Why are the changes needed?

Whitespace handling should be consistent for interval value, and across different types in Spark.
PostgreSQL feature parity.

### Does this PR introduce any user-facing change?
Yes, the interval string of multi-units values which separated by whitespaces can be valid now.

### How was this patch tested?
add ut.

Closes apache#26662 from yaooqinn/SPARK-30026.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants