Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29870][SQL] Unify the logic of multi-units interval string to CalendarInterval #26491

Closed
wants to merge 15 commits into from
Closed

Conversation

yaooqinn
Copy link
Member

What changes were proposed in this pull request?

We now have two different implementation for multi-units interval strings to CalendarInterval type values.

One is used to covert interval string literals to CalendarInterval. This approach will re-delegate the interval string to spark parser which handles the string as a singleInterval -> multiUnitsInterval -> eventually call IntervalUtils.fromUnitStrings

The other is used in Cast, which eventually calls IntervalUtils.stringToInterval. This approach is ~10 times faster than the other.

We should unify these two for better performance and simple logic. this pr uses the 2nd approach.

Why are the changes needed?

We should unify these two for better performance and simple logic.

Does this PR introduce any user-facing change?

no

How was this patch tested?

we shall not fail on existing uts

@yaooqinn
Copy link
Member Author

cc @cloud-fan @MaxGekk @maropu @HyukjinKwon thanks very much in advance.

v + " " + u
}
val str = kvs.mkString(" ")
IntervalUtils.fromString(str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we at here, the parsing is already done by antlr, so this may make things slower. But it's better to unify the code as the perf doesn't matter that much for interval literals.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the performance improvement is actually for type constructor to parse interval string literals, which can't be seen via code change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a particular modified IntervalBenchmark test, which mocks the type constructor logic, which is directly different with an IntervalUtils.fromString call only.

  private def addCase(benchmark: Benchmark, cardinality: Long, units: Seq[String]): Unit = {
    Seq(true, false).foreach { withPrefix =>
      val expr = buildString(withPrefix, units).cast("interval")
      val note = if (withPrefix) "w/ interval" else "w/o interval"
      benchmark.addCase(s"${units.length + 1} units $note", numIters = 3) { _ =>
//        doBenchmark(cardinality, expr)
        (0L until cardinality).foreach(_ => IntervalUtils.fromString(units.mkString(" ")))
      }
    }
  }

we can see huge perfomance improment here. Any way, this is just used to parse typed literals, not a big deal acturally.

info]   Running case: 1 units w/ interval
[info]   Stopped after 3 iterations, 98544 ms
[info]   Running case: 1 units w/o interval
[info]   Stopped after 3 iterations, 78871 ms
[info]   Running case: 2 units w/ interval
[info]   Stopped after 3 iterations, 72469 ms
[info]   Running case: 2 units w/o interval
[info]   Stopped after 3 iterations, 78753 ms
[info]   Running case: 1 units w/ interval
[info]   Stopped after 3 iterations, 8926 ms
[info]   Running case: 1 units w/o interval
[info]   Stopped after 3 iterations, 8881 ms
[info]   Running case: 2 units w/ interval
[info]   Stopped after 3 iterations, 8773 ms
[info]   Running case: 2 units w/o interval
[info]   Stopped after 3 iterations, 8815 ms

@SparkQA
Copy link

SparkQA commented Nov 13, 2019

Test build #113668 has finished for PR 26491 at commit 49b74c3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn
Copy link
Member Author

benchmark test

with val N = 5000000, performace seems not to be affected that much

before

[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_65-b17 on Mac OS X 10.15.1
[info] Intel(R) Core(TM) i5-5287U CPU @ 2.90GHz
[info] cast strings to intervals:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] prepare string w/ interval                         2198           2231          46          2.3         439.6       1.0X
[info] prepare string w/o interval                        2189           2636         388          2.3         437.8       1.0X
[info] 1 units w/ interval                                2256           2315          76          2.2         451.2       1.0X
[info] 1 units w/o interval                               2106           2182          95          2.4         421.2       1.0X
[info] 2 units w/ interval                                3137           3597         792          1.6         627.5       0.7X
[info] 2 units w/o interval                               3038           3299         241          1.6         607.6       0.7X
[info] 3 units w/ interval                                6747           7466         638          0.7        1349.3       0.3X
[info] 3 units w/o interval                               6905           7583         605          0.7        1381.0       0.3X
[info] 4 units w/ interval                                7621           8126         785          0.7        1524.2       0.3X
[info] 4 units w/o interval                               7323           8823        1305          0.7        1464.5       0.3X
[info] 5 units w/ interval                                8297           8522         378          0.6        1659.4       0.3X
[info] 5 units w/o interval                               8215           8220           6          0.6        1642.9       0.3X
[info] 6 units w/ interval                                9022           9111          78          0.6        1804.4       0.2X
[info] 6 units w/o interval                               9642          12338         NaN          0.5        1928.4       0.2X
[info] 7 units w/ interval                                9343          48099        1662          0.5        1868.6       0.2X
[info] 7 units w/o interval                               8879           8941          54          0.6        1775.7       0.2X
[info] 8 units w/ interval                               10401          10537         225          0.5        2080.1       0.2X
[info] 8 units w/o interval                              10398          10413          20          0.5        2079.6       0.2X
[info] 9 units w/ interval                               10206          10373         189          0.5        2041.2       0.2X
[info] 9 units w/o interval                              10100          10151          45          0.5        2020.0       0.2X
[info] 10 units w/ interval                              10887          11881        1596          0.5        2177.5       0.2X
[info] 10 units w/o interval                             11430          13329        1754          0.4        2286.1       0.2X
[info] 11 units w/ interval                              13046          14058         877          0.4        2609.2       0.2X
[info] 11 units w/o interval                             12059          12297         265          0.4        2411.7       0.2X
[info]

after

info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_65-b17 on Mac OS X 10.15.1
[info] Intel(R) Core(TM) i5-5287U CPU @ 2.90GHz
[info] cast strings to intervals:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] prepare string w/ interval                         2055           2230         202          2.4         411.1       1.0X
[info] prepare string w/o interval                        1894           1938          71          2.6         378.8       1.1X
[info] 1 units w/ interval                                2235           2285          68          2.2         447.1       0.9X
[info] 1 units w/o interval                               2046           2076          38          2.4         409.1       1.0X
[info] 2 units w/ interval                                3024           3624         978          1.7         604.8       0.7X
[info] 2 units w/o interval                               4622           4963         336          1.1         924.3       0.4X
[info] 3 units w/ interval                                6794           7369         499          0.7        1358.7       0.3X
[info] 3 units w/o interval                               6090           7030         815          0.8        1217.9       0.3X
[info] 4 units w/ interval                                8100          10666         NaN          0.6        1619.9       0.3X
[info] 4 units w/o interval                               8063          10349         NaN          0.6        1612.5       0.3X
[info] 5 units w/ interval                                8429           8630         318          0.6        1685.8       0.2X
[info] 5 units w/o interval                               8448           9335        1365          0.6        1689.7       0.2X
[info] 6 units w/ interval                                8671           8707          31          0.6        1734.2       0.2X
[info] 6 units w/o interval                               8613           8638          38          0.6        1722.5       0.2X
[info] 7 units w/ interval                                8924           8981          68          0.6        1784.8       0.2X
[info] 7 units w/o interval                               8858           8889          52          0.6        1771.5       0.2X
[info] 8 units w/ interval                               10640          11768         979          0.5        2127.9       0.2X
[info] 8 units w/o interval                              10990          11551         554          0.5        2197.9       0.2X
[info] 9 units w/ interval                               10783          11849        1110          0.5        2156.5       0.2X
[info] 9 units w/o interval                              10408          12267        1609          0.5        2081.6       0.2X
[info] 10 units w/ interval                              11245          13406         851          0.4        2249.1       0.2X
[info] 10 units w/o interval                             11331          11444         102          0.4        2266.2       0.2X
[info] 11 units w/ interval                              12480          13403         919          0.4        2496.0       0.2X
[info] 11 units w/o interval                             12001          12009           7          0.4        2400.2       0.2X

@SparkQA
Copy link

SparkQA commented Nov 13, 2019

Test build #113675 has finished for PR 26491 at commit 803d454.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yaooqinn
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Nov 13, 2019

Test build #113686 has finished for PR 26491 at commit 803d454.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

The behavior of select interval '.1111111111' second becomes different after this PR

@yaooqinn
Copy link
Member Author

The behavior of select interval '.1111111111' second becomes different after this PR

Add to the pr desc or need to document it?

@@ -789,7 +789,7 @@ class DDLParserSuite extends AnalysisTest with SharedSparkSession {
assertError("select interval '23:61:15' hour to second",
"minute 61 outside range [0, 59]")
assertError("select interval '.1111111111' second",
"nanosecond 1111111111 outside range")
"Invalid interval string")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually exposes a problem of stringToInterval: the error reporting becomes worse.

I think the reason is stringToInterval returns null to indicate invalid input, but then there is no chance to know what is the exact failure. Can we make stringToInterval throw exception? cc @MaxGekk

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it used to replace safeFromString, we may could wrap a try catch here and use a safe flag to control throw exception or return null

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of a safe flag, I'd like to have 2 methods like before

  1. stringToInterval which throws exception on failure
  2. safeStringToInterval which calls stringToInterval and try-catch to return null on failure

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i will check on this right now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. Initially I thought about raising exceptions from stringToInterval and catch them on the next level to convert to null. For now, SGTM.

Copy link
Member Author

@yaooqinn yaooqinn Nov 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this part, I encounter another bug here, current string to interval cast logic does not support i.e. cast('.111 second' as interval) which will fail in SIGN state and return null, actually, it is 00:00:00.111. I had fixed in the current pr, I also wonder if I do it separately to fix the bug first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the results of the master branch.

-- !query 63
select interval '.111 seconds'
-- !query 63 schema
struct<0.111 seconds:interval>
-- !query 63 output
0.111 seconds


-- !query 64
select cast('.111 seconds' as interval)
-- !query 64 schema
struct<CAST(.111 seconds AS INTERVAL):interval>
-- !query 64 output
NULL

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the newly added interval parser has a bug...

@SparkQA
Copy link

SparkQA commented Nov 13, 2019

Test build #113714 has finished for PR 26491 at commit 79f5892.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 13, 2019

Test build #113715 has finished for PR 26491 at commit 0814fc4.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 13, 2019

Test build #113703 has finished for PR 26491 at commit ed3b35f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 13, 2019

Test build #113718 has finished for PR 26491 at commit 3b1667e.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 14, 2019

Test build #113743 has finished for PR 26491 at commit 5aa09ca.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -482,13 +409,17 @@ object IntervalUtils {
}
}

def nextWord: UTF8String = {
s.substring(i, s.numBytes()).subStringIndex(UTF8String.blankString(1), 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for error reporting, so perf doesn't matter. Can we make it more readable? like

s.substring(i, s.numBytes()).toString.split("\\s+").head

@@ -547,14 +478,16 @@ object IntervalUtils {
case ' ' =>
fraction /= NANOS_PER_MICROS.toInt
state = TRIM_BEFORE_UNIT
case _ => return null
case _ if '0' <= b && b <= '9' =>
throwIAE(s"invalid value fractional part '$fraction$nextWord' out of range")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about interval can only support nanosecond precision, '$nextWord' is out of range

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW can we really implement nextWord correctly? We need to know where we start to parse a number, seems we don't track it now.

Copy link
Member Author

@yaooqinn yaooqinn Nov 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about interval can only support nanosecond precision, '$nextWord' is out of range

the is not suitable, 0.9999999999 the nextword will be 9 only but '$fraction$nextWord' is the exact 9999999999

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd improve this.

}
state = UNIT_SUFFIX
case UNIT_SUFFIX =>
b match {
case 's' => state = UNIT_END
case ' ' => state = TRIM_BEFORE_SIGN
case _ => return null
case _ => throwIAE(s"invalid unit suffix '$nextWord'")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

invalid unit '$nextWord' is better if we can implement nextword correctly. Or we should introduce "currentWord"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the nextword represents which character the error occurs, the only special case is for nanoseconds out of range, the fraction + nextword could just exactly show the out of range number

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or we change the logic of unit parsing we can extract the whole part like case _ if b>= 'A' $$ b<= 'z' => unit = s.substring(i, s.numBytes()).subStringIndex(UTF8String.blankString(1), 1) than do unit case matching and error capture.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for error reporting, usually backtracing is necessary. For example, it's better to tell users that 123a is not valid, instead of just saying a is not valid.

@SparkQA
Copy link

SparkQA commented Nov 15, 2019

Test build #113851 has finished for PR 26491 at commit da7f9e8.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -482,13 +409,19 @@ object IntervalUtils {
}
}

def currentWord: String = {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan this method should be able to extract the error word

@SparkQA
Copy link

SparkQA commented Nov 15, 2019

Test build #113864 has finished for PR 26491 at commit 61626ff.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -482,13 +409,19 @@ object IntervalUtils {
}
}

def currentWord: UTF8String = {
val strings = s.split(UTF8String.blankString(1), -1)
val lenLeft = s.substring(i, s.numBytes()).split(UTF8String.blankString(1), -1).length
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still use UTF8String here, because the trim here in val s = input.trim.toLowerCase does handle '\n' '\t' ..., If we use toString here, we need to additional logic to avoid ArrayIndexOutOfBoundsException

@SparkQA
Copy link

SparkQA commented Nov 15, 2019

Test build #113883 has finished for PR 26491 at commit e12bb86.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

checkFromInvalidString("1.5 days", "Error parsing interval string")
checkFromInvalidString("1. hour", "Error parsing interval string")
checkFromInvalidString("1.5 days", "'days' cannot have fractional part")
checkFromInvalidString("1. hour", "'hour' cannot have fractional part")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does 1. second work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it does. But I guess this should not work, we may just expose another bug in the new string to interval parser. we may need a follow-up to disable this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checked pgsql

cloud0fan=# select interval '1.' second;
ERROR:  invalid input syntax for type interval: "1."
LINE 1: select interval '1.' second;

we should fix it later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll raise a jira and an pr later

checkFromInvalidString("1 Mour", "invalid unit 'mour'")
checkFromInvalidString("1 aour", "invalid unit 'aour'")
checkFromInvalidString("1a1 hour", "invalid value '1a1'")
checkFromInvalidString("1.1a1 seconds", "invalid value '1.1a1' in fractional part")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: seems like invalid value '1.1a1' is more clearer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea

def currentWord: UTF8String = {
val strings = s.split(UTF8String.blankString(1), -1)
val lenLeft = s.substring(i, s.numBytes()).split(UTF8String.blankString(1), -1).length
strings(strings.length - lenLeft)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC lenLeft should be lenRight?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I guess left is a bit ambiguous here, right is better

@SparkQA
Copy link

SparkQA commented Nov 16, 2019

Test build #113918 has finished for PR 26491 at commit 50bfd2e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class DescribeNamespaceStatement(
  • case class ShowTableStatement(
  • case class DescribeNamespace(
  • case class DescribeNamespaceExec(

@SparkQA
Copy link

SparkQA commented Nov 16, 2019

Test build #113923 has finished for PR 26491 at commit 32d5194.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants