[SPARK-16323][SQL] Add IntegralDivide expression #22395

mgaido91 · 2018-09-11T15:57:16Z

What changes were proposed in this pull request?

The PR takes over #14036 and it introduces a new expression IntegralDivide in order to avoid the several unneded cast added previously.

In order to prove the performance gain, the following benchmark has been run:

  test("Benchmark IntegralDivide") {
    val r = new scala.util.Random(91)
    val nData = 1000000
    val testDataInt = (1 to nData).map(_ => (r.nextInt(), r.nextInt()))
    val testDataLong = (1 to nData).map(_ => (r.nextLong(), r.nextLong()))
    val testDataShort = (1 to nData).map(_ => (r.nextInt().toShort, r.nextInt().toShort))

    // old code
    val oldExprsInt = testDataInt.map(x =>
      Cast(Divide(Cast(Literal(x._1), DoubleType), Cast(Literal(x._2), DoubleType)), LongType))
    val oldExprsLong = testDataLong.map(x =>
      Cast(Divide(Cast(Literal(x._1), DoubleType), Cast(Literal(x._2), DoubleType)), LongType))
    val oldExprsShort = testDataShort.map(x =>
      Cast(Divide(Cast(Literal(x._1), DoubleType), Cast(Literal(x._2), DoubleType)), LongType))

    // new code
    val newExprsInt = testDataInt.map(x => IntegralDivide(x._1, x._2))
    val newExprsLong = testDataLong.map(x => IntegralDivide(x._1, x._2))
    val newExprsShort = testDataShort.map(x => IntegralDivide(x._1, x._2))


    Seq(("Long", "old", oldExprsLong),
      ("Long", "new", newExprsLong),
      ("Int", "old", oldExprsInt),
      ("Int", "new", newExprsShort),
      ("Short", "old", oldExprsShort),
      ("Short", "new", oldExprsShort)).foreach { case (dt, t, ds) =>
      val start = System.nanoTime()
      ds.foreach(e => e.eval(EmptyRow))
      val endNoCodegen = System.nanoTime()
      println(s"Running $nData op with $t code on $dt (no-codegen): ${(endNoCodegen - start) / 1000000} ms")
    }
  }

The results on my laptop are:

Running 1000000 op with old code on Long (no-codegen): 600 ms
Running 1000000 op with new code on Long (no-codegen): 112 ms
Running 1000000 op with old code on Int (no-codegen): 560 ms
Running 1000000 op with new code on Int (no-codegen): 135 ms
Running 1000000 op with old code on Short (no-codegen): 317 ms
Running 1000000 op with new code on Short (no-codegen): 153 ms

Showing a 2-5X improvement. The benchmark doesn't include code generation as it is pretty hard to test the performance there as for such simple operations the most of the time is spent in the code generation/compilation process.

How was this patch tested?

added UTs

mgaido91 · 2018-09-11T15:57:25Z

cc @cloud-fan

dongjoon-hyun · 2018-09-11T16:13:38Z

[SQ] -> [SQL] in the title?

SparkQA · 2018-09-11T19:31:46Z

Test build #95954 has finished for PR 22395 at commit 649b458.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class IntegralDivide(left: Expression, right: Expression) extends DivModLike

dongjoon-hyun · 2018-09-11T23:22:25Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala

@@ -72,6 +72,7 @@ package object dsl {
    def - (other: Expression): Expression = Subtract(expr, other)
    def * (other: Expression): Expression = Multiply(expr, other)
    def / (other: Expression): Expression = Divide(expr, other)
+    def div (other: Expression): Expression = IntegralDivide(expr, other)


The failure looks like relevant.

org.scalatest.exceptions.TestFailedException: Expected "struct<[CAST((CAST(5 AS DOUBLE) / CAST(2 AS DOUBLE)) AS BIGINT):big]int>", but got "struct<[(5 div 2):]int>" Schema did not match for query #19 select 5 div 2

SparkQA · 2018-09-12T12:00:47Z

Test build #95982 has finished for PR 22395 at commit a0c0849.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-09-12T14:53:20Z

LGTM, cc @viirya @gatorsmile

viirya

LGTM

viirya · 2018-09-12T15:47:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala

+       1
+  """,
+  since = "3.0.0")
+case class IntegralDivide(left: Expression, right: Expression) extends DivModLike {


Shall we add this to FunctionRegistry?

I don't think so, please see the discussion at #14036 (comment)

Ur, sorry, but why not? As @viirya suggested, without that, the description added here is not meaningless.

spark-sql> describe function 'div'; Function: div not found. Time taken: 0.016 seconds, Fetched 1 row(s)

Also, Hive accepts that like the following. (from Hive 3.1.0)

0: jdbc:hive2://ctr-e138-1518143905142-429335> describe function div; +----------------------------------------------------+ | tab_name | +----------------------------------------------------+ | a div b - Divide a by b rounded to the long integer | +----------------------------------------------------+ 0: jdbc:hive2://ctr-e138-1518143905142-429335> select 3 / 2, 3 div 2, `/`(3,2), `div`(3,2); +------+------+------+------+ | _c0 | _c1 | _c2 | _c3 | +------+------+------+------+ | 1.5 | 1 | 1.5 | 1 | +------+------+------+------+

@dongjoon-hyun because if we add it there, we can write: select div(3, 2), which is not supported by Hive.

hive> select div(3, 2); NoViableAltException(13@[]) at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:964)

@mgaido91 . I gave you the example of Hive in the above. :)

`div`(3,2)

ah, sorry I missed the back-ticks. I am adding it, sorry. Thanks.

viirya · 2018-09-12T15:49:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala

@@ -314,6 +314,27 @@ case class Divide(left: Expression, right: Expression) extends DivModLike {
  override def evalOperation(left: Any, right: Any): Any = div(left, right)
 }

+@ExpressionDescription(
+  usage = "a _FUNC_ b - Divides a by b.",


nit: explicitly say this is integral divide?

yes, thanks, I am very bad at descriptions.

SparkQA · 2018-09-12T20:09:18Z

Test build #95995 has finished for PR 22395 at commit 315bb86.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-09-13T00:54:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala

+  private lazy val div: (Any, Any) => Any = dataType match {
+    case i: IntegralType => i.integral.asInstanceOf[Integral[Any]].quot
+  }
+  override def evalOperation(left: Any, right: Any): Any = div(left, right)


Sorry I may not recall it very clearly. Can you check Hive and other databases and see if the result type of div is always long?

Sure, so:

Hive returns always long;

Postgres and SQLServer don't have a div operator but they perform integral division when the operands are integrals and return the datatype of the operands (eg. select 3 / 2 returns an integer);

Oracle doesn't support it.

So the behavior is not homogeneous among the RDBMs

Then I'd prefer always returning long, since it was the behavior before. We can consider changing the behavior in another PR.

+1 for @cloud-fan 's suggestion.

Yeah, I think it is reasonable as that is what we defined: Hive Long Division: 'DIV' in AstBuilder.scala.

SparkQA · 2018-09-13T15:40:44Z

Test build #96040 has finished for PR 22395 at commit 02a2369.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-09-13T16:23:43Z

Retest this please

SparkQA · 2018-09-13T20:32:44Z

Test build #96045 has finished for PR 22395 at commit 02a2369.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-14T13:51:18Z

Test build #96072 has finished for PR 22395 at commit fca5e62.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-09-14T15:02:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala

@@ -314,6 +314,32 @@ case class Divide(left: Expression, right: Expression) extends DivModLike {
  override def evalOperation(left: Any, right: Any): Any = div(left, right)
 }

+@ExpressionDescription(
+  usage = "expr1 _FUNC_ expr2 - Returns `expr1`/`expr2`. It performs integral division.",


Let's mention that it always return long. Maybe we can take a look at how Hive document it.

Divide a by b rounded to the long integer, this is Hive's div document.

yes, thanks @viirya, I am updating to that sentence

cloud-fan · 2018-09-14T15:03:42Z

LGTM except one comment

viirya · 2018-09-14T15:26:47Z

LGTM

gatorsmile · 2018-09-14T15:38:10Z

Could we check the definition of div in MySQL? Is it the same as the one implemented in this PR?

https://dev.mysql.com/doc/refman/8.0/en/arithmetic-functions.html#operator_div

mgaido91 · 2018-09-14T16:56:55Z

@gatorsmile I checked on MySQL 5.6 and there are 2 differences between MySQL's div and the current implementation:

MySQL returns an int if the div operands are integers, bigint if they are bigint, etc (as reported for Postgres and SQLServer in [SPARK-16323][SQL] Add IntegralDivide expression #22395 (comment));
MySQL accepts as operands also decimal, double, etc. (it is the only DB doing so).

SparkQA · 2018-09-14T19:21:28Z

Test build #96079 has finished for PR 22395 at commit 71255a1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-09-17T00:36:55Z

Retest this please

dongjoon-hyun · 2018-09-17T01:06:40Z

...yst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ArithmeticExpressionSuite.scala

+    checkEvaluation(IntegralDivide(Literal(1.toLong), Literal(2.toLong)), 0L)
+    checkEvaluation(IntegralDivide(positiveShortLit, negativeShortLit), 0L)
+    checkEvaluation(IntegralDivide(positiveIntLit, negativeIntLit), 0L)
+    checkEvaluation(IntegralDivide(positiveLongLit, negativeLongLit), 0L)


Could you add a test case for divide by zero like test("/ (Divide) basic")?

For now, this PR seems to follow the behavior of Spark / instead of Hive div. We had better be clear on our decision and prevent future unintended behavior changes.

scala> sql("select 2 / 0, 2 div 0").show() +---------------------------------------+---------+ |(CAST(2 AS DOUBLE) / CAST(0 AS DOUBLE))|(2 div 0)| +---------------------------------------+---------+ | null| null| +---------------------------------------+---------+

0: jdbc:hive2://ctr-e138-1518143905142-477481> select 2 / 0; +-------+ | _c0 | +-------+ | NULL | +-------+ 0: jdbc:hive2://ctr-e138-1518143905142-477481> select 2 div 0; Error: Error while compiling statement: FAILED: SemanticException [Error 10014]: Line 1:7 Wrong arguments '0': org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method public org.apache.hadoop.io.LongWritable org.apache.hadoop.hive.ql.udf.UDFOPLongDivide.evaluate(org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.LongWritable) with arguments {2,0}:/ by zero (state=42000,code=10014)

good catch! We should clearly define the behavior in the doc string too.

The test for this case is present in operators.sql (anyway, if you prefer me to add a case here too, just let me know and I'll add it). And since we already have this function in our code indeed - it is just translated to a normal divide + a cast - currently we are returning null and throwing an exception for it would be a behavior change (and a quite disruptive too). Do we really want to follow Hive's behavior on this?

I think we don't really need to change current behavior, but it is worth describing this in the doc string.

I agree with you @viirya. I updated the doc string with the current behavior. Thanks.

SparkQA · 2018-09-17T03:46:50Z

Test build #96114 has finished for PR 22395 at commit 71255a1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-17T12:26:26Z

Test build #96128 has finished for PR 22395 at commit c471bef.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-09-17T12:52:46Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala

+      > SELECT 3 _FUNC_ 2;
+       1
+  """,
+  since = "3.0.0")


the next version will be 2.5.0 AFAIK.

cloud-fan · 2018-09-17T12:53:01Z

LGTM

SparkQA · 2018-09-17T17:35:55Z

Test build #96141 has finished for PR 22395 at commit 3550d29.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM.

dongjoon-hyun · 2018-09-17T18:33:06Z

Merged to the master.

dongjoon-hyun · 2018-09-17T18:34:50Z

Thank you, @mgaido91 !

mgaido91 · 2018-09-17T19:06:08Z

thank you all for the reviews

rxin · 2018-09-17T23:11:58Z

why are we always returning long type here? shouldn't they be the same as the left expr's type? see mysql

Query OK, 1 row affected (0.02 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql> describe rxin_temp;
+--------------------+---------------+------+-----+---------+-------+
| Field              | Type          | Null | Key | Default | Extra |
+--------------------+---------------+------+-----+---------+-------+
| 4 div 2            | int(1)        | YES  |     | NULL    |       |
| 123456789124 div 2 | bigint(12)    | YES  |     | NULL    |       |
| 4 / 2              | decimal(5,4)  | YES  |     | NULL    |       |
| 123456789124 / 2   | decimal(16,4) | YES  |     | NULL    |       |
+--------------------+---------------+------+-----+---------+-------+
4 rows in set (0.01 sec)

dongjoon-hyun · 2018-09-17T23:56:54Z

@rxin . We made a decision to follow Hive behavior here .

cloud-fan · 2018-09-18T01:40:53Z

To clarify, it's not following hive, but following the behavior of previous Spark versions, which is same as hive.

I also think returning left operand's type is more reasonable, but we should do it in another PR since it's a behavior change, and we should also add migration guide for it.

@mgaido91 do you have time to do this change? Thanks!

rxin · 2018-09-18T02:41:28Z

Looks like a use case for a legacy config.

On Mon, Sep 17, 2018 at 6:41 PM Wenchen Fan ***@***.***> wrote: To clarify, it's not following hive, but following the behavior of previous Spark versions, which is same as hive. I also think returning left operand's type is more reasonable, but we should do it in another PR since it's a behavior change, and we should also add migration guide for it. @mgaido91 <https://github.com/mgaido91> do you have time to do this change? Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#22395 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AATvPDeW3F4Jsc-gS6CFrrGZY_lFXGxbks5ucE9WgaJpZM4Wjmfh> .

-- -- excuse the brevity and lower case due to wrist injury

mgaido91 · 2018-09-18T11:36:58Z

Sure @cloud-fan, I'll create a JIRA and submit a PR for it.

Looks like a use case for a legacy config.

Yes, thanks for the suggestion @rxin, I agree.

gatorsmile · 2018-12-29T05:47:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala

+  override def inputType: AbstractDataType = IntegralType
+  override def dataType: DataType = LongType
+
+  override def symbol: String = "/"


What is the reason we are using / here? Any benefit?

used in doGenCode?

yes, exactly, it is used there

[SPARK-16323][SQ] Add IntegerDivide expression

649b458

mgaido91 changed the title ~~[SPARK-16323][SQ] Add IntegralDivide expression~~ [SPARK-16323][SQL] Add IntegralDivide expression Sep 11, 2018

dongjoon-hyun reviewed Sep 11, 2018

View reviewed changes

fix UT failure

a0c0849

viirya reviewed Sep 12, 2018

View reviewed changes

address comment

315bb86

cloud-fan reviewed Sep 13, 2018

View reviewed changes

address comment

02a2369

return long

fca5e62

cloud-fan reviewed Sep 14, 2018

View reviewed changes

update description

71255a1

dongjoon-hyun reviewed Sep 17, 2018

View reviewed changes

update doc with null handling

c471bef

cloud-fan reviewed Sep 17, 2018

View reviewed changes

3.0.0 -> 2.5.0

3550d29

dongjoon-hyun approved these changes Sep 17, 2018

View reviewed changes

asfgit closed this in 553af22 Sep 17, 2018

gatorsmile reviewed Dec 29, 2018

View reviewed changes

MaxGekk mentioned this pull request Oct 15, 2021

[SPARK-36921][SQL] Support ANSI intervals by DIV #34257

Closed

[SPARK-16323][SQL] Add IntegralDivide expression #22395

[SPARK-16323][SQL] Add IntegralDivide expression #22395

Conversation

mgaido91 commented Sep 11, 2018

What changes were proposed in this pull request?

How was this patch tested?

mgaido91 commented Sep 11, 2018

dongjoon-hyun commented Sep 11, 2018

SparkQA commented Sep 11, 2018

Choose a reason for hiding this comment

SparkQA commented Sep 12, 2018

cloud-fan commented Sep 12, 2018

viirya left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Sep 12, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Sep 13, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 13, 2018

dongjoon-hyun commented Sep 13, 2018

SparkQA commented Sep 13, 2018

SparkQA commented Sep 14, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Sep 14, 2018

viirya commented Sep 14, 2018

gatorsmile commented Sep 14, 2018

mgaido91 commented Sep 14, 2018

SparkQA commented Sep 14, 2018

dongjoon-hyun commented Sep 17, 2018

dongjoon-hyun Sep 17, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 17, 2018

SparkQA commented Sep 17, 2018

Choose a reason for hiding this comment

cloud-fan commented Sep 17, 2018

SparkQA commented Sep 17, 2018

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Sep 17, 2018

dongjoon-hyun commented Sep 17, 2018

mgaido91 commented Sep 17, 2018

rxin commented Sep 17, 2018 • edited

dongjoon-hyun commented Sep 17, 2018

cloud-fan commented Sep 18, 2018

rxin commented Sep 18, 2018 via email

mgaido91 commented Sep 18, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Sep 12, 2018 •

edited

dongjoon-hyun Sep 13, 2018 •

edited

dongjoon-hyun Sep 17, 2018 •

edited

rxin commented Sep 17, 2018 •

edited