Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16323][SQL] Add IntegralDivide expression #22395

Closed
wants to merge 8 commits into from

Conversation

mgaido91
Copy link
Contributor

What changes were proposed in this pull request?

The PR takes over #14036 and it introduces a new expression IntegralDivide in order to avoid the several unneded cast added previously.

In order to prove the performance gain, the following benchmark has been run:

  test("Benchmark IntegralDivide") {
    val r = new scala.util.Random(91)
    val nData = 1000000
    val testDataInt = (1 to nData).map(_ => (r.nextInt(), r.nextInt()))
    val testDataLong = (1 to nData).map(_ => (r.nextLong(), r.nextLong()))
    val testDataShort = (1 to nData).map(_ => (r.nextInt().toShort, r.nextInt().toShort))

    // old code
    val oldExprsInt = testDataInt.map(x =>
      Cast(Divide(Cast(Literal(x._1), DoubleType), Cast(Literal(x._2), DoubleType)), LongType))
    val oldExprsLong = testDataLong.map(x =>
      Cast(Divide(Cast(Literal(x._1), DoubleType), Cast(Literal(x._2), DoubleType)), LongType))
    val oldExprsShort = testDataShort.map(x =>
      Cast(Divide(Cast(Literal(x._1), DoubleType), Cast(Literal(x._2), DoubleType)), LongType))

    // new code
    val newExprsInt = testDataInt.map(x => IntegralDivide(x._1, x._2))
    val newExprsLong = testDataLong.map(x => IntegralDivide(x._1, x._2))
    val newExprsShort = testDataShort.map(x => IntegralDivide(x._1, x._2))


    Seq(("Long", "old", oldExprsLong),
      ("Long", "new", newExprsLong),
      ("Int", "old", oldExprsInt),
      ("Int", "new", newExprsShort),
      ("Short", "old", oldExprsShort),
      ("Short", "new", oldExprsShort)).foreach { case (dt, t, ds) =>
      val start = System.nanoTime()
      ds.foreach(e => e.eval(EmptyRow))
      val endNoCodegen = System.nanoTime()
      println(s"Running $nData op with $t code on $dt (no-codegen): ${(endNoCodegen - start) / 1000000} ms")
    }
  }

The results on my laptop are:

Running 1000000 op with old code on Long (no-codegen): 600 ms
Running 1000000 op with new code on Long (no-codegen): 112 ms
Running 1000000 op with old code on Int (no-codegen): 560 ms
Running 1000000 op with new code on Int (no-codegen): 135 ms
Running 1000000 op with old code on Short (no-codegen): 317 ms
Running 1000000 op with new code on Short (no-codegen): 153 ms

Showing a 2-5X improvement. The benchmark doesn't include code generation as it is pretty hard to test the performance there as for such simple operations the most of the time is spent in the code generation/compilation process.

How was this patch tested?

added UTs

@mgaido91
Copy link
Contributor Author

cc @cloud-fan

@dongjoon-hyun
Copy link
Member

[SQ] -> [SQL] in the title?

@mgaido91 mgaido91 changed the title [SPARK-16323][SQ] Add IntegralDivide expression [SPARK-16323][SQL] Add IntegralDivide expression Sep 11, 2018
@SparkQA
Copy link

SparkQA commented Sep 11, 2018

Test build #95954 has finished for PR 22395 at commit 649b458.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class IntegralDivide(left: Expression, right: Expression) extends DivModLike

@@ -72,6 +72,7 @@ package object dsl {
def - (other: Expression): Expression = Subtract(expr, other)
def * (other: Expression): Expression = Multiply(expr, other)
def / (other: Expression): Expression = Divide(expr, other)
def div (other: Expression): Expression = IntegralDivide(expr, other)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failure looks like relevant.

org.scalatest.exceptions.TestFailedException: 
Expected "struct<[CAST((CAST(5 AS DOUBLE) / CAST(2 AS DOUBLE)) AS BIGINT):big]int>",
but got "struct<[(5 div 2):]int>" Schema did not match for query #19 select 5 div 2

@SparkQA
Copy link

SparkQA commented Sep 12, 2018

Test build #95982 has finished for PR 22395 at commit a0c0849.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM, cc @viirya @gatorsmile

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

1
""",
since = "3.0.0")
case class IntegralDivide(left: Expression, right: Expression) extends DivModLike {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add this to FunctionRegistry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, please see the discussion at #14036 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, sorry, but why not? As @viirya suggested, without that, the description added here is not meaningless.

spark-sql> describe function 'div';
Function: div not found.
Time taken: 0.016 seconds, Fetched 1 row(s)

Also, Hive accepts that like the following. (from Hive 3.1.0)

0: jdbc:hive2://ctr-e138-1518143905142-429335> describe function div;
+----------------------------------------------------+
|                      tab_name                      |
+----------------------------------------------------+
| a div b - Divide a by b rounded to the long integer |
+----------------------------------------------------+

0: jdbc:hive2://ctr-e138-1518143905142-429335> select 3 / 2, 3 div 2, `/`(3,2), `div`(3,2);
+------+------+------+------+
| _c0  | _c1  | _c2  | _c3  |
+------+------+------+------+
| 1.5  | 1    | 1.5  | 1    |
+------+------+------+------+

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun because if we add it there, we can write: select div(3, 2), which is not supported by Hive.

hive> select div(3, 2);
NoViableAltException(13@[])
	at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:964)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgaido91 . I gave you the example of Hive in the above. :)

`div`(3,2)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, sorry I missed the back-ticks. I am adding it, sorry. Thanks.

@@ -314,6 +314,27 @@ case class Divide(left: Expression, right: Expression) extends DivModLike {
override def evalOperation(left: Any, right: Any): Any = div(left, right)
}

@ExpressionDescription(
usage = "a _FUNC_ b - Divides a by b.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: explicitly say this is integral divide?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, thanks, I am very bad at descriptions.

@SparkQA
Copy link

SparkQA commented Sep 12, 2018

Test build #95995 has finished for PR 22395 at commit 315bb86.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

private lazy val div: (Any, Any) => Any = dataType match {
case i: IntegralType => i.integral.asInstanceOf[Integral[Any]].quot
}
override def evalOperation(left: Any, right: Any): Any = div(left, right)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I may not recall it very clearly. Can you check Hive and other databases and see if the result type of div is always long?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, so:

  • Hive returns always long;
  • Postgres and SQLServer don't have a div operator but they perform integral division when the operands are integrals and return the datatype of the operands (eg. select 3 / 2 returns an integer);
  • Oracle doesn't support it.

So the behavior is not homogeneous among the RDBMs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I'd prefer always returning long, since it was the behavior before. We can consider changing the behavior in another PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for @cloud-fan 's suggestion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it is reasonable as that is what we defined: Hive Long Division: 'DIV' in AstBuilder.scala.

@SparkQA
Copy link

SparkQA commented Sep 13, 2018

Test build #96040 has finished for PR 22395 at commit 02a2369.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please

@SparkQA
Copy link

SparkQA commented Sep 13, 2018

Test build #96045 has finished for PR 22395 at commit 02a2369.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 14, 2018

Test build #96072 has finished for PR 22395 at commit fca5e62.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -314,6 +314,32 @@ case class Divide(left: Expression, right: Expression) extends DivModLike {
override def evalOperation(left: Any, right: Any): Any = div(left, right)
}

@ExpressionDescription(
usage = "expr1 _FUNC_ expr2 - Returns `expr1`/`expr2`. It performs integral division.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's mention that it always return long. Maybe we can take a look at how Hive document it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Divide a by b rounded to the long integer, this is Hive's div document.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, thanks @viirya, I am updating to that sentence

@cloud-fan
Copy link
Contributor

LGTM except one comment

@viirya
Copy link
Member

viirya commented Sep 14, 2018

LGTM

@gatorsmile
Copy link
Member

Could we check the definition of div in MySQL? Is it the same as the one implemented in this PR?

https://dev.mysql.com/doc/refman/8.0/en/arithmetic-functions.html#operator_div

@mgaido91
Copy link
Contributor Author

@gatorsmile I checked on MySQL 5.6 and there are 2 differences between MySQL's div and the current implementation:

@SparkQA
Copy link

SparkQA commented Sep 14, 2018

Test build #96079 has finished for PR 22395 at commit 71255a1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please

checkEvaluation(IntegralDivide(Literal(1.toLong), Literal(2.toLong)), 0L)
checkEvaluation(IntegralDivide(positiveShortLit, negativeShortLit), 0L)
checkEvaluation(IntegralDivide(positiveIntLit, negativeIntLit), 0L)
checkEvaluation(IntegralDivide(positiveLongLit, negativeLongLit), 0L)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test case for divide by zero like test("/ (Divide) basic")?

For now, this PR seems to follow the behavior of Spark / instead of Hive div. We had better be clear on our decision and prevent future unintended behavior changes.

scala> sql("select 2 / 0, 2 div 0").show()
+---------------------------------------+---------+
|(CAST(2 AS DOUBLE) / CAST(0 AS DOUBLE))|(2 div 0)|
+---------------------------------------+---------+
|                                   null|     null|
+---------------------------------------+---------+
0: jdbc:hive2://ctr-e138-1518143905142-477481> select 2 / 0;
+-------+
|  _c0  |
+-------+
| NULL  |
+-------+

0: jdbc:hive2://ctr-e138-1518143905142-477481> select 2 div 0;
Error: Error while compiling statement: FAILED:
SemanticException [Error 10014]: Line 1:7 Wrong arguments '0':
org.apache.hadoop.hive.ql.metadata.HiveException:
Unable to execute method public org.apache.hadoop.io.LongWritable org.apache.hadoop.hive.ql.udf.UDFOPLongDivide.evaluate(org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.LongWritable)
with arguments {2,0}:/ by zero (state=42000,code=10014)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! We should clearly define the behavior in the doc string too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test for this case is present in operators.sql (anyway, if you prefer me to add a case here too, just let me know and I'll add it). And since we already have this function in our code indeed - it is just translated to a normal divide + a cast - currently we are returning null and throwing an exception for it would be a behavior change (and a quite disruptive too). Do we really want to follow Hive's behavior on this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't really need to change current behavior, but it is worth describing this in the doc string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you @viirya. I updated the doc string with the current behavior. Thanks.

@SparkQA
Copy link

SparkQA commented Sep 17, 2018

Test build #96114 has finished for PR 22395 at commit 71255a1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 17, 2018

Test build #96128 has finished for PR 22395 at commit c471bef.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

> SELECT 3 _FUNC_ 2;
1
""",
since = "3.0.0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the next version will be 2.5.0 AFAIK.

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Sep 17, 2018

Test build #96141 has finished for PR 22395 at commit 3550d29.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@dongjoon-hyun
Copy link
Member

Merged to the master.

@dongjoon-hyun
Copy link
Member

Thank you, @mgaido91 !

@asfgit asfgit closed this in 553af22 Sep 17, 2018
@mgaido91
Copy link
Contributor Author

thank you all for the reviews

@rxin
Copy link
Contributor

rxin commented Sep 17, 2018

why are we always returning long type here? shouldn't they be the same as the left expr's type? see mysql

Query OK, 1 row affected (0.02 sec)
Records: 1  Duplicates: 0  Warnings: 0

mysql> describe rxin_temp;
+--------------------+---------------+------+-----+---------+-------+
| Field              | Type          | Null | Key | Default | Extra |
+--------------------+---------------+------+-----+---------+-------+
| 4 div 2            | int(1)        | YES  |     | NULL    |       |
| 123456789124 div 2 | bigint(12)    | YES  |     | NULL    |       |
| 4 / 2              | decimal(5,4)  | YES  |     | NULL    |       |
| 123456789124 / 2   | decimal(16,4) | YES  |     | NULL    |       |
+--------------------+---------------+------+-----+---------+-------+
4 rows in set (0.01 sec)

@dongjoon-hyun
Copy link
Member

@rxin . We made a decision to follow Hive behavior here .

@cloud-fan
Copy link
Contributor

To clarify, it's not following hive, but following the behavior of previous Spark versions, which is same as hive.

I also think returning left operand's type is more reasonable, but we should do it in another PR since it's a behavior change, and we should also add migration guide for it.

@mgaido91 do you have time to do this change? Thanks!

@rxin
Copy link
Contributor

rxin commented Sep 18, 2018 via email

@mgaido91
Copy link
Contributor Author

Sure @cloud-fan, I'll create a JIRA and submit a PR for it.

Looks like a use case for a legacy config.

Yes, thanks for the suggestion @rxin, I agree.

override def inputType: AbstractDataType = IntegralType
override def dataType: DataType = LongType

override def symbol: String = "/"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason we are using / here? Any benefit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

used in doGenCode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, exactly, it is used there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants