New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-16323][SQL] Add IntegralDivide expression #22395
Conversation
cc @cloud-fan |
|
Test build #95954 has finished for PR 22395 at commit
|
@@ -72,6 +72,7 @@ package object dsl { | |||
def - (other: Expression): Expression = Subtract(expr, other) | |||
def * (other: Expression): Expression = Multiply(expr, other) | |||
def / (other: Expression): Expression = Divide(expr, other) | |||
def div (other: Expression): Expression = IntegralDivide(expr, other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The failure looks like relevant.
org.scalatest.exceptions.TestFailedException:
Expected "struct<[CAST((CAST(5 AS DOUBLE) / CAST(2 AS DOUBLE)) AS BIGINT):big]int>",
but got "struct<[(5 div 2):]int>" Schema did not match for query #19 select 5 div 2
Test build #95982 has finished for PR 22395 at commit
|
LGTM, cc @viirya @gatorsmile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
1 | ||
""", | ||
since = "3.0.0") | ||
case class IntegralDivide(left: Expression, right: Expression) extends DivModLike { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add this to FunctionRegistry
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, please see the discussion at #14036 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ur, sorry, but why not? As @viirya suggested, without that, the description added here is not meaningless.
spark-sql> describe function 'div';
Function: div not found.
Time taken: 0.016 seconds, Fetched 1 row(s)
Also, Hive accepts that like the following. (from Hive 3.1.0)
0: jdbc:hive2://ctr-e138-1518143905142-429335> describe function div;
+----------------------------------------------------+
| tab_name |
+----------------------------------------------------+
| a div b - Divide a by b rounded to the long integer |
+----------------------------------------------------+
0: jdbc:hive2://ctr-e138-1518143905142-429335> select 3 / 2, 3 div 2, `/`(3,2), `div`(3,2);
+------+------+------+------+
| _c0 | _c1 | _c2 | _c3 |
+------+------+------+------+
| 1.5 | 1 | 1.5 | 1 |
+------+------+------+------+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun because if we add it there, we can write: select div(3, 2)
, which is not supported by Hive.
hive> select div(3, 2);
NoViableAltException(13@[])
at org.apache.hadoop.hive.ql.parse.HiveParser_SelectClauseParser.selectClause(HiveParser_SelectClauseParser.java:964)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgaido91 . I gave you the example of Hive in the above. :)
`div`(3,2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, sorry I missed the back-ticks. I am adding it, sorry. Thanks.
@@ -314,6 +314,27 @@ case class Divide(left: Expression, right: Expression) extends DivModLike { | |||
override def evalOperation(left: Any, right: Any): Any = div(left, right) | |||
} | |||
|
|||
@ExpressionDescription( | |||
usage = "a _FUNC_ b - Divides a by b.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: explicitly say this is integral divide?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, thanks, I am very bad at descriptions.
Test build #95995 has finished for PR 22395 at commit
|
private lazy val div: (Any, Any) => Any = dataType match { | ||
case i: IntegralType => i.integral.asInstanceOf[Integral[Any]].quot | ||
} | ||
override def evalOperation(left: Any, right: Any): Any = div(left, right) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I may not recall it very clearly. Can you check Hive and other databases and see if the result type of div
is always long?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, so:
- Hive returns always long;
- Postgres and SQLServer don't have a
div
operator but they perform integral division when the operands are integrals and return the datatype of the operands (eg.select 3 / 2
returns an integer); - Oracle doesn't support it.
So the behavior is not homogeneous among the RDBMs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I'd prefer always returning long, since it was the behavior before. We can consider changing the behavior in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for @cloud-fan 's suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it is reasonable as that is what we defined: Hive Long Division: 'DIV'
in AstBuilder.scala
.
Test build #96040 has finished for PR 22395 at commit
|
Retest this please |
Test build #96045 has finished for PR 22395 at commit
|
Test build #96072 has finished for PR 22395 at commit
|
@@ -314,6 +314,32 @@ case class Divide(left: Expression, right: Expression) extends DivModLike { | |||
override def evalOperation(left: Any, right: Any): Any = div(left, right) | |||
} | |||
|
|||
@ExpressionDescription( | |||
usage = "expr1 _FUNC_ expr2 - Returns `expr1`/`expr2`. It performs integral division.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's mention that it always return long. Maybe we can take a look at how Hive document it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Divide a by b rounded to the long integer
, this is Hive's div
document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, thanks @viirya, I am updating to that sentence
LGTM except one comment |
LGTM |
Could we check the definition of div in MySQL? Is it the same as the one implemented in this PR? https://dev.mysql.com/doc/refman/8.0/en/arithmetic-functions.html#operator_div |
@gatorsmile I checked on MySQL 5.6 and there are 2 differences between MySQL's
|
Test build #96079 has finished for PR 22395 at commit
|
Retest this please |
checkEvaluation(IntegralDivide(Literal(1.toLong), Literal(2.toLong)), 0L) | ||
checkEvaluation(IntegralDivide(positiveShortLit, negativeShortLit), 0L) | ||
checkEvaluation(IntegralDivide(positiveIntLit, negativeIntLit), 0L) | ||
checkEvaluation(IntegralDivide(positiveLongLit, negativeLongLit), 0L) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a test case for divide by zero
like test("/ (Divide) basic")
?
For now, this PR seems to follow the behavior of Spark /
instead of Hive div
. We had better be clear on our decision and prevent future unintended behavior changes.
scala> sql("select 2 / 0, 2 div 0").show()
+---------------------------------------+---------+
|(CAST(2 AS DOUBLE) / CAST(0 AS DOUBLE))|(2 div 0)|
+---------------------------------------+---------+
| null| null|
+---------------------------------------+---------+
0: jdbc:hive2://ctr-e138-1518143905142-477481> select 2 / 0;
+-------+
| _c0 |
+-------+
| NULL |
+-------+
0: jdbc:hive2://ctr-e138-1518143905142-477481> select 2 div 0;
Error: Error while compiling statement: FAILED:
SemanticException [Error 10014]: Line 1:7 Wrong arguments '0':
org.apache.hadoop.hive.ql.metadata.HiveException:
Unable to execute method public org.apache.hadoop.io.LongWritable org.apache.hadoop.hive.ql.udf.UDFOPLongDivide.evaluate(org.apache.hadoop.io.LongWritable,org.apache.hadoop.io.LongWritable)
with arguments {2,0}:/ by zero (state=42000,code=10014)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch! We should clearly define the behavior in the doc string too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test for this case is present in operators.sql
(anyway, if you prefer me to add a case here too, just let me know and I'll add it). And since we already have this function in our code indeed - it is just translated to a normal divide + a cast - currently we are returning null
and throwing an exception for it would be a behavior change (and a quite disruptive too). Do we really want to follow Hive's behavior on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't really need to change current behavior, but it is worth describing this in the doc string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you @viirya. I updated the doc string with the current behavior. Thanks.
Test build #96114 has finished for PR 22395 at commit
|
Test build #96128 has finished for PR 22395 at commit
|
> SELECT 3 _FUNC_ 2; | ||
1 | ||
""", | ||
since = "3.0.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the next version will be 2.5.0 AFAIK.
LGTM |
Test build #96141 has finished for PR 22395 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
Merged to the master. |
Thank you, @mgaido91 ! |
thank you all for the reviews |
why are we always returning long type here? shouldn't they be the same as the left expr's type? see mysql
|
To clarify, it's not following hive, but following the behavior of previous Spark versions, which is same as hive. I also think returning left operand's type is more reasonable, but we should do it in another PR since it's a behavior change, and we should also add migration guide for it. @mgaido91 do you have time to do this change? Thanks! |
Looks like a use case for a legacy config.
On Mon, Sep 17, 2018 at 6:41 PM Wenchen Fan ***@***.***> wrote:
To clarify, it's not following hive, but following the behavior of
previous Spark versions, which is same as hive.
I also think returning left operand's type is more reasonable, but we
should do it in another PR since it's a behavior change, and we should also
add migration guide for it.
@mgaido91 <https://github.com/mgaido91> do you have time to do this
change? Thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#22395 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AATvPDeW3F4Jsc-gS6CFrrGZY_lFXGxbks5ucE9WgaJpZM4Wjmfh>
.
--
--
excuse the brevity and lower case due to wrist injury
|
Sure @cloud-fan, I'll create a JIRA and submit a PR for it.
Yes, thanks for the suggestion @rxin, I agree. |
override def inputType: AbstractDataType = IntegralType | ||
override def dataType: DataType = LongType | ||
|
||
override def symbol: String = "/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason we are using /
here? Any benefit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
used in doGenCode
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, exactly, it is used there
What changes were proposed in this pull request?
The PR takes over #14036 and it introduces a new expression
IntegralDivide
in order to avoid the several unneded cast added previously.In order to prove the performance gain, the following benchmark has been run:
The results on my laptop are:
Showing a 2-5X improvement. The benchmark doesn't include code generation as it is pretty hard to test the performance there as for such simple operations the most of the time is spent in the code generation/compilation process.
How was this patch tested?
added UTs