New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-24393][SQL] SQL builtin: isinf #21482
[SPARK-24393][SQL] SQL builtin: isinf #21482
Conversation
ok to test |
Jenkins, ok to test |
@@ -24,7 +24,7 @@ import org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull | |||
import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, Project} | |||
import org.apache.spark.sql.types._ | |||
|
|||
class NullExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { | |||
class NullExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revert this?
checkEvaluation(IsInf(Literal(Float.NegativeInfinity)), true) | ||
checkEvaluation(IsInf(Literal.create(null, DoubleType)), false) | ||
checkEvaluation(IsInf(Literal(Float.MaxValue)), false) | ||
checkEvaluation(IsInf(Literal(5.5f)), false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check NaN as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the checks in my later commits
* Return true iff the column is Infinity. | ||
* | ||
* @group normal_funcs | ||
* @since 1.6.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to fix these versions, here and elsewhere. This change would land in Spark 2.4.0.
ev.copy(code = code""" | ||
${eval.code} | ||
${CodeGenerator.javaType(dataType)} ${ev.value} = ${CodeGenerator.defaultValue(dataType)}; | ||
${ev.value} = !${eval.isNull} && Double.isInfinite(${eval.value});""", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of interest, why use Double.isInfinite
here, but value.isInfinity
in the non-codegen version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The non-codegen version uses the isInfinity method defined for scala's Double and Float, whereas the codegen version uses java's static method "isInfinite" defined for the classes Double and Float.
* Evaluates to `true` iff it's Infinity. | ||
*/ | ||
@ExpressionDescription( | ||
usage = "_FUNC_(expr) - Returns True evaluates to infinite else returns False ", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"True evaluates" -> "True if expr evaluates"
Test build #91405 has finished for PR 21482 at commit
|
Test build #91408 has finished for PR 21482 at commit
|
* @group normal_funcs | ||
* @since 2.4.0 | ||
*/ | ||
def isinf(e: Column): Column = withExpr { IsInf(e.expr) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mind if I ask to elaborate isinf
vs isInf
across the APIs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have followed what seemed to be the preexistent convention for function names in those particular files. For example in functions.scala
all function names were in lower case but in Column.scala
all function names were in camel case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess someone should elaborate if Column.isFoo vs function's isfoo is the right pattern we want to stay with...
@@ -557,6 +557,14 @@ class Column(val expr: Expression) extends Logging { | |||
(this >= lowerBound) && (this <= upperBound) | |||
} | |||
|
|||
/** | |||
* True if the current expression is NaN. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
? is this the same with isNaN
?
python/pyspark/sql/functions.py
Outdated
@@ -468,6 +468,18 @@ def input_file_name(): | |||
return Column(sc._jvm.functions.input_file_name()) | |||
|
|||
|
|||
@since(2.4) | |||
def isinf(col): | |||
"""An expression that returns true iff the column is NaN. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto. is this the same with isnan
?
R/pkg/R/functions.R
Outdated
#' @rdname column_nonaggregate_functions | ||
#' @aliases isnan isnan,Column-method | ||
#' @note isinf since 2.4.0 | ||
setMethod("isInf", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
R has is.infinite
. Can we match the behaviour and rename it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea, but we might not have a way to extend it (sort of)
> showMethods("is.finite")
Function: is.finite (package base)
> is.finite
function (x) .Primitive("is.finite")
It looks like S3 without a generic.
examples = """ | ||
Examples: | ||
> SELECT _FUNC_(1/0); | ||
True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you run the example and check the results?
* Evaluates to `true` iff it's Infinity. | ||
*/ | ||
@ExpressionDescription( | ||
usage = "_FUNC_(expr) - Returns True if expr evaluates to infinite else returns False ", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True -> true, False -> false to be consistent
R/pkg/R/functions.R
Outdated
#' @rdname column_nonaggregate_functions | ||
#' @aliases isnan isnan,Column-method | ||
#' @note isinf since 2.4.0 | ||
setMethod("isInf", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea, but we might not have a way to extend it (sort of)
> showMethods("is.finite")
Function: is.finite (package base)
> is.finite
function (x) .Primitive("is.finite")
It looks like S3 without a generic.
R/pkg/NAMESPACE
Outdated
@@ -281,6 +281,8 @@ exportMethods("%<=>%", | |||
"initcap", | |||
"input_file_name", | |||
"instr", | |||
"isInf", | |||
"isinf", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really appreciate the attempt to include R, though a question, why do we have isInf
and isinf
?
why not just isinf
like python?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we may have isInf
or isinf
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have just followed what has been done for isnan
, which also has isNan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the functions are case insensitive so i don't think we need both?
Test build #91418 has finished for PR 21482 at commit
|
Test build #91417 has finished for PR 21482 at commit
|
case DoubleType | FloatType => | ||
ev.copy(code = code""" | ||
${eval.code} | ||
${CodeGenerator.javaType(dataType)} ${ev.value} = ${CodeGenerator.defaultValue(dataType)}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can assign ev.value
directly here without a default value.
R/pkg/R/functions.R
Outdated
#' @rdname column_nonaggregate_functions | ||
#' @aliases isnan isnan,Column-method | ||
#' @note isinf since 2.4.0 | ||
setMethod("isinf", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to have duplicate method definition like this. Maybe we can follow isNaN
's approach if we really need to have both isinf
and isInf
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be alright if I omit isinf
completely and only implement isInf
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For isNaN
case, functions.R
only defines isnan
. isNaN
is defined in column.R
. If we really want to have both isInf
and isinf
like isNaN
and isnan
, maybe to follow it instead having two duplicate method definition here is better.
R/pkg/R/functions.R
Outdated
#' @details | ||
#' \code{isinf}: Returns true if the column is Infinity. | ||
#' @rdname column_nonaggregate_functions | ||
#' @aliases isnan isnan,Column-method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aliases
is incorrect here.
override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { | ||
val eval = child.genCode(ctx) | ||
child.dataType match { | ||
case DoubleType | FloatType => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this match block necessary since there is only one case pattern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function can only test for infinity values for datatypes Double and Float, and hence we need to match the child datatype with these types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we will only see double and float types here because of inputTypes
.
Test build #91423 has finished for PR 21482 at commit
|
* @group normal_funcs | ||
* @since 2.4.0 | ||
*/ | ||
def isinf(e: Column): Column = withExpr { IsInf(e.expr) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess someone should elaborate if Column.isFoo vs function's isfoo is the right pattern we want to stay with...
R/pkg/R/generics.R
Outdated
setGeneric("isInf", function(x) { standardGeneric("isInf") }) | ||
|
||
#' @rdname columnfunctions | ||
setGeneric("isinf", function(x) { standardGeneric("isinf") }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isnan
lower case is not a column functions see https://github.com/NihalHarish/spark/blob/7e396f70f58ffd309e7f738751f3aa8cfe321ce7/R/pkg/R/generics.R#L1002
@rdname columnfunctions
will cause it to go to the wrong doc page
R/pkg/NAMESPACE
Outdated
@@ -281,6 +281,8 @@ exportMethods("%<=>%", | |||
"initcap", | |||
"input_file_name", | |||
"instr", | |||
"isInf", | |||
"isinf", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add tests for these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added tests in my latest commit
R/pkg/R/functions.R
Outdated
#' @details | ||
#' \code{isinf}: Returns true if the column is Infinity. | ||
#' @rdname column_nonaggregate_functions | ||
#' @note isinf since 2.4.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing @aliases
Test build #91459 has finished for PR 21482 at commit
|
Test build #91462 has finished for PR 21482 at commit
|
Test build #91518 has finished for PR 21482 at commit
|
Test build #91517 has finished for PR 21482 at commit
|
@rxin, that in itself is a bit weird, but there are ways to express inf values in Scala and thus inf values can show up flowing through Spark plans. I'm not sure MySQL has any such facility. |
Thanks, Henry. In general I'm not a huge fan of adding something because hypothetically somebody might want it. Also if you want this to be compatible with Impala, wouldn't you want to name this the same way as Impala? |
I think consistency in Spark's naming convention (and therefore increased
discoverability by users) outweighs the advantage of naming it exactly for
the Impala equivalent. I do agree that multiple aliases probably aren't
worth the trouble at this point.
FWIW, I would have used this function if it had been available recently. So
it's not just hypothetical. And to me this provides some symmetry for
support for 'special' float values, since we already have isnan().
…On 7 June 2018 at 15:36, Reynold Xin ***@***.***> wrote:
Thanks, Henry. In general I'm not a huge fan of adding something because
hypothetically somebody might want it. Also if you want this to be
compatible with Impala, wouldn't you want to name this the same way as
Impala?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#21482 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAFc1dxwI0XBXVvtW7qXKOu3BmbuXEV0ks5t6asDgaJpZM4UXVuP>
.
|
Test build #91575 has finished for PR 21482 at commit
|
ok to test |
Test build #91601 has finished for PR 21482 at commit
|
@@ -468,6 +468,18 @@ def input_file_name(): | |||
return Column(sc._jvm.functions.input_file_name()) | |||
|
|||
|
|||
@since(2.4) | |||
def isinf(col): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we expose this to column.py too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want me to add the function to column.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon could you clarify, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please because I see it's exposed in Column.scala.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added it in my latest commit
Any further comments here? |
I have no more comments except the one above. |
Test build #92224 has finished for PR 21482 at commit
|
python/pyspark/sql/column.py
Outdated
>>> from pyspark.sql import Row | ||
>>> df = spark.createDataFrame([\ | ||
Row(name=u'Tom', height=80.0),\ | ||
Row(name=u'Alice', height=float('inf'))]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
>>> df = spark.createDataFrame([
... Row(name=u'Tom', height=80.0),
... Row(name=u'Alice', height=float('inf'))
... ])
Test build #92262 has finished for PR 21482 at commit
|
Test build #92307 has finished for PR 21482 at commit
|
Test build #92315 has finished for PR 21482 at commit
|
python/pyspark/sql/column.py
Outdated
>>> from pyspark.sql import Row | ||
>>> df = spark.createDataFrame([ | ||
Row(name=u'Tom', height=80.0), | ||
Row(name=u'Alice', height=float('inf')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dots are required as written in #21482 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Test build #92330 has finished for PR 21482 at commit
|
Hey I have an additional thought on this. Will leave it in the next ten mins. |
OK I double checked. I don't think we should be adding this functionality, since different databases implemented it differently, and it is somewhat difficult to create Infinity in Spark SQL given we return null or nan. On top of that, we already support equality for infinity, e.g.
The above shows true. If you start adding inf, we'd need to soon add functions for negative infinity, and these functions provide very little value beyond what we already support using equality. And where does it end? Would you start adding is0, is1, is2? If we see a lot of usage of infinity (because users try to create it using Scala/Python literals), I'd create a SQL literal that get parsed into infinity for easier construction of inf. But I wouldn't do that now. |
@NihalHarish shall we leave this closed for now if you don't have any opinion on ^? |
ping @NihalHarish |
What changes were proposed in this pull request?
Implemented isinf to test if a float or double value is Infinity.
How was this patch tested?
Unit tests have been added to
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/NullExpressionsSuite.scala