Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-24393][SQL] SQL builtin: isinf #21482

Closed

Conversation

NihalHarish
Copy link

What changes were proposed in this pull request?

Implemented isinf to test if a float or double value is Infinity.

How was this patch tested?

Unit tests have been added to
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/NullExpressionsSuite.scala

@vanzin
Copy link
Contributor

vanzin commented Jun 1, 2018

ok to test

@squito
Copy link
Contributor

squito commented Jun 1, 2018

Jenkins, ok to test

@@ -24,7 +24,7 @@ import org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull
import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, Project}
import org.apache.spark.sql.types._

class NullExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper {
class NullExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert this?

checkEvaluation(IsInf(Literal(Float.NegativeInfinity)), true)
checkEvaluation(IsInf(Literal.create(null, DoubleType)), false)
checkEvaluation(IsInf(Literal(Float.MaxValue)), false)
checkEvaluation(IsInf(Literal(5.5f)), false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check NaN as well?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the checks in my later commits

* Return true iff the column is Infinity.
*
* @group normal_funcs
* @since 1.6.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to fix these versions, here and elsewhere. This change would land in Spark 2.4.0.

ev.copy(code = code"""
${eval.code}
${CodeGenerator.javaType(dataType)} ${ev.value} = ${CodeGenerator.defaultValue(dataType)};
${ev.value} = !${eval.isNull} && Double.isInfinite(${eval.value});""",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of interest, why use Double.isInfinite here, but value.isInfinity in the non-codegen version?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The non-codegen version uses the isInfinity method defined for scala's Double and Float, whereas the codegen version uses java's static method "isInfinite" defined for the classes Double and Float.

* Evaluates to `true` iff it's Infinity.
*/
@ExpressionDescription(
usage = "_FUNC_(expr) - Returns True evaluates to infinite else returns False ",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"True evaluates" -> "True if expr evaluates"

@SparkQA
Copy link

SparkQA commented Jun 2, 2018

Test build #91405 has finished for PR 21482 at commit bcdaab2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class IsInf(child: Expression) extends UnaryExpression

@SparkQA
Copy link

SparkQA commented Jun 2, 2018

Test build #91408 has finished for PR 21482 at commit 9ab0eb2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* @group normal_funcs
* @since 2.4.0
*/
def isinf(e: Column): Column = withExpr { IsInf(e.expr) }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind if I ask to elaborate isinf vs isInf across the APIs?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have followed what seemed to be the preexistent convention for function names in those particular files. For example in functions.scala all function names were in lower case but in Column.scala all function names were in camel case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess someone should elaborate if Column.isFoo vs function's isfoo is the right pattern we want to stay with...

@@ -557,6 +557,14 @@ class Column(val expr: Expression) extends Logging {
(this >= lowerBound) && (this <= upperBound)
}

/**
* True if the current expression is NaN.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

? is this the same with isNaN?

@@ -468,6 +468,18 @@ def input_file_name():
return Column(sc._jvm.functions.input_file_name())


@since(2.4)
def isinf(col):
"""An expression that returns true iff the column is NaN.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto. is this the same with isnan?

#' @rdname column_nonaggregate_functions
#' @aliases isnan isnan,Column-method
#' @note isinf since 2.4.0
setMethod("isInf",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R has is.infinite. Can we match the behaviour and rename it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea, but we might not have a way to extend it (sort of)

> showMethods("is.finite")
Function: is.finite (package base)
> is.finite
function (x)  .Primitive("is.finite")

It looks like S3 without a generic.

examples = """
Examples:
> SELECT _FUNC_(1/0);
True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you run the example and check the results?

* Evaluates to `true` iff it's Infinity.
*/
@ExpressionDescription(
usage = "_FUNC_(expr) - Returns True if expr evaluates to infinite else returns False ",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True -> true, False -> false to be consistent

#' @rdname column_nonaggregate_functions
#' @aliases isnan isnan,Column-method
#' @note isinf since 2.4.0
setMethod("isInf",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea, but we might not have a way to extend it (sort of)

> showMethods("is.finite")
Function: is.finite (package base)
> is.finite
function (x)  .Primitive("is.finite")

It looks like S3 without a generic.

R/pkg/NAMESPACE Outdated
@@ -281,6 +281,8 @@ exportMethods("%<=>%",
"initcap",
"input_file_name",
"instr",
"isInf",
"isinf",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really appreciate the attempt to include R, though a question, why do we have isInf and isinf?
why not just isinf like python?

Copy link
Member

@viirya viirya Jun 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we may have isInf or isinf here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just followed what has been done for isnan, which also has isNan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the functions are case insensitive so i don't think we need both?

@SparkQA
Copy link

SparkQA commented Jun 2, 2018

Test build #91418 has finished for PR 21482 at commit f34bfdc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 2, 2018

Test build #91417 has finished for PR 21482 at commit 069f9d9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case DoubleType | FloatType =>
ev.copy(code = code"""
${eval.code}
${CodeGenerator.javaType(dataType)} ${ev.value} = ${CodeGenerator.defaultValue(dataType)};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can assign ev.value directly here without a default value.

#' @rdname column_nonaggregate_functions
#' @aliases isnan isnan,Column-method
#' @note isinf since 2.4.0
setMethod("isinf",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to have duplicate method definition like this. Maybe we can follow isNaN's approach if we really need to have both isinf and isInf.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be alright if I omit isinf completely and only implement isInf?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For isNaN case, functions.R only defines isnan. isNaN is defined in column.R. If we really want to have both isInf and isinf like isNaN and isnan, maybe to follow it instead having two duplicate method definition here is better.

#' @details
#' \code{isinf}: Returns true if the column is Infinity.
#' @rdname column_nonaggregate_functions
#' @aliases isnan isnan,Column-method
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aliases is incorrect here.

override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
val eval = child.genCode(ctx)
child.dataType match {
case DoubleType | FloatType =>
Copy link
Member

@viirya viirya Jun 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this match block necessary since there is only one case pattern?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function can only test for infinity values for datatypes Double and Float, and hence we need to match the child datatype with these types

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will only see double and float types here because of inputTypes.

@SparkQA
Copy link

SparkQA commented Jun 3, 2018

Test build #91423 has finished for PR 21482 at commit 7e396f7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* @group normal_funcs
* @since 2.4.0
*/
def isinf(e: Column): Column = withExpr { IsInf(e.expr) }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess someone should elaborate if Column.isFoo vs function's isfoo is the right pattern we want to stay with...

setGeneric("isInf", function(x) { standardGeneric("isInf") })

#' @rdname columnfunctions
setGeneric("isinf", function(x) { standardGeneric("isinf") })
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isnan lower case is not a column functions see https://github.com/NihalHarish/spark/blob/7e396f70f58ffd309e7f738751f3aa8cfe321ce7/R/pkg/R/generics.R#L1002
@rdname columnfunctions will cause it to go to the wrong doc page

R/pkg/NAMESPACE Outdated
@@ -281,6 +281,8 @@ exportMethods("%<=>%",
"initcap",
"input_file_name",
"instr",
"isInf",
"isinf",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add tests for these?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added tests in my latest commit

#' @details
#' \code{isinf}: Returns true if the column is Infinity.
#' @rdname column_nonaggregate_functions
#' @note isinf since 2.4.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing @aliases

@SparkQA
Copy link

SparkQA commented Jun 4, 2018

Test build #91459 has finished for PR 21482 at commit 13b5aaa.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 4, 2018

Test build #91462 has finished for PR 21482 at commit d381f0c.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 7, 2018

Test build #91518 has finished for PR 21482 at commit 6a4d46e.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 7, 2018

Test build #91517 has finished for PR 21482 at commit f240fdf.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@henryr
Copy link
Contributor

henryr commented Jun 7, 2018

@rxin, that in itself is a bit weird, but there are ways to express inf values in Scala and thus inf values can show up flowing through Spark plans. I'm not sure MySQL has any such facility.

@rxin
Copy link
Contributor

rxin commented Jun 7, 2018

Thanks, Henry. In general I'm not a huge fan of adding something because hypothetically somebody might want it. Also if you want this to be compatible with Impala, wouldn't you want to name this the same way as Impala?

@henryr
Copy link
Contributor

henryr commented Jun 7, 2018 via email

@SparkQA
Copy link

SparkQA commented Jun 8, 2018

Test build #91575 has finished for PR 21482 at commit 559900a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Jun 9, 2018

Test build #91601 has finished for PR 21482 at commit 559900a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -468,6 +468,18 @@ def input_file_name():
return Column(sc._jvm.functions.input_file_name())


@since(2.4)
def isinf(col):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we expose this to column.py too?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want me to add the function to column.py?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon could you clarify, please?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please because I see it's exposed in Column.scala.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added it in my latest commit

@henryr
Copy link
Contributor

henryr commented Jun 21, 2018

Any further comments here?

@HyukjinKwon
Copy link
Member

I have no more comments except the one above.

@SparkQA
Copy link

SparkQA commented Jun 22, 2018

Test build #92224 has finished for PR 21482 at commit b727838.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

>>> from pyspark.sql import Row
>>> df = spark.createDataFrame([\
Row(name=u'Tom', height=80.0),\
Row(name=u'Alice', height=float('inf'))])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

>>> df = spark.createDataFrame([
...     Row(name=u'Tom', height=80.0),
...     Row(name=u'Alice', height=float('inf'))
... ])

@SparkQA
Copy link

SparkQA commented Jun 24, 2018

Test build #92262 has finished for PR 21482 at commit 6bd6735.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 25, 2018

Test build #92307 has finished for PR 21482 at commit be23549.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 26, 2018

Test build #92315 has finished for PR 21482 at commit cb8f9d0.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

>>> from pyspark.sql import Row
>>> df = spark.createDataFrame([
Row(name=u'Tom', height=80.0),
Row(name=u'Alice', height=float('inf'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dots are required as written in #21482 (comment)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@SparkQA
Copy link

SparkQA commented Jun 26, 2018

Test build #92330 has finished for PR 21482 at commit d60aa21.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Jun 26, 2018

Hey I have an additional thought on this. Will leave it in the next ten mins.

@rxin
Copy link
Contributor

rxin commented Jun 26, 2018

OK I double checked. I don't think we should be adding this functionality, since different databases implemented it differently, and it is somewhat difficult to create Infinity in Spark SQL given we return null or nan.

On top of that, we already support equality for infinity, e.g.

spark.range(1).select(
  org.apache.spark.sql.functions.lit(java.lang.Double.POSITIVE_INFINITY) ===org.apache.spark.sql.functions.lit(java.lang.Double.POSITIVE_INFINITY)).show()

The above shows true.

If you start adding inf, we'd need to soon add functions for negative infinity, and these functions provide very little value beyond what we already support using equality. And where does it end? Would you start adding is0, is1, is2?

If we see a lot of usage of infinity (because users try to create it using Scala/Python literals), I'd create a SQL literal that get parsed into infinity for easier construction of inf. But I wouldn't do that now.

@HyukjinKwon
Copy link
Member

@NihalHarish shall we leave this closed for now if you don't have any opinion on ^?

@HyukjinKwon
Copy link
Member

ping @NihalHarish

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
9 participants