[SPARK-24393][SQL] SQL builtin: isinf #21482

NihalHarish · 2018-06-01T21:28:18Z

What changes were proposed in this pull request?

Implemented isinf to test if a float or double value is Infinity.

How was this patch tested?

Unit tests have been added to
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/NullExpressionsSuite.scala

vanzin · 2018-06-01T21:36:51Z

ok to test

squito · 2018-06-01T21:36:54Z

Jenkins, ok to test

henryr · 2018-06-01T21:31:43Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/NullExpressionsSuite.scala

@@ -24,7 +24,7 @@ import org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull
 import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, Project}
 import org.apache.spark.sql.types._

-class NullExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper {
+  class NullExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper {


Revert this?

henryr · 2018-06-01T21:32:24Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/NullExpressionsSuite.scala

+    checkEvaluation(IsInf(Literal(Float.NegativeInfinity)), true)
+    checkEvaluation(IsInf(Literal.create(null, DoubleType)), false)
+    checkEvaluation(IsInf(Literal(Float.MaxValue)), false)
+    checkEvaluation(IsInf(Literal(5.5f)), false)


check NaN as well?

Added the checks in my later commits

henryr · 2018-06-01T21:33:02Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

+   * Return true iff the column is Infinity.
+   *
+   * @group normal_funcs
+   * @since 1.6.0


Need to fix these versions, here and elsewhere. This change would land in Spark 2.4.0.

henryr · 2018-06-01T21:38:57Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullExpressions.scala

+        ev.copy(code = code"""
+          ${eval.code}
+          ${CodeGenerator.javaType(dataType)} ${ev.value} = ${CodeGenerator.defaultValue(dataType)};
+          ${ev.value} = !${eval.isNull} && Double.isInfinite(${eval.value});""",


out of interest, why use Double.isInfinite here, but value.isInfinity in the non-codegen version?

The non-codegen version uses the isInfinity method defined for scala's Double and Float, whereas the codegen version uses java's static method "isInfinite" defined for the classes Double and Float.

henryr · 2018-06-01T21:39:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullExpressions.scala

+ * Evaluates to `true` iff it's Infinity.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(expr) - Returns True evaluates to infinite else returns False ",


"True evaluates" -> "True if expr evaluates"

SparkQA · 2018-06-02T01:29:35Z

Test build #91405 has finished for PR 21482 at commit bcdaab2.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class IsInf(child: Expression) extends UnaryExpression

SparkQA · 2018-06-02T02:43:57Z

Test build #91408 has finished for PR 21482 at commit 9ab0eb2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-06-02T03:07:40Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

+   * @group normal_funcs
+   * @since 2.4.0
+   */
+  def isinf(e: Column): Column = withExpr { IsInf(e.expr) }


Mind if I ask to elaborate isinf vs isInf across the APIs?

I have followed what seemed to be the preexistent convention for function names in those particular files. For example in functions.scala all function names were in lower case but in Column.scala all function names were in camel case.

I guess someone should elaborate if Column.isFoo vs function's isfoo is the right pattern we want to stay with...

HyukjinKwon · 2018-06-02T03:09:20Z

sql/core/src/main/scala/org/apache/spark/sql/Column.scala

@@ -557,6 +557,14 @@ class Column(val expr: Expression) extends Logging {
    (this >= lowerBound) && (this <= upperBound)
  }

+  /**
+   * True if the current expression is NaN.


? is this the same with isNaN?

HyukjinKwon · 2018-06-02T03:09:44Z

python/pyspark/sql/functions.py

@@ -468,6 +468,18 @@ def input_file_name():
    return Column(sc._jvm.functions.input_file_name())


+@since(2.4)
+def isinf(col):
+    """An expression that returns true iff the column is NaN.


ditto. is this the same with isnan?

HyukjinKwon · 2018-06-02T03:11:07Z

R/pkg/R/functions.R

+#' @rdname column_nonaggregate_functions
+#' @aliases isnan isnan,Column-method
+#' @note isinf since 2.4.0
+setMethod("isInf",


R has is.infinite. Can we match the behaviour and rename it?

I like the idea, but we might not have a way to extend it (sort of)

> showMethods("is.finite") Function: is.finite (package base) > is.finite function (x) .Primitive("is.finite")

It looks like S3 without a generic.

HyukjinKwon · 2018-06-02T04:21:33Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullExpressions.scala

+  examples = """
+    Examples:
+      > SELECT _FUNC_(1/0);
+       True


Can you run the example and check the results?

HyukjinKwon · 2018-06-02T04:22:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullExpressions.scala

+ * Evaluates to `true` iff it's Infinity.
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(expr) - Returns True if expr evaluates to infinite else returns False ",


True -> true, False -> false to be consistent

felixcheung · 2018-06-02T06:29:31Z

R/pkg/R/functions.R

+#' @rdname column_nonaggregate_functions
+#' @aliases isnan isnan,Column-method
+#' @note isinf since 2.4.0
+setMethod("isInf",


I like the idea, but we might not have a way to extend it (sort of)

> showMethods("is.finite") Function: is.finite (package base) > is.finite function (x) .Primitive("is.finite")

It looks like S3 without a generic.

felixcheung · 2018-06-02T06:30:47Z

R/pkg/NAMESPACE

@@ -281,6 +281,8 @@ exportMethods("%<=>%",
              "initcap",
              "input_file_name",
              "instr",
+              "isInf",
+              "isinf",


I really appreciate the attempt to include R, though a question, why do we have isInf and isinf?
why not just isinf like python?

Yeah, we may have isInf or isinf here.

I have just followed what has been done for isnan, which also has isNan

the functions are case insensitive so i don't think we need both?

SparkQA · 2018-06-02T12:57:17Z

Test build #91418 has finished for PR 21482 at commit f34bfdc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-02T13:01:05Z

Test build #91417 has finished for PR 21482 at commit 069f9d9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-06-03T01:15:59Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullExpressions.scala

+      case DoubleType | FloatType =>
+        ev.copy(code = code"""
+          ${eval.code}
+          ${CodeGenerator.javaType(dataType)} ${ev.value} = ${CodeGenerator.defaultValue(dataType)};


We can assign ev.value directly here without a default value.

viirya · 2018-06-03T01:30:55Z

R/pkg/R/functions.R

+#' @rdname column_nonaggregate_functions
+#' @aliases isnan isnan,Column-method
+#' @note isinf since 2.4.0
+setMethod("isinf",


We don't need to have duplicate method definition like this. Maybe we can follow isNaN's approach if we really need to have both isinf and isInf.

Would it be alright if I omit isinf completely and only implement isInf?

For isNaN case, functions.R only defines isnan. isNaN is defined in column.R. If we really want to have both isInf and isinf like isNaN and isnan, maybe to follow it instead having two duplicate method definition here is better.

viirya · 2018-06-03T01:31:42Z

R/pkg/R/functions.R

+#' @details
+#' \code{isinf}: Returns true if the column is Infinity.
+#' @rdname column_nonaggregate_functions
+#' @aliases isnan isnan,Column-method


@aliases is incorrect here.

viirya · 2018-06-03T01:34:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullExpressions.scala

+  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+    val eval = child.genCode(ctx)
+    child.dataType match {
+      case DoubleType | FloatType =>


Is this match block necessary since there is only one case pattern?

The function can only test for infinity values for datatypes Double and Float, and hence we need to match the child datatype with these types

I think we will only see double and float types here because of inputTypes.

SparkQA · 2018-06-03T05:08:43Z

Test build #91423 has finished for PR 21482 at commit 7e396f7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2018-06-03T05:07:49Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

+   * @group normal_funcs
+   * @since 2.4.0
+   */
+  def isinf(e: Column): Column = withExpr { IsInf(e.expr) }


I guess someone should elaborate if Column.isFoo vs function's isfoo is the right pattern we want to stay with...

felixcheung · 2018-06-03T05:08:50Z

R/pkg/R/generics.R

+setGeneric("isInf", function(x) { standardGeneric("isInf") })
+
+#' @rdname columnfunctions
+setGeneric("isinf", function(x) { standardGeneric("isinf") })


isnan lower case is not a column functions see https://github.com/NihalHarish/spark/blob/7e396f70f58ffd309e7f738751f3aa8cfe321ce7/R/pkg/R/generics.R#L1002
@rdname columnfunctions will cause it to go to the wrong doc page

felixcheung · 2018-06-03T05:10:08Z

R/pkg/NAMESPACE

@@ -281,6 +281,8 @@ exportMethods("%<=>%",
              "initcap",
              "input_file_name",
              "instr",
+              "isInf",
+              "isinf",


add tests for these?

Added tests in my latest commit

felixcheung · 2018-06-03T05:10:51Z

R/pkg/R/functions.R

+#' @details
+#' \code{isinf}: Returns true if the column is Infinity.
+#' @rdname column_nonaggregate_functions
+#' @note isinf since 2.4.0


missing @aliases

SparkQA · 2018-06-04T20:17:00Z

Test build #91459 has finished for PR 21482 at commit 13b5aaa.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-04T20:29:42Z

Test build #91462 has finished for PR 21482 at commit d381f0c.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-07T07:05:02Z

Test build #91518 has finished for PR 21482 at commit 6a4d46e.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-07T07:05:02Z

Test build #91517 has finished for PR 21482 at commit f240fdf.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

henryr · 2018-06-07T21:40:50Z

@rxin, that in itself is a bit weird, but there are ways to express inf values in Scala and thus inf values can show up flowing through Spark plans. I'm not sure MySQL has any such facility.

rxin · 2018-06-07T22:35:49Z

Thanks, Henry. In general I'm not a huge fan of adding something because hypothetically somebody might want it. Also if you want this to be compatible with Impala, wouldn't you want to name this the same way as Impala?

henryr · 2018-06-07T23:19:40Z

I think consistency in Spark's naming convention (and therefore increased discoverability by users) outweighs the advantage of naming it exactly for the Impala equivalent. I do agree that multiple aliases probably aren't worth the trouble at this point. FWIW, I would have used this function if it had been available recently. So it's not just hypothetical. And to me this provides some symmetry for support for 'special' float values, since we already have isnan().

…

On 7 June 2018 at 15:36, Reynold Xin ***@***.***> wrote: Thanks, Henry. In general I'm not a huge fan of adding something because hypothetically somebody might want it. Also if you want this to be compatible with Impala, wouldn't you want to name this the same way as Impala? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#21482 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAFc1dxwI0XBXVvtW7qXKOu3BmbuXEV0ks5t6asDgaJpZM4UXVuP> .

SparkQA · 2018-06-08T20:43:57Z

Test build #91575 has finished for PR 21482 at commit 559900a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-06-09T08:52:49Z

ok to test

SparkQA · 2018-06-09T12:51:03Z

Test build #91601 has finished for PR 21482 at commit 559900a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-06-12T04:21:30Z

python/pyspark/sql/functions.py

@@ -468,6 +468,18 @@ def input_file_name():
    return Column(sc._jvm.functions.input_file_name())


+@since(2.4)
+def isinf(col):


Shall we expose this to column.py too?

Do you want me to add the function to column.py?

@HyukjinKwon could you clarify, please?

Yes, please because I see it's exposed in Column.scala.

Added it in my latest commit

henryr · 2018-06-21T20:31:42Z

Any further comments here?

HyukjinKwon · 2018-06-22T05:02:46Z

I have no more comments except the one above.

SparkQA · 2018-06-22T22:17:40Z

Test build #92224 has finished for PR 21482 at commit b727838.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-06-24T01:32:11Z

python/pyspark/sql/column.py

+    >>> from pyspark.sql import Row
+    >>> df = spark.createDataFrame([\
+            Row(name=u'Tom', height=80.0),\
+            Row(name=u'Alice', height=float('inf'))])


nit:

>>> df = spark.createDataFrame([ ... Row(name=u'Tom', height=80.0), ... Row(name=u'Alice', height=float('inf')) ... ])

SparkQA · 2018-06-24T03:24:15Z

Test build #92262 has finished for PR 21482 at commit 6bd6735.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-25T21:23:31Z

Test build #92307 has finished for PR 21482 at commit be23549.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-26T01:22:49Z

Test build #92315 has finished for PR 21482 at commit cb8f9d0.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-06-26T01:33:31Z

python/pyspark/sql/column.py

+    >>> from pyspark.sql import Row
+    >>> df = spark.createDataFrame([
+            Row(name=u'Tom', height=80.0),
+            Row(name=u'Alice', height=float('inf'))


dots are required as written in #21482 (comment)

SparkQA · 2018-06-26T11:28:23Z

Test build #92330 has finished for PR 21482 at commit d60aa21.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2018-06-26T17:09:46Z

Hey I have an additional thought on this. Will leave it in the next ten mins.

rxin · 2018-06-26T17:14:55Z

OK I double checked. I don't think we should be adding this functionality, since different databases implemented it differently, and it is somewhat difficult to create Infinity in Spark SQL given we return null or nan.

On top of that, we already support equality for infinity, e.g.

spark.range(1).select(
  org.apache.spark.sql.functions.lit(java.lang.Double.POSITIVE_INFINITY) ===org.apache.spark.sql.functions.lit(java.lang.Double.POSITIVE_INFINITY)).show()

The above shows true.

If you start adding inf, we'd need to soon add functions for negative infinity, and these functions provide very little value beyond what we already support using equality. And where does it end? Would you start adding is0, is1, is2?

If we see a lot of usage of infinity (because users try to create it using Scala/Python literals), I'd create a SQL literal that get parsed into infinity for easier construction of inf. But I wouldn't do that now.

HyukjinKwon · 2018-07-16T02:50:27Z

@NihalHarish shall we leave this closed for now if you don't have any opinion on ^?

HyukjinKwon · 2018-08-07T11:03:34Z

ping @NihalHarish

[SPARK-24393][SQL] SQL builtin: isinf

bcdaab2

henryr reviewed Jun 1, 2018

View reviewed changes

[SPARK-24393][SQL] made review changes

9ab0eb2

HyukjinKwon reviewed Jun 2, 2018

View reviewed changes

felixcheung reviewed Jun 2, 2018

View reviewed changes

Nihal Harish added 2 commits June 2, 2018 01:50

[SPARK-24393][SQL] made changes to comments

069f9d9

[SPARK-24393][SQL] made changes to comments

f34bfdc

viirya reviewed Jun 3, 2018

View reviewed changes

Nihal Harish added 2 commits June 2, 2018 18:53

[SPARK-24393][SQL] made changes to comments

a6c3903

[SPARK-24393][SQL] made changes to comments

7e396f7

felixcheung reviewed Jun 3, 2018

View reviewed changes

[SPARK-24393][SQL] made changes according to reviews

13b5aaa

[SPARK-24393][SQL] made changes according to reviews

d381f0c

[SPARK-24393][SQL] removed the isinf alias in R

559900a

HyukjinKwon reviewed Jun 12, 2018

View reviewed changes

added isInf to column.py

b727838

fixed multiline comment

6bd6735

HyukjinKwon reviewed Jun 24, 2018

View reviewed changes

removed backslashes

be23549

moved bracket

cb8f9d0

HyukjinKwon reviewed Jun 26, 2018

View reviewed changes

added dots

d60aa21

HyukjinKwon approved these changes Jun 26, 2018

View reviewed changes

NihalHarish closed this Sep 17, 2018

[SPARK-24393][SQL] SQL builtin: isinf #21482

[SPARK-24393][SQL] SQL builtin: isinf #21482

Conversation

NihalHarish commented Jun 1, 2018

What changes were proposed in this pull request?

How was this patch tested?

vanzin commented Jun 1, 2018

squito commented Jun 1, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 2, 2018

SparkQA commented Jun 2, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixcheung Jun 2, 2018 • edited

Choose a reason for hiding this comment

viirya Jun 3, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 2, 2018

SparkQA commented Jun 2, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NihalHarish Jun 3, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya Jun 3, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 4, 2018

SparkQA commented Jun 4, 2018

SparkQA commented Jun 7, 2018

SparkQA commented Jun 7, 2018

henryr commented Jun 7, 2018

rxin commented Jun 7, 2018

henryr commented Jun 7, 2018 via email

SparkQA commented Jun 8, 2018

HyukjinKwon commented Jun 9, 2018

SparkQA commented Jun 9, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

henryr commented Jun 21, 2018

HyukjinKwon commented Jun 22, 2018

SparkQA commented Jun 22, 2018

Choose a reason for hiding this comment

SparkQA commented Jun 24, 2018

SparkQA commented Jun 25, 2018

SparkQA commented Jun 26, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 26, 2018

rxin commented Jun 26, 2018

rxin commented Jun 26, 2018 • edited

HyukjinKwon commented Jul 16, 2018

HyukjinKwon commented Aug 7, 2018

felixcheung Jun 2, 2018 •

edited

viirya Jun 3, 2018 •

edited

NihalHarish Jun 3, 2018 •

edited

viirya Jun 3, 2018 •

edited

rxin commented Jun 26, 2018 •

edited