Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-24393][SQL] SQL builtin: isinf #21482

Closed
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 2 additions & 0 deletions R/pkg/NAMESPACE
Expand Up @@ -281,6 +281,8 @@ exportMethods("%<=>%",
"initcap",
"input_file_name",
"instr",
"isInf",
"isinf",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really appreciate the attempt to include R, though a question, why do we have isInf and isinf?
why not just isinf like python?

Copy link
Member

@viirya viirya Jun 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we may have isInf or isinf here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just followed what has been done for isnan, which also has isNan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the functions are case insensitive so i don't think we need both?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add tests for these?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added tests in my latest commit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't understand why we have both. It usually has the ones matching to Scala side or R specific function. Otherwise, I don't think we should have both.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if there's any consistency behind the different capitalization schemes? There's format_number, isnan and isNotNull here, for example.

If not, how about we just go with isInf for now and if other aliases are needed in the future they can be added and discussed then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't add other variants unless there's a clear reason. It sounds like we are adding this for no reason.

how about we just go with isInf for now and if other aliases are needed in the future they can be added and discussed then?

Yea, please.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I commented on this here #21482 (comment)
it's because Column capitalizes it differently from functions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I got it now. I roughly remember we keep functions this_naming_style in functions[.py|.R|.scala], e.g.(SPARK-10621).

"isNaN",
"isNotNull",
"isNull",
Expand Down
24 changes: 24 additions & 0 deletions R/pkg/R/functions.R
Expand Up @@ -907,6 +907,30 @@ setMethod("initcap",
column(jc)
})

#' @details
#' \code{isinf}: Returns true if the column is Infinity.
#' @rdname column_nonaggregate_functions
#' @aliases isnan isnan,Column-method
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aliases is incorrect here.

#' @note isinf since 2.4.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing @aliases

setMethod("isinf",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to have duplicate method definition like this. Maybe we can follow isNaN's approach if we really need to have both isinf and isInf.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be alright if I omit isinf completely and only implement isInf?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For isNaN case, functions.R only defines isnan. isNaN is defined in column.R. If we really want to have both isInf and isinf like isNaN and isnan, maybe to follow it instead having two duplicate method definition here is better.

signature(x = "Column"),
function(x) {
jc <- callJStatic("org.apache.spark.sql.functions", "isinf", x@jc)
column(jc)
})

#' @details
#' \code{isInf}: Returns true if the column is Infinity.
#' @rdname column_nonaggregate_functions
#' @aliases isnan isnan,Column-method
#' @note isinf since 2.4.0
setMethod("isInf",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R has is.infinite. Can we match the behaviour and rename it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea, but we might not have a way to extend it (sort of)

> showMethods("is.finite")
Function: is.finite (package base)
> is.finite
function (x)  .Primitive("is.finite")

It looks like S3 without a generic.

signature(x = "Column"),
function(x) {
jc <- callJStatic("org.apache.spark.sql.functions", "isinf", x@jc)
column(jc)
})

#' @details
#' \code{isnan}: Returns true if the column is NaN.
#' @rdname column_nonaggregate_functions
Expand Down
6 changes: 6 additions & 0 deletions R/pkg/R/generics.R
Expand Up @@ -695,6 +695,12 @@ setGeneric("getField", function(x, ...) { standardGeneric("getField") })
#' @rdname columnfunctions
setGeneric("getItem", function(x, ...) { standardGeneric("getItem") })

#' @rdname columnfunctions
setGeneric("isInf", function(x) { standardGeneric("isInf") })

#' @rdname columnfunctions
setGeneric("isinf", function(x) { standardGeneric("isinf") })
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isnan lower case is not a column functions see https://github.com/NihalHarish/spark/blob/7e396f70f58ffd309e7f738751f3aa8cfe321ce7/R/pkg/R/generics.R#L1002
@rdname columnfunctions will cause it to go to the wrong doc page


#' @rdname columnfunctions
setGeneric("isNaN", function(x) { standardGeneric("isNaN") })

Expand Down
12 changes: 12 additions & 0 deletions python/pyspark/sql/functions.py
Expand Up @@ -468,6 +468,18 @@ def input_file_name():
return Column(sc._jvm.functions.input_file_name())


@since(2.4)
def isinf(col):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we expose this to column.py too?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want me to add the function to column.py?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon could you clarify, please?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please because I see it's exposed in Column.scala.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added it in my latest commit

"""An expression that returns true iff the column is NaN.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto. is this the same with isnan?


>>> df = spark.createDataFrame([(1.0, float('inf')), (float('inf'), 2.0)], ("a", "b"))
>>> df.select(isinf("a").alias("r1"), isinf(df.a).alias("r2")).collect()
[Row(r1=False, r2=False), Row(r1=True, r2=True)]
"""
sc = SparkContext._active_spark_context
return Column(sc._jvm.functions.isinf(_to_java_column(col)))


@since(1.6)
def isnan(col):
"""An expression that returns true iff the column is NaN.
Expand Down
Expand Up @@ -198,6 +198,7 @@ object FunctionRegistry {
expression[If]("if"),
expression[Inline]("inline"),
expressionGeneratorOuter[Inline]("inline_outer"),
expression[IsInf]("isinf"),
expression[IsNaN]("isnan"),
expression[IfNull]("ifnull"),
expression[IsNull]("isnull"),
Expand Down
Expand Up @@ -199,6 +199,50 @@ case class Nvl2(expr1: Expression, expr2: Expression, expr3: Expression, child:
override def sql: String = s"$prettyName(${expr1.sql}, ${expr2.sql}, ${expr3.sql})"
}

/**
* Evaluates to `true` iff it's Infinity.
*/
@ExpressionDescription(
usage = "_FUNC_(expr) - Returns True if expr evaluates to infinite else returns False ",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True -> true, False -> false to be consistent

examples = """
Examples:
> SELECT _FUNC_(1/0);
True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you run the example and check the results?

> SELECT _FUNC_(5);
False
""")
case class IsInf(child: Expression) extends UnaryExpression
with Predicate with ImplicitCastInputTypes {

override def inputTypes: Seq[AbstractDataType] = Seq(TypeCollection(DoubleType, FloatType))

override def nullable: Boolean = false

override def eval(input: InternalRow): Boolean = {
val value = child.eval(input)
if (value == null) {
false
} else {
child.dataType match {
case DoubleType => value.asInstanceOf[Double].isInfinity
case FloatType => value.asInstanceOf[Float].isInfinity
}
}
}


override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
val eval = child.genCode(ctx)
child.dataType match {
case DoubleType | FloatType =>
Copy link
Member

@viirya viirya Jun 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this match block necessary since there is only one case pattern?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function can only test for infinity values for datatypes Double and Float, and hence we need to match the child datatype with these types

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will only see double and float types here because of inputTypes.

ev.copy(code = code"""
${eval.code}
${CodeGenerator.javaType(dataType)} ${ev.value} = ${CodeGenerator.defaultValue(dataType)};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can assign ev.value directly here without a default value.

${ev.value} = !${eval.isNull} && Double.isInfinite(${eval.value});""",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of interest, why use Double.isInfinite here, but value.isInfinity in the non-codegen version?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The non-codegen version uses the isInfinity method defined for scala's Double and Float, whereas the codegen version uses java's static method "isInfinite" defined for the classes Double and Float.

isNull = FalseLiteral)
}
}
}

/**
* Evaluates to `true` iff it's NaN.
Expand Down
Expand Up @@ -56,6 +56,18 @@ class NullExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper {
assert(ex.contains("Null value appeared in non-nullable field"))
}

test("IsInf") {
checkEvaluation(IsInf(Literal(Double.PositiveInfinity)), true)
checkEvaluation(IsInf(Literal(Double.NegativeInfinity)), true)
checkEvaluation(IsInf(Literal(Float.PositiveInfinity)), true)
checkEvaluation(IsInf(Literal(Float.NegativeInfinity)), true)
checkEvaluation(IsInf(Literal.create(null, DoubleType)), false)
checkEvaluation(IsInf(Literal(Float.MaxValue)), false)
checkEvaluation(IsInf(Literal(5.5f)), false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check NaN as well?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the checks in my later commits

checkEvaluation(IsInf(Literal(Float.NaN)), expected = false)
checkEvaluation(IsInf(Literal(Double.NaN)), expected = false)
}

test("IsNaN") {
checkEvaluation(IsNaN(Literal(Double.NaN)), true)
checkEvaluation(IsNaN(Literal(Float.NaN)), true)
Expand Down
8 changes: 8 additions & 0 deletions sql/core/src/main/scala/org/apache/spark/sql/Column.scala
Expand Up @@ -557,6 +557,14 @@ class Column(val expr: Expression) extends Logging {
(this >= lowerBound) && (this <= upperBound)
}

/**
* True if the current expression is NaN.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

? is this the same with isNaN?

*
* @group expr_ops
* @since 2.4.0
*/
def isInf: Column = withExpr { IsInf(expr) }

/**
* True if the current expression is NaN.
*
Expand Down
8 changes: 8 additions & 0 deletions sql/core/src/main/scala/org/apache/spark/sql/functions.scala
Expand Up @@ -1107,6 +1107,14 @@ object functions {
*/
def input_file_name(): Column = withExpr { InputFileName() }

/**
* Return true iff the column is Infinity.
*
* @group normal_funcs
* @since 2.4.0
*/
def isinf(e: Column): Column = withExpr { IsInf(e.expr) }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind if I ask to elaborate isinf vs isInf across the APIs?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have followed what seemed to be the preexistent convention for function names in those particular files. For example in functions.scala all function names were in lower case but in Column.scala all function names were in camel case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess someone should elaborate if Column.isFoo vs function's isfoo is the right pattern we want to stay with...


/**
* Return true iff the column is NaN.
*
Expand Down