Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-8218][SQL] Add binary log math function #6725

Closed
wants to merge 19 commits into from
Closed
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion python/pyspark/sql/functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,8 @@ def _():
'atan2': 'Returns the angle theta from the conversion of rectangular coordinates (x, y) to' +
'polar coordinates (r, theta).',
'hypot': 'Computes `sqrt(a^2^ + b^2^)` without intermediate overflow or underflow.',
'pow': 'Returns the value of the first argument raised to the power of the second argument.'
'pow': 'Returns the value of the first argument raised to the power of the second argument.',
'log': 'Returns the first argument-based logarithm of the second argument',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we define this function directly instead of using this magic?

}

_window_functions = {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ object FunctionRegistry {
expression[Expm1]("expm1"),
expression[Floor]("floor"),
expression[Hypot]("hypot"),
expression[Log]("log"),
expression[Logarithm]("log"),
expression[Log10]("log10"),
expression[Log1p]("log1p"),
expression[Pi]("pi"),
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -254,3 +254,54 @@ case class Pow(left: Expression, right: Expression)
"""
}
}

object Logarithm {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this? We should assume people will use the ln instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we want to support the usage of df.selectExpr("log(a)", "log(2.0, a)", "log(b)").

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take a look at #6806

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. it solves this problem.

def apply(child: Expression): Expression = new Log(child)
}

case class Logarithm(left: Expression, right: Expression)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We seem to be doing this throughout the file, but it seems pretty confusing to me to be using left and right instead of something like value and base. I'm not sure this is worth whatever code reuse we are getting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh - one thing is that left/right is coming from BinaryExpression

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that is what I meant saying it wasn't worth whatever code reuse we are getting. The other option would be to name the arguments and have def left = base, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inheriting from BinaryMathExpression seems like a bad idea to me. It is forcing the arguments to have weird names and is resulting in dead code.

extends BinaryMathExpression((c1, c2) => math.log(c2) / math.log(c1), "LOG") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need to override the eval(), probably it's better to inherit from BinaryArithmetic, not BinaryMathExpression, then we needn't to pass the additional function object for its parent class.
Or can we remove the eval() from this class?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think BinaryArithmetic is more limited in its semantic meaning. As I can tell, it is more suitable for the binary expression between two expressions of same type. BinaryMathExpression is more like to represent math function with two expressions.

Due to support the case when the left is null in that 10 base logarithm is applied, to override eval() is needed here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, but this is confusing. Is it this lambda that is being used for evaluation or the eval function below? We should not have both if only one is used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be ok because there is other binary math expression Atan2 that overrides eval because its eval behavior is different than the default one in BinaryMathExpression.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that that function is also implemented in a confusing way. We should not shoehorn things into the class hierarchy if its going to result hard to follow code. I'd rather we have small amounts of code duplication.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Sounds reasonable. I think I can refactor this part of codes a little.

override def eval(input: Row): Any = {
val evalE2 = right.eval(input)
if (evalE2 == null) {
null
} else {
val evalE1 = left.eval(input)
var result: Double = 0.0
if (evalE1 == null) {
result = math.log(evalE2.asInstanceOf[Double])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it standard to assume base 10 if the exponent is null?

} else {
result = math.log(evalE2.asInstanceOf[Double]) / math.log(evalE1.asInstanceOf[Double])
}
if (result.isNaN) null else result
}
}

override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = {
if (left.dataType != right.dataType) {
// log.warn(s"${left.dataType} != ${right.dataType}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove

}

val eval1 = left.gen(ctx)
val eval2 = right.gen(ctx)
val resultCode =
s"java.lang.Math.log(${eval2.primitive}) / java.lang.Math.log(${eval1.primitive})"

s"""
${eval2.code}
boolean ${ev.isNull} = ${eval2.isNull};
${ctx.javaType(dataType)} ${ev.primitive} = ${ctx.defaultValue(dataType)};
if (!${ev.isNull}) {
${eval1.code}
if (!${eval1.isNull}) {
${ev.primitive} = ${resultCode};
} else {
${ev.primitive} = java.lang.Math.log(${eval2.primitive});
}
}
if (Double.valueOf(${ev.primitive}).isNaN()) {
${ev.isNull} = true;
}
"""
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -204,4 +204,18 @@ class MathFunctionsSuite extends SparkFunSuite with ExpressionEvalHelper {
testBinary(Atan2, math.atan2)
}

test("binary log") {
val f = (c1: Double, c2: Double) => math.log(c2) / math.log(c1)
val domain = (1 to 20).map(v => (v * 0.1, v * 0.2))

domain.foreach { case (v1, v2) =>
checkEvaluation(Logarithm(Literal(v1), Literal(v2)), f(v1 + 0.0, v2 + 0.0), EmptyRow)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indent is wrong.

checkEvaluation(Logarithm(Literal(v2), Literal(v1)), f(v2 + 0.0, v1 + 0.0), EmptyRow)
}
// When base is null, Logarithm is as same as Log
checkEvaluation(Logarithm(Literal.create(null, DoubleType), Literal(1.0)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about wrapping.

math.log(1.0), create_row(null))
checkEvaluation(Logarithm(Literal(1.0), Literal.create(null, DoubleType)),
null, create_row(null))
}
}
16 changes: 16 additions & 0 deletions sql/core/src/main/scala/org/apache/spark/sql/functions.scala
Original file line number Diff line number Diff line change
Expand Up @@ -1083,6 +1083,22 @@ object functions {
*/
def log(columnName: String): Column = log(Column(columnName))

/**
* Returns the first argument-base logarithm of the second argument.
*
* @group math_funcs
* @since 1.4.0
*/
def log(base: Double, a: Column): Column = Logarithm(lit(base).expr, a.expr)

/**
* Returns the first argument-base logarithm of the second argument.
*
* @group math_funcs
* @since 1.4.0
*/
def log(base: Double, a: String): Column = log(base, Column(a))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we rename a to columnName here


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we support specify the base from a column?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is suggested by @rxin. I think it is reasonable because it is hard to have a use case to returns the logarithm of one column with another column as base. Usually you want to compute the logarithm values for a column with the same base.

/**
* Computes the logarithm of the given value in base 10.
*
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,19 @@ class DataFrameFunctionsSuite extends QueryTest {
testData2.collect().toSeq.map(r => Row(~r.getInt(0))))
}

test("log") {
val df = Seq[(Integer, Integer)]((123, null)).toDF("a", "b")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move this test into the math expression suite?

checkAnswer(
df.select(org.apache.spark.sql.functions.log("a"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: When wrapping, wrap all arguments. Also the indent below is wrong.

org.apache.spark.sql.functions.log(2.0, "a"),
org.apache.spark.sql.functions.log("b")),
Row(math.log(123), math.log(123) / math.log(2), null))

checkAnswer(
df.selectExpr("log(a)", "log(2.0, a)", "log(b)"),
Row(math.log(123), math.log(123) / math.log(2), null))
}

test("length") {
checkAnswer(
nullStrings.select(strlen($"s"), strlen("s")),
Expand Down