Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-8248][SQL] string function: length #6724

Closed
wants to merge 11 commits into from
Expand Up @@ -89,14 +89,10 @@ object FunctionRegistry {
expression[CreateArray]("array"),
expression[Coalesce]("coalesce"),
expression[Explode]("explode"),
expression[Lower]("lower"),
expression[Substring]("substr"),
expression[Substring]("substring"),
expression[Rand]("rand"),
expression[Randn]("randn"),
expression[CreateStruct]("struct"),
expression[Sqrt]("sqrt"),
expression[Upper]("upper"),

// Math functions
expression[Acos]("acos"),
Expand Down Expand Up @@ -130,7 +126,16 @@ object FunctionRegistry {
expression[Last]("last"),
expression[Max]("max"),
expression[Min]("min"),
expression[Sum]("sum")
expression[Sum]("sum"),

// string functions
expression[Upper]("lcase"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo here? should be expression[Lower]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's a good catch.

expression[Lower]("lower"),
expression[StringLength]("strlen"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't rename this one since we need hive compatibility here... only rename the data frame function.

expression[Substring]("substr"),
expression[Substring]("substring"),
expression[Upper]("upper"),
expression[Upper]("ucase")
)

/** See usage above. */
Expand Down
Expand Up @@ -294,3 +294,35 @@ object Substring {
apply(str, pos, Literal(Integer.MAX_VALUE))
}
}

/**
* A function that return the length of the given string expression.
*/
case class StringLength(child: Expression) extends UnaryExpression with ExpectsInputTypes {

override def foldable: Boolean = child.foldable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can make them denser, i.e. remove the blank lines between the trivial functions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually can we push the definition of foldable & nullable into UnaryExpression?


override def nullable: Boolean = child.nullable

override def dataType: DataType = IntegerType

override def expectedChildTypes: Seq[DataType] = Seq(StringType)

override def eval(input: Row): Any = {
val string = child.eval(input)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this blank line

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and condense the following into one line?

if (string.eq(null)) null else string.asInstanceOf[UTF8String].length

if (string == null) {
null
} else {
string.asInstanceOf[UTF8String].length
}
}

override def toString: String = s"strlen($child)"

override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = {
defineCodeGen(ctx, ev, c => s"($c).length()")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no () in Java

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, it has. never mind.

}
}


Expand Up @@ -215,4 +215,15 @@ class StringFunctionsSuite extends SparkFunSuite with ExpressionEvalHelper {
evaluate("abbbbc" rlike regEx, create_row("**"))
}
}

test("length for string") {
val regEx = 'a.string.at(0)
checkEvaluation(StringLength(Literal("abc")), 3, create_row("abdef"))
checkEvaluation(StringLength(regEx), 5, create_row("abdef"))
checkEvaluation(StringLength(regEx), 0, create_row(""))
checkEvaluation(StringLength(regEx), null, create_row(null))
checkEvaluation(StringLength(Literal.create(null, StringType)), null, create_row("abdef"))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @davies pointed out, this probably failed in codegen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you pull in his fix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I thought @davies will fix this. I will take look at this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea but his fix won't be merged for a while because it's part of a much broader change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can port some of fix from that big PR as a separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do that. Take your big PR into smaller ones.

}


}
14 changes: 14 additions & 0 deletions sql/core/src/main/scala/org/apache/spark/sql/functions.scala
Expand Up @@ -37,6 +37,7 @@ import org.apache.spark.util.Utils
* @groupname normal_funcs Non-aggregate functions
* @groupname math_funcs Math functions
* @groupname window_funcs Window functions
* @groupname string_funcs functions for DataFrames.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"functions for DataFrames " -> "String functions"

* @groupname Ungrouped Support functions for DataFrames.
* @since 1.3.0
*/
Expand Down Expand Up @@ -1299,6 +1300,19 @@ object functions {
*/
def toRadians(columnName: String): Column = toRadians(Column(columnName))

/**
* Length of a given string value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Computes the length of a string column.

* @group string_funcs
* @since 1.5.0
*/
def strlen(e: Column): Column = StringLength(e.expr)

/**
* Length of a given string column
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Computes the length of a string column.

* @group string_funcs
* @since 1.5.0
*/
def strlen(columnName: String): Column = strlen(Column(columnName))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add functions for other string expressions like substring?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those are coming as separate pull requests


//////////////////////////////////////////////////////////////////////////////////////////////
//////////////////////////////////////////////////////////////////////////////////////////////
Expand Down
Expand Up @@ -90,4 +90,28 @@ class DataFrameFunctionsSuite extends QueryTest {
testData2.select(bitwiseNOT($"a")),
testData2.collect().toSeq.map(r => Row(~r.getInt(0))))
}

test("length") {
checkAnswer(
nullStrings.select(strlen($"s"), strlen("s")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for future ones, i prefer just having the data defined inline here. and just have one or two items in the test data set. and we should explicitly put in the test result.

nullStrings.collect().toSeq.map { r =>
val v = r.getString(1)
val l = if (v == null) null else v.length
Row(l, l)
})
}

test("length in SQL") {
nullStrings.registerTempTable("null_strings")

checkAnswer(
ctx.sql("SELECT strlen(s) FROM null_strings"),
nullStrings.collect().toSeq.map { r =>
val v = r.getString(1)
val l = if (v == null) null else v.length
Row(l)
})
}


}