Skip to content

Conversation

@gczsjdy
Copy link

@gczsjdy gczsjdy commented Jan 5, 2017

What changes were proposed in this pull request?

This is an implementation of expression field which is implemented as built-in function by Hive and MySQL.

field(expr, expr1, expr2, ... ) is a variable-length(>=2) function that returns the index of expr in (expr1, expr2, ...) list or 0 if not found.

  • It takes at least 2 parameters, and all parameters can be of any type.
  • Implicit cast will be done when at least 2 parameters have different types, and it will be based on the first parameter's type.
  • If the first parameter is of NumericType, all parameters will be implicitly cast to DoubleType, and those that can't be cast to DoubleType will be regarded as NULL.
  • If the first parameter is of any other type, all parameters will be implicitly cast to StringType and the comparison will follow String's comparing rules.
  • If the search expression is NULL, the return value is 0 because NULL fails equality comparison with any value.

How was this patch tested?

Unit tests are in ConditionalExpressionSuite & ColumnExpressionSuite.

@gczsjdy gczsjdy changed the title Implement expression field [SPARK-19084][SQL] Implement expression field Jan 5, 2017
@gczsjdy
Copy link
Author

gczsjdy commented Jan 5, 2017

cc @chenghao-intel @adrian-wang

}

/**
* A function that returns the index of str in (str1, str2, ...) list or 0 if not found.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: delete a space before *

if(target == null)
0
else
findEqual(target, children.tail, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: one more space before findEqual

"""
}

def dataTypeEqualsTarget(evalWithIndex: Tuple2[Tuple2[ExprCode, DataType], Int]): Boolean = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def dataTypeEqualsTarget(evalWithIndex: ((ExprCode, DataType), Int)): Boolean



@since(2.2)
def field(*cols):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove this?

* @group normal_funcs
* @since 2.2.0
*/
@scala.annotation.varargs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I have removed this.
Still curious, is it inappropriate to be in Dataset API?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rxin can you explain a little bit why we remove this?

checkEvaluation(CaseKeyWhen(literalNull, Seq(c2, c5, c1, c6)), null, row)
}

test("case field") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's "case"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like you have comprehensive testing but at the same time feels like there is overlap amongst the tests in terms of coverage. There is room of reducing the tests while still having same coverage. eg you don't need 5 strings and use lesser.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rxin My bad, will take it out

Copy link
Author

@gczsjdy gczsjdy Jan 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tejasapatil The first 3 strings are base test strings, the 5th is for null of string type. Probably I can remove the 4th one which is not useful. What do you think?
About the tests' role, could you please check another thread where I @ you?

usage = "_FUNC_(str, str1, str2, ...) - Returns the index of str in the str1,str2,... or 0 if not found.",
extended = """
Examples:
> SELECT _FUNC_(10, 9, 3, 10, 4);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use strings as examples rather than integer literals?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This UDF accepts any mix atomic types so one can even get fancy with the inputs. Would recommend mentioning that in the doc (given that you have tests for that below)

Copy link
Author

@gczsjdy gczsjdy Jan 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have given 3 examples, one for Integer, one for String, one for mixed types.

* It takes at least 2 parameters, and all parameters' types should be subtypes of AtomicType.
*/
@ExpressionDescription(
usage = "_FUNC_(str, str1, str2, ...) - Returns the index of str in the str1,str2,... or 0 if not found.",
Copy link
Member

@gatorsmile gatorsmile Jan 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use expr1, expr2, expr3 here? The type can be any atomic type. right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's more reasonable to use expr(n), thx.
Probably it should be AtomicType or NullType to support user's writing of null.

extended = """
Examples:
> SELECT _FUNC_(10, 9, 3, 10, 4);
3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More examples please?

if(target == null)
0
else
findEqual(target, children.tail, 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you fix the style, based on https://github.com/databricks/scala-style-guide#curly?

Copy link
Member

@gatorsmile gatorsmile Jan 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

findEqual(target, children.tail, index = 1)

checkEvaluation(Field(Seq(str5, str1, str2, str4)), 0)
checkEvaluation(Field(Seq(int4, double3, str5, bool1, date1, timeStamp2, int3)), 0)
checkEvaluation(Field(Seq(int1, strNull, intNull, bool1, date1, timeStamp2, int3)), 0)
checkEvaluation(Field(Seq(strNull, int1, str1, str2, str3)), 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to test null. Could you add the description?

If the search string is NULL, the return value is 0 because NULL fails equality comparison with any value

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fails with the patch.

hc.sql("""SELECT FIELD("tejas", 34, "patil", true, null, "tejas") FROM src LIMIT 1""").collect.foreach(println)

Removing null makes it work. Can you check on your side ? It worked with Hive.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile Sure.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tejasapatil It doesn't work with my code because it only support AtomicType. While user's writing of 'null' is NullType, I will add NullType support to be consistent with Hive.

checkEvaluation(Field(Seq(int4, double3, str5, bool1, date1, timeStamp2, int4)), 6)
checkEvaluation(Field(Seq(str5, str1, str2, str4)), 0)
checkEvaluation(Field(Seq(int4, double3, str5, bool1, date1, timeStamp2, int3)), 0)
checkEvaluation(Field(Seq(int1, strNull, intNull, bool1, date1, timeStamp2, int3)), 0)
Copy link
Member

@gatorsmile gatorsmile Jan 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of these checks?

Based on MySQL's field function, the type casting rules is described as

If all arguments to FIELD() are strings, all arguments are compared as strings. If all arguments are numbers, they are compared as numbers. Otherwise, the arguments are compared as double.

Copy link
Author

@gczsjdy gczsjdy Jan 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 180 is to test multi types of parameters;
Line 181 is to test not found case;
Line 182 is to test not found case when parameters are of multi types;
Line 183 is to test null in parameter which has >=1 index
I think maybe we should refer to Hive's field? In Hive, when not all arguments are numbers && not all arguments are strings, they are not compared as double.

Also @tejasapatil , here's some lines' explanation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed line 181, since line 182 actually covers not found case.

val target = children.head.eval(input)
val targetDataType = children.head.dataType
def findEqual(target: Any, params: Seq[Expression], index: Int): Int = {
params.toList match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid toList for each recursive call?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to not do this iteratively ? I would suggest avoiding recursion to get better perf.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile Actually, the case:
checkAnswer(testData.selectExpr("field('花花世界', 'a', 1.23, true, '花花世界')"), Row(4))
will produce a child: Seq[Expression] an ArrayBuffer, it's not a list, so can't use head::tail.
So there are 2 ways:

  1. remove the toList and do another pattern match for ArrayBuffer, which I think is not neat.
  2. keep the toList.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tejasapatil Actually, I think it's tail recursion, so the compiler will do the optimization, then it has the same performance with iteration edition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can add the annotation @tailrec for explicitly declare that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toList probably causes performance overhead, I don't think we have to sacrifice the performance for using the pattern match. In the meantime, I still believe we don't have to check the data type during the runtime. It's supposed to be done during the compile time or only done once for the first time in eval.

The Field evaluation is quite confusing, as @gatorsmile suggested, we need to describe how to evaluate the value when sub expressions' data type are different.

Copy link
Author

@gczsjdy gczsjdy Jan 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenghao-intel I have removed the toList, replaced the pattern match by if & else,
and also reduced the type check time to 1 for the same table.
I have added line 353-354 as comments for sub expressions' multiple data types, could you please have a look?

}

/**
* A function that returns the index of str in (str1, str2, ...) list or 0 if not found.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change str to expr here as well.

usage = "_FUNC_(str, str1, str2, ...) - Returns the index of str in the str1,str2,... or 0 if not found.",
extended = """
Examples:
> SELECT _FUNC_(10, 9, 3, 10, 4);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This UDF accepts any mix atomic types so one can even get fancy with the inputs. Would recommend mentioning that in the doc (given that you have tests for that below)

val target = children.head.eval(input)
val targetDataType = children.head.dataType
def findEqual(target: Any, params: Seq[Expression], index: Int): Int = {
params.toList match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to not do this iteratively ? I would suggest avoiding recursion to get better perf.

case _ => findEqual(target, params.tail, index + 1)
}
}
if(target == null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • nit: space after if
  • checkstyle will fail saying if-else needs to use braces


checkEvaluation(Field(Seq(str1, str2, str3, str1)), 3)
checkEvaluation(Field(Seq(str2, str2, str2, str1)), 1)
checkEvaluation(Field(Seq(str4, str4, str4, str1)), 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as previous ?

"""
}

def genIfElseStructure(code1: String, code2: String): String = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the name it felt like this method would put code1 in if block and code2 in else block but turns out thats not the case. That floating else looks weird.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I can change this function's name? But actually I can't think of a better name. : )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Maybe we can use foldLeft to replace current approach to get rid of the floating else.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't understand, how to use foldLeft approach? I think we can only use foldRight or reduceRight, because the code for latter children should be nested inner.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yeah, right, if use foldLeft, there is still a floating else. We can only use foldRight to remove it.

Copy link
Author

@gczsjdy gczsjdy Feb 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to use foldRight to remove it?
My thought: If I understand your meaning of floating else right(could you please explain it a little bit?), foldRight and reduceRight both can't avoid floating else, because we need nested else in else block, like this:
if (xxx) else { if (xxx) else { ... } } , so if we avoid floating else in genIfElseStructure, else should be in updateEval, which will make the code unclear and complicated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks like:

${evalChildren.zip(dataTypes).zipWithIndex.tail.filter { x =>
  dataTypeMatchIndex.contains(x._2)
}.foldRight("") { (code: String, evalWithIndex: ((ExprCode, DataType), Int)) =>
  val ((eval, _), index) = evalWithIndex
  val condition = ctx.genEqual(targetDataType, eval.value, target.value)     
  s"""
    ${eval.code}
    if ($condition) {
      ${ev.value} = ${index};
    } else {
      $code
    }       
  """
}

You can do this with a function like you did before. It will have a empty "else" block at the end.

However this doesn't affect the functionality, just dealing with how the code looks. I don't have strong option about this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think whats already present in the code is ok. Given that there is no better option without adding more complexity, lets stick with it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya Maybe the order of code and evalWithIndex parameters should be changed.
@tejasapatil I agree with your opinion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current code is ok.

findEqual(target, children.tail, 1)
}

protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do your unit tests cover generated code ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because in checkEvaluation function there is : checkEvaluationWithGeneratedMutableProjection

checkEvaluation(Field(Seq(str5, str1, str2, str4)), 0)
checkEvaluation(Field(Seq(int4, double3, str5, bool1, date1, timeStamp2, int3)), 0)
checkEvaluation(Field(Seq(int1, strNull, intNull, bool1, date1, timeStamp2, int3)), 0)
checkEvaluation(Field(Seq(strNull, int1, str1, str2, str3)), 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fails with the patch.

hc.sql("""SELECT FIELD("tejas", 34, "patil", true, null, "tejas") FROM src LIMIT 1""").collect.foreach(println)

Removing null makes it work. Can you check on your side ? It worked with Hive.

val ((eval, dataType), index) = evalWithIndex
s"""
${eval.code}
if (${dataType.equals(targetDataType)}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point the code at lines 427-428 (ie. .filter(dataTypeEqualsTarget)) have ensured that this will always be true. The generated code will have this as true and you might as well get rid of the check here.

${target.code}
boolean ${ev.isNull} = false;
int ${ev.value} = 0;
${rest.zip(restDataType).zipWithIndex.map(x => (x._1, x._2 + 1)).filter(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.zipWithIndex.map(x => (x._1, x._2 + 1)) can be simplified as .zip(Stream from 1)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thx.

@gczsjdy gczsjdy force-pushed the udffield branch 2 times, most recently from 44941ab to b09f446 Compare January 9, 2017 13:04
@chenghao-intel
Copy link
Contributor

@gczsjdy can you please add [WIP] in the title, until you feel the code is ready for review.

@gczsjdy gczsjdy changed the title [SPARK-19084][SQL] Implement expression field [SPARK-19084][SQL][WIP] Implement expression field Jan 9, 2017

private lazy val ordering = TypeUtils.getInterpretedOrdering(children(0).dataType)

private val dataTypeMatchIndex: Seq[Int] = children.tail.zip(Stream from 1).filter(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Array[Int] instead? Seq[Int] probably a LinkedList in its concrete implementation.

Copy link
Author

@gczsjdy gczsjdy Jan 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't it be inconsistent with children? Since children's type is Seq[Expression].

private lazy val ordering = TypeUtils.getInterpretedOrdering(children(0).dataType)

private val dataTypeMatchIndex: Seq[Int] = children.tail.zip(Stream from 1).filter(
_._1.dataType == children.head.dataType).map(_._2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_._1.dataType.sameType(children.head.dataType)?

Copy link
Author

@gczsjdy gczsjdy Jan 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, thx.


private lazy val ordering = TypeUtils.getInterpretedOrdering(children(0).dataType)

private val dataTypeMatchIndex: Seq[Int] = children.tail.zip(Stream from 1).filter(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zip(Stream from 1), do we really need it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it's short for zipWithIndex.map(x => (x._1, x._2 + 1)).
I realized it makes people confused, and have changed it.

override def dataType: DataType = IntegerType
override def eval(input: InternalRow): Any = {
val target = children.head.eval(input)
val targetDataType = children.head.dataType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it

val target = children.head.eval(input)
val targetDataType = children.head.dataType
@tailrec def findEqual(index: Int): Int = {
if (index == dataTypeMatchIndex.size) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if dataTypeMatchIndex is Array[Int], then we'd better use dataTypeMatchIndex.length instead.

0
} else {
val value = children(dataTypeMatchIndex(index)).eval(input)
if (value != null && ordering.equiv(target, value))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenghao-intel
Copy link
Contributor

Since the different data type will be simply ignored, I think we'd better also add the optimization rule in Optimizer.

As well as the python/scala API support, but need to confirm with @rxin, why we don't need the API field.

@gczsjdy gczsjdy force-pushed the udffield branch 2 times, most recently from ff9a0be to 08e9f0c Compare January 10, 2017 07:10
@gczsjdy gczsjdy changed the title [SPARK-19084][SQL][WIP] Implement expression field [SPARK-19084][SQL] Implement expression field Jan 10, 2017
@gczsjdy
Copy link
Author

gczsjdy commented Jul 16, 2018

@HyukjinKwon Done, thanks : )
Gentle ping @maropu. About the implicit cast part, seems this is the part difficult to agree with, do you have any suggestions?

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Jul 16, 2018

Test build #93100 has finished for PR 16476 at commit 4b63e94.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 26, 2018

Test build #93594 has finished for PR 16476 at commit c75d786.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jan 19, 2020
@github-actions github-actions bot closed this Jan 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.