-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19084][SQL] Implement expression field #16476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| } | ||
|
|
||
| /** | ||
| * A function that returns the index of str in (str1, str2, ...) list or 0 if not found. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: delete a space before *
| if(target == null) | ||
| 0 | ||
| else | ||
| findEqual(target, children.tail, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: one more space before findEqual
| """ | ||
| } | ||
|
|
||
| def dataTypeEqualsTarget(evalWithIndex: Tuple2[Tuple2[ExprCode, DataType], Int]): Boolean = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def dataTypeEqualsTarget(evalWithIndex: ((ExprCode, DataType), Int)): Boolean
python/pyspark/sql/functions.py
Outdated
|
|
||
|
|
||
| @since(2.2) | ||
| def field(*cols): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we remove this?
| * @group normal_funcs | ||
| * @since 2.2.0 | ||
| */ | ||
| @scala.annotation.varargs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I have removed this.
Still curious, is it inappropriate to be in Dataset API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rxin can you explain a little bit why we remove this?
| checkEvaluation(CaseKeyWhen(literalNull, Seq(c2, c5, c1, c6)), null, row) | ||
| } | ||
|
|
||
| test("case field") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's "case"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like you have comprehensive testing but at the same time feels like there is overlap amongst the tests in terms of coverage. There is room of reducing the tests while still having same coverage. eg you don't need 5 strings and use lesser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rxin My bad, will take it out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tejasapatil The first 3 strings are base test strings, the 5th is for null of string type. Probably I can remove the 4th one which is not useful. What do you think?
About the tests' role, could you please check another thread where I @ you?
| usage = "_FUNC_(str, str1, str2, ...) - Returns the index of str in the str1,str2,... or 0 if not found.", | ||
| extended = """ | ||
| Examples: | ||
| > SELECT _FUNC_(10, 9, 3, 10, 4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use strings as examples rather than integer literals?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This UDF accepts any mix atomic types so one can even get fancy with the inputs. Would recommend mentioning that in the doc (given that you have tests for that below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have given 3 examples, one for Integer, one for String, one for mixed types.
| * It takes at least 2 parameters, and all parameters' types should be subtypes of AtomicType. | ||
| */ | ||
| @ExpressionDescription( | ||
| usage = "_FUNC_(str, str1, str2, ...) - Returns the index of str in the str1,str2,... or 0 if not found.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use expr1, expr2, expr3 here? The type can be any atomic type. right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's more reasonable to use expr(n), thx.
Probably it should be AtomicType or NullType to support user's writing of null.
| extended = """ | ||
| Examples: | ||
| > SELECT _FUNC_(10, 9, 3, 10, 4); | ||
| 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More examples please?
| if(target == null) | ||
| 0 | ||
| else | ||
| findEqual(target, children.tail, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you fix the style, based on https://github.com/databricks/scala-style-guide#curly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
findEqual(target, children.tail, index = 1)
| checkEvaluation(Field(Seq(str5, str1, str2, str4)), 0) | ||
| checkEvaluation(Field(Seq(int4, double3, str5, bool1, date1, timeStamp2, int3)), 0) | ||
| checkEvaluation(Field(Seq(int1, strNull, intNull, bool1, date1, timeStamp2, int3)), 0) | ||
| checkEvaluation(Field(Seq(strNull, int1, str1, str2, str3)), 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to test null. Could you add the description?
If the search string is NULL, the return value is 0 because NULL fails equality comparison with any value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fails with the patch.
hc.sql("""SELECT FIELD("tejas", 34, "patil", true, null, "tejas") FROM src LIMIT 1""").collect.foreach(println)
Removing null makes it work. Can you check on your side ? It worked with Hive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile Sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tejasapatil It doesn't work with my code because it only support AtomicType. While user's writing of 'null' is NullType, I will add NullType support to be consistent with Hive.
| checkEvaluation(Field(Seq(int4, double3, str5, bool1, date1, timeStamp2, int4)), 6) | ||
| checkEvaluation(Field(Seq(str5, str1, str2, str4)), 0) | ||
| checkEvaluation(Field(Seq(int4, double3, str5, bool1, date1, timeStamp2, int3)), 0) | ||
| checkEvaluation(Field(Seq(int1, strNull, intNull, bool1, date1, timeStamp2, int3)), 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of these checks?
Based on MySQL's field function, the type casting rules is described as
If all arguments to FIELD() are strings, all arguments are compared as strings. If all arguments are numbers, they are compared as numbers. Otherwise, the arguments are compared as double.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 180 is to test multi types of parameters;
Line 181 is to test not found case;
Line 182 is to test not found case when parameters are of multi types;
Line 183 is to test null in parameter which has >=1 index
I think maybe we should refer to Hive's field? In Hive, when not all arguments are numbers && not all arguments are strings, they are not compared as double.
Also @tejasapatil , here's some lines' explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed line 181, since line 182 actually covers not found case.
| val target = children.head.eval(input) | ||
| val targetDataType = children.head.dataType | ||
| def findEqual(target: Any, params: Seq[Expression], index: Int): Int = { | ||
| params.toList match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid toList for each recursive call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to not do this iteratively ? I would suggest avoiding recursion to get better perf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile Actually, the case:
checkAnswer(testData.selectExpr("field('花花世界', 'a', 1.23, true, '花花世界')"), Row(4))
will produce a child: Seq[Expression] an ArrayBuffer, it's not a list, so can't use head::tail.
So there are 2 ways:
- remove the
toListand do another pattern match for ArrayBuffer, which I think is not neat. - keep the
toList.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tejasapatil Actually, I think it's tail recursion, so the compiler will do the optimization, then it has the same performance with iteration edition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can add the annotation @tailrec for explicitly declare that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
toList probably causes performance overhead, I don't think we have to sacrifice the performance for using the pattern match. In the meantime, I still believe we don't have to check the data type during the runtime. It's supposed to be done during the compile time or only done once for the first time in eval.
The Field evaluation is quite confusing, as @gatorsmile suggested, we need to describe how to evaluate the value when sub expressions' data type are different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chenghao-intel I have removed the toList, replaced the pattern match by if & else,
and also reduced the type check time to 1 for the same table.
I have added line 353-354 as comments for sub expressions' multiple data types, could you please have a look?
| } | ||
|
|
||
| /** | ||
| * A function that returns the index of str in (str1, str2, ...) list or 0 if not found. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change str to expr here as well.
| usage = "_FUNC_(str, str1, str2, ...) - Returns the index of str in the str1,str2,... or 0 if not found.", | ||
| extended = """ | ||
| Examples: | ||
| > SELECT _FUNC_(10, 9, 3, 10, 4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This UDF accepts any mix atomic types so one can even get fancy with the inputs. Would recommend mentioning that in the doc (given that you have tests for that below)
| val target = children.head.eval(input) | ||
| val targetDataType = children.head.dataType | ||
| def findEqual(target: Any, params: Seq[Expression], index: Int): Int = { | ||
| params.toList match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to not do this iteratively ? I would suggest avoiding recursion to get better perf.
| case _ => findEqual(target, params.tail, index + 1) | ||
| } | ||
| } | ||
| if(target == null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- nit: space after
if - checkstyle will fail saying if-else needs to use braces
|
|
||
| checkEvaluation(Field(Seq(str1, str2, str3, str1)), 3) | ||
| checkEvaluation(Field(Seq(str2, str2, str2, str1)), 1) | ||
| checkEvaluation(Field(Seq(str4, str4, str4, str1)), 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as previous ?
| """ | ||
| } | ||
|
|
||
| def genIfElseStructure(code1: String, code2: String): String = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the name it felt like this method would put code1 in if block and code2 in else block but turns out thats not the case. That floating else looks weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I can change this function's name? But actually I can't think of a better name. : )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Maybe we can use foldLeft to replace current approach to get rid of the floating else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I don't understand, how to use foldLeft approach? I think we can only use foldRight or reduceRight, because the code for latter children should be nested inner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yeah, right, if use foldLeft, there is still a floating else. We can only use foldRight to remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to use foldRight to remove it?
My thought: If I understand your meaning of floating else right(could you please explain it a little bit?), foldRight and reduceRight both can't avoid floating else, because we need nested else in else block, like this:
if (xxx) else { if (xxx) else { ... } } , so if we avoid floating else in genIfElseStructure, else should be in updateEval, which will make the code unclear and complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it looks like:
${evalChildren.zip(dataTypes).zipWithIndex.tail.filter { x =>
dataTypeMatchIndex.contains(x._2)
}.foldRight("") { (code: String, evalWithIndex: ((ExprCode, DataType), Int)) =>
val ((eval, _), index) = evalWithIndex
val condition = ctx.genEqual(targetDataType, eval.value, target.value)
s"""
${eval.code}
if ($condition) {
${ev.value} = ${index};
} else {
$code
}
"""
}
You can do this with a function like you did before. It will have a empty "else" block at the end.
However this doesn't affect the functionality, just dealing with how the code looks. I don't have strong option about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think whats already present in the code is ok. Given that there is no better option without adding more complexity, lets stick with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya Maybe the order of code and evalWithIndex parameters should be changed.
@tejasapatil I agree with your opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current code is ok.
| findEqual(target, children.tail, 1) | ||
| } | ||
|
|
||
| protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do your unit tests cover generated code ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, because in checkEvaluation function there is : checkEvaluationWithGeneratedMutableProjection
| checkEvaluation(Field(Seq(str5, str1, str2, str4)), 0) | ||
| checkEvaluation(Field(Seq(int4, double3, str5, bool1, date1, timeStamp2, int3)), 0) | ||
| checkEvaluation(Field(Seq(int1, strNull, intNull, bool1, date1, timeStamp2, int3)), 0) | ||
| checkEvaluation(Field(Seq(strNull, int1, str1, str2, str3)), 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fails with the patch.
hc.sql("""SELECT FIELD("tejas", 34, "patil", true, null, "tejas") FROM src LIMIT 1""").collect.foreach(println)
Removing null makes it work. Can you check on your side ? It worked with Hive.
| val ((eval, dataType), index) = evalWithIndex | ||
| s""" | ||
| ${eval.code} | ||
| if (${dataType.equals(targetDataType)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point the code at lines 427-428 (ie. .filter(dataTypeEqualsTarget)) have ensured that this will always be true. The generated code will have this as true and you might as well get rid of the check here.
| ${target.code} | ||
| boolean ${ev.isNull} = false; | ||
| int ${ev.value} = 0; | ||
| ${rest.zip(restDataType).zipWithIndex.map(x => (x._1, x._2 + 1)).filter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.zipWithIndex.map(x => (x._1, x._2 + 1)) can be simplified as .zip(Stream from 1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thx.
44941ab to
b09f446
Compare
|
@gczsjdy can you please add [WIP] in the title, until you feel the code is ready for review. |
|
|
||
| private lazy val ordering = TypeUtils.getInterpretedOrdering(children(0).dataType) | ||
|
|
||
| private val dataTypeMatchIndex: Seq[Int] = children.tail.zip(Stream from 1).filter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Array[Int] instead? Seq[Int] probably a LinkedList in its concrete implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't it be inconsistent with children? Since children's type is Seq[Expression].
| private lazy val ordering = TypeUtils.getInterpretedOrdering(children(0).dataType) | ||
|
|
||
| private val dataTypeMatchIndex: Seq[Int] = children.tail.zip(Stream from 1).filter( | ||
| _._1.dataType == children.head.dataType).map(_._2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_._1.dataType.sameType(children.head.dataType)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, thx.
|
|
||
| private lazy val ordering = TypeUtils.getInterpretedOrdering(children(0).dataType) | ||
|
|
||
| private val dataTypeMatchIndex: Seq[Int] = children.tail.zip(Stream from 1).filter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zip(Stream from 1), do we really need it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it's short for zipWithIndex.map(x => (x._1, x._2 + 1)).
I realized it makes people confused, and have changed it.
| override def dataType: DataType = IntegerType | ||
| override def eval(input: InternalRow): Any = { | ||
| val target = children.head.eval(input) | ||
| val targetDataType = children.head.dataType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it
| val target = children.head.eval(input) | ||
| val targetDataType = children.head.dataType | ||
| @tailrec def findEqual(index: Int): Int = { | ||
| if (index == dataTypeMatchIndex.size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if dataTypeMatchIndex is Array[Int], then we'd better use dataTypeMatchIndex.length instead.
| 0 | ||
| } else { | ||
| val value = children(dataTypeMatchIndex(index)).eval(input) | ||
| if (value != null && ordering.equiv(target, value)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use braces. see https://github.com/databricks/scala-style-guide#curly
|
Since the different data type will be simply ignored, I think we'd better also add the optimization rule in As well as the python/scala API support, but need to confirm with @rxin, why we don't need the API |
ff9a0be to
08e9f0c
Compare
|
@HyukjinKwon Done, thanks : ) |
|
ok to test |
|
Test build #93100 has finished for PR 16476 at commit
|
|
Test build #93594 has finished for PR 16476 at commit
|
|
Can one of the admins verify this patch? |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
This is an implementation of expression
fieldwhich is implemented as built-in function by Hive and MySQL.field(expr, expr1, expr2, ... ) is a variable-length(>=2) function that returns the index of expr in (expr1, expr2, ...) list or 0 if not found.
How was this patch tested?
Unit tests are in ConditionalExpressionSuite & ColumnExpressionSuite.