-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-8407][SQL]complex type constructors: struct and named_struct #6874
Conversation
It's ready to be reviewed now. |
|
||
override def foldable: Boolean = children.forall(_.foldable) | ||
|
||
override lazy val resolved: Boolean = childrenResolved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd better remove this, as it's covered by its parent class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Get it.
@chenghao-intel, I've fixed |
I find it hard to make a column names version of API: def namedStruct(fieldName: String, col: String, fieldAndCols: String*): Column = ??? It would limit creation of Literal fields. However, when we change the API to this one: def namedStruct(fieldName: String, col: Any, fieldAndCols: Any*): Column = ??? When we have String in even positions, it's impossible to tell if the user want to create a String Literal or refer to a col |
The Dataframe API does not like the normal function, the string arguments actually represent the associated columns, not the value it's represented. @rxin I think that's a common problem if we want to passed a string literal for DataFrame functions, do you have any suggestion for that? |
we can document that string literals should be set using lit("...") |
@rxin, what do you think of the column names version API? |
I don't think we need named_struct in DataFrame, since struct itself is powerful enough already. Just have it for SQL. |
OK, I would remove it from DataFrame. |
* @param children Seq(name1, val1, name2, val2, ...) | ||
*/ | ||
case class CreateNamedStruct(children: Seq[Expression]) extends Expression { | ||
assert(children.size % 2 == 0, "NamedStruct expects an even number of arguments.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't use assert here
assert is for internal errors. maybe it's best to use checkInputTypes to do this: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L169
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, please use checkInputTypes
here to check children.size % 2 == 0
and all name expressions are non-null literal string.
@cloud-fan can you help review this one? |
assert(children.size % 2 == 0, "NamedStruct expects an even number of arguments.") | ||
|
||
private val nameExprs = children.zipWithIndex.filter(_._2 % 2 == 0).map(_._1) | ||
private val valExprs = children.zipWithIndex.filter(_._2 % 2 == 1).map(_._1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about
private val (nameExprs, valExprs) = children.sliding(2, 2).collect { case Seq(a, b) => a -> b }.toList.unzip
or
private val (nameExprs, valExprs) = children.zipWithIndex.partition(_._2 % 2 == 0).map(_.map(_._1))
Close by mistake. |
@rxin @cloud-fan , thanks for the detailed reviews. |
} | ||
|
||
override def eval(input: InternalRow): Any = { | ||
require(resolved, resolveFailureMessage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move the require
out of the eval
, a better place probably within the def checkInputDataTypes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In complexTypeSuite, when I call CreateNamedStruct directly in checkEvaluation, checkInputType are not executed, so I call resolved here to utilize its default implementation to do checkInputType.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A better way to enforce the check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checkEvaluation
just evaluate the expression, not go through the whole analyze process. So you can write normal test at complexTypeSuite
and write error test at ExpressionTypeCheckingSuite
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Get it.
@cloud-fan @chenghao-intel, thanks for reviewing this. I've moved the incorrect input test into |
Jenkins, retest this please |
@rxin, could you please review this and also trigger the test? |
Jenkins, ok to test. |
Test build #35569 has finished for PR 6874 at commit
|
@@ -1 +1 @@ | |||
{"aa":"10","aaaaaa":"11","aaaaaa":"12","bb12":"13","s14s14":"14"} | |||
{"aa":"10","aaaaaa":"11","aaaaaa":"12","Bb12":"13","s14s14":"14"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The query is:
createQueryTest("constant object inspector for generic udf",
"""SELECT named_struct(
lower("AA"), "10",
repeat(lower("AA"), 3), "11",
lower(repeat("AA", 3)), "12",
printf("Bb%d", 12), "13",
repeat(printf("s%d", 14), 2), "14") FROM src LIMIT 1""")
Since printf
in Hive didn't change word case in Bb%d
, therefore, Bb12
is the right answer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't change machine generated golden answers though. If we are going to differ from hive use checkAnswer
instead.
Test build #35672 has finished for PR 6874 at commit
|
StructField("b", StringType) | ||
)) | ||
assert(row.schema(0).dataType === expectedType) | ||
assert(row.getAs[Row](0) === Row(2, "str")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use checkAnswer
instead of assert
, it gives better error messages when there is a failure.
@marmbrus , I remove the previous wrong golden answer and generate a new one during test. |
Test build #35851 has finished for PR 6874 at commit
|
@@ -747,9 +747,7 @@ object functions { | |||
*/ | |||
@scala.annotation.varargs | |||
def struct(cols: Column*): Column = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation above needs to be updated and should specify what happens when the columns are unnamed.
Do we also need to add this to |
Test build #36399 has finished for PR 6874 at commit
|
Test build #36403 has finished for PR 6874 at commit
|
I agree that struct is enough in scala/python. Thanks! Merging to master. |
This is a follow up of SPARK-8283 (PR-6828), to support both
struct
andnamed_struct
in Spark SQL.After #6725, the semantic of
CreateStruct
methods have changed a little and do not limited to cols ofNamedExpressions
, it will name non-NamedExpression fields following the hive convention, col1, col2 ...This PR would both loosen
struct
to take children ofExpression
type and addnamed_struct
support.