[SPARK-8407][SQL]complex type constructors: struct and named_struct #6874

yjshen · 2015-06-18T07:49:29Z

This is a follow up of SPARK-8283 (PR-6828), to support both struct and named_struct in Spark SQL.

After #6725, the semantic of CreateStruct methods have changed a little and do not limited to cols of NamedExpressions, it will name non-NamedExpression fields following the hive convention, col1, col2 ...

This PR would both loosen struct to take children of Expression type and add named_struct support.

yjshen · 2015-06-18T09:16:32Z

It's ready to be reviewed now.

chenghao-intel · 2015-06-18T12:36:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypes.scala

+
+  override def foldable: Boolean = children.forall(_.foldable)
+
+  override lazy val resolved: Boolean = childrenResolved


We'd better remove this, as it's covered by its parent class.

yjshen · 2015-06-19T04:36:53Z

@chenghao-intel, I've fixed named_struct to be camel case and also remove the unnecessary override method in CreateStruct and CreateNamedStruct

yjshen · 2015-06-19T04:50:59Z

I find it hard to make a column names version of API:

  def namedStruct(fieldName: String, col: String, fieldAndCols: String*): Column = ???

It would limit creation of Literal fields. However, when we change the API to this one:

  def namedStruct(fieldName: String, col: Any, fieldAndCols: Any*): Column = ???

When we have String in even positions, it's impossible to tell if the user want to create a String Literal or refer to a col

chenghao-intel · 2015-06-19T05:09:18Z

The Dataframe API does not like the normal function, the string arguments actually represent the associated columns, not the value it's represented.

@rxin I think that's a common problem if we want to passed a string literal for DataFrame functions, do you have any suggestion for that?

rxin · 2015-06-19T05:26:52Z

we can document that string literals should be set using lit("...")

yjshen · 2015-06-19T05:30:49Z

@rxin, what do you think of the column names version API?

rxin · 2015-06-19T05:44:09Z

I don't think we need named_struct in DataFrame, since struct itself is powerful enough already. Just have it for SQL.

yjshen · 2015-06-19T05:53:46Z

OK, I would remove it from DataFrame.

rxin · 2015-06-19T06:16:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypes.scala

+ * @param children Seq(name1, val1, name2, val2, ...)
+ */
+case class CreateNamedStruct(children: Seq[Expression]) extends Expression {
+  assert(children.size % 2 == 0, "NamedStruct expects an even number of arguments.")


shouldn't use assert here

assert is for internal errors. maybe it's best to use checkInputTypes to do this: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L169

yes, please use checkInputTypes here to check children.size % 2 == 0 and all name expressions are non-null literal string.

rxin · 2015-06-19T06:17:20Z

@cloud-fan can you help review this one?

cloud-fan · 2015-06-19T06:43:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypes.scala

+  assert(children.size % 2 == 0, "NamedStruct expects an even number of arguments.")
+
+  private val nameExprs = children.zipWithIndex.filter(_._2 % 2 == 0).map(_._1)
+  private val valExprs = children.zipWithIndex.filter(_._2 % 2 == 1).map(_._1)


what about

private val (nameExprs, valExprs) = children.sliding(2, 2).collect { case Seq(a, b) => a -> b }.toList.unzip

or

private val (nameExprs, valExprs) = children.zipWithIndex.partition(_._2 % 2 == 0).map(_.map(_._1))

yjshen · 2015-06-19T09:40:52Z

Close by mistake.

yjshen · 2015-06-19T09:53:53Z

@rxin @cloud-fan , thanks for the detailed reviews.
I've change the implementation in the latest commit, mind reviewing it again?

chenghao-intel · 2015-06-19T15:30:25Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypes.scala

+  }
+
+  override def eval(input: InternalRow): Any = {
+    require(resolved, resolveFailureMessage)


Move the require out of the eval, a better place probably within the def checkInputDataTypes

In complexTypeSuite, when I call CreateNamedStruct directly in checkEvaluation, checkInputType are not executed, so I call resolved here to utilize its default implementation to do checkInputType.

A better way to enforce the check?

checkEvaluation just evaluate the expression, not go through the whole analyze process. So you can write normal test at complexTypeSuite and write error test at ExpressionTypeCheckingSuite.

yjshen · 2015-06-20T17:08:33Z

@cloud-fan @chenghao-intel, thanks for reviewing this. I've moved the incorrect input test into ExpressionTypeCheckingSuite and also remove unnecessary assertion from dataType.

yjshen · 2015-06-23T16:24:28Z

Jenkins, retest this please

yjshen · 2015-06-23T17:40:16Z

@rxin, could you please review this and also trigger the test?

rxin · 2015-06-23T18:41:44Z

Jenkins, ok to test.

SparkQA · 2015-06-23T19:54:06Z

Test build #35569 has finished for PR 6874 at commit 156c2a5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CreateNamedStruct(children: Seq[Expression]) extends Expression

yjshen · 2015-06-24T02:58:47Z

...esources/golden/constant object inspector for generic udf-0-cc120a2331158f570a073599985d3f55

@@ -1 +1 @@
-{"aa":"10","aaaaaa":"11","aaaaaa":"12","bb12":"13","s14s14":"14"}
+{"aa":"10","aaaaaa":"11","aaaaaa":"12","Bb12":"13","s14s14":"14"}


The query is:

createQueryTest("constant object inspector for generic udf", """SELECT named_struct( lower("AA"), "10", repeat(lower("AA"), 3), "11", lower(repeat("AA", 3)), "12", printf("Bb%d", 12), "13", repeat(printf("s%d", 14), 2), "14") FROM src LIMIT 1""")

Since printf in Hive didn't change word case in Bb%d, therefore, Bb12 is the right answer

We shouldn't change machine generated golden answers though. If we are going to differ from hive use checkAnswer instead.

SparkQA · 2015-06-24T14:10:59Z

Test build #35672 has finished for PR 6874 at commit 0ae010a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CreateNamedStruct(children: Seq[Expression]) extends Expression

marmbrus · 2015-06-24T20:47:31Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala

+      StructField("b", StringType)
+    ))
+    assert(row.schema(0).dataType === expectedType)
+    assert(row.getAs[Row](0) === Row(2, "str"))


use checkAnswer instead of assert, it gives better error messages when there is a failure.

yjshen · 2015-06-26T09:09:14Z

@marmbrus , I remove the previous wrong golden answer and generate a new one during test.

SparkQA · 2015-06-26T11:08:58Z

Test build #35851 has finished for PR 6874 at commit 385e490.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class StreamingLinearAlgorithm(object):
- class StreamingLogisticRegressionWithSGD(StreamingLinearAlgorithm):
- case class CountFunction(expr: Expression, base: AggregateExpression) extends AggregateFunction
- case class CountDistinctFunction(
- case class ApproxCountDistinctPartitionFunction(
- case class ApproxCountDistinctMergeFunction(
- case class Sum(child: Expression) extends PartialAggregate with trees.UnaryNode[Expression]
- case class CombineSum(child: Expression) extends AggregateExpression
- case class SumDistinct(child: Expression)
- case class CombineSetsAndSum(inputSet: Expression, base: Expression) extends AggregateExpression
- case class CombineSetsAndSumFunction(
- case class First(child: Expression) extends PartialAggregate with trees.UnaryNode[Expression]
- case class Last(child: Expression) extends PartialAggregate with trees.UnaryNode[Expression]
- case class CreateNamedStruct(children: Seq[Expression]) extends Expression
- case class Sha2(left: Expression, right: Expression)
- case class PrecisionInfo(precision: Int, scale: Int)
- case class TakeOrderedAndProject(

marmbrus · 2015-06-29T23:21:58Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

@@ -747,9 +747,7 @@ object functions {
   */
  @scala.annotation.varargs
  def struct(cols: Column*): Column = {


The documentation above needs to be updated and should specify what happens when the columns are unnamed.

marmbrus · 2015-06-29T23:22:46Z

Do we also need to add this to functions.py?

…stery

yjshen · 2015-07-02T14:17:37Z

@marmbrus , I have no idea whether we should put namedStruct in functions.py as well as functions.scala, since @rxin thought struct is powerful enough itself.

SparkQA · 2015-07-02T15:23:12Z

Test build #36399 has finished for PR 6874 at commit d599d0b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CreateNamedStruct(children: Seq[Expression]) extends Expression

SparkQA · 2015-07-02T15:58:36Z

Test build #36403 has finished for PR 6874 at commit 4cd3375.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CreateNamedStruct(children: Seq[Expression]) extends Expression

marmbrus · 2015-07-02T17:11:50Z

I agree that struct is enough in scala/python. Thanks! Merging to master.

yjshen changed the title ~~[SPARK-8407]complex type constructors: struct and named_struct~~ [SPARK-8407][SQL]complex type constructors: struct and named_struct Jun 18, 2015

yjshen force-pushed the SPARK-8283 branch from 7020c81 to ecff783 Compare June 18, 2015 08:34

chenghao-intel reviewed Jun 18, 2015
View reviewed changes

yjshen mentioned this pull request Jun 18, 2015

[SPARK-8283][SQL] CreateStruct should not specify the field names #6881

Closed

rxin reviewed Jun 19, 2015
View reviewed changes

cloud-fan reviewed Jun 19, 2015
View reviewed changes

yjshen closed this Jun 19, 2015

yjshen reopened this Jun 19, 2015

chenghao-intel reviewed Jun 19, 2015
View reviewed changes

yjshen reviewed Jun 24, 2015
View reviewed changes

marmbrus reviewed Jun 24, 2015
View reviewed changes

marmbrus reviewed Jun 29, 2015
View reviewed changes

yjshen added 14 commits July 2, 2015 21:29

Add CreateNamedStruct in both DataFrame function API and FunctionRegi…

0acb7be

…stery

loosen struct method in functions.scala to take Expression children

4bd75ad

Fix reviews

917e680

remove nameStruct API from DataFrame

47da332

Fix reviews

ccbbd86

tiny fix

7a71255

remove type check from eval

fd3cd8e

remove unnecessary resolved assertion inside dataType method

828d694

Fix type check

60812a7

Fix checkInputTypes' implementation using foldable and nullable

7fef712

review fix

9613be9

tiny fix

f07e114

replace assert using checkAnswer

b487354

fix reviews and regenerate golden answers

9a7039e

yjshen force-pushed the SPARK-8283 branch from 385e490 to 9a7039e Compare July 2, 2015 13:34

yjshen added 2 commits July 2, 2015 21:39

rebase code

d599d0b

change struct documentation

4cd3375

asfgit closed this in 52302a8 Jul 2, 2015

yjshen deleted the SPARK-8283 branch July 20, 2015 02:42

wayneguow mentioned this pull request Aug 8, 2024

Set correct since for named_struct #47665

Closed


		override def foldable: Boolean = children.forall(_.foldable)

		override lazy val resolved: Boolean = childrenResolved

		@@ -1 +1 @@
		{"aa":"10","aaaaaa":"11","aaaaaa":"12","bb12":"13","s14s14":"14"}
		{"aa":"10","aaaaaa":"11","aaaaaa":"12","Bb12":"13","s14s14":"14"}

[SPARK-8407][SQL]complex type constructors: struct and named_struct #6874

[SPARK-8407][SQL]complex type constructors: struct and named_struct #6874

Conversation

yjshen commented Jun 18, 2015

yjshen commented Jun 18, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yjshen commented Jun 19, 2015

yjshen commented Jun 19, 2015

chenghao-intel commented Jun 19, 2015

rxin commented Jun 19, 2015

yjshen commented Jun 19, 2015

rxin commented Jun 19, 2015

yjshen commented Jun 19, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rxin commented Jun 19, 2015

Choose a reason for hiding this comment

yjshen commented Jun 19, 2015

yjshen commented Jun 19, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yjshen commented Jun 20, 2015

yjshen commented Jun 23, 2015

yjshen commented Jun 23, 2015

rxin commented Jun 23, 2015

SparkQA commented Jun 23, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 24, 2015

Choose a reason for hiding this comment

yjshen commented Jun 26, 2015

SparkQA commented Jun 26, 2015

Choose a reason for hiding this comment

marmbrus commented Jun 29, 2015

yjshen commented Jul 2, 2015

SparkQA commented Jul 2, 2015

SparkQA commented Jul 2, 2015

marmbrus commented Jul 2, 2015