Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-8407][SQL]complex type constructors: struct and named_struct #6874

Closed
wants to merge 16 commits into from

Conversation

yjshen
Copy link
Member

@yjshen yjshen commented Jun 18, 2015

This is a follow up of SPARK-8283 (PR-6828), to support both struct and named_struct in Spark SQL.

After #6725, the semantic of CreateStruct methods have changed a little and do not limited to cols of NamedExpressions, it will name non-NamedExpression fields following the hive convention, col1, col2 ...

This PR would both loosen struct to take children of Expression type and add named_struct support.

@yjshen yjshen changed the title [SPARK-8407]complex type constructors: struct and named_struct [SPARK-8407][SQL]complex type constructors: struct and named_struct Jun 18, 2015
@yjshen
Copy link
Member Author

yjshen commented Jun 18, 2015

It's ready to be reviewed now.


override def foldable: Boolean = children.forall(_.foldable)

override lazy val resolved: Boolean = childrenResolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd better remove this, as it's covered by its parent class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get it.

@yjshen
Copy link
Member Author

yjshen commented Jun 19, 2015

@chenghao-intel, I've fixed named_struct to be camel case and also remove the unnecessary override method in CreateStruct and CreateNamedStruct

@yjshen
Copy link
Member Author

yjshen commented Jun 19, 2015

I find it hard to make a column names version of API:

  def namedStruct(fieldName: String, col: String, fieldAndCols: String*): Column = ???

It would limit creation of Literal fields. However, when we change the API to this one:

  def namedStruct(fieldName: String, col: Any, fieldAndCols: Any*): Column = ???

When we have String in even positions, it's impossible to tell if the user want to create a String Literal or refer to a col

@chenghao-intel
Copy link
Contributor

The Dataframe API does not like the normal function, the string arguments actually represent the associated columns, not the value it's represented.

@rxin I think that's a common problem if we want to passed a string literal for DataFrame functions, do you have any suggestion for that?

@rxin
Copy link
Contributor

rxin commented Jun 19, 2015

we can document that string literals should be set using lit("...")

@yjshen
Copy link
Member Author

yjshen commented Jun 19, 2015

@rxin, what do you think of the column names version API?

@rxin
Copy link
Contributor

rxin commented Jun 19, 2015

I don't think we need named_struct in DataFrame, since struct itself is powerful enough already. Just have it for SQL.

@yjshen
Copy link
Member Author

yjshen commented Jun 19, 2015

OK, I would remove it from DataFrame.

* @param children Seq(name1, val1, name2, val2, ...)
*/
case class CreateNamedStruct(children: Seq[Expression]) extends Expression {
assert(children.size % 2 == 0, "NamedStruct expects an even number of arguments.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't use assert here

assert is for internal errors. maybe it's best to use checkInputTypes to do this: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L169

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, please use checkInputTypes here to check children.size % 2 == 0 and all name expressions are non-null literal string.

@rxin
Copy link
Contributor

rxin commented Jun 19, 2015

@cloud-fan can you help review this one?

assert(children.size % 2 == 0, "NamedStruct expects an even number of arguments.")

private val nameExprs = children.zipWithIndex.filter(_._2 % 2 == 0).map(_._1)
private val valExprs = children.zipWithIndex.filter(_._2 % 2 == 1).map(_._1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about

private val (nameExprs, valExprs) = children.sliding(2, 2).collect { case Seq(a, b) => a -> b }.toList.unzip

or

private val (nameExprs, valExprs) = children.zipWithIndex.partition(_._2 % 2 == 0).map(_.map(_._1))

@yjshen yjshen closed this Jun 19, 2015
@yjshen yjshen reopened this Jun 19, 2015
@yjshen
Copy link
Member Author

yjshen commented Jun 19, 2015

Close by mistake.

@yjshen
Copy link
Member Author

yjshen commented Jun 19, 2015

@rxin @cloud-fan , thanks for the detailed reviews.
I've change the implementation in the latest commit, mind reviewing it again?

}

override def eval(input: InternalRow): Any = {
require(resolved, resolveFailureMessage)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the require out of the eval, a better place probably within the def checkInputDataTypes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In complexTypeSuite, when I call CreateNamedStruct directly in checkEvaluation, checkInputType are not executed, so I call resolved here to utilize its default implementation to do checkInputType.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better way to enforce the check?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkEvaluation just evaluate the expression, not go through the whole analyze process. So you can write normal test at complexTypeSuite and write error test at ExpressionTypeCheckingSuite.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get it.

@yjshen
Copy link
Member Author

yjshen commented Jun 20, 2015

@cloud-fan @chenghao-intel, thanks for reviewing this. I've moved the incorrect input test into ExpressionTypeCheckingSuite and also remove unnecessary assertion from dataType.

@yjshen
Copy link
Member Author

yjshen commented Jun 23, 2015

Jenkins, retest this please

@yjshen
Copy link
Member Author

yjshen commented Jun 23, 2015

@rxin, could you please review this and also trigger the test?

@rxin
Copy link
Contributor

rxin commented Jun 23, 2015

Jenkins, ok to test.

@SparkQA
Copy link

SparkQA commented Jun 23, 2015

Test build #35569 has finished for PR 6874 at commit 156c2a5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class CreateNamedStruct(children: Seq[Expression]) extends Expression

@@ -1 +1 @@
{"aa":"10","aaaaaa":"11","aaaaaa":"12","bb12":"13","s14s14":"14"}
{"aa":"10","aaaaaa":"11","aaaaaa":"12","Bb12":"13","s14s14":"14"}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query is:

createQueryTest("constant object inspector for generic udf",
    """SELECT named_struct(
      lower("AA"), "10",
      repeat(lower("AA"), 3), "11",
      lower(repeat("AA", 3)), "12",
      printf("Bb%d", 12), "13",
      repeat(printf("s%d", 14), 2), "14") FROM src LIMIT 1""")

Since printf in Hive didn't change word case in Bb%d, therefore, Bb12 is the right answer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't change machine generated golden answers though. If we are going to differ from hive use checkAnswer instead.

@SparkQA
Copy link

SparkQA commented Jun 24, 2015

Test build #35672 has finished for PR 6874 at commit 0ae010a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class CreateNamedStruct(children: Seq[Expression]) extends Expression

StructField("b", StringType)
))
assert(row.schema(0).dataType === expectedType)
assert(row.getAs[Row](0) === Row(2, "str"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use checkAnswer instead of assert, it gives better error messages when there is a failure.

@yjshen
Copy link
Member Author

yjshen commented Jun 26, 2015

@marmbrus , I remove the previous wrong golden answer and generate a new one during test.

@SparkQA
Copy link

SparkQA commented Jun 26, 2015

Test build #35851 has finished for PR 6874 at commit 385e490.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class StreamingLinearAlgorithm(object):
    • class StreamingLogisticRegressionWithSGD(StreamingLinearAlgorithm):
    • case class CountFunction(expr: Expression, base: AggregateExpression) extends AggregateFunction
    • case class CountDistinctFunction(
    • case class ApproxCountDistinctPartitionFunction(
    • case class ApproxCountDistinctMergeFunction(
    • case class Sum(child: Expression) extends PartialAggregate with trees.UnaryNode[Expression]
    • case class CombineSum(child: Expression) extends AggregateExpression
    • case class SumDistinct(child: Expression)
    • case class CombineSetsAndSum(inputSet: Expression, base: Expression) extends AggregateExpression
    • case class CombineSetsAndSumFunction(
    • case class First(child: Expression) extends PartialAggregate with trees.UnaryNode[Expression]
    • case class Last(child: Expression) extends PartialAggregate with trees.UnaryNode[Expression]
    • case class CreateNamedStruct(children: Seq[Expression]) extends Expression
    • case class Sha2(left: Expression, right: Expression)
    • case class PrecisionInfo(precision: Int, scale: Int)
    • case class TakeOrderedAndProject(

@@ -747,9 +747,7 @@ object functions {
*/
@scala.annotation.varargs
def struct(cols: Column*): Column = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation above needs to be updated and should specify what happens when the columns are unnamed.

@marmbrus
Copy link
Contributor

Do we also need to add this to functions.py?

@yjshen
Copy link
Member Author

yjshen commented Jul 2, 2015

@marmbrus , I have no idea whether we should put namedStruct in functions.py as well as functions.scala, since @rxin thought struct is powerful enough itself.

@SparkQA
Copy link

SparkQA commented Jul 2, 2015

Test build #36399 has finished for PR 6874 at commit d599d0b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class CreateNamedStruct(children: Seq[Expression]) extends Expression

@SparkQA
Copy link

SparkQA commented Jul 2, 2015

Test build #36403 has finished for PR 6874 at commit 4cd3375.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class CreateNamedStruct(children: Seq[Expression]) extends Expression

@marmbrus
Copy link
Contributor

marmbrus commented Jul 2, 2015

I agree that struct is enough in scala/python. Thanks! Merging to master.

@asfgit asfgit closed this in 52302a8 Jul 2, 2015
@yjshen yjshen deleted the SPARK-8283 branch July 20, 2015 02:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants