Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16947][SQL] Improve type coercion for inline tables. #14539

Closed
wants to merge 5 commits into from

Conversation

hvanhovell
Copy link
Contributor

@hvanhovell hvanhovell commented Aug 8, 2016

What changes were proposed in this pull request?

Inline tables were added in to Spark SQL in 2.0, e.g.: select * from values (1, 'A'), (2, 'B') as tbl(a, b)

This is currently implemented using a LocalRelation and this relation is created during parsing. This has a weakness: type coercion is based on the first row in the relation, and all subsequent values are cast in to this type. The latter violates the principle of least surprise.

This PR fixes this by creating a dedicated InlineTable node. Type coercion now follows the rules for Union, which is similar to other systems like PostgreSQL. In order to retain optimal speed, I have extended the ConvertToLocalRelation, which makes sure the table gets rewritten into a LocalRelation during optimization.

The following SQL statement:

select * from values (1, exp(1)), (2.00, 'b'), (3.0, 'tt'), (40.0001, 4) x(a, b)

... now yields the following plan:

== Parsed Logical Plan ==
'Project [*]
+- 'SubqueryAlias x
   +- 'InlineTable [ArrayBuffer(1 AS a#0, 'exp(1) AS b#1), ArrayBuffer(2.00 AS a#2, b AS b#3), ArrayBuffer(3.0 AS a#4, tt AS b#5), ArrayBuffer(40.0001 AS a#6, 4 AS b#7)]

== Analyzed Logical Plan ==
1 AS a#0: decimal(14,4), EXP(cast(1 as double)) AS b#1: string
Project [1 AS a#0#18, EXP(cast(1 as double)) AS b#1#19]
+- SubqueryAlias x
   +- InlineTable [ArrayBuffer(cast(1 as decimal(14,4)) AS 1 AS a#0#18, cast(EXP(cast(1 as double)) as string) AS EXP(cast(1 as double)) AS b#1#19), ArrayBuffer(cast(2.00 as decimal(14,4)) AS 2.00 AS a#2#20, b AS b#3), ArrayBuffer(cast(3.0 as decimal(14,4)) AS 3.0 AS a#4#21, tt AS b#5), ArrayBuffer(cast(40.0001 as decimal(14,4)) AS 40.0001 AS a#6#22, cast(4 as string) AS 4 AS b#7#23)]

== Optimized Logical Plan ==
LocalRelation [1 AS a#0#18, EXP(cast(1 as double)) AS b#1#19]

== Physical Plan ==
LocalTableScan [1 AS a#0#18, EXP(cast(1 as double)) AS b#1#19]

... and the following result:

+-------+-----------------+
|      a|                b|
+-------+-----------------+
| 1.0000|2.718281828459045|
| 2.0000|                b|
| 3.0000|               tt|
|40.0001|                4|
+-------+-----------------+

How was this patch tested?

I have updated the PlanParseSuite to test the parsers the new output, and I have added tests to the ConvertToLocalRelationSuite to tests if inline table (a-like) structures are converted into LocalRelations. I still need to add tests for the TypeCoercion and CheckAnalysis rules.

@hvanhovell
Copy link
Contributor Author

cc @cloud-fan @eyalfa

@eyalfa
Copy link

eyalfa commented Aug 8, 2016

@hvanhovell in case of validation error, how will the error message look like? will it mention inline-tables in any way, or is it just going to complain about Union's requirements?

@SparkQA
Copy link

SparkQA commented Aug 8, 2016

Test build #63358 has finished for PR 14539 at commit 3b0a28b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor Author

@eyalfa It is currently going to complain about Unions (if type coercion fails).

@eyalfa
Copy link

eyalfa commented Aug 8, 2016

@hvanhovell don't you think that if you're already taking a stab at it, it's better to introduce something like UnresolvedInlineTable with its own resolution logic and type coercion, once it's resolved it can be transformed into a Union of simple projects.

@hvanhovell
Copy link
Contributor Author

@eyalfa I am a bit hesitant to add yet another almost pointless LogicalPlan node to Catalyst, and certainly not one that is functionally exactly the same as a Union. This would require us to add another rule to the analyzer/optimizer and we would have to add a separate case to type coercion.

I do think your point has merit and that errors should be as concise as possible, but I would (if we were to change this) rather add some sort of an alias which encodes this information or just add a flag to Union.

@eyalfa
Copy link

eyalfa commented Aug 8, 2016

fair enough, I think it's worth adding a negative test just to see what we're dealing with.
btw, will the error message/location somehow point back to the inline table?

@SparkQA
Copy link

SparkQA commented Aug 8, 2016

Test build #63360 has finished for PR 14539 at commit 3f3aa93.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 8, 2016

Test build #63366 has finished for PR 14539 at commit 9a827ce.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Aug 8, 2016

Can we create some common function used by both union and this? It seems like a pretty complicated plan to do this via union.

@SparkQA
Copy link

SparkQA commented Aug 8, 2016

Test build #63378 has finished for PR 14539 at commit 392eb0a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class AbstractOneRowRelation extends LeafNode

val numExpectedColumns = rows.head.size
val aliases = if (ctx.identifierList != null) {
val names = visitIdentifierList(ctx.identifierList)
assert(names.size == numExpectedColumns,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an error case users can hit, should we throw ParserException instead of assert?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It uses a parser only version of assert that throws a ParseException: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParserUtils.scala#L81

Come to think of it, we might need to rename it because people expect that assert calls can be elided. That is for a different PR though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

* another LocalRelation.
*
* This is relatively simple as it currently handles only a single case: Project.
* Converts local operations (i.e. ones that don't require data exchange) on LocalRelation or
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update comment.

@cloud-fan
Copy link
Contributor

@hvanhovell , I may miss something, why do we create this new InlineTable instead of using Union? I think we can create a special OneRowRelation(e.g. UnfoldableOneRowRelarion) in test scope and use it in ExpressionEvalHelper.

@SparkQA
Copy link

SparkQA commented Aug 9, 2016

Test build #63442 has finished for PR 14539 at commit daa01a2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class InlineTable(rows: Seq[Seq[NamedExpression]]) extends LeafNode

@hvanhovell
Copy link
Contributor Author

@cloud-fan I had an offline discussion with @rxin about this. His main point was that a larger inline table would create an extremely unreadable plan. So I came up with this.

@cloud-fan
Copy link
Contributor

@hvanhovell how about we make InlineTable an unresolved plan and resolve it to LocalRelation in analyzer? It's weird that we don't have planner rules for InlineTable and depend on optimizer to convert it to LocalRelation. Besides, we can put alias names as a field in InlineTable instead of alias every expression in every row, because we don't need to define output for InlineTable it it's an unresolved plan.

@petermaxlee
Copy link
Contributor

@hvanhovell do you mind me taking a look at this? I am running into an issue in which I cannot use array() function to construct an array in inline tables (only literals are allowed). I can try fix the type coercion issue there too.

@hvanhovell
Copy link
Contributor Author

@petermaxlee sure go ahead!

// Create expressions.
val rows = ctx.expression.asScala.map { e =>
expression(e) match {
case CreateStruct(children) => children
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hvanhovell what's this about? Why do we need to expand struct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think I understand what's happening here now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parser creates rows by issuing CreateStruct commands. Inline table takes a Seq[Expression] per row. So we need to extracts the children from the CreateStruct.

@gatorsmile
Copy link
Member

gatorsmile commented Aug 17, 2016

How to specify null when we creating inline table?

    sql(
      """
        |create temporary view src as select * from values
        |(201, null),
        |(86, "val_86")
        |as data(key, value)
      """.stripMargin)

Is that supported?

@hvanhovell
Copy link
Contributor Author

@gatorsmile we should support this, but you might have to add an explicit cast.

@hvanhovell
Copy link
Contributor Author

closing in favor of #14676

@hvanhovell hvanhovell closed this Aug 17, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants