SPARK-13827[SQL] Can't add subquery to an operator with same-name outputs while generate SQL string #11658

cloud-fan · 2016-03-11T14:46:00Z

What changes were proposed in this pull request?

This PR tries to solve a fundamental issue in the SQLBuilder. When we want to turn a logical plan into SQL string and put it after FROM clause, we need to wrap it with a sub-query. However, a logical plan is allowed to have same-name outputs with different qualifiers(e.g. the Join operator), and this kind of plan can't be put under a subquery as we will erase and assign a new qualifier to all outputs and make it impossible to distinguish same-name outputs.

To solve this problem, this PR renames all attributes with globally unique names(using exprId), so that we don't need qualifiers to resolve ambiguity anymore.

For example, SELECT x.key, MAX(y.key) OVER () FROM t x JOIN t y, we will parse this SQL to a Window operator and a Project operator, and add a sub-query between them. The generated SQL looks like:

SELECT sq_1.key, sq_1.max
FROM (
    SELECT sq_0.key, sq_0.key, MAX(sq_0.key) OVER () AS max
    FROM (
        SELECT x.key, y.key FROM t1 AS x JOIN t2 AS y
    ) AS sq_0
) AS sq_1

You can see, the key columns become ambiguous after sq_0.

After this PR, it will generate something like:

SELECT attr_30 AS key, attr_37 AS max
FROM (
    SELECT attr_30, attr_37
    FROM (
        SELECT attr_30, attr_35, MAX(attr_35) AS attr_37
        FROM (
            SELECT attr_30, attr_35 FROM
                (SELECT key AS attr_30 FROM t1) AS sq_0
            INNER JOIN
                (SELECT key AS attr_35 FROM t1) AS sq_1
        ) AS sq_2
    ) AS sq_3
) AS sq_4

The outermost SELECT is used to turn the generated named to real names back, and the innermost SELECT is used to alias real columns to our generated names. Between them, there is no name ambiguity anymore.

How was this patch tested?

existing tests and new tests in LogicalPlanToSQLSuite.

cloud-fan · 2016-03-11T14:46:41Z

cc @liancheng @gatorsmile @yhuai

SparkQA · 2016-03-11T16:56:42Z

Test build #52926 has finished for PR 11658 at commit 198b406.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-03-11T17:50:43Z

You are so fast! Will do the review tonight or tomorrow. I have another test case for this issue. Maybe you can take it. This is Project -- Subquery -- Filter -- Aggregate --...

SELECT Count(a.value), 
       b.KEY, 
       a.KEY 
FROM   parquet_t1 a, 
       parquet_t1 b 
GROUP  BY a.KEY, 
          b.KEY 
HAVING Max(a.KEY) > 0

SparkQA · 2016-03-12T03:42:02Z

Test build #52985 has finished for PR 11658 at commit 21a142d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-12T04:51:34Z

Test build #52987 has finished for PR 11658 at commit ade17d8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-03-12T05:34:18Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala

-      )
+    case SQLTable(database, table, _, sample) =>
+      val qualifiedName = s"${quoteIdentifier(database)}.${quoteIdentifier(table)}"
+      sample.map { case (lowerBound, upperBound) =>


After this change, you can remove case Sample

gatorsmile · 2016-03-12T06:42:47Z

SELECT x.key, x.value, y.a, y.b, y.c, y.d FROM parquet_t1 x JOIN parquet_t2 y ON x.key = y.a

The generated SQL is

SELECT `gen_attr_30` AS `key`, `gen_attr_31` AS `value`, `gen_attr_32` AS `a`, `gen_attr_33` AS `b`, `gen_attr_34` AS `c`, `gen_attr_35` AS `d` FROM (SELECT `gen_attr_30`, `gen_attr_31`, `gen_attr_32`, `gen_attr_33`, `gen_attr_34`, `gen_attr_35` FROM (SELECT `key` AS `gen_attr_30`, `value` AS `gen_attr_31` FROM `default`.`parquet_t1`) AS gen_subquery_0 INNER JOIN (SELECT `a` AS `gen_attr_32`, `b` AS `gen_attr_33`, `c` AS `gen_attr_34`, `d` AS `gen_attr_35` FROM `default`.`parquet_t2`) AS gen_subquery_1 ON (`gen_attr_30` = `gen_attr_32`)) AS gen_subquery_2

I compared the Optimized Logical Plan of these two queries:

Join Inner, Some((key#30L = a#32L))
:- Filter isnotnull(key#30L)
:  +- Relation[key#30L,value#31] ParquetFormat part: struct<>, data: struct<key:bigint,value:string>
+- Filter isnotnull(a#32L)
   +- Relation[a#32L,b#33L,c#34L,d#35L] ParquetFormat part: struct<>, data: struct<a:bigint,b:bigint,c:bigint,d:bigint>

Project [gen_attr_30#71L AS key#77L,gen_attr_31#72 AS value#78,gen_attr_32#73L AS a#79L,gen_attr_33#74L AS b#80L,gen_attr_34#75L AS c#81L,gen_attr_35#76L AS d#82L]
+- Join Inner, Some((gen_attr_30#71L = gen_attr_32#73L))
   :- Project [key#30L AS gen_attr_30#71L,value#31 AS gen_attr_31#72]
   :  +- Filter isnotnull(key#30L)
   :     +- Relation[key#30L,value#31] ParquetFormat part: struct<>, data: struct<key:bigint,value:string>
   +- Project [a#32L AS gen_attr_32#73L,b#33L AS gen_attr_33#74L,c#34L AS gen_attr_34#75L,d#35L AS gen_attr_35#76L]
      +- Filter isnotnull(a#32L)
         +- Relation[a#32L,b#33L,c#34L,d#35L] ParquetFormat part: struct<>, data: struct<a:bigint,b:bigint,c:bigint,d:bigint>

Here, we always add extra Projects in SQL generation. I am just thinking if we need to do it even if no name ambiguity exists?

cloud-fan · 2016-03-12T07:02:28Z

I think it's because our optimizer is not smart enough, these alias-only Project should be removed, name ambiguity is not a problem anymore after analysis phase.

gatorsmile · 2016-03-12T07:04:59Z

True. : )

SparkQA · 2016-03-12T08:38:12Z

Test build #52993 has finished for PR 11658 at commit 5b12aa0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-03-14T07:58:29Z

retest this please

SparkQA · 2016-03-14T11:18:58Z

Test build #53058 has finished for PR 11658 at commit 5b12aa0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-03-15T07:09:07Z

retest this please.

SparkQA · 2016-03-15T09:10:51Z

Test build #53172 has finished for PR 11658 at commit 5b12aa0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-15T11:57:19Z

Test build #53182 has finished for PR 11658 at commit 8de6365.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class SQLTable(

SparkQA · 2016-03-15T17:35:12Z

Test build #53203 has finished for PR 11658 at commit 5ef9fd4.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class SQLTable(

yhuai · 2016-03-16T02:04:31Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala

+      case _: LocalLimit => plan
+      case _: GlobalLimit => plan
+      case _: SQLTable => plan
+      case OneRowRelation => plan


Why we do not need to add a subquery for these kinds of nodes?

As the comments says, we don't need to add sub-query if this operator can be put after FROM. So obviously, SubqueryAlias, Join, SQLTable, OneRowRelation don't need extra sub-query. Currently we only support convert logical plan that is parsed from SQL string to SQL string, this implies, Filter, Limit will always appear after table relation and they will generate SQL string like tbl WHERE ... LIMIT ... which can be put after FROM.

Anyway this logical is just copied from original code.

yhuai · 2016-03-16T02:05:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala

@@ -185,8 +185,7 @@ case class Alias(child: Expression, name: String)(
  override def sql: String = {
    val qualifiersString =
      if (qualifiers.isEmpty) "" else qualifiers.map(quoteIdentifier).mkString("", ".", ".")
-    val aliasName = if (isGenerated) s"$name#${exprId.id}" else s"$name"


Is isGenerated still needed? (if not, we do not need to remove it in this PR). Also, what is the reason of this change?

isGenerated is still needed to avoid resolving column names on these generated internal attributes. However, it should not affect the sql anymore, as this PR need to control the format of attribute names and alias names.

yhuai · 2016-03-16T02:09:37Z

@cloud-fan It will be super helpful if we can have an example in the description as well as in the code. I feel this kind of changes is hard to fully understand without good examples. Thanks!

yhuai · 2016-03-16T18:56:26Z

Thank you for the example. That's super helpful. We should also put that in the code. Changes look good. @liancheng It will be good if you can take a look at it later.

I am merging this to master. We can put this example in the comment while working on another PR related to view support.

## What changes were proposed in this pull request? This PR adds SQL generation support for `Generate` operator. It always converts `Generate` operator into `LATERAL VIEW` format as there are many limitations to put UDTF in project list. This PR is based on #11658, please see the last commit to review the real changes. Thanks dilipbiswal for his initial work! Takes over #11596 ## How was this patch tested? new tests in `LogicalPlanToSQLSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #11696 from cloud-fan/generate.

…utputs while generate SQL string ## What changes were proposed in this pull request? This PR tries to solve a fundamental issue in the `SQLBuilder`. When we want to turn a logical plan into SQL string and put it after FROM clause, we need to wrap it with a sub-query. However, a logical plan is allowed to have same-name outputs with different qualifiers(e.g. the `Join` operator), and this kind of plan can't be put under a subquery as we will erase and assign a new qualifier to all outputs and make it impossible to distinguish same-name outputs. To solve this problem, this PR renames all attributes with globally unique names(using exprId), so that we don't need qualifiers to resolve ambiguity anymore. For example, `SELECT x.key, MAX(y.key) OVER () FROM t x JOIN t y`, we will parse this SQL to a Window operator and a Project operator, and add a sub-query between them. The generated SQL looks like: ``` SELECT sq_1.key, sq_1.max FROM ( SELECT sq_0.key, sq_0.key, MAX(sq_0.key) OVER () AS max FROM ( SELECT x.key, y.key FROM t1 AS x JOIN t2 AS y ) AS sq_0 ) AS sq_1 ``` You can see, the `key` columns become ambiguous after `sq_0`. After this PR, it will generate something like: ``` SELECT attr_30 AS key, attr_37 AS max FROM ( SELECT attr_30, attr_37 FROM ( SELECT attr_30, attr_35, MAX(attr_35) AS attr_37 FROM ( SELECT attr_30, attr_35 FROM (SELECT key AS attr_30 FROM t1) AS sq_0 INNER JOIN (SELECT key AS attr_35 FROM t1) AS sq_1 ) AS sq_2 ) AS sq_3 ) AS sq_4 ``` The outermost SELECT is used to turn the generated named to real names back, and the innermost SELECT is used to alias real columns to our generated names. Between them, there is no name ambiguity anymore. ## How was this patch tested? existing tests and new tests in LogicalPlanToSQLSuite. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#11658 from cloud-fan/gensql.

## What changes were proposed in this pull request? This PR adds SQL generation support for `Generate` operator. It always converts `Generate` operator into `LATERAL VIEW` format as there are many limitations to put UDTF in project list. This PR is based on apache#11658, please see the last commit to review the real changes. Thanks dilipbiswal for his initial work! Takes over apache#11596 ## How was this patch tested? new tests in `LogicalPlanToSQLSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes apache#11696 from cloud-fan/generate.

cloud-fan added 2 commits March 11, 2016 20:18

tmp

f4b1ae8

assign globally unique names to all attributes to avoid ambiguity

198b406

cloud-fan changed the title ~~[SPARK-XXXX][SQL] Can't add subquery to an operator with same-name outputs while generate SQL string~~ SPARK-13827[SQL] Can't add subquery to an operator with same-name outputs while generate SQL string Mar 11, 2016

one more test and mionr cleanup

21a142d

cloud-fan force-pushed the gensql branch from d4c3c32 to 21a142d Compare March 12, 2016 02:16

cleanup

ade17d8

gatorsmile reviewed Mar 12, 2016
View reviewed changes

remove case Sample

5b12aa0

cloud-fan mentioned this pull request Mar 14, 2016

[SPARK-12719][SQL] SQL generation support for Generate #11696

Closed

cloud-fan added 2 commits March 15, 2016 15:29

Merge remote-tracking branch 'origin/master' into gensql

6320e39

Merge remote-tracking branch 'origin/master' into gensql

0010af9

address comments

5ef9fd4

cloud-fan force-pushed the gensql branch from 8de6365 to 5ef9fd4 Compare March 15, 2016 15:51

yhuai reviewed Mar 16, 2016
View reviewed changes

asfgit closed this in 1d1de28 Mar 16, 2016

yhuai mentioned this pull request Mar 16, 2016

[SPARK-12719][SQL] SQL generation support for Generate #11768

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK-13827[SQL] Can't add subquery to an operator with same-name outputs while generate SQL string #11658

SPARK-13827[SQL] Can't add subquery to an operator with same-name outputs while generate SQL string #11658

cloud-fan commented Mar 11, 2016

cloud-fan commented Mar 11, 2016

SparkQA commented Mar 11, 2016

gatorsmile commented Mar 11, 2016

SparkQA commented Mar 12, 2016

SparkQA commented Mar 12, 2016

gatorsmile Mar 12, 2016

gatorsmile commented Mar 12, 2016

cloud-fan commented Mar 12, 2016

gatorsmile commented Mar 12, 2016

SparkQA commented Mar 12, 2016

cloud-fan commented Mar 14, 2016

SparkQA commented Mar 14, 2016

cloud-fan commented Mar 15, 2016

SparkQA commented Mar 15, 2016

SparkQA commented Mar 15, 2016

SparkQA commented Mar 15, 2016

yhuai Mar 16, 2016

cloud-fan Mar 16, 2016

yhuai Mar 16, 2016

cloud-fan Mar 16, 2016

yhuai commented Mar 16, 2016

yhuai commented Mar 16, 2016

SPARK-13827[SQL] Can't add subquery to an operator with same-name outputs while generate SQL string #11658

SPARK-13827[SQL] Can't add subquery to an operator with same-name outputs while generate SQL string #11658

Conversation

cloud-fan commented Mar 11, 2016

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Mar 11, 2016

SparkQA commented Mar 11, 2016

gatorsmile commented Mar 11, 2016

SparkQA commented Mar 12, 2016

SparkQA commented Mar 12, 2016

gatorsmile Mar 12, 2016

Choose a reason for hiding this comment

gatorsmile commented Mar 12, 2016

cloud-fan commented Mar 12, 2016

gatorsmile commented Mar 12, 2016

SparkQA commented Mar 12, 2016

cloud-fan commented Mar 14, 2016

SparkQA commented Mar 14, 2016

cloud-fan commented Mar 15, 2016

SparkQA commented Mar 15, 2016

SparkQA commented Mar 15, 2016

SparkQA commented Mar 15, 2016

yhuai Mar 16, 2016

Choose a reason for hiding this comment

cloud-fan Mar 16, 2016

Choose a reason for hiding this comment

yhuai Mar 16, 2016

Choose a reason for hiding this comment

cloud-fan Mar 16, 2016

Choose a reason for hiding this comment

yhuai commented Mar 16, 2016

yhuai commented Mar 16, 2016