Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-13827[SQL] Can't add subquery to an operator with same-name outputs while generate SQL string #11658

Closed
wants to merge 8 commits into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

This PR tries to solve a fundamental issue in the SQLBuilder. When we want to turn a logical plan into SQL string and put it after FROM clause, we need to wrap it with a sub-query. However, a logical plan is allowed to have same-name outputs with different qualifiers(e.g. the Join operator), and this kind of plan can't be put under a subquery as we will erase and assign a new qualifier to all outputs and make it impossible to distinguish same-name outputs.

To solve this problem, this PR renames all attributes with globally unique names(using exprId), so that we don't need qualifiers to resolve ambiguity anymore.

For example, SELECT x.key, MAX(y.key) OVER () FROM t x JOIN t y, we will parse this SQL to a Window operator and a Project operator, and add a sub-query between them. The generated SQL looks like:

SELECT sq_1.key, sq_1.max
FROM (
    SELECT sq_0.key, sq_0.key, MAX(sq_0.key) OVER () AS max
    FROM (
        SELECT x.key, y.key FROM t1 AS x JOIN t2 AS y
    ) AS sq_0
) AS sq_1

You can see, the key columns become ambiguous after sq_0.

After this PR, it will generate something like:

SELECT attr_30 AS key, attr_37 AS max
FROM (
    SELECT attr_30, attr_37
    FROM (
        SELECT attr_30, attr_35, MAX(attr_35) AS attr_37
        FROM (
            SELECT attr_30, attr_35 FROM
                (SELECT key AS attr_30 FROM t1) AS sq_0
            INNER JOIN
                (SELECT key AS attr_35 FROM t1) AS sq_1
        ) AS sq_2
    ) AS sq_3
) AS sq_4

The outermost SELECT is used to turn the generated named to real names back, and the innermost SELECT is used to alias real columns to our generated names. Between them, there is no name ambiguity anymore.

How was this patch tested?

existing tests and new tests in LogicalPlanToSQLSuite.

@cloud-fan cloud-fan changed the title [SPARK-XXXX][SQL] Can't add subquery to an operator with same-name outputs while generate SQL string SPARK-13827[SQL] Can't add subquery to an operator with same-name outputs while generate SQL string Mar 11, 2016
@cloud-fan
Copy link
Contributor Author

cc @liancheng @gatorsmile @yhuai

@SparkQA
Copy link

SparkQA commented Mar 11, 2016

Test build #52926 has finished for PR 11658 at commit 198b406.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

You are so fast! Will do the review tonight or tomorrow. I have another test case for this issue. Maybe you can take it. This is Project -- Subquery -- Filter -- Aggregate --...

SELECT Count(a.value), 
       b.KEY, 
       a.KEY 
FROM   parquet_t1 a, 
       parquet_t1 b 
GROUP  BY a.KEY, 
          b.KEY 
HAVING Max(a.KEY> 0 

@SparkQA
Copy link

SparkQA commented Mar 12, 2016

Test build #52985 has finished for PR 11658 at commit 21a142d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 12, 2016

Test build #52987 has finished for PR 11658 at commit ade17d8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

)
case SQLTable(database, table, _, sample) =>
val qualifiedName = s"${quoteIdentifier(database)}.${quoteIdentifier(table)}"
sample.map { case (lowerBound, upperBound) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this change, you can remove case Sample

@gatorsmile
Copy link
Member

SELECT x.key, x.value, y.a, y.b, y.c, y.d FROM parquet_t1 x JOIN parquet_t2 y ON x.key = y.a

The generated SQL is

SELECT `gen_attr_30` AS `key`, `gen_attr_31` AS `value`, `gen_attr_32` AS `a`, `gen_attr_33` AS `b`, `gen_attr_34` AS `c`, `gen_attr_35` AS `d` FROM (SELECT `gen_attr_30`, `gen_attr_31`, `gen_attr_32`, `gen_attr_33`, `gen_attr_34`, `gen_attr_35` FROM (SELECT `key` AS `gen_attr_30`, `value` AS `gen_attr_31` FROM `default`.`parquet_t1`) AS gen_subquery_0 INNER JOIN (SELECT `a` AS `gen_attr_32`, `b` AS `gen_attr_33`, `c` AS `gen_attr_34`, `d` AS `gen_attr_35` FROM `default`.`parquet_t2`) AS gen_subquery_1 ON (`gen_attr_30` = `gen_attr_32`)) AS gen_subquery_2

I compared the Optimized Logical Plan of these two queries:

Join Inner, Some((key#30L = a#32L))
:- Filter isnotnull(key#30L)
:  +- Relation[key#30L,value#31] ParquetFormat part: struct<>, data: struct<key:bigint,value:string>
+- Filter isnotnull(a#32L)
   +- Relation[a#32L,b#33L,c#34L,d#35L] ParquetFormat part: struct<>, data: struct<a:bigint,b:bigint,c:bigint,d:bigint>
Project [gen_attr_30#71L AS key#77L,gen_attr_31#72 AS value#78,gen_attr_32#73L AS a#79L,gen_attr_33#74L AS b#80L,gen_attr_34#75L AS c#81L,gen_attr_35#76L AS d#82L]
+- Join Inner, Some((gen_attr_30#71L = gen_attr_32#73L))
   :- Project [key#30L AS gen_attr_30#71L,value#31 AS gen_attr_31#72]
   :  +- Filter isnotnull(key#30L)
   :     +- Relation[key#30L,value#31] ParquetFormat part: struct<>, data: struct<key:bigint,value:string>
   +- Project [a#32L AS gen_attr_32#73L,b#33L AS gen_attr_33#74L,c#34L AS gen_attr_34#75L,d#35L AS gen_attr_35#76L]
      +- Filter isnotnull(a#32L)
         +- Relation[a#32L,b#33L,c#34L,d#35L] ParquetFormat part: struct<>, data: struct<a:bigint,b:bigint,c:bigint,d:bigint>

Here, we always add extra Projects in SQL generation. I am just thinking if we need to do it even if no name ambiguity exists?

@cloud-fan
Copy link
Contributor Author

I think it's because our optimizer is not smart enough, these alias-only Project should be removed, name ambiguity is not a problem anymore after analysis phase.

@gatorsmile
Copy link
Member

True. : )

@SparkQA
Copy link

SparkQA commented Mar 12, 2016

Test build #52993 has finished for PR 11658 at commit 5b12aa0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Mar 14, 2016

Test build #53058 has finished for PR 11658 at commit 5b12aa0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please.

@SparkQA
Copy link

SparkQA commented Mar 15, 2016

Test build #53172 has finished for PR 11658 at commit 5b12aa0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 15, 2016

Test build #53182 has finished for PR 11658 at commit 8de6365.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class SQLTable(

@SparkQA
Copy link

SparkQA commented Mar 15, 2016

Test build #53203 has finished for PR 11658 at commit 5ef9fd4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class SQLTable(

case _: LocalLimit => plan
case _: GlobalLimit => plan
case _: SQLTable => plan
case OneRowRelation => plan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we do not need to add a subquery for these kinds of nodes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the comments says, we don't need to add sub-query if this operator can be put after FROM. So obviously, SubqueryAlias, Join, SQLTable, OneRowRelation don't need extra sub-query. Currently we only support convert logical plan that is parsed from SQL string to SQL string, this implies, Filter, Limit will always appear after table relation and they will generate SQL string like tbl WHERE ... LIMIT ... which can be put after FROM.

Anyway this logical is just copied from original code.

@@ -185,8 +185,7 @@ case class Alias(child: Expression, name: String)(
override def sql: String = {
val qualifiersString =
if (qualifiers.isEmpty) "" else qualifiers.map(quoteIdentifier).mkString("", ".", ".")
val aliasName = if (isGenerated) s"$name#${exprId.id}" else s"$name"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is isGenerated still needed? (if not, we do not need to remove it in this PR). Also, what is the reason of this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isGenerated is still needed to avoid resolving column names on these generated internal attributes. However, it should not affect the sql anymore, as this PR need to control the format of attribute names and alias names.

@yhuai
Copy link
Contributor

yhuai commented Mar 16, 2016

@cloud-fan It will be super helpful if we can have an example in the description as well as in the code. I feel this kind of changes is hard to fully understand without good examples. Thanks!

@yhuai
Copy link
Contributor

yhuai commented Mar 16, 2016

Thank you for the example. That's super helpful. We should also put that in the code. Changes look good. @liancheng It will be good if you can take a look at it later.

I am merging this to master. We can put this example in the comment while working on another PR related to view support.

@asfgit asfgit closed this in 1d1de28 Mar 16, 2016
asfgit pushed a commit that referenced this pull request Mar 17, 2016
## What changes were proposed in this pull request?

This PR adds SQL generation support for `Generate` operator. It always converts `Generate` operator into `LATERAL VIEW` format as there are many limitations to put UDTF in project list.

This PR is based on #11658, please see the last commit to review the real changes.

Thanks dilipbiswal for his initial work! Takes over #11596

## How was this patch tested?

new tests in `LogicalPlanToSQLSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11696 from cloud-fan/generate.
roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
…utputs while generate SQL string

## What changes were proposed in this pull request?

This PR tries to solve a fundamental issue in the `SQLBuilder`. When we want to turn a logical plan into SQL string and put it after FROM clause, we need to wrap it with a sub-query. However, a logical plan is allowed to have same-name outputs with different qualifiers(e.g. the `Join` operator), and this kind of plan can't be put under a subquery as we will erase and assign a new qualifier to all outputs and make it impossible to distinguish same-name outputs.

To solve this problem, this PR renames all attributes with globally unique names(using exprId), so that we don't need qualifiers to resolve ambiguity anymore.

For example, `SELECT x.key, MAX(y.key) OVER () FROM t x JOIN t y`, we will parse this SQL to a Window operator and a Project operator, and add a sub-query between them. The generated SQL looks like:
```
SELECT sq_1.key, sq_1.max
FROM (
    SELECT sq_0.key, sq_0.key, MAX(sq_0.key) OVER () AS max
    FROM (
        SELECT x.key, y.key FROM t1 AS x JOIN t2 AS y
    ) AS sq_0
) AS sq_1
```
You can see, the `key` columns become ambiguous after `sq_0`.

After this PR, it will generate something like:
```
SELECT attr_30 AS key, attr_37 AS max
FROM (
    SELECT attr_30, attr_37
    FROM (
        SELECT attr_30, attr_35, MAX(attr_35) AS attr_37
        FROM (
            SELECT attr_30, attr_35 FROM
                (SELECT key AS attr_30 FROM t1) AS sq_0
            INNER JOIN
                (SELECT key AS attr_35 FROM t1) AS sq_1
        ) AS sq_2
    ) AS sq_3
) AS sq_4
```
The outermost SELECT is used to turn the generated named to real names back, and the innermost SELECT is used to alias real columns to our generated names. Between them, there is no name ambiguity anymore.

## How was this patch tested?

existing tests and new tests in LogicalPlanToSQLSuite.

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#11658 from cloud-fan/gensql.
roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
## What changes were proposed in this pull request?

This PR adds SQL generation support for `Generate` operator. It always converts `Generate` operator into `LATERAL VIEW` format as there are many limitations to put UDTF in project list.

This PR is based on apache#11658, please see the last commit to review the real changes.

Thanks dilipbiswal for his initial work! Takes over apache#11596

## How was this patch tested?

new tests in `LogicalPlanToSQLSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#11696 from cloud-fan/generate.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants