-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-13827[SQL] Can't add subquery to an operator with same-name outputs while generate SQL string #11658
Conversation
Test build #52926 has finished for PR 11658 at commit
|
You are so fast! Will do the review tonight or tomorrow. I have another test case for this issue. Maybe you can take it. This is Project -- Subquery -- Filter -- Aggregate --... SELECT Count(a.value),
b.KEY,
a.KEY
FROM parquet_t1 a,
parquet_t1 b
GROUP BY a.KEY,
b.KEY
HAVING Max(a.KEY) > 0 |
Test build #52985 has finished for PR 11658 at commit
|
Test build #52987 has finished for PR 11658 at commit
|
) | ||
case SQLTable(database, table, _, sample) => | ||
val qualifiedName = s"${quoteIdentifier(database)}.${quoteIdentifier(table)}" | ||
sample.map { case (lowerBound, upperBound) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this change, you can remove case Sample
SELECT x.key, x.value, y.a, y.b, y.c, y.d FROM parquet_t1 x JOIN parquet_t2 y ON x.key = y.a The generated SQL is SELECT `gen_attr_30` AS `key`, `gen_attr_31` AS `value`, `gen_attr_32` AS `a`, `gen_attr_33` AS `b`, `gen_attr_34` AS `c`, `gen_attr_35` AS `d` FROM (SELECT `gen_attr_30`, `gen_attr_31`, `gen_attr_32`, `gen_attr_33`, `gen_attr_34`, `gen_attr_35` FROM (SELECT `key` AS `gen_attr_30`, `value` AS `gen_attr_31` FROM `default`.`parquet_t1`) AS gen_subquery_0 INNER JOIN (SELECT `a` AS `gen_attr_32`, `b` AS `gen_attr_33`, `c` AS `gen_attr_34`, `d` AS `gen_attr_35` FROM `default`.`parquet_t2`) AS gen_subquery_1 ON (`gen_attr_30` = `gen_attr_32`)) AS gen_subquery_2 I compared the Optimized Logical Plan of these two queries:
Here, we always add extra |
I think it's because our optimizer is not smart enough, these alias-only |
True. : ) |
Test build #52993 has finished for PR 11658 at commit
|
retest this please |
Test build #53058 has finished for PR 11658 at commit
|
retest this please. |
Test build #53172 has finished for PR 11658 at commit
|
Test build #53182 has finished for PR 11658 at commit
|
Test build #53203 has finished for PR 11658 at commit
|
case _: LocalLimit => plan | ||
case _: GlobalLimit => plan | ||
case _: SQLTable => plan | ||
case OneRowRelation => plan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we do not need to add a subquery for these kinds of nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the comments says, we don't need to add sub-query if this operator can be put after FROM. So obviously, SubqueryAlias
, Join
, SQLTable
, OneRowRelation
don't need extra sub-query. Currently we only support convert logical plan that is parsed from SQL string to SQL string, this implies, Filter
, Limit
will always appear after table relation and they will generate SQL string like tbl WHERE ... LIMIT ...
which can be put after FROM.
Anyway this logical is just copied from original code.
@@ -185,8 +185,7 @@ case class Alias(child: Expression, name: String)( | |||
override def sql: String = { | |||
val qualifiersString = | |||
if (qualifiers.isEmpty) "" else qualifiers.map(quoteIdentifier).mkString("", ".", ".") | |||
val aliasName = if (isGenerated) s"$name#${exprId.id}" else s"$name" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is isGenerated still needed? (if not, we do not need to remove it in this PR). Also, what is the reason of this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isGenerated
is still needed to avoid resolving column names on these generated internal attributes. However, it should not affect the sql
anymore, as this PR need to control the format of attribute names and alias names.
@cloud-fan It will be super helpful if we can have an example in the description as well as in the code. I feel this kind of changes is hard to fully understand without good examples. Thanks! |
Thank you for the example. That's super helpful. We should also put that in the code. Changes look good. @liancheng It will be good if you can take a look at it later. I am merging this to master. We can put this example in the comment while working on another PR related to view support. |
## What changes were proposed in this pull request? This PR adds SQL generation support for `Generate` operator. It always converts `Generate` operator into `LATERAL VIEW` format as there are many limitations to put UDTF in project list. This PR is based on #11658, please see the last commit to review the real changes. Thanks dilipbiswal for his initial work! Takes over #11596 ## How was this patch tested? new tests in `LogicalPlanToSQLSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #11696 from cloud-fan/generate.
…utputs while generate SQL string ## What changes were proposed in this pull request? This PR tries to solve a fundamental issue in the `SQLBuilder`. When we want to turn a logical plan into SQL string and put it after FROM clause, we need to wrap it with a sub-query. However, a logical plan is allowed to have same-name outputs with different qualifiers(e.g. the `Join` operator), and this kind of plan can't be put under a subquery as we will erase and assign a new qualifier to all outputs and make it impossible to distinguish same-name outputs. To solve this problem, this PR renames all attributes with globally unique names(using exprId), so that we don't need qualifiers to resolve ambiguity anymore. For example, `SELECT x.key, MAX(y.key) OVER () FROM t x JOIN t y`, we will parse this SQL to a Window operator and a Project operator, and add a sub-query between them. The generated SQL looks like: ``` SELECT sq_1.key, sq_1.max FROM ( SELECT sq_0.key, sq_0.key, MAX(sq_0.key) OVER () AS max FROM ( SELECT x.key, y.key FROM t1 AS x JOIN t2 AS y ) AS sq_0 ) AS sq_1 ``` You can see, the `key` columns become ambiguous after `sq_0`. After this PR, it will generate something like: ``` SELECT attr_30 AS key, attr_37 AS max FROM ( SELECT attr_30, attr_37 FROM ( SELECT attr_30, attr_35, MAX(attr_35) AS attr_37 FROM ( SELECT attr_30, attr_35 FROM (SELECT key AS attr_30 FROM t1) AS sq_0 INNER JOIN (SELECT key AS attr_35 FROM t1) AS sq_1 ) AS sq_2 ) AS sq_3 ) AS sq_4 ``` The outermost SELECT is used to turn the generated named to real names back, and the innermost SELECT is used to alias real columns to our generated names. Between them, there is no name ambiguity anymore. ## How was this patch tested? existing tests and new tests in LogicalPlanToSQLSuite. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#11658 from cloud-fan/gensql.
## What changes were proposed in this pull request? This PR adds SQL generation support for `Generate` operator. It always converts `Generate` operator into `LATERAL VIEW` format as there are many limitations to put UDTF in project list. This PR is based on apache#11658, please see the last commit to review the real changes. Thanks dilipbiswal for his initial work! Takes over apache#11596 ## How was this patch tested? new tests in `LogicalPlanToSQLSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes apache#11696 from cloud-fan/generate.
What changes were proposed in this pull request?
This PR tries to solve a fundamental issue in the
SQLBuilder
. When we want to turn a logical plan into SQL string and put it after FROM clause, we need to wrap it with a sub-query. However, a logical plan is allowed to have same-name outputs with different qualifiers(e.g. theJoin
operator), and this kind of plan can't be put under a subquery as we will erase and assign a new qualifier to all outputs and make it impossible to distinguish same-name outputs.To solve this problem, this PR renames all attributes with globally unique names(using exprId), so that we don't need qualifiers to resolve ambiguity anymore.
For example,
SELECT x.key, MAX(y.key) OVER () FROM t x JOIN t y
, we will parse this SQL to a Window operator and a Project operator, and add a sub-query between them. The generated SQL looks like:You can see, the
key
columns become ambiguous aftersq_0
.After this PR, it will generate something like:
The outermost SELECT is used to turn the generated named to real names back, and the innermost SELECT is used to alias real columns to our generated names. Between them, there is no name ambiguity anymore.
How was this patch tested?
existing tests and new tests in LogicalPlanToSQLSuite.