-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12593][SQL] Converts resolved logical plan back to SQL #10541
Conversation
When running any test suite that extends
Otherwise, we may see something like this:
In this way we can figure out the percentage of convertible query plans. Ideally the percentage should be calculated automatically. |
I found SQL generation in Slick can be a good reference for attacking limitations mentioned in the PR description. But the current approach should be enough for native view. |
Test build #48554 has finished for PR 10541 at commit
|
e0c61b7
to
af865a9
Compare
Test build #48558 has finished for PR 10541 at commit
|
The jira ticket is linked incorrectly I think. |
@liancheng this looks cool! I was wondering why we are bound to SQL? Is this because of Hive? I was thinking of the following, we could also store the logical plan's json representation. This should alot easier to (de)serialize. Could we store that in the Hive metadata store? Another idea I was having. If a view is defined in HQL, we could also store that in some way with the query execution. This saves us a serialization/deserialization trip, and allows the user to recognize his own query. |
@hvanhovell the problem with the json representation is stability. The json one is pretty tied to our internal implementation, and as a result would be hard to stabilize. Of course, we can also design our own stable json representation, but at that point we are really just re-inventing the SQL wheel. |
a7c35f0
to
ef5dac2
Compare
@rxin Thanks for helping explaining this. (JIRA ID in the PR title fixed.) @hvanhovell Would also like to add that, once fully implemented, SQL statement generation itself can be quite useful, and not limited to native view support. One example is random query generation in integration tests. |
@@ -637,7 +637,7 @@ import org.apache.hadoop.hive.conf.HiveConf; | |||
// counter to generate unique union aliases | |||
private int aliasCounter; | |||
private String generateUnionAlias() { | |||
return "_u" + (++aliasCounter); | |||
return "u_" + (++aliasCounter); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is because Hive lexer doesn't allow identifiers starting with underscore.
(All other changes in this file are caused by removing training spaces.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I correct to say that this only happens in the following (test) scenario?
HQL Statement -> Logical Plan -> HQL Statement (with generated names) -> Logical Plan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. _u
appears as an alias of a subquery. I hit this issue while trying to fix HiveQuerySuite.CTE feature #2
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, perfect!
Test build #48657 has finished for PR 10541 at commit
|
Test build #48660 has finished for PR 10541 at commit
|
import org.apache.spark.sql.catalyst.rules.{Rule, RuleExecutor} | ||
import org.apache.spark.sql.catalyst.util.sequenceOption | ||
|
||
class SQLBuilder(logicalPlan: LogicalPlan, sqlContext: SQLContext) extends Logging { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems sqlContext
is un-used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we need it later when dealing with more complex scenarios. For example, we may want to add SELECT *
over a raw MetastoreRelation
. Then we need sqlContext
to resolve the *
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we add it back when we need it later?
Hm, seems that my last fixes introduced bug related to UDF handling. Looking into it. |
The following two test cases always fail when executed with other test cases, but always pass when executed separately:
Both test cases complain table |
According to local testing result, now 75% query plans in |
Test build #48665 has finished for PR 10541 at commit
|
Test build #48667 has finished for PR 10541 at commit
|
One thing about testing infrastructure: It is a good idea to use the existing Hive compatibility tests to bootstrap your test coverage. However, for every test failure that you find, we should create unit tests specifically built for the SQL conversion and increase the coverage of that. In the long run, we should not depend on the Hive compatibility tests. |
predicateSQL <- predicate.sql | ||
trueSQL <- trueValue.sql | ||
falseSQL <- falseValue.sql | ||
} yield s"(IF($predicateSQL, $trueSQL, $falseSQL))" | ||
} | ||
|
||
trait CaseWhenLike extends Expression { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we support case when
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet, support for more expressions and operators is still ongoing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added support for case when expressions.
@@ -137,7 +133,8 @@ private[hive] class HiveFunctionRegistry( | |||
} | |||
} | |||
|
|||
private[hive] case class HiveSimpleUDF(funcWrapper: HiveFunctionWrapper, children: Seq[Expression]) | |||
private[hive] case class HiveSimpleUDF( | |||
name: String, funcWrapper: HiveFunctionWrapper, children: Seq[Expression]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we get the function name from funcWrapper
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we can't. funcWrapper
only contains class name.
Test build #48940 has finished for PR 10541 at commit
|
Test build #48949 has finished for PR 10541 at commit
|
test this please |
1 similar comment
test this please |
Test build #48996 has finished for PR 10541 at commit
|
// that, the metastore database name and table name are not always propagated to converted | ||
// `ParquetRelation` instances via data source options. Here we use subquery alias as a | ||
// workaround. | ||
Some(s"`$alias`") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's create a jira for this.
Let's also create a jira for supporting persisted data source tables. |
SimplifyCaseConversionExpressions) :: | ||
SimplifyCaseConversionExpressions, | ||
// Nondeterministic | ||
ComputeCurrentTime) :: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be the first batch after the batch of Remove SubQueries
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I think we need to make it evaluated before this batch. Otherwise, constant folding rules will fire first, which can potentially introduce problem (multiple CurrentTimestamps returns different answers in a query).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks!
LGTM pending jenkins. |
@@ -37,6 +37,8 @@ abstract class Optimizer extends RuleExecutor[LogicalPlan] { | |||
// SubQueries are only needed for analysis and can be removed before execution. | |||
Batch("Remove SubQueries", FixedPoint(100), | |||
EliminateSubQueries) :: | |||
Batch("Compute Current Time", Once, | |||
ComputeCurrentTime) :: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a comment to explain it in the follow-up pr.
Test build #49026 has finished for PR 10541 at commit
|
Merging to master. |
Thanks to all for the review! |
This PR is a follow-up of PR #10541. It integrates the newly introduced SQL generation feature with native view to make native view canonical. In this PR, a new SQL option `spark.sql.nativeView.canonical` is added. When this option and `spark.sql.nativeView` are both `true`, Spark SQL tries to handle `CREATE VIEW` DDL statements using SQL query strings generated from view definition logical plans. If we failed to map the plan to SQL, we fallback to the original native view approach. One important issue this PR fixes is that, now we can use CTE when defining a view. Originally, when native view is turned on, we wrap the view definition text with an extra `SELECT`. However, HiveQL parser doesn't allow CTE appearing as a subquery. Namely, something like this is disallowed: ```sql SELECT n FROM ( WITH w AS (SELECT 1 AS n) SELECT * FROM w ) v ``` This PR fixes this issue because the extra `SELECT` is no longer needed (also, CTE expressions are inlined as subqueries during analysis phase, thus there won't be CTE expressions in the generated SQL query string). Author: Cheng Lian <lian@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #10733 from liancheng/spark-12728.integrate-sql-gen-with-native-view.
#### What changes were proposed in this pull request? The PR #10541 changed the rule `CollapseProject` by enabling collapsing `Project` into `Aggregate`. It leaves a to-do item to remove the duplicate code. This PR is to finish this to-do item. Also added a test case for covering this change. #### How was this patch tested? Added a new test case. liancheng Could you check if the code refactoring is fine? Thanks! Author: gatorsmile <gatorsmile@gmail.com> Closes #11427 from gatorsmile/collapseProjectRefactor.
This PR tries to enable Spark SQL to convert resolved logical plans back to SQL query strings. For now, the major use case is to canonicalize Spark SQL native view support. The major entry point is
SQLBuilder.toSQL
, which returns anOption[String]
if the logical plan is recognized.The current version is still in WIP status, and is quite limited. Known limitations include:
The logical plan must be analyzed but not optimized
The optimizer erases
Subquery
operators, which contain necessary scope information for SQL generation. Future versions should be able to recover erased scope information by inserting subqueries when necessary.The logical plan must be created using HiveQL query string
Query plans generated by composing arbitrary DataFrame API combinations are not supported yet. Operators within these query plans need to be rearranged into a canonical form that is more suitable for direct SQL generation. For example, the following query plan
need to be canonicalized into the following form before SQL generation:
Otherwise, the SQL generation process will have to handle a large number of special cases.
Only a fraction of expressions and basic logical plan operators are supported in this PR
Currently, 95.7% (1720 out of 1798) query plans in
HiveCompatibilitySuite
can be successfully converted to SQL query strings.Known unsupported components are:
NOT
expressions, e.g.NOT LIKE
andNOT IN
DistinctAggregationRewriter
analysis ruleSupport for window functions, generators, and cubes etc. will be added in follow-up PRs.
This PR leverages
HiveCompatibilitySuite
for testing SQL generation in a "round-trip" manner:If the query plan is inconvertible, the test case simply falls back to the original logic.
TODO