[SPARK-12593][SQL] Converts resolved logical plan back to SQL #10541

liancheng · 2015-12-31T12:50:51Z

This PR tries to enable Spark SQL to convert resolved logical plans back to SQL query strings. For now, the major use case is to canonicalize Spark SQL native view support. The major entry point is SQLBuilder.toSQL, which returns an Option[String] if the logical plan is recognized.

The current version is still in WIP status, and is quite limited. Known limitations include:

The logical plan must be analyzed but not optimized

The optimizer erases Subquery operators, which contain necessary scope information for SQL generation. Future versions should be able to recover erased scope information by inserting subqueries when necessary.
The logical plan must be created using HiveQL query string

Query plans generated by composing arbitrary DataFrame API combinations are not supported yet. Operators within these query plans need to be rearranged into a canonical form that is more suitable for direct SQL generation. For example, the following query plan
```
Filter (a#1 < 10)
 +- MetastoreRelation default, src, None
```
need to be canonicalized into the following form before SQL generation:
```
Project [a#1, b#2, c#3]
 +- Filter (a#1 < 10)
     +- MetastoreRelation default, src, None
```
Otherwise, the SQL generation process will have to handle a large number of special cases.
Only a fraction of expressions and basic logical plan operators are supported in this PR

Currently, 95.7% (1720 out of 1798) query plans in HiveCompatibilitySuite can be successfully converted to SQL query strings.

Known unsupported components are:

Expressions
- Part of math expressions
- Part of string expressions (buggy?)
- Null expressions
- Calendar interval literal
- Part of date time expressions
- Complex type creators
- Special NOT expressions, e.g. NOT LIKE and NOT IN
Logical plan operators/patterns
- Cube, rollup, and grouping set
- Script transformation
- Generator
- Distinct aggregation patterns that fit DistinctAggregationRewriter analysis rule
- Window functions

Support for window functions, generators, and cubes etc. will be added in follow-up PRs.

This PR leverages HiveCompatibilitySuite for testing SQL generation in a "round-trip" manner:

For all select queries, we try to convert it back to SQL
If the query plan is convertible, we parse the generated SQL into a new logical plan
Run the new logical plan instead of the original one

If the query plan is inconvertible, the test case simply falls back to the original logic.

TODO

Fix failed test cases
Support for more basic expressions and logical plan operators (e.g. distinct aggregation etc.)
Comments and documentation

liancheng · 2015-12-31T12:53:46Z

When running any test suite that extends HiveComparisonTest, detail logs can be found in sql/hive/target/unit-tests.log. If a query string is convertible, we may see something like this (the triple-braces are added so that Vim recognizes them as fold marks.):

### Running SQL generation round-trip test {{{
Project [key#357,value#358,ds#356]
+- MetastoreRelation default, add_part_test, None

Original SQL:
select * from add_part_test

Generated SQL:
SELECT `add_part_test`.`key`, `add_part_test`.`value`, `add_part_test`.`ds` FROM `default`.`add_part_test`
}}}

Otherwise, we may see something like this:

### Cannot convert the following logical plan back to SQL {{{
Aggregate [(sum(cast(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(key#2853,value#2854) as bigint)),mode=Complete,isDistinct=false) AS _c0#2855L]
+- MetastoreRelation default, dest_j1, None

Original SQL:
SELECT sum(hash(dest_j1.key,dest_j1.value)) FROM dest_j1
}}}

In this way we can figure out the percentage of convertible query plans. Ideally the percentage should be calculated automatically.

liancheng · 2015-12-31T13:03:48Z

I found SQL generation in Slick can be a good reference for attacking limitations mentioned in the PR description. But the current approach should be enough for native view.

SparkQA · 2015-12-31T13:42:11Z

Test build #48554 has finished for PR 10541 at commit 17e8fba.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract sealed class SortDirection
- sealed abstract class JoinType
- case class Subquery(alias: String, child: LogicalPlan)
- case class NamedRelation(databaseName: String, tableName: String, output: Seq[Attribute])
- class QueryNormalizer(sqlContext: SQLContext) extends RuleExecutor[LogicalPlan]

SparkQA · 2015-12-31T16:45:53Z

Test build #48558 has finished for PR 10541 at commit af865a9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract sealed class SortDirection
- sealed abstract class JoinType
- case class Subquery(alias: String, child: LogicalPlan)
- case class NamedRelation(databaseName: String, tableName: String, output: Seq[Attribute])
- class QueryNormalizer(sqlContext: SQLContext) extends RuleExecutor[LogicalPlan]
- class SQLBuilder(logicalPlan: LogicalPlan, sqlContext: SQLContext) extends Logging

rxin · 2016-01-01T21:24:34Z

The jira ticket is linked incorrectly I think.

hvanhovell · 2016-01-02T12:09:01Z

@liancheng this looks cool!

I was wondering why we are bound to SQL? Is this because of Hive? I was thinking of the following, we could also store the logical plan's json representation. This should alot easier to (de)serialize. Could we store that in the Hive metadata store?

Another idea I was having. If a view is defined in HQL, we could also store that in some way with the query execution. This saves us a serialization/deserialization trip, and allows the user to recognize his own query.

rxin · 2016-01-03T06:30:14Z

@hvanhovell the problem with the json representation is stability. The json one is pretty tied to our internal implementation, and as a result would be hard to stabilize. Of course, we can also design our own stable json representation, but at that point we are really just re-inventing the SQL wheel.

liancheng · 2016-01-04T12:43:05Z

@rxin Thanks for helping explaining this. (JIRA ID in the PR title fixed.)

@hvanhovell Would also like to add that, once fully implemented, SQL statement generation itself can be quite useful, and not limited to native view support. One example is random query generation in integration tests.

liancheng · 2016-01-04T12:44:54Z

sql/hive/src/main/antlr3/org/apache/spark/sql/parser/SparkSqlParser.g

@@ -637,7 +637,7 @@ import org.apache.hadoop.hive.conf.HiveConf;
  // counter to generate unique union aliases
  private int aliasCounter;
  private String generateUnionAlias() {
-    return "_u" + (++aliasCounter);
+    return "u_" + (++aliasCounter);


This change is because Hive lexer doesn't allow identifiers starting with underscore.

(All other changes in this file are caused by removing training spaces.)

Am I correct to say that this only happens in the following (test) scenario?
HQL Statement -> Logical Plan -> HQL Statement (with generated names) -> Logical Plan

Yes. _u appears as an alias of a subquery. I hit this issue while trying to fix HiveQuerySuite.CTE feature #2.

Ok, perfect!

SparkQA · 2016-01-04T13:26:34Z

Test build #48657 has finished for PR 10541 at commit ef5dac2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-04T14:03:26Z

Test build #48660 has finished for PR 10541 at commit 4963676.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-01-04T14:46:33Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala

+import org.apache.spark.sql.catalyst.rules.{Rule, RuleExecutor}
+import org.apache.spark.sql.catalyst.util.sequenceOption
+
+class SQLBuilder(logicalPlan: LogicalPlan, sqlContext: SQLContext) extends Logging {


seems sqlContext is un-used?

I believe we need it later when dealing with more complex scenarios. For example, we may want to add SELECT * over a raw MetastoreRelation. Then we need sqlContext to resolve the *.

How about we add it back when we need it later?

liancheng · 2016-01-04T15:13:02Z

Hm, seems that my last fixes introduced bug related to UDF handling. Looking into it.

liancheng · 2016-01-04T16:41:04Z

The following two test cases always fail when executed with other test cases, but always pass when executed separately:

HiveCompatibilitySuite.select_as_omitted
HiveCompatibilitySuite.router_join_ppr

Both test cases complain table src not found when failing. Probably because of side effects done in test cases executed earlier. Still investigating.

liancheng · 2016-01-04T16:43:55Z

According to local testing result, now 75% query plans in HiveCompatibilitySuite can be successfully converted to SQL query strings.

SparkQA · 2016-01-04T17:06:45Z

Test build #48665 has finished for PR 10541 at commit 70af178.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-04T18:14:02Z

Test build #48667 has finished for PR 10541 at commit 1d5dd3b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-01-04T19:42:23Z

@liancheng

One thing about testing infrastructure: It is a good idea to use the existing Hive compatibility tests to bootstrap your test coverage. However, for every test failure that you find, we should create unit tests specifically built for the SQL conversion and increase the coverage of that. In the long run, we should not depend on the Hive compatibility tests.

yhuai · 2016-01-04T22:31:25Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala

+    predicateSQL <- predicate.sql
+    trueSQL <- trueValue.sql
+    falseSQL <- falseValue.sql
+  } yield s"(IF($predicateSQL, $trueSQL, $falseSQL))"
 }

 trait CaseWhenLike extends Expression {


Do we support case when?

Not yet, support for more expressions and operators is still ongoing.

Added support for case when expressions.

cloud-fan · 2016-01-07T09:08:44Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala

@@ -137,7 +133,8 @@ private[hive] class HiveFunctionRegistry(
  }
 }

-private[hive] case class HiveSimpleUDF(funcWrapper: HiveFunctionWrapper, children: Seq[Expression])
+private[hive] case class HiveSimpleUDF(
+    name: String, funcWrapper: HiveFunctionWrapper, children: Seq[Expression])


can't we get the function name from funcWrapper?

No, we can't. funcWrapper only contains class name.

SparkQA · 2016-01-07T15:30:52Z

Test build #48940 has finished for PR 10541 at commit a304392.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-07T21:28:16Z

Test build #48949 has finished for PR 10541 at commit 2073e30.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ComputeCurrentTimeSuite extends PlanTest

yhuai · 2016-01-07T21:53:21Z

test this please

yhuai · 2016-01-08T02:08:40Z

test this please

SparkQA · 2016-01-08T03:49:03Z

Test build #48996 has finished for PR 10541 at commit 2073e30.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ComputeCurrentTimeSuite extends PlanTest

yhuai · 2016-01-08T19:12:27Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala

+      // that, the metastore database name and table name are not always propagated to converted
+      // `ParquetRelation` instances via data source options.  Here we use subquery alias as a
+      // workaround.
+      Some(s"`$alias`")


Let's create a jira for this.

yhuai · 2016-01-08T19:14:07Z

Let's also create a jira for supporting persisted data source tables.

yhuai · 2016-01-08T19:36:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

-      SimplifyCaseConversionExpressions) ::
+      SimplifyCaseConversionExpressions,
+      // Nondeterministic
+      ComputeCurrentTime) ::


Should it be the first batch after the batch of Remove SubQueries?

Yea, I think we need to make it evaluated before this batch. Otherwise, constant folding rules will fire first, which can potentially introduce problem (multiple CurrentTimestamps returns different answers in a query).

Good catch, thanks!

yhuai · 2016-01-08T20:43:00Z

LGTM pending jenkins.

yhuai · 2016-01-08T20:47:45Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

@@ -37,6 +37,8 @@ abstract class Optimizer extends RuleExecutor[LogicalPlan] {
    // SubQueries are only needed for analysis and can be removed before execution.
    Batch("Remove SubQueries", FixedPoint(100),
      EliminateSubQueries) ::
+    Batch("Compute Current Time", Once,
+      ComputeCurrentTime) ::


Let's add a comment to explain it in the follow-up pr.

SparkQA · 2016-01-08T21:56:28Z

Test build #49026 has finished for PR 10541 at commit 97cd39e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-01-08T22:07:52Z

Merging to master.

liancheng · 2016-01-08T22:18:56Z

Thanks to all for the review!

This PR is a follow-up of PR #10541. It integrates the newly introduced SQL generation feature with native view to make native view canonical. In this PR, a new SQL option `spark.sql.nativeView.canonical` is added. When this option and `spark.sql.nativeView` are both `true`, Spark SQL tries to handle `CREATE VIEW` DDL statements using SQL query strings generated from view definition logical plans. If we failed to map the plan to SQL, we fallback to the original native view approach. One important issue this PR fixes is that, now we can use CTE when defining a view. Originally, when native view is turned on, we wrap the view definition text with an extra `SELECT`. However, HiveQL parser doesn't allow CTE appearing as a subquery. Namely, something like this is disallowed: ```sql SELECT n FROM ( WITH w AS (SELECT 1 AS n) SELECT * FROM w ) v ``` This PR fixes this issue because the extra `SELECT` is no longer needed (also, CTE expressions are inlined as subqueries during analysis phase, thus there won't be CTE expressions in the generated SQL query string). Author: Cheng Lian <lian@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #10733 from liancheng/spark-12728.integrate-sql-gen-with-native-view.

#### What changes were proposed in this pull request? The PR #10541 changed the rule `CollapseProject` by enabling collapsing `Project` into `Aggregate`. It leaves a to-do item to remove the duplicate code. This PR is to finish this to-do item. Also added a test case for covering this change. #### How was this patch tested? Added a new test case. liancheng Could you check if the code refactoring is fine? Thanks! Author: gatorsmile <gatorsmile@gmail.com> Closes #11427 from gatorsmile/collapseProjectRefactor.

liancheng force-pushed the sql-generation branch 3 times, most recently from e0c61b7 to af865a9 Compare December 31, 2015 15:15

liancheng changed the title ~~[SPARK-12592][SQL][WIP] Converts resolved logical plan back to SQL~~ [SPARK-12593][SQL][WIP] Converts resolved logical plan back to SQL Jan 4, 2016

liancheng force-pushed the sql-generation branch 2 times, most recently from a7c35f0 to ef5dac2 Compare January 4, 2016 11:51

liancheng reviewed Jan 4, 2016
View reviewed changes

cloud-fan reviewed Jan 4, 2016
View reviewed changes

yhuai reviewed Jan 4, 2016
View reviewed changes

cloud-fan reviewed Jan 7, 2016
View reviewed changes

Makes Expression.sql return String instead of Option[String]

a304392

Migrates test cases for ComputeCurrentTime

2073e30

yhuai reviewed Jan 8, 2016
View reviewed changes

Makes CreateCurrentTime a separate batch

97cd39e

yhuai reviewed Jan 8, 2016
View reviewed changes

asfgit closed this in d9447ca Jan 8, 2016

liancheng deleted the sql-generation branch January 8, 2016 22:18

This was referenced Jan 9, 2016

[SPARK-12616] [SQL] Making Logical Operator Union Support Arbitrary Number of Children #10577

Closed

[SPARK-12745] [SQL] Hive Parser: Limit is not supported inside Set Operation #10689

Closed

liancheng mentioned this pull request Jan 13, 2016

[SPARK-12728][SQL] Integrates SQL generation with native view #10733

Closed

gatorsmile mentioned this pull request Feb 29, 2016

[SPARK-13549] [SQL] Refactor the Optimizer Rule CollapseProject #11427

Closed

lw-lin mentioned this pull request Jul 11, 2016

[SPARK-16452][SQL] Support basic INFORMATION_SCHEMA #14116

Closed

[SPARK-12593][SQL] Converts resolved logical plan back to SQL #10541

[SPARK-12593][SQL] Converts resolved logical plan back to SQL #10541

Conversation

liancheng commented Dec 31, 2015 • edited Loading

liancheng commented Dec 31, 2015

liancheng commented Dec 31, 2015

SparkQA commented Dec 31, 2015

SparkQA commented Dec 31, 2015

rxin commented Jan 1, 2016

hvanhovell commented Jan 2, 2016

rxin commented Jan 3, 2016

liancheng commented Jan 4, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 4, 2016

SparkQA commented Jan 4, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liancheng commented Jan 4, 2016

liancheng commented Jan 4, 2016

liancheng commented Jan 4, 2016

SparkQA commented Jan 4, 2016

SparkQA commented Jan 4, 2016

rxin commented Jan 4, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 7, 2016

SparkQA commented Jan 7, 2016

yhuai commented Jan 7, 2016

yhuai commented Jan 8, 2016

SparkQA commented Jan 8, 2016

Choose a reason for hiding this comment

yhuai commented Jan 8, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yhuai commented Jan 8, 2016

Choose a reason for hiding this comment

SparkQA commented Jan 8, 2016

yhuai commented Jan 8, 2016

liancheng commented Jan 8, 2016

liancheng commented Dec 31, 2015 •

edited

Loading