Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12593][SQL] Converts resolved logical plan back to SQL #10541

Closed
wants to merge 15 commits into from

Conversation

liancheng
Copy link
Contributor

@liancheng liancheng commented Dec 31, 2015

This PR tries to enable Spark SQL to convert resolved logical plans back to SQL query strings. For now, the major use case is to canonicalize Spark SQL native view support. The major entry point is SQLBuilder.toSQL, which returns an Option[String] if the logical plan is recognized.

The current version is still in WIP status, and is quite limited. Known limitations include:

  1. The logical plan must be analyzed but not optimized

    The optimizer erases Subquery operators, which contain necessary scope information for SQL generation. Future versions should be able to recover erased scope information by inserting subqueries when necessary.

  2. The logical plan must be created using HiveQL query string

    Query plans generated by composing arbitrary DataFrame API combinations are not supported yet. Operators within these query plans need to be rearranged into a canonical form that is more suitable for direct SQL generation. For example, the following query plan

    Filter (a#1 < 10)
     +- MetastoreRelation default, src, None
    

    need to be canonicalized into the following form before SQL generation:

    Project [a#1, b#2, c#3]
     +- Filter (a#1 < 10)
         +- MetastoreRelation default, src, None
    

    Otherwise, the SQL generation process will have to handle a large number of special cases.

  3. Only a fraction of expressions and basic logical plan operators are supported in this PR

Currently, 95.7% (1720 out of 1798) query plans in HiveCompatibilitySuite can be successfully converted to SQL query strings.

Known unsupported components are:

  • Expressions
    • Part of math expressions
    • Part of string expressions (buggy?)
    • Null expressions
    • Calendar interval literal
    • Part of date time expressions
    • Complex type creators
    • Special NOT expressions, e.g. NOT LIKE and NOT IN
  • Logical plan operators/patterns
    • Cube, rollup, and grouping set
    • Script transformation
    • Generator
    • Distinct aggregation patterns that fit DistinctAggregationRewriter analysis rule
    • Window functions

Support for window functions, generators, and cubes etc. will be added in follow-up PRs.

This PR leverages HiveCompatibilitySuite for testing SQL generation in a "round-trip" manner:

  • For all select queries, we try to convert it back to SQL
  • If the query plan is convertible, we parse the generated SQL into a new logical plan
  • Run the new logical plan instead of the original one

If the query plan is inconvertible, the test case simply falls back to the original logic.

TODO

  • Fix failed test cases
  • Support for more basic expressions and logical plan operators (e.g. distinct aggregation etc.)
  • Comments and documentation

@liancheng
Copy link
Contributor Author

When running any test suite that extends HiveComparisonTest, detail logs can be found in sql/hive/target/unit-tests.log. If a query string is convertible, we may see something like this (the triple-braces are added so that Vim recognizes them as fold marks.):

### Running SQL generation round-trip test {{{
Project [key#357,value#358,ds#356]
+- MetastoreRelation default, add_part_test, None

Original SQL:
select * from add_part_test

Generated SQL:
SELECT `add_part_test`.`key`, `add_part_test`.`value`, `add_part_test`.`ds` FROM `default`.`add_part_test`
}}}

Otherwise, we may see something like this:

### Cannot convert the following logical plan back to SQL {{{
Aggregate [(sum(cast(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFHash(key#2853,value#2854) as bigint)),mode=Complete,isDistinct=false) AS _c0#2855L]
+- MetastoreRelation default, dest_j1, None

Original SQL:
SELECT sum(hash(dest_j1.key,dest_j1.value)) FROM dest_j1
}}}

In this way we can figure out the percentage of convertible query plans. Ideally the percentage should be calculated automatically.

@liancheng
Copy link
Contributor Author

I found SQL generation in Slick can be a good reference for attacking limitations mentioned in the PR description. But the current approach should be enough for native view.

@SparkQA
Copy link

SparkQA commented Dec 31, 2015

Test build #48554 has finished for PR 10541 at commit 17e8fba.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract sealed class SortDirection
    • sealed abstract class JoinType
    • case class Subquery(alias: String, child: LogicalPlan)
    • case class NamedRelation(databaseName: String, tableName: String, output: Seq[Attribute])
    • class QueryNormalizer(sqlContext: SQLContext) extends RuleExecutor[LogicalPlan]

@liancheng liancheng force-pushed the sql-generation branch 3 times, most recently from e0c61b7 to af865a9 Compare December 31, 2015 15:15
@SparkQA
Copy link

SparkQA commented Dec 31, 2015

Test build #48558 has finished for PR 10541 at commit af865a9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract sealed class SortDirection
    • sealed abstract class JoinType
    • case class Subquery(alias: String, child: LogicalPlan)
    • case class NamedRelation(databaseName: String, tableName: String, output: Seq[Attribute])
    • class QueryNormalizer(sqlContext: SQLContext) extends RuleExecutor[LogicalPlan]
    • class SQLBuilder(logicalPlan: LogicalPlan, sqlContext: SQLContext) extends Logging

@rxin
Copy link
Contributor

rxin commented Jan 1, 2016

The jira ticket is linked incorrectly I think.

@hvanhovell
Copy link
Contributor

@liancheng this looks cool!

I was wondering why we are bound to SQL? Is this because of Hive? I was thinking of the following, we could also store the logical plan's json representation. This should alot easier to (de)serialize. Could we store that in the Hive metadata store?

Another idea I was having. If a view is defined in HQL, we could also store that in some way with the query execution. This saves us a serialization/deserialization trip, and allows the user to recognize his own query.

@rxin
Copy link
Contributor

rxin commented Jan 3, 2016

@hvanhovell the problem with the json representation is stability. The json one is pretty tied to our internal implementation, and as a result would be hard to stabilize. Of course, we can also design our own stable json representation, but at that point we are really just re-inventing the SQL wheel.

@liancheng liancheng changed the title [SPARK-12592][SQL][WIP] Converts resolved logical plan back to SQL [SPARK-12593][SQL][WIP] Converts resolved logical plan back to SQL Jan 4, 2016
@liancheng liancheng force-pushed the sql-generation branch 2 times, most recently from a7c35f0 to ef5dac2 Compare January 4, 2016 11:51
@liancheng
Copy link
Contributor Author

@rxin Thanks for helping explaining this. (JIRA ID in the PR title fixed.)

@hvanhovell Would also like to add that, once fully implemented, SQL statement generation itself can be quite useful, and not limited to native view support. One example is random query generation in integration tests.

@@ -637,7 +637,7 @@ import org.apache.hadoop.hive.conf.HiveConf;
// counter to generate unique union aliases
private int aliasCounter;
private String generateUnionAlias() {
return "_u" + (++aliasCounter);
return "u_" + (++aliasCounter);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is because Hive lexer doesn't allow identifiers starting with underscore.

(All other changes in this file are caused by removing training spaces.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct to say that this only happens in the following (test) scenario?
HQL Statement -> Logical Plan -> HQL Statement (with generated names) -> Logical Plan

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. _u appears as an alias of a subquery. I hit this issue while trying to fix HiveQuerySuite.CTE feature #2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, perfect!

@SparkQA
Copy link

SparkQA commented Jan 4, 2016

Test build #48657 has finished for PR 10541 at commit ef5dac2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 4, 2016

Test build #48660 has finished for PR 10541 at commit 4963676.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

import org.apache.spark.sql.catalyst.rules.{Rule, RuleExecutor}
import org.apache.spark.sql.catalyst.util.sequenceOption

class SQLBuilder(logicalPlan: LogicalPlan, sqlContext: SQLContext) extends Logging {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems sqlContext is un-used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we need it later when dealing with more complex scenarios. For example, we may want to add SELECT * over a raw MetastoreRelation. Then we need sqlContext to resolve the *.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we add it back when we need it later?

@liancheng
Copy link
Contributor Author

Hm, seems that my last fixes introduced bug related to UDF handling. Looking into it.

@liancheng
Copy link
Contributor Author

The following two test cases always fail when executed with other test cases, but always pass when executed separately:

  • HiveCompatibilitySuite.select_as_omitted
  • HiveCompatibilitySuite.router_join_ppr

Both test cases complain table src not found when failing. Probably because of side effects done in test cases executed earlier. Still investigating.

@liancheng
Copy link
Contributor Author

According to local testing result, now 75% query plans in HiveCompatibilitySuite can be successfully converted to SQL query strings.

@SparkQA
Copy link

SparkQA commented Jan 4, 2016

Test build #48665 has finished for PR 10541 at commit 70af178.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 4, 2016

Test build #48667 has finished for PR 10541 at commit 1d5dd3b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Jan 4, 2016

@liancheng

One thing about testing infrastructure: It is a good idea to use the existing Hive compatibility tests to bootstrap your test coverage. However, for every test failure that you find, we should create unit tests specifically built for the SQL conversion and increase the coverage of that. In the long run, we should not depend on the Hive compatibility tests.

predicateSQL <- predicate.sql
trueSQL <- trueValue.sql
falseSQL <- falseValue.sql
} yield s"(IF($predicateSQL, $trueSQL, $falseSQL))"
}

trait CaseWhenLike extends Expression {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we support case when?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet, support for more expressions and operators is still ongoing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added support for case when expressions.

@@ -137,7 +133,8 @@ private[hive] class HiveFunctionRegistry(
}
}

private[hive] case class HiveSimpleUDF(funcWrapper: HiveFunctionWrapper, children: Seq[Expression])
private[hive] case class HiveSimpleUDF(
name: String, funcWrapper: HiveFunctionWrapper, children: Seq[Expression])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we get the function name from funcWrapper?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we can't. funcWrapper only contains class name.

@SparkQA
Copy link

SparkQA commented Jan 7, 2016

Test build #48940 has finished for PR 10541 at commit a304392.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 7, 2016

Test build #48949 has finished for PR 10541 at commit 2073e30.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class ComputeCurrentTimeSuite extends PlanTest

@yhuai
Copy link
Contributor

yhuai commented Jan 7, 2016

test this please

1 similar comment
@yhuai
Copy link
Contributor

yhuai commented Jan 8, 2016

test this please

@SparkQA
Copy link

SparkQA commented Jan 8, 2016

Test build #48996 has finished for PR 10541 at commit 2073e30.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class ComputeCurrentTimeSuite extends PlanTest

// that, the metastore database name and table name are not always propagated to converted
// `ParquetRelation` instances via data source options. Here we use subquery alias as a
// workaround.
Some(s"`$alias`")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's create a jira for this.

@yhuai
Copy link
Contributor

yhuai commented Jan 8, 2016

Let's also create a jira for supporting persisted data source tables.

SimplifyCaseConversionExpressions) ::
SimplifyCaseConversionExpressions,
// Nondeterministic
ComputeCurrentTime) ::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be the first batch after the batch of Remove SubQueries?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I think we need to make it evaluated before this batch. Otherwise, constant folding rules will fire first, which can potentially introduce problem (multiple CurrentTimestamps returns different answers in a query).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks!

@yhuai
Copy link
Contributor

yhuai commented Jan 8, 2016

LGTM pending jenkins.

@@ -37,6 +37,8 @@ abstract class Optimizer extends RuleExecutor[LogicalPlan] {
// SubQueries are only needed for analysis and can be removed before execution.
Batch("Remove SubQueries", FixedPoint(100),
EliminateSubQueries) ::
Batch("Compute Current Time", Once,
ComputeCurrentTime) ::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a comment to explain it in the follow-up pr.

@SparkQA
Copy link

SparkQA commented Jan 8, 2016

Test build #49026 has finished for PR 10541 at commit 97cd39e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Jan 8, 2016

Merging to master.

@asfgit asfgit closed this in d9447ca Jan 8, 2016
@liancheng liancheng deleted the sql-generation branch January 8, 2016 22:18
@liancheng
Copy link
Contributor Author

Thanks to all for the review!

asfgit pushed a commit that referenced this pull request Jan 27, 2016
This PR is a follow-up of PR #10541. It integrates the newly introduced SQL generation feature with native view to make native view canonical.

In this PR, a new SQL option `spark.sql.nativeView.canonical` is added.  When this option and `spark.sql.nativeView` are both `true`, Spark SQL tries to handle `CREATE VIEW` DDL statements using SQL query strings generated from view definition logical plans. If we failed to map the plan to SQL, we fallback to the original native view approach.

One important issue this PR fixes is that, now we can use CTE when defining a view.  Originally, when native view is turned on, we wrap the view definition text with an extra `SELECT`.  However, HiveQL parser doesn't allow CTE appearing as a subquery.  Namely, something like this is disallowed:

```sql
SELECT n
FROM (
  WITH w AS (SELECT 1 AS n)
  SELECT * FROM w
) v
```

This PR fixes this issue because the extra `SELECT` is no longer needed (also, CTE expressions are inlined as subqueries during analysis phase, thus there won't be CTE expressions in the generated SQL query string).

Author: Cheng Lian <lian@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #10733 from liancheng/spark-12728.integrate-sql-gen-with-native-view.
asfgit pushed a commit that referenced this pull request Mar 23, 2016
#### What changes were proposed in this pull request?

The PR #10541 changed the rule `CollapseProject` by enabling collapsing `Project` into `Aggregate`. It leaves a to-do item to remove the duplicate code. This PR is to finish this to-do item. Also added a test case for covering this change.

#### How was this patch tested?

Added a new test case.

liancheng Could you check if the code refactoring is fine? Thanks!

Author: gatorsmile <gatorsmile@gmail.com>

Closes #11427 from gatorsmile/collapseProjectRefactor.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants