[SPARK-19290][SQL] add a new extending interface in Analyzer for post-hoc resolution #16645

cloud-fan · 2017-01-19T14:27:09Z

What changes were proposed in this pull request?

To implement DDL commands, we added several analyzer rules in sql/hive module to analyze DDL related plans. However, our Analyzer currently only have one extending interface: extendedResolutionRules, which defines extra rules that will be run together with other rules in the resolution batch, and doesn't fit DDL rules well, because:

DDL rules may do some checking and normalization, but we may do it many times as the resolution batch will run rules again and again, until fixed point, and it's hard to tell if a DDL rule has already done its checking and normalization. It's fine because DDL rules are idempotent, but it's bad for analysis performance
some DDL rules may depend on others, and it's pretty hard to write if conditions to guarantee the dependencies. It will be good if we have a batch which run rules in one pass, so that we can guarantee the dependencies by rules order.

This PR adds a new extending interface in Analyzer: postHocResolutionRules, which defines rules that will be run only once in a batch runs right after the resolution batch.

How was this patch tested?

existing tests

cloud-fan · 2017-01-19T14:29:52Z

cc @yhuai @gatorsmile

SparkQA · 2017-01-19T17:01:55Z

Test build #71663 has finished for PR 16645 at commit 7a71372.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-01-19T19:46:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@@ -106,6 +106,13 @@ class Analyzer(
   */
  val extendedResolutionRules: Seq[Rule[LogicalPlan]] = Nil

+  /**
+   * Override to provide rules to do post-hoc resolution. Note that these rules will be executed
+   * in an individual bach. This batch is run right after the normal resolution batch and execute


bach -> batch

is run -> is to run

yhuai · 2017-01-19T21:45:11Z

My main concern of this pr is that if people will think it is recommended to add new batches to force those rules running in a certain ordering. For these resolution rules, we can also use conditions to control when they will fire, right? If we will always replace a logical plan to another one in the analysis phase, seems we should use resolved to control if a rule will fired.

gatorsmile · 2017-01-19T22:47:15Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionState.scala

@@ -62,15 +62,17 @@ private[hive] class HiveSessionState(sparkSession: SparkSession)
      override val extendedResolutionRules =
        catalog.ParquetConversions ::
        catalog.OrcConversions ::


How about moving the rule catalog.ParquetConversions and catalog.OrcConversions at the beginning of the batch postHocResolutionRules ?

do they need to? Eventually they will be optimizer rules.

These two rules need MetastoreRelation. Ideally, they should be after the rule FindHiveSerdeTable.

I am fine to keep it if we plan to move it into optimizer rules.

cloud-fan · 2017-01-20T01:24:32Z

@yhuai yes we can use conditions and put them in resolved to control when the rules will fire, but another problem is checking and normalization, it's hard to detect if it's done and we will do it again and again. Later we may also have rules that need the checking and normalization done, then we have to depend on rules order in a batch.

gatorsmile · 2017-01-20T02:37:09Z

I also understand the concern of @yhuai . But, when the number of rules in a single batch keeps growing, using a single condition resolved is a little bit hard to maintain the order of rules when they depend on each other. Eventually, I assume we need to split the huge batch to multiple reasonable batches.

cloud-fan · 2017-01-20T02:37:42Z

also ping @hvanhovell

SparkQA · 2017-01-20T04:36:28Z

Test build #71693 has finished for PR 16645 at commit b1028ad.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-01-20T04:45:34Z

retest this please

SparkQA · 2017-01-20T07:13:45Z

Test build #71696 has finished for PR 16645 at commit b1028ad.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-23T06:38:40Z

Test build #71825 has started for PR 16645 at commit c55a1f9.

cloud-fan · 2017-01-23T08:24:19Z

retest this please

SparkQA · 2017-01-23T11:06:18Z

Test build #71831 has finished for PR 16645 at commit c55a1f9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-01-24T03:55:12Z

LGTM

gatorsmile · 2017-01-24T04:02:14Z

Thanks! Merging to master.

…-hoc resolution ## What changes were proposed in this pull request? To implement DDL commands, we added several analyzer rules in sql/hive module to analyze DDL related plans. However, our `Analyzer` currently only have one extending interface: `extendedResolutionRules`, which defines extra rules that will be run together with other rules in the resolution batch, and doesn't fit DDL rules well, because: 1. DDL rules may do some checking and normalization, but we may do it many times as the resolution batch will run rules again and again, until fixed point, and it's hard to tell if a DDL rule has already done its checking and normalization. It's fine because DDL rules are idempotent, but it's bad for analysis performance 2. some DDL rules may depend on others, and it's pretty hard to write `if` conditions to guarantee the dependencies. It will be good if we have a batch which run rules in one pass, so that we can guarantee the dependencies by rules order. This PR adds a new extending interface in `Analyzer`: `postHocResolutionRules`, which defines rules that will be run only once in a batch runs right after the resolution batch. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes apache#16645 from cloud-fan/analyzer.

cloud-fan changed the title ~~[SPARK-19290][SQL] add post-hoc resolution~~ [SPARK-19290][SQL] add a new extending interface in Analyzer for post-hoc resolution Jan 19, 2017

gatorsmile reviewed Jan 19, 2017

View reviewed changes

cloud-fan force-pushed the analyzer branch from 7a71372 to b1028ad Compare January 20, 2017 02:44

add post-hoc resolution

c55a1f9

cloud-fan force-pushed the analyzer branch from b1028ad to c55a1f9 Compare January 23, 2017 06:36

asfgit closed this in fcfd5d0 Jan 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19290][SQL] add a new extending interface in Analyzer for post-hoc resolution #16645

[SPARK-19290][SQL] add a new extending interface in Analyzer for post-hoc resolution #16645

cloud-fan commented Jan 19, 2017

cloud-fan commented Jan 19, 2017

SparkQA commented Jan 19, 2017

gatorsmile Jan 19, 2017 •

edited

Loading

yhuai commented Jan 19, 2017

gatorsmile Jan 19, 2017

cloud-fan Jan 20, 2017

gatorsmile Jan 20, 2017

cloud-fan commented Jan 20, 2017

gatorsmile commented Jan 20, 2017

cloud-fan commented Jan 20, 2017

SparkQA commented Jan 20, 2017

cloud-fan commented Jan 20, 2017

SparkQA commented Jan 20, 2017

SparkQA commented Jan 23, 2017

cloud-fan commented Jan 23, 2017

SparkQA commented Jan 23, 2017

gatorsmile commented Jan 24, 2017

gatorsmile commented Jan 24, 2017

[SPARK-19290][SQL] add a new extending interface in Analyzer for post-hoc resolution #16645

[SPARK-19290][SQL] add a new extending interface in Analyzer for post-hoc resolution #16645

Conversation

cloud-fan commented Jan 19, 2017

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Jan 19, 2017

SparkQA commented Jan 19, 2017

gatorsmile Jan 19, 2017 • edited Loading

Choose a reason for hiding this comment

yhuai commented Jan 19, 2017

gatorsmile Jan 19, 2017

Choose a reason for hiding this comment

cloud-fan Jan 20, 2017

Choose a reason for hiding this comment

gatorsmile Jan 20, 2017

Choose a reason for hiding this comment

cloud-fan commented Jan 20, 2017

gatorsmile commented Jan 20, 2017

cloud-fan commented Jan 20, 2017

SparkQA commented Jan 20, 2017

cloud-fan commented Jan 20, 2017

SparkQA commented Jan 20, 2017

SparkQA commented Jan 23, 2017

cloud-fan commented Jan 23, 2017

SparkQA commented Jan 23, 2017

gatorsmile commented Jan 24, 2017

gatorsmile commented Jan 24, 2017

gatorsmile Jan 19, 2017 •

edited

Loading