Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19290][SQL] add a new extending interface in Analyzer for post-hoc resolution #16645

Closed
wants to merge 1 commit into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

To implement DDL commands, we added several analyzer rules in sql/hive module to analyze DDL related plans. However, our Analyzer currently only have one extending interface: extendedResolutionRules, which defines extra rules that will be run together with other rules in the resolution batch, and doesn't fit DDL rules well, because:

  1. DDL rules may do some checking and normalization, but we may do it many times as the resolution batch will run rules again and again, until fixed point, and it's hard to tell if a DDL rule has already done its checking and normalization. It's fine because DDL rules are idempotent, but it's bad for analysis performance
  2. some DDL rules may depend on others, and it's pretty hard to write if conditions to guarantee the dependencies. It will be good if we have a batch which run rules in one pass, so that we can guarantee the dependencies by rules order.

This PR adds a new extending interface in Analyzer: postHocResolutionRules, which defines rules that will be run only once in a batch runs right after the resolution batch.

How was this patch tested?

existing tests

@cloud-fan cloud-fan changed the title [SPARK-19290][SQL] add post-hoc resolution [SPARK-19290][SQL] add a new extending interface in Analyzer for post-hoc resolution Jan 19, 2017
@cloud-fan
Copy link
Contributor Author

cc @yhuai @gatorsmile

@SparkQA
Copy link

SparkQA commented Jan 19, 2017

Test build #71663 has finished for PR 16645 at commit 7a71372.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -106,6 +106,13 @@ class Analyzer(
*/
val extendedResolutionRules: Seq[Rule[LogicalPlan]] = Nil

/**
* Override to provide rules to do post-hoc resolution. Note that these rules will be executed
* in an individual bach. This batch is run right after the normal resolution batch and execute
Copy link
Member

@gatorsmile gatorsmile Jan 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bach -> batch

is run -> is to run

@yhuai
Copy link
Contributor

yhuai commented Jan 19, 2017

My main concern of this pr is that if people will think it is recommended to add new batches to force those rules running in a certain ordering. For these resolution rules, we can also use conditions to control when they will fire, right? If we will always replace a logical plan to another one in the analysis phase, seems we should use resolved to control if a rule will fired.

@@ -62,15 +62,17 @@ private[hive] class HiveSessionState(sparkSession: SparkSession)
override val extendedResolutionRules =
catalog.ParquetConversions ::
catalog.OrcConversions ::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about moving the rule catalog.ParquetConversions and catalog.OrcConversions at the beginning of the batch postHocResolutionRules ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do they need to? Eventually they will be optimizer rules.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two rules need MetastoreRelation. Ideally, they should be after the rule FindHiveSerdeTable.

I am fine to keep it if we plan to move it into optimizer rules.

@cloud-fan
Copy link
Contributor Author

@yhuai yes we can use conditions and put them in resolved to control when the rules will fire, but another problem is checking and normalization, it's hard to detect if it's done and we will do it again and again. Later we may also have rules that need the checking and normalization done, then we have to depend on rules order in a batch.

@gatorsmile
Copy link
Member

I also understand the concern of @yhuai . But, when the number of rules in a single batch keeps growing, using a single condition resolved is a little bit hard to maintain the order of rules when they depend on each other. Eventually, I assume we need to split the huge batch to multiple reasonable batches.

@cloud-fan
Copy link
Contributor Author

also ping @hvanhovell

@SparkQA
Copy link

SparkQA commented Jan 20, 2017

Test build #71693 has finished for PR 16645 at commit b1028ad.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jan 20, 2017

Test build #71696 has finished for PR 16645 at commit b1028ad.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 23, 2017

Test build #71825 has started for PR 16645 at commit c55a1f9.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jan 23, 2017

Test build #71831 has finished for PR 16645 at commit c55a1f9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

@gatorsmile
Copy link
Member

Thanks! Merging to master.

@asfgit asfgit closed this in fcfd5d0 Jan 24, 2017
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…-hoc resolution

## What changes were proposed in this pull request?

To implement DDL commands, we added several analyzer rules in sql/hive module to analyze DDL related plans. However, our `Analyzer` currently only have one extending interface: `extendedResolutionRules`, which defines extra rules that will be run together with other rules in the resolution batch, and doesn't fit DDL rules well, because:

1. DDL rules may do some checking and normalization, but we may do it many times as the resolution batch will run rules again and again, until fixed point, and it's hard to tell if a DDL rule has already done its checking and normalization. It's fine because DDL rules are idempotent, but it's bad for analysis performance
2. some DDL rules may depend on others, and it's pretty hard to write `if` conditions to guarantee the dependencies. It will be good if we have a batch which run rules in one pass, so that we can guarantee the dependencies by rules order.

This PR adds a new extending interface in `Analyzer`: `postHocResolutionRules`, which defines rules that will be run only once in a batch runs right after the resolution batch.

## How was this patch tested?

existing tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#16645 from cloud-fan/analyzer.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
…-hoc resolution

## What changes were proposed in this pull request?

To implement DDL commands, we added several analyzer rules in sql/hive module to analyze DDL related plans. However, our `Analyzer` currently only have one extending interface: `extendedResolutionRules`, which defines extra rules that will be run together with other rules in the resolution batch, and doesn't fit DDL rules well, because:

1. DDL rules may do some checking and normalization, but we may do it many times as the resolution batch will run rules again and again, until fixed point, and it's hard to tell if a DDL rule has already done its checking and normalization. It's fine because DDL rules are idempotent, but it's bad for analysis performance
2. some DDL rules may depend on others, and it's pretty hard to write `if` conditions to guarantee the dependencies. It will be good if we have a batch which run rules in one pass, so that we can guarantee the dependencies by rules order.

This PR adds a new extending interface in `Analyzer`: `postHocResolutionRules`, which defines rules that will be run only once in a batch runs right after the resolution batch.

## How was this patch tested?

existing tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#16645 from cloud-fan/analyzer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants