Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21222] Move elimination of Distinct clause from analyzer to optimizer #18429

Closed
wants to merge 10 commits into from

Conversation

gengliangwang
Copy link
Member

@gengliangwang gengliangwang commented Jun 27, 2017

What changes were proposed in this pull request?

Move elimination of Distinct clause from analyzer to optimizer

Distinct clause is useless after MAX/MIN clause. For example,
"Select MAX(distinct a) FROM src from"
is equivalent of
"Select MAX(a) FROM src from"
However, this optimization is implemented in analyzer. It should be in optimizer.

How was this patch tested?

Unit test

@gatorsmile @cloud-fan

Please review http://spark.apache.org/contributing.html before opening a pull request.

@@ -160,6 +160,8 @@ package object dsl {
def last(e: Expression): Expression = new Last(e).toAggregateExpression()
def min(e: Expression): Expression = Min(e).toAggregateExpression()
def max(e: Expression): Expression = Max(e).toAggregateExpression()
def maxDistinct(e: Expression): Expression = Max(e).toAggregateExpression(isDistinct = true)
def minDistinct(e: Expression): Expression = Min(e).toAggregateExpression(isDistinct = true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to line 162

@@ -40,6 +40,8 @@ abstract class Optimizer(sessionCatalog: SessionCatalog, conf: SQLConf)
protected val fixedPoint = FixedPoint(conf.optimizerMaxIterations)

def batches: Seq[Batch] = {
// DISTINCT is not meaningful for a Max or a Min.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remove this line.

object EliminateDistinct extends Rule[LogicalPlan] {
override def apply(plan: LogicalPlan): LogicalPlan = plan transformExpressions {
case AggregateExpression(af @ Max(_), _, true, _) => AggregateExpression(af, Complete, false)
case AggregateExpression(af @ Min(_), _, true, _) => AggregateExpression(af, Complete, false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: -> af: Max and af: Min

@@ -152,6 +154,16 @@ abstract class Optimizer(sessionCatalog: SessionCatalog, conf: SQLConf)
}

/**
* Remove useless DISTINCT for MAX and MIN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also need to emphasize EliminateDistinct should be before ReplaceDeduplicateWithAggregate

@gengliangwang
Copy link
Member Author

Hi @gatorsmile ,
Thanks for the comments. I have just pushed code changes.

import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, Expand, LocalRelation, LogicalPlan}
import org.apache.spark.sql.catalyst.rules.RuleExecutor

class EliminateDistinceSuite extends PlanTest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Distinct. not Distince.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo corrected. Thanks!

@SparkQA
Copy link

SparkQA commented Jun 27, 2017

Test build #78676 has finished for PR 18429 at commit 7604811.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 27, 2017

Test build #78680 has finished for PR 18429 at commit 892f50a.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 27, 2017

Test build #78682 has finished for PR 18429 at commit 2f89499.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class EliminateDistinctSuite extends PlanTest

@@ -40,6 +40,7 @@ abstract class Optimizer(sessionCatalog: SessionCatalog, conf: SQLConf)
protected val fixedPoint = FixedPoint(conf.optimizerMaxIterations)

def batches: Seq[Batch] = {
Batch("Eliminate Distinct", Once, EliminateDistinct) ::
Copy link
Contributor

@cloud-fan cloud-fan Jun 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, does it have to be executed before the "Finish Analysis" batch?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can move this into the next batch, but it has to be before RewriteDistinctAggregates

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RewriteDistinctAggregates is inside the "Finish Analysis" batch, so the new rule has to be placed before it.

*/
object EliminateDistinct extends Rule[LogicalPlan] {
override def apply(plan: LogicalPlan): LogicalPlan = plan transformExpressions {
case AggregateExpression(max: Max, _, true, _) => AggregateExpression(max, Complete, false)
Copy link
Contributor

@cloud-fan cloud-fan Jun 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it safe to always choose the Complete mode?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ResolveFunctions rule already set it to Complete. @gengliangwang we should just keep the AggregateMode unchanged.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, this could be a terrible mistake! I Already corrected, thanks!

@@ -152,6 +153,19 @@ abstract class Optimizer(sessionCatalog: SessionCatalog, conf: SQLConf)
}

/**
* Remove useless DISTINCT for MAX and MIN.
* This rule should be applied before ReplaceDeduplicateWithAggregate.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"before RewriteDistinctAggregates"?


test("Eliminate Distinct in Max") {
val query = testRelation
.select(maxDistinct('a) as('result))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please use java style, i.e. maxDistinct('a).as('result)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, I have revised it.

comparePlans(Optimize.execute(query), answer)
}

test("Eliminate Distinct in Min") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually you can put them in one test:

val query = testRelation
  .select(maxDistinct('a).as('max), minDistinct('a).as('min))
  .analyze

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, I prefer to make it simple

@cloud-fan
Copy link
Contributor

LGTM except some minor comments

@SparkQA
Copy link

SparkQA commented Jun 28, 2017

Test build #78734 has finished for PR 18429 at commit 19163d4.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 28, 2017

Test build #78737 has finished for PR 18429 at commit fd3c849.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 28, 2017

Test build #78738 has finished for PR 18429 at commit 9fb9779.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

*/
object EliminateDistinct extends Rule[LogicalPlan] {
override def apply(plan: LogicalPlan): LogicalPlan = plan transformExpressions {
case AggregateExpression(max: Max, mode: AggregateMode, true, _) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's unclear what the "true" is. I'd either use named argument, or rewrite it to something like

case ae: AggregateExpression if ae.isDistinct =>
  ae.aggregateFunction match {
    case _: Max | _: Min => ae.copy(isDistinct = false)
  }
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. I will remember that and revise the current patch.
Also, how about this:

    case ae @ AggregateExpression(_: Max | _: Min, _, isDistinct, _) if isDistinct =>
      ae.copy(isDistinct = false)

Is it too long?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your version is more readable, the first param "aggregateFunction" is specified.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general i'm not a big fan of using extractors unless we need almost all arguments ... it makes refactoring a lot more complicated. Extractors are over/abused in Catalyst (there are some code that use extractors when they are simply doing type matching).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, appreciate your help!

@SparkQA
Copy link

SparkQA commented Jun 28, 2017

Test build #78747 has finished for PR 18429 at commit 08f61f4.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

override def apply(plan: LogicalPlan): LogicalPlan = plan transformExpressions {
case ae: AggregateExpression if ae.isDistinct =>
ae.aggregateFunction match {
case _: Max | _: Min => ae.copy(isDistinct = false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually i made a mistake earlier. you'd need to do a match on other cases and return the ae itself too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My fault...

@SparkQA
Copy link

SparkQA commented Jun 28, 2017

Test build #78758 has finished for PR 18429 at commit 536aae2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 28, 2017

Test build #78762 has finished for PR 18429 at commit 5a3df30.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@cloud-fan
Copy link
Contributor

LGTM

1 similar comment
@gatorsmile
Copy link
Member

LGTM

@SparkQA
Copy link

SparkQA commented Jun 28, 2017

Test build #78811 has finished for PR 18429 at commit 5a3df30.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in b72b852 Jun 29, 2017
robert3005 pushed a commit to palantir/spark that referenced this pull request Jun 29, 2017
…timizer

## What changes were proposed in this pull request?

Move elimination of Distinct clause from analyzer to optimizer

Distinct clause is useless after MAX/MIN clause. For example,
"Select MAX(distinct a) FROM src from"
is equivalent of
"Select MAX(a) FROM src from"
However, this optimization is implemented in analyzer. It should be in optimizer.

## How was this patch tested?

Unit test

gatorsmile cloud-fan

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Wang Gengliang <ltnwgl@gmail.com>

Closes apache#18429 from gengliangwang/distinct_opt.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants