[SPARK-21222] Move elimination of Distinct clause from analyzer to optimizer #18429

gengliangwang · 2017-06-27T04:13:00Z

What changes were proposed in this pull request?

Move elimination of Distinct clause from analyzer to optimizer

Distinct clause is useless after MAX/MIN clause. For example,
"Select MAX(distinct a) FROM src from"
is equivalent of
"Select MAX(a) FROM src from"
However, this optimization is implemented in analyzer. It should be in optimizer.

How was this patch tested?

Unit test

@gatorsmile @cloud-fan

Please review http://spark.apache.org/contributing.html before opening a pull request.

gatorsmile · 2017-06-27T04:21:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala

@@ -160,6 +160,8 @@ package object dsl {
    def last(e: Expression): Expression = new Last(e).toAggregateExpression()
    def min(e: Expression): Expression = Min(e).toAggregateExpression()
    def max(e: Expression): Expression = Max(e).toAggregateExpression()
+    def maxDistinct(e: Expression): Expression = Max(e).toAggregateExpression(isDistinct = true)
+    def minDistinct(e: Expression): Expression = Min(e).toAggregateExpression(isDistinct = true)


Move this to line 162

gatorsmile · 2017-06-27T04:21:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

@@ -40,6 +40,8 @@ abstract class Optimizer(sessionCatalog: SessionCatalog, conf: SQLConf)
  protected val fixedPoint = FixedPoint(conf.optimizerMaxIterations)

  def batches: Seq[Batch] = {
+    // DISTINCT is not meaningful for a Max or a Min.


Nit: remove this line.

gatorsmile · 2017-06-27T04:23:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+object EliminateDistinct extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan transformExpressions  {
+    case AggregateExpression(af @ Max(_), _, true, _) => AggregateExpression(af, Complete, false)
+    case AggregateExpression(af @ Min(_), _, true, _) => AggregateExpression(af, Complete, false)


Nit: -> af: Max and af: Min

gatorsmile · 2017-06-27T04:25:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

@@ -152,6 +154,16 @@ abstract class Optimizer(sessionCatalog: SessionCatalog, conf: SQLConf)
 }

 /**
+ * Remove useless DISTINCT for MAX and MIN


Also need to emphasize EliminateDistinct should be before ReplaceDeduplicateWithAggregate

gengliangwang · 2017-06-27T04:43:39Z

Hi @gatorsmile ,
Thanks for the comments. I have just pushed code changes.

rxin · 2017-06-27T04:52:20Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/EliminateDistinceSuite.scala

+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, Expand, LocalRelation, LogicalPlan}
+import org.apache.spark.sql.catalyst.rules.RuleExecutor
+
+class EliminateDistinceSuite extends PlanTest {


Distinct. not Distince.

Typo corrected. Thanks!

SparkQA · 2017-06-27T05:52:08Z

Test build #78676 has finished for PR 18429 at commit 7604811.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-27T06:43:36Z

Test build #78680 has finished for PR 18429 at commit 892f50a.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-27T06:55:32Z

Test build #78682 has finished for PR 18429 at commit 2f89499.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class EliminateDistinctSuite extends PlanTest

cloud-fan · 2017-06-27T08:27:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

@@ -40,6 +40,7 @@ abstract class Optimizer(sessionCatalog: SessionCatalog, conf: SQLConf)
  protected val fixedPoint = FixedPoint(conf.optimizerMaxIterations)

  def batches: Seq[Batch] = {
+    Batch("Eliminate Distinct", Once, EliminateDistinct) ::


hmm, does it have to be executed before the "Finish Analysis" batch?

We can move this into the next batch, but it has to be before RewriteDistinctAggregates

RewriteDistinctAggregates is inside the "Finish Analysis" batch, so the new rule has to be placed before it.

cloud-fan · 2017-06-27T08:28:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+ */
+object EliminateDistinct extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan transformExpressions  {
+    case AggregateExpression(max: Max, _, true, _) => AggregateExpression(max, Complete, false)


is it safe to always choose the Complete mode?

The ResolveFunctions rule already set it to Complete. @gengliangwang we should just keep the AggregateMode unchanged.

Got it, this could be a terrible mistake! I Already corrected, thanks!

cloud-fan · 2017-06-28T01:21:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

@@ -152,6 +153,19 @@ abstract class Optimizer(sessionCatalog: SessionCatalog, conf: SQLConf)
 }

 /**
+ * Remove useless DISTINCT for MAX and MIN.
+ * This rule should be applied before ReplaceDeduplicateWithAggregate.


"before RewriteDistinctAggregates"?

cloud-fan · 2017-06-28T01:21:59Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/EliminateDistinctSuite.scala

+
+  test("Eliminate Distinct in Max") {
+    val query = testRelation
+      .select(maxDistinct('a) as('result))


nit: please use java style, i.e. maxDistinct('a).as('result)

got it, I have revised it.

cloud-fan · 2017-06-28T01:23:04Z

...catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/EliminateDistinctSuite.scala

+    comparePlans(Optimize.execute(query), answer)
+  }
+
+  test("Eliminate Distinct in Min") {


actually you can put them in one test:

val query = testRelation .select(maxDistinct('a).as('max), minDistinct('a).as('min)) .analyze

well, I prefer to make it simple

cloud-fan · 2017-06-28T01:23:33Z

LGTM except some minor comments

SparkQA · 2017-06-28T03:01:38Z

Test build #78734 has finished for PR 18429 at commit 19163d4.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-28T03:05:28Z

Test build #78737 has finished for PR 18429 at commit fd3c849.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-28T03:39:35Z

Test build #78738 has finished for PR 18429 at commit 9fb9779.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2017-06-28T04:59:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+ */
+object EliminateDistinct extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan transformExpressions  {
+    case AggregateExpression(max: Max, mode: AggregateMode, true, _) =>


it's unclear what the "true" is. I'd either use named argument, or rewrite it to something like

case ae: AggregateExpression if ae.isDistinct => ae.aggregateFunction match { case _: Max | _: Min => ae.copy(isDistinct = false) } }

Make sense. I will remember that and revise the current patch.
Also, how about this:

case ae @ AggregateExpression(_: Max | _: Min, _, isDistinct, _) if isDistinct => ae.copy(isDistinct = false)

Is it too long?

I think your version is more readable, the first param "aggregateFunction" is specified.

in general i'm not a big fan of using extractors unless we need almost all arguments ... it makes refactoring a lot more complicated. Extractors are over/abused in Catalyst (there are some code that use extractors when they are simply doing type matching).

See https://github.com/databricks/scala-style-guide#pattern-matching

Got it, appreciate your help!

SparkQA · 2017-06-28T05:44:03Z

Test build #78747 has finished for PR 18429 at commit 08f61f4.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2017-06-28T05:58:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+  override def apply(plan: LogicalPlan): LogicalPlan = plan transformExpressions  {
+    case ae: AggregateExpression if ae.isDistinct =>
+      ae.aggregateFunction match {
+        case _: Max | _: Min => ae.copy(isDistinct = false)


actually i made a mistake earlier. you'd need to do a match on other cases and return the ae itself too

My fault...

SparkQA · 2017-06-28T06:59:24Z

Test build #78758 has finished for PR 18429 at commit 536aae2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-28T07:04:58Z

Test build #78762 has finished for PR 18429 at commit 5a3df30.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-06-28T15:01:17Z

retest this please

cloud-fan · 2017-06-28T15:01:39Z

LGTM

gatorsmile · 2017-06-28T16:19:14Z

LGTM

SparkQA · 2017-06-28T17:22:17Z

Test build #78811 has finished for PR 18429 at commit 5a3df30.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-06-29T00:47:51Z

thanks, merging to master!

…timizer ## What changes were proposed in this pull request? Move elimination of Distinct clause from analyzer to optimizer Distinct clause is useless after MAX/MIN clause. For example, "Select MAX(distinct a) FROM src from" is equivalent of "Select MAX(a) FROM src from" However, this optimization is implemented in analyzer. It should be in optimizer. ## How was this patch tested? Unit test gatorsmile cloud-fan Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Wang Gengliang <ltnwgl@gmail.com> Closes apache#18429 from gengliangwang/distinct_opt.

gengliangwang added 2 commits June 23, 2017 19:54

save for now

f60c184

finish implementation and test cases

7604811

gatorsmile reviewed Jun 27, 2017

View reviewed changes

revise code style and comments

892f50a

rxin reviewed Jun 27, 2017

View reviewed changes

fix typo: distince=>distinct

2f89499

cloud-fan reviewed Jun 27, 2017

View reviewed changes

remain AggregateMode

19163d4

cloud-fan reviewed Jun 28, 2017

View reviewed changes

gengliangwang added 2 commits June 27, 2017 18:28

revise code style

fd3c849

revise comment

9fb9779

revise code style

08f61f4

rxin reviewed Jun 28, 2017

View reviewed changes

stop abusing extractors

536aae2

rxin reviewed Jun 28, 2017

View reviewed changes

handle default case

5a3df30

asfgit closed this in b72b852 Jun 29, 2017

ulysses-you mentioned this pull request Apr 12, 2022

[SPARK-38832][SQL] Remove unnecessary distinct in aggregate expression by distinctKeys #36117

Closed

[SPARK-21222] Move elimination of Distinct clause from analyzer to optimizer #18429

[SPARK-21222] Move elimination of Distinct clause from analyzer to optimizer #18429

Conversation

gengliangwang commented Jun 27, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gengliangwang commented Jun 27, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 27, 2017

SparkQA commented Jun 27, 2017

SparkQA commented Jun 27, 2017

cloud-fan Jun 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Jun 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jun 28, 2017

SparkQA commented Jun 28, 2017

SparkQA commented Jun 28, 2017

SparkQA commented Jun 28, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 28, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 28, 2017

SparkQA commented Jun 28, 2017

cloud-fan commented Jun 28, 2017

cloud-fan commented Jun 28, 2017

gatorsmile commented Jun 28, 2017

SparkQA commented Jun 28, 2017

cloud-fan commented Jun 29, 2017

gengliangwang commented Jun 27, 2017 •

edited

Loading

cloud-fan Jun 27, 2017 •

edited

Loading

cloud-fan Jun 27, 2017 •

edited

Loading