[SPARK-23973][SQL] Remove consecutive Sorts #21072

mgaido91 · 2018-04-14T11:11:25Z

What changes were proposed in this pull request?

In SPARK-23375 we introduced the ability of removing Sort operation during query optimization if the data is already sorted. In this follow-up we remove also a Sort which is followed by another Sort: in this case the first sort is not needed and can be safely removed.

The PR starts from @henryr's comment: #20560 (comment). So credit should be given to him.

How was this patch tested?

added UT

mgaido91 · 2018-04-14T11:12:03Z

cc @cloud-fan @henryr

SparkQA · 2018-04-14T14:49:41Z

Test build #89367 has finished for PR 21072 at commit 6ba4186.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

henryr · 2018-04-14T21:19:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

 */
 object RemoveRedundantSorts extends Rule[LogicalPlan] {
  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    case Sort(orders, true, child) if SortOrder.orderingSatisfies(child.outputOrdering, orders) =>
      child
+    case s @ Sort(_, _, Sort(_, _, child)) => s.copy(child = child)


Thanks for doing this! It might be useful to generalise this to any pair of sorts separated by 0 or more projections or filters. I did this for my SPARK-23975 PR, see: henryr@bb992c2#diff-a636a87d8843eeccca90140be91d4fafR322

What do you think?

yes, it makes sense. I will do, thanks.

SparkQA · 2018-04-16T11:46:09Z

Test build #89388 has finished for PR 21072 at commit ff7d412.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-16T16:14:17Z

Test build #89395 has finished for PR 21072 at commit ac03bed.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-04-16T18:54:59Z

retest this please

henryr · 2018-04-16T18:50:32Z

...alyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RemoveRedundantSortsSuite.scala

+    val optimized = Optimize.execute(orderedTwice.analyze)
+    val correctAnswer = testRelation.orderBy('b.desc).analyze
+    comparePlans(optimized, correctAnswer)
+  }


Can you add a test for three consecutive sorts? Two is the base case, three will help us show the inductive case :)

henryr · 2018-04-16T18:58:46Z

...alyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RemoveRedundantSortsSuite.scala

@@ -98,4 +98,31 @@ class RemoveRedundantSortsSuite extends PlanTest {
    val correctAnswer = groupedAndResorted.analyze
    comparePlans(optimized, correctAnswer)
  }
+


Could you add a test which explicitly confirms that sort.limit.sort is not simplified? I know the above two tests cover that case, but it's good to have one dedicated to testing this important property.

henryr · 2018-04-16T19:04:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

- * Removes Sort operation if the child is already sorted
+ * Removes redundant Sort operation. This can happen:
+ * 1) if the child is already sorted
+ * 2) if the there is another Sort operator separated by 0...n Project/Filter operators


nit: 'the there'

SparkQA · 2018-04-16T22:45:29Z

Test build #89413 has finished for PR 21072 at commit ac03bed.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-04-17T09:38:12Z

@henryr thanks, I added the test cases you suggested :)

SparkQA · 2018-04-17T11:13:06Z

Test build #89445 has finished for PR 21072 at commit 1d6ca1e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-04-17T11:15:01Z

retest this please

SparkQA · 2018-04-17T14:56:49Z

Test build #89449 has finished for PR 21072 at commit 1d6ca1e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-04-20T16:12:05Z

anymore comments @henryr ? comments @cloud-fan ?

cloud-fan · 2018-04-23T04:01:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+
+  def recursiveRemoveSort(plan: LogicalPlan): LogicalPlan = plan match {
+    case Project(fields, child) => Project(fields, recursiveRemoveSort(child))
+    case Filter(condition, child) => Filter(condition, recursiveRemoveSort(child))


we should at least add ResolvedHint. To easily expand the white list in the future, I'd like to change the code style to

def recursiveRemoveSort(plan: LogicalPlan): LogicalPlan = plan match { case s: Sort => recursiveRemoveSort(s.child) case other if canEliminateSort(other) => other.withNewChildren(other.children.map(recursiveRemoveSort)) case _ => plan } def canEliminateSort(plan: LogicalPlan): Boolean = plan match { case p: Project => p.projectList.forall(_.deterministic) case f: Filter => f.condition.deterministic case _: ResolvedHint => true ... case _ => false }

why do you think we should check for the filter condition and the projected items to be deterministic?

by the definition of deterministic, the entire input is the stats of the expression. It's very likely we will get a different result if we remove sort before filter, e.g. rowId() < 10 will get the first 10 rows, if you sort the input, the first 10 rows changed.

I think we should be conservative about deterministic expressions.

cloud-fan · 2018-04-23T12:51:51Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

- * Removes Sort operation if the child is already sorted
+ * Removes redundant Sort operation. This can happen:
+ * 1) if the child is already sorted
+ * 2) if there is another Sort operator separated by 0...n Project/Filter operators
 */
 object RemoveRedundantSorts extends Rule[LogicalPlan] {
  def apply(plan: LogicalPlan): LogicalPlan = plan transform {


nit: now it's more efficient to do transformDown

isn't it the same?

assume the plan is

Sort Filter Sort Filter Sort OtherPlan

If we do transformUp, we hit the rule 3 times, which has some unnecessary transformation(OtherPlan is transformed 3 times). If it's transformDown, it's one-shot.

yes, but I saw that transfrom actually does transformDown. Anyway, I see that this might change and here we best have transformDown

cloud-fan · 2018-04-23T12:54:26Z

LGTM

SparkQA · 2018-04-23T16:27:43Z

Test build #89721 has finished for PR 21072 at commit e7391f3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-23T17:31:16Z

Test build #89726 has finished for PR 21072 at commit e2f4d4d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-04-24T02:11:24Z

thanks, merging to master!

[SPARK-23973][SQL] Remove consecutive Sorts

6ba4186

henryr reviewed Apr 14, 2018

View reviewed changes

remove Sort operator separated by 0...n Project/Filter operators

ff7d412

fix test

ac03bed

henryr reviewed Apr 16, 2018

View reviewed changes

add test cases + fix typo

1d6ca1e

cloud-fan reviewed Apr 23, 2018

View reviewed changes

address comments

e7391f3

cloud-fan reviewed Apr 23, 2018

View reviewed changes

address comment

e2f4d4d

asfgit closed this in 281c1ca Apr 24, 2018

cloud-fan mentioned this pull request Apr 26, 2018

[SPARK-23375][SQL] Eliminate unneeded Sort in Optimizer #20560

Closed

sarutak mentioned this pull request Oct 6, 2020

[SPARK-32820][SQL] Remove redundant shuffle exchanges inserted by EnsureRequirements #29677

Closed

tanelk mentioned this pull request Oct 9, 2020

[SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs #29950

Closed

maropu mentioned this pull request Oct 20, 2020

[SPARK-33183][SQL] Fix Optimizer rule EliminateSorts and add a physical rule to remove redundant sorts #30093

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-23973][SQL] Remove consecutive Sorts #21072

[SPARK-23973][SQL] Remove consecutive Sorts #21072

mgaido91 commented Apr 14, 2018

mgaido91 commented Apr 14, 2018

SparkQA commented Apr 14, 2018

henryr Apr 14, 2018

mgaido91 Apr 16, 2018

SparkQA commented Apr 16, 2018

SparkQA commented Apr 16, 2018

mgaido91 commented Apr 16, 2018

henryr Apr 16, 2018

henryr Apr 16, 2018

henryr Apr 16, 2018

SparkQA commented Apr 16, 2018

mgaido91 commented Apr 17, 2018

SparkQA commented Apr 17, 2018

mgaido91 commented Apr 17, 2018

SparkQA commented Apr 17, 2018

mgaido91 commented Apr 20, 2018

cloud-fan Apr 23, 2018

mgaido91 Apr 23, 2018

cloud-fan Apr 23, 2018

cloud-fan Apr 23, 2018

mgaido91 Apr 23, 2018

cloud-fan Apr 23, 2018 •

edited

Loading

mgaido91 Apr 23, 2018

cloud-fan commented Apr 23, 2018

SparkQA commented Apr 23, 2018

SparkQA commented Apr 23, 2018

cloud-fan commented Apr 24, 2018

[SPARK-23973][SQL] Remove consecutive Sorts #21072

[SPARK-23973][SQL] Remove consecutive Sorts #21072

Conversation

mgaido91 commented Apr 14, 2018

What changes were proposed in this pull request?

How was this patch tested?

mgaido91 commented Apr 14, 2018

SparkQA commented Apr 14, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 16, 2018

SparkQA commented Apr 16, 2018

mgaido91 commented Apr 16, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 16, 2018

mgaido91 commented Apr 17, 2018

SparkQA commented Apr 17, 2018

mgaido91 commented Apr 17, 2018

SparkQA commented Apr 17, 2018

mgaido91 commented Apr 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Apr 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Apr 23, 2018

SparkQA commented Apr 23, 2018

SparkQA commented Apr 23, 2018

cloud-fan commented Apr 24, 2018

cloud-fan Apr 23, 2018 •

edited

Loading