[SPARK-23957][SQL] Remove redundant sort operators from subqueries #21049

henryr · 2018-04-12T04:32:48Z

What changes were proposed in this pull request?

Subqueries (at least in SQL) have 'bag of tuples' semantics. Ordering
them is therefore redundant (unless combined with a limit). This patch
adds a new optimizer rule that removes sort operators that are directly
below subqueries (or some combination of projection and filtering below
a subquery).

How was this patch tested?

New unit tests. All sql unit tests pass.

## What changes were proposed in this pull request? Subqueries (at least in SQL) have 'bag of tuples' semantics. Ordering them is therefore redundant (unless combined with a limit). This patch adds a new optimizer rule that removes sort operators that are directly below subqueries (or some combination of projection and filtering below a subquery). ## How was this patch tested? New unit tests. All sql unit tests pass.

gatorsmile · 2018-04-12T04:45:10Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+
+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+    case Subquery(child) => Subquery(removeTopLevelSorts(child))
+    case SubqueryAlias(name, child) => SubqueryAlias(name, removeTopLevelSorts(child))


SubqueryAlias is not the subquery you want. This is just an alias of a query/table/view. For example,

Seq((1, 2, "1"), (3, 4, "3")).toDF("int", "int2", "str_sort").orderBy('int.asc).as('df1)

Thanks! I've been trying to understand the role of Subquery and SubqueryAlias. My confusion is that subqueries do seem to get planned as SubqueryAlias operators, e.g.:

scala> spark.sql("SELECT count(*) from (SELECT id FROM dft ORDER BY id)").explain(true) == Parsed Logical Plan == 'Project [unresolvedalias('count(1), None)] +- 'SubqueryAlias __auto_generated_subquery_name +- 'Sort ['id ASC NULLS FIRST], true +- 'Project ['id] +- 'UnresolvedRelation `dft`

In the example you give I (personally) think it's still reasonable to drop the ordering, but understand that might surprise some users. It wouldn't be hard to skip the root if it's a subquery - but what do you propose for detecting subqueries if my method isn't right?

Seq((1, 2, "1"), (3, 4, "3")).toDF("int", "int2", "str_sort").orderBy('int.asc).as('df1)

Before entering optimizer, we get rid of SubqueryAlias by the rule EliminateSubqueryAliases. Basically, it is no-op after query analysis. The name is a little bit confusing, I have to admit.

Yep, that's why I added the new rule just before EliminateSubqueryAliases (which runs in the optimizer, as part of the 'finish analysis' batch). After EliminateSubqueryAliases there doesn't seem to be any way to detect subqueries.

Another approach I suppose would be to handle this like SparkPlan's requiredChildOrdering - if a parent doesn't require any ordering of the child, (and the child is a Sort node), the child Sort should be dropped. That seems like a more fundamental change though.

SparkQA · 2018-04-12T04:57:44Z

Test build #89234 has finished for PR 21049 at commit bb992c2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-04-16T07:40:26Z

I am not sure we can actually do this. If we have for instance:

select a + b from (select a, b from tb1 order by a) t1

IIUC the plan would be transformed so that the sort is removed...but actually this will produce a change in the output, since keeping the sort would make the output of the query ordered by a instead of with a random order.
What do you think?

henryr · 2018-04-16T16:21:08Z

In SQL, the sort in a subquery doesn't make sense because of the relational model - the output of a subquery is an unordered bag of tuples. Some engines still allow the sort, some silently drop it and some throw an error.

For example:

Oracle and Postgres allow the ORDER BY.

One issue might be that the underlying dataframe model might not be 100% relational - maybe dataframes are sorted lists of rows and then this optimization would only be valid if using the SQL interface. If so, it's probably not worth the effort to maintain. But if dataframes and SQL relations are supposed to be equivalent, we can drop the ORDER BY.

We also may want to decide not to do this because it would surprise users who had been relying on the existing behavior.

gatorsmile · 2018-04-16T16:34:26Z

@mgaido91 This is a common rule. Last year, @dilipbiswal @ioana-delaney @nsyca and I had a discussion about this. We forgot to continue the efforts after finishing the first stage of the correlated subquery. https://issues.apache.org/jira/browse/SPARK-18455.

@henryr Thank you for contributing to this!

@dilipbiswal Do you have the bandwidth to review this PR?

dilipbiswal · 2018-04-16T19:14:34Z

@henryr @gatorsmile I agree with Sean. To the best of my knowledge, spark does not treat "select ... from ( query)" as a subquery. It treats it as an aliased query. Please see the the grammar under "relationPrimary" rule. The subqueries supported in spark may mostly originate from the projection (non corr scalar) or predicate of the main query. So basically, we see this as expressions either under the Project or Filter operators of the main query block. We can look at SubquerySuite to find usage examples.

henryr · 2018-04-23T20:50:17Z

@dilipbiswal Thanks! Although Spark doesn't necessarily parse the query in the from clause as a subquery, is it fair to say it plans it as one? (Since the planner puts the alias under a SubqueryAlias node). The optimization of removing the sorts seems valid to me in any case, since neither an alias nor a subquery should be sorting except in the presence of a limit. Or do you think that this optimization should only be applied to subqueries that aren't in the from clause?

dilipbiswal · 2018-04-27T08:41:23Z

@henryr Since SubqueryAlias is used as a correlation name and used mostly for resolving attributes, in my understanding its not safe to apply this optimization. I will borrow @gatorsmile 's example here. Please note that the alias is specified after the sort. Below is plan after this optimization that removes sorts under SubqueryAlias->child.

scala> Seq((1, 2, "1"), (3, 4, "3")).toDF("int", "int2", "str_sort").orderBy('int.asc).as('df1)
res0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [int: int, int2: int ... 1 more field]

scala> res0.explain(true)
== Parsed Logical Plan ==
SubqueryAlias df1
+- AnalysisBarrier
      +- Sort [int#7 ASC NULLS FIRST], true
         +- Project [_1#3 AS int#7, _2#4 AS int2#8, _3#5 AS str_sort#9]
            +- LocalRelation [_1#3, _2#4, _3#5]

== Analyzed Logical Plan ==
int: int, int2: int, str_sort: string
SubqueryAlias df1
+- Sort [int#7 ASC NULLS FIRST], true
   +- Project [_1#3 AS int#7, _2#4 AS int2#8, _3#5 AS str_sort#9]
      +- LocalRelation [_1#3, _2#4, _3#5]

== Optimized Logical Plan ==
LocalRelation [int#7, int2#8, str_sort#9]

== Physical Plan ==
LocalTableScan [int#7, int2#8, str_sort#9]

In this case we should not be removing the top level sort from user's query right ?

cc @gatorsmile for his opinion.

henryr · 2018-04-27T21:18:41Z

@dilipbiswal thanks for the clarification.

I agree that this particular case - where the alias is the root of a logical plan - might need special handling. Is there any reason to actually use an alias at the root of a plan like this (outside of composing with other plans, where this optimization would apply)? My suggestion would be, since there are no references to the name the alias introduces, to consider just dropping the alias node during optimization (and then the sort would not get dropped).

It does seem to be an edge case though - no matter how we handle unreferred-to aliases, the optimization seems to be appropriate for the general case where aliases do correspond to subqueries. What do you think?

dilipbiswal · 2018-04-30T06:33:56Z

@henryr
Is there any reason to actually use an alias at the root of a plan like this (outside of composing with other plans, where this optimization would apply)?

I can't think of a reason :-). Just that the API allows users do that.

How about this query ?

 scala> spark.sql("with abcd as (select * from t1 order by t1.c1) select * from abcd").explain(true)
18/04/29 23:28:45 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
== Parsed Logical Plan ==
CTE [abcd]
:  +- 'SubqueryAlias abcd
:     +- 'Sort ['t1.c1 ASC NULLS FIRST], true
:        +- 'Project [*]
:           +- 'UnresolvedRelation `t1`
+- 'Project [*]
   +- 'UnresolvedRelation `abcd`

== Analyzed Logical Plan ==
c1: int, c2: int, c3: int
Project [c1#7, c2#8, c3#9]
+- SubqueryAlias abcd
   +- Sort [c1#7 ASC NULLS FIRST], true
      +- Project [c1#7, c2#8, c3#9]
         +- SubqueryAlias t1
            +- HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#7, c2#8, c3#9]

== Optimized Logical Plan ==
HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#7, c2#8, c3#9]

== Physical Plan ==
HiveTableScan [c1#7, c2#8, c3#9], HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#7, c2#8, c3#9]

IMHO, its probably better to correctly detect the real subqueries and apply this optimization in order to be fully sure about it.

cc @gatorsmile

henryr · 2018-05-04T18:00:17Z

I might be a bit of a hardliner on this, but I think it's correct to eliminate the {{ORDER BY}} from common table expressions (e.g. MSSQL agrees with me, see this link).

However, given the principle of least surprise, I agree it might be a good idea to at least start with scalar and nested subqueries, and leave inline views for another day. That might be a bit harder to do (I think the rule will need a whitelist of operators it's ok to eliminate sorts below), and in general I think there'll be some missed opportunities, but it's a start :)

Alternatively we could extend the analyzed logical plan to explicitly mark the different subquery types (i.e. have a InlineView node, a NestedSubquery node and so on). That would make these optimizations easier to express, but I have some reservations about the semantics of introducing those nodes. What do you think @dilipbiswal / @gatorsmile ?

dilipbiswal · 2018-05-04T18:54:02Z

@henryr
I might be a bit of a hardliner on this, but I think it's correct to eliminate the {{ORDER BY}} from common table expressions (e.g. MSSQL agrees with me, see this link).
DB>> Yeah.. I had seen this. I had checked DB2 behaviour on CTE and it does not seem to remove the sort or give any error.

db2 => with x as (select * from t1 order by 1) select * from x 

C1          C2         
----------- -----------
          0           0
          1           1
          1           1
          2           2
          2           2

  5 record(s) selected.

So perhaps the SQL standard does not explicitly clarify this ?

henryr · 2018-05-09T22:30:37Z

@dilipbiswal I tried hard to find something in the SQL standard that clarified the situation, but couldn't (of course that could be because the standard is pretty hard to parse... :)).

So let's go with the idea of not dropping ORDER BY in inline views, but we can safely drop them in scalar subqueries and nested subqueries. What do you think?

gatorsmile · 2018-07-13T17:43:31Z

@henryr Could you update the PR based on the review? We can safely drop them in scalar subqueries and nested subqueries

gatorsmile · 2018-07-20T06:39:27Z

@dilipbiswal Can you take this over?

dilipbiswal · 2018-07-20T06:45:02Z

@gatorsmile Sure.

gatorsmile reviewed Apr 12, 2018

View reviewed changes

dilipbiswal mentioned this pull request Jul 23, 2018

[SPARK-23957][SQL] Sorts in subqueries are redundant and can be removed #21853

Closed

asfgit closed this in afb0627 Jul 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-23957][SQL] Remove redundant sort operators from subqueries #21049

[SPARK-23957][SQL] Remove redundant sort operators from subqueries #21049

henryr commented Apr 12, 2018

gatorsmile Apr 12, 2018

henryr Apr 12, 2018

gatorsmile Apr 16, 2018

henryr Apr 16, 2018

SparkQA commented Apr 12, 2018

mgaido91 commented Apr 16, 2018

henryr commented Apr 16, 2018

gatorsmile commented Apr 16, 2018

dilipbiswal commented Apr 16, 2018 •

edited

Loading

henryr commented Apr 23, 2018

dilipbiswal commented Apr 27, 2018

henryr commented Apr 27, 2018

dilipbiswal commented Apr 30, 2018

henryr commented May 4, 2018

dilipbiswal commented May 4, 2018 •

edited

Loading

henryr commented May 9, 2018

gatorsmile commented Jul 13, 2018

gatorsmile commented Jul 20, 2018

dilipbiswal commented Jul 20, 2018

[SPARK-23957][SQL] Remove redundant sort operators from subqueries #21049

[SPARK-23957][SQL] Remove redundant sort operators from subqueries #21049

Conversation

henryr commented Apr 12, 2018

What changes were proposed in this pull request?

How was this patch tested?

gatorsmile Apr 12, 2018

Choose a reason for hiding this comment

henryr Apr 12, 2018

Choose a reason for hiding this comment

gatorsmile Apr 16, 2018

Choose a reason for hiding this comment

henryr Apr 16, 2018

Choose a reason for hiding this comment

SparkQA commented Apr 12, 2018

mgaido91 commented Apr 16, 2018

henryr commented Apr 16, 2018

gatorsmile commented Apr 16, 2018

dilipbiswal commented Apr 16, 2018 • edited Loading

henryr commented Apr 23, 2018

dilipbiswal commented Apr 27, 2018

henryr commented Apr 27, 2018

dilipbiswal commented Apr 30, 2018

henryr commented May 4, 2018

dilipbiswal commented May 4, 2018 • edited Loading

henryr commented May 9, 2018

gatorsmile commented Jul 13, 2018

gatorsmile commented Jul 20, 2018

dilipbiswal commented Jul 20, 2018

dilipbiswal commented Apr 16, 2018 •

edited

Loading

dilipbiswal commented May 4, 2018 •

edited

Loading