Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23957][SQL] Remove redundant sort operators from subqueries #21049

Closed
wants to merge 1 commit into from

Conversation

henryr
Copy link
Contributor

@henryr henryr commented Apr 12, 2018

What changes were proposed in this pull request?

Subqueries (at least in SQL) have 'bag of tuples' semantics. Ordering
them is therefore redundant (unless combined with a limit). This patch
adds a new optimizer rule that removes sort operators that are directly
below subqueries (or some combination of projection and filtering below
a subquery).

How was this patch tested?

New unit tests. All sql unit tests pass.

## What changes were proposed in this pull request?

Subqueries (at least in SQL) have 'bag of tuples' semantics. Ordering
them is therefore redundant (unless combined with a limit). This patch
adds a new optimizer rule that removes sort operators that are directly
below subqueries (or some combination of projection and filtering below
a subquery).

## How was this patch tested?

New unit tests. All sql unit tests pass.

def apply(plan: LogicalPlan): LogicalPlan = plan transform {
case Subquery(child) => Subquery(removeTopLevelSorts(child))
case SubqueryAlias(name, child) => SubqueryAlias(name, removeTopLevelSorts(child))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SubqueryAlias is not the subquery you want. This is just an alias of a query/table/view. For example,

Seq((1, 2, "1"), (3, 4, "3")).toDF("int", "int2", "str_sort").orderBy('int.asc).as('df1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I've been trying to understand the role of Subquery and SubqueryAlias. My confusion is that subqueries do seem to get planned as SubqueryAlias operators, e.g.:

scala> spark.sql("SELECT count(*) from (SELECT id FROM dft ORDER BY id)").explain(true)
== Parsed Logical Plan ==
'Project [unresolvedalias('count(1), None)]
+- 'SubqueryAlias __auto_generated_subquery_name
   +- 'Sort ['id ASC NULLS FIRST], true
      +- 'Project ['id]
         +- 'UnresolvedRelation `dft`

In the example you give I (personally) think it's still reasonable to drop the ordering, but understand that might surprise some users. It wouldn't be hard to skip the root if it's a subquery - but what do you propose for detecting subqueries if my method isn't right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seq((1, 2, "1"), (3, 4, "3")).toDF("int", "int2", "str_sort").orderBy('int.asc).as('df1)

Before entering optimizer, we get rid of SubqueryAlias by the rule EliminateSubqueryAliases. Basically, it is no-op after query analysis. The name is a little bit confusing, I have to admit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's why I added the new rule just before EliminateSubqueryAliases (which runs in the optimizer, as part of the 'finish analysis' batch). After EliminateSubqueryAliases there doesn't seem to be any way to detect subqueries.

Another approach I suppose would be to handle this like SparkPlan's requiredChildOrdering - if a parent doesn't require any ordering of the child, (and the child is a Sort node), the child Sort should be dropped. That seems like a more fundamental change though.

@SparkQA
Copy link

SparkQA commented Apr 12, 2018

Test build #89234 has finished for PR 21049 at commit bb992c2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor

I am not sure we can actually do this. If we have for instance:

select a + b from (select a, b from tb1 order by a) t1

IIUC the plan would be transformed so that the sort is removed...but actually this will produce a change in the output, since keeping the sort would make the output of the query ordered by a instead of with a random order.
What do you think?

@henryr
Copy link
Contributor Author

henryr commented Apr 16, 2018

In SQL, the sort in a subquery doesn't make sense because of the relational model - the output of a subquery is an unordered bag of tuples. Some engines still allow the sort, some silently drop it and some throw an error.

For example:

Oracle and Postgres allow the ORDER BY.

One issue might be that the underlying dataframe model might not be 100% relational - maybe dataframes are sorted lists of rows and then this optimization would only be valid if using the SQL interface. If so, it's probably not worth the effort to maintain. But if dataframes and SQL relations are supposed to be equivalent, we can drop the ORDER BY.

We also may want to decide not to do this because it would surprise users who had been relying on the existing behavior.

@gatorsmile
Copy link
Member

@mgaido91 This is a common rule. Last year, @dilipbiswal @ioana-delaney @nsyca and I had a discussion about this. We forgot to continue the efforts after finishing the first stage of the correlated subquery. https://issues.apache.org/jira/browse/SPARK-18455.

@henryr Thank you for contributing to this!

@dilipbiswal Do you have the bandwidth to review this PR?

@dilipbiswal
Copy link
Contributor

dilipbiswal commented Apr 16, 2018

@henryr @gatorsmile I agree with Sean. To the best of my knowledge, spark does not treat "select ... from ( query)" as a subquery. It treats it as an aliased query. Please see the the grammar under "relationPrimary" rule. The subqueries supported in spark may mostly originate from the projection (non corr scalar) or predicate of the main query. So basically, we see this as expressions either under the Project or Filter operators of the main query block. We can look at SubquerySuite to find usage examples.

@henryr
Copy link
Contributor Author

henryr commented Apr 23, 2018

@dilipbiswal Thanks! Although Spark doesn't necessarily parse the query in the from clause as a subquery, is it fair to say it plans it as one? (Since the planner puts the alias under a SubqueryAlias node). The optimization of removing the sorts seems valid to me in any case, since neither an alias nor a subquery should be sorting except in the presence of a limit. Or do you think that this optimization should only be applied to subqueries that aren't in the from clause?

@dilipbiswal
Copy link
Contributor

@henryr Since SubqueryAlias is used as a correlation name and used mostly for resolving attributes, in my understanding its not safe to apply this optimization. I will borrow @gatorsmile 's example here. Please note that the alias is specified after the sort. Below is plan after this optimization that removes sorts under SubqueryAlias->child.

scala> Seq((1, 2, "1"), (3, 4, "3")).toDF("int", "int2", "str_sort").orderBy('int.asc).as('df1)
res0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [int: int, int2: int ... 1 more field]

scala> res0.explain(true)
== Parsed Logical Plan ==
SubqueryAlias df1
+- AnalysisBarrier
      +- Sort [int#7 ASC NULLS FIRST], true
         +- Project [_1#3 AS int#7, _2#4 AS int2#8, _3#5 AS str_sort#9]
            +- LocalRelation [_1#3, _2#4, _3#5]

== Analyzed Logical Plan ==
int: int, int2: int, str_sort: string
SubqueryAlias df1
+- Sort [int#7 ASC NULLS FIRST], true
   +- Project [_1#3 AS int#7, _2#4 AS int2#8, _3#5 AS str_sort#9]
      +- LocalRelation [_1#3, _2#4, _3#5]

== Optimized Logical Plan ==
LocalRelation [int#7, int2#8, str_sort#9]

== Physical Plan ==
LocalTableScan [int#7, int2#8, str_sort#9]

In this case we should not be removing the top level sort from user's query right ?

cc @gatorsmile for his opinion.

@henryr
Copy link
Contributor Author

henryr commented Apr 27, 2018

@dilipbiswal thanks for the clarification.

I agree that this particular case - where the alias is the root of a logical plan - might need special handling. Is there any reason to actually use an alias at the root of a plan like this (outside of composing with other plans, where this optimization would apply)? My suggestion would be, since there are no references to the name the alias introduces, to consider just dropping the alias node during optimization (and then the sort would not get dropped).

It does seem to be an edge case though - no matter how we handle unreferred-to aliases, the optimization seems to be appropriate for the general case where aliases do correspond to subqueries. What do you think?

@dilipbiswal
Copy link
Contributor

@henryr
Is there any reason to actually use an alias at the root of a plan like this (outside of composing with other plans, where this optimization would apply)?

I can't think of a reason :-). Just that the API allows users do that.

How about this query ?

 scala> spark.sql("with abcd as (select * from t1 order by t1.c1) select * from abcd").explain(true)
18/04/29 23:28:45 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
== Parsed Logical Plan ==
CTE [abcd]
:  +- 'SubqueryAlias abcd
:     +- 'Sort ['t1.c1 ASC NULLS FIRST], true
:        +- 'Project [*]
:           +- 'UnresolvedRelation `t1`
+- 'Project [*]
   +- 'UnresolvedRelation `abcd`

== Analyzed Logical Plan ==
c1: int, c2: int, c3: int
Project [c1#7, c2#8, c3#9]
+- SubqueryAlias abcd
   +- Sort [c1#7 ASC NULLS FIRST], true
      +- Project [c1#7, c2#8, c3#9]
         +- SubqueryAlias t1
            +- HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#7, c2#8, c3#9]

== Optimized Logical Plan ==
HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#7, c2#8, c3#9]

== Physical Plan ==
HiveTableScan [c1#7, c2#8, c3#9], HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#7, c2#8, c3#9]

IMHO, its probably better to correctly detect the real subqueries and apply this optimization in order to be fully sure about it.

cc @gatorsmile

@henryr
Copy link
Contributor Author

henryr commented May 4, 2018

I might be a bit of a hardliner on this, but I think it's correct to eliminate the {{ORDER BY}} from common table expressions (e.g. MSSQL agrees with me, see this link).

However, given the principle of least surprise, I agree it might be a good idea to at least start with scalar and nested subqueries, and leave inline views for another day. That might be a bit harder to do (I think the rule will need a whitelist of operators it's ok to eliminate sorts below), and in general I think there'll be some missed opportunities, but it's a start :)

Alternatively we could extend the analyzed logical plan to explicitly mark the different subquery types (i.e. have a InlineView node, a NestedSubquery node and so on). That would make these optimizations easier to express, but I have some reservations about the semantics of introducing those nodes. What do you think @dilipbiswal / @gatorsmile ?

@dilipbiswal
Copy link
Contributor

dilipbiswal commented May 4, 2018

@henryr
I might be a bit of a hardliner on this, but I think it's correct to eliminate the {{ORDER BY}} from common table expressions (e.g. MSSQL agrees with me, see this link).
DB>> Yeah.. I had seen this. I had checked DB2 behaviour on CTE and it does not seem to remove the sort or give any error.

db2 => with x as (select * from t1 order by 1) select * from x 

C1          C2         
----------- -----------
          0           0
          1           1
          1           1
          2           2
          2           2

  5 record(s) selected.

So perhaps the SQL standard does not explicitly clarify this ?

@henryr
Copy link
Contributor Author

henryr commented May 9, 2018

@dilipbiswal I tried hard to find something in the SQL standard that clarified the situation, but couldn't (of course that could be because the standard is pretty hard to parse... :)).

So let's go with the idea of not dropping ORDER BY in inline views, but we can safely drop them in scalar subqueries and nested subqueries. What do you think?

@gatorsmile
Copy link
Member

@henryr Could you update the PR based on the review? We can safely drop them in scalar subqueries and nested subqueries

@gatorsmile
Copy link
Member

@dilipbiswal Can you take this over?

@dilipbiswal
Copy link
Contributor

@gatorsmile Sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants