-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23957][SQL] Remove redundant sort operators from subqueries #21049
Conversation
## What changes were proposed in this pull request? Subqueries (at least in SQL) have 'bag of tuples' semantics. Ordering them is therefore redundant (unless combined with a limit). This patch adds a new optimizer rule that removes sort operators that are directly below subqueries (or some combination of projection and filtering below a subquery). ## How was this patch tested? New unit tests. All sql unit tests pass.
|
||
def apply(plan: LogicalPlan): LogicalPlan = plan transform { | ||
case Subquery(child) => Subquery(removeTopLevelSorts(child)) | ||
case SubqueryAlias(name, child) => SubqueryAlias(name, removeTopLevelSorts(child)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SubqueryAlias
is not the subquery you want. This is just an alias of a query/table/view. For example,
Seq((1, 2, "1"), (3, 4, "3")).toDF("int", "int2", "str_sort").orderBy('int.asc).as('df1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I've been trying to understand the role of Subquery
and SubqueryAlias
. My confusion is that subqueries do seem to get planned as SubqueryAlias
operators, e.g.:
scala> spark.sql("SELECT count(*) from (SELECT id FROM dft ORDER BY id)").explain(true)
== Parsed Logical Plan ==
'Project [unresolvedalias('count(1), None)]
+- 'SubqueryAlias __auto_generated_subquery_name
+- 'Sort ['id ASC NULLS FIRST], true
+- 'Project ['id]
+- 'UnresolvedRelation `dft`
In the example you give I (personally) think it's still reasonable to drop the ordering, but understand that might surprise some users. It wouldn't be hard to skip the root if it's a subquery - but what do you propose for detecting subqueries if my method isn't right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seq((1, 2, "1"), (3, 4, "3")).toDF("int", "int2", "str_sort").orderBy('int.asc).as('df1)
Before entering optimizer, we get rid of SubqueryAlias
by the rule EliminateSubqueryAliases
. Basically, it is no-op after query analysis. The name is a little bit confusing, I have to admit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that's why I added the new rule just before EliminateSubqueryAliases
(which runs in the optimizer, as part of the 'finish analysis' batch). After EliminateSubqueryAliases
there doesn't seem to be any way to detect subqueries.
Another approach I suppose would be to handle this like SparkPlan
's requiredChildOrdering
- if a parent doesn't require any ordering of the child, (and the child is a Sort
node), the child Sort
should be dropped. That seems like a more fundamental change though.
Test build #89234 has finished for PR 21049 at commit
|
I am not sure we can actually do this. If we have for instance:
IIUC the plan would be transformed so that the sort is removed...but actually this will produce a change in the output, since keeping the sort would make the output of the query ordered by |
In SQL, the sort in a subquery doesn't make sense because of the relational model - the output of a subquery is an unordered bag of tuples. Some engines still allow the sort, some silently drop it and some throw an error. For example:
Oracle and Postgres allow the One issue might be that the underlying dataframe model might not be 100% relational - maybe dataframes are sorted lists of rows and then this optimization would only be valid if using the SQL interface. If so, it's probably not worth the effort to maintain. But if dataframes and SQL relations are supposed to be equivalent, we can drop the We also may want to decide not to do this because it would surprise users who had been relying on the existing behavior. |
@mgaido91 This is a common rule. Last year, @dilipbiswal @ioana-delaney @nsyca and I had a discussion about this. We forgot to continue the efforts after finishing the first stage of the correlated subquery. https://issues.apache.org/jira/browse/SPARK-18455. @henryr Thank you for contributing to this! @dilipbiswal Do you have the bandwidth to review this PR? |
@henryr @gatorsmile I agree with Sean. To the best of my knowledge, spark does not treat "select ... from ( query)" as a subquery. It treats it as an aliased query. Please see the the grammar under "relationPrimary" rule. The subqueries supported in spark may mostly originate from the projection (non corr scalar) or predicate of the main query. So basically, we see this as expressions either under the Project or Filter operators of the main query block. We can look at SubquerySuite to find usage examples. |
@dilipbiswal Thanks! Although Spark doesn't necessarily parse the query in the |
@henryr Since SubqueryAlias is used as a correlation name and used mostly for resolving attributes, in my understanding its not safe to apply this optimization. I will borrow @gatorsmile 's example here. Please note that the alias is specified after the sort. Below is plan after this optimization that removes sorts under SubqueryAlias->child. scala> Seq((1, 2, "1"), (3, 4, "3")).toDF("int", "int2", "str_sort").orderBy('int.asc).as('df1)
res0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [int: int, int2: int ... 1 more field]
scala> res0.explain(true)
== Parsed Logical Plan ==
SubqueryAlias df1
+- AnalysisBarrier
+- Sort [int#7 ASC NULLS FIRST], true
+- Project [_1#3 AS int#7, _2#4 AS int2#8, _3#5 AS str_sort#9]
+- LocalRelation [_1#3, _2#4, _3#5]
== Analyzed Logical Plan ==
int: int, int2: int, str_sort: string
SubqueryAlias df1
+- Sort [int#7 ASC NULLS FIRST], true
+- Project [_1#3 AS int#7, _2#4 AS int2#8, _3#5 AS str_sort#9]
+- LocalRelation [_1#3, _2#4, _3#5]
== Optimized Logical Plan ==
LocalRelation [int#7, int2#8, str_sort#9]
== Physical Plan ==
LocalTableScan [int#7, int2#8, str_sort#9] In this case we should not be removing the top level sort from user's query right ? cc @gatorsmile for his opinion. |
@dilipbiswal thanks for the clarification. I agree that this particular case - where the alias is the root of a logical plan - might need special handling. Is there any reason to actually use an alias at the root of a plan like this (outside of composing with other plans, where this optimization would apply)? My suggestion would be, since there are no references to the name the alias introduces, to consider just dropping the alias node during optimization (and then the sort would not get dropped). It does seem to be an edge case though - no matter how we handle unreferred-to aliases, the optimization seems to be appropriate for the general case where aliases do correspond to subqueries. What do you think? |
@henryr I can't think of a reason :-). Just that the API allows users do that. How about this query ? scala> spark.sql("with abcd as (select * from t1 order by t1.c1) select * from abcd").explain(true)
18/04/29 23:28:45 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
== Parsed Logical Plan ==
CTE [abcd]
: +- 'SubqueryAlias abcd
: +- 'Sort ['t1.c1 ASC NULLS FIRST], true
: +- 'Project [*]
: +- 'UnresolvedRelation `t1`
+- 'Project [*]
+- 'UnresolvedRelation `abcd`
== Analyzed Logical Plan ==
c1: int, c2: int, c3: int
Project [c1#7, c2#8, c3#9]
+- SubqueryAlias abcd
+- Sort [c1#7 ASC NULLS FIRST], true
+- Project [c1#7, c2#8, c3#9]
+- SubqueryAlias t1
+- HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#7, c2#8, c3#9]
== Optimized Logical Plan ==
HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#7, c2#8, c3#9]
== Physical Plan ==
HiveTableScan [c1#7, c2#8, c3#9], HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#7, c2#8, c3#9] IMHO, its probably better to correctly detect the real subqueries and apply this optimization in order to be fully sure about it. cc @gatorsmile |
I might be a bit of a hardliner on this, but I think it's correct to eliminate the {{ORDER BY}} from common table expressions (e.g. MSSQL agrees with me, see this link). However, given the principle of least surprise, I agree it might be a good idea to at least start with scalar and nested subqueries, and leave inline views for another day. That might be a bit harder to do (I think the rule will need a whitelist of operators it's ok to eliminate sorts below), and in general I think there'll be some missed opportunities, but it's a start :) Alternatively we could extend the analyzed logical plan to explicitly mark the different subquery types (i.e. have a |
@henryr
So perhaps the SQL standard does not explicitly clarify this ? |
@dilipbiswal I tried hard to find something in the SQL standard that clarified the situation, but couldn't (of course that could be because the standard is pretty hard to parse... :)). So let's go with the idea of not dropping |
@henryr Could you update the PR based on the review? We can safely drop them in scalar subqueries and nested subqueries |
@dilipbiswal Can you take this over? |
@gatorsmile Sure. |
What changes were proposed in this pull request?
Subqueries (at least in SQL) have 'bag of tuples' semantics. Ordering
them is therefore redundant (unless combined with a limit). This patch
adds a new optimizer rule that removes sort operators that are directly
below subqueries (or some combination of projection and filtering below
a subquery).
How was this patch tested?
New unit tests. All sql unit tests pass.