-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-25482][SQL] Avoid pushdown of subqueries to data source filters #22518
Conversation
Test build #96428 has finished for PR 22518 at commit
|
Test build #96427 has finished for PR 22518 at commit
|
@@ -1268,4 +1269,16 @@ class SubquerySuite extends QueryTest with SharedSQLContext { | |||
assert(getNumSortsInQuery(query5) == 1) | |||
} | |||
} | |||
|
|||
test("SPARK-25482: Reuse same Subquery in order to execute it only once") { | |||
withTempView("t1", "t2", "t3") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need for "t3".
@@ -166,7 +168,7 @@ case class ReuseSubquery(conf: SQLConf) extends Rule[SparkPlan] { | |||
val sameSchema = subqueries.getOrElseUpdate(sub.plan.schema, ArrayBuffer[SubqueryExec]()) | |||
val sameResult = sameSchema.find(_.sameResult(sub.plan)) | |||
if (sameResult.isDefined) { | |||
sub.withNewPlan(sameResult.get) | |||
sub.withNewPlan(sameResult.get).withNewExprId() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid double copy()? Or is it cleaner this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is cleaner this way. I don't expect this to happen very often (how many subquery can you have in a plan?) so I don't think it is an issue. But if there are cleaner options/solutions, I am open to suggestions, thanks.
Test build #96472 has finished for PR 22518 at commit
|
hmm, how can this happen? I don't think a data source can handle a filter of subquery... |
you can check the UT which reproduces the issue. The scalar subquery is pushed down as part of the filter |
@mgaido91 Sorry but can you have a more detailed explanation in the PR description? |
@gengliangwang no, let me cite and explain the PR description. I am not sure how to improve it, but if you have suggestions I am happy to. The main point of the PR is to address an issue which arise when:
Now the point is, can this condition happen? The answer is yes, and one situation in which this happens (as reported in the JIRA) is
So in the plan we have two
So this result in the subquery being executed twice (as the two |
@mgaido91 I see, thanks for the explanation! |
Test build #98681 has finished for PR 22518 at commit
|
withTempView("t1", "t2") { | ||
sql("create temporary view t1(a int) using parquet") | ||
sql("create temporary view t2(b int) using parquet") | ||
val plan = sql("select * from t2 where b > (select max(a) from t1)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry it has been a long time and I don't quite remember the context.
What was the problem we are trying to fix? This test looks nothing related to subquery reuse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, please can you check the PR description? I think the context is quite well explained there.
Anyway, as a quick summary: in this case b > (select max(a) from t1)
is pushed down as a datasource filter. So we have 2 instances of b > (select max(a) from t1)
and the result is not reused. It is not reused because the copied plan satisfies ==
, so even if ReuseSubquery
replaces it, then the change is ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we only have a problem when we have subquery in data source filter? If that's the case, I would suggest not pushdown subquery filter into data source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have this problem if we copy
the same subquery. I can't think of any other case than filter push-down, but I may be wrong.
Forbidding to push down these filter may cause a perf regression, I am not sure it is the right solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any data source can support subquery filter? for data source v1/v2, the public Filter
API does not support subquery. For file source, they don't support subquery filter either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it also means the data source scan must wait until the subquery is finished
The subquery should be executed anyway sooner or later, right? So I don't see the problem here: am I missing something?
Ok, thanks, I'll follow your suggestion and forbid it here and create a new ticket about pushing it down to data sources. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The subquery should be executed anyway sooner or later, right?
Yes, but we could execute scan and subquery at the same time (2 spark jobs running together), instead of executing them serialized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could execute scan and subquery at the same time (
is this really possible? My understanding is that subqueries are executed before the plan they belong to (in SparkPlan.executeQuery
). So my understanding is that when a subquery is running, the rest of the query is not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah sorry I misread the code. Unless the subquery is rewritten into join, we must wait for all subqueries to be finished before executing the plan.
We can rewrite scalar subquery in data source filters into literal, to make it work with the filter pushdown API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, exactly. That's what I meant. Shall I revert the changes to the previous PR? And in the scope of the new JIRA do the rewriting to literals? Thanks.
Test build #98734 has finished for PR 22518 at commit
|
@@ -155,15 +155,14 @@ object FileSourceStrategy extends Strategy with Logging { | |||
case a: AttributeReference => | |||
a.withName(l.output.find(_.semanticEquals(a)).get.name) | |||
} | |||
} | |||
}.filterNot(SubqueryExpression.hasSubquery) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we do the filter before the map
?
@@ -47,7 +47,8 @@ private[sql] object PruneFileSourcePartitions extends Rule[LogicalPlan] { | |||
case a: AttributeReference => | |||
a.withName(logicalRelation.output.find(_.semanticEquals(a)).get.name) | |||
} | |||
} | |||
}.filterNot(SubqueryExpression.hasSubquery) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
@@ -1268,4 +1269,16 @@ class SubquerySuite extends QueryTest with SharedSQLContext { | |||
assert(getNumSortsInQuery(query5) == 1) | |||
} | |||
} | |||
|
|||
test("SPARK-25482: Reuse same Subquery in order to execute it only once") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's update the test
I'd like to merge this simple PR first, to address the performance problem (unnecessary subquery execution). Let's create a new ticket for subquery filter pushing to data source, and have more people to attend the discussion. |
BTW can you include a simple benchmark to show this problem? e.g. just run a query in spark-shell, and post the result before and after this PR. |
@cloud-fan this is the benchmark:
the result is:
The difference is very small because all the subqueries run in parallel. The execution time would be much more affected if there are several subqueries (our thread pool has 16 threads so a query like that but with 9 filters with subqueries would see a big performance gain after this PR). |
@@ -22,7 +22,7 @@ import scala.collection.mutable.ArrayBuffer | |||
|
|||
import org.apache.spark.sql.SparkSession | |||
import org.apache.spark.sql.catalyst.{expressions, InternalRow} | |||
import org.apache.spark.sql.catalyst.expressions.{Expression, ExprId, InSet, Literal, PlanExpression} | |||
import org.apache.spark.sql.catalyst.expressions.{Expression, ExprId, InSet, Literal, NamedExpression, PlanExpression} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks sorry, I missed it.
@@ -1268,4 +1269,16 @@ class SubquerySuite extends QueryTest with SharedSQLContext { | |||
assert(getNumSortsInQuery(query5) == 1) | |||
} | |||
} | |||
|
|||
test("SPARK-25482: Forbid pushdown to dattasources of filters containing subqueries") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dattasources
typo
Test build #98775 has finished for PR 22518 at commit
|
Test build #98777 has finished for PR 22518 at commit
|
Test build #98778 has finished for PR 22518 at commit
|
Test build #98779 has finished for PR 22518 at commit
|
thanks, merging to master! |
## What changes were proposed in this pull request? An expressions with a subquery can be pushed down as a data source filter. Despite the filter is not actively used, this causes anyway a re-execution of the subquery becuase the `ReuseSubquery` optimization rule is ineffective in this case. The PR avoids this problem by forbidding the push down of filters containing a subquery. ## How was this patch tested? added UT Closes apache#22518 from mgaido91/SPARK-25482. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…with scalar subquery ### What changes were proposed in this pull request? Scalar subquery can be pushed down as data filter at runtime, since we always execute subquery first. Ideally, we can rewrite `ScalarSubquery` to `Literal` before pushing down filter. The main issue before we do not support that is `ReuseSubquery` is ineffective, see #22518. It is not a issue now. For example: ```sql SELECT * FROM t1 WHERE c1 > (SELECT min(c2) FROM t2) ``` ### Why are the changes needed? Improve peformance if data filter have scalar subquery. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? add test Closes #41088 from ulysses-you/SPARK-43402. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Xiduo You <ulyssesyou@apache.org>
…with scalar subquery ### What changes were proposed in this pull request? Scalar subquery can be pushed down as data filter at runtime, since we always execute subquery first. Ideally, we can rewrite `ScalarSubquery` to `Literal` before pushing down filter. The main issue before we do not support that is `ReuseSubquery` is ineffective, see apache#22518. It is not a issue now. For example: ```sql SELECT * FROM t1 WHERE c1 > (SELECT min(c2) FROM t2) ``` ### Why are the changes needed? Improve peformance if data filter have scalar subquery. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? add test Closes apache#41088 from ulysses-you/SPARK-43402. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Xiduo You <ulyssesyou@apache.org>
What changes were proposed in this pull request?
An expressions with a subquery can be pushed down as a data source filter. Despite the filter is not actively used, this causes anyway a re-execution of the subquery becuase the
ReuseSubquery
optimization rule is ineffective in this case.The PR avoids this problem by forbidding the push down of filters containing a subquery.
How was this patch tested?
added UT