New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-28375][SQL] Prevent the PullupCorrelatedPredicates optimizer rule from removing predicates if run multiple times #25164
Conversation
Test build #107693 has finished for PR 25164 at commit
|
retest this please |
@@ -275,13 +275,16 @@ object PullupCorrelatedPredicates extends Rule[LogicalPlan] with PredicateHelper | |||
plan transformExpressions { | |||
case ScalarSubquery(sub, children, exprId) if children.nonEmpty => | |||
val (newPlan, newCond) = pullOutCorrelatedPredicates(sub, outerPlans) | |||
ScalarSubquery(newPlan, newCond, exprId) | |||
val conds = newCond ++ children.filter(_.isInstanceOf[Predicate]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about this. What if we use children if newCond
is empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm... That might not be correct. Since in the analysis phase, children
contains all outer references. Even though newCond
is empty, we can't leave out children
as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, if there are outer references there, newCond
should not be empty, right? I am a bit worried about using Predicate
here. You might have an outer reference in a Coalesce
or CaseWhen
, which are not Predicate
s for instance...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense... Basically we want to distinguish analyzed plans and optimized plans. But in the current implementation, both the analyzer and optimizer are stripping outer references...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, a side note... The PullupCorrelatedPredicates
rule is tightly coupled with RewriteSubqueryPredicates
. For ListQuery
, it seems that PullupCorrelatedPredicates
is compulsory for a correct physical plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree with your suggestion @dilipbiswal, basically it is what I suggested earlier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgaido91 Thanks a lot. Lets see if it works or if there is something we may be missing :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt it... Because the logic for checking OuterReferences
and the logic for actually pulling up predicates are slightly different. With that being said, even though l.children
is non-empty, it does not necessarily mean that newCond
is non-empty.
The most natural way I can think of is that we combine these two rule PullupCorrelatedPredicates
and RewriteSubqueryPredicates
, and RewriteSubqueryPredicates
completely removes those hacky list subqueries. I don't think the plan can change if we apply these two rules in a single Once
batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH subquery resolution and optimzation are super tricky and can be error-prone. The current code is a bit complex and fragile, because one piece of code might have some pre-conditions on some other parts of the codebase, which might change over time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yeshengm Thanks .. let me make a pr to get it tested to see what is broken. I think it makes sense to keep this logic (even though its merged with rewriteSubquery) idempotent. That is because, pullupCorrelatedPredicates works for scalar subquery as well and that is handled by a different rule.
Test build #107697 has finished for PR 25164 at commit
|
Test build #107700 has finished for PR 25164 at commit
|
0efae65
to
c898430
Compare
Test build #107730 has finished for PR 25164 at commit
|
retest this please |
Test build #107750 has finished for PR 25164 at commit
|
c898430
to
6b79f88
Compare
Test build #107754 has finished for PR 25164 at commit
|
6b79f88
to
f8f4ec8
Compare
Test build #107755 has finished for PR 25164 at commit
|
Test build #107767 has finished for PR 25164 at commit
|
Yep. It does work!
On Wed, Jul 17, 2019 at 11:57 AM Dilip Biswal ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala
<#25164 (comment)>:
> @@ -275,13 +275,16 @@ object PullupCorrelatedPredicates extends Rule[LogicalPlan] with PredicateHelper
plan transformExpressions {
case ScalarSubquery(sub, children, exprId) if children.nonEmpty =>
val (newPlan, newCond) = pullOutCorrelatedPredicates(sub, outerPlans)
- ScalarSubquery(newPlan, newCond, exprId)
+ val conds = newCond ++ children.filter(_.isInstanceOf[Predicate])
@yeshengm <https://github.com/yeshengm> Ok.. sounds good.
One question, can we not make this PullupCorrelatedPredicates idempotent
now the way it is (i.e when these two rules are separate) ? If we did
something like this :
case l @ ListQuery(sub, _, exprId, childOutputs) =>
val (newPlan, newCond) = pullOutCorrelatedPredicates(sub, outerPlans)
if (newCond.isEmpty) {
// Perhaps just returning `l` may work as well. But in case we r relying on
// the de-dup processing somehow..
ListQuery(newPlan, l.children, exprId, childOutputs)
} else {
ListQuery(newPlan, newCond, exprId, childOutputs)
}
will it work ? Or you tried it already ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#25164?email_source=notifications&email_token=AC5TTEM4YUV5AB4Y2GUQ4DDP75TSFA5CNFSM4IDZAVE2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOB6YOUFY#discussion_r304591303>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AC5TTEPCLRIBBWFABPQSURLP75TSFANCNFSM4IDZAVEQ>
.
--
Yesheng
|
What changes were proposed in this pull request?
The original implementation of
PullupCorrelatedPredicates
can remove predicates in subqueries if the rule is run multiple times. This fix resolves this issue by appending new predicates to existing predicates from the last run.How was this patch tested?
A new UT.