-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12505] [SQL] Pushdown a Limit on top of an Outer-Join #10454
Changes from 3 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -47,6 +47,7 @@ object DefaultOptimizer extends Optimizer { | |
PushPredicateThroughProject, | ||
PushPredicateThroughGenerate, | ||
PushPredicateThroughAggregate, | ||
PushLimitThroughOuterJoin, | ||
ColumnPruning, | ||
// Operator combine | ||
ProjectCollapsing, | ||
|
@@ -857,6 +858,30 @@ object PushPredicateThroughJoin extends Rule[LogicalPlan] with PredicateHelper { | |
} | ||
} | ||
|
||
/** | ||
* Push [[Limit]] operators through [[Join]] operators, iff the join type is outer joins. | ||
* Adding extra [[Limit]] operators on top of the outer-side child/children. | ||
*/ | ||
object PushLimitThroughOuterJoin extends Rule[LogicalPlan] with PredicateHelper { | ||
|
||
def apply(plan: LogicalPlan): LogicalPlan = plan transform { | ||
case f @ Limit(expr, Join(left, right, joinType, joinCondition)) => | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We also can push it down through the |
||
joinType match { | ||
case RightOuter => | ||
Limit(expr, Join(left, CombineLimits(Limit(expr, right)), joinType, joinCondition)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. need a stop condition to stop pushing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for your review! Since we call There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ah, it's right, nvm There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That is right. However I think checking if it is already pushed would reduce unnecessary multiple applying this rule and |
||
case LeftOuter => | ||
Limit(expr, Join(CombineLimits(Limit(expr, left)), right, joinType, joinCondition)) | ||
case FullOuter => | ||
Limit(expr, | ||
Join( | ||
CombineLimits(Limit(expr, left)), | ||
CombineLimits(Limit(expr, right)), | ||
joinType, joinCondition)) | ||
case _ => f // DO Nothing for the other join types | ||
} | ||
} | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it possible that we add an extra limit in every iteration ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You are right, but, after this rule, there exists another rule |
||
|
||
/** | ||
* Removes [[Cast Casts]] that are unnecessary because the input is already the correct type. | ||
*/ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we will generate wrong results. It is not safe to push down the limit if we are not joining on the foreign key with the primary key, right? For example, for a left outer join, we push down the limit to the right table. It is possible all rows returned by the right side are having the same join column value.
am I missing anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For left outer joins, we only push down the limit to the left table. Thus, I think it should be safe?
Basically, the rule is to add additional Limit node(s) on top of the outer-side child/children
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, sorry. I was looking at the wrong line. For full outer join, is it safe? Also, with the limit, the result of left/right outer may not be deterministic, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm. Actually, for left/right outer join, what will happen if we have
A left outer join B on (A.key = B.key) sort by A.key limit 10
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a sort, the plan is different. We will have a
Sort
belowLimit
. Thus, it is not applicable to this rule.For the full-outer join, I think the result is still correct, but the result might not be the same as the original plan.
In the new plan, we add extra
limit
. If the nodes belowlimit
is deterministic, the result should be deterministic. If not deterministic, the result will be not deterministic. Right? Thus, this rule will not change the results' deterministic property. Sorry, this part is unclear to me. I am not sure my answer is correct.Based on my understanding, when we use
limit
, the users will not expect a deterministic results, unless they useorder by
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I found a hole.
full outer = union (left outer, right outer)
. We are unable to add extralimit
belowUnion
, which removes the duplicates.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, we have the a table A having 1, 2, 3, 4, 5 (say k is the column name). If we do
A x FULL OUTER JOIN A y ON (x.k = y.k) limit 2
and we push limit to both side, it is possible that we get1,2
from the left side and3, 4
from the right side, right?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me update the code and fix the bug. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the
full outer
join, my idea is to add extralimit
to one side which has a higherstatistics
. Does that sound good?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to refine the idea.
For the
full outer
join,limit
. Do nothing.limit
, add extralimit
to that side.limit
, add extralimit
to one side which has a higherstatistics
Is it better?