-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-13383][SQL] Keep broadcast hint after column pruning #11260
Conversation
@@ -432,7 +432,8 @@ class Analyzer( | |||
case r if r == oldRelation => newRelation | |||
} transformUp { | |||
case other => other transformExpressions { | |||
case a: Attribute => attributeRewrites.get(a).getOrElse(a) | |||
case a: Attribute => | |||
attributeRewrites.get(a).getOrElse(a).withQualifiers(a.qualifiers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This issue is proposed to fix in #11261. I included this change here is because without this the following unit test can not pass.
Test build #51527 has finished for PR 11260 at commit
|
ping @marmbrus @liancheng @rxin @davies |
Project(allReferences.filter(c.outputSet.contains).toSeq, c) | ||
c match { | ||
case BroadcastHint(p) => | ||
BroadcastHint(Project(allReferences.filter(c.outputSet.contains).toSeq, p)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any other operator will be inserted between Join and BroadcastHint?
Maybe we could have a rule to pull up BroadcastHint until Join.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way to do it is to define the hint in the join operator, and then it is safe (can never be pushed anywhere else)
Test build #51625 has finished for PR 11260 at commit
|
@@ -260,6 +260,20 @@ case class Join( | |||
condition: Option[Expression]) | |||
extends BinaryNode with PredicateHelper { | |||
|
|||
private def isBrocastHint(plan: LogicalPlan): Boolean = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what i meant was making the broadcast hint just a property on the logical plan, so it can never be pushed away from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I see. But as we need to add two broadcast hint for left and right plans. Considering Join is broadly used in many places as pattern matching usage, I am afraid that we need to change too many places for this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E.g. each case Join(left, right, joinType, condition) =>
would need to change to case Join(left, right, joinType, condition, leftBroadcastHint, rightBroadcastHint)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ok for you? If so, I will make the change as that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well you don't need to make the hint a parameter of the case class, but just some field you can set ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I've made it as variables to set & copy from other Join operator. Please take a look if this update is good. Thanks.
retest this please. |
Test build #51659 has finished for PR 11260 at commit
|
retest this please. |
Why jenkins can't retest? |
It could be caused by the build failure. The shade JAR is too large now. After merging the latest build, I hit an issue when using mvn. |
This feels kind of hacky to me. @rxin why doesn't the hint just change the statistics again? |
Test build #51672 has finished for PR 11260 at commit
|
I still don't like this approach and find it too hacky. I talked to @marmbrus more offline, and maybe it's easiest to just rely on stats overriding, e.g. when broadcast hint is declared, we set the size of the relation (just in the hint operator) to the smallest possible number (1?). Then it should be robust to pushdowns. |
retest this please. |
We can remove the old broadcast hint matching in strategy now, can't we? |
Indeed. We can now. Will update it. |
@@ -82,7 +82,6 @@ private[sql] abstract class SparkStrategies extends QueryPlanner[SparkPlan] { | |||
*/ | |||
object CanBroadcast { | |||
def unapply(plan: LogicalPlan): Option[LogicalPlan] = plan match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably turn this patter matching into just an if/else statement now
Test build #51731 has finished for PR 11260 at commit
|
@@ -137,6 +139,22 @@ class JoinSuite extends QueryTest with SharedSQLContext { | |||
assert(planned.size === 1) | |||
} | |||
|
|||
test("broadcasthint sets relation statistics to smallest value") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry would be great to put this in the optimizer suites, rather than this file which is an end-to-end suite.
I just took a look at the suites available. I'd rename JoinOrderSuite to JoinOptimizationSuite, and then put this case there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I just can't find appropriate suite for it. I'd use JoinOptimizationSuite. Thanks.
1e35aa3
to
9998d95
Compare
Test build #51738 has finished for PR 11260 at commit
|
Test build #51743 has finished for PR 11260 at commit
|
retest this please. |
Test build #51763 has finished for PR 11260 at commit
|
@rxin I've addressed your comments. Please see if this is appropriate. Thanks. |
comparePlans(optimized, expected) | ||
|
||
assert(optimized.collect { | ||
case b @ BroadcastHint(_) if b.statistics.sizeInBytes == 1 => 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit of a nit, but I think what you really want to test here is something like:
val broadcastChildren = optimized.collect {
case Join(_, CanBroadcast(r), _, _) => r
}
assert(broadcastChildren == 1)
With the current test something could break in Project
(for example) that would prevent the broadcast from actually happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it seems we can't import CanBroadcast into this test, I do update according to your comment with a little change (just check its statistics.sizeInBytes). Please see if it is appropriate now. Thanks.
Implementation LGTM overall, minor comment on tests. |
test this please |
Test build #51842 has finished for PR 11260 at commit
|
I think the failed test is caused by updated column pruning rule. |
Test build #51858 has finished for PR 11260 at commit
|
retest this please. |
Failure (at test_mllib.R#133): kmeans ... |
Test build #51868 has finished for PR 11260 at commit
|
It is weird. Some other PRs like #11344 (just document change) also failed at this SparkR unit tests. |
retest this please. |
Test build #51882 has finished for PR 11260 at commit
|
Looks good, merging to master. |
JIRA: https://issues.apache.org/jira/browse/SPARK-13383
What changes were proposed in this pull request?
When we do column pruning in Optimizer, we put additional Project on top of a logical plan. However, when we already wrap a BroadcastHint on a logical plan, the added Project will hide BroadcastHint after later execution.
We should take care of BroadcastHint when we do column pruning.
How was the this patch tested?
Unit test is added.