-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-13136][SQL] Create a dedicated Broadcast exchange operator #11083
Conversation
Test build #50775 has finished for PR 11083 at commit
|
# Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoin.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashOuterJoin.scala
Retest this please |
Test build #50879 has finished for PR 11083 at commit
|
Test build #50881 has finished for PR 11083 at commit
|
Retest this please |
Test build #50883 has finished for PR 11083 at commit
|
This one is ready for review. |
Test build #50897 has finished for PR 11083 at commit
|
retest this please |
Test build #50898 has finished for PR 11083 at commit
|
retest this please |
Test build #50900 has finished for PR 11083 at commit
|
retest this please |
case class Broadcast( | ||
f: Iterable[InternalRow] => Any, | ||
child: SparkPlan) | ||
extends UnaryNode with CodegenSupport { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we do include this in generated code of BroadcastHashJoin, I think it's better to not implement CodegenSupport, then we don't need the special case in CollapseCodegenStages
Test build #50928 has finished for PR 11083 at commit
|
Test build #50942 has finished for PR 11083 at commit
|
# Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoin.scala
Test build #51039 has finished for PR 11083 at commit
|
@yhuai if you have some time this wk, can you review this? |
* Represents data where tuples are broadcasted to every node. It is quite common that the | ||
* entire set of tuples is transformed into different data structure. | ||
*/ | ||
case class BroadcastDistribution(f: Iterable[InternalRow] => Any = identity) extends Distribution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm thinking maybe it's better to just declare that we want a hashed broadcast distribution, and then don't take a closure. The reason it is bad to take a closure is that this won't work if we want to whole-stage codegen the building of the hash table, or if we want to change the internal engine to a push-based model.
Retest this please |
Test build #51418 has finished for PR 11083 at commit
|
# Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoin.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashOuterJoin.scala # sql/core/src/test/scala/org/apache/spark/sql/execution/joins/InnerJoinSuite.scala
Test build #51596 has finished for PR 11083 at commit
|
@@ -29,22 +29,20 @@ import org.apache.spark.sql.execution.metric.SQLMetrics | |||
* for hash join. | |||
*/ | |||
case class LeftSemiJoinBNL( | |||
streamed: SparkPlan, broadcast: SparkPlan, condition: Option[Expression]) | |||
left: SparkPlan, right: SparkPlan, condition: Option[Expression]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you do this change (streamed -> left, broadcast -> right)? this makes the variable name more confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'll revert that.
I'm going to review this more carefully tonight. |
@hvanhovell when you get a chance, please update the description if it merits any change. |
Test build #51605 has finished for PR 11083 at commit
|
* Marker trait to identify the shape in which tuples are broadcasted. Typical examples of this are | ||
* identity (tuples remain unchanged) or hashed (tuples are converted into some hash index). | ||
*/ | ||
trait BroadcastMode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd move this and IdentityBroadcastMode into a new file.
This looks pretty good actually. |
@rxin I agree that this is stretching the definitions of both |
Test build #51635 has finished for PR 11083 at commit
|
Test build #51637 has finished for PR 11083 at commit
|
Thanks. I'm going to merge this. |
Quite a few Spark SQL join operators broadcast one side of the join to all nodes. The are a few problems with this:
This PR defines both a
BroadcastDistribution
andBroadcastPartitioning
, these contain aBroadcastMode
. TheBroadcastMode
defines the way in which we transform the Array ofInternalRow
's into an index. We currently support the followingBroadcastMode
's:Set
.HashedRelation
, and broadcasts this index.To match this distribution we implement a
BroadcastExchange
operator which will perform the broadcast for us, and haveEnsureRequirements
plan this operator. The old Exchange operator has been renamed into ShuffleExchange in order to clearly separate between Shuffled and Broadcasted exchanges. Finally the classes in Exchange.scala have been moved to a dedicated package.cc @rxin @davies