-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-10887] [SQL] Build HashedRelation outside of HashJoinNode. #8953
Conversation
Test build #43140 has finished for PR 8953 at commit
|
I think we still need to have a 2-way hash join node that takes two LocalNodes as children. It will internally use the refactored hash join node. |
Test build #43187 has finished for PR 8953 at commit
|
test this please |
Test build #43189 has started for PR 8953 at commit |
test this please |
Test build #43201 has finished for PR 8953 at commit
|
Test build #43223 has finished for PR 8953 at commit
|
Test build #43282 has finished for PR 8953 at commit
|
left: LocalNode, | ||
right: LocalNode) extends BinaryLocalNode(conf) { | ||
|
||
private[this] lazy val (buildNode, buildKeys, streamedNode, streamedKeys) = buildSide match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a lazy val?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, doesn't need to be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed lazy
.
Test build #43283 has finished for PR 8953 at commit
|
* input [[InternalRow]] for a fixed set of [[Expression Expressions]]. It exposes a `target` | ||
* method. This method is used to set the row that will be updated. So, when `target` is used, the | ||
* [[MutableRow]] object created internally will not be used. If `target` is not used, the | ||
* [[MutableRow]] object created internally will be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this comment. Where is target
? I can't find it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See public ${classOf[BaseMutableProjection].getName} target($mutableRowType row) {
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh got it
As discussed offline, we should explore alternatives where we don't have to wrap local nodes to avoid the wrapping logic. Right now we need to worry about the initialization / prepare / close order and it's easy to miss something there. |
val resolvedBuildNode = resolveExpressions(buildNode) | ||
val resolvedBuildKeys = resolveExpressions(buildKeys, resolvedBuildNode) | ||
val hashedRelation = buildHashedRelation(conf, resolvedBuildKeys, resolvedBuildNode) | ||
val broadcastHashedRelation = sqlContext.sparkContext.broadcast(hashedRelation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewor14 This is the place where I need sqlcontext.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't you just create a broadcast variable directly? You might have to do some mocking for that, e.g.
import org.mockito.Mockito.{mock, when}
val hashedRelation = ...
val broadcastHashedRelation = mock(classOf[TorrentBroadcast[HashedRelation]])
when(broadcastHashedRelation.value).thenReturn(hashedRelation)
would that work? Then we won't have to rely on SparkContext
.
@andrewor14 I have made |
Test build #43340 has finished for PR 8953 at commit
|
* input [[InternalRow]] for a fixed set of [[Expression Expressions]]. | ||
* input [[InternalRow]] for a fixed set of [[Expression Expressions]]. It exposes a `target` | ||
* method. This method is used to set the row that will be updated. So, when `target` is used, the | ||
* [[MutableRow]] object created internally will not be used. If `target` is not used, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is quite verbose. I think you can just say
It exposes a `target` method, which is used to set the row that will be updated.
The internal [[MutableRow]] object created internally is used only when `target` is not used.
|
||
/** | ||
* A node for inner hash equi-join. [[BinaryHashJoinNode]] and [[BroadcastHashJoinNode]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An abstract node for sharing common functionality among different implementations of
inner hash equi-join, notably [[BinaryHashJoinNode]] and [[BroadcastHashJoinNode]].
@yhuai thanks, this looks much simpler. My remaining comments are mainly concerned with comments and tests. |
LGTM will merge pending tasks. Thanks for addressing the comments so quickly. |
Test build #43352 has finished for PR 8953 at commit
|
retest this please |
Test build #43366 has finished for PR 8953 at commit
|
This PR refactors
HashJoinNode
to take a existingHashedRelation
. So, we can reuse this node for bothShuffledHashJoin
andBroadcastHashJoin
.https://issues.apache.org/jira/browse/SPARK-10887