-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SQL] SPARK-1800 Add broadcast hash join operator & associated hints. #1163
[SQL] SPARK-1800 Add broadcast hash join operator & associated hints. #1163
Conversation
…al operators: BroadcastHashJoin and ShuffledHashJoin.
…a configuration hint.
Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala sql/core/src/main/scala/org/apache/spark/sql/execution/joins.scala
Build triggered. |
Build started. |
Build finished. |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15970/ |
Conflicts: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala
Merged build triggered. |
Merged build started. |
Merged build finished. |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15971/ |
@@ -243,16 +242,25 @@ object HiveMetastoreTypes extends RegexParsers { | |||
} | |||
} | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra spaces.
* each time an input row is added. This significatly reduces the cost of calcuating the | ||
* projection, but means that it is not safe | ||
* each time an input row is added. This significantly reduces the cost of calculating the | ||
* projection, but means that it is not safe ...? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... to hold on to a reference to a Row
after next()
has been called on the Iterator
that produced it. Instead, the user must call Row.copy()
and hold on to the returned Row
before calling next()
.
Regarding testing we will probably want to pull all of our various join tests out into a separate test suite that can be run with various options turned on an off so we exercise all of the edge cases for each of the join operators. This is going to become more important as we add more and more join types so I think its worth putting some time into it. Towards that we might consider breaking this PR into a few pieces. Get the new join type / testing in soon. Add the auto selection / cost estimation in a follow up. |
self: Product => | ||
|
||
def estimatedSize(context: SQLContext): Long = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we could probably estimate the size more accurately if we also had some semantic information, like which columns we wanted, as I believe Parquet stores stats for each column. Perhaps worthy of a TODO, this seems perfectly reasonable for now.
- Make SparkLogicalPlan a BaseRelation. Moreover, in SQLContext#registerRDDAsTable, propagate the new table name to any SparkLogicalPlan with an ExistingRdd child. Essentially we are treating such a plan as a relation. - Move all current join related tests into JoinSuite, to prepare for a better test framework for join algorithms.
Merged build triggered. |
Merged build started. |
Merged build finished. |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16071/ |
Merged build finished. |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16074/ |
Merged build triggered. |
Merged build started. |
Merged build finished. |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16075/ |
Jenkins, retest this please. |
Merged build triggered. |
Merged build started. |
Merged build finished. All automated tests passed. |
All automated tests passed. |
Merged build finished. All automated tests passed. |
All automated tests passed. |
Merged build triggered. |
Merged build started. |
Merged build finished. All automated tests passed. |
All automated tests passed. |
Thanks, I've merged this into master. I did not merge this into 1.0 as it has this hint that we are not sure we want to support long term, so I'd like to avoid having it in a released version of spark. |
Sounds good. On Wednesday, June 25, 2014, Michael Armbrust notifications@github.com
|
This PR is based off Michael's [PR 734](apache#734) and includes a bunch of cleanups. Moreover, this PR also - makes `SparkLogicalPlan` take a `tableName: String`, which facilitates testing. - moves join-related tests to a single file. Author: Zongheng Yang <zongheng.y@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes apache#1163 from concretevitamin/auto-broadcast-hash-join and squashes the following commits: d0f4991 [Zongheng Yang] Fix bug in broadcast hash join & add test to cover it. af080d7 [Zongheng Yang] Fix in joinIterators()'s next(). 440d277 [Zongheng Yang] Fixes to imports; add back requiredChildDistribution (lost when merging) 208d5f6 [Zongheng Yang] Make LeftSemiJoinHash mix in HashJoin. ad6c7cc [Zongheng Yang] Minor cleanups. 814b3bf [Zongheng Yang] Merge branch 'master' into auto-broadcast-hash-join a8a093e [Zongheng Yang] Minor cleanups. 6fd8443 [Zongheng Yang] Cut down size estimation related stuff. a4267be [Zongheng Yang] Add test for broadcast hash join and related necessary refactorings: 0e64b08 [Zongheng Yang] Scalastyle fix. 91461c2 [Zongheng Yang] Merge branch 'master' into auto-broadcast-hash-join 7c7158b [Zongheng Yang] Prototype of auto conversion to broadcast hash join. 0ad122f [Zongheng Yang] Merge branch 'master' into auto-broadcast-hash-join 3e5d77c [Zongheng Yang] WIP: giant and messy WIP. a92ed0c [Michael Armbrust] Formatting. 76ca434 [Michael Armbrust] A simple strategy that broadcasts tables only when they are found in a configuration hint. cf6b381 [Michael Armbrust] Split out generic logic for hash joins and create two concrete physical operators: BroadcastHashJoin and ShuffledHashJoin. a8420ca [Michael Armbrust] Copy records in executeCollect to avoid issues with mutable rows.
This PR is based off Michael's PR 734 and includes a bunch of cleanups.
Moreover, this PR also
SparkLogicalPlan
take atableName: String
, which facilitates testing.