Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SQL] SPARK-1800 Add broadcast hash join operator & associated hints. #1163

Closed

Conversation

concretevitamin
Copy link
Contributor

This PR is based off Michael's PR 734 and includes a bunch of cleanups.

Moreover, this PR also

  • makes SparkLogicalPlan take a tableName: String, which facilitates testing.
  • moves join-related tests to a single file.

@AmplabJenkins
Copy link

Build triggered.

@AmplabJenkins
Copy link

Build started.

@AmplabJenkins
Copy link

Build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15970/

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15971/

@@ -243,16 +242,25 @@ object HiveMetastoreTypes extends RegexParsers {
}
}


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra spaces.

* each time an input row is added. This significatly reduces the cost of calcuating the
* projection, but means that it is not safe
* each time an input row is added. This significantly reduces the cost of calculating the
* projection, but means that it is not safe ...?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... to hold on to a reference to a Row after next() has been called on the Iterator that produced it. Instead, the user must call Row.copy() and hold on to the returned Row before calling next().

@marmbrus
Copy link
Contributor

Regarding testing we will probably want to pull all of our various join tests out into a separate test suite that can be run with various options turned on an off so we exercise all of the edge cases for each of the join operators. This is going to become more important as we add more and more join types so I think its worth putting some time into it.

Towards that we might consider breaking this PR into a few pieces. Get the new join type / testing in soon. Add the auto selection / cost estimation in a follow up.

self: Product =>

def estimatedSize(context: SQLContext): Long = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we could probably estimate the size more accurately if we also had some semantic information, like which columns we wanted, as I believe Parquet stores stats for each column. Perhaps worthy of a TODO, this seems perfectly reasonable for now.

- Make SparkLogicalPlan a BaseRelation. Moreover, in
  SQLContext#registerRDDAsTable, propagate the new table name to any
  SparkLogicalPlan with an ExistingRdd child. Essentially we are
  treating such a plan as a relation.
- Move all current join related tests into JoinSuite, to prepare for a
  better test framework for join algorithms.
@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16071/

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16074/

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16075/

@concretevitamin
Copy link
Contributor Author

Jenkins, retest this please.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16082/

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16083/

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16125/

@asfgit asfgit closed this in 9d824fe Jun 26, 2014
@marmbrus
Copy link
Contributor

Thanks, I've merged this into master. I did not merge this into 1.0 as it has this hint that we are not sure we want to support long term, so I'd like to avoid having it in a released version of spark.

@concretevitamin
Copy link
Contributor Author

Sounds good.

On Wednesday, June 25, 2014, Michael Armbrust notifications@github.com
wrote:

Thanks, I've merged this into master. I did not merge this into 1.0 as it
has this hint that we are not sure we want to support long term, so I'd
like to avoid having it in a released version of spark.


Reply to this email directly or view it on GitHub
#1163 (comment).

xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
This PR is based off Michael's [PR 734](apache#734) and includes a bunch of cleanups.

Moreover, this PR also
- makes `SparkLogicalPlan` take a `tableName: String`, which facilitates testing.
- moves join-related tests to a single file.

Author: Zongheng Yang <zongheng.y@gmail.com>
Author: Michael Armbrust <michael@databricks.com>

Closes apache#1163 from concretevitamin/auto-broadcast-hash-join and squashes the following commits:

d0f4991 [Zongheng Yang] Fix bug in broadcast hash join & add test to cover it.
af080d7 [Zongheng Yang] Fix in joinIterators()'s next().
440d277 [Zongheng Yang] Fixes to imports; add back requiredChildDistribution (lost when merging)
208d5f6 [Zongheng Yang] Make LeftSemiJoinHash mix in HashJoin.
ad6c7cc [Zongheng Yang] Minor cleanups.
814b3bf [Zongheng Yang] Merge branch 'master' into auto-broadcast-hash-join
a8a093e [Zongheng Yang] Minor cleanups.
6fd8443 [Zongheng Yang] Cut down size estimation related stuff.
a4267be [Zongheng Yang] Add test for broadcast hash join and related necessary refactorings:
0e64b08 [Zongheng Yang] Scalastyle fix.
91461c2 [Zongheng Yang] Merge branch 'master' into auto-broadcast-hash-join
7c7158b [Zongheng Yang] Prototype of auto conversion to broadcast hash join.
0ad122f [Zongheng Yang] Merge branch 'master' into auto-broadcast-hash-join
3e5d77c [Zongheng Yang] WIP: giant and messy WIP.
a92ed0c [Michael Armbrust] Formatting.
76ca434 [Michael Armbrust] A simple strategy that broadcasts tables only when they are found in a configuration hint.
cf6b381 [Michael Armbrust] Split out generic logic for hash joins and create two concrete physical operators: BroadcastHashJoin and ShuffledHashJoin.
a8420ca [Michael Armbrust] Copy records in executeCollect to avoid issues with mutable rows.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants