[SQL] SPARK-1800 Add broadcast hash join operator & associated hints. #1163

concretevitamin · 2014-06-20T22:48:11Z

This PR is based off Michael's PR 734 and includes a bunch of cleanups.

Moreover, this PR also

makes SparkLogicalPlan take a tableName: String, which facilitates testing.
moves join-related tests to a single file.

…al operators: BroadcastHashJoin and ShuffledHashJoin.

…a configuration hint.

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala sql/core/src/main/scala/org/apache/spark/sql/execution/joins.scala

AmplabJenkins · 2014-06-20T22:50:03Z

Build triggered.

AmplabJenkins · 2014-06-20T22:50:11Z

Build started.

AmplabJenkins · 2014-06-20T22:51:40Z

Build finished.

AmplabJenkins · 2014-06-20T22:51:40Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15970/

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala

AmplabJenkins · 2014-06-20T23:00:01Z

Merged build triggered.

AmplabJenkins · 2014-06-20T23:00:06Z

Merged build started.

AmplabJenkins · 2014-06-21T00:04:29Z

Merged build finished.

AmplabJenkins · 2014-06-21T00:04:29Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15971/

marmbrus · 2014-06-21T20:38:37Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala

@@ -243,16 +242,25 @@ object HiveMetastoreTypes extends RegexParsers {
  }
 }

+


Extra spaces.

marmbrus · 2014-06-21T20:42:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Projection.scala

- * each time an input row is added.  This significatly reduces the cost of calcuating the
- * projection, but means that it is not safe
+ * each time an input row is added.  This significantly reduces the cost of calculating the
+ * projection, but means that it is not safe ...?


... to hold on to a reference to a Row after next() has been called on the Iterator that produced it. Instead, the user must call Row.copy() and hold on to the returned Row before calling next().

marmbrus · 2014-06-21T21:03:01Z

Regarding testing we will probably want to pull all of our various join tests out into a separate test suite that can be run with various options turned on an off so we exercise all of the edge cases for each of the join operators. This is going to become more important as we add more and more join types so I think its worth putting some time into it.

Towards that we might consider breaking this PR into a few pieces. Get the new join type / testing in soon. Add the auto selection / cost estimation in a follow up.

aarondav · 2014-06-23T02:51:32Z

sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala

  self: Product =>

+  def estimatedSize(context: SQLContext): Long = {


Here we could probably estimate the size more accurately if we also had some semantic information, like which columns we wanted, as I believe Parquet stores stats for each column. Perhaps worthy of a TODO, this seems perfectly reasonable for now.

- Make SparkLogicalPlan a BaseRelation. Moreover, in SQLContext#registerRDDAsTable, propagate the new table name to any SparkLogicalPlan with an ExistingRdd child. Essentially we are treating such a plan as a relation. - Move all current join related tests into JoinSuite, to prepare for a better test framework for join algorithms.

AmplabJenkins · 2014-06-24T20:20:18Z

Merged build triggered.

AmplabJenkins · 2014-06-24T20:20:25Z

Merged build started.

AmplabJenkins · 2014-06-24T22:25:28Z

Merged build finished.

AmplabJenkins · 2014-06-24T22:25:28Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16071/

AmplabJenkins · 2014-06-24T23:05:32Z

Merged build finished.

AmplabJenkins · 2014-06-24T23:05:32Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16074/

AmplabJenkins · 2014-06-24T23:50:16Z

Merged build triggered.

AmplabJenkins · 2014-06-24T23:50:26Z

Merged build started.

AmplabJenkins · 2014-06-24T23:53:46Z

Merged build finished.

AmplabJenkins · 2014-06-24T23:53:46Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16075/

concretevitamin · 2014-06-24T23:56:46Z

Jenkins, retest this please.

AmplabJenkins · 2014-06-25T00:00:17Z

Merged build triggered.

AmplabJenkins · 2014-06-25T00:00:26Z

Merged build started.

AmplabJenkins · 2014-06-25T01:28:11Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-06-25T01:28:11Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16082/

AmplabJenkins · 2014-06-25T01:39:30Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-06-25T01:39:31Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16083/

AmplabJenkins · 2014-06-25T22:20:21Z

Merged build triggered.

AmplabJenkins · 2014-06-25T22:20:29Z

Merged build started.

AmplabJenkins · 2014-06-25T23:59:44Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-06-25T23:59:44Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16125/

marmbrus · 2014-06-26T01:39:10Z

Thanks, I've merged this into master. I did not merge this into 1.0 as it has this hint that we are not sure we want to support long term, so I'd like to avoid having it in a released version of spark.

concretevitamin · 2014-06-26T02:34:59Z

Sounds good.

On Wednesday, June 25, 2014, Michael Armbrust notifications@github.com
wrote:

Thanks, I've merged this into master. I did not merge this into 1.0 as it
has this hint that we are not sure we want to support long term, so I'd
like to avoid having it in a released version of spark.

—
Reply to this email directly or view it on GitHub
#1163 (comment).

This PR is based off Michael's [PR 734](apache#734) and includes a bunch of cleanups. Moreover, this PR also - makes `SparkLogicalPlan` take a `tableName: String`, which facilitates testing. - moves join-related tests to a single file. Author: Zongheng Yang <zongheng.y@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes apache#1163 from concretevitamin/auto-broadcast-hash-join and squashes the following commits: d0f4991 [Zongheng Yang] Fix bug in broadcast hash join & add test to cover it. af080d7 [Zongheng Yang] Fix in joinIterators()'s next(). 440d277 [Zongheng Yang] Fixes to imports; add back requiredChildDistribution (lost when merging) 208d5f6 [Zongheng Yang] Make LeftSemiJoinHash mix in HashJoin. ad6c7cc [Zongheng Yang] Minor cleanups. 814b3bf [Zongheng Yang] Merge branch 'master' into auto-broadcast-hash-join a8a093e [Zongheng Yang] Minor cleanups. 6fd8443 [Zongheng Yang] Cut down size estimation related stuff. a4267be [Zongheng Yang] Add test for broadcast hash join and related necessary refactorings: 0e64b08 [Zongheng Yang] Scalastyle fix. 91461c2 [Zongheng Yang] Merge branch 'master' into auto-broadcast-hash-join 7c7158b [Zongheng Yang] Prototype of auto conversion to broadcast hash join. 0ad122f [Zongheng Yang] Merge branch 'master' into auto-broadcast-hash-join 3e5d77c [Zongheng Yang] WIP: giant and messy WIP. a92ed0c [Michael Armbrust] Formatting. 76ca434 [Michael Armbrust] A simple strategy that broadcasts tables only when they are found in a configuration hint. cf6b381 [Michael Armbrust] Split out generic logic for hash joins and create two concrete physical operators: BroadcastHashJoin and ShuffledHashJoin. a8420ca [Michael Armbrust] Copy records in executeCollect to avoid issues with mutable rows.

marmbrus and others added 7 commits May 11, 2014 11:23

Copy records in executeCollect to avoid issues with mutable rows.

a8420ca

Split out generic logic for hash joins and create two concrete physic…

cf6b381

…al operators: BroadcastHashJoin and ShuffledHashJoin.

A simple strategy that broadcasts tables only when they are found in …

76ca434

…a configuration hint.

Formatting.

a92ed0c

WIP: giant and messy WIP.

3e5d77c

Merge branch 'master' into auto-broadcast-hash-join

0ad122f

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala sql/core/src/main/scala/org/apache/spark/sql/execution/joins.scala

Prototype of auto conversion to broadcast hash join.

7c7158b

concretevitamin added 2 commits June 20, 2014 15:53

Merge branch 'master' into auto-broadcast-hash-join

91461c2

Conflicts: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala

Scalastyle fix.

0e64b08

marmbrus reviewed Jun 21, 2014
View reviewed changes

marmbrus mentioned this pull request Jun 21, 2014

[SQL] SPARK-1800 Add broadcast hash join operator #734

Closed

2 tasks

marmbrus reviewed Jun 21, 2014
View reviewed changes

aarondav reviewed Jun 23, 2014
View reviewed changes

concretevitamin added 2 commits June 24, 2014 13:11

Cut down size estimation related stuff.

6fd8443

Minor cleanups.

a8a093e

Fixes to imports; add back requiredChildDistribution (lost when merging)

440d277

chenghao-intel mentioned this pull request Jun 25, 2014

[SQL][SPARK-2212]Hash Outer Join #1147

Closed

concretevitamin added 2 commits June 25, 2014 11:17

Fix in joinIterators()'s next().

af080d7

Fix bug in broadcast hash join & add test to cover it.

d0f4991

asfgit closed this in 9d824fe Jun 26, 2014

		@@ -243,16 +242,25 @@ object HiveMetastoreTypes extends RegexParsers {
		}
		}

		self: Product =>

		def estimatedSize(context: SQLContext): Long = {

[SQL] SPARK-1800 Add broadcast hash join operator & associated hints. #1163

[SQL] SPARK-1800 Add broadcast hash join operator & associated hints. #1163

Conversation

concretevitamin commented Jun 20, 2014

AmplabJenkins commented Jun 20, 2014

AmplabJenkins commented Jun 20, 2014

AmplabJenkins commented Jun 20, 2014

AmplabJenkins commented Jun 20, 2014

AmplabJenkins commented Jun 20, 2014

AmplabJenkins commented Jun 20, 2014

AmplabJenkins commented Jun 21, 2014

AmplabJenkins commented Jun 21, 2014

marmbrus Jun 21, 2014

Choose a reason for hiding this comment

marmbrus Jun 21, 2014

Choose a reason for hiding this comment

marmbrus commented Jun 21, 2014

aarondav Jun 23, 2014

Choose a reason for hiding this comment

AmplabJenkins commented Jun 24, 2014

AmplabJenkins commented Jun 24, 2014

AmplabJenkins commented Jun 24, 2014

AmplabJenkins commented Jun 24, 2014

AmplabJenkins commented Jun 24, 2014

AmplabJenkins commented Jun 24, 2014

AmplabJenkins commented Jun 24, 2014

AmplabJenkins commented Jun 24, 2014

AmplabJenkins commented Jun 24, 2014

AmplabJenkins commented Jun 24, 2014

concretevitamin commented Jun 24, 2014

AmplabJenkins commented Jun 25, 2014

AmplabJenkins commented Jun 25, 2014

AmplabJenkins commented Jun 25, 2014

AmplabJenkins commented Jun 25, 2014

AmplabJenkins commented Jun 25, 2014

AmplabJenkins commented Jun 25, 2014

AmplabJenkins commented Jun 25, 2014

AmplabJenkins commented Jun 25, 2014

AmplabJenkins commented Jun 25, 2014

AmplabJenkins commented Jun 25, 2014

marmbrus commented Jun 26, 2014

concretevitamin commented Jun 26, 2014