[SPARK-9024] Unsafe HashJoin/HashOuterJoin/HashSemiJoin #7480

davies · 2015-07-17T20:32:23Z

This PR introduce unsafe version (using UnsafeRow) of HashJoin, HashOuterJoin and HashSemiJoin, including the broadcast one and shuffle one (except FullOuterJoin, which is better to be implemented using SortMergeJoin).

It use HashMap to store UnsafeRow right now, will change to use BytesToBytesMap for better performance (in another PR).

davies · 2015-07-17T20:32:42Z

@JoshRosen @rxin Please take a early look.

SparkQA · 2015-07-17T20:42:45Z

Test build #37665 has finished for PR 7480 at commit bea4a50.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class UnsafeColumnWriter

SparkQA · 2015-07-17T22:44:35Z

Test build #37666 has finished for PR 7480 at commit 95d0762.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class UnsafeColumnWriter

rxin · 2015-07-17T22:58:22Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala

+    sizeEstimate: Int = 64): HashedRelation = {
+
+    // TODO: Use BytesToBytesMap.
+    val hashTable = new JavaHashMap[UnsafeRow, CompactBuffer[UnsafeRow]](sizeEstimate)


it might be ok to just build this using a java hashmap first, and then build a giant byte array from this.

I'm thinking we can re-order the values in BytesToBytesMap during serialization.

SparkQA · 2015-07-20T19:32:08Z

Test build #37855 has finished for PR 7480 at commit 6acbb11.

This patch fails Scala style tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- abstract class UnsafeColumnWriter

JoshRosen · 2015-07-20T21:29:01Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java

+  public boolean equals(Object other) {
+    if (other instanceof UnsafeRow) {
+      UnsafeRow o = (UnsafeRow) other;
+      return ByteArrayMethods.arrayEquals(baseObject, baseOffset, o.baseObject, o.baseOffset,


I think that we should check whether the rows' sizeInBytes are equal before attempting to compare their contents.

SparkQA · 2015-07-20T21:42:56Z

Test build #37857 has finished for PR 7480 at commit 184b852.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- abstract class UnsafeColumnWriter

JoshRosen · 2015-07-20T21:43:09Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala

+
+    val values = hashTable.get(unsafeKey)
+    // Return GenericInternalRow to work with other JoinRow, which
+    // TODO(davies): return UnsafeRow once we have UnsafeJoinRow.


If we're not going to implement this as part of this patch, then let's make sure to file a followup JIRA under the Tungsten umbrella.

It's already on the last week's TODO-list

JoshRosen · 2015-07-20T22:07:46Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Projection.scala

@@ -83,12 +83,23 @@ abstract class UnsafeProjection extends Projection {
 }

 object UnsafeProjection {
+  def canSupport(schema: StructType): Boolean = canSupport(schema.fields.map(_.dataType))
+  def canSupport(types: Seq[DataType]): Boolean = types.forall(UnsafeColumnWriter.canEmbed(_))


You could even add a canSupport(exprs: Seq[Expression]) to be able to save some characters elsewhere.

JoshRosen · 2015-07-20T22:26:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala

+
+  def this() = this(null, null, null)  // Needed for serialization
+
+  // UnsafeProjection is not thread safe


Can you elaborate on why instances of UnsafeProjection are not thread-safe? I don't see any mention of this in the Scaladoc for UnsafeProjection, so we should probably update it to make any thread-safety concerns clearer.

Does UnsafeHashedRelation have to be thread-safe? I thought it was only used in the context of a single task.

davies · 2015-07-21T21:45:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/rowFormatConverters.scala

-            c => if (!c.outputsUnsafeRows) ConvertToUnsafe(c) else c
+        // If this operator's children produce both unsafe and safe rows,
+        // convert everything unsafe rows if all the schema of them are support by UnsafeRow
+        if (operator.children.forall(c => UnsafeProjection.canSupport(c.schema))) {


ping @JoshRosen

This change looks good to me.

davies · 2015-07-21T21:48:34Z

@JoshRosen Thanks, I will merge this once Jenkins is happy.

SparkQA · 2015-07-21T22:59:00Z

Test build #37979 has finished for PR 7480 at commit a05b4f6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-21T23:30:11Z

Test build #37985 has finished for PR 7480 at commit 84c9807.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-22T02:06:18Z

Test build #37999 has finished for PR 7480 at commit dede020.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-22T06:40:18Z

Test build #38028 has finished for PR 7480 at commit 6294b1e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-22T07:16:31Z

Test build #1154 has finished for PR 7480 at commit 6294b1e.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-22T07:18:44Z

Test build #1155 has finished for PR 7480 at commit 6294b1e.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-22T15:50:01Z

Test build #1156 has finished for PR 7480 at commit 6294b1e.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-07-22T16:36:02Z

Test build #1160 has finished for PR 7480 at commit 6294b1e.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2015-07-22T18:05:45Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java

  public boolean anyNull() {
-    return BitSetMethods.anySet(baseObject, baseOffset, bitSetWidthInBytes);
+    return BitSetMethods.anySet(baseObject, baseOffset, bitSetWidthInBytes / 8);


can you add a unit test for this? i'd imagine it affects correctness

SparkQA · 2015-07-22T19:25:41Z

Test build #1171 has finished for PR 7480 at commit 6294b1e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2015-07-22T20:03:28Z

Merged this into master, will address these comments in follow up PR.

rxin · 2015-07-22T20:08:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashOuterJoin.scala

+    if (supportUnsafe) {
+      UnsafeProjection.create(self.schema)
+    } else {
+      new Projection {


i think you just want scala's "identity" here

This PR introduce BytesToBytesMap to UnsafeHashedRelation, use it in executor for better performance. It serialize all the key and values from java HashMap, put them into a BytesToBytesMap while deserializing. All the values for a same key are stored continuous to have better memory locality. This PR also address the comments for #7480 , do some clean up. Author: Davies Liu <davies@databricks.com> Closes #7592 from davies/unsafe_map2 and squashes the following commits: 42c578a [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_map2 fd09528 [Davies Liu] remove thread local cache and update docs 1c5ad8d [Davies Liu] fix test 5eb1b5a [Davies Liu] address comments in #7480 46f1f22 [Davies Liu] fix style fc221e0 [Davies Liu] use BytesToBytesMap for broadcast join

Unsafe HashJoin

bea4a50

remove println

95d0762

rxin reviewed Jul 17, 2015
View reviewed changes

fix tests

6acbb11

Davies Liu added 2 commits July 20, 2015 12:49

fix style

184b852

Merge branch 'master' of github.com:apache/spark into unsafe_join

a6c0b7d

JoshRosen reviewed Jul 20, 2015
View reviewed changes

use UnsafeRow in SemiJoin

60371f2

JoshRosen reviewed Jul 20, 2015
View reviewed changes

address comments

ab1690f

JoshRosen reviewed Jul 20, 2015
View reviewed changes

davies reviewed Jul 21, 2015
View reviewed changes

address comments

84c9807

Davies Liu added 2 commits July 21, 2015 17:12

fix test

dede020

Merge branch 'master' of github.com:apache/spark into unsafe_join

10583f1

fix projection

6294b1e

davies mentioned this pull request Jul 22, 2015

[SPARK-9247] [SQL] Use BytesToBytesMap for broadcast join #7592

Closed

rxin reviewed Jul 22, 2015
View reviewed changes

asfgit closed this in e0b7ba5 Jul 22, 2015

rxin reviewed Jul 22, 2015
View reviewed changes

davies pushed a commit to davies/spark that referenced this pull request Jul 23, 2015

address comments in apache#7480

5eb1b5a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-9024] Unsafe HashJoin/HashOuterJoin/HashSemiJoin #7480

[SPARK-9024] Unsafe HashJoin/HashOuterJoin/HashSemiJoin #7480

davies commented Jul 17, 2015

davies commented Jul 17, 2015

SparkQA commented Jul 17, 2015

SparkQA commented Jul 17, 2015

rxin Jul 17, 2015

davies Jul 20, 2015

SparkQA commented Jul 20, 2015

JoshRosen Jul 20, 2015

SparkQA commented Jul 20, 2015

JoshRosen Jul 20, 2015

davies Jul 20, 2015

JoshRosen Jul 20, 2015

JoshRosen Jul 20, 2015

davies Jul 21, 2015

JoshRosen Jul 21, 2015

davies commented Jul 21, 2015

SparkQA commented Jul 21, 2015

SparkQA commented Jul 21, 2015

SparkQA commented Jul 22, 2015

SparkQA commented Jul 22, 2015

SparkQA commented Jul 22, 2015

SparkQA commented Jul 22, 2015

SparkQA commented Jul 22, 2015

SparkQA commented Jul 22, 2015

rxin Jul 22, 2015

SparkQA commented Jul 22, 2015

davies commented Jul 22, 2015

rxin Jul 22, 2015


		def this() = this(null, null, null) // Needed for serialization

		// UnsafeProjection is not thread safe

[SPARK-9024] Unsafe HashJoin/HashOuterJoin/HashSemiJoin #7480

[SPARK-9024] Unsafe HashJoin/HashOuterJoin/HashSemiJoin #7480

Conversation

davies commented Jul 17, 2015

davies commented Jul 17, 2015

SparkQA commented Jul 17, 2015

SparkQA commented Jul 17, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 20, 2015

Choose a reason for hiding this comment

SparkQA commented Jul 20, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davies commented Jul 21, 2015

SparkQA commented Jul 21, 2015

SparkQA commented Jul 21, 2015

SparkQA commented Jul 22, 2015

SparkQA commented Jul 22, 2015

SparkQA commented Jul 22, 2015

SparkQA commented Jul 22, 2015

SparkQA commented Jul 22, 2015

SparkQA commented Jul 22, 2015

Choose a reason for hiding this comment

SparkQA commented Jul 22, 2015

davies commented Jul 22, 2015

Choose a reason for hiding this comment