Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-9024] Unsafe HashJoin/HashOuterJoin/HashSemiJoin #7480

Closed
wants to merge 19 commits into from

Conversation

davies
Copy link
Contributor

@davies davies commented Jul 17, 2015

This PR introduce unsafe version (using UnsafeRow) of HashJoin, HashOuterJoin and HashSemiJoin, including the broadcast one and shuffle one (except FullOuterJoin, which is better to be implemented using SortMergeJoin).

It use HashMap to store UnsafeRow right now, will change to use BytesToBytesMap for better performance (in another PR).

@davies
Copy link
Contributor Author

davies commented Jul 17, 2015

@JoshRosen @rxin Please take a early look.

@SparkQA
Copy link

SparkQA commented Jul 17, 2015

Test build #37665 has finished for PR 7480 at commit bea4a50.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class UnsafeColumnWriter

@SparkQA
Copy link

SparkQA commented Jul 17, 2015

Test build #37666 has finished for PR 7480 at commit 95d0762.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class UnsafeColumnWriter

sizeEstimate: Int = 64): HashedRelation = {

// TODO: Use BytesToBytesMap.
val hashTable = new JavaHashMap[UnsafeRow, CompactBuffer[UnsafeRow]](sizeEstimate)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be ok to just build this using a java hashmap first, and then build a giant byte array from this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking we can re-order the values in BytesToBytesMap during serialization.

@SparkQA
Copy link

SparkQA commented Jul 20, 2015

Test build #37855 has finished for PR 7480 at commit 6acbb11.

  • This patch fails Scala style tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class UnsafeColumnWriter

public boolean equals(Object other) {
if (other instanceof UnsafeRow) {
UnsafeRow o = (UnsafeRow) other;
return ByteArrayMethods.arrayEquals(baseObject, baseOffset, o.baseObject, o.baseOffset,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should check whether the rows' sizeInBytes are equal before attempting to compare their contents.

@SparkQA
Copy link

SparkQA commented Jul 20, 2015

Test build #37857 has finished for PR 7480 at commit 184b852.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class UnsafeColumnWriter


val values = hashTable.get(unsafeKey)
// Return GenericInternalRow to work with other JoinRow, which
// TODO(davies): return UnsafeRow once we have UnsafeJoinRow.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're not going to implement this as part of this patch, then let's make sure to file a followup JIRA under the Tungsten umbrella.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already on the last week's TODO-list

@@ -83,12 +83,23 @@ abstract class UnsafeProjection extends Projection {
}

object UnsafeProjection {
def canSupport(schema: StructType): Boolean = canSupport(schema.fields.map(_.dataType))
def canSupport(types: Seq[DataType]): Boolean = types.forall(UnsafeColumnWriter.canEmbed(_))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could even add a canSupport(exprs: Seq[Expression]) to be able to save some characters elsewhere.


def this() = this(null, null, null) // Needed for serialization

// UnsafeProjection is not thread safe
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on why instances of UnsafeProjection are not thread-safe? I don't see any mention of this in the Scaladoc for UnsafeProjection, so we should probably update it to make any thread-safety concerns clearer.

Does UnsafeHashedRelation have to be thread-safe? I thought it was only used in the context of a single task.

c => if (!c.outputsUnsafeRows) ConvertToUnsafe(c) else c
// If this operator's children produce both unsafe and safe rows,
// convert everything unsafe rows if all the schema of them are support by UnsafeRow
if (operator.children.forall(c => UnsafeProjection.canSupport(c.schema))) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @JoshRosen

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks good to me.

@davies
Copy link
Contributor Author

davies commented Jul 21, 2015

@JoshRosen Thanks, I will merge this once Jenkins is happy.

@SparkQA
Copy link

SparkQA commented Jul 21, 2015

Test build #37979 has finished for PR 7480 at commit a05b4f6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 21, 2015

Test build #37985 has finished for PR 7480 at commit 84c9807.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 22, 2015

Test build #37999 has finished for PR 7480 at commit dede020.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 22, 2015

Test build #38028 has finished for PR 7480 at commit 6294b1e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 22, 2015

Test build #1154 has finished for PR 7480 at commit 6294b1e.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 22, 2015

Test build #1155 has finished for PR 7480 at commit 6294b1e.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 22, 2015

Test build #1156 has finished for PR 7480 at commit 6294b1e.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 22, 2015

Test build #1160 has finished for PR 7480 at commit 6294b1e.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

public boolean anyNull() {
return BitSetMethods.anySet(baseObject, baseOffset, bitSetWidthInBytes);
return BitSetMethods.anySet(baseObject, baseOffset, bitSetWidthInBytes / 8);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a unit test for this? i'd imagine it affects correctness

@SparkQA
Copy link

SparkQA commented Jul 22, 2015

Test build #1171 has finished for PR 7480 at commit 6294b1e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor Author

davies commented Jul 22, 2015

Merged this into master, will address these comments in follow up PR.

@asfgit asfgit closed this in e0b7ba5 Jul 22, 2015
if (supportUnsafe) {
UnsafeProjection.create(self.schema)
} else {
new Projection {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you just want scala's "identity" here

davies pushed a commit to davies/spark that referenced this pull request Jul 23, 2015
asfgit pushed a commit that referenced this pull request Jul 28, 2015
This PR introduce BytesToBytesMap to UnsafeHashedRelation, use it in executor for better performance.

It serialize all the key and values from java HashMap, put them into a BytesToBytesMap while deserializing. All the values for a same key are stored continuous to have better memory locality.

This PR also address the comments for #7480 , do some clean up.

Author: Davies Liu <davies@databricks.com>

Closes #7592 from davies/unsafe_map2 and squashes the following commits:

42c578a [Davies Liu] Merge branch 'master' of github.com:apache/spark into unsafe_map2
fd09528 [Davies Liu] remove thread local cache and update docs
1c5ad8d [Davies Liu] fix test
5eb1b5a [Davies Liu] address comments in #7480
46f1f22 [Davies Liu] fix style
fc221e0 [Davies Liu] use BytesToBytesMap for broadcast join
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants