[SPARK-24999][SQL]Reduce unnecessary 'new' memory operations #21968

heary-cao · 2018-08-02T12:08:11Z

What changes were proposed in this pull request?

This PR is to solve the CodeGen code generated by fast hash, and there is no need to apply for a block of memory for every new entry, because unsafeRow's memory can be reused.

How was this patch tested?

the existed test cases.

kiszk · 2018-08-24T09:18:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala

-    val loc = new map.Location  // this could be allocated in stack
-    binaryMap.safeLookup(unsafeKey.getBaseObject, unsafeKey.getBaseOffset,
-      unsafeKey.getSizeInBytes, loc, unsafeKey.hashCode())
+    val loc = map.lookup(unsafeKey.getBaseObject, unsafeKey.getBaseOffset,


IIUC, this change makes this part thread-unsafe. Is it OK?

this is safe to lookup, and It is different from the get(key: InternalRow).

Before this PR, loc is allocated at each call of getValue(). After this PR, loc will be shared within each binaryMap that is passed to a constructor of UnsafeHashedRelation.
Is this behavior change safe?

srowen · 2018-08-25T14:22:44Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/RowBasedHashMapGenerator.scala

@@ -44,6 +44,12 @@ class RowBasedHashMapGenerator(
    groupingKeySchema, bufferSchema) {

  override protected def initializeAggregateHashMap(): String = {
+    val numVarLenFields = groupingKeys.map(_.dataType).count {


Nit: can't this just be .count(!UnsafeRow.isFixedLength(_))?

maropu · 2018-08-27T12:28:11Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/RowBasedHashMapGenerator.scala

@@ -48,6 +48,8 @@ class RowBasedHashMapGenerator(
    val keySchema = ctx.addReferenceObj("keySchemaTerm", groupingKeySchema)
    val valueSchema = ctx.addReferenceObj("valueSchemaTerm", bufferSchema)

+    val numVarLenFields = groupingKeys.map(_.dataType).count(!UnsafeRow.isFixedLength(_))


Do not remove the TODO comment below.

maropu · 2018-08-27T12:41:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/RowBasedHashMapGenerator.scala

@@ -141,9 +141,6 @@ class RowBasedHashMapGenerator(
       |    if (buckets[idx] == -1) {
       |      if (numRows < capacity && !isBatchFull) {
       |        // creating the unsafe for new entry
-       |        org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter agg_rowWriter
-       |          = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(
-       |              ${groupingKeySchema.length}, ${numVarLenFields * 32});
       |        agg_rowWriter.reset(); //TODO: investigate if reset or zeroout are actually needed
       |        agg_rowWriter.zeroOutNullBytes();


btw, if groupingKeySchema has no nullable field, can we drop agg_rowWriter.zeroOutNullBytes()?

maropu · 2018-08-27T12:43:54Z

The change looks reasonable to me, so can you trigger tests? @gatorsmile @cloud-fan @hvanhovell

SparkQA · 2018-08-27T18:26:05Z

Test build #4294 has finished for PR 21968 at commit e0748e1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

heary-cao · 2018-08-28T06:43:08Z

cc @maropu ,@kiszk, @cloud-fan

cloud-fan · 2018-08-28T07:01:41Z

can we do the same thing for the columnar one?

heary-cao · 2018-08-28T10:15:42Z

cc @cloud-fan I'm sorry, I look VectorizedHashMapGenerator is fine. and i don't know whether you refer to other. thanks.

heary-cao · 2018-08-31T03:03:46Z

cc @cloud-fan @maropu

viirya · 2018-08-31T03:17:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/RowBasedHashMapGenerator.scala

@@ -141,11 +151,8 @@ class RowBasedHashMapGenerator(
       |    if (buckets[idx] == -1) {
       |      if (numRows < capacity && !isBatchFull) {
       |        // creating the unsafe for new entry


Remove or update this comment?

ok, updated. thanks.

viirya · 2018-08-31T03:18:11Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/RowBasedHashMapGenerator.scala

@@ -141,11 +151,8 @@ class RowBasedHashMapGenerator(
       |    if (buckets[idx] == -1) {
       |      if (numRows < capacity && !isBatchFull) {
       |        // creating the unsafe for new entry
-       |        org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter agg_rowWriter
-       |          = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(
-       |              ${groupingKeySchema.length}, ${numVarLenFields * 32});
       |        agg_rowWriter.reset(); //TODO: investigate if reset or zeroout are actually needed


I think now reset and zero out is needed? So maybe remove this TODO?

cloud-fan · 2018-08-31T05:21:29Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/RowBasedHashMapGenerator.scala

@@ -48,6 +48,12 @@ class RowBasedHashMapGenerator(
    val keySchema = ctx.addReferenceObj("keySchemaTerm", groupingKeySchema)
    val valueSchema = ctx.addReferenceObj("valueSchemaTerm", bufferSchema)

+    val numVarLenFields = groupingKeys.map(_.dataType).count {


groupingKeys.map(_.dataType).count(dt => !UnsafeRow.isFixedLength(dt))

ok, thanks.

cloud-fan · 2018-08-31T05:22:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/RowBasedHashMapGenerator.scala

@@ -130,6 +134,12 @@ class RowBasedHashMapGenerator(
      }
    }.mkString(";\n")

+    val nullByteWriter = if (groupingKeySchema.map(_.nullable).forall(_ == false)) {


maybe name it resetNullBits?

ok, thanks.

maropu · 2018-08-31T22:58:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/RowBasedHashMapGenerator.scala

@@ -48,6 +48,8 @@ class RowBasedHashMapGenerator(
    val keySchema = ctx.addReferenceObj("keySchemaTerm", groupingKeySchema)
    val valueSchema = ctx.addReferenceObj("valueSchemaTerm", bufferSchema)

+    val numVarLenFields = groupingKeys.map(_.dataType).count(dt => !UnsafeRow.isFixedLength(dt))


super nit: .count(!UnsafeRow.isFixedLength(_))?

plz keep the comment // TODO: consider large decimal and interval type below

@cloud-fan We want to discuss, how to modify?

the code style doesn't matter, both are fine. but let's keep the comment.

heary-cao · 2018-09-06T02:20:22Z

cc @cloud-fan @hvanhovell @maropu

kiszk · 2018-09-07T09:46:13Z

LGTM cc @cloud-fan @hvanhovell @maropu

cloud-fan · 2018-09-10T07:11:25Z

thanks, merging to master!

heary-cao · 2018-09-10T10:25:30Z

@cloud-fan thanks.

kiszk reviewed Aug 24, 2018

View reviewed changes

srowen reviewed Aug 25, 2018

View reviewed changes

heary-cao force-pushed the updateNewMemory branch 2 times, most recently from e223446 to e0748e1 Compare August 27, 2018 11:18

maropu reviewed Aug 27, 2018

View reviewed changes

heary-cao force-pushed the updateNewMemory branch from e0748e1 to e4cec60 Compare August 28, 2018 01:41

viirya reviewed Aug 31, 2018

View reviewed changes

cloud-fan reviewed Aug 31, 2018

View reviewed changes

heary-cao force-pushed the updateNewMemory branch from e4cec60 to 49703e8 Compare August 31, 2018 09:28

maropu reviewed Aug 31, 2018

View reviewed changes

Reduce unnecessary 'new' memory operations

3e4f2e4

heary-cao force-pushed the updateNewMemory branch from 49703e8 to 3e4f2e4 Compare September 4, 2018 05:39

asfgit closed this in e7853dc Sep 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-24999][SQL]Reduce unnecessary 'new' memory operations #21968

[SPARK-24999][SQL]Reduce unnecessary 'new' memory operations #21968

heary-cao commented Aug 2, 2018

kiszk Aug 24, 2018

heary-cao Aug 27, 2018

kiszk Aug 27, 2018

srowen Aug 25, 2018

maropu Aug 27, 2018

maropu Aug 27, 2018

maropu commented Aug 27, 2018

SparkQA commented Aug 27, 2018

heary-cao commented Aug 28, 2018

cloud-fan commented Aug 28, 2018

heary-cao commented Aug 28, 2018 •

edited

heary-cao commented Aug 31, 2018

viirya Aug 31, 2018

heary-cao Aug 31, 2018

viirya Aug 31, 2018

heary-cao Aug 31, 2018

cloud-fan Aug 31, 2018

heary-cao Aug 31, 2018

cloud-fan Aug 31, 2018

heary-cao Aug 31, 2018

maropu Aug 31, 2018

maropu Aug 31, 2018

heary-cao Sep 1, 2018

cloud-fan Sep 3, 2018

heary-cao commented Sep 6, 2018

kiszk commented Sep 7, 2018

cloud-fan commented Sep 10, 2018

heary-cao commented Sep 10, 2018

[SPARK-24999][SQL]Reduce unnecessary 'new' memory operations #21968

[SPARK-24999][SQL]Reduce unnecessary 'new' memory operations #21968

Conversation

heary-cao commented Aug 2, 2018

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu commented Aug 27, 2018

SparkQA commented Aug 27, 2018

heary-cao commented Aug 28, 2018

cloud-fan commented Aug 28, 2018

heary-cao commented Aug 28, 2018 • edited

heary-cao commented Aug 31, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

heary-cao commented Sep 6, 2018

kiszk commented Sep 7, 2018

cloud-fan commented Sep 10, 2018

heary-cao commented Sep 10, 2018

heary-cao commented Aug 28, 2018 •

edited