[SPARK-12950] [SQL] Improve lookup of BytesToBytesMap in aggregate #11010

davies · 2016-02-02T00:50:24Z

This PR improve the lookup of BytesToBytesMap by:

Generate code for calculate the hash code of grouping keys.
Do not use MemoryLocation, fetch the baseObject and offset for key and value directly (remove the indirection).

rxin · 2016-02-02T04:55:35Z

sql/core/src/test/scala/org/apache/spark/sql/execution/BenchmarkWholeStageCodegen.scala

-      Without codegen             7775.53            26.97         1.00 X
-      With codegen                 342.15           612.94        22.73 X
+      Without codegen                         5488.16            38.21         1.00 X
+      With codegen                             531.08           394.88        10.33 X


what's causing the big drop?

The benchmark is not that stable, this number change from 10 to 20, maybe 200M are still not enough, I will increase it to 500M.

Probably want to increase it by 10x to amortize the fixed overhead. The runtime was 342 ms, which is too small.

Then we will wait more than 2 minutes to finish this benchmark, I will send another PR to update these benchmark.

SparkQA · 2016-02-02T05:03:20Z

Test build #2487 has finished for PR 11010 at commit 4cefbc5.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-02-02T05:10:17Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala

+  * A function that calculates hash value for a group of expressions, which basically XOR all the
+  * hash code of children expressions together.
+  *
+  * Note: This is used for hash map for aggreagte, designed for performance (has worse


Does UnsafeRow.hashCode slower than this?

UnsafeRow will call murmur3, which is still slow

Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/TungstenAggregate.scala sql/core/src/test/scala/org/apache/spark/sql/execution/BenchmarkWholeStageCodegen.scala

Conflicts: sql/core/src/test/scala/org/apache/spark/sql/execution/BenchmarkWholeStageCodegen.scala

SparkQA · 2016-02-03T23:51:58Z

Test build #50705 has finished for PR 11010 at commit 5eff34b.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

Conflicts: sql/core/src/test/scala/org/apache/spark/sql/execution/BenchmarkWholeStageCodegen.scala

SparkQA · 2016-02-04T01:53:55Z

Test build #50723 has finished for PR 11010 at commit 53a2dd4.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-04T05:42:51Z

Test build #50734 has finished for PR 11010 at commit 6c9ce88.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-04T07:58:44Z

Test build #50736 has finished for PR 11010 at commit 85f8d0e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public class TaskMemoryManager

nongli · 2016-02-04T19:05:21Z

Do you know how much of this is from the general clean up and how much is from switching to a simpler hash? In my experience, using a very weak hash function can make things really bad if you dont account for it other ways.

davies · 2016-02-05T00:10:28Z

@nongli As the benchmark show, the weak hash function could save 10ns per row, others may save 20ns per row. I'm also not sure the weak hash function is enough in this cases. BTW, the hashCode of ing/long in Java are also using this weak hash function, so they may not that bad.

davies · 2016-02-09T22:32:33Z

@nongli Had reverted it to Murmur3 (we could figure out a faster hash function later), the improvements become minor.

nongli · 2016-02-10T00:26:14Z

LGTM

SparkQA · 2016-02-10T00:31:25Z

Test build #50999 has finished for PR 11010 at commit 7f5852a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-02-10T00:41:05Z

Merging this into master, thanks!

Davies Liu added 2 commits February 1, 2016 16:43

improve BytesToBytesMap

64df01e

Merge branch 'master' of github.com:apache/spark into gen_map

4cefbc5

rxin reviewed Feb 2, 2016
View reviewed changes

cloud-fan reviewed Feb 2, 2016
View reviewed changes

Davies Liu added 3 commits February 3, 2016 09:46

Merge branch 'master' of github.com:apache/spark into gen_map

ddacfa6

Conflicts: sql/core/src/test/scala/org/apache/spark/sql/execution/BenchmarkWholeStageCodegen.scala

fix conflict

5eff34b

Davies Liu added 4 commits February 3, 2016 17:25

Merge branch 'master' of github.com:apache/spark into gen_map

3b42154

Conflicts: sql/core/src/test/scala/org/apache/spark/sql/execution/BenchmarkWholeStageCodegen.scala

fix mima

13b73bf

enable codegen for single operator

8f074ab

fix a bug in avgRate

53a2dd4

fix mima

85f8d0e

davies force-pushed the gen_map branch from 6c9ce88 to 85f8d0e Compare February 4, 2016 05:45

address comments

6ba0cf2

update benchmark

7f5852a

asfgit closed this in 0e5ebac Feb 10, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-12950] [SQL] Improve lookup of BytesToBytesMap in aggregate #11010

[SPARK-12950] [SQL] Improve lookup of BytesToBytesMap in aggregate #11010

davies commented Feb 2, 2016

rxin Feb 2, 2016

davies Feb 2, 2016

rxin Feb 2, 2016

davies Feb 2, 2016

SparkQA commented Feb 2, 2016

cloud-fan Feb 2, 2016

davies Feb 2, 2016

SparkQA commented Feb 3, 2016

SparkQA commented Feb 4, 2016

SparkQA commented Feb 4, 2016

SparkQA commented Feb 4, 2016

nongli commented Feb 4, 2016

davies commented Feb 5, 2016

davies commented Feb 9, 2016

nongli commented Feb 10, 2016

SparkQA commented Feb 10, 2016

davies commented Feb 10, 2016

[SPARK-12950] [SQL] Improve lookup of BytesToBytesMap in aggregate #11010

[SPARK-12950] [SQL] Improve lookup of BytesToBytesMap in aggregate #11010

Conversation

davies commented Feb 2, 2016

rxin Feb 2, 2016

Choose a reason for hiding this comment

davies Feb 2, 2016

Choose a reason for hiding this comment

rxin Feb 2, 2016

Choose a reason for hiding this comment

davies Feb 2, 2016

Choose a reason for hiding this comment

SparkQA commented Feb 2, 2016

cloud-fan Feb 2, 2016

Choose a reason for hiding this comment

davies Feb 2, 2016

Choose a reason for hiding this comment

SparkQA commented Feb 3, 2016

SparkQA commented Feb 4, 2016

SparkQA commented Feb 4, 2016

SparkQA commented Feb 4, 2016

nongli commented Feb 4, 2016

davies commented Feb 5, 2016

davies commented Feb 9, 2016

nongli commented Feb 10, 2016

SparkQA commented Feb 10, 2016

davies commented Feb 10, 2016