[SPARK-12888][SQL][follow-up] benchmark the new hash expression #10917

cloud-fan · 2016-01-26T05:48:16Z

Adds the benchmark results as comments.

The codegen version is slower than the interpreted version for simple case becasue of 3 reasons:

codegen version use a more complex hash algorithm than interpreted version, i.e. Murmur3_x86_32.hashInt vs simple multiplication and addition.
codegen version will write the hash value to a row first and then read it out. I tried to create a GenerateHasher that can generate code to return hash value directly and got about 60% speed up for the simple case, does it worth?
the row in simple case only has one int field, so the runtime reflection may be removed because of branch prediction, which makes the interpreted version faster.

The array case is also slow for similar reasons, e.g. array elements are of same type, so interpreted version can probably get rid of runtime reflection by branch prediction.

cloud-fan · 2016-01-26T05:48:46Z

cc @nongli @rxin

rxin · 2016-01-26T05:50:22Z

@nongli maybe we should just use the simpler multiplication and addition?

nongli · 2016-01-26T07:10:24Z

@cloud-fan Simple is just a single int right? It's not even doing anything in the previous case?

cloud-fan · 2016-01-26T07:37:21Z

@nongli It's not doing anything to get the hash code of the int field, but do a simple multiplication and addition to get the hash code of the row.

SparkQA · 2016-01-26T07:37:47Z

Test build #50072 has finished for PR 10917 at commit 8207dc1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

nongli · 2016-01-26T17:42:48Z

LGTM. We can have different hash functions with different entropy later but this seems okay to me.

cloud-fan · 2016-01-26T23:48:03Z

The map case should have been slower than the array case for codegen, but it doesn't because of a generator bug that has been fixed at #10930. Should we merge this one first and update map case result after #10930 is merged?

nongli · 2016-01-26T23:54:37Z

No, let's re-run them when the results are easier to explain. Can you also tune the iterations so that the iterations is a higher value. The harness does some rounding with the less significant digits.

SparkQA · 2016-02-03T02:13:03Z

Test build #50634 has started for PR 10917 at commit 4104d80.

yhuai · 2016-02-03T04:59:04Z

test this please

SparkQA · 2016-02-03T05:32:55Z

Test build #50640 has finished for PR 10917 at commit 4104d80.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-02-03T05:34:09Z

retest this please

SparkQA · 2016-02-03T06:27:56Z

Test build #50642 has finished for PR 10917 at commit 4104d80.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-02-03T07:15:43Z

So many flaky tests...

cloud-fan · 2016-02-03T07:15:49Z

retest this please

SparkQA · 2016-02-03T08:42:44Z

Test build #50653 has finished for PR 10917 at commit 4104d80.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

nongli · 2016-02-03T19:19:08Z

sql/catalyst/src/test/scala/org/apache/spark/sql/HashBenchmark.scala

+    Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
+    Hash For map:                      Avg Time(ms)    Avg Rate(M/s)  Relative Rate
+    -------------------------------------------------------------------------------
+    interpreted version                    64709.73             0.00         1.00 X


How long does this benchmark take to run? THis looks really long. I think we should keep benchmarks to run in low number of seconds total if possible.

rxin · 2016-02-05T19:41:47Z

cc @cloud-fan on follow-up

cloud-fan · 2016-02-06T02:28:38Z

core/src/main/scala/org/apache/spark/util/Benchmark.scala

@@ -124,7 +124,7 @@ private[spark] object Benchmark {
    }
    val best = runTimes.min
    val avg = runTimes.sum / iters
-    Result(avg / 1000000, num / (best / 1000), best / 1000000)
+    Result(avg / 1000000, num.toDouble / (best / 1000), best / 1000000)


We need to keep the precision here. The bestMs/avgMs can be well controlled in an appropriate number, but the rate can't. And we use rate as a divisor later, so if rate is small(assume we are benchmarking some slow operations), we will get large deviation. BTW, we use %10.1f to print rate but previously rate is always integral.

cc @davies

Hot fixed in master.

SparkQA · 2016-02-06T04:38:33Z

Test build #50858 has finished for PR 10917 at commit 315af8c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-02-06T06:27:52Z

sql/catalyst/src/test/scala/org/apache/spark/sql/HashBenchmark.scala

+    Hash For normal:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
+    -------------------------------------------------------------------------------------------
+    interpreted version                      2209 / 2271          0.9        1053.4       1.0X
+    codegen version                          1887 / 2018          1.1         899.9       1.2X


Why the generated version is slower?

codegen version is 20% faster, because it doesn't have runtime reflection.

Sorry, I read it wrong (Maybe I commented in wrong line, meant the previous result).

SparkQA · 2016-02-07T15:07:44Z

Test build #50899 has finished for PR 10917 at commit 9a1f8ff.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-02-09T21:06:02Z

LGTM, merging this into master, thanks!

add benchmark results

8207dc1

cloud-fan added 2 commits February 2, 2016 17:21

Merge remote-tracking branch 'origin/master' into hash-benchmark

30a4298

update results

4104d80

nongli reviewed Feb 3, 2016
View reviewed changes

cloud-fan added 2 commits February 6, 2016 09:29

Merge remote-tracking branch 'origin/master' into hash-benchmark

caf87e1

update

315af8c

cloud-fan force-pushed the hash-benchmark branch from e443f32 to 315af8c Compare February 6, 2016 02:23

cloud-fan reviewed Feb 6, 2016
View reviewed changes

davies reviewed Feb 6, 2016
View reviewed changes

Merge remote-tracking branch 'origin/master' into hash-benchmark

9a1f8ff

asfgit closed this in 7fe4fe6 Feb 9, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-12888][SQL][follow-up] benchmark the new hash expression #10917

[SPARK-12888][SQL][follow-up] benchmark the new hash expression #10917

cloud-fan commented Jan 26, 2016

cloud-fan commented Jan 26, 2016

rxin commented Jan 26, 2016

nongli commented Jan 26, 2016

cloud-fan commented Jan 26, 2016

SparkQA commented Jan 26, 2016

nongli commented Jan 26, 2016

cloud-fan commented Jan 26, 2016

nongli commented Jan 26, 2016

SparkQA commented Feb 3, 2016

yhuai commented Feb 3, 2016

SparkQA commented Feb 3, 2016

cloud-fan commented Feb 3, 2016

SparkQA commented Feb 3, 2016

cloud-fan commented Feb 3, 2016

cloud-fan commented Feb 3, 2016

SparkQA commented Feb 3, 2016

nongli Feb 3, 2016

rxin commented Feb 5, 2016

cloud-fan Feb 6, 2016

davies Feb 7, 2016

SparkQA commented Feb 6, 2016

davies Feb 6, 2016

cloud-fan Feb 7, 2016

davies Feb 7, 2016

SparkQA commented Feb 7, 2016

davies commented Feb 9, 2016

[SPARK-12888][SQL][follow-up] benchmark the new hash expression #10917

[SPARK-12888][SQL][follow-up] benchmark the new hash expression #10917

Conversation

cloud-fan commented Jan 26, 2016

cloud-fan commented Jan 26, 2016

rxin commented Jan 26, 2016

nongli commented Jan 26, 2016

cloud-fan commented Jan 26, 2016

SparkQA commented Jan 26, 2016

nongli commented Jan 26, 2016

cloud-fan commented Jan 26, 2016

nongli commented Jan 26, 2016

SparkQA commented Feb 3, 2016

yhuai commented Feb 3, 2016

SparkQA commented Feb 3, 2016

cloud-fan commented Feb 3, 2016

SparkQA commented Feb 3, 2016

cloud-fan commented Feb 3, 2016

cloud-fan commented Feb 3, 2016

SparkQA commented Feb 3, 2016

nongli Feb 3, 2016

Choose a reason for hiding this comment

rxin commented Feb 5, 2016

cloud-fan Feb 6, 2016

Choose a reason for hiding this comment

davies Feb 7, 2016

Choose a reason for hiding this comment

SparkQA commented Feb 6, 2016

davies Feb 6, 2016

Choose a reason for hiding this comment

cloud-fan Feb 7, 2016

Choose a reason for hiding this comment

davies Feb 7, 2016

Choose a reason for hiding this comment

SparkQA commented Feb 7, 2016

davies commented Feb 9, 2016