[SPARK-13131] [SQL] Use best and average time in benchmark #11018

davies · 2016-02-02T05:47:43Z

Best time is stabler than average time, also added a column for nano seconds per row (which could be used to estimate contributions of each components in a query).

Having best time and average time together for more information (we can see kind of variance).

rate, time per row and relative are all calculated using best time.

The result looks like this:

Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
rang/filter/sum:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
rang/filter/sum codegen=false          14332 / 16646         36.0          27.8       1.0X
rang/filter/sum codegen=true              845 /  940        620.0           1.6      17.0X

davies · 2016-02-02T05:48:35Z

sql/core/src/test/scala/org/apache/spark/sql/execution/BenchmarkWholeStageCodegen.scala

+    rang/filter/aggregate:             Avg Time(ms)    Avg Rate(M/s)  Relative Rate
+    -------------------------------------------------------------------------------
+    rang/filter/aggregate codegen=false        12509.22            41.91         1.00 X
+    rang/filter/aggregate codegen=true           846.38           619.45        14.78 X


The query is changed from .count() to .groupBy().sum().collect(), it's 20X for previous query.

SparkQA · 2016-02-02T07:20:46Z

Test build #50541 has finished for PR 11018 at commit a534e0e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2016-02-02T09:30:20Z

@davies I think this depends on the error distribution.

Let's assume that the measured running time of algorithm A is described by an additive model: a + W, where a is a constant indicating the ideal running time and w is a positive random variable describing system noise/overhead. We assume the same error distribution for algorithm B: b + W. Basically, we want to test which one is smaller (faster), a or b. One common way is to compare the sample mean values of a + W and b + W, and you want to compare the sample min values of a + W and b + W.

If we agree on this model, which method is better majorly depends on the variance of the sample mean and sample min (first order statistic). We know that the variance of sample mean is of order O(1/n) (CLT), while the variance of the first order statistic is very sensitive to the distribution. If W follows uniform distribution, the variance of the first order statistics is of order O(1/n^2), which is indeed better than that of the sample mean. However, if the error distribution has little mass near 0, the variance of the first order statistic could be very large. And this is very easy to verify numerically.

Certainly we can do many runs and draw the empirical error distribution out and then tell which one is better for this case. But without good knowledge of the error distribution, using sample mean is definitely a safe bet because we know the variance is of order O(1/n). If we want to avoid outliers, a common solution is to use the median, following similar arguments.

davies · 2016-02-02T19:47:34Z

@mengxr Thanks for the details, that make sense.

Ran a few tests, here is the distribution of W (removed the outliners, microseconds)

Because the number we care most is a + W / b + W, especially when b is small, the result become more sensitive on W.

Ran a few tests on this particular case, the relative rates of first benchmark (range/filter/ are listed here (this is the number we care about most):

It seems that best time or medium time are much better than mean time, the variance of best time (0.21) is little better than medium time (0.33), the variance of mean time is 2.4.

I think we should go with best time or medium time. cc @rxin @nongli

nongli · 2016-02-02T21:48:54Z

@davies what is the x axis? runs of the benchmark?

Do we know what the variance is per run? (error bars on your graph)

davies · 2016-02-02T21:58:00Z

@nongli The x axis is per run (runs of benchmark). The first chart is histogram of (run time - best time, microseconds) for each run within a benchmark.

mengxr · 2016-02-03T01:12:59Z

@davies Thanks for plotting the histogram! The mean estimator is apparently not good here due to outliers. Median seems more stable and I don't think we have enough numbers to confidently tell whether best time or median is better. Given the current data scale, it seems that a (or b) are significantly large for us to ignore W in estimating the ratio a/b. So I would go with median and increase the data size if necessary in order to reduce the effect of W. Does it sound good?

davies · 2016-02-03T01:14:47Z

Sounds good ,will go with median

SparkQA · 2016-02-03T06:01:24Z

Test build #50637 has finished for PR 11018 at commit a244e20.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-02-03T06:33:53Z

ping @nongli @rxin

nongli · 2016-02-03T18:43:35Z

core/src/main/scala/org/apache/spark/util/Benchmark.scala

@@ -62,13 +63,15 @@ private[spark] class Benchmark(
    val firstRate = results.head.avgRate
    // The results are going to be processor specific so it is useful to include that.
    println(Benchmark.getProcessorName())
-    printf("%-30s %16s %16s %14s\n", name + ":", "Avg Time(ms)", "Avg Rate(M/s)", "Relative Rate")
-    println("-------------------------------------------------------------------------------")
+    printf("%-30s %16s %12s %13s %10s\n", name + ":", "median Time(ms)", "Rate(M/s)", "Per Row(ns)",


Can you capitalize the M -> "Median Time"

Will update this when merging.

nongli · 2016-02-03T18:43:46Z

LGTM

davies · 2016-02-03T19:22:25Z

@mengxr Here are four runs for BroadcastHashJoin:

Running benchmark: BroadcastHashJoin
  Running case: BroadcastHashJoin codegen=false
ArrayBuffer(   3951.07,    4019.88,    4526.42,    6504.16,    9585.88)
  Running case: BroadcastHashJoin codegen=true
ArrayBuffer(   1857.93,    1944.03,    1961.13,    2142.73,    2223.96)

Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
BroadcastHashJoin:                   Median Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------
BroadcastHashJoin codegen=false              4526.42         4.63        215.84     1.00 X
BroadcastHashJoin codegen=true               1961.13        10.69         93.51     2.31 X


Running benchmark: BroadcastHashJoin
  Running case: BroadcastHashJoin codegen=false
ArrayBuffer(   3670.16,    3766.32,    6600.27,    6629.90,    6976.39)
  Running case: BroadcastHashJoin codegen=true
ArrayBuffer(   1866.42,    1899.43,    1973.72,    2012.05,    2026.64)

Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
BroadcastHashJoin:                   Median Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------
BroadcastHashJoin codegen=false              6600.27         3.18        314.73     1.00 X
BroadcastHashJoin codegen=true               1973.72        10.63         94.11     3.34 X

Running benchmark: BroadcastHashJoin
  Running case: BroadcastHashJoin codegen=false
ArrayBuffer(   3790.12,    4326.11,    6543.06,    6890.16,    7029.33)
  Running case: BroadcastHashJoin codegen=true
ArrayBuffer(   1869.67,    1921.94,    1938.57,    1939.92,    2099.17)

Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
BroadcastHashJoin:                   Median Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------
BroadcastHashJoin codegen=false              6543.06         3.21        312.00     1.00 X
BroadcastHashJoin codegen=true               1938.57        10.82         92.44     3.38 X

Running benchmark: BroadcastHashJoin
  Running case: BroadcastHashJoin codegen=false
ArrayBuffer(   3783.44,    3826.02,    4032.82,    6582.35,    6972.17)
  Running case: BroadcastHashJoin codegen=true
ArrayBuffer(   1868.17,    1907.98,    2004.32,    2013.73,    2027.36)

Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
BroadcastHashJoin:                   Median Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------
BroadcastHashJoin codegen=false              4032.82         5.20        192.30     1.00 X
BroadcastHashJoin codegen=true               2004.32        10.46         95.57     2.01 X

With median time, the improvements are 2.31X, 3.34X, 3.38X, 2.01X

With best time, they will be 2.12X, 1.97X, 2.0X, 2.0X, they are much stable than those using median time.

@mengxr @nongli So I still think that we should use best time here. Also, keep only one digits after dot.

davies · 2016-02-03T19:59:11Z

After offline discussion with @rxin and @nongli , we agreed to have best time, average time and standard deviation together, will update this PR shortly.

nongli · 2016-02-03T22:34:41Z

LGTM

SparkQA · 2016-02-03T22:59:45Z

Test build #50680 has finished for PR 11018 at commit fda444a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Result(avgMs: Double, bestRate: Double, bestMs: Double)
- case class Filter(condition: Expression, child: LogicalPlan)
- abstract class SetOperation(left: LogicalPlan, right: LogicalPlan) extends BinaryNode

SparkQA · 2016-02-04T01:06:25Z

Test build #2511 has finished for PR 11018 at commit fda444a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Result(avgMs: Double, bestRate: Double, bestMs: Double)
- case class Filter(condition: Expression, child: LogicalPlan)
- abstract class SetOperation(left: LogicalPlan, right: LogicalPlan) extends BinaryNode

davies · 2016-02-04T01:07:13Z

Merging this into master.

SparkQA · 2016-02-04T01:58:46Z

Test build #50697 has finished for PR 11018 at commit 2f88960.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-02-06T01:53:59Z

core/src/main/scala/org/apache/spark/util/Benchmark.scala

+        benchmark.name,
+        "%5.0f / %4.0f" format (result.bestMs, result.avgMs),
+        "%10.1f" format result.bestRate,
+        "%6.1f" format (1000 / result.bestRate),


what does this "Per Row" means?

Nano seconds per row

should be per iteration? we may execute multiple rows in one iteration.

Could be, but once we use batch mode, per iteration will be worse

tune the benchmark

a534e0e

davies reviewed Feb 2, 2016
View reviewed changes

Davies Liu added 3 commits February 2, 2016 17:51

switch to median time

b105f6f

update benchmark

62eb43d

Merge branch 'master' of github.com:apache/spark into gen_bench

a244e20

davies changed the title ~~[SPARK-13131] [SQL] Use best time in benchmark~~ [SPARK-13131] [SQL] Use median time in benchmark Feb 3, 2016

nongli reviewed Feb 3, 2016
View reviewed changes

davies changed the title ~~[SPARK-13131] [SQL] Use median time in benchmark~~ [SPARK-13131] [SQL] Use best and average time in benchmark Feb 3, 2016

update benchmarks

fda444a

davies force-pushed the gen_bench branch from 80fe7cf to fda444a Compare February 3, 2016 20:57

ignore benchmark tests

2f88960

davies force-pushed the gen_bench branch from 414fc21 to 2f88960 Compare February 3, 2016 23:05

asfgit closed this in de09145 Feb 4, 2016

cloud-fan reviewed Feb 6, 2016
View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-13131] [SQL] Use best and average time in benchmark #11018

[SPARK-13131] [SQL] Use best and average time in benchmark #11018

davies commented Feb 2, 2016

davies Feb 2, 2016

SparkQA commented Feb 2, 2016

mengxr commented Feb 2, 2016

davies commented Feb 2, 2016

nongli commented Feb 2, 2016

davies commented Feb 2, 2016

mengxr commented Feb 3, 2016

davies commented Feb 3, 2016

SparkQA commented Feb 3, 2016

davies commented Feb 3, 2016

nongli Feb 3, 2016

davies Feb 3, 2016

nongli commented Feb 3, 2016

davies commented Feb 3, 2016

davies commented Feb 3, 2016

nongli commented Feb 3, 2016

SparkQA commented Feb 3, 2016

SparkQA commented Feb 4, 2016

davies commented Feb 4, 2016

SparkQA commented Feb 4, 2016

cloud-fan Feb 6, 2016

davies Feb 6, 2016

cloud-fan Feb 7, 2016

davies Feb 7, 2016

[SPARK-13131] [SQL] Use best and average time in benchmark #11018

[SPARK-13131] [SQL] Use best and average time in benchmark #11018

Conversation

davies commented Feb 2, 2016

davies Feb 2, 2016

Choose a reason for hiding this comment

SparkQA commented Feb 2, 2016

mengxr commented Feb 2, 2016

davies commented Feb 2, 2016

nongli commented Feb 2, 2016

davies commented Feb 2, 2016

mengxr commented Feb 3, 2016

davies commented Feb 3, 2016

SparkQA commented Feb 3, 2016

davies commented Feb 3, 2016

nongli Feb 3, 2016

Choose a reason for hiding this comment

davies Feb 3, 2016

Choose a reason for hiding this comment

nongli commented Feb 3, 2016

davies commented Feb 3, 2016

davies commented Feb 3, 2016

nongli commented Feb 3, 2016

SparkQA commented Feb 3, 2016

SparkQA commented Feb 4, 2016

davies commented Feb 4, 2016

SparkQA commented Feb 4, 2016

cloud-fan Feb 6, 2016

Choose a reason for hiding this comment

davies Feb 6, 2016

Choose a reason for hiding this comment

cloud-fan Feb 7, 2016

Choose a reason for hiding this comment

davies Feb 7, 2016

Choose a reason for hiding this comment