Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13131] [SQL] Use best and average time in benchmark #11018

Closed
wants to merge 6 commits into from

Conversation

davies
Copy link
Contributor

@davies davies commented Feb 2, 2016

Best time is stabler than average time, also added a column for nano seconds per row (which could be used to estimate contributions of each components in a query).

Having best time and average time together for more information (we can see kind of variance).

rate, time per row and relative are all calculated using best time.

The result looks like this:

Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
rang/filter/sum:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
rang/filter/sum codegen=false          14332 / 16646         36.0          27.8       1.0X
rang/filter/sum codegen=true              845 /  940        620.0           1.6      17.0X

rang/filter/aggregate: Avg Time(ms) Avg Rate(M/s) Relative Rate
-------------------------------------------------------------------------------
rang/filter/aggregate codegen=false 12509.22 41.91 1.00 X
rang/filter/aggregate codegen=true 846.38 619.45 14.78 X
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query is changed from .count() to .groupBy().sum().collect(), it's 20X for previous query.

@SparkQA
Copy link

SparkQA commented Feb 2, 2016

Test build #50541 has finished for PR 11018 at commit a534e0e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Feb 2, 2016

@davies I think this depends on the error distribution.

Let's assume that the measured running time of algorithm A is described by an additive model: a + W, where a is a constant indicating the ideal running time and w is a positive random variable describing system noise/overhead. We assume the same error distribution for algorithm B: b + W. Basically, we want to test which one is smaller (faster), a or b. One common way is to compare the sample mean values of a + W and b + W, and you want to compare the sample min values of a + W and b + W.

If we agree on this model, which method is better majorly depends on the variance of the sample mean and sample min (first order statistic). We know that the variance of sample mean is of order O(1/n) (CLT), while the variance of the first order statistic is very sensitive to the distribution. If W follows uniform distribution, the variance of the first order statistics is of order O(1/n^2), which is indeed better than that of the sample mean. However, if the error distribution has little mass near 0, the variance of the first order statistic could be very large. And this is very easy to verify numerically.

Certainly we can do many runs and draw the empirical error distribution out and then tell which one is better for this case. But without good knowledge of the error distribution, using sample mean is definitely a safe bet because we know the variance is of order O(1/n). If we want to avoid outliers, a common solution is to use the median, following similar arguments.

@davies
Copy link
Contributor Author

davies commented Feb 2, 2016

@mengxr Thanks for the details, that make sense.

Ran a few tests, here is the distribution of W (removed the outliners, microseconds)

image

Because the number we care most is a + W / b + W, especially when b is small, the result become more sensitive on W.

Ran a few tests on this particular case, the relative rates of first benchmark (range/filter/ are listed here (this is the number we care about most):

image 1

It seems that best time or medium time are much better than mean time, the variance of best time (0.21) is little better than medium time (0.33), the variance of mean time is 2.4.

I think we should go with best time or medium time. cc @rxin @nongli

@nongli
Copy link
Contributor

nongli commented Feb 2, 2016

@davies what is the x axis? runs of the benchmark?

Do we know what the variance is per run? (error bars on your graph)

@davies
Copy link
Contributor Author

davies commented Feb 2, 2016

@nongli The x axis is per run (runs of benchmark). The first chart is histogram of (run time - best time, microseconds) for each run within a benchmark.

@mengxr
Copy link
Contributor

mengxr commented Feb 3, 2016

@davies Thanks for plotting the histogram! The mean estimator is apparently not good here due to outliers. Median seems more stable and I don't think we have enough numbers to confidently tell whether best time or median is better. Given the current data scale, it seems that a (or b) are significantly large for us to ignore W in estimating the ratio a/b. So I would go with median and increase the data size if necessary in order to reduce the effect of W. Does it sound good?

@davies
Copy link
Contributor Author

davies commented Feb 3, 2016

Sounds good ,will go with median

@SparkQA
Copy link

SparkQA commented Feb 3, 2016

Test build #50637 has finished for PR 11018 at commit a244e20.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies davies changed the title [SPARK-13131] [SQL] Use best time in benchmark [SPARK-13131] [SQL] Use median time in benchmark Feb 3, 2016
@davies
Copy link
Contributor Author

davies commented Feb 3, 2016

ping @nongli @rxin

@@ -62,13 +63,15 @@ private[spark] class Benchmark(
val firstRate = results.head.avgRate
// The results are going to be processor specific so it is useful to include that.
println(Benchmark.getProcessorName())
printf("%-30s %16s %16s %14s\n", name + ":", "Avg Time(ms)", "Avg Rate(M/s)", "Relative Rate")
println("-------------------------------------------------------------------------------")
printf("%-30s %16s %12s %13s %10s\n", name + ":", "median Time(ms)", "Rate(M/s)", "Per Row(ns)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you capitalize the M -> "Median Time"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update this when merging.

@nongli
Copy link
Contributor

nongli commented Feb 3, 2016

LGTM

@davies
Copy link
Contributor Author

davies commented Feb 3, 2016

@mengxr Here are four runs for BroadcastHashJoin:

Running benchmark: BroadcastHashJoin
  Running case: BroadcastHashJoin codegen=false
ArrayBuffer(   3951.07,    4019.88,    4526.42,    6504.16,    9585.88)
  Running case: BroadcastHashJoin codegen=true
ArrayBuffer(   1857.93,    1944.03,    1961.13,    2142.73,    2223.96)

Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
BroadcastHashJoin:                   Median Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------
BroadcastHashJoin codegen=false              4526.42         4.63        215.84     1.00 X
BroadcastHashJoin codegen=true               1961.13        10.69         93.51     2.31 X


Running benchmark: BroadcastHashJoin
  Running case: BroadcastHashJoin codegen=false
ArrayBuffer(   3670.16,    3766.32,    6600.27,    6629.90,    6976.39)
  Running case: BroadcastHashJoin codegen=true
ArrayBuffer(   1866.42,    1899.43,    1973.72,    2012.05,    2026.64)

Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
BroadcastHashJoin:                   Median Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------
BroadcastHashJoin codegen=false              6600.27         3.18        314.73     1.00 X
BroadcastHashJoin codegen=true               1973.72        10.63         94.11     3.34 X

Running benchmark: BroadcastHashJoin
  Running case: BroadcastHashJoin codegen=false
ArrayBuffer(   3790.12,    4326.11,    6543.06,    6890.16,    7029.33)
  Running case: BroadcastHashJoin codegen=true
ArrayBuffer(   1869.67,    1921.94,    1938.57,    1939.92,    2099.17)

Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
BroadcastHashJoin:                   Median Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------
BroadcastHashJoin codegen=false              6543.06         3.21        312.00     1.00 X
BroadcastHashJoin codegen=true               1938.57        10.82         92.44     3.38 X

Running benchmark: BroadcastHashJoin
  Running case: BroadcastHashJoin codegen=false
ArrayBuffer(   3783.44,    3826.02,    4032.82,    6582.35,    6972.17)
  Running case: BroadcastHashJoin codegen=true
ArrayBuffer(   1868.17,    1907.98,    2004.32,    2013.73,    2027.36)

Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
BroadcastHashJoin:                   Median Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------
BroadcastHashJoin codegen=false              4032.82         5.20        192.30     1.00 X
BroadcastHashJoin codegen=true               2004.32        10.46         95.57     2.01 X

With median time, the improvements are 2.31X, 3.34X, 3.38X, 2.01X

With best time, they will be 2.12X, 1.97X, 2.0X, 2.0X, they are much stable than those using median time.

@mengxr @nongli So I still think that we should use best time here. Also, keep only one digits after dot.

@davies
Copy link
Contributor Author

davies commented Feb 3, 2016

After offline discussion with @rxin and @nongli , we agreed to have best time, average time and standard deviation together, will update this PR shortly.

@davies davies changed the title [SPARK-13131] [SQL] Use median time in benchmark [SPARK-13131] [SQL] Use best and average time in benchmark Feb 3, 2016
@nongli
Copy link
Contributor

nongli commented Feb 3, 2016

LGTM

@SparkQA
Copy link

SparkQA commented Feb 3, 2016

Test build #50680 has finished for PR 11018 at commit fda444a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Result(avgMs: Double, bestRate: Double, bestMs: Double)
    • case class Filter(condition: Expression, child: LogicalPlan)
    • abstract class SetOperation(left: LogicalPlan, right: LogicalPlan) extends BinaryNode

@SparkQA
Copy link

SparkQA commented Feb 4, 2016

Test build #2511 has finished for PR 11018 at commit fda444a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Result(avgMs: Double, bestRate: Double, bestMs: Double)
    • case class Filter(condition: Expression, child: LogicalPlan)
    • abstract class SetOperation(left: LogicalPlan, right: LogicalPlan) extends BinaryNode

@davies
Copy link
Contributor Author

davies commented Feb 4, 2016

Merging this into master.

@asfgit asfgit closed this in de09145 Feb 4, 2016
@SparkQA
Copy link

SparkQA commented Feb 4, 2016

Test build #50697 has finished for PR 11018 at commit 2f88960.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

benchmark.name,
"%5.0f / %4.0f" format (result.bestMs, result.avgMs),
"%10.1f" format result.bestRate,
"%6.1f" format (1000 / result.bestRate),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this "Per Row" means?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nano seconds per row

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be per iteration? we may execute multiple rows in one iteration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be, but once we use batch mode, per iteration will be worse

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants