Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Bogus outlier count on clock resolution samples #11

Closed
jmillikin opened this Issue Apr 2, 2012 · 10 comments

Comments

Projects
None yet
5 participants
Contributor

jmillikin commented Apr 2, 2012

I'm not sure what would cause this -- how could it be getting more outliers than there are samples?

warming up
estimating clock resolution...
mean is 5.257008 us (160001 iterations)
found 243504 outliers among 159999 samples (152.2%)
  120537 (75.3%) low severe
  122967 (76.9%) high severe
estimating cost of a clock call...
mean is 31.80684 ns (52 iterations)
found 6 outliers among 52 samples (11.5%)
  4 (7.7%) high mild
  2 (3.8%) high severe
Owner

bos commented Apr 3, 2012

Well that's entertaining. Could you tell me more about the environment you're running in? OS, GHC version, etc?

@ghost ghost assigned bos Apr 3, 2012

Contributor

jmillikin commented Apr 3, 2012

OS: Ubuntu 11.10
Arch: x86_64
GHC: 7.0.3 (standard version in the Ubuntu repository)

I'm not able to reproduce it on GHC 7.4.1, but it reliably reproduces on GHC 7.0.3.

What library is responsible for measuring clock resolution (statistics, criteiron, etc)? I'd be happy to put in some traces or something to help debug.

Owner

bos commented Apr 3, 2012

It's criterion itself that measures the clock resolution.https://github.com/bos/criterion/blob/master/Criterion/Environment.hs#L37

Contributor

jmillikin commented Apr 3, 2012

It looks like the outlier calculation assumes there will be outliers, and produces bad results if all the samples are nearly identical.

In this case, the measurement of clock resolution on my system is coming up with the same value almost every time. I added some traces to Criterion.Analysis.classifyOutliers so it would use only the first ten values, and print out its internal state. Here's the output:

outliers = Outliers {samplesSeen = 10, lowSevere = 10, lowMild = 0, highMild = 0, highSevere = 10}
ssa = fromList [3.814697265625e-6,3.814697265625e-6,3.814697265625e-6,3.814697265625e-6,3.814697265625e-6,3.814697265625e-6,3.814697265625e-6,3.814697265625e-6,3.814697265625e-6,3.814697265625e-6]
loS = 3.814697265625e-6
loM = 3.814697265625e-6
hiM = 3.814697265625e-6
hiS = 3.814697265625e-6
q1 = 3.814697265625e-6
q3 = 3.814697265625e-6
iqr = 0.0
Owner

bos commented Apr 3, 2012

Cool! Thanks for looking into this. That should make it easier to figure out a fix.

dreixel commented May 21, 2012

Any progress on this? I'm having the same stuff on Windows 7 64-bit, with criterion-0.6.0.1, ghc-7.4.1, in this very simple example:

module Main where

import Control.DeepSeq
import Criterion.Main
import Data.List ( sort )


main = defaultMain [ bgroup "rands" . benchSorts $ [1..1000::Int] ]

benchSorts xs = seq (sum xs) [ bench "sort" (nf (sum . sort) xs) ]

Typical output is:

warming up
estimating clock resolution...
mean is 1.637714 us (320001 iterations)
found 638950 outliers among 319999 samples (199.7%)
  318951 (99.7%) low severe
  319999 (100.0%) high severe
estimating cost of a clock call...
mean is 123.0625 ns (16 iterations)

benchmarking rands/sort
Warning: Couldn't open /dev/urandom
Warning: using system clock for seed instead (quality will be lower)
mean: 15.09599 us, lb 14.81007 us, ub 15.52451 us, ci 0.950
std dev: 1.798179 us, lb 1.497669 us, ub 2.072980 us, ci 0.950
found 17 outliers among 100 samples (17.0%)
  17 (17.0%) high severe
variance introduced by outliers: 84.204%
variance is severely inflated by outliers

It doesn't seem to be affected by optimization level. Same happens on a virtual machine running linux:

warming up
estimating clock resolution...
mean is 5.303620 us (160001 iterations)
found 287114 outliers among 159999 samples (179.4%)
  145205 (90.8%) low severe
  141909 (88.7%) high severe
estimating cost of a clock call...
mean is 2.490415 us (21 iterations)
found 3 outliers among 21 samples (14.3%)
  3 (14.3%) high mild

benchmarking rands/insertionSort
mean: 281.0402 us, lb 277.6374 us, ub 284.8547 us, ci 0.950
std dev: 18.47338 us, lb 16.33518 us, ub 20.99275 us, ci 0.950
variance introduced by outliers: 61.565%
variance is severely inflated by outliers

letmaik commented Aug 27, 2012

I also got the same with Windows 7 64-bit, criterion-0.6.1.1, ghc-7.4.1:

Warning: Couldn't open /dev/urandom
Warning: using system clock for seed instead (quality will be lower)
warming up
estimating clock resolution...
mean is 2.190758 us (320001 iterations)
found 639297 outliers among 319999 samples (199.8%)
  319298 (99.8%) low severe
  319999 (100.0%) high severe
estimating cost of a clock call...
mean is 179.1769 ns (15 iterations)

benchmarking tiny/fib 10
mean: 4.652202 us, lb 4.358087 us, ub 4.973060 us, ci 0.950
std dev: 1.549386 us, lb 1.293269 us, ub 2.013035 us, ci 0.950
found 2 outliers among 100 samples (2.0%)
  2 (2.0%) high mild
variance introduced by outliers: 97.865%
variance is severely inflated by outliers

benchmarking tiny/fib 15
mean: 49.72007 us, lb 47.22011 us, ub 51.66477 us, ci 0.950
std dev: 11.37166 us, lb 9.687343 us, ub 12.79400 us, ci 0.950
found 173 outliers among 100 samples (173.0%)
  94 (94.0%) low severe
  79 (79.0%) high severe
variance introduced by outliers: 95.725%
variance is severely inflated by outliers

With these tests:

{-# LANGUAGE ScopedTypeVariables #-}

import Criterion.Main

main = defaultMain [
        bgroup "tiny" [ bench "fib 10" $ whnf fib 10
                      , bench "fib 15" $ whnf fib 15
                      ]
       ]

fib :: Int -> Int
fib n | n < 0 = error "negative!"
      | otherwise = go (fromIntegral n)
  where
    go 0 = 0
    go 1 = 1
    go n = go (n-1) + go (n-2)
Contributor

Gabriel439 commented Mar 25, 2013

classifyOutliers triggers this bug whenever the input contains many identical values:

>>> classifyOutliers (U.replicate 10 0)
Outliers {samplesSeen = 10, lowSevere = 10, lowMild = 0, highMild = 0, highSevere = 10}

This results in duplicate counts in the lowSevere and highSevere range, just like in the other errors reported here, too. This also triggers the bug if even if not all the data is identical:

>>> classifyOutliers (U.fromList [0, 0, 0, 0, 0, 0, 1])
Outliers {samplesSeen = 7, lowSevere = 6, lowMild = 0, highMild = 0, highSevere = 7}

However,it was not clear what would cause it to receive large stretches of identical data, especially when you have 100 samples. You would think that this would be very improbable. So I added a trace to see what the offending data was and this is what it was getting:

benchmarking parse
collecting 100 samples, 1 iterations each, in estimated 10.58121 s
mean: 123.6324 ns, lb 47.33842 ns, ub 238.0733 ns, ci 0.950
std dev: 471.0889 ns, lb 368.2355 ns, ub 707.7345 ns, ci 0.950
fromList [2.7629260312427172e-6,8.555773984302174e-7,-9.80969179760326e-8,-9.809
69179760326e-8,8.555773984302174e-7,-9.80969179760326e-8,-9.80969179760326e-8,8.
555773984302174e-7,-9.80969179760326e-8,8.555773984302174e-7,8.555773984302174e-
7,-9.80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.809691797603
26e-8,-9.80969179760326e-8,-9.80969179760326e-8,8.555773984302174e-7,-9.80969179
760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.8096
9179760326e-8,8.555773984302174e-7,-9.80969179760326e-8,8.555773984302174e-7,-9.
80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8
,-9.80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.8096917976032
6e-8,8.555773984302174e-7,-9.80969179760326e-8,-9.80969179760326e-8,-9.809691797
60326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.80969
179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.8
0969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,8.555773984302174e-7,
-9.80969179760326e-8,-9.80969179760326e-8,8.555773984302174e-7,-9.80969179760326
e-8,-9.80969179760326e-8,8.555773984302174e-7,-9.80969179760326e-8,-9.8096917976
0326e-8,-9.80969179760326e-8,-9.80969179760326e-8,8.555773984302174e-7,-9.809691
79760326e-8,-9.80969179760326e-8,8.555773984302174e-7,-9.80969179760326e-8,-9.80
969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,1.09399597753178e-6,-9
.80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-
8,8.555773984302174e-7,-9.80969179760326e-8,8.555773984302174e-7,-9.80969179760$
26e-8,-9.80969179760326e-8,8.555773984302174e-7,-9.80969179760326e-8,-9.80969179
760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.8096
9179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,8.555773984302174e-7,-9.
80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8
,-9.80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.8096917976032
6e-8,-9.80969179760326e-8,-9.80969179760326e-8,-9.80969179760326e-8,8.5557739843
02174e-7,-9.80969179760326e-8]
found 179 outliers among 100 samples (179.0%)
  79 (79.0%) low severe
  100 (100.0%) high severe

The data has several repeating values, and the value -9.80969179760326e-8 sufficiently dominates the rest of the data set that it triggers the bug. I still don't have an answer for why that exact timing repeats so often. My initial guess was because it was getting near the limit of the timer's minimum step size, but that wasn't the case. Benchmarks on trivial code like nf id () give smaller values in the e-9 range that are much more diverse. Here's a sample of some of them:

-9,5.519538864781468e-9,5.444293899712435e-9,5.59069442957501e-9,5.468012421310282e-9,5.529353425442646e-9,5.553071947040493e-9,5.443476019657337e-9,5.621773871668741e-9,5.605416270566778e-9,5.51872098472637e-9,5.646310273321687e-9,5.549800426820101e-9,5.643038753101294e-9,5.516267344561075e-9,5.60459839051168e-9,5.443476019657337e-9,5.571065308252654e-9,5.783714122578182e-9,6.146852867041778e-9,5.763267121200728e-9,5.49173094290813e-9,5.488459422687737e-9,5.557161347315985e-9,5.440204499436944e-9,5.60459839051168e-9,5.5121779442855845e-9,5.502363383624406e-9,5.6185023514483485e-9,5.443476019657337e-9,5.632406312385018e-9,5.5391679861038246e-9,5.6086877907871706e-9,5.519538864781468e-9,5.605416270566778e-9,5.6185023514483485e-9,5.446747539877729e-9,5.632406312385018e-9,5.611959311007563e-9,5.512995824340683e-9,5.574336828473046e-9,5.45083694015322e-9,5.6185023514483485e-9,5.508906424065192e-9,5.45083694015322e-9,5.616048711283054e-9,5.45083694015322e-9,5.56370438775677e-9,5.621773871668741e-9,5.502363383624406e-9,5.584969269189323e-9,5.6185023514483485e-9,5.6078699107320724e-9,5.643038753101294e-9,6.047071500319799e-9,5.567793788032261e-9,5.638949352825803e-9,5.601326870291287e-9,5.471283941530675e-9,5.587422909354618e-9,5.526081905222253e-9,5.454108460373613e-9,5.638949352825803e-9,5.6561248339828655e-9,5.69456519657248e-9,5.532624945663039e-9,5.546528906599708e-9]

So the missing thing to solve is what causes the timings to abnormally snap to certain values.

@bos bos closed this Sep 11, 2013

@bos bos reopened this Sep 11, 2013

Owner

bos commented Sep 11, 2013

Oops, closed this by accident.

@bos bos added a commit that referenced this issue Sep 11, 2013

@bos bos Correctly handle samples with heavily clustered values (gh-11)
If a sample contains many values that are clustered around a single
value, this throws off classifyOutliers.

Before this change, we could easily end up considering a value as
both high and low, thereby counting it in more than one bucket at
a time (which should not happen). As a result, we would sometimes
report more outliers in a data set than sample values.

With this change, every outlier should be classified into a single
bucket. Our estimate-based weighted average method can still lead
to the wrong bucket being chosen, but at least there should be only
one bucket now!
ea5cd9e
Owner

bos commented Sep 11, 2013

Fix validated in f028c02.

@bos bos closed this Sep 11, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment