/
sum_benchmark.py
82 lines (67 loc) · 3.28 KB
/
sum_benchmark.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
"""Test the speed and precision of the sum of many values.
When fitting data with many events or bins (say 1e6 or 1e9 or more), the fit statistic is first
computed per event or bin and then summed.
The sum (and thus the fit statistic and fit results) can become incorrect due to rounding errors
(see e.g. http://en.wikipedia.org/wiki/Kahan_summation_algorithm)
There are clever and slower methods to avoid this problem (like the Kahan summation algorithm),
but they are not readily available in stdlib C / C++ or numpy
(there's an open feature request for numpy though: https://github.com/numpy/numpy/issues/2448)
The simplest solution is to use 128 bit precision for the accumulator in numpy:
In [7]: np.sum([1e10, -1e10, 1e-6] * int(1e6))
Out[7]: 1.9073477254638671
In [8]: np.sum([1e10, -1e10, 1e-6] * int(1e6), dtype='float128')
Out[8]: 1.0002404448965787888
Sherpa uses Kahan summation:
$ASCDS_INSTALL/src/pkg/sherpa/sherpa/sherpa/include/sherpa/stats.hh
This is a quick benchmarking of the precision, speed of sum as a function of these parameters:
* Accumulator bit size: 32, 64 or 128
* Number of elements: 1e6, 1e9
* TODO: how to choose values in a meaningful way? Precision results will completely depend on this.
Should be chosen similar to typical / extreme fitting cases with CASH, CSTAT, CHI2 fits
For now we only check the speed.
* TODO: Check against C and Cython implementation
"""
from timeit import Timer
dtypes = ['float32', 'float64', 'float128']
sizes = [int(1e6), int(1e9)]
def setup(size, dtype):
return """
import numpy as np
data = np.zeros({size}, dtype='{dtype}')
"""[1:-1].format(**locals())
def statement(dtype):
return """data.sum(dtype='{dtype}')""".format(**locals())
for data_dtype in dtypes:
for accumulator_dtype in dtypes:
for size in sizes:
timer = Timer(statement(accumulator_dtype), setup(size, data_dtype))
time = min(timer.repeat(repeat=3, number=1))
# Let's use the frequency in GHz of summed elements as our measure of speed
speed = 1e-9 * (size / time)
print('%10s %10s %10d %10.5f' %
(data_dtype, accumulator_dtype, size, speed))
"""
On my 2.6 GHz Intel Core I7 Macbook the speed doesn't depend on data or accumulator dtype at all.
This is weird, because it's a 64 bit machine, so 128 bit addition should be slower.
Also for such a simple computation as sum the limiting factor should be memory loading speed,
so 128 bit data should be slower to process than 64 bit data?
In [53]: run sum_benchmark.py
f32 f32 1000000 0.82793
f32 f32 1000000000 1.12276
f32 f64 1000000 1.12207
f32 f64 1000000000 1.10964
f32 f128 1000000 1.04155
f32 f128 1000000000 1.12900
f64 f32 1000000 1.10609
f64 f32 1000000000 1.12823
f64 f64 1000000 1.10493
f64 f64 1000000000 1.11920
f64 f128 1000000 1.15450
f64 f128 1000000000 1.11794
f128 f32 1000000 1.12087
f128 f32 1000000000 1.12223
f128 f64 1000000 1.09885
f128 f64 1000000000 1.11911
f128 f128 1000000 1.06943
f128 f128 1000000000 1.12578
"""