In [1]:
using BenchmarkTools
using PyCall

Let's look at and benchmark the sum function:

$$\mathrm{sum}(x) = \sum_{i=1}^n x_i$$

In [2]:
x = rand(10^7);

In [3]:
sum(x)

4.999556292214221e6

In [4]:
d = Dict() # to store the measurement results

Dict{Any,Any}()

## Hand-written C

In [5]:
c_code = """
#include <stddef.h>
double c_sum(size_t n, double *X) {
    double s = 0.0;
    for (size_t i = 0; i < n; ++i) {
        s += X[i];
    }
    return s;
}
""";

In [6]:
# compile to a shared library by piping C_code to gcc:
# (only works if you have gcc installed)
const Clib = tempname()
using Libdl

In [7]:
open(`gcc -fPIC -O3 -msse3 -xc -shared -o $(Clib * "." * Libdl.dlext) -`, "w") do f
    print(f, c_code)
end

In [8]:
c_sum(X::Array{Float64}) = ccall(("c_sum", Clib), Float64, (Csize_t, Ptr{Float64}), length(X), X)

c_sum (generic function with 1 method)

In [9]:
c_sum(x) ≈ sum(x)

true

In [10]:
b = @benchmark c_sum($x)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     10.222 ms (0.00% GC)
  median time:      12.351 ms (0.00% GC)
  mean time:        13.038 ms (0.00% GC)
  maximum time:     42.979 ms (0.00% GC)
  --------------
  samples:          384
  evals/sample:     1

In [11]:
d["C"] = minimum(b.times) / 1e6

10.221516

## Hand-written C with -fast-math

In [12]:
const Clib_fastmath = tempname()   # make a temporary file

# The same as above but with a -ffast-math flag added
open(`gcc -fPIC -O3 -msse3 -xc -shared -ffast-math -o $(Clib_fastmath * "." * Libdl.dlext) -`, "w") do f
    print(f, c_code) 
end

# define a Julia function that calls the C function:
c_sum_fastmath(X::Array{Float64}) = ccall(("c_sum", Clib_fastmath), Float64, (Csize_t, Ptr{Float64}), length(X), X)

c_sum_fastmath (generic function with 1 method)

In [13]:
b = @benchmark c_sum_fastmath($x)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.308 ms (0.00% GC)
  median time:      4.841 ms (0.00% GC)
  mean time:        5.743 ms (0.00% GC)
  maximum time:     10.746 ms (0.00% GC)
  --------------
  samples:          871
  evals/sample:     1

In [14]:
d["C (fastmath)"] = minimum(b.times) / 1e6

4.308035

## Built-in Python `sum`

In [15]:
# get the Python built-in "sum" function:
pysum = pybuiltin("sum")

PyObject <built-in function sum>

In [16]:
# call a low-level PyCall function to get a Python list, because
# by default PyCall will convert to a NumPy array instead (we benchmark NumPy below):
xpy_list = PyCall.array2py(x);

In [17]:
b = @benchmark $pysum($xpy_list)

BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  2
  --------------
  minimum time:     55.168 ms (0.00% GC)
  median time:      57.266 ms (0.00% GC)
  mean time:        59.170 ms (0.00% GC)
  maximum time:     74.765 ms (0.00% GC)
  --------------
  samples:          85
  evals/sample:     1

In [18]:
d["Python (built-in)"] = minimum(b.times) / 1e6

55.168154

## numpy `sum`

In [19]:
numpy_sum = pyimport("numpy").sum
b = @benchmark $numpy_sum($x)

BenchmarkTools.Trial: 
  memory estimate:  336 bytes
  allocs estimate:  6
  --------------
  minimum time:     3.583 ms (0.00% GC)
  median time:      3.753 ms (0.00% GC)
  mean time:        3.895 ms (0.00% GC)
  maximum time:     8.030 ms (0.00% GC)
  --------------
  samples:          1284
  evals/sample:     1

In [20]:
d["Python (numpy)"] = minimum(b.times) / 1e6

3.582819

## Hand-written Python

In [21]:
py"""
def mysum(a):
    s = 0.0
    for x in a:
        s = s + x
    return s
"""
mysum_py = py"mysum"

PyObject <function mysum at 0x7febbcbcbd40>

In [22]:
b = @benchmark $mysum_py($xpy_list)

BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  2
  --------------
  minimum time:     212.160 ms (0.00% GC)
  median time:      217.512 ms (0.00% GC)
  mean time:        219.573 ms (0.00% GC)
  maximum time:     240.518 ms (0.00% GC)
  --------------
  samples:          23
  evals/sample:     1

In [23]:
d["Python (hand-written)"] = minimum(b.times) / 1e6

212.159834

## Built-in Julia `sum`

In [24]:
b = @benchmark sum($x)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.089 ms (0.00% GC)
  median time:      5.020 ms (0.00% GC)
  mean time:        5.592 ms (0.00% GC)
  maximum time:     11.051 ms (0.00% GC)
  --------------
  samples:          894
  evals/sample:     1

In [25]:
d["Julia (built-in)"] = minimum(b.times) / 1e6

4.089137

## Built-in Julia `sum` with  `Vector{Any}`

In [26]:
x_any = Vector{Any}(x)
b = @benchmark sum($x_any)

BenchmarkTools.Trial: 
  memory estimate:  152.59 MiB
  allocs estimate:  9999999
  --------------
  minimum time:     184.982 ms (8.30% GC)
  median time:      197.987 ms (8.05% GC)
  mean time:        201.038 ms (7.64% GC)
  maximum time:     251.713 ms (8.63% GC)
  --------------
  samples:          25
  evals/sample:     1

In [27]:
d["Julia (built-in, Any)"] = minimum(b.times) / 1e6

184.982358

## Hand-written Julia

In [28]:
function mysum(A)
    s = zero(eltype(A)) # the correct type of zero for A
    for a in A
        s += a
    end
    return s
end

mysum (generic function with 1 method)

In [29]:
b = @benchmark mysum($x)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     11.616 ms (0.00% GC)
  median time:      11.778 ms (0.00% GC)
  mean time:        11.927 ms (0.00% GC)
  maximum time:     14.069 ms (0.00% GC)
  --------------
  samples:          420
  evals/sample:     1

In [30]:
d["Julia (hand-written)"] = minimum(b.times) / 1e6

11.616404

## Hand-written Julia with `@simd`

In [31]:
function mysum_simd(A)
    s = zero(eltype(A)) # the correct type of zero for A
    @simd for a in A
        s += a
    end
    return s
end

mysum_simd (generic function with 1 method)

In [32]:
b = @benchmark mysum_simd($x)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.185 ms (0.00% GC)
  median time:      5.002 ms (0.00% GC)
  mean time:        5.007 ms (0.00% GC)
  maximum time:     6.648 ms (0.00% GC)
  --------------
  samples:          999
  evals/sample:     1

In [33]:
d["Julia (hand-written, simd)"] = minimum(b.times) / 1e6

4.185407

## Summary

In [34]:
for (key, value) in sort(collect(d), by=x->x[2])
    println(rpad(key, 30, "."), lpad(round(value, digits=2), 10, "."))
end

Python (numpy)......................3.58
Julia (built-in)....................4.09
Julia (hand-written, simd)..........4.19
C (fastmath)........................4.31
C..................................10.22
Julia (hand-written)...............11.62
Python (built-in)..................55.17
Julia (built-in, Any).............184.98
Python (hand-written).............212.16


And of course, our hand-written Julia implementation is type-generic!

In [35]:
z = rand(Complex{Float64}, length(x));
@btime mysum_simd($z);

  8.682 ms (0 allocations: 0 bytes)


Resources/more info:

https://github.com/mitmath/18S096/blob/master/lectures/lecture1/Performance-variation.ipynb

https://github.com/mitmath/18S096/blob/master/lectures/lecture1/Boxes-and-registers.ipynb
