# Julia is fast

Very often, benchmarks are used to compare languages.  These benchmarks can lead to long discussions, first as to exactly what is being benchmarked and secondly what explains the differences.  These simple questions can sometimes get more complicated than you at first might imagine.

The purpose of this notebook is for you to see a simple benchmark for yourself.  One can read the notebook and see what happened on the author's Macbook Pro with a 4-core Intel Core I7, or run the notebook yourself.

(This material began life as a wonderful lecture by Steven Johnson at MIT: https://github.com/stevengj/18S096-iap17/blob/master/lecture1/Boxes-and-registers.ipynb.)

# `sum`: An easy enough function to understand

Consider the  **sum** function `sum(a)`, which computes
$$
\mathrm{sum}(a) = \sum_{i=1}^n a_i,
$$
where $n$ is the length of `a`.

In [3]:
a = rand(10^7) # 1D vector of random numbers, uniform on [0,1)   π

10000000-element Array{Float64,1}:
 0.53397 
 0.852647
 0.761401
 0.655317
 0.608951
 0.626262
 0.208563
 0.720231
 0.297856
 0.137934
 0.565968
 0.606141
 0.206431
 ⋮       
 0.980561
 0.470557
 0.770711
 0.772431
 0.506627
 0.652219
 0.836984
 0.597263
 0.532471
 0.247946
 0.861616
 0.019944

In [4]:
sum(a)

5.000249856187823e6

The expected result is 0.5 * 10^7, since the mean of each entry is 0.5

# Benchmarking a few ways in a few languages

Julia has a `BenchmarkTools.jl` package for easy and accurate benchmarking:

In [None]:
# Pkg.add("BenchmarkTools")

In [5]:
using BenchmarkTools  

#  1. The C language

C is often considered the gold standard: difficult on the human, nice for the machine. Getting within a factor of 2 of C is often satisfying. Nonetheless, even within C, there are many kinds of optimizations possible that a naive C writer may or may not get the advantage of.

The current author does not speak C, so he does not read the cell below, but is happy to know that you can put C code in a Julia session, compile it, and run it. Note that the `"""` wrap a multi-line string.

In [30]:
C_code = """
#include <stddef.h>
double c_sum(size_t n, double *X) {
    double s = 0.0;
    for (size_t i = 0; i < n; ++i) {
        s += X[i];
    }
    return s;
}
"""

const Clib = tempname()   # make a temporary file


# compile to a shared library by piping C_code to gcc
# (works only if you have gcc installed):

#open(`gcc -fPIC -O3 -msse3 -xc -shared -o $(Clib * "." * Libdl.dlext) -`, "w") do f
 #   print(f, C_code) 
#end


open(`gcc -fPIC -ffast-math -O3 -msse3 -xc -shared -o $(Clib * "." * Libdl.dlext) -`, "w") do f
   print(f, C_code) 
end

# define a Julia function that calls the C function:
c_sum(X::Array{Float64}) = ccall(("c_sum", Clib), Float64, (Csize_t, Ptr{Float64}), length(X), X)



c_sum (generic function with 1 method)

In [27]:
c_sum(a)

5.000249856187658e6

In [28]:
c_sum(a) ≈ sum(a) # type \approx and then <TAB> to get the ≈ symbolb

true

In [11]:
c_sum(a) - sum(a)  

-1.648440957069397e-7

In [29]:
≈  # alias for the `isapprox` function

isapprox (generic function with 9 methods)

In [13]:
?isapprox

search: [1mi[22m[1ms[22m[1ma[22m[1mp[22m[1mp[22m[1mr[22m[1mo[22m[1mx[22m



```
isapprox(x, y; rtol::Real=sqrt(eps), atol::Real=0, nans::Bool=false, norm::Function)
```

Inexact equality comparison: `true` if `norm(x-y) <= atol + rtol*max(norm(x), norm(y))`. The default `atol` is zero and the default `rtol` depends on the types of `x` and `y`. The keyword argument `nans` determines whether or not NaN values are considered equal (defaults to false).

For real or complex floating-point values, `rtol` defaults to `sqrt(eps(typeof(real(x-y))))`. This corresponds to requiring equality of about half of the significand digits. For other types, `rtol` defaults to zero.

`x` and `y` may also be arrays of numbers, in which case `norm` defaults to `vecnorm` but may be changed by passing a `norm::Function` keyword argument. (For numbers, `norm` is the same thing as `abs`.) When `x` and `y` are arrays, if `norm(x-y)` is not finite (i.e. `±Inf` or `NaN`), the comparison falls back to checking whether all elements of `x` and `y` are approximately equal component-wise.

The binary operator `≈` is equivalent to `isapprox` with the default arguments, and `x ≉ y` is equivalent to `!isapprox(x,y)`.

```jldoctest
julia> 0.1 ≈ (0.1 - 1e-10)
true

julia> isapprox(10, 11; atol = 2)
true

julia> isapprox([10.0^9, 1.0], [10.0^9, 2.0])
true
```


We can now benchmark the C code directly from Julia:

In [31]:
c_bench = @benchmark c_sum($a) 

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.672 ms (0.00% GC)
  median time:      3.784 ms (0.00% GC)
  mean time:        3.930 ms (0.00% GC)
  maximum time:     6.345 ms (0.00% GC)
  --------------
  samples:          1270
  evals/sample:     1

In [32]:
println("C: Fastest time was $(minimum(c_bench.times) / 1e6) msec")

C: Fastest time was 3.671642 msec


In [33]:
d = Dict()  # a "dictionary", i.e. an associative array
d["C"] = minimum(c_bench.times) / 1e6  # in milliseconds
d

Dict{Any,Any} with 1 entry:
  "C" => 3.67164

In [34]:
using Plots
plotly()

Plots.PlotlyBackend()

In [35]:
t = c_bench.times / 1e6 # times in milliseconds
m, σ = minimum(t), std(t)

histogram(t, bins=500,
    xlim=(m - 0.01, m + σ),
    xlabel="milliseconds", ylabel="count", label="")

# 2. Python's built in `sum` 

The `PyCall` package provides a Julia interface to Python:

In [36]:
#Pkg.add("PyCall")

In [37]:
using PyCall

In [38]:
# Call a low-level PyCall function to get a Python list, because
# by default PyCall will convert to a NumPy array instead (we benchmark NumPy below):

apy_list = PyCall.array2py(a, 1, 1)

# get the Python built-in "sum" function:
pysum = pybuiltin("sum")

PyObject <built-in function sum>

In [39]:
pysum(a)

5.000249856187658e6

In [23]:
pysum(a) ≈ sum(a)

true

In [40]:
py_list_bench = @benchmark $pysum($apy_list)

BenchmarkTools.Trial: 
  memory estimate:  512 bytes
  allocs estimate:  17
  --------------
  minimum time:     68.490 ms (0.00% GC)
  median time:      79.995 ms (0.00% GC)
  mean time:        80.252 ms (0.00% GC)
  maximum time:     93.686 ms (0.00% GC)
  --------------
  samples:          63
  evals/sample:     1

In [41]:
d["Python built-in"] = minimum(py_list_bench.times) / 1e6
d

Dict{Any,Any} with 2 entries:
  "C"               => 3.67164
  "Python built-in" => 68.4895

# 3. Python: `numpy` 

## Takes advantage of hardware "SIMD", but only works when it works.

`numpy` is an optimized C library, callable from Python.
It may be installed within Julia as follows:

In [42]:
using Conda 
#Conda.add("numpy")

In [43]:
numpy_sum = pyimport("numpy")["sum"]
apy_numpy = PyObject(a) # converts to a numpy array by default

py_numpy_bench = @benchmark $numpy_sum($apy_numpy)

BenchmarkTools.Trial: 
  memory estimate:  720 bytes
  allocs estimate:  22
  --------------
  minimum time:     3.922 ms (0.00% GC)
  median time:      4.018 ms (0.00% GC)
  mean time:        4.257 ms (0.00% GC)
  maximum time:     8.438 ms (0.00% GC)
  --------------
  samples:          1172
  evals/sample:     1

In [44]:
numpy_sum(apy_list) # python thing

5.000249856187817e6

In [45]:
numpy_sum(apy_list) ≈ sum(a)

true

In [46]:
d["Python numpy"] = minimum(py_numpy_bench.times) / 1e6
d

Dict{Any,Any} with 3 entries:
  "C"               => 3.67164
  "Python numpy"    => 3.9221
  "Python built-in" => 68.4895

# 4. Python, hand-written 

In [67]:
py"""
def py_sum(a):
    s = 0.0
    for x in a:
        s = s + x
    return s
"""

sum_py = py"py_sum"

PyObject <function py_sum at 0x14c9f7cf8>

In [48]:
py_hand = @benchmark $sum_py($apy_list)

BenchmarkTools.Trial: 
  memory estimate:  512 bytes
  allocs estimate:  17
  --------------
  minimum time:     1.335 s (0.00% GC)
  median time:      1.381 s (0.00% GC)
  mean time:        1.374 s (0.00% GC)
  maximum time:     1.397 s (0.00% GC)
  --------------
  samples:          4
  evals/sample:     1

In [49]:
@which sum([1.5])

In [50]:
ccall(:sleep, Void, (Cint,), 3)

In [51]:
Cint, Csize_t

(Int32, UInt64)

In [52]:
sum_py(apy_list)

5.000249856187658e6

In [53]:
sum_py(apy_list) ≈ sum(a)

true

In [54]:
d["Python hand-written"] = minimum(py_hand.times) / 1e6
d

Dict{Any,Any} with 4 entries:
  "C"                   => 3.67164
  "Python numpy"        => 3.9221
  "Python hand-written" => 1335.48
  "Python built-in"     => 68.4895

# 5. Julia (built-in) 

## Written directly in Julia, not in C!

In [55]:
@which sum(a)

In [56]:
j_bench = @benchmark sum($a)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.727 ms (0.00% GC)
  median time:      3.782 ms (0.00% GC)
  mean time:        3.880 ms (0.00% GC)
  maximum time:     5.985 ms (0.00% GC)
  --------------
  samples:          1287
  evals/sample:     1

In [57]:
d["Julia built-in"] = minimum(j_bench.times) / 1e6
d

Dict{Any,Any} with 5 entries:
  "C"                   => 3.67164
  "Python numpy"        => 3.9221
  "Python hand-written" => 1335.48
  "Python built-in"     => 68.4895
  "Julia built-in"      => 3.72721

# 6. Julia (hand-written) 

In [70]:
function mysum(A)   
    s = 0.0  # s = zero(eltype(A))
    @simd for a ∈ A   
        s += a   
    end
    s
end

mysum (generic function with 1 method)

In [71]:
j_bench_hand = @benchmark mysum($a)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.669 ms (0.00% GC)
  median time:      3.785 ms (0.00% GC)
  mean time:        3.925 ms (0.00% GC)
  maximum time:     6.299 ms (0.00% GC)
  --------------
  samples:          1272
  evals/sample:     1

In [72]:
d["Julia hand-written"] = minimum(j_bench_hand.times) / 1e6
d

Dict{Any,Any} with 6 entries:
  "C"                   => 3.67164
  "Python numpy"        => 3.9221
  "Julia hand-written"  => 3.66924
  "Python hand-written" => 1335.48
  "Python built-in"     => 68.4895
  "Julia built-in"      => 3.72721

# Summary

In [73]:
for (key, value) in sort(collect(d))
    println(rpad(key, 20, "."), lpad(round(value, 1), 8, "."))
end




C........................3.7
Julia built-in...........3.7
Julia hand-written.......3.7
Python built-in.........68.5
Python hand-written...1335.5
Python numpy.............3.9


In [74]:
for (key, value) in sort(collect(d), by=x->x[2])
    println(rpad(key, 20, "."), lpad(round(value, 2), 10, "."))
end

Julia hand-written........3.67
C.........................3.67
Julia built-in............3.73
Python numpy..............3.92
Python built-in..........68.49
Python hand-written....1335.48


In [63]:
α π

LoadError: [91msyntax: extra token "π" after end of expression[39m

In [77]:
প = π

π = 3.1415926535897...

In [78]:
প * 1000

3141.592653589793

In [79]:
s = "জুলিয়া "

"জুলিয়া "

In [80]:
s^100

"জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া জুলিয়া "

In [75]:
জুলিয়া = 1

1

In [76]:
জুলিয়া + জুলিয়া

2