# How does Julia's performance compare to C and Python?

Let's look at and benchmark the sum function:

$$\mathrm{sum}(x) = \sum_{i=1}^n x_i$$

In [3]:
x = rand(10^7);

In [4]:
sum(x)

5.000064512763912e6

In [5]:
d = Dict() # to store the measurement results

Dict{Any,Any}()

# Python

In [62]:
using BenchmarkTools
using PyCall

## numpy

In [9]:
np = pyimport("numpy")

PyObject <module 'numpy' from '/Users/crstnbr/opt/anaconda3/lib/python3.7/site-packages/numpy/__init__.py'>

In [10]:
numpy_sum = np.sum

PyObject <function sum at 0x7fa2294db4d0>

In [11]:
b = @benchmark $numpy_sum($x)

BenchmarkTools.Trial: 
  memory estimate:  336 bytes
  allocs estimate:  6
  --------------
  minimum time:     3.791 ms (0.00% GC)
  median time:      3.994 ms (0.00% GC)
  mean time:        4.099 ms (0.00% GC)
  maximum time:     6.100 ms (0.00% GC)
  --------------
  samples:          1220
  evals/sample:     1

In [12]:
d["Python (numpy)"] = minimum(b.times) / 1e6

3.791396

## hand-written

In [14]:
py"""
def mysum(a):
    s = 0.0
    for x in a:
        s = s + x
    return s
"""

In [15]:
mysum_py = py"mysum"

PyObject <function mysum at 0x7fa22a035b90>

In [20]:
# call a low-level PyCall function to get a Python list, because
# by default PyCall will convert to a NumPy array instead (we benchmark NumPy below):
xpy_list = PyCall.array2py(x);

In [21]:
b = @benchmark $mysum_py($xpy_list)

BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  2
  --------------
  minimum time:     210.815 ms (0.00% GC)
  median time:      214.834 ms (0.00% GC)
  mean time:        218.635 ms (0.00% GC)
  maximum time:     237.063 ms (0.00% GC)
  --------------
  samples:          23
  evals/sample:     1

In [22]:
d["Python (hand-written)"] = minimum(b.times) / 1e6

210.815487

## built-in

In [23]:
# get the Python built-in "sum" function:
pysum = pybuiltin("sum")

PyObject <built-in function sum>

In [24]:
b = @benchmark $pysum($xpy_list)

BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  2
  --------------
  minimum time:     54.422 ms (0.00% GC)
  median time:      62.803 ms (0.00% GC)
  mean time:        62.836 ms (0.00% GC)
  maximum time:     71.827 ms (0.00% GC)
  --------------
  samples:          80
  evals/sample:     1

In [25]:
d["Python (built-in)"] = minimum(b.times) / 1e6

54.421817

# C

## hand-written

In [26]:
c_code = """
#include <stddef.h>
double c_sum(size_t n, double *X) {
    double s = 0.0;
    for (size_t i = 0; i < n; ++i) {
        s += X[i];
    }
    return s;
}
""";

In [27]:
# compile to a shared library by piping C_code to gcc:
# (only works if you have gcc installed)
const Clib = tempname()
using Libdl

In [28]:
open(`gcc -fPIC -O3 -msse3 -xc -shared -o $(Clib * "." * Libdl.dlext) -`, "w") do f
    print(f, c_code)
end

In [29]:
c_sum(X::Array{Float64}) = ccall(("c_sum", Clib), Float64, (Csize_t, Ptr{Float64}), length(X), X)

c_sum (generic function with 1 method)

In [30]:
c_sum(x) ≈ sum(x)

true

In [31]:
b = @benchmark c_sum($x)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     10.578 ms (0.00% GC)
  median time:      11.950 ms (0.00% GC)
  mean time:        11.982 ms (0.00% GC)
  maximum time:     14.829 ms (0.00% GC)
  --------------
  samples:          418
  evals/sample:     1

In [32]:
d["C"] = minimum(b.times) / 1e6

10.578369

## hand-written (with `-fast-math`)

In [33]:
const Clib_fastmath = tempname()   # make a temporary file

# The same as above but with a -ffast-math flag added
open(`gcc -fPIC -O3 -msse3 -xc -shared -ffast-math -o $(Clib_fastmath * "." * Libdl.dlext) -`, "w") do f
    print(f, c_code) 
end

# define a Julia function that calls the C function:
c_sum_fastmath(X::Array{Float64}) = ccall(("c_sum", Clib_fastmath), Float64, (Csize_t, Ptr{Float64}), length(X), X)

c_sum_fastmath (generic function with 1 method)

In [34]:
b = @benchmark c_sum_fastmath($x)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.717 ms (0.00% GC)
  median time:      4.507 ms (0.00% GC)
  mean time:        4.551 ms (0.00% GC)
  maximum time:     8.822 ms (0.00% GC)
  --------------
  samples:          1098
  evals/sample:     1

In [35]:
d["C (fastmath)"] = minimum(b.times) / 1e6

3.717225

# Julia

## built-in

In [51]:
b = @benchmark sum($x)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.453 ms (0.00% GC)
  median time:      4.646 ms (0.00% GC)
  mean time:        4.647 ms (0.00% GC)
  maximum time:     9.032 ms (0.00% GC)
  --------------
  samples:          1076
  evals/sample:     1

In [52]:
d["Julia (built-in)"] = minimum(b.times) / 1e6

3.452827

## built-in (with `Vector{Any}`)

In [53]:
x_any = Vector{Any}(x)
b = @benchmark sum($x_any)

BenchmarkTools.Trial: 
  memory estimate:  152.59 MiB
  allocs estimate:  9999999
  --------------
  minimum time:     188.855 ms (0.00% GC)
  median time:      228.223 ms (7.31% GC)
  mean time:        227.978 ms (8.67% GC)
  maximum time:     287.137 ms (16.20% GC)
  --------------
  samples:          22
  evals/sample:     1

In [54]:
d["Julia (built-in, Any)"] = minimum(b.times) / 1e6

188.8551

## hand-written

In [55]:
function mysum(A)
    s = zero(eltype(A)) # the correct type of zero for A
    for a in A
        s += a
    end
    return s
end

mysum (generic function with 1 method)

In [56]:
b = @benchmark mysum($x)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     11.009 ms (0.00% GC)
  median time:      12.948 ms (0.00% GC)
  mean time:        13.018 ms (0.00% GC)
  maximum time:     18.651 ms (0.00% GC)
  --------------
  samples:          384
  evals/sample:     1

In [57]:
d["Julia (hand-written)"] = minimum(b.times) / 1e6

11.008869

## hand-written (with `@simd`)

In [58]:
function mysum_simd(A)
    s = zero(eltype(A)) # the correct type of zero for A
    @simd for a in A
        s += a
    end
    return s
end

mysum_simd (generic function with 1 method)

In [59]:
b = @benchmark mysum_simd($x)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.468 ms (0.00% GC)
  median time:      4.810 ms (0.00% GC)
  mean time:        4.851 ms (0.00% GC)
  maximum time:     16.269 ms (0.00% GC)
  --------------
  samples:          1030
  evals/sample:     1

In [60]:
d["Julia (hand-written, simd)"] = minimum(b.times) / 1e6

3.468321

## Summary

In [61]:
for (key, value) in sort(collect(d), by=x->x[2])
    println(rpad(key, 30, "."), lpad(round(value, digits=2), 10, "."))
end

Julia (built-in)....................3.45
Julia (hand-written, simd)..........3.47
C (fastmath)........................3.72
Python (numpy)......................3.79
C..................................10.58
Julia (hand-written)...............11.01
Python (built-in)..................54.42
Julia (built-in, Any).............188.86
Python (hand-written).............210.82


And of course, our hand-written Julia implementation is type-generic!

# What about other functions?

### Log

In [40]:
using BenchmarkTools

# uses the system C library
clog(x) = ccall(:log, Float64, (Float64,), x)
# uses LLVM's log
llvmlog(x) =  ccall(Symbol("llvm.log.f64"), llvmcall, Float64, (Float64,), x)

@btime log($(Ref(1.2))[])    
@btime clog($(Ref(1.2))[])    
@btime llvmlog($(Ref(1.2))[]);

  5.277 ns (0 allocations: 0 bytes)
  4.664 ns (0 allocations: 0 bytes)
  4.356 ns (0 allocations: 0 bytes)


In [41]:
@which log(1.2)

### Exp

In [42]:
using BenchmarkTools

# uses the system C library
cexp(x) = ccall(:exp, Float64, (Float64,), x)
# uses LLVM's
llvmexp(x) =  ccall(Symbol("llvm.exp.f64"), llvmcall, Float64, (Float64,), x)

@btime exp($(Ref(1.2))[])    
@btime cexp($(Ref(1.2))[])    
@btime llvmexp($(Ref(1.2))[]);

  7.115 ns (0 allocations: 0 bytes)
  4.747 ns (0 allocations: 0 bytes)
  4.346 ns (0 allocations: 0 bytes)


(`6.037 ns` on Julia 1.6)

In [43]:
@which exp(1.2)

### Matrix multiplication

In [44]:
using BenchmarkTools

function A_mul_B!(𝐂, 𝐀, 𝐁)
   @inbounds for m ∈ axes(𝐀,1), n ∈ axes(𝐁,2)
       𝐂mn = zero(eltype(𝐂))
       for k ∈ axes(𝐀,2)
           𝐂mn += 𝐀[m,k] * 𝐁[k,n]
       end
       𝐂[m,n] = 𝐂mn
   end
end

A_mul_B! (generic function with 1 method)

In [45]:
N = 10
C = zeros(N,N);
A = rand(N,N);
B = rand(N,N);

In [46]:
@btime A_mul_B!($C,$A,$B);

  631.552 ns (0 allocations: 0 bytes)


[LoopVectorization.jl](https://github.com/chriselrod/LoopVectorization.jl)

In [47]:
using LoopVectorization

function A_mul_B_avx!(𝐂, 𝐀, 𝐁)
   @avx for m ∈ axes(𝐀,1), n ∈ axes(𝐁,2)
       𝐂mn = zero(eltype(𝐂))
       for k ∈ axes(𝐀,2)
           𝐂mn += 𝐀[m,k] * 𝐁[k,n]
       end
       𝐂[m,n] = 𝐂mn
   end
end

A_mul_B_avx! (generic function with 1 method)

In [48]:
@btime A_mul_B_avx!($C,$A,$B);

  81.066 ns (0 allocations: 0 bytes)


In [49]:
using LinearAlgebra

@btime mul!($C, $A, $B); # calls underlying BLAS

  258.555 ns (0 allocations: 0 bytes)


In [12]:
c_code = """
#include <stddef.h>
#include <math.h>

void gemm_mnk(double* restrict C, double* restrict A, double* restrict B, long M, long K, long N){
  for (long i = 0; i < M*N; i++){
    C[i] = 0.0;
  }
  for (long m = 0; m < M; m++){
    for (long n = 0; n < N; n++){
      for (long k = 0; k < K; k++){
	C[m + n*M] += A[m + k*M] * B[k + n*K];
      }
    }
  }
  return;
}
void gemm_mkn(double* restrict C, double* restrict A, double* restrict B, long M, long K, long N){
  for (long i = 0; i < M*N; i++){
    C[i] = 0.0;
  }
  for (long m = 0; m < M; m++){
    for (long k = 0; k < K; k++){
      for (long n = 0; n < N; n++){
	C[m + n*M] += A[m + k*M] * B[k + n*K];
      }
    }
  }
  return;
}
void gemm_nmk(double* restrict C, double* restrict A, double* restrict B, long M, long K, long N){
  for (long i = 0; i < M*N; i++){
    C[i] = 0.0;
  }
  for (long n = 0; n < N; n++){
    for (long m = 0; m < M; m++){
      for (long k = 0; k < K; k++){
	C[m + n*M] += A[m + k*M] * B[k + n*K];
      }
    }
  }
  return;
}
void gemm_nkm(double* restrict C, double* restrict A, double* restrict B, long M, long K, long N){
  for (long i = 0; i < M*N; i++){
    C[i] = 0.0;
  }
  for (long n = 0; n < N; n++){
    for (long k = 0; k < K; k++){
      for (long m = 0; m < M; m++){
	C[m + n*M] += A[m + k*M] * B[k + n*K];
      }
    }
  }
  return;
}
void gemm_kmn(double* restrict C, double* restrict A, double* restrict B, long M, long K, long N){
  for (long i = 0; i < M*N; i++){
    C[i] = 0.0;
  }
  for (long k = 0; k < K; k++){
    for (long m = 0; m < M; m++){
      for (long n = 0; n < N; n++){
	C[m + n*M] += A[m + k*M] * B[k + n*K];
      }
    }
  }
  return;
}
void gemm_knm(double* restrict C, double* restrict A, double* restrict B, long M, long K, long N){
  for (long i = 0; i < M*N; i++){
    C[i] = 0.0;
  }
  for (long k = 0; k < K; k++){
    for (long n = 0; n < N; n++){
      for (long m = 0; m < M; m++){
	C[m + n*M] += A[m + k*M] * B[k + n*K];
      }
    }
  }
  return;
}
""";

In [13]:
# compile to a shared library by piping C_code to gcc:
# (only works if you have gcc installed)
const Clib = tempname()
using Libdl

open(`gcc -fPIC -O3 -msse3 -xc -shared -o $(Clib * "." * Libdl.dlext) -`, "w") do f
    print(f, c_code)
end

c_gemm_mnk(C::Array{Float64},A::Array{Float64},B::Array{Float64}) = ccall(("gemm_mnk", Clib), Cvoid, (Ptr{Float64}, Ptr{Float64}, Ptr{Float64}, Clong, Clong, Clong), C, A, B, size(A,1), size(A,2), size(B, 2))

c_gemm_mkn(C::Array{Float64},A::Array{Float64},B::Array{Float64}) = ccall(("gemm_mkn", Clib), Cvoid, (Ptr{Float64}, Ptr{Float64}, Ptr{Float64}, Clong, Clong, Clong), C, A, B, size(A,1), size(A,2), size(B, 2))

c_gemm_nmk(C::Array{Float64},A::Array{Float64},B::Array{Float64}) = ccall(("gemm_nmk", Clib), Cvoid, (Ptr{Float64}, Ptr{Float64}, Ptr{Float64}, Clong, Clong, Clong), C, A, B, size(A,1), size(A,2), size(B, 2))

c_gemm_nkm(C::Array{Float64},A::Array{Float64},B::Array{Float64}) = ccall(("gemm_nkm", Clib), Cvoid, (Ptr{Float64}, Ptr{Float64}, Ptr{Float64}, Clong, Clong, Clong), C, A, B, size(A,1), size(A,2), size(B, 2))

c_gemm_kmn(C::Array{Float64},A::Array{Float64},B::Array{Float64}) = ccall(("gemm_kmn", Clib), Cvoid, (Ptr{Float64}, Ptr{Float64}, Ptr{Float64}, Clong, Clong, Clong), C, A, B, size(A,1), size(A,2), size(B, 2))

c_gemm_knm(C::Array{Float64},A::Array{Float64},B::Array{Float64}) = ccall(("gemm_knm", Clib), Cvoid, (Ptr{Float64}, Ptr{Float64}, Ptr{Float64}, Clong, Clong, Clong), C, A, B, size(A,1), size(A,2), size(B, 2))

c_gemm_knm (generic function with 1 method)

In [14]:
C2 = zeros(N,N);
c_gemm_mnk(C2, A, B)
@assert C ≈ C2
c_gemm_mkn(C2, A, B)
@assert C ≈ C2
c_gemm_nmk(C2, A, B)
@assert C ≈ C2
c_gemm_nkm(C2, A, B)
@assert C ≈ C2
c_gemm_kmn(C2, A, B)
@assert C ≈ C2
c_gemm_knm(C2, A, B)
@assert C ≈ C2

In [15]:
@btime c_gemm_mnk($C2, $A, $B)
@btime c_gemm_mkn($C2, $A, $B)
@btime c_gemm_nmk($C2, $A, $B)
@btime c_gemm_nkm($C2, $A, $B)
@btime c_gemm_kmn($C2, $A, $B)
@btime c_gemm_knm($C2, $A, $B)

  473.518 ns (0 allocations: 0 bytes)
  742.070 ns (0 allocations: 0 bytes)
  413.734 ns (0 allocations: 0 bytes)
  403.945 ns (0 allocations: 0 bytes)
  716.914 ns (0 allocations: 0 bytes)
  432.347 ns (0 allocations: 0 bytes)


**Note for larger `N`:** BLAS is multithreaded for larger `N`. In this case our `A_mul_B_avx!` is slower than `mul!`.

In [16]:
N = 100
C = zeros(N,N);
A = rand(N,N);
B = rand(N,N);

In [17]:
@btime A_mul_B_avx!($C, $A, $B);

  43.625 μs (0 allocations: 0 bytes)


In [18]:
@btime mul!($C, $A, $B);

  19.485 μs (0 allocations: 0 bytes)


However, [Octavian.jl](https://github.com/JuliaLinearAlgebra/Octavian.jl) is a (experimental) package that adds multithreading on top of LoopVectorization.jl. Again, we can blast OpenBLAS out of the water with pure Julia code. (Restart the kernel before executing the following.)

In [9]:
Threads.nthreads()

6

In [1]:
using Pkg
pkg"activate --temp"
pkg"add Octavian"
pkg"add BenchmarkTools"

In [2]:
using Octavian, LinearAlgebra, BenchmarkTools

In [5]:
N = 100; C = zeros(N,N); A = rand(N,N); B = rand(N,N);

In [6]:
@btime Octavian.matmul!($C, $A, $B);

  8.810 μs (0 allocations: 0 bytes)


In [7]:
@btime mul!($C, $A, $B);

  18.590 μs (0 allocations: 0 bytes)


**Resources/more info:**

https://github.com/mitmath/18S096/blob/master/lectures/lecture1/Performance-variation.ipynb

https://github.com/mitmath/18S096/blob/master/lectures/lecture1/Boxes-and-registers.ipynb


**More comprehensive benchmarks:** https://chriselrod.github.io/LoopVectorization.jl/stable/examples/matrix_multiplication/#Matrix-Multiplication