# Is Julia fast?

Julia isn't fast *per se*.

One can write terribly slow code in any language, including Julia.

So let's ask a different question.

# *Can* Julia be fast?

 ### Microbenchmarks
 <img src="imgs/benchmarks.svg" alt="drawing" width="800"/>

### More realistic case: Vandermonde matrix
(modified from [Steve's Julia intro](https://web.mit.edu/18.06/www/Fall17/1806/julia/Julia-intro.pdf))

[Vandermonde matrix:](https://en.wikipedia.org/wiki/Vandermonde_matrix)
\begin{align}V=\begin{bmatrix}1&\alpha _{1}&\alpha _{1}^{2}&\dots &\alpha _{1}^{n-1}\\1&\alpha _{2}&\alpha _{2}^{2}&\dots &\alpha _{2}^{n-1}\\1&\alpha _{3}&\alpha _{3}^{2}&\dots &\alpha _{3}^{n-1}\\\vdots &\vdots &\vdots &\ddots &\vdots \\1&\alpha _{m}&\alpha _{m}^{2}&\dots &\alpha _{m}^{n-1}\end{bmatrix}\end{align}

In [1]:
using PyCall

In [2]:
np = pyimport("numpy")

PyObject <module 'numpy' from '/Users/crstnbr/opt/anaconda3/lib/python3.7/site-packages/numpy/__init__.py'>

In [3]:
np.vander(1:5, increasing=true)

5×5 Array{Int64,2}:
 1  1   1    1    1
 1  2   4    8   16
 1  3   9   27   81
 1  4  16   64  256
 1  5  25  125  625

The source code for this function is [here](https://github.com/numpy/numpy/blob/v1.16.1/numpy/lib/twodim_base.py#L475-L563). It calls `np.multiply.accumulate` which is implemented in C [here](https://github.com/numpy/numpy/blob/deea4983aedfa96905bbaee64e3d1de84144303f/numpy/core/src/umath/ufunc_object.c#L3678). However, this code doesn't actually perform the computation, it basically only checks types and stuff. The actual kernel that gets called is [here](https://github.com/numpy/numpy/blob/deea4983aedfa96905bbaee64e3d1de84144303f/numpy/core/src/umath/loops.c.src#L1742). This isn't even C code but a template for C code which is used to generate type specific kernels.

Overall, this setup only supports a limited set of types, like `Float64`, `Float32`, and so forth.

Here is a simple Julia implementation

In [4]:
function vander(x::AbstractVector{T},  n=length(x)) where T
    m = length(x)
    V = Matrix{T}(undef, m, n)
    for j = 1:m
        V[j,1] = one(x[j])
    end
    for i= 2:n
        for j = 1:m
            V[j,i] = x[j] * V[j,i-1]
            end
        end
    return V
end

vander (generic function with 2 methods)

In [5]:
vander(1:5)

5×5 Array{Int64,2}:
 1  1   1    1    1
 1  2   4    8   16
 1  3   9   27   81
 1  4  16   64  256
 1  5  25  125  625

#### A quick speed comparison

<details>
  <summary>Show Code</summary>
<br>
    
```julia
using BenchmarkTools, Plots
ns = exp10.(range(1, 4, length=30));

tnp = Float64[]
tjl = Float64[]
for n in ns
    x = 1:n |> collect
    push!(tnp, @belapsed np.vander(\$x) samples=3 evals=1)
    push!(tjl, @belapsed vander(\$x) samples=3 evals=1)
end
plot(ns, tnp./tjl, m=:circle, xscale=:log10, xlab="matrix size", ylab="NumPy time / Julia time", legend=:false)
```
</details>

 <img src="imgs/vandermonde.svg" alt="drawing" width="600"/>

Note that the clean and concise Julia implementation is **beating numpy's C implementation for small matrices** and is **on-par for large matrix sizes**.

At the same time, the Julia code is *generic* and works for arbitrary types!

In [6]:
vander(Int32[4, 8, 16, 32])

4×4 Array{Int32,2}:
 1   4    16     64
 1   8    64    512
 1  16   256   4096
 1  32  1024  32768

It even works for non-numerical types. The only requirement is that the type has a *one* (identity element) and a multiplication operation defined.

In [7]:
vander(["this", "is", "a", "test"])

4×4 Array{String,2}:
 ""  "this"  "thisthis"  "thisthisthis"
 ""  "is"    "isis"      "isisis"
 ""  "a"     "aa"        "aaa"
 ""  "test"  "testtest"  "testtesttest"

Here, `one(String) == ""` since the empty string is the identity under multiplication (string concatenation).

# How can Julia be fast?

<p><img src="imgs/from_source_to_native.png" alt="drawing" width="800"/></p>
 
**AST = Abstract Syntax Tree**

**SSA = Static Single Assignment**

**[LLVM](https://de.wikipedia.org/wiki/LLVM) = Low Level Virtual Machine**

### Specialization and code inspection

**Julia specializes on the types of function arguments.**

When a function is called for the first time, Julia compiles efficient machine code for the given input types.

If it is called again, the already existing machine code is reused, until we call the function with different input types.

In [8]:
func(x,y) = x^2 + y

func (generic function with 1 method)

In [9]:
@time func(1,2)
@time func(1,2)

  0.000000 seconds
  0.000000 seconds


3

In [12]:
f(x,y) = x^2 + y

f (generic function with 1 method)

In [13]:
@timev f(1,2)

  0.000000 seconds
elapsed time (ns): 45


3

In [14]:
@timev f(1,2)

  0.000000 seconds
elapsed time (ns): 47


3

In [10]:
?@time

```
@time
```

A macro to execute an expression, printing the time it took to execute, the number of allocations, and the total number of bytes its execution caused to be allocated, before returning the value of the expression.

See also [`@timev`](@ref), [`@timed`](@ref), [`@elapsed`](@ref), and [`@allocated`](@ref).

!!! note
    For more serious benchmarking, consider the `@btime` macro from the BenchmarkTools.jl package which among other things evaluates the function multiple times in order to reduce noise.


```julia-repl
julia> @time rand(10^6);
  0.001525 seconds (7 allocations: 7.630 MiB)

julia> @time begin
           sleep(0.3)
           1+1
       end
  0.301395 seconds (8 allocations: 336 bytes)
2
```


**First call:** compilation + running the code

**Second call:** running the code

In [15]:
@time func(1,2)

  0.000000 seconds


3

If the input types change, Julia compiles a new specialization of the function.

In [16]:
@time func(1.3,4.8)
@time func(1.3,4.8)

  0.000000 seconds
  0.000000 seconds


6.49

In [17]:
myfunc(x,y) = sum(rand(4,4)) + x^2 + y

myfunc (generic function with 1 method)

In [18]:
@time myfunc(1,3)
@time myfunc(1,3)

  0.000003 seconds (1 allocation: 208 bytes)
  0.000010 seconds (1 allocation: 208 bytes)


11.877776113671842

In [19]:
@compiletime myfunc(1,3)

LoadError: LoadError: UndefVarError: @compiletime not defined
in expression starting at In[19]:1

We now have two efficient codes, one for all `Int64` inputs and another one for all `Float64` arguments, in the cache.

### *But I really want to see what happens!*

We can inspect the code at all transformation stages with a bunch of macros:

* The AST after parsing (**`@macroexpand`**)
* The AST after lowering (**`@code_typed`**, **`@code_warntype`**)
* The AST after type inference and optimization (**`@code_lowered`**)
* The LLVM IR (**`@code_llvm`**)
* The assembly machine code (**`@code_native`**)

In [21]:
func(1,2)

3

In [20]:
@code_typed func(1,2)

CodeInfo(
[90m1 ─[39m %1 = Base.mul_int(x, x)[36m::Int64[39m
[90m│  [39m %2 = Base.add_int(%1, y)[36m::Int64[39m
[90m└──[39m      return %2
) => Int64

In [22]:
@code_lowered func(1,2)

CodeInfo(
[90m1 ─[39m %1 = Core.apply_type(Base.Val, 2)
[90m│  [39m %2 = (%1)()
[90m│  [39m %3 = Base.literal_pow(Main.:^, x, %2)
[90m│  [39m %4 = %3 + y
[90m└──[39m      return %4
)

In [23]:
@code_llvm func(1,2)


;  @ In[8]:1 within `func'
define i64 @julia_func_2838(i64, i64) {
top:
; ┌ @ intfuncs.jl:296 within `literal_pow'
; │┌ @ int.jl:87 within `*'
    %2 = mul i64 %0, %0
; └└
; ┌ @ int.jl:86 within `+'
   %3 = add i64 %2, %1
; └
  ret i64 %3
}


We can remove the comments (lines starting with `;` using `debuginfo=:none`).

In [24]:
@code_llvm debuginfo=:none func(1,2)


define i64 @julia_func_2845(i64, i64) {
top:
  %2 = mul i64 %0, %0
  %3 = add i64 %2, %1
  ret i64 %3
}


In [25]:
@code_native debuginfo=:none func(1,2)

	.section	__TEXT,__text,regular,pure_instructions
	imulq	%rdi, %rdi
	leaq	(%rdi,%rsi), %rax
	retq
	nopl	(%rax)


Let's compare this to `Float64` input.

In [26]:
@code_native debuginfo=:none func(1.2,2.9)

	.section	__TEXT,__text,regular,pure_instructions
	vmulsd	%xmm0, %xmm0, %xmm0
	vaddsd	%xmm1, %xmm0, %xmm0
	retq
	nopl	(%rax)


In [27]:
@code_llvm debuginfo=:none func(1.2,2.9)


define double @julia_func_2852(double, double) {
top:
  %2 = fmul double %0, %0
  %3 = fadd double %2, %1
  ret double %3
}


## How important is code specialization?

Let's try to estimate the performance gain by specialization.

We wrap our numbers into a custom type which internally stores them as `Any` to prevent specialization.

(This is qualitatively comparable to what Python does.)

In [28]:
struct Anything
    value::Any
end

operation(x::Number) = x^2 + sqrt(x)
operation(x::Anything) = x.value^2 + sqrt(x.value)

operation (generic function with 2 methods)

In [29]:
using BenchmarkTools

@btime operation(2);
@btime operation(2.0);

x = Anything(2.0)
@btime operation($x);

  1.278 ns (0 allocations: 0 bytes)
  1.278 ns (0 allocations: 0 bytes)
  55.708 ns (3 allocations: 48 bytes)


**That's about an 40 times slowdown!**

In [30]:
@code_native debuginfo=:none operation(2.0)

	.section	__TEXT,__text,regular,pure_instructions
	pushq	%rax
	vxorps	%xmm1, %xmm1, %xmm1
	vucomisd	%xmm0, %xmm1
	ja	L25
	vmulsd	%xmm0, %xmm0, %xmm1
	vsqrtsd	%xmm0, %xmm0, %xmm0
	vaddsd	%xmm0, %xmm1, %xmm0
	popq	%rax
	retq
L25:
	movabsq	$throw_complex_domainerror, %rax
	movabsq	$4514851472, %rdi       ## imm = 0x10D1B2A90
	callq	*%rax
	ud2
	nopw	%cs:(%rax,%rax)
	nopl	(%rax,%rax)


In [31]:
@code_native debuginfo=:none operation(x)

	.section	__TEXT,__text,regular,pure_instructions
	pushq	%rbp
	movq	%rsp, %rbp
	pushq	%r15
	pushq	%r14
	pushq	%r13
	pushq	%r12
	pushq	%rbx
	andq	$-32, %rsp
	subq	$96, %rsp
	movq	%rdi, %r14
	vxorps	%xmm0, %xmm0, %xmm0
	vmovaps	%ymm0, 32(%rsp)
	movabsq	$jl_get_ptls_states_fast, %rax
	vzeroupper
	callq	*%rax
	movq	%rax, %r15
	movq	$8, 32(%rsp)
	movq	(%rax), %rax
	movq	%rax, 40(%rsp)
	leaq	32(%rsp), %rax
	movq	%rax, (%r15)
	movq	(%r14), %r12
	movabsq	$jl_system_image_data, %rax
	movq	%rax, 8(%rsp)
	movq	%r12, 16(%rsp)
	movabsq	$jl_system_image_data, %rax
	movq	%rax, 24(%rsp)
	movabsq	$jl_apply_generic, %r13
	movabsq	$jl_system_image_data, %rdi
	leaq	8(%rsp), %r14
	movq	%r14, %rsi
	movl	$3, %edx
	callq	*%r13
	movq	%rax, %rbx
	movq	%rax, 56(%rsp)
	movq	%r12, 8(%rsp)
	movabsq	$jl_system_image_data, %rdi
	movq	%r14, %rsi
	movl	$1, %edx
	callq	*%r13
	movq	%rax, 48(%rsp)
	movq	%rbx, 8(%rsp)
	movq	%rax, 16(%rsp)
	movabsq	$jl_system_image_data, %rdi
	movq	%r14, %rsi
	movl	$2, %edx
	callq	*%r13
	mo

# Make run-time the fun time.

In scientific computations, we typically run a piece of code many times over and over again. Think of a Monte Carlo simulation, for example, where we perform the update and the Metropolis check millions of times.

**Therefore, we want our run-time to be as short as possible.**

On the other hand, for a given set of input arguments, Julia compiles the piece of code only once, as we have seen above. The time it takes to compile our code is almost always negligible compared to the duration of the full computation.

A general strategy is therefore to move parts of the computation to compile-time.

Since Julia specializes on types, at compile-time **only type information is available to the compiler.**

In [None]:
f1(x::Int) = x + 1
f2(x::Int) = x + 2

function f_slow(x::Int, p::Bool)
    if p                                # check depends on the value of p
        return f1(x)
    else
        return f2(x)
    end
end

In [None]:
@code_llvm debuginfo=:none f_slow(1, true)

We can eliminate the if branch by moving the condition check to the type domain. This way, it **will only be evaluated once at compile-time.**

In [None]:
abstract type Boolean end
struct True <: Boolean end # type domain true
struct False <: Boolean end # type domain false

function f_fast(x::Int, p::Boolean)
    if typeof(p) == True                # check solely based on the type of p
        return f1(x)
    else
        return f2(x)
    end
end

In [None]:
@code_llvm debuginfo=:none f_fast(1, True())

In [None]:
abstract type Boolean end
struct True <: Boolean end # type domain true
struct False <: Boolean end # type domain false

f_fast(x::Int, p::True) = f1(x)
f_fast(x::Int, p::False) = f2(x)

In [None]:
@which f_fast(1, True())

In [None]:
@code_llvm debuginfo=:none f_fast(1, True())

(Multiple) dispatch allows us to avoid explicit if branches operating on types in an elegant way.

### More realistic example: StaticArrays.jl

In [32]:
x = rand(3)

3-element Array{Float64,1}:
 0.7866199337364581
 0.10521574900575303
 0.7165338105616172

In [None]:
m = SMatrix{2,2}(1, 2, 3, 4)

In [None]:
size(m)

In [None]:
size(typeof(m))

In [None]:
# compare to
M = Matrix(m)
size(typeof(M))

In [None]:
m2 = @SMatrix [ 1  3 ;
                2  4 ]

In [None]:
m3 = @SMatrix rand(4,4)

In [None]:
a = @SArray randn(2, 2, 2)

**Static arrays are fast ...**

In [None]:
using BenchmarkTools, LinearAlgebra

In [None]:
println("Inversion")
@btime inv($m);
@btime inv($M);

In [None]:
@code_native debuginfo=:none inv(m)

In [None]:
@code_native debuginfo=:none inv(M)

In [None]:
@edit inv(m)

```
============================================
    Benchmarks for 3×3 Float64 matrices
============================================
Matrix multiplication               -> 5.9x speedup
Matrix multiplication (mutating)    -> 1.8x speedup
Matrix addition                     -> 33.1x speedup
Matrix addition (mutating)          -> 2.5x speedup
Matrix determinant                  -> 112.9x speedup
Matrix inverse                      -> 67.8x speedup
Matrix symmetric eigendecomposition -> 25.0x speedup
Matrix Cholesky decomposition       -> 8.8x speedup
Matrix LU decomposition             -> 6.1x speedup
Matrix QR decomposition             -> 65.0x speedup
```

**... as long as they are small** because they put a lot of stress on the compiler!

In [None]:
# might take longer to compile and the speedup is gone
N = 20
M = rand(N,N);
m = SMatrix{N,N}(M);

println("Inversion")
@btime inv($m);
@btime inv($M);

# Are explicit type annotations necessary? (like in C or Fortran)

Note that Julia's type inference is powerful. Specifying types **is not** necessary for best performance!

In [33]:
function my_function(x)
    y = rand()
    z = rand()
    x+y+z
end

function my_function_typed(x::Int)::Float64
    y::Float64 = rand()
    z::Float64 = rand()
    x+y+z
end

my_function_typed (generic function with 1 method)

In [35]:
@btime my_function(10);
@btime my_function_typed(10);

  10.794 ns (0 allocations: 0 bytes)
  10.687 ns (0 allocations: 0 bytes)


 However, annotating types explicitly can serve a purpose.

* **Define a user interface/type filter** (will throw error if incompatible type is given)
* Enforce conversions
* Rarely, help the compiler infer types in tricky situations

# Core messages of this Notebook

* Julia **can be fast.**
* **A function is compiled when called for the first time** with a given set of argument types.
* The are **multiple compilation steps** all of which can be inspected through macros like `@code_warntype`.
* **Code specialization** based on the types of all of the input arguments is important for speed.
* Calculations can be moved to compile-time to make run-time faster.
* In virtually all cases, **explicit type annotations are irrelevant for performance**.
* Type annotations in function signatures define a **type filter/user interface**.