# Code Specialization

To be fast, Julia needs to **specialize** code, that is compile specific native versions of the code. **The better the specialization the faster the code!** In the following we will investigate how Julia achieves good code specialization while retaining the power of generic programming.

## Just Ahead of Time (JAOT) Compilation

<p><img src="../imgs/from_source_to_native.png" alt="drawing" width="800"/></p>
 

**AST = Abstract Syntax Tree**

**IR = Intermediate Representation**

**SSA = Static Single Assignment**

**[LLVM](https://de.wikipedia.org/wiki/LLVM) = Low Level Virtual Machine**

## Specialization

**Julia specializes on the types of function arguments**, i.e. Julia compiles efficient machine code for the given input types, **when a function is called for the first time**.

If it is called again, the already existing machine code is reused, until we call the function with different input types.


In [1]:
func(x,y) = 2x + y

func (generic function with 1 method)

In [2]:
x = [1.2, 3.4, 5.6] # Vector{Float64}
y = [0.4, 0.7, 0.9] # Vector{Float64}

@time func(x,y);
@time func(x,y);

  0.185945 seconds (496.05 k allocations: 25.883 MiB, 5.20% gc time, 99.99% compilation time)
  0.000006 seconds (2 allocations: 160 bytes)


**First call:** compilation + running the code

**Second call:** running the code


In [3]:
@time func(x,y);

  0.000007 seconds (2 allocations: 160 bytes)


If one of the input types changes, Julia compiles a new specialization of the function!


In [4]:
typeof(x)

Vector{Float64}[90m (alias for [39m[90mArray{Float64, 1}[39m[90m)[39m

In [5]:
x = [1, 3, 5]

3-element Vector{Int64}:
 1
 3
 5

In [6]:
typeof(x)

Vector{Int64}[90m (alias for [39m[90mArray{Int64, 1}[39m[90m)[39m

In [7]:
@time func(x,y); # Vector{Int64}, Vector{Float64}
@time func(x,y);

  0.166157 seconds (415.98 k allocations: 21.285 MiB, 99.98% compilation time)
  0.000008 seconds (2 allocations: 160 bytes)


We now have two efficient native codes in the cache: one for all `Vector{Float64}` inputs and another one for `Vector{Int64}` as the first and `Vector{Float64}` as the second argument type.

In [8]:
using MethodAnalysis

In [9]:
methods(func)

In [10]:
methodinstances(func)

2-element Vector{Core.MethodInstance}:
 MethodInstance for func(::Vector{Float64}, ::Vector{Float64})
 MethodInstance for func(::Vector{Int64}, ::Vector{Float64})

## Introspection
#### (*But I really want to see what happens!*)

We can inspect the code at all transformation stages with a bunch of macros:

<img src="../imgs/julia_introspection_macros.png" width=350px>

In [11]:
@macroexpand @time 3+3

quote
    [90m#= timing.jl:253 =#[39m
    begin
        [90m#= timing.jl:258 =#[39m
        $(Expr(:meta, :force_compile))
        [90m#= timing.jl:259 =#[39m
        local var"#101#stats" = Base.gc_num()
        [90m#= timing.jl:260 =#[39m
        local var"#103#elapsedtime" = Base.time_ns()
        [90m#= timing.jl:261 =#[39m
        Base.cumulative_compile_timing(true)
        [90m#= timing.jl:262 =#[39m
        local var"#104#compile_elapsedtimes" = Base.cumulative_compile_time_ns()
        [90m#= timing.jl:263 =#[39m
        local var"#102#val" = $(Expr(:tryfinally, :(3 + 3), quote
    var"#103#elapsedtime" = Base.time_ns() - var"#103#elapsedtime"
    [90m#= timing.jl:265 =#[39m
    Base.cumulative_compile_timing(false)
    [90m#= timing.jl:266 =#[39m
    var"#104#compile_elapsedtimes" = Base.cumulative_compile_time_ns() .- var"#104#compile_elapsedtimes"
end))
        [90m#= timing.jl:268 =#[39m
        local var"#105#diff" = Base.GC_Diff(Base.gc_num(), var"#10

In [46]:
@code_lowered func(1.0,2.0)

CodeInfo(
[90m1 ─[39m %1 = 2 * x
[90m│  [39m %2 = %1 + y
[90m└──[39m      return %2
)

In [47]:
@code_typed func(1.0,2.0)

CodeInfo(
[90m1 ─[39m %1 = Base.mul_float(2.0, x)[36m::Float64[39m
[90m│  [39m %2 = Base.add_float(%1, y)[36m::Float64[39m
[90m└──[39m      return %2
) => Float64

From the types of the input arguments, Julia has figured out all the intermediate types and replaced the generic functions `*` and `+` by specific implementations. This crucial process is known as **type inference** and its success is the basis for a good specialization (i.e. performant native code as a result). It will concern us in much more detail tomorrow.

In [48]:
@code_llvm func(1.0,2.0)

[90m;  @ In[1]:1 within `func`[39m
[95mdefine[39m [36mdouble[39m [93m@julia_func_4161[39m[33m([39m[36mdouble[39m [0m%0[0m, [36mdouble[39m [0m%1[33m)[39m [0m#0 [33m{[39m
[91mtop:[39m
[90m; ┌ @ promotion.jl:389 within `*` @ float.jl:385[39m
   [0m%2 [0m= [96m[1mfmul[22m[39m [36mdouble[39m [0m%0[0m, [33m2.000000e+00[39m
[90m; └[39m
[90m; ┌ @ float.jl:383 within `+`[39m
   [0m%3 [0m= [96m[1mfadd[22m[39m [36mdouble[39m [0m%2[0m, [0m%1
[90m; └[39m
  [96m[1mret[22m[39m [36mdouble[39m [0m%3
[33m}[39m


We can remove the comments (lines starting with `;` using `debuginfo=:none`).


In [49]:
@code_llvm debuginfo=:none func(1.0,2.0)

[95mdefine[39m [36mdouble[39m [93m@julia_func_4163[39m[33m([39m[36mdouble[39m [0m%0[0m, [36mdouble[39m [0m%1[33m)[39m [0m#0 [33m{[39m
[91mtop:[39m
  [0m%2 [0m= [96m[1mfmul[22m[39m [36mdouble[39m [0m%0[0m, [33m2.000000e+00[39m
  [0m%3 [0m= [96m[1mfadd[22m[39m [36mdouble[39m [0m%2[0m, [0m%1
  [96m[1mret[22m[39m [36mdouble[39m [0m%3
[33m}[39m


In [50]:
@code_native debuginfo=:none syntax=:intel func(1.0,2.0)

	[0m.section	[0m__TEXT[0m,[0m__text[0m,[0mregular[0m,[0mpure_instructions
	[0m.build_version [0mmacos[0m, [33m12[39m[0m, [33m0[39m
	[0m.globl	[0m_julia_func_4165                [0m## [0m-- [0mBegin [0mfunction [0mjulia_func_4165
	[0m.p2align	[33m4[39m[0m, [33m0x90[39m
[91m_julia_func_4165:[39m                       [0m## [0m@julia_func_4165
	[0m.cfi_startproc
[0m## [0m%bb.0[0m:                               [0m## [0m%top
	[96m[1mvaddsd[22m[39m	[0mxmm0[0m, [0mxmm0[0m, [0mxmm0
	[96m[1mvaddsd[22m[39m	[0mxmm0[0m, [0mxmm0[0m, [0mxmm1
	[96m[1mret[22m[39m
	[0m.cfi_endproc
                                        [0m## [0m-- [0mEnd [0mfunction
[0m.subsections_via_symbols


Let's compare this to integer input.


In [51]:
@code_native debuginfo=:none syntax=:intel func(1,2)

	[0m.section	[0m__TEXT[0m,[0m__text[0m,[0mregular[0m,[0mpure_instructions
	[0m.build_version [0mmacos[0m, [33m12[39m[0m, [33m0[39m
	[0m.globl	[0m_julia_func_4167                [0m## [0m-- [0mBegin [0mfunction [0mjulia_func_4167
	[0m.p2align	[33m4[39m[0m, [33m0x90[39m
[91m_julia_func_4167:[39m                       [0m## [0m@julia_func_4167
	[0m.cfi_startproc
[0m## [0m%bb.0[0m:                               [0m## [0m%top
	[96m[1mlea[22m[39m	[0mrax[0m, [33m[[39m[0mrsi [0m+ [33m2[39m[0m*[0mrdi[33m][39m
	[96m[1mret[22m[39m
	[0m.cfi_endproc
                                        [0m## [0m-- [0mEnd [0mfunction
[0m.subsections_via_symbols


## How important is specialization?

Let's try to estimate the performance gain by specialization.

To prevent specialization, we deliberately throw away any useful type information and operate on a `Vector{Any}` that can literally store anything!

(This is qualitatively comparable to what Python does.)


In [18]:
func(v) = 2*v[1] + v[2] # version of func that takes in a vector

func (generic function with 2 methods)

In [19]:
rand(2)

2-element Vector{Float64}:
 0.5426636707323305
 0.009070281415085368

In [20]:
Any[rand(), rand()]

2-element Vector{Any}:
 0.6131532350844942
 0.5634086668836824

In [21]:
using BenchmarkTools

@btime func(v) setup=(v=rand(2));
@btime func(v) setup=(v=Any[rand(), rand()]);

  3.916 ns (0 allocations: 0 bytes)
  60.928 ns (2 allocations: 32 bytes)


**That's a huge slowdown!**


In [22]:
@code_typed func(rand(2))

CodeInfo(
[90m1 ─[39m %1 = Base.arrayref(true, v, 1)[36m::Float64[39m
[90m│  [39m %2 = Base.mul_float(2.0, %1)[36m::Float64[39m
[90m│  [39m %3 = Base.arrayref(true, v, 2)[36m::Float64[39m
[90m│  [39m %4 = Base.add_float(%2, %3)[36m::Float64[39m
[90m└──[39m      return %4
) => Float64

In [23]:
@code_typed func(Any[rand(), rand()])

CodeInfo(
[90m1 ─[39m %1 = Base.arrayref(true, v, 1)[36m::Any[39m
[90m│  [39m %2 = (2 * %1)[36m::Any[39m
[90m│  [39m %3 = Base.arrayref(true, v, 2)[36m::Any[39m
[90m│  [39m %4 = (%2 + %3)[36m::Any[39m
[90m└──[39m      return %4
) => Any

In [24]:
# @code_native debuginfo=:none syntax=:intel func(rand(2))
# @code_native debuginfo=:none syntax=:intel func(Any[rand(), rand()])

## Types vs values

In high performance computing, compilation time (order of seconds or minutes) is typically neglectable compared to the actual time it takes to perform the computation (readily on the orders of hours/days/weeks). Therefore, we generally want to optimize for runtime efficiency even if this means that compilation time goes up by a reasonable amount.

**Julia specializes on input types and not values!**

Primarily it is **type information** that is used by the compiler to specialize code. (There are special techniques like, e.g., constant propagation and others that we are neglecting here.)

(Very) roughly speaking, the more information there is in *type space* (e.g. in type parameters) the higher the likelihood that the compiler produces fast and efficient code.

In [25]:
A = rand(10,10);
B = rand(10,10);
@btime $A + $B;

  193.079 ns (1 allocation: 896 bytes)


In [26]:
typeof(A)

Matrix{Float64}[90m (alias for [39m[90mArray{Float64, 2}[39m[90m)[39m

In [27]:
size(A)

(10, 10)

In [28]:
size(typeof(A)) # the size of A isn't type information

LoadError: MethodError: no method matching size(::Type{Matrix{Float64}})
[0mClosest candidates are:
[0m  size([91m::Union{LinearAlgebra.Adjoint{T, var"#s884"}, LinearAlgebra.Transpose{T, var"#s884"}} where {T, var"#s884"<:(AbstractVector)}[39m) at ~/.julia/juliaup/julia-1.8.0+0.x64/share/julia/stdlib/v1.8/LinearAlgebra/src/adjtrans.jl:173
[0m  size([91m::Union{LinearAlgebra.Adjoint{T, var"#s884"}, LinearAlgebra.Transpose{T, var"#s884"}} where {T, var"#s884"<:(AbstractMatrix)}[39m) at ~/.julia/juliaup/julia-1.8.0+0.x64/share/julia/stdlib/v1.8/LinearAlgebra/src/adjtrans.jl:174
[0m  size([91m::Union{LinearAlgebra.QR, LinearAlgebra.QRCompactWY, LinearAlgebra.QRPivoted}[39m) at ~/.julia/juliaup/julia-1.8.0+0.x64/share/julia/stdlib/v1.8/LinearAlgebra/src/qr.jl:581
[0m  ...

In [29]:
using StaticArrays

In [30]:
A = @SMatrix rand(10,10);
B = @SMatrix rand(10,10);

In [31]:
typeof(A)

SMatrix{10, 10, Float64, 100}[90m (alias for [39m[90mSArray{Tuple{10, 10}, Float64, 2, 100}[39m[90m)[39m

In [32]:
size(typeof(A)) # the size of A is type information!

(10, 10)

In [33]:
@btime $A + $B;

  33.012 ns (0 allocations: 0 bytes)


**StaticArrays.jl**

```
============================================
    Benchmarks for 3×3 Float64 matrices
============================================
Matrix multiplication               -> 5.9x speedup
Matrix multiplication (mutating)    -> 1.8x speedup
Matrix addition                     -> 33.1x speedup
Matrix addition (mutating)          -> 2.5x speedup
Matrix determinant                  -> 112.9x speedup
Matrix inverse                      -> 67.8x speedup
Matrix symmetric eigendecomposition -> 25.0x speedup
Matrix Cholesky decomposition       -> 8.8x speedup
Matrix LU decomposition             -> 6.1x speedup
Matrix QR decomposition             -> 65.0x speedup
```

### Why not always use static arrays then?!

By putting more information in the type you are putting more stress on the compiler to optimize things.

Specifically, if static arrays are too big compile time can explode or the compiler might just give up and fall back to an inefficient default version.

Generally speaking, static arrays are only useful as small fixed-size arrays.

In [34]:
# # should take (much) longer to compile and the speedup should be gone as well
# # if it isn't, increase N a little bit
# N = 50
# M = rand(N,N);
# Mstatic = SMatrix{N,N}(M);

# @btime $Mstatic + $Mstatic;
# @btime $M + $M;

### Dispatch and specialization

Having a reasonable amount of information encoded in the type domain isn't only useful to help the compiler (specialization) but also for dispatching to the most specific (and therefore hopfully most performant) method of a function.

**Types drive both specialization and multiple dispatch!**

In this sense, multiple dispatch is essentially the first step of the specialization process where Julia chooses between different implementations.

#### Example: Determinant of a 2x2 matrix

Let's say your task would be to write a function computing the determinant of a 2x2 matrix. How would you implement it?

Probably you'd say, well I know the formula for computing the determinant of a 2x2 matrix! Let's just implement it.


In [35]:
det_2x2(X) = X[1,1] * X[2,2] - X[1,2] * X[2,1]

det_2x2 (generic function with 1 method)

In [36]:
M = [1 2; 3 4]

2×2 Matrix{Int64}:
 1  2
 3  4

In [37]:
det_2x2(M)

-2

In [38]:
@btime det_2x2(M);

  23.857 ns (0 allocations: 0 bytes)


Let's see how Julia's built-in `det` function compares to our algorithm:


In [39]:
using LinearAlgebra

det(M)

-2.0

In [40]:
@btime det(M);

  337.776 ns (3 allocations: 192 bytes)


It's much slower!!

The reason isn't just that the compiler doesn't just know the size of the matrix from its type but also that [the code it considers](https://github.com/JuliaLang/julia/blob/release-1.8/stdlib/LinearAlgebra/src/generic.jl#L1544-L1550) (selected by the dispatch mechanism) is too general to compete with our implementation in `det_2x2`.

Let's now move the size information to the type domain and see how things change.

In [41]:
using StaticArrays
S = @SMatrix [1 2; 3 4]

2×2 SMatrix{2, 2, Int64, 4} with indices SOneTo(2)×SOneTo(2):
 1  2
 3  4

In [42]:
@btime det($S);

  4.666 ns (0 allocations: 0 bytes)


Note that it is super faster because StaticArrays.jl provides [a hand-coded version](https://github.com/JuliaArrays/StaticArrays.jl/blob/master/src/det.jl#L10-L12), similar to our `det_2x2` above, which gets selected because of the size information in the type.

The (tiny) speed difference compared to our own `det_2x2` is only due to bounds checking and matrix vs linear indexing.

In [43]:
det_2x2_optimized(X) = X[1] * X[4] - X[3] * X[2]
@btime det_2x2_optimized($M);

  4.292 ns (0 allocations: 0 bytes)


## Are explicit type annotations necessary? (think C or Fortran)

Note that Julia's type inference is powerful. Specifying types **is not** necessary for best performance!


In [44]:
function my_function(x)
    y = rand()
    z = rand()
    x+y+z
end

function my_function_typed(x::Int)::Float64
    y::Float64 = rand()
    z::Float64 = rand()
    x+y+z
end

my_function_typed (generic function with 1 method)

In [45]:
@btime my_function(10);
@btime my_function_typed(10);

  9.046 ns (0 allocations: 0 bytes)
  8.903 ns (0 allocations: 0 bytes)


Annotating types explicitly can serve a purpose.

* Enforce conversions
* Very rarely: help the compiler infer types in tricky situations

However, more often than not it is an indication of suboptimal code design. (It also makes functions much less generic and reusable!)

# Core messages of this Notebook

* **A function is compiled when called for the first time** with a given set of argument types.
* The are **multiple compilation steps** which can be inspected through macros like `@code_warntype`.
* **Code specialization** based on the types of all of the input arguments is important for speed.
* Critical information can be moved to the **type domain** for better dispatch and specialization.
* In virtually all cases, **explicit type annotations are irrelevant for performance**.