# Specialization

To be fast, Julia needs to **specialize** code, that is compile specific native versions of the code utilizing the type information. **The better the specialization the faster the code!**

## "Just ahead of time" compilation

* Julia **specializes on the types of function arguments** and 
* compiles efficient machine code **when a function is called for the first time** (with these input argument types).

If the same function is called again with the same input argument types, the already existing machine code is reused.


In [None]:
func(x,y) = 2x + y

In [None]:
x = [1.2, 3.4, 5.6] # Vector{Float64}
y = [0.4, 0.7, 0.9] # Vector{Float64}

@time func(x,y);
@time func(x,y);

**First call:** compilation + running the code

**Second call:** running the code


In [None]:
@time func(x,y);

If one of the input types changes, Julia compiles a new specialization of the function!


In [None]:
typeof(x)

In [None]:
x = [1, 3, 5]

In [None]:
typeof(x)

In [None]:
@time func(x,y); # Vector{Int64}, Vector{Float64}
@time func(x,y);

We now have two efficient native codes in the cache: one for all `Vector{Float64}` inputs and another one for `Vector{Int64}` as the first and `Vector{Float64}` as the second argument type.

In [None]:
methods(func)

In [None]:
using MethodAnalysis
methodinstances(func)

### Compilation pipeline

<p><img src="./imgs/from_source_to_native.png" alt="drawing" width="800"/></p>


### What makes Julia fast?

(Successful) **Type inference** -> **Specialization** -> **Compilation**

## Introspection tools
#### (*But I really want to see what happens!*)

We can inspect the code at all transformation stages with a bunch of macros:

<img src="./imgs/julia_introspection_macros.png" width=350px>

In [None]:
@macroexpand @time 3+3

In [None]:
@code_typed func(1.0,2.0)

From the types of the input arguments, Julia has figured out all the intermediate types and replaced the generic functions `*` and `+` by specific implementations (**static dispatch**). This crucial process is known as **type inference** and its success is the basis for a good specialization (i.e. performant native code as a result). It will concern us in much more detail tomorrow.

In [None]:
@code_llvm debuginfo=:none func(1.0,2.0)

In [None]:
@code_native debuginfo=:none func(1.0,2.0)

Let's compare this to integer input.


In [None]:
@code_native debuginfo=:none func(1,2)

### Recommendation: [Cthulhu.jl](https://github.com/JuliaDebug/Cthulhu.jl)
While these introspection macros are great, I recommend to use `@descend` from the package [Cthulhu.jl](https://github.com/JuliaDebug/Cthulhu.jl) for real world code analysis.

Essentially, Cthulhu is an **interactive**, more powerful generalization of the macros above.

* Allows easy switching between code representations (syntax, typed, native, ...).
* **Recursive application possible**(!) (i.e. introspecting a function that is called within a function within function ...).

However, due to its interactivity, it doesn't work in Jupyter but **only works in the REPL** (→ exercise).

<img src="imgs/cthulhu.png" width=1000>

## How important is specialization?

Let's try to estimate the performance gain by specialization.

To prevent specialization, we deliberately throw away any useful type information and operate on a `Vector{Any}` that can literally store anything!

(This is qualitatively comparable to what Python does.)


In [None]:
func(v) = 2*v[1] + v[2] # version of func that takes in a vector

In [None]:
rand(2)

In [None]:
Any[rand(), rand()]

For benchmarking we will use `@btime` (or `@benchmark`) from [BenchmarkTools.jl](https://github.com/JuliaCI/BenchmarkTools.jl). This will take care of a couple of things for us:
* Exclude first run.
* Run the code multiple times (→ statistics).
* Benchmark in a function (local scope).

**General rule:** For proper benchmarking don't use `@time` but `@btime` and interpolate (`$`) global input arguments.

In [None]:
using BenchmarkTools

v_typed = rand(2)
v_any = Any[rand(), rand()]

@btime func($v_typed);
@btime func($v_any);

In [None]:
@benchmark func($v_any)

In [None]:
@code_typed func(rand(2))

In [None]:
@code_typed func(Any[rand(), rand()])

Note that in the latter case the generic functions `*` and `+` can not be replaced by specific variants due to lack of type information. This leads to inefficient **runtime dispatch**.

## Dispatch and specialization

**Types drive both dispatch and specialization.**

First, the most specific method is selected (dispatch), then it gets compiled to efficient native code (specialization).

In [None]:
myabs(x::Real) = sign(x) * x
myabs(z::Complex) = sqrt(real(z * conj(z)))

In [None]:
@code_native myabs(3.2 + 4.5im)

In [None]:
@code_native myabs(3 + 4im)

## Are explicit type annotations necessary? (think C or Fortran)

Note that Julia's type inference is powerful. Specifying types **is not** necessary for best performance!


In [None]:
function my_function(x)
    y = rand()
    z = rand()
    x+y+z
end

function my_function_typed(x::Int)::Float64
    y::Float64 = rand()
    z::Float64 = rand()
    x+y+z
end

In [None]:
@btime my_function(10);
@btime my_function_typed(10);

Annotating types explicitly can serve a purpose.

* Enforce conversions
* Very rarely: help the compiler infer types in tricky situations

However, more often than not it is an indication of suboptimal code design. (It also makes functions much less generic and reusable!)

## Note for heterogeneous HPC clusters

By default, Julia produces native code for the CPU type it is running on. This means that it uses the [Instruction Set Architecture (ISA)](https://en.wikipedia.org/wiki/Instruction_set_architecture) of this CPU.

This can lead to issues on heterogeneous clusters where different nodes have different CPU types. E.g. you precompile Julia packages on a login node with an Intel CPU but want to run the code on a compute node with AMD CPUs.

**Solution: Multiversioning**

```julia
export JULIA_CPU_TARGET="generic;zen2,clone_all;skylake,clone_all"
```

This will compile a generic (but slow) variant as well as efficient variants for AMD Zen2 and Intel Skylake CPUs.

# Core messages of this Notebook

* **A function is compiled when called for the first time** with a given set of argument types.
* The are **multiple code transformation steps** which can be inspected through macros like `@code_warntype` or `@descend` from Cthulhu.jl.
* What makes Julia fast? Successful **Type inference** → **Specialization** → **Compilation**.
* Functions should almost always be benchmarked with **BenchmarkTools.jl's `@btime` and `@benchmark`** instead of `@time`.
* In virtually all cases, **explicit type annotations are irrelevant for performance**.