# Compilation

To be fast, Julia needs to specialize code, that is **compile specific native versions of the code**. The better the specialization the faster the code!

## "Just ahead of time" compilation

* Julia specializes on the **types of function arguments** and 
* compiles efficient machine code **when a function is called for the first time** (with these input argument types).

If the same function is called again with the same input argument types, the already existing machine code is reused.


In [None]:
func(x,y) = 2x + y

In [None]:
x = [1.2, 3.4, 5.6] # Vector{Float64}
y = [0.4, 0.7, 0.9] # Vector{Float64}

@time func(x,y);
@time func(x,y);

**First call:** compilation + running the code

**Second call:** running the code


In [None]:
@time func(x,y);

If one of the input types changes, Julia compiles a new specialization of the function!


In [None]:
typeof(x)

In [None]:
x = [1, 3, 5]

In [None]:
typeof(x)

In [None]:
@time func(x,y); # Vector{Int64}, Vector{Float64}
@time func(x,y);

We now have two efficient native codes in the cache: one for all `Vector{Float64}` inputs and another one for `Vector{Int64}` as the first and `Vector{Float64}` as the second argument type.

In [None]:
methods(func)

In [None]:
using MethodAnalysis
methodinstances(func)

### Compilation pipeline

<p><br><img src="imgs/Julia_compilation_pipeline.svg" width="512"/></p>

* **AST**: abstract syntax tree
* **IR**: intermediate representation

## Introspection tools
#### (*But I really want to see what happens!*)

We can inspect the code at all transformation stages with a bunch of macros:

<img src="./imgs/julia_introspection_macros.svg" width=300px>

In [None]:
@macroexpand @show 3+3

In [None]:
f(x, y) = x^3 + y/2

In [None]:
@code_lowered f(1.0,2.0)

In [None]:
@code_typed f(1.0,2.0)

From the types of the input arguments, Julia has figured out all the intermediate types. This crucial process is known as **type inference** and its success is the basis for a good specialization (i.e. performant native code as a result). Moreover, the generic power function computing the cubic of `x` is replaced by specific floating-point multiplications (**static dispatch**).

In [None]:
@code_llvm debuginfo=:none f(1.0,2.0)

The expensive divide operation (`y/2`) is replaced by multiplying by 0.5. In the end, giving two `Float64` arguments this function has 4 floating-point operations, i.e. 3 multiplications and 1 addition, instead of cubic function and division.

In [None]:
@code_native debuginfo=:none f(1.0,2.0)

Let's compare this to integer inputs.

In [None]:
@code_native debuginfo=:none f(1,2)

## How important is specialization?

Let's try to estimate the performance gain by specialization. To prevent specialization, we deliberately throw away any useful type information and operate on a `Vector{Any}` that can literally store anything!

(This is qualitatively comparable to what Python does.)


In [None]:
func(v) = 2*v[1] + v[2] # version of func that takes in a vector

In [None]:
rand(2)

In [None]:
Any[rand(), rand()]

In [None]:
using BenchmarkTools

v_typed = rand(2)
v_any = Any[rand(), rand()]

@btime func($v_typed);
@btime func($v_any);

For benchmarking we generally use `@btime` (or `@benchmark`) from [BenchmarkTools.jl](https://github.com/JuliaCI/BenchmarkTools.jl). This will take care of a couple of things for us:
* Exclude first run.
* Run the code multiple times (→ statistics).
* Benchmark in a function (local scope).

**General rule:** For good benchmarking use `@btime` and interpolate (`$`) global input arguments.

(Prefixing variable with `$` always means interpolation in Julia, e.g. string interpolation.)

In [None]:
@benchmark func($v_any)

In [None]:
@code_typed func(rand(2))

**static dispatch**: the generic functions `*` and `+` are replaced by specific implementations.

In [None]:
@code_typed func(Any[rand(), rand()])

Note here the generic functions `*` and `+` can not be replaced by specific variants due to lack of type information. This leads to inefficient **runtime dispatch**.

## Dispatch and specialization

**Types drive both dispatch and specialization.**

First, the most specific method is selected (dispatch), then it gets compiled to efficient native code (specialization). Let's reconsider our earlier example:

In [None]:
myabs(x::Real) = sign(x) * x
myabs(z::Complex) = sqrt(real(z * conj(z)))

In [None]:
@code_native myabs(3.2 + 4.5im) # complex input

In [None]:
@code_native myabs(3 + 4im) # also complex input but different native code (due to specialization)!

## Precompilation

Besides the "just ahead of time" compilation discussed above, **Julia precompiles packages and stores the resulting binary code (among other things).**

In [None]:
Base.compilecache_dir(Base.PkgId(BenchmarkTools))

In [None]:
readdir(Base.compilecache_dir(Base.PkgId(BenchmarkTools)))

### A note for heterogeneous HPC clusters

By default, precompilation produces native code for the CPU type it is running on. This means that it uses the [Instruction Set Architecture (ISA)](https://en.wikipedia.org/wiki/Instruction_set_architecture) of this CPU.

This can lead to issues on heterogeneous clusters where different nodes have different CPU types. E.g. you precompile Julia packages on a login node with an Intel CPU but want to run the code on a compute node with AMD CPUs.

**Solution: Multiversioning**

```julia
# HLRS Training Cluster
export JULIA_CPU_TARGET="generic;sandybridge,clone_all;cascadelake,clone_all;skylake-avx512,clone_all"
```

This will compile a generic (but slow) version as well as efficient variants for Intel Sandybridge and Intel Cascade Lake and Intel Skylake CPUs.

# Core messages of this Notebook

* **A function is compiled when called for the first time** with a given set of argument types.
* There are **multiple code transformation steps** which can be inspected through macros like `@code_warntype` or `@descend` from Cthulhu.jl.
* What makes Julia fast? Successful **Type inference** → **Specialization** → **Compilation**.
* Functions should almost always be benchmarked with **BenchmarkTools.jl's `@btime` and `@benchmark`** instead of `@time`.
* In virtually all cases, **explicit type annotations are irrelevant for performance**.