# Specialization

To be fast, Julia needs to **specialize** code, that is compile specific native versions of the code utilizing the type information. **The better the specialization the faster the code!**

## "Just ahead of time" compilation

* Julia **specializes on the types of function arguments** and 
* compiles efficient machine code **when a function is called for the first time** (with these input argument types).

If the same function is called again with the same input argument types, the already existing machine code is reused.


In [27]:
func(x,y) = 2x + y

func (generic function with 1 method)

In [28]:
x = [1.2, 3.4, 5.6] # Vector{Float64}
y = [0.4, 0.7, 0.9] # Vector{Float64}

@time func(x,y);
@time func(x,y);

  0.005169 seconds (441 allocations: 30.921 KiB, 99.60% compilation time)
  0.000007 seconds (2 allocations: 160 bytes)


**First call:** compilation + running the code

**Second call:** running the code


In [29]:
@time func(x,y);

  0.000008 seconds (2 allocations: 160 bytes)


If one of the input types changes, Julia compiles a new specialization of the function!


In [30]:
typeof(x)

Vector{Float64}[90m (alias for [39m[90mArray{Float64, 1}[39m[90m)[39m

In [31]:
x = [1, 3, 5]

3-element Vector{Int64}:
 1
 3
 5

In [32]:
typeof(x)

Vector{Int64}[90m (alias for [39m[90mArray{Int64, 1}[39m[90m)[39m

In [33]:
@time func(x,y); # Vector{Int64}, Vector{Float64}
@time func(x,y);

  0.132855 seconds (166.58 k allocations: 11.230 MiB, 99.98% compilation time)
  0.000010 seconds (2 allocations: 160 bytes)


We now have two efficient native codes in the cache: one for all `Vector{Float64}` inputs and another one for `Vector{Int64}` as the first and `Vector{Float64}` as the second argument type.

In [34]:
methods(func)

In [35]:
using MethodAnalysis
methodinstances(func)

2-element Vector{Core.MethodInstance}:
 MethodInstance for func(::Vector{Float64}, ::Vector{Float64})
 MethodInstance for func(::Vector{Int64}, ::Vector{Float64})

### Compilation pipeline

<p><img src="./imgs/from_source_to_native.png" alt="drawing" width="800"/></p>


### What makes Julia fast?

(Successful) **Type inference** -> **Specialization** -> **Compilation**

## Introspection tools
#### (*But I really want to see what happens!*)

We can inspect the code at all transformation stages with a bunch of macros:

<img src="./imgs/julia_introspection_macros.png" width=350px>

In [36]:
@macroexpand @time 3+3

quote
    [90m#= timing.jl:263 =#[39m
    begin
        [90m#= timing.jl:268 =#[39m
        $(Expr(:meta, :force_compile))
        [90m#= timing.jl:269 =#[39m
        local var"#202#stats" = Base.gc_num()
        [90m#= timing.jl:270 =#[39m
        local var"#204#elapsedtime" = Base.time_ns()
        [90m#= timing.jl:271 =#[39m
        Base.cumulative_compile_timing(true)
        [90m#= timing.jl:272 =#[39m
        local var"#205#compile_elapsedtimes" = Base.cumulative_compile_time_ns()
        [90m#= timing.jl:273 =#[39m
        local var"#203#val" = $(Expr(:tryfinally, :(3 + 3), quote
    var"#204#elapsedtime" = Base.time_ns() - var"#204#elapsedtime"
    [90m#= timing.jl:275 =#[39m
    Base.cumulative_compile_timing(false)
    [90m#= timing.jl:276 =#[39m
    var"#205#compile_elapsedtimes" = Base.cumulative_compile_time_ns() .- var"#205#compile_elapsedtimes"
end))
        [90m#= timing.jl:278 =#[39m
        local var"#206#diff" = Base.GC_Diff(Base.gc_num(), var"#20

In [37]:
@code_typed func(1.0,2.0)

CodeInfo(
[90m1 ─[39m %1 = Base.mul_float(2.0, x)[36m::Float64[39m
[90m│  [39m %2 = Base.add_float(%1, y)[36m::Float64[39m
[90m└──[39m      return %2
) => Float64

From the types of the input arguments, Julia has figured out all the intermediate types and replaced the generic functions `*` and `+` by specific implementations (**static dispatch**). This crucial process is known as **type inference** and its success is the basis for a good specialization (i.e. performant native code as a result). It will concern us in much more detail tomorrow.

In [38]:
@code_llvm debuginfo=:none func(1.0,2.0)

[95mdefine[39m [36mdouble[39m [93m@julia_func_2409[39m[33m([39m[36mdouble[39m [0m%0[0m, [36mdouble[39m [0m%1[33m)[39m [0m#0 [33m{[39m
[91mtop:[39m
  [0m%2 [0m= [96m[1mfmul[22m[39m [36mdouble[39m [0m%0[0m, [33m2.000000e+00[39m
  [0m%3 [0m= [96m[1mfadd[22m[39m [36mdouble[39m [0m%2[0m, [0m%1
  [96m[1mret[22m[39m [36mdouble[39m [0m%3
[33m}[39m


In [39]:
@code_native debuginfo=:none func(1.0,2.0)

	[0m.section	[0m__TEXT[0m,[0m__text[0m,[0mregular[0m,[0mpure_instructions
	[0m.build_version [0mmacos[0m, [33m14[39m[0m, [33m0[39m
	[0m.globl	[0m_julia_func_2425                [0m## [0m-- [0mBegin [0mfunction [0mjulia_func_2425
	[0m.p2align	[33m4[39m[0m, [33m0x90[39m
[91m_julia_func_2425:[39m                       [0m## [0m@julia_func_2425
	[0m.cfi_startproc
[0m## [0m%bb.0[0m:                               [0m## [0m%top
	[96m[1mvaddsd[22m[39m	[0m%xmm0[0m, [0m%xmm0[0m, [0m%xmm0
	[96m[1mvaddsd[22m[39m	[0m%xmm1[0m, [0m%xmm0[0m, [0m%xmm0
	[96m[1mretq[22m[39m
	[0m.cfi_endproc
                                        [0m## [0m-- [0mEnd [0mfunction
[0m.subsections_via_symbols


Let's compare this to integer input.


In [40]:
@code_native debuginfo=:none func(1,2)

	[0m.section	[0m__TEXT[0m,[0m__text[0m,[0mregular[0m,[0mpure_instructions
	[0m.build_version [0mmacos[0m, [33m14[39m[0m, [33m0[39m
	[0m.globl	[0m_julia_func_2427                [0m## [0m-- [0mBegin [0mfunction [0mjulia_func_2427
	[0m.p2align	[33m4[39m[0m, [33m0x90[39m
[91m_julia_func_2427:[39m                       [0m## [0m@julia_func_2427
	[0m.cfi_startproc
[0m## [0m%bb.0[0m:                               [0m## [0m%top
	[96m[1mleaq[22m[39m	[33m([39m[0m%rsi[0m,[0m%rdi[0m,[33m2[39m[33m)[39m[0m, [0m%rax
	[96m[1mretq[22m[39m
	[0m.cfi_endproc
                                        [0m## [0m-- [0mEnd [0mfunction
[0m.subsections_via_symbols


### Recommendation: [Cthulhu.jl](https://github.com/JuliaDebug/Cthulhu.jl)
While these introspection macros are great, I recommend to use `@descend` from the package [Cthulhu.jl](https://github.com/JuliaDebug/Cthulhu.jl) for real world code analysis.

Essentially, Cthulhu is an **interactive**, more powerful generalization of the macros above.

* Allows easy switching between code representations (syntax, typed, native, ...).
* **Recursive application possible**(!) (i.e. introspecting a function that is called within a function within function ...).

However, due to its interactivity, it doesn't work in Jupyter but **only works in the REPL** (→ exercise).

<img src="imgs/cthulhu.png" width=1000>

## How important is specialization?

Let's try to estimate the performance gain by specialization.

To prevent specialization, we deliberately throw away any useful type information and operate on a `Vector{Any}` that can literally store anything!

(This is qualitatively comparable to what Python does.)


In [2]:
func(v) = 2*v[1] + v[2] # version of func that takes in a vector

func (generic function with 1 method)

In [53]:
rand(2)

2-element Vector{Float64}:
 0.7794739907620628
 0.20580422926731945

In [54]:
Any[rand(), rand()]

2-element Vector{Any}:
 0.42689498894071154
 0.5196898151534661

For benchmarking we will use `@btime` (or `@benchmark`) from [BenchmarkTools.jl](https://github.com/JuliaCI/BenchmarkTools.jl). This will take care of a couple of things for us:
* Exclude first run.
* Run the code multiple times (→ statistics).
* Benchmark in a function (local scope).

**General rule:** For proper benchmarking don't use `@time` but `@btime` and interpolate (`$`) global input arguments.

In [71]:
using BenchmarkTools

v_typed = rand(2)
v_any = Any[rand(), rand()]

@btime func($v_typed);
@btime func($v_any);

  3.911 ns (0 allocations: 0 bytes)
  59.210 ns (2 allocations: 32 bytes)


In [75]:
@benchmark func($v_any)

BenchmarkTools.Trial: 10000 samples with 982 evaluations.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m59.281 ns[22m[39m … [35m 1.368 μs[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 93.44%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m60.903 ns              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m62.696 ns[22m[39m ± [32m31.323 ns[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m1.18% ±  2.28%

  [39m [39m█[39m█[34m▃[39m[39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▅[39m█[39m█[34m█[39m[39m

In [7]:
@code_typed func(rand(2))

CodeInfo(
[90m1 ─[39m %1 = Base.arrayref(true, v, 1)[36m::Float64[39m
[90m│  [39m %2 = Base.mul_float(2.0, %1)[36m::Float64[39m
[90m│  [39m %3 = Base.arrayref(true, v, 2)[36m::Float64[39m
[90m│  [39m %4 = Base.add_float(%2, %3)[36m::Float64[39m
[90m└──[39m      return %4
) => Float64

In [8]:
@code_typed func(Any[rand(), rand()])

CodeInfo(
[90m1 ─[39m %1 = Base.arrayref(true, v, 1)[36m::Any[39m
[90m│  [39m %2 = (2 * %1)[36m::Any[39m
[90m│  [39m %3 = Base.arrayref(true, v, 2)[36m::Any[39m
[90m│  [39m %4 = (%2 + %3)[36m::Any[39m
[90m└──[39m      return %4
) => Any

Note that in the latter case the generic functions `*` and `+` can not be replaced by specific variants due to lack of type information. This leads to inefficient **runtime dispatch**.

## Dispatch and specialization

**Types drive both dispatch and specialization.**

First, the most specific method is selected (dispatch), then it gets compiled to efficient native code (specialization).

In [7]:
myabs(x::Real) = sign(x) * x
myabs(z::Complex) = sqrt(real(z * conj(z)))

myabs (generic function with 2 methods)

In [22]:
@code_native myabs(3.2 + 4.5im)

	[0m.section	[0m__TEXT[0m,[0m__text[0m,[0mregular[0m,[0mpure_instructions
	[0m.build_version [0mmacos[0m, [33m14[39m[0m, [33m0[39m
	[0m.globl	[0m_julia_myabs_1486               [0m## [0m-- [0mBegin [0mfunction [0mjulia_myabs_1486
	[0m.p2align	[33m4[39m[0m, [33m0x90[39m
[91m_julia_myabs_1486:[39m                      [0m## [0m@julia_myabs_1486
[90m; ┌ @ /Users/crstnbr/repos/JuliaWorkshops/JuliaHLRS23/notebooks/1_2_specialization.ipynb:2 within `myabs`[39m
	[0m.cfi_startproc
[0m## [0m%bb.0[0m:                               [0m## [0m%top
[90m; │┌ @ complex.jl:293 within `*` @ float.jl:410[39m
	[96m[1mvmovupd[22m[39m	[33m([39m[0m%rdi[33m)[39m[0m, [0m%xmm0
	[96m[1mvmulpd[22m[39m	[0m%xmm0[0m, [0m%xmm0[0m, [0m%xmm0
[90m; ││ @ complex.jl:293 within `*`[39m
[90m; ││┌ @ float.jl:409 within `-`[39m
	[96m[1mvpermilpd[22m[39m	[33m$1[39m[0m, [0m%xmm0[0m, [0m%xmm1        [0m## [0mxmm1 [0m= [0mxmm0[33m[[39m[33m1[39m[

In [23]:
@code_native myabs(3 + 4im)

	[0m.section	[0m__TEXT[0m,[0m__text[0m,[0mregular[0m,[0mpure_instructions
	[0m.build_version [0mmacos[0m, [33m14[39m[0m, [33m0[39m
	[0m.globl	[0m_julia_myabs_1489               [0m## [0m-- [0mBegin [0mfunction [0mjulia_myabs_1489
	[0m.p2align	[33m4[39m[0m, [33m0x90[39m
[91m_julia_myabs_1489:[39m                      [0m## [0m@julia_myabs_1489
[90m; ┌ @ /Users/crstnbr/repos/JuliaWorkshops/JuliaHLRS23/notebooks/1_2_specialization.ipynb:2 within `myabs`[39m
	[0m.cfi_startproc
[0m## [0m%bb.0[0m:                               [0m## [0m%top
	[96m[1msubq[22m[39m	[33m$8[39m[0m, [0m%rsp
	[0m.cfi_def_cfa_offset [33m16[39m
[90m; │┌ @ complex.jl:293 within `*` @ int.jl:88[39m
	[96m[1mmovq[22m[39m	[33m([39m[0m%rdi[33m)[39m[0m, [0m%rax
[90m; │└[39m
[90m; │┌ @ complex.jl:279 within `conj`[39m
[90m; ││┌ @ int.jl:85 within `-`[39m
	[96m[1mmovq[22m[39m	[33m8[39m[33m([39m[0m%rdi[33m)[39m[0m, [0m%rcx
[90m; │└└[39m
[90m;

## Are explicit type annotations necessary? (think C or Fortran)

Note that Julia's type inference is powerful. Specifying types **is not** necessary for best performance!


In [None]:
function my_function(x)
    y = rand()
    z = rand()
    x+y+z
end

function my_function_typed(x::Int)::Float64
    y::Float64 = rand()
    z::Float64 = rand()
    x+y+z
end

In [None]:
@btime my_function(10);
@btime my_function_typed(10);

Annotating types explicitly can serve a purpose.

* Enforce conversions
* Very rarely: help the compiler infer types in tricky situations

However, more often than not it is an indication of suboptimal code design. (It also makes functions much less generic and reusable!)

## Note for heterogeneous HPC clusters

By default, Julia produces native code for the CPU type it is running on. This means that it uses the [Instruction Set Architecture (ISA)](https://en.wikipedia.org/wiki/Instruction_set_architecture) of this CPU.

This can lead to issues on heterogeneous clusters where different nodes have different CPU types. E.g. you precompile Julia packages on a login node with an Intel CPU but want to run the code on a compute node with AMD CPUs.

**Solution: Multiversioning**

```julia
export JULIA_CPU_TARGET="generic;zen2,clone_all;skylake,clone_all"
```

This will compile a generic (but slow) variant as well as efficient variants for AMD Zen2 and Intel Skylake CPUs.

# Core messages of this Notebook

* **A function is compiled when called for the first time** with a given set of argument types.
* The are **multiple code transformation steps** which can be inspected through macros like `@code_warntype` or `@descend` from Cthulhu.jl.
* What makes Julia fast? Successful **Type inference** → **Specialization** → **Compilation**.
* Functions should almost always be benchmarked with **BenchmarkTools.jl's `@btime` and `@benchmark`** instead of `@time`.
* In virtually all cases, **explicit type annotations are irrelevant for performance**.