# Specialization

While abstraction in the form of duck typing and generic programming is great for us as programmers the computer needs specific machine instructions operating on specific data structures. Hence, Julia needs to **specialize** generic code, that is compile specific native versions of the code for specific input data types. **The better the specialization the faster the code!** In the following we will investigate how Julia achieves good code specialization while retaining the power of generic programming.

## Is Julia fast?

Julia isn't fast *per se*.

One can write terribly slow code in any language, including Julia.

So let's ask a different question.

## *Can* Julia be fast?

 ### Microbenchmarks
 <img src="imgs/benchmarks.svg" alt="drawing" width="800"/>

### Vandermonde matrix (once again)
(modified from [Steven's Julia intro](https://web.mit.edu/18.06/www/Fall17/1806/julia/Julia-intro.pdf))

\begin{align}V=\begin{bmatrix}1&\alpha _{1}&\alpha _{1}^{2}&\dots &\alpha _{1}^{n-1}\\1&\alpha _{2}&\alpha _{2}^{2}&\dots &\alpha _{2}^{n-1}\\1&\alpha _{3}&\alpha _{3}^{2}&\dots &\alpha _{3}^{n-1}\\\vdots &\vdots &\vdots &\ddots &\vdots \\1&\alpha _{m}&\alpha _{m}^{2}&\dots &\alpha _{m}^{n-1}\end{bmatrix}\end{align}

In [5]:
using PyCall

In [6]:
np = pyimport("numpy")

PyObject <module 'numpy' from '/home/eric/anaconda3/lib/python3.7/site-packages/numpy/__init__.py'>

In [7]:
np.vander(1:5, increasing=true)

5×5 Matrix{Int64}:
 1  1   1    1    1
 1  2   4    8   16
 1  3   9   27   81
 1  4  16   64  256
 1  5  25  125  625

The source code for this function is [here](https://github.com/numpy/numpy/blob/v1.16.1/numpy/lib/twodim_base.py#L475-L563). It calls `np.multiply.accumulate` which is implemented in C [here](https://github.com/numpy/numpy/blob/deea4983aedfa96905bbaee64e3d1de84144303f/numpy/core/src/umath/ufunc_object.c#L3678). However, this code doesn't actually perform the computation, it basically only checks types and stuff. The actual kernel that gets called is [here](https://github.com/numpy/numpy/blob/deea4983aedfa96905bbaee64e3d1de84144303f/numpy/core/src/umath/loops.c.src#L1742). This isn't even C code but a template for C code which is used to generate type specific kernels.

Overall, this setup only supports a limited set of types, like `Float64`, `Float32`, and so forth.

Here is our simple generic Julia implementation

In [8]:
function vander(x::AbstractVector{T}) where T
    m = length(x)
    V = Matrix{T}(undef, m, m)
    for j = 1:m
        V[j,1] = one(x[j])
    end
    for i= 2:m
        for j = 1:m
            V[j,i] = x[j] * V[j,i-1]
            end
        end
    return V
end

vander (generic function with 1 method)

In [9]:
vander(1:5)

5×5 Matrix{Int64}:
 1  1   1    1    1
 1  2   4    8   16
 1  3   9   27   81
 1  4  16   64  256
 1  5  25  125  625

#### A quick speed comparison

<details>
  <summary>Show Code</summary>
<br>
    
```julia
using BenchmarkTools, Plots
ns = exp10.(range(1, 4, length=30));

tnp = Float64[]
tjl = Float64[]
for n in ns
    x = 1:n |> collect
    push!(tnp, @belapsed np.vander(\$x) samples=3 evals=1)
    push!(tjl, @belapsed vander(\$x) samples=3 evals=1)
end
plot(ns, tnp./tjl, m=:circle, xscale=:log10, xlab="matrix size", ylab="NumPy time / Julia time", legend=:false)
```
</details>

 <img src="imgs/vandermonde.svg" alt="drawing" width="600"/>

Note that the clean and concise Julia implementation is **beating numpy's C implementation for small matrices** and is **on-par for large matrix sizes**.

At the same time, the Julia code is *generic* and works for arbitrary types!

In [10]:
vander(Int32[4, 8, 16, 32])

4×4 Matrix{Int32}:
 1   4    16     64
 1   8    64    512
 1  16   256   4096
 1  32  1024  32768

It even works for non-numerical types. The only requirement is that the type has a *one* (identity element) and a multiplication operation defined.

In [11]:
vander(["this", "is", "a", "test"])

4×4 Matrix{String}:
 ""  "this"  "thisthis"  "thisthisthis"
 ""  "is"    "isis"      "isisis"
 ""  "a"     "aa"        "aaa"
 ""  "test"  "testtest"  "testtesttest"

Here, `one(String) == ""` since the empty string is the identity under multiplication (string concatenation).

# How can Julia be fast?

<p><img src="imgs/from_source_to_native.png" alt="drawing" width="800"/></p>
 
**AST = Abstract Syntax Tree**

**SSA = Static Single Assignment**

**[LLVM](https://de.wikipedia.org/wiki/LLVM) = Low Level Virtual Machine**

### Specialization and code inspection

**Julia specializes on the types of function arguments**, i.e. Julia compiles efficient machine code for the given input types, **when a function is called for the first time**.

If it is called again, the already existing machine code is reused, until we call the function with different input types.

In [12]:
func(x,y) = 2x + y

func (generic function with 1 method)

In [13]:
x = [1.2, 3.4, 5.6]
y = [0.4, 0.7, 0.9]

@time func(x,y);
@time func(x,y);

  0.118397 seconds (411.57 k allocations: 21.702 MiB, 99.98% compilation time)
  0.000007 seconds (2 allocations: 160 bytes)


**First call:** compilation + running the code

**Second call:** running the code

In [14]:
@time func(x,y);

  0.000007 seconds (2 allocations: 160 bytes)


If one of the input types changes, Julia compiles a new specialization of the function!

In [15]:
typeof(x)

Vector{Float64} (alias for Array{Float64, 1})

In [16]:
x = [1, 3, 5]

3-element Vector{Int64}:
 1
 3
 5

In [17]:
typeof(x)

Vector{Int64} (alias for Array{Int64, 1})

In [18]:
@time func(x,y);
@time func(x,y);

  0.110060 seconds (350.41 k allocations: 18.118 MiB, 99.98% compilation time)
  0.000006 seconds (2 allocations: 160 bytes)


We now have two efficient native codes in the cache: one for all `Vector{Float64}` inputs and another one for `Vector{Int64}` as the first and `Vector{Float64}` as the second argument type.

### *But I really want to see what happens!*

We can inspect the code at all transformation stages with a bunch of macros:

* The AST after parsing (**`@macroexpand`**)
* The AST after lowering (**`@code_typed`**, **`@code_warntype`**)
* The AST after type inference and optimization (**`@code_lowered`**)
* The LLVM IR (**`@code_llvm`**)
* The assembly machine code (**`@code_native`**)

In [19]:
@code_typed func(1,2)

CodeInfo(
[90m1 ─[39m %1 = Base.mul_int(2, x)[36m::Int64[39m
[90m│  [39m %2 = Base.add_int(%1, y)[36m::Int64[39m
[90m└──[39m      return %2
) => Int64

In [20]:
@code_lowered func(1,2)

CodeInfo(
[90m1 ─[39m %1 = 2 * x
[90m│  [39m %2 = %1 + y
[90m└──[39m      return %2
)

In [21]:
@code_llvm func(1,2)

[90m;  @ In[12]:1 within `func`[39m
[95mdefine[39m [36mi64[39m [93m@julia_func_2921[39m[33m([39m[36mi64[39m [95msignext[39m [0m%0[0m, [36mi64[39m [95msignext[39m [0m%1[33m)[39m [0m#0 [33m{[39m
[91mtop:[39m
[90m; ┌ @ int.jl:88 within `*`[39m
   [0m%2 [0m= [96m[1mshl[22m[39m [36mi64[39m [0m%0[0m, [33m1[39m
[90m; └[39m
[90m; ┌ @ int.jl:87 within `+`[39m
   [0m%3 [0m= [96m[1madd[22m[39m [36mi64[39m [0m%2[0m, [0m%1
[90m; └[39m
  [96m[1mret[22m[39m [36mi64[39m [0m%3
[33m}[39m


We can remove the comments (lines starting with `;` using `debuginfo=:none`).

In [22]:
@code_llvm debuginfo=:none func(1,2)

[95mdefine[39m [36mi64[39m [93m@julia_func_2952[39m[33m([39m[36mi64[39m [95msignext[39m [0m%0[0m, [36mi64[39m [95msignext[39m [0m%1[33m)[39m [0m#0 [33m{[39m
[91mtop:[39m
  [0m%2 [0m= [96m[1mshl[22m[39m [36mi64[39m [0m%0[0m, [33m1[39m
  [0m%3 [0m= [96m[1madd[22m[39m [36mi64[39m [0m%2[0m, [0m%1
  [96m[1mret[22m[39m [36mi64[39m [0m%3
[33m}[39m


In [23]:
@code_native debuginfo=:none func(1,2)

	[0m.text
	[96m[1mleaq[22m[39m	[33m([39m[0m%rsi[0m,[0m%rdi[0m,[33m2[39m[33m)[39m[0m, [0m%rax
	[96m[1mretq[22m[39m
	[96m[1mnopw[22m[39m	[0m%cs[0m:[33m([39m[0m%rax[0m,[0m%rax[33m)[39m


Let's compare this to `Float64` input.

In [24]:
@code_native debuginfo=:none func(1.2,2.9)

	[0m.text
	[96m[1mvaddsd[22m[39m	[0m%xmm0[0m, [0m%xmm0[0m, [0m%xmm0
	[96m[1mvaddsd[22m[39m	[0m%xmm1[0m, [0m%xmm0[0m, [0m%xmm0
	[96m[1mretq[22m[39m
	[96m[1mnopl[22m[39m	[33m([39m[0m%rax[33m)[39m


## How important is code specialization?

Let's try to estimate the performance gain by specialization.

We wrap our numbers into a custom type which internally stores them as `Any` to prevent specialization.

(This is qualitatively comparable to what Python does.)

In [25]:
struct Anything
    value::Any
end

operation(x::Number) = x^2 + sqrt(x)
operation(x::Anything) = x.value^2 + sqrt(x.value)

operation (generic function with 2 methods)

In [26]:
using BenchmarkTools

@btime operation(2.0);

x = Anything(2.0)
@btime operation($x);

  1.334 ns (0 allocations: 0 bytes)
  47.773 ns (3 allocations: 48 bytes)


**That's about an 40 times slowdown!**

In [27]:
@code_native debuginfo=:none operation(2.0)

	[0m.text
	[96m[1mvxorpd[22m[39m	[0m%xmm1[0m, [0m%xmm1[0m, [0m%xmm1
	[96m[1mvucomisd[22m[39m	[0m%xmm0[0m, [0m%xmm1
	[96m[1mja[22m[39m	[91mL23[39m
	[96m[1mvmulsd[22m[39m	[0m%xmm0[0m, [0m%xmm0[0m, [0m%xmm1
	[96m[1mvsqrtsd[22m[39m	[0m%xmm0[0m, [0m%xmm0[0m, [0m%xmm0
	[96m[1mvaddsd[22m[39m	[0m%xmm0[0m, [0m%xmm1[0m, [0m%xmm0
	[96m[1mretq[22m[39m
[91mL23:[39m
	[96m[1msubq[22m[39m	[33m$8[39m[0m, [0m%rsp
	[96m[1mmovabsq[22m[39m	[93m$throw_complex_domainerror[39m[0m, [0m%rax
	[96m[1mmovabsq[22m[39m	[33m$139632615684672[39m[0m, [0m%rdi          [90m# imm = 0x7EFEC074FA40[39m
	[96m[1mcallq[22m[39m	[0m*[0m%rax
	[96m[1mud2[22m[39m
	[96m[1mnopw[22m[39m	[0m%cs[0m:[33m([39m[0m%rax[0m,[0m%rax[33m)[39m


In [28]:
@code_native debuginfo=:none operation(x)

	[0m.text
	[96m[1mpushq[22m[39m	[0m%rbp
	[96m[1mmovq[22m[39m	[0m%rsp[0m, [0m%rbp
	[96m[1mpushq[22m[39m	[0m%r15
	[96m[1mpushq[22m[39m	[0m%r14
	[96m[1mpushq[22m[39m	[0m%r13
	[96m[1mpushq[22m[39m	[0m%r12
	[96m[1mpushq[22m[39m	[0m%rbx
	[96m[1mandq[22m[39m	[33m$-32[39m[0m, [0m%rsp
	[96m[1msubq[22m[39m	[33m$96[39m[0m, [0m%rsp
	[96m[1mvxorps[22m[39m	[0m%xmm0[0m, [0m%xmm0[0m, [0m%xmm0
	[96m[1mvmovaps[22m[39m	[0m%ymm0[0m, [33m32[39m[33m([39m[0m%rsp[33m)[39m
	[96m[1mmovq[22m[39m	[0m%fs[0m:[33m0[39m[0m, [0m%rax
	[96m[1mmovq[22m[39m	[33m-8[39m[33m([39m[0m%rax[33m)[39m[0m, [0m%r15
	[96m[1mmovq[22m[39m	[33m$8[39m[0m, [33m32[39m[33m([39m[0m%rsp[33m)[39m
	[96m[1mmovq[22m[39m	[33m([39m[0m%r15[33m)[39m[0m, [0m%rax
	[96m[1mmovq[22m[39m	[0m%rax[0m, [33m40[39m[33m([39m[0m%rsp[33m)[39m
	[96m[1mleaq[22m[39m	[33m32[39m[33m([39m[0m%rsp[33m)[39m[0m, [0m%rax
	[96m

# Make runtime the fun time.

Julia specializes on input types (not values). As a rule of thumb: **only type information is available to the compiler when specializing code.**

In scientific computations, we typically run a piece of code many times over and over again. Think of a Monte Carlo simulation, for example, where we perform the update and the Metropolis check millions of times.

**Therefore, we want our runtime to be as short as possible.**

On the other hand, for a given set of input arguments, Julia compiles the piece of code only once, as we have seen above. The time it takes to compile our code is almost always negligible compared to the duration of the full computation.

### Example: Determinant of a 2x2 matrix

Let's say your task would be to write a function computing the determinant of a 2x2 matrix. How would you implement it?

Probably you'd say, well I know the formula for computing the determinant of a 2x2 matrix! Let's just implement it.

In [29]:
det_2x2(X::AbstractMatrix) = X[1,1] * X[2,2] - X[1,2] * X[2,1]

det_2x2 (generic function with 1 method)

In [30]:
M = [1 2; 3 4]

2×2 Matrix{Int64}:
 1  2
 3  4

In [31]:
det_2x2(M)

-2

In [32]:
@btime det_2x2(M);

  14.251 ns (0 allocations: 0 bytes)


Let's see how Julia's built-in `det` function compares to our algorithm:

In [33]:
using LinearAlgebra

det(M)

-2.0

In [34]:
@btime det(M);

  239.230 ns (3 allocations: 192 bytes)


It's much slower!! But why?

The reason is that, as we've discussed above, the compiler only has the type information available for producing a specialization of the `det` function, i.e. in this case:

In [35]:
typeof(M)

Matrix{Int64} (alias for Array{Int64, 2})

Note that, in particular, the **size of the matrix is not encoded in the type and therefore not available to the compiler!**

In [36]:
size(typeof(M))

LoadError: MethodError: no method matching size(::Type{Matrix{Int64}})
[0mClosest candidates are:
[0m  size([91m::Union{Adjoint{T, var"#s861"}, Transpose{T, var"#s861"}} where {T, var"#s861"<:(AbstractVector)}[39m) at /opt/julia-1.7.3/share/julia/stdlib/v1.7/LinearAlgebra/src/adjtrans.jl:172
[0m  size([91m::Union{Adjoint{T, var"#s861"}, Transpose{T, var"#s861"}} where {T, var"#s861"<:(AbstractMatrix)}[39m) at /opt/julia-1.7.3/share/julia/stdlib/v1.7/LinearAlgebra/src/adjtrans.jl:173
[0m  size([91m::Union{QR, LinearAlgebra.QRCompactWY, QRPivoted}[39m) at /opt/julia-1.7.3/share/julia/stdlib/v1.7/LinearAlgebra/src/qr.jl:567
[0m  ...

So, in spirit, when calling `det(M)` for the first time, we ask the compiler to create a specialization which can handle matrices of all sizes! That's obviously much harder than the 2x2 case. Hence, the produced code is more general - it will also work for 3x3 matrices etc. - but much less efficient.

In [37]:
@code_typed debuginfo=:none det(M)

CodeInfo(
[90m1 ──[39m %1  = LinearAlgebra.istriu[36m::typeof(istriu)[39m
[90m│   [39m %2  = invoke %1(A::Matrix{Int64}, 0::Int64)[36m::Bool[39m
[90m└───[39m       goto #3 if not %2
[90m2 ──[39m       goto #6
[90m3 ──[39m %5  = LinearAlgebra.istril[36m::typeof(istril)[39m
[90m│   [39m %6  = invoke %5(A::Matrix{Int64}, 0::Int64)[36m::Bool[39m
[90m└───[39m       goto #5 if not %6
[90m4 ──[39m       goto #6
[90m5 ──[39m %9  = LinearAlgebra.lu[36m::typeof(lu)[39m
[90m│   [39m %10 = invoke LinearAlgebra.var"#lu##kw"()($(QuoteNode((check = false,)))::NamedTuple{(:check,), Tuple{Bool}}, %9::typeof(lu), A::Matrix{Int64}, $(QuoteNode(RowMaximum()))::RowMaximum)[36m::LU{Float64, Matrix{Float64}}[39m
[90m│   [39m %11 = invoke LinearAlgebra.det(%10::LU{Float64, Matrix{Float64}})[36m::Float64[39m
[90m└───[39m       return %11
[90m6 ┄─[39m       Base.arraysize(A, 1)[90m::Int64[39m
[90m│   [39m       Base.arraysize(A, 2)[90m::Int64[39m
[90m│   [39m %15 

Let's now move the size information to the type domain and see how things change.

In [40]:
using StaticArrays

In [41]:
S = SMatrix{2,2}(1, 2, 3, 4)

2×2 SMatrix{2, 2, Int64, 4} with indices SOneTo(2)×SOneTo(2):
 1  3
 2  4

or equivalently

In [42]:
S = @SMatrix [1 2; 3 4]

2×2 SMatrix{2, 2, Int64, 4} with indices SOneTo(2)×SOneTo(2):
 1  2
 3  4

For static arrays, we can extract the matrix size solely from the type:

In [43]:
size(typeof(S)) # doesn't error

(2, 2)

Hence, the compiler can utilize the fact that `S` is a 2x2 matrix - it can create different specializations for different matrix sizes.

In [44]:
@btime det(S);

  17.042 ns (1 allocation: 16 bytes)


In fact, let us check what `det(S)` is actually doing under the hood.

In [45]:
@code_typed debuginfo=:none det(S)

CodeInfo(
[90m1 ─[39m %1  = StaticArrays.getfield(A, :data)[36m::NTuple{4, Int64}[39m
[90m│  [39m %2  = Base.getfield(%1, 1, true)[36m::Int64[39m
[90m│  [39m %3  = Base.sitofp(Float64, %2)[36m::Float64[39m
[90m│  [39m %4  = Base.getfield(%1, 2, true)[36m::Int64[39m
[90m│  [39m %5  = Base.sitofp(Float64, %4)[36m::Float64[39m
[90m│  [39m %6  = Base.getfield(%1, 3, true)[36m::Int64[39m
[90m│  [39m %7  = Base.sitofp(Float64, %6)[36m::Float64[39m
[90m│  [39m %8  = Base.getfield(%1, 4, true)[36m::Int64[39m
[90m│  [39m %9  = Base.sitofp(Float64, %8)[36m::Float64[39m
[90m│  [39m %10 = Base.mul_float(%3, %9)[36m::Float64[39m
[90m│  [39m %11 = Base.mul_float(%7, %5)[36m::Float64[39m
[90m│  [39m %12 = Base.sub_float(%10, %11)[36m::Float64[39m
[90m└──[39m       return %12
) => Float64

Let's translate this into something more readable:
```julia
%2 = %3 = S[1,1] = 1
%4 = %5 = S[2,1] = 3
%6 = %7 = S[1,2] = 2
%8 = %9 = S[2,2] = 4

%10 = %3 * %9 = S[1,1] * S[2,2]
%11 = %7 * %5 = S[1,2] * S[2,1]
%12 = %10 - %11 = S[1,1] * S[2,2] - S[1,2] * S[2,1]
```

**Overall it just corresponds to the explicit formula that we hand-coded in `det_2x2`!**
```julia
det(S) = S[1,1] * S[2,2] - S[1,2] * S[2,1]
```

In [46]:
@code_native debuginfo=:none det(Mstatic)

LoadError: UndefVarError: Mstatic not defined

(Side remark: of course, here it is not really the compiler that does the job but it gets help from multiple dispatch: StaticArrays.jl just implements a specific method of the function `det` for the 2x2 case, see `@edit det(S)`. But you hopefully get the point.)

In [48]:
@edit det(S)

[?1049h[22;0;0t[1;30r(B[m[4l[?7h[39;49m[?1h=[?1h=[?1h=[?25l[39;49m(B[m[H[2J[28;33H(B[0;7m[ Reading File ](B[m (B[0;7mFile '/home/eric/.julia/packages/StaticArrays/5bAMi/src/det.jl' is unwritable(B[m[H(B[0;7m       /home/eric/.julia/packages/StaticArrays/5bAMi/src/det.jl                 [1;79H(B[m[29d(B[0;7m^G(B[m Get Help  (B[0;7m^O(B[m Write Out (B[0;7m^W(B[m Where Is  (B[0;7m^K(B[m Cut Text  (B[0;7m^J(B[m Justify   (B[0;7m^C(B[m Cur Pos[30d(B[0;7m^X(B[m Exit[14G(B[0;7m^R(B[m Read File (B[0;7m^\(B[m Replace   (B[0;7m^U(B[m Uncut Text(B[0;7m^T(B[m To Spell  (B[0;7m^_(B[m Go To Line[28d[3d[39;49m(B[m@inline function det(A::StaticMatrix)[4;5HT = eltype(A)[5;5HS = arithmetic_closure(T)[6;5HA_S = convert(similar_type(A,S),A)[7;5H_det(Size(A_S),A_S)[8dend[10d@inline _det(::Size{(1,1)}, A::StaticMatrix) = @inbounds return A[1][12d@inline function _det(::Size{(2,2)}, A::StaticMatrix)[13;

Too many errors from stdin

LoadError: failed process: Process(`[4m/bin/nano[24m [4m+1[24m [4m/home/eric/.julia/packages/StaticArrays/5bAMi/src/det.jl[24m`, ProcessExited(1)) [1]


**Other operations sped up by using StaticArrays.jl**

```
============================================
    Benchmarks for 3×3 Float64 matrices
============================================
Matrix multiplication               -> 5.9x speedup
Matrix multiplication (mutating)    -> 1.8x speedup
Matrix addition                     -> 33.1x speedup
Matrix addition (mutating)          -> 2.5x speedup
Matrix determinant                  -> 112.9x speedup
Matrix inverse                      -> 67.8x speedup
Matrix symmetric eigendecomposition -> 25.0x speedup
Matrix Cholesky decomposition       -> 8.8x speedup
Matrix LU decomposition             -> 6.1x speedup
Matrix QR decomposition             -> 65.0x speedup
```

Of course, by putting more information in the type you are putting more stress on the compiler to optimize things. If the static arrays are too big compile time might explode or the compiler might just give up and fall back to a slow default version.

Hence, static arrays are only useful as small fixed-size arrays.

In [None]:
# # might take longer to compile and the speedup is gone
# N = 20
# M = rand(N,N);
# m = SMatrix{N,N}(M);

# println("Inversion")
# @btime inv($m);
# @btime inv($M);

# Are explicit type annotations necessary? (like in C or Fortran)

Note that Julia's type inference is powerful. Specifying types **is not** necessary for best performance!

In [49]:
function my_function(x)
    y = rand()
    z = rand()
    x+y+z
end

function my_function_typed(x::Int)::Float64
    y::Float64 = rand()
    z::Float64 = rand()
    x+y+z
end

my_function_typed (generic function with 1 method)

In [50]:
@btime my_function(10);
@btime my_function_typed(10);

  6.526 ns (0 allocations: 0 bytes)
  6.533 ns (0 allocations: 0 bytes)


 However, annotating types explicitly can serve a purpose.

* **Define a user interface/type filter** (will throw error if incompatible type is given)
* Enforce conversions
* Rarely, help the compiler infer types in tricky situations

# Core messages of this Notebook

* Julia **can be fast.**
* **A function is compiled when called for the first time** with a given set of argument types.
* The are **multiple compilation steps** all of which can be inspected through macros like `@code_warntype`.
* **Code specialization** based on the types of all of the input arguments is important for speed.
* Calculations can be moved to compile-time to make run-time faster.
* In virtually all cases, **explicit type annotations are irrelevant for performance**.
* Type annotations in function signatures define a **type filter/user interface**.