# Optimizing Performance (Single-Core)

## SIMD

SIMD stands for **"Single Instruction Multiple Data"** and falls into the category of instruction level **parallelism** (vector instructions). Since raw clock speeds haven't been getting much faster, one way in which processors have been able to increase performance is through operations which operate on a "vector" (basically, a short sequence of values contiguous in memory).

Consider this simple vector addition example:

In [None]:
function vector_add(A, B, C)    
    for i in eachindex(A, B, C)
        @inbounds A[i] = B[i] + C[i]
    end
end

<br>

*Packed* vector addition: **vaddpd**

<img src="./imgs/SIMD.svg" width=450px>
<br>


The idea behind SIMD is to perform the add instruction on multiple elements at the same time (instead of separately performing them one after another). The process of splitting up the simple loop addition into multiple vector additions is often denoted as "loop vectorization". Since each vectorized addition happens at instruction level, i.e. within a CPU core, the feature set of the CPU determines how many elements we can process in one go.

#### SIMD register width

<br>
<img src="./imgs/SIMD_vectorwidth.svg" width=580px>
<br>

### Is SIMD important?

**Peak performance** (single-core): $P_\textrm{core} = f \cdot n_\textrm{super} \cdot n_\textrm{FMA} \cdot n_\textrm{SIMD}$
- $f$: clock frequency
- $n_\textrm{super}$: superscalarity (multiple arithmetic units)
- $n_\textrm{FMA}$: FMA factor (two FLOPs in one instruction)
- $n_\textrm{SIMD}$: SIMD factor

|microarchitecture|processor|launch date|f [GHz]|n_super|n_FMA|n_SIMD|P_core [GFLOPS]|
|:----|:----|:----|:----|:----|:----|:----|:----|
|Haswell|Xeon E5-2695 v3|Q3/2014|2.30|2|2|4|36.8|
|Skylake SP|Xeon Gold 6148|Q3/2017|2.40|2|2|8|76.8|
|Zen 2|EPYC 7642|Q3/2019|2.30|2|2|4|36.8|
|Zen 3|EPYC 7763|Q1/2021|2.45|2|2|4|39.2|
|A64FX|FX1000|Q1/2020|2.20|2|2|8|70.4|


### What does my system support?

Let's check which "advanced vector extensions" (AVX) the system supports.

In [None]:
using CpuId
cpuinfo()

In [None]:
filter(x -> contains(string(x), "AVX"), cpufeatures())

In [None]:
SIZE = 512^2
A = rand(Float64, SIZE);
B = rand(Float64, SIZE);
C = rand(Float64, SIZE);

In [None]:
@code_native debuginfo=:none syntax=:intel vector_add(A,B,C)

### It's not always so simple: SIMD can be hard...

Autovectorization is a hard problem (it needs to prove a lot of things about the code!). After all, it is a from of parallelism and efficient parallelism can be hard as well...

Not every loop is (readily) vectorizable. **Keep your loops as simple as possible!**

* avoid conditionals and function calls etc.
* ideally, loop length is static (countable up front).
* access **contiguous data** (spatial locality).
  * (align data structures to SIMD width boundary)
* avoid data dependencies (e.g. between loop iterations)

#### Example: Reduction

In [None]:
function vector_dot(B, C)
    a = zero(eltype(B))
    for i in eachindex(B,C)
        @inbounds a += B[i] * C[i]
    end
    return a
end

In [None]:
@code_native debuginfo=:none syntax=:intel vector_dot(B, C)

Note the `vaddsd` instruction and usage of `xmmi` registers (128 bit).

#### How could this loop reduction be vectorized manually?

In [None]:
function vector_dot_unrolled4(B, C)
    a1 = zero(eltype(B))
    a2 = zero(eltype(B))
    a3 = zero(eltype(B))
    a4 = zero(eltype(B))
    @inbounds for i in 1:4:length(B)-4
        a1 += B[i] * C[i]
        a2 += B[i+1] * C[i+1]
        a3 += B[i+2] * C[i+2]
        a4 += B[i+3] * C[i+3]
    end
    return a1+a2+a3+a4
end

In [None]:
@code_native debuginfo=:none syntax=:intel vector_dot_unrolled4(B, C)

In [None]:
using BenchmarkTools
@btime vector_dot($B, $C) samples=10 evals=3;
@btime vector_dot_unrolled4($B, $C) samples=10 evals=3;

#### The "automatic" way: `@simd`

To (try to) "force" automatic SIMD vectorization in Julia, you can use the `@simd` macro.

In [None]:
function vector_dot_simd(B, C)
    a = zero(eltype(B))
    @simd for i in eachindex(B,C)
        @inbounds a += B[i] * C[i]
    end
    return a
end

By using the `@simd` macro, we are **asserting several properties** of the loop:

* It is safe to execute iterations in arbitrary or overlapping order, with special consideration for reduction variables.
* Floating-point operations on reduction variables can be reordered, possibly causing different results than without `@simd`.

In [None]:
@btime vector_dot_simd($B, $C) samples=10 evals=3;

In [None]:
@code_native debuginfo=:none syntax=:intel vector_dot_simd(B, C)

Note the `vfmadd231pd` instruction and usage of `ymmi` AVX registers (256 bit).

#### Data types matter
Floating-point addition is **non-associative** and the order of operations is important.

In [None]:
v = rand(10^6)
@show vector_dot(v,v);
@show vector_dot_simd(v,v);
@show abs(vector_dot(v,v) - vector_dot_simd(v,v));

How bad can this get? In principle, [arbitraily bad](https://discourse.julialang.org/t/when-shouldnt-we-use-simd/18276/11?u=carstenbauer)!! Quite often you can get away with it though.


Compare this to integer addition, which is **associative** and the order of operations has no impact.

In [None]:
B_int = rand(Int64, SIZE);
C_int = rand(Int64, SIZE);

In [None]:
@show vector_dot(B_int, C_int);
@show vector_dot_simd(B_int, C_int);
@show abs(vector_dot(B_int, C_int) - vector_dot_simd(B_int, C_int));

In [None]:
@btime vector_dot($B_int, $C_int) samples=10 evals=3;;
@btime vector_dot_simd($B_int, $C_int) samples=10 evals=3;;

#### Data layout matters (Structure of Array vs Array of Structure)

Contiguous memory access facilitates SIMD.

In [None]:
complex_numbers_aos = [rand() + im * rand() for i in 1:1024] # array of structs (Complex{Float64})

In [None]:
import Base: sum

struct ComplexNumbers
    x::Vector{Float64}
    y::Vector{Float64}
end

sum(cn::ComplexNumbers) = sum(cn.x) + im * sum(cn.y)

In [None]:
complex_numbers_soa = ComplexNumbers(rand(1024), rand(1024)) # struct of arrays

In [None]:
@btime sum($complex_numbers_aos);
@btime sum($complex_numbers_soa);

Sidenote: [StructArrays.jl](https://github.com/JuliaArrays/StructArrays.jl)

## Other tricks: `@fastmath` (if time permits)

Enables lots of floating point optimizations that are potentially *unsafe*! It trades accuracy for speed, so, [Beware of fast-math](https://simonbyrne.github.io/notes/fastmath/). (See the [LLVM Language Reference Manual](https://llvm.org/docs/LangRef.html#fast-math-flags) for more information on which compiler options it sets.)

### SIMD
Among other things, it **facilitates SIMD vectorization** because it:
* Allows re-association of operands in series of floating-point operations.

In [None]:
function vector_dot_fastmath(B, C)
    a = zero(eltype(B))
    @fastmath for i in eachindex(B,C)
        @inbounds a += B[i] * C[i]
    end
    return a
end

In [None]:
@btime vector_dot_fastmath($B, $C) samples=10 evals=3;

In [None]:
@code_native debuginfo=:none syntax=:intel vector_dot_fastmath(B,C)

### FMA - Fused Multiply Add

In [None]:
f(a,b,c) = a*b+c

In [None]:
@code_native debuginfo=:none f(1.0,2.0,3.0)

In [None]:
f_fastmath(a,b,c) = @fastmath a*b+c

In [None]:
@code_native debuginfo=:none f_fastmath(1.0,2.0,3.0)

(In this specific case, the explicit `fma` function or [MuladdMacro.jl](https://github.com/SciML/MuladdMacro.jl) are *safer* alternatives.)

<img src="./imgs/skylake_server_microarch.png" width=900px>

**Source:** [Intel® 64 and IA-32 Architectures Optimization Reference Manual](https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf)

#### Sidenote: Why doesn't Julia use FMA automatically?

Answer: because it can break math in weird ways.

In [None]:
function f(a,b,c)
    @assert a*b ≥ c
    return sqrt(a*b-c)
end

function f_fma(a,b,c)
    @assert a*b ≥ c
    return sqrt(fma(a,b,-c))
end

a = 1.0 + 0.5^27;
b = 1.0 - 0.5^27;
c = 1.0;

In [None]:
f(a,b,c)

In [None]:
f_fma(a,b,c)

# Core messages of this Notebook

* **SIMD is important for your innermost computational kernel** and, ideally, can give you a factor of 4 or 8 speedup (for `Float64`).
* **Keep your hot loop as simple as possible** to facilitate SIMD (avoid branches, data dependencies, etc., if possible).
* (Carefully) think about using `@simd`, `@fastmath`, etc. to **opt-into potentially unsafe optimizations**.