# SIMD

SIMD stands for **"Single Instruction Multiple Data"** and falls into the category of instruction level **parallelism** (vector instructions). Since raw clock speeds haven't been getting much faster, one way in which processors have been able to increase performance is through operations which operate on a "vector" (basically, a short sequence of values contiguous in memory).

Consider this simple example where `A`, `B`, and `C` are vectors:

In [1]:
function vector_add(A, B, C)
    for i in eachindex(A, B, C)
        @inbounds A[i] = B[i] + C[i]
    end
end

vector_add (generic function with 1 method)


The idea behind SIMD is to perform the add instruction on multiple elements at the same time (instead of separately performing them one after another). The process of splitting up the simple loop addition into multiple vector additions is often denoted as "loop vectorization". Since each vectorized addition happens at instruction level, i.e. within a CPU core, the feature set of the CPU determines how many elements we can process in one go.

<img src="./imgs/simd_vaddpd.png" width=300px>
<img src="./imgs/simd_register_width.png" width=400px>

(**Source:** Node-level performance engineering course by [NHR@FAU](https://hpc.fau.de/))

Let's check which "advanced vector extensions" (AVX) the system supports.

In [2]:
using CpuId
cpuinfo()

| Cpu Property       | Value                                                      |
|:------------------ |:---------------------------------------------------------- |
| Brand              | Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz                   |
| Vendor             | :Intel                                                     |
| Architecture       | :Skylake                                                   |
| Model              | Family: 0x06, Model: 0x55, Stepping: 0x04, Type: 0x00      |
| Cores              | 20 physical cores, 40 logical cores (on executing CPU)     |
|                    | Hyperthreading hardware capability detected                |
| Clock Frequencies  | 2400 / 3700 MHz (base/max), 100 MHz bus                    |
| Data Cache         | Level 1:3 : (32, 1024, 28160) kbytes                       |
|                    | 64 byte cache line size                                    |
| Address Size       | 48 bits virtual, 46 bits physical                          |
| SIMD               | 512 bit = 64 byte max. SIMD vector size                    |
| Time Stamp Counter | TSC is accessible via `rdtsc`                              |
|                    | TSC runs at constant rate (invariant from clock frequency) |
| Perf. Monitoring   | Performance Monitoring Counters (PMC) revision 4           |
|                    | Available hardware counters per logical core:              |
|                    | 3 fixed-function counters of 48 bit width                  |
|                    | 8 general-purpose counters of 48 bit width                 |
| Hypervisor         | No                                                         |


In [3]:
filter(x -> contains(string(x), "AVX"), cpufeatures())

7-element Vector{Symbol}:
 :AVX
 :AVX2
 :AVX512BW
 :AVX512CD
 :AVX512DQ
 :AVX512F
 :AVX512VL

**Hawk nodes do not have AVX512.**

In [4]:
SIZE = 2^16
SIZE = 512^2

262144

In [5]:
A = rand(Float64, SIZE)
B = rand(Float64, SIZE)
C = rand(Float64, SIZE);

In [6]:
@code_native debuginfo=:none syntax=:intel vector_add(A,B,C)

	[0m.text
	[0m.file	[0m"vector_add"
	[0m.globl	[0mjapi1_vector_add_1723           [90m# -- Begin function japi1_vector_add_1723[39m
	[0m.p2align	[33m4[39m[0m, [33m0x90[39m
	[0m.type	[0mjapi1_vector_add_1723[0m,[0m@function
[91mjapi1_vector_add_1723:[39m                  [90m# @japi1_vector_add_1723[39m
	[0m.cfi_startproc
[90m# %bb.0:                                # %top[39m
	[96m[1mpush[22m[39m	[0mrbp
	[0m.cfi_def_cfa_offset [33m16[39m
	[0m.cfi_offset [0mrbp[0m, [33m-16[39m
	[96m[1mmov[22m[39m	[0mrbp[0m, [0mrsp
	[0m.cfi_def_cfa_register [0mrbp
	[96m[1mpush[22m[39m	[0mr15
	[96m[1mpush[22m[39m	[0mr14
	[96m[1mpush[22m[39m	[0mr13
	[96m[1mpush[22m[39m	[0mr12
	[96m[1mpush[22m[39m	[0mrbx
	[96m[1mand[22m[39m	[0mrsp[0m, [33m-32[39m
	[96m[1msub[22m[39m	[0mrsp[0m, [33m160[39m
	[0m.cfi_offset [0mrbx[0m, [33m-56[39m
	[0m.cfi_offset [0mr12[0m, [33m-48[39m
	[0m.cfi_offset [0mr13[0m, [33m-40[39m
	[0

## It's not always so simple: SIMD can be hard...

Autovectorization is a hard problem (it needs to prove a lot of things about the code!). After all, it is a from of parallelism and efficient parallelism can be hard as well...

Not every loop is (readily) vectorizable. **Keep your loops as simple as possible!**

* avoid conditionals and function calls etc.
* ideally, loop length is static (countable up front).
* access **contiguous data** (spatial locality).
  * (align data structures to SIMD width boundary)
* avoid data dependencies (e.g. between loop iterations)

### Example: Reduction

In [7]:
function vector_dot(B, C)
    a = zero(eltype(B))
    for i in eachindex(B,C)
        @inbounds a += B[i] * C[i]
    end
    return a
end

vector_dot (generic function with 1 method)

In [8]:
@code_native debuginfo=:none syntax=:intel vector_dot(B, C)

	[0m.text
	[0m.file	[0m"vector_dot"
	[0m.globl	[0mjulia_vector_dot_1774           [90m# -- Begin function julia_vector_dot_1774[39m
	[0m.p2align	[33m4[39m[0m, [33m0x90[39m
	[0m.type	[0mjulia_vector_dot_1774[0m,[0m@function
[91mjulia_vector_dot_1774:[39m                  [90m# @julia_vector_dot_1774[39m
	[0m.cfi_startproc
[90m# %bb.0:                                # %top[39m
	[96m[1mpush[22m[39m	[0mrbp
	[0m.cfi_def_cfa_offset [33m16[39m
	[0m.cfi_offset [0mrbp[0m, [33m-16[39m
	[96m[1mmov[22m[39m	[0mrbp[0m, [0mrsp
	[0m.cfi_def_cfa_register [0mrbp
	[96m[1mpush[22m[39m	[0mr15
	[96m[1mpush[22m[39m	[0mr14
	[96m[1mpush[22m[39m	[0mr13
	[96m[1mpush[22m[39m	[0mr12
	[96m[1mpush[22m[39m	[0mrbx
	[96m[1mand[22m[39m	[0mrsp[0m, [33m-32[39m
	[96m[1msub[22m[39m	[0mrsp[0m, [33m96[39m
	[0m.cfi_offset [0mrbx[0m, [33m-56[39m
	[0m.cfi_offset [0mr12[0m, [33m-48[39m
	[0m.cfi_offset [0mr13[0m, [33m-40[39m
	[0m

Note the `vaddsd` instruction and usage of `xmmi` registers (128 bit).

#### How could this loop reduction be vectorized manually?

In [9]:
function vector_dot_unrolled4(B, C)
    a1 = zero(eltype(B))
    a2 = zero(eltype(B))
    a3 = zero(eltype(B))
    a4 = zero(eltype(B))
    @inbounds for i in 1:4:length(B)-4
        a1 += B[i] * C[i]
        a2 += B[i+1] * C[i+1]
        a3 += B[i+2] * C[i+2]
        a4 += B[i+3] * C[i+3]
    end
    return a1+a2+a3+a4
end

vector_dot_unrolled4 (generic function with 1 method)

In [10]:
@code_native debuginfo=:none syntax=:intel vector_dot_unrolled4(B, C)

	[0m.text
	[0m.file	[0m"vector_dot_unrolled4"
	[0m.globl	[0mjulia_vector_dot_unrolled4_1799 [90m# -- Begin function julia_vector_dot_unrolled4_1799[39m
	[0m.p2align	[33m4[39m[0m, [33m0x90[39m
	[0m.type	[0mjulia_vector_dot_unrolled4_1799[0m,[0m@function
[91mjulia_vector_dot_unrolled4_1799:[39m        [90m# @julia_vector_dot_unrolled4_1799[39m
	[0m.cfi_startproc
[90m# %bb.0:                                # %top[39m
	[96m[1mpush[22m[39m	[0mrbp
	[0m.cfi_def_cfa_offset [33m16[39m
	[0m.cfi_offset [0mrbp[0m, [33m-16[39m
	[96m[1mmov[22m[39m	[0mrbp[0m, [0mrsp
	[0m.cfi_def_cfa_register [0mrbp
	[96m[1mpush[22m[39m	[0mr14
	[96m[1mpush[22m[39m	[0mrbx
	[0m.cfi_offset [0mrbx[0m, [33m-32[39m
	[0m.cfi_offset [0mr14[0m, [33m-24[39m
	[96m[1mmov[22m[39m	[0mr14[0m, [0mrsi
	[96m[1mmov[22m[39m	[0mrbx[0m, [0mrdi
	[96m[1mmov[22m[39m	[0mrdx[0m, [95mqword[39m [95mptr[39m [33m[[39m[0mrdi [0m+ [33m8[39m[33m][39m
	[9

In [11]:
using BenchmarkTools
@btime vector_dot($B, $C);
@btime vector_dot_unrolled4($B, $C);

  284.177 μs (0 allocations: 0 bytes)
  153.287 μs (0 allocations: 0 bytes)


#### The "automatic" way: `@simd`

To (try to) "force" automatic SIMD vectorization in Julia, you can use the `@simd` macro.

In [12]:
function vector_dot_simd(B, C)
    a = zero(eltype(B))
    @simd for i in eachindex(B,C)
        @inbounds a += B[i] * C[i]
    end
    return a
end

vector_dot_simd (generic function with 1 method)

By using the `@simd` macro, we are **asserting several properties** of the loop:

* It is safe to execute iterations in arbitrary or overlapping order, with special consideration for reduction variables.
* Floating-point operations on reduction variables can be reordered, possibly causing different results than without `@simd`.

In [13]:
@btime vector_dot_simd($B, $C);

  149.821 μs (0 allocations: 0 bytes)


In [14]:
@code_native debuginfo=:none syntax=:intel vector_dot_simd(B, C)

	[0m.text
	[0m.file	[0m"vector_dot_simd"
	[0m.globl	[0mjulia_vector_dot_simd_2041      [90m# -- Begin function julia_vector_dot_simd_2041[39m
	[0m.p2align	[33m4[39m[0m, [33m0x90[39m
	[0m.type	[0mjulia_vector_dot_simd_2041[0m,[0m@function
[91mjulia_vector_dot_simd_2041:[39m             [90m# @julia_vector_dot_simd_2041[39m
	[0m.cfi_startproc
[90m# %bb.0:                                # %top[39m
	[96m[1mpush[22m[39m	[0mrbp
	[0m.cfi_def_cfa_offset [33m16[39m
	[0m.cfi_offset [0mrbp[0m, [33m-16[39m
	[96m[1mmov[22m[39m	[0mrbp[0m, [0mrsp
	[0m.cfi_def_cfa_register [0mrbp
	[96m[1mpush[22m[39m	[0mr15
	[96m[1mpush[22m[39m	[0mr14
	[96m[1mpush[22m[39m	[0mr13
	[96m[1mpush[22m[39m	[0mr12
	[96m[1mpush[22m[39m	[0mrbx
	[96m[1mand[22m[39m	[0mrsp[0m, [33m-32[39m
	[96m[1msub[22m[39m	[0mrsp[0m, [33m96[39m
	[0m.cfi_offset [0mrbx[0m, [33m-56[39m
	[0m.cfi_offset [0mr12[0m, [33m-48[39m
	[0m.cfi_offset [0mr13[0m,

Note the `vfmadd231pd` instruction and usage of `ymmi` AVX registers (256 bit).

#### Data types matter
Floating-point addition is **non-associative** and the order of operations is important.

In [15]:
v = rand(10^6)
@show vector_dot(v,v);
@show vector_dot_simd(v,v);
@show abs(vector_dot(v,v) - vector_dot_simd(v,v));

vector_dot(v, v) = 333183.770499329
vector_dot_simd(v, v) = 333183.77049932454
abs(vector_dot(v, v) - vector_dot_simd(v, v)) = 4.48198989033699e-9


How bad can this get? In principle, [arbitraily bad](https://discourse.julialang.org/t/when-shouldnt-we-use-simd/18276/11?u=carstenbauer)!! Quite often you can get away with it though.


Compare this to integer addition, which is **associative** and the order of operations has no impact.

In [16]:
B_int = rand(Int64, SIZE);
C_int = rand(Int64, SIZE);

In [17]:
@show vector_dot(B_int, C_int);
@show vector_dot_simd(B_int, C_int);
@show abs(vector_dot(B_int, C_int) - vector_dot_simd(B_int, C_int));

vector_dot(B_int, C_int) = 1495217605974072290
vector_dot_simd(B_int, C_int) = 1495217605974072290
abs(vector_dot(B_int, C_int) - vector_dot_simd(B_int, C_int)) = 0


In [18]:
@btime vector_dot($B_int, $C_int);
@btime vector_dot_simd($B_int, $C_int);

  154.575 μs (0 allocations: 0 bytes)
  153.743 μs (0 allocations: 0 bytes)


### [LoopVectorization.jl](https://github.com/JuliaSIMD/LoopVectorization.jl)

Loosely speaking, you can think of `@turbo` as a much(!) more sophisticated version of `@simd`. Hopefully, these features will at some point just be part of Julia's compiler.

(Sidenote: There is [LLVMLoopInfo.jl](https://github.com/JuliaSIMD/LLVMLoopInfo.jl) to talk to LLVM directly if necessary.)

In [19]:
using LoopVectorization

function vector_dot_turbo(B, C)
    a = zero(eltype(B))
    @turbo for i in eachindex(B,C)
        @inbounds a += B[i] * C[i]
    end
    return a
end

@btime vector_dot_simd($B, $C);
@btime vector_dot_turbo($B, $C);

  154.395 μs (0 allocations: 0 bytes)
  154.678 μs (0 allocations: 0 bytes)


In [20]:
@code_native debuginfo=:none syntax=:intel vector_dot_turbo(B, C)

	[0m.text
	[0m.file	[0m"vector_dot_turbo"
	[0m.globl	[0mjulia_vector_dot_turbo_2371     [90m# -- Begin function julia_vector_dot_turbo_2371[39m
	[0m.p2align	[33m4[39m[0m, [33m0x90[39m
	[0m.type	[0mjulia_vector_dot_turbo_2371[0m,[0m@function
[91mjulia_vector_dot_turbo_2371:[39m            [90m# @julia_vector_dot_turbo_2371[39m
	[0m.cfi_startproc
[90m# %bb.0:                                # %top[39m
	[96m[1mpush[22m[39m	[0mrbp
	[0m.cfi_def_cfa_offset [33m16[39m
	[0m.cfi_offset [0mrbp[0m, [33m-16[39m
	[96m[1mmov[22m[39m	[0mrbp[0m, [0mrsp
	[0m.cfi_def_cfa_register [0mrbp
	[96m[1mpush[22m[39m	[0mr15
	[96m[1mpush[22m[39m	[0mr14
	[96m[1mpush[22m[39m	[0mr13
	[96m[1mpush[22m[39m	[0mr12
	[96m[1mpush[22m[39m	[0mrbx
	[96m[1mand[22m[39m	[0mrsp[0m, [33m-32[39m
	[96m[1msub[22m[39m	[0mrsp[0m, [33m96[39m
	[0m.cfi_offset [0mrbx[0m, [33m-56[39m
	[0m.cfi_offset [0mr12[0m, [33m-48[39m
	[0m.cfi_offset [0mr13

On systems with AVX512, `@turbo` will lead to usage of `zmmi` AVX512 512-bit registers! (**Not the case on Hawk which only has AVX2.**)

Demo: With `@turbo`, a basic matrix-matrix multiplication implementation can be **competitive to highly optimized BLAS**! (at least for small matrices where memory effects aren't important -> dMMM exercise).

In [21]:
N = 64
X = zeros(N,N);
Y = rand(N,N);
Z = rand(N,N);

using LoopVectorization
using BenchmarkTools
using LinearAlgebra
BLAS.set_num_threads(1)

function mul_naive!(C, A, B)
    for m in axes(A,1)
        for n in axes(B,2)
            Cmn = zero(eltype(C))
            for k in axes(A,2)
               @inbounds Cmn += A[m,k] * B[k,n]
            end
            C[m,n] = Cmn
        end
   end
end

function mul_turbo!(C, A, B)
    @turbo for m in axes(A,1)
        for n in axes(B,2)
            Cmn = zero(eltype(C))
            for k in axes(A,2)
               @inbounds Cmn += A[m,k] * B[k,n]
            end
            C[m,n] = Cmn
        end
   end
end

@btime mul_naive!($X,$Y,$Z);
@btime mul_turbo!($X,$Y,$Z);
@btime mul!($X, $Y, $Z);

  203.175 μs (0 allocations: 0 bytes)
  5.359 μs (0 allocations: 0 bytes)
  5.188 μs (0 allocations: 0 bytes)


## Structure of Array vs Array of Structure

Data layout matters: contiguous memory access facilitates SIMD.

In [22]:
complex_numbers_aos = [rand() + im * rand() for i in 1:1024] # array of structs (Complex{Float64})

1024-element Vector{ComplexF64}:
  0.4248032407022607 + 0.2074977769334484im
  0.2708788298095226 + 0.9636367428559456im
  0.8568111068532775 + 0.5778339209265116im
 0.13283542670636317 + 0.959886975842917im
 0.20525511058671342 + 0.6080479972797184im
   0.717257652806436 + 0.517080491269898im
  0.4610088871394108 + 0.824704046826356im
 0.16043317778389632 + 0.20261627620319844im
  0.9836829513483717 + 0.6953102754119255im
   0.449811959469115 + 0.37973849890363376im
                     ⋮
  0.3036630551257512 + 0.8642433576476658im
 0.25934671847669255 + 0.9415844808026036im
  0.6612220289259743 + 0.6843400059847646im
  0.3130432632522835 + 0.9775200175657431im
 0.04792672679127463 + 0.6533344607723329im
  0.9586454099547651 + 0.1598949292562052im
  0.3344518365842343 + 0.7036310712202178im
 0.42458605458400334 + 0.37240208976941713im
 0.11525810170134687 + 0.05304679358769526im

In [23]:
import Base: sum

struct ComplexNumbers
    x::Vector{Float64}
    y::Vector{Float64}
end

sum(cn::ComplexNumbers) = sum(cn.x) + im * sum(cn.y)

sum (generic function with 12 methods)

In [24]:
complex_numbers_soa = ComplexNumbers(rand(1024), rand(1024)) # struct of arrays

ComplexNumbers([0.6193147136959426, 0.7459879183835579, 0.9603511277842748, 0.047427286945926905, 0.6816802877340027, 0.028456199096732027, 0.7473851462871121, 0.3106263507193098, 0.034836457854273695, 0.5792693903931728  …  0.9021169198744317, 0.5777553808829311, 0.7759986835771416, 0.4498238289408256, 0.1613960981227155, 0.5311090855056004, 0.8857997349055956, 0.7944041720495205, 0.1923827875557672, 0.06908958790563946], [0.05855492121455874, 0.41963758819640173, 0.3148308118600841, 0.8493810264760896, 0.05564751993888295, 0.3361616710047636, 0.42226374810303446, 0.9886969135967675, 0.3409480759816155, 0.5055877520645108  …  0.6405126858894584, 0.680571162659709, 0.3543775516162031, 0.12898806711223365, 0.2516209371278817, 0.3993378831128842, 0.7073634674200572, 0.7589819161806708, 0.43614973545519187, 0.4745827687016867])

In [25]:
@btime sum($complex_numbers_aos);
@btime sum($complex_numbers_soa);

  291.658 ns (0 allocations: 0 bytes)
  182.893 ns (0 allocations: 0 bytes)


### [StructArrays.jl](https://github.com/JuliaArrays/StructArrays.jl)

In [26]:
using StructArrays

In [27]:
SoA = StructArray{Complex}((rand(1024), rand(1024)))

1024-element StructArray(::Vector{Float64}, ::Vector{Float64}) with eltype Complex:
   0.6231233162147308 + 0.5221752362078299im
   0.8461461113889266 + 0.2726221291192137im
    0.692530582858677 + 0.2967911706388219im
   0.6752175221862602 + 0.3679254837931377im
  0.13254359707246943 + 0.5151700983609677im
   0.6752885404447888 + 0.03757319641684442im
   0.9360958823341042 + 0.8867003711994643im
   0.7034623459671063 + 0.0316605595622198im
    0.830280130308197 + 0.09700492313178988im
    0.096210990946758 + 0.55136042043357im
                      ⋮
   0.9600656450603273 + 0.271979993002636im
  0.11808387776141394 + 0.38910352023717965im
   0.6952666793546648 + 0.6383123278596894im
   0.9539764977547674 + 0.2416287788447532im
   0.7330043462411541 + 0.9645433842900925im
   0.2626620399732905 + 0.403752505205907im
   0.9124789427431633 + 0.5827481724165073im
 0.022262533405200458 + 0.5952798614747566im
  0.35917171971685624 + 0.27142682386186634im

In [28]:
@btime sum($SoA);

  162.986 ns (0 allocations: 0 bytes)


## `@fastmath`

Enables lots of floating point optimizations that are potentially *unsafe*! It trades accuracy for speed, so, [Beware of fast-math](https://simonbyrne.github.io/notes/fastmath/). (See the [LLVM Language Reference Manual](https://llvm.org/docs/LangRef.html#fast-math-flags) for more information on which compiler options it sets.)

### SIMD
Among other things, it **facilitates SIMD vectorization** because it:
* Allows re-association of operands in series of floating-point operations.

In [29]:
function vector_dot_fastmath(B, C)
    a = zero(eltype(B))
    @fastmath for i in eachindex(B,C)
        @inbounds a += B[i] * C[i]
    end
    return a
end

vector_dot_fastmath (generic function with 1 method)

In [30]:
@btime vector_dot_fastmath($B, $C)

  155.066 μs (0 allocations: 0 bytes)


65594.13734335195

In [31]:
@code_native debuginfo=:none syntax=:intel vector_dot_fastmath(B,C)

	[0m.text
	[0m.file	[0m"vector_dot_fastmath"
	[0m.globl	[0mjulia_vector_dot_fastmath_4296  [90m# -- Begin function julia_vector_dot_fastmath_4296[39m
	[0m.p2align	[33m4[39m[0m, [33m0x90[39m
	[0m.type	[0mjulia_vector_dot_fastmath_4296[0m,[0m@function
[91mjulia_vector_dot_fastmath_4296:[39m         [90m# @julia_vector_dot_fastmath_4296[39m
	[0m.cfi_startproc
[90m# %bb.0:                                # %top[39m
	[96m[1mpush[22m[39m	[0mrbp
	[0m.cfi_def_cfa_offset [33m16[39m
	[0m.cfi_offset [0mrbp[0m, [33m-16[39m
	[96m[1mmov[22m[39m	[0mrbp[0m, [0mrsp
	[0m.cfi_def_cfa_register [0mrbp
	[96m[1mpush[22m[39m	[0mr15
	[96m[1mpush[22m[39m	[0mr14
	[96m[1mpush[22m[39m	[0mr13
	[96m[1mpush[22m[39m	[0mr12
	[96m[1mpush[22m[39m	[0mrbx
	[96m[1mand[22m[39m	[0mrsp[0m, [33m-32[39m
	[96m[1msub[22m[39m	[0mrsp[0m, [33m96[39m
	[0m.cfi_offset [0mrbx[0m, [33m-56[39m
	[0m.cfi_offset [0mr12[0m, [33m-48[39m
	[0m.cfi_off

### FMA - Fused Multiply Add

In [32]:
f(a,b,c) = a*b+c

f (generic function with 1 method)

In [33]:
@code_native debuginfo=:none f(1.0,2.0,3.0)

	[0m.text
	[0m.file	[0m"f"
	[0m.globl	[0mjulia_f_4329                    [90m# -- Begin function julia_f_4329[39m
	[0m.p2align	[33m4[39m[0m, [33m0x90[39m
	[0m.type	[0mjulia_f_4329[0m,[0m@function
[91mjulia_f_4329:[39m                           [90m# @julia_f_4329[39m
	[0m.cfi_startproc
[90m# %bb.0:                                # %top[39m
	[96m[1mpushq[22m[39m	[0m%rbp
	[0m.cfi_def_cfa_offset [33m16[39m
	[0m.cfi_offset [0m%rbp[0m, [33m-16[39m
	[96m[1mmovq[22m[39m	[0m%rsp[0m, [0m%rbp
	[0m.cfi_def_cfa_register [0m%rbp
	[96m[1mvmulsd[22m[39m	[0m%xmm1[0m, [0m%xmm0[0m, [0m%xmm0
	[96m[1mvaddsd[22m[39m	[0m%xmm2[0m, [0m%xmm0[0m, [0m%xmm0
	[96m[1mpopq[22m[39m	[0m%rbp
	[0m.cfi_def_cfa [0m%rsp[0m, [33m8[39m
	[96m[1mretq[22m[39m
[91m.Lfunc_end0:[39m
	[0m.size	[0mjulia_f_4329[0m, [0m.Lfunc_end0-julia_f_4329
	[0m.cfi_endproc
                                        [90m# -- End function[39m
	[0m.section	[0m".note

In [34]:
f_fastmath(a,b,c) = @fastmath a*b+c

f_fastmath (generic function with 1 method)

In [35]:
@code_native debuginfo=:none f_fastmath(1.0,2.0,3.0)

	[0m.text
	[0m.file	[0m"f_fastmath"
	[0m.globl	[0mjulia_f_fastmath_4356           [90m# -- Begin function julia_f_fastmath_4356[39m
	[0m.p2align	[33m4[39m[0m, [33m0x90[39m
	[0m.type	[0mjulia_f_fastmath_4356[0m,[0m@function
[91mjulia_f_fastmath_4356:[39m                  [90m# @julia_f_fastmath_4356[39m
	[0m.cfi_startproc
[90m# %bb.0:                                # %top[39m
	[96m[1mpushq[22m[39m	[0m%rbp
	[0m.cfi_def_cfa_offset [33m16[39m
	[0m.cfi_offset [0m%rbp[0m, [33m-16[39m
	[96m[1mmovq[22m[39m	[0m%rsp[0m, [0m%rbp
	[0m.cfi_def_cfa_register [0m%rbp
	[96m[1mvfmadd213sd[22m[39m	[0m%xmm2[0m, [0m%xmm1[0m, [0m%xmm0     [90m# xmm0 = (xmm1 * xmm0) + xmm2[39m
	[96m[1mpopq[22m[39m	[0m%rbp
	[0m.cfi_def_cfa [0m%rsp[0m, [33m8[39m
	[96m[1mretq[22m[39m
[91m.Lfunc_end0:[39m
	[0m.size	[0mjulia_f_fastmath_4356[0m, [0m.Lfunc_end0-julia_f_fastmath_4356
	[0m.cfi_endproc
                                        [90m# -- End fu

(In this specific case, the explicit `fma` function or [MuladdMacro.jl](https://github.com/SciML/MuladdMacro.jl) are *safer* alternatives.)

<img src="./imgs/skylake_microarchitecture.png" width=700px>

**Source:** [Intel® 64 and IA-32 Architectures Optimization Reference Manual](https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf)

#### Sidenote: Why doesn't Julia use FMA automatically?

Answer: because it can break math in weird ways.

In [9]:
function f(a,b,c)
    @assert a*b ≥ c
    return sqrt(a*b-c)
end

function f_fma(a,b,c)
    @assert a*b ≥ c
    return sqrt(fma(a,b,-c))
end

a = 1.0 + 0.5^27;
b = 1.0 - 0.5^27;
c = 1.0;

In [10]:
f(a,b,c)

0.0

In [11]:
f_fma(a,b,c)

LoadError: DomainError with -5.551115123125783e-17:
sqrt will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).