## SIMD Vectorization
Does my code vectorize? Let's take a look.

In [1]:
function add(out, x, y)
    for i in 1:length(out)
        out[i] = x[i] + y[i]
    end
    return out
end

add (generic function with 1 method)

In [2]:
@code_llvm add(Vector{Float64}, Vector{Float64}, Vector{Float64})


; Function add
; Location: In[1]:2
; Function Attrs: noreturn
define nonnull %jl_value_t addrspace(10)* @japi1_add_35688(%jl_value_t addrspace(10)*, %jl_value_t addrspace(10)**, i32) #0 {
top:
  %3 = alloca %jl_value_t addrspace(10)*, i32 2
  %4 = alloca %jl_value_t addrspace(10)**, align 8
  store volatile %jl_value_t addrspace(10)** %1, %jl_value_t addrspace(10)*** %4, align 8
  %5 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %3, i32 0
  store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 4625236400 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)** %5
  %6 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %3, i32 1
  store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 4587922512 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)** %6
  %7 = call nonnull %jl_value_t addrspace(10)* @jl_apply_generic(%jl_value_t addrspace(10)** %3, i32 2)
  c

## @inbounds
Adding `@inbounds` removes the bound-checks and gives LLVM the opportunity to auto-vectorize this function.

In [6]:
function add2(out, x, y)
    @inbounds for i in 1:length(out)
        out[i] = x[i] + y[i]
    end
    return out
end

add2 (generic function with 1 method)

In [7]:
@code_llvm add2(Vector{Float64}, Vector{Float64}, Vector{Float64})


; Function add2
; Location: In[6]:2
; Function Attrs: noreturn
define nonnull %jl_value_t addrspace(10)* @japi1_add2_35743(%jl_value_t addrspace(10)*, %jl_value_t addrspace(10)**, i32) #0 {
top:
  %3 = alloca %jl_value_t addrspace(10)*, i32 2
  %4 = alloca %jl_value_t addrspace(10)**, align 8
  store volatile %jl_value_t addrspace(10)** %1, %jl_value_t addrspace(10)*** %4, align 8
  %5 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %3, i32 0
  store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 4625236400 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)** %5
  %6 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %3, i32 1
  store %jl_value_t addrspace(10)* addrspacecast (%jl_value_t* inttoptr (i64 4587922512 to %jl_value_t*) to %jl_value_t addrspace(10)*), %jl_value_t addrspace(10)** %6
  %7 = call nonnull %jl_value_t addrspace(10)* @jl_apply_generic(%jl_value_t addrspace(10)** %3, i32 2)
 

## SIMD.jl
Other option is to use explicit SIMD vectorization instructions. The [SIMD.jl](https://github.com/eschnett/SIMD.jl) library gives you correct data types for this.

## Additionally:
Look [here](https://slides.com/valentinchuravy/julia-parallelism) for a lecture about levels of parallelism in Julia.

The syntactic loop fusion is discussed [here](https://julialang.org/blog/2017/01/moredots).