<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#The-Vec-type" data-toc-modified-id="The-Vec-type-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>The <code>Vec</code> type</a></span></li><li><span><a href="#Operations-on-Vec-types" data-toc-modified-id="Operations-on-Vec-types-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Operations on <code>Vec</code> types</a></span><ul class="toc-item"><li><span><a href="#Operations-between-Vec-elements" data-toc-modified-id="Operations-between-Vec-elements-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Operations between <code>Vec</code> elements</a></span></li><li><span><a href="#Elementwise-operation-in-Vec-elements" data-toc-modified-id="Elementwise-operation-in-Vec-elements-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Elementwise operation in <code>Vec</code> elements</a></span></li><li><span><a href="#Reduction-operations" data-toc-modified-id="Reduction-operations-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Reduction operations</a></span></li><li><span><a href="#Performance-gain-by-lowering-precission" data-toc-modified-id="Performance-gain-by-lowering-precission-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Performance gain by lowering precission</a></span></li><li><span><a href="#Example-automatic-vectorization" data-toc-modified-id="Example-automatic-vectorization-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Example automatic vectorization</a></span></li></ul></li><li><span><a href="#Reading-and-writting-SIMD-vectors-from-arrays-using-vload-and-vstore" data-toc-modified-id="Reading-and-writting-SIMD-vectors-from-arrays-using-vload-and-vstore-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Reading and writting SIMD vectors from arrays using <code>vload</code> and <code>vstore</code></a></span><ul class="toc-item"><li><span><a href="#Translating-&quot;Array&quot;-code-to-&quot;SIMDVector&quot;-code" data-toc-modified-id="Translating-&quot;Array&quot;-code-to-&quot;SIMDVector&quot;-code-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Translating "Array" code to "SIMDVector" code</a></span></li><li><span><a href="#Arrays-with-number-of-elements-not-divisible-by-simd-witdh" data-toc-modified-id="Arrays-with-number-of-elements-not-divisible-by-simd-witdh-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Arrays with number of elements not divisible by simd witdh</a></span><ul class="toc-item"><li><span><a href="#Another-example" data-toc-modified-id="Another-example-3.2.1"><span class="toc-item-num">3.2.1&nbsp;&nbsp;</span>Another example</a></span></li></ul></li><li><span><a href="#vload-and-vstore-using-indexing-notation" data-toc-modified-id="vload-and-vstore-using-indexing-notation-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>vload and vstore using indexing notation</a></span></li></ul></li><li><span><a href="#Using-VecRange-objects" data-toc-modified-id="Using-VecRange-objects-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Using <code>VecRange</code> objects</a></span></li><li><span><a href="#If-else-statements-in-SIMD" data-toc-modified-id="If-else-statements-in-SIMD-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>If else statements in SIMD</a></span></li></ul></div>

In [1]:
using SIMD
using BenchmarkTools

# The `Vec` type


SIMD vectors are similar to small fixed-size arrays of "simple" types.  These element types are supportedin SIMD.jl are:

```
- Bool 
- Int{8,16,32,64,128} 
- UInt{8,16,32,64,128} 
- Float{16,32,64}
```


We can create vector types (or `SIMD.Vec` types) as follows:

```
my_vec = Vec(a_tuple)
```

Notice that `eltype(a_tuple)` has to be in 

`[Bool, Int8, Int16, Int32, Int64, Int128, UInt8, UInt16, UInt32, UInt128, Float16, Float32, Float64]`

##### Examples:

```
a1_v = Vec(1,2,3,4,5,6,7,8)
a2_v = Vec(9,10,11,12,13,14,15,16,17,18)
```

##### Breaking examples:

```
a = Vec(("the","house")) # strings are not in the set of possible element types for a Vec
a = Vec((1,23.231))      # All elements in the tuple constructing the Vec need to be of the same type
```

In [2]:
a1_v = Vec((1,2,3,4,5,6,7,8))
a2_v = Vec((9,10,11,12,13,14,15,16))

<8 x Int64>[9, 10, 11, 12, 13, 14, 15, 16]

We can operate with the given vectors as if they were arrays

In [3]:
res_v = a1_v + a2_v

<8 x Int64>[10, 12, 14, 16, 18, 20, 22, 24]

In [4]:
a1_a = [1,2,3,4,5,6,7,8]
a2_a = [9,10,11,12,13,14,15,16];

In [5]:
res_a = a1_a + a2_a

8-element Array{Int64,1}:
 10
 12
 14
 16
 18
 20
 22
 24

Using operations between Vec types will be faster than the coutnerpart Array operations

In [6]:
@btime res_a = a1_a + a2_a

  53.623 ns (1 allocation: 144 bytes)


8-element Array{Int64,1}:
 10
 12
 14
 16
 18
 20
 22
 24

In [7]:
@btime res_v = a1_v + a2_v

  23.833 ns (1 allocation: 80 bytes)


<8 x Int64>[10, 12, 14, 16, 18, 20, 22, 24]

The main difference of using standard vectors and Vec types is the fact that Vec types can be easily mapped to instructions that operate on the whole Vec. 


In [8]:
@code_native a1_v + a2_v

	.text
; ┌ @ SIMD.jl:996 within `+'
; │┌ @ SIMD.jl:583 within `llvmwrap' @ SIMD.jl:583
; ││┌ @ SIMD.jl:996 within `macro expansion'
	vmovdqu	(%rdx), %ymm0
	vmovdqu	32(%rdx), %ymm1
	vpaddq	(%rsi), %ymm0, %ymm0
	vpaddq	32(%rsi), %ymm1, %ymm1
; │└└
	vmovdqa	%ymm1, 32(%rdi)
	vmovdqa	%ymm0, (%rdi)
	movq	%rdi, %rax
	vzeroupper
	retq
	nopw	%cs:(%rax,%rax)
; └


This does not happen in standard arrays (at least not as easily, the compiler could be smart enough to do it).

In [9]:
@code_native a1_a + a2_a

	.text
; ┌ @ arraymath.jl:44 within `+'
	pushq	%r14
	pushq	%rbx
	subq	$56, %rsp
	movq	%rsi, %r8
	vxorps	%xmm0, %xmm0, %xmm0
	vmovaps	%xmm0, (%rsp)
	movq	$0, 16(%rsp)
	movq	%r8, 48(%rsp)
	movq	%fs:0, %rcx
	movq	$2, (%rsp)
	movq	-15552(%rcx), %rsi
	movq	%rsi, 8(%rsp)
	movq	%rsp, %rsi
	movq	%rsi, -15552(%rcx)
	leaq	-15552(%rcx), %r14
	movq	(%r8), %rcx
; │┌ @ tuple.jl:43 within `iterate' @ tuple.jl:43
; ││┌ @ tuple.jl:24 within `getindex'
	movq	8(%r8), %rsi
; │└└
; │ @ arraymath.jl:45 within `+'
; │┌ @ indices.jl:145 within `promote_shape'
; ││┌ @ abstractarray.jl:75 within `axes'
; │││┌ @ array.jl:155 within `size'
	movq	24(%rcx), %rbx
; │││└
; │││┌ @ tuple.jl:165 within `map'
; ││││┌ @ range.jl:317 within `Type' @ range.jl:308
; │││││┌ @ promotion.jl:414 within `max'
	movq	%rbx, %rdi
	sarq	$63, %rdi
	andnq	%rbx, %rdi, %rdi
; ││└└└└
; ││┌ @ array.jl:155 within `axes'
	movq	24(%rsi), %rax
; ││└
; ││┌ @ abstractarray.jl:75 within `axes'
; │││┌ @ tuple.jl:165 within `map'
; ││││┌ @ range.jl:


# Operations on `Vec` types


## Operations between `Vec` elements


Let $Vec\{N,T\}$ be a set where a `Vec` can be instanciated. For example `Vec{8,Int32}`.

Then a vector operation is a function $Vec\{N,T\} \times Vec\{N,T\}  \longrightarrow Vec\{N,T\}$. We have the following methods in SIMD.jl :


```
+ - * / % ^ ! ~ & | $ << >> >>> == != < <= > >=
```


In [10]:
x = Vec((1.,2.,3.,4.))
y = Vec((1.,2.,3.,9.))

<4 x Float64>[1.0, 2.0, 3.0, 9.0]

In [11]:
operations = Symbol.([ +, -, *,  /,  %, ^, !, ~, &, |])

10-element Array{Symbol,1}:
 :+  
 :-  
 :*  
 :/  
 :rem
 :^  
 :!  
 :~  
 :&  
 :|  

In [12]:
for op in operations:
    println(op, "x $op y", eval(exp(x op y)))

LoadError: syntax: line break in ":" expression

In [13]:
operations = Symbol.([ +, -, *,  /,  %, ^, !, ~, &, |, $, <<, >>, >>>, ==, !=, <, <=, >, >=])

UndefVarError: UndefVarError: $ not defined


## Elementwise operation in `Vec` elements

An elementwise operation in a vec element is a function $Vec\{N,T\} \longrightarrow Vec\{N,T\}$

The following operations are available:

```
abs cbrt ceil copysign cos div exp exp10 exp2 flipsign floor fma inv isfinite isinf isnan issubnormal log log10 log2 muladd rem round sign signbit sin sqrt trunc vifelse
```

In [14]:
x = Vec((1.,2.,3.,4.))

<4 x Float64>[1.0, 2.0, 3.0, 4.0]

In [15]:
exp(x)

<4 x Float64>[2.718281828459045, 7.38905609893065, 20.085536923187668, 54.598150033144236]

In [16]:
ceil(x)

<4 x Float64>[1.0, 2.0, 3.0, 4.0]



## Reduction operations

A reduction operation is a function of the form $Vec\{N,T\}  \longrightarrow T$.

Therefore reductions "reduce" a SIMD vector to a scalar. The following reduction operations are provided:

```
all any maximum minimum sum prod
```

In [17]:
x = Vec((1.,2.,3.,4.))
sum(x)

10.0

In [18]:
x = Vec((1,2,3,4))
sum(x)

10

## Performance gain by lowering precission

The speed difference can be magnified if we do more operations than a single sum:

In [19]:
function sum_vec(a1,a2,b1,b2)
    res1 = a1 + a2
    res2 = b1 + b2
    return res1, res2
end

sum_vec (generic function with 1 method)

In [20]:
@btime sum_vec(a1_a,a2_a,a1_a,a2_a)

  99.499 ns (3 allocations: 320 bytes)


([10, 12, 14, 16, 18, 20, 22, 24], [10, 12, 14, 16, 18, 20, 22, 24])

In [21]:
@btime sum_vec(a1_v,a2_v,a1_v,a2_v)

  29.503 ns (1 allocation: 144 bytes)


(<8 x Int64>[10, 12, 14, 16, 18, 20, 22, 24], <8 x Int64>[10, 12, 14, 16, 18, 20, 22, 24])

In [22]:
function sum_vec_4(a1,a2,b1,b2)
    res1 = a1 + a2
    res2 = b1 + b2
    res3 = a1 + b1
    res4 = a2 + b2

    return res1, res2,res3, res4
end

sum_vec_4 (generic function with 1 method)

In [23]:
@btime sum_vec_4(a1_v,a2_v,a1_v,a2_v)

  40.785 ns (1 allocation: 272 bytes)


(<8 x Int64>[10, 12, 14, 16, 18, 20, 22, 24], <8 x Int64>[10, 12, 14, 16, 18, 20, 22, 24], <8 x Int64>[2, 4, 6, 8, 10, 12, 14, 16], <8 x Int64>[18, 20, 22, 24, 26, 28, 30, 32])

In [24]:
@btime sum_vec_4(a1_a,a2_a,a1_a,a2_a)

  172.576 ns (5 allocations: 624 bytes)


([10, 12, 14, 16, 18, 20, 22, 24], [10, 12, 14, 16, 18, 20, 22, 24], [2, 4, 6, 8, 10, 12, 14, 16], [18, 20, 22, 24, 26, 28, 30, 32])

Notice that the speed gain between the prevoius two calls is, at most, 4X.

We can increase the difference in speed using an integer type with less bits. For example Int32.

In [25]:
a1_i8_v = Vec(tuple(Int32.(a1_a)...))
a2_i8_v = Vec(tuple(Int32.(a2_a)...))

<8 x Int32>[9, 10, 11, 12, 13, 14, 15, 16]

In [26]:
@btime sum_vec_4(a1_i8_v,a2_i8_v,a1_i8_v,a2_i8_v)

  33.669 ns (1 allocation: 144 bytes)


(<8 x Int32>[10, 12, 14, 16, 18, 20, 22, 24], <8 x Int32>[10, 12, 14, 16, 18, 20, 22, 24], <8 x Int32>[2, 4, 6, 8, 10, 12, 14, 16], <8 x Int32>[18, 20, 22, 24, 26, 28, 30, 32])

## Example automatic vectorization

In [27]:
function c_a_times_b!(c::Array, a::Array, b::Array)
    @assert length(a) == length(b) == length(c)
    @inbounds for i in 1:length(a)
        c[i] = a[i] * b[i]
    end
end

c_a_times_b! (generic function with 1 method)

In [28]:
 V64 = Vector{Float64}

Array{Float64,1}

Inspecting `code_llvm` we can see

```
 ┌ @ float.jl:399 within `*'
   %55 = fmul <4 x double> %wide.load, %wide.load24
   %56 = fmul <4 x double> %wide.load21, %wide.load25
   %57 = fmul <4 x double> %wide.load22, %wide.load26
   %58 = fmul <4 x double> %wide.load23, %wide.load27
```

In [29]:
code_llvm(c_a_times_b!, Tuple{V64, V64, V64})


;  @ In[27]:2 within `c_a_times_b!'
define nonnull %jl_value_t addrspace(10)* @"japi1_c_a_times_b!_13563"(%jl_value_t addrspace(10)*, %jl_value_t addrspace(10)**, i32) #0 {
top:
  %gcframe = alloca %jl_value_t addrspace(10)*, i32 3
  %3 = bitcast %jl_value_t addrspace(10)** %gcframe to i8*
  call void @llvm.memset.p0i8.i32(i8* %3, i8 0, i32 24, i32 0, i1 false)
  %4 = alloca %jl_value_t addrspace(10)**, align 8
  store volatile %jl_value_t addrspace(10)** %1, %jl_value_t addrspace(10)*** %4, align 8
  %thread_ptr = call i8* asm "movq %fs:0, $0", "=r"()
  %ptls_i8 = getelementptr i8, i8* %thread_ptr, i64 -15552
  %ptls = bitcast i8* %ptls_i8 to %jl_value_t***
  %5 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %gcframe, i32 0
  %6 = bitcast %jl_value_t addrspace(10)** %5 to i64*
  store i64 2, i64* %6
  %7 = getelementptr %jl_value_t**, %jl_value_t*** %ptls, i32 0
  %8 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %gcframe, i32 1
  %9 

Inspecting `code_native`

```
	vmovupd	(%r10,%rcx,8), %ymm0
	vmovupd	32(%r10,%rcx,8), %ymm1
	vmovupd	64(%r10,%rcx,8), %ymm2
	vmovupd	96(%r10,%rcx,8), %ymm3
```

In [30]:
code_native(c_a_times_b!, Tuple{V64, V64, V64})

	.text
; ┌ @ In[27]:2 within `c_a_times_b!'
	pushq	%rbx
	subq	$32, %rsp
	vxorpd	%xmm0, %xmm0, %xmm0
	vmovapd	%xmm0, (%rsp)
	movq	$0, 16(%rsp)
	movq	%rsi, 24(%rsp)
	movq	%fs:0, %rax
	movq	$2, (%rsp)
	movq	-15552(%rax), %rcx
	movq	%rcx, 8(%rsp)
	movq	%rsp, %rcx
	movq	%rcx, -15552(%rax)
	leaq	-15552(%rax), %rdi
	movq	8(%rsi), %rax
	movq	16(%rsi), %rcx
; │┌ @ array.jl:199 within `length'
	movq	8(%rcx), %r8
; │└
; │┌ @ promotion.jl:403 within `=='
	cmpq	%r8, 8(%rax)
; │└
	jne	L368
	movq	(%rsi), %rsi
; │ @ In[27]:2 within `c_a_times_b!'
; │┌ @ promotion.jl:403 within `=='
	cmpq	8(%rsi), %r8
; │└
	jne	L368
; │ @ In[27]:3 within `c_a_times_b!'
; │┌ @ range.jl:5 within `Colon'
; ││┌ @ range.jl:274 within `Type'
; │││┌ @ range.jl:279 within `unitrange_last'
; ││││┌ @ operators.jl:333 within `>='
; │││││┌ @ int.jl:428 within `<='
	testq	%r8, %r8
; │└└└└└
	jle	L169
	movq	(%rax), %r10
	movq	(%rcx), %rdx
	movq	(%rsi), %rsi
; │ @ In[27]:3 within `c_a_times_b!'
	cmpq	$16, %r8
	jae	L196
	movl	$1, %eax


# Reading and writting SIMD vectors from arrays using `vload` and `vstore`


- `vload`  reads to a SIMD vector from an array (reading from several contiguous array elements).
- `vstore` writes to an array from a SIMD vector (writting to consecutive positions of an array) .

Given an array `a` we can get a slice of size `N` as follows

In [31]:
T = Float64
N = 4
a = rand(T, 10_000);
v_type = Vec{N, T}

Vec{4,Float64}

Notice that `Vec{N,T}` is not a vector, is the type of a vector (that is why we named it `v_type`).

In [32]:
typeof(v_type)

DataType

When we use the `vload(v_type,a,i)` function we actually generate a vector of type `v_type`

In [33]:
i = 1
v = vload(v_type, a, i)

<4 x Float64>[0.9357647892861083, 0.8605442359943398, 0.6671121286441688, 0.4117393532614202]

In [34]:
typeof(v)

Vec{4,Float64}

In this case `v` contains the information from `a[1:4]` because `i=1`.

More generally, **`vload(v_type, a, i)` will contain the same information as `a[i:(i+N-1)]`**

In [35]:
for i in 1:N
    println( "v[$i] == a[$i] is ", v[i] == a[i])
end

v[1] == a[1] is true
v[2] == a[2] is true
v[3] == a[3] is true
v[4] == a[4] is true


## Translating "Array" code to "SIMDVector" code 

What would happen if we want to translate

```
T = Float64
a = rand(T, 1000)
b = rand(T, 1000)
c = zeros(T, 1000)

c_a_times_b!(c,a,b)
```

Using SIMD instructions ?



In [36]:
T = Float64
a = rand(T, 10_000)
b = rand(T, 100_00)
c = zeros(T, 100_00)

@btime c_a_times_b!($c,$a,$b)

  5.616 μs (0 allocations: 0 bytes)


Now let us convert the method `c_a_times_b!` manually to use SIMD instructions, to do so we need to decompose our problem into subproblems of size `SIMD_WIDTH`. 

Now we can assume, without any loss in generality, that `SIMD_WIDTH=4`. We will cover more details later on.

##### Example of `c_a_times_b!(c[1:8], a[1:8], b[1:8])`

Now what can we do to, for example, sum the first 8 positions of `a` to the first 8 positions of `b` and then save the results into `c` ?

In [37]:
c = zeros(8)
i = 1
c_chunk = vload(v_type, a, i) * vload(v_type, b, i)
vstore(c_chunk,c,i)
i = 1 + N
c_chunk = vload(v_type, a, i) * vload(v_type, b, i)
vstore(c_chunk,c,i)
c

8-element Array{Float64,1}:
 0.5016238777443504  
 0.18793853104134334 
 0.009747989525473767
 0.14699073291223802 
 0.3365845204622229  
 0.20695896603027628 
 0.18520153561321254 
 0.19679010405495975 

This is equivalent to...

In [38]:
c = zeros(8)
c_a_times_b!(c, a[1:8], b[1:8])
c

8-element Array{Float64,1}:
 0.5016238777443504  
 0.18793853104134334 
 0.009747989525473767
 0.14699073291223802 
 0.3365845204622229  
 0.20695896603027628 
 0.18520153561321254 
 0.19679010405495975 

Now, instead of doing this process twice, we can do it multiple times.
This means that we can instead of iterating over all positions in the array we can do the operation as follows

In [39]:
function c_a_times_b_SIMD!(c::Array{T}, a::Array{T}, b::Array, v_type::Type{Vec{N,T}}) where {N, T}
    #@assert length(a) == length(b) == length(c)
    
    @inbounds for i in 1:N:length(a)
        a_chunk = vload(v_type, a, i) 
        b_chunk = vload(v_type, b, i) 
        a_chunk *=  b_chunk
        vstore(a_chunk,c,i)
    end
end

c_a_times_b_SIMD! (generic function with 1 method)

In [40]:
@btime c_a_times_b_SIMD!($c,$a,$b,Vec{4,Float64})

BoundsError: BoundsError

We could aso have the information of the SIMD width inside the function

In [41]:

function c_a_times_b_SIMD_2!(c::Array{T}, a::Array{T}, b::Array{T})
    #@assert length(a) == length(b) == length(c)
    N = 4
    @inbounds for i in 1:N:length(a)
        a_chunk = vload(Vec{N,T}, a, i) 
        b_chunk = vload(Vec{N,T}, b, i) 
        a_chunk *=  b_chunk
        vstore(a_chunk,c,i)
    end
end

c_a_times_b_SIMD_2! (generic function with 1 method)

In [42]:
@btime c_a_times_b_SIMD_2!($c,$a,$b)

BoundsError: BoundsError

In fact we could have the element type of the arrays and the SIMD width inside the function with the 
        same performance

In [43]:
function c_a_times_b_SIMD_3!(c, a, b)
    N = 4
    T = eltype(c)
    @inbounds for i in 1:N:length(a)
        a_chunk = vload(Vec{N,T}, a, i) 
        b_chunk = vload(Vec{N,T}, b, i) 
        a_chunk *=  b_chunk
        vstore(a_chunk,c,i)
    end
end

c_a_times_b_SIMD_3! (generic function with 1 method)

In [44]:
@btime c_a_times_b_SIMD_3!($c,$a,$b)

BoundsError: BoundsError

## Arrays with number of elements not divisible by simd witdh

What happens now if we use `c_a_times_b_SIMD_3!` with an array of **1010** elements?

Our code assumed `length(a)%N==0` because if this is not true then the code crashes.



In [45]:
T = Float64
n_elements = 1010
SIMD_WIDTH = 4
a = rand(T, n_elements)
b = rand(T, n_elements)
c = zeros(T, n_elements);

In [46]:
c_a_times_b_SIMD_3!(c,a,b)

BoundsError: BoundsError

To solve this problem we can do the following: 

Split the problem into two subproblems.
The first subproblem has a number of elements that is divisible by `N`, we can proceed as we did before.
THe second subproblem has a number of elements that is not divisible by `N`, we can do those sequentially.

Since `mod(1010,4) = 2` that means we can do the first `1008` elements using SIMD vectors and the remaining 2 using scalar operations.

In general:

```
    n_remaining = mod(n_elements, N)
    n_first = n_elements - n_remaining
```


In [47]:
n_remaining = mod(1010,4)
n_first = n_elements -n_remaining
n_remaining, n_first

(2, 1008)

In [75]:
function c_a_times_b_SIMD_4!(c, a, b)
    N = 4
    T = eltype(c)
    n_elements = length(a)
    n_remaining = mod(n_elements, N)
    n_first = n_elements - n_remaining
    
    for i in 1:N:length(n_first)
        a_chunk = vload(Vec{N,T}, a, i) 
        b_chunk = vload(Vec{N,T}, b, i) 
        a_chunk *=  b_chunk
        vstore(a_chunk,c,i)
    end
    
    for i in n_first:n_elements
        c[i] = a[i]*b[i]
    end
end

c_a_times_b_SIMD_4! (generic function with 1 method)

In [76]:
c_a_times_b_SIMD_4!(c,a,b)

In [78]:
aux = zero(a)
c_a_times_b!(aux,a,b)
isapprox(aux,c)

true

In [80]:
@btime c_a_times_b_SIMD_4!($c, $a, $b)

  18.109 ns (0 allocations: 0 bytes)


In [81]:
@btime c_a_times_b!($c, $a, $b)

  4.864 μs (0 allocations: 0 bytes)


In [82]:
@btime c_a_times_b_SIMD_3!(c,a,b)

  5.340 μs (0 allocations: 0 bytes)


In [63]:
T = Float64
n_elements = 10_000
SIMD_WIDTH = 4
a = rand(T, n_elements)
b = rand(T, n_elements)
c = zeros(T, n_elements);

In [64]:
@btime c_a_times_b_SIMD_4!($c, $a, $b)

  20.807 ns (0 allocations: 0 bytes)


In [65]:
@btime c_a_times_b!($c, $a, $b)

  5.050 μs (0 allocations: 0 bytes)


### Another example

In [None]:
using SIMD
using BenchmarkTools

x1 = rand(Float64, 64)
x2 = rand(Float64, 64)
y = similar(x1)

function add!(y, x1,x2)
    @inbounds for i=1:length(x1)
        y[i] = x1[i] + x2[i] 
    end
end

function vadd!(y::Vector{T}, xs::Vector{T}, ys::Vector{T}, vec::Type{Vec{N,T}}=Vec{8,T}) 
    @inbounds for i in 1:N:length(xs)
        xv = vload(Vec{N,T}, xs, i)
        yv = vload(Vec{N,T}, ys, i)
        xv += yv 
        vstore(xv, y, i)
    end
end


## vload and vstore using indexing notation

In [None]:
function c_a_times_b_SIMD_2!(c::Array, a::Array, b::Array, N::Int)
    @assert length(a) == length(b) == length(c)
    
    T      = eltype(c)
    v_type = Vec{N, T}
    a_chunk = zero(Vec{4,Float64})
    b_chunk = zero(Vec{4,Float64})
    
    for i in 1:N:length(a)
        c_chunk = vload(v_type, a, i) * vload(v_type, b, i)
        vstore(c_chunk,c,i)
    end
end

In [None]:
@btime c_a_times_b_SIMD_2!($c,$a,$b,4)

# Using `VecRange` objects

In [None]:
using SIMD
function vadd!(xs::Vector{T}, ys::Vector{T}, ::Type{Vec{N,T}}) where {N, T}
    @assert length(ys) == length(xs)
    @assert length(xs) % N == 0
    lane = VecRange{N}(0)
    @inbounds for i in 1:N:length(xs)
        xs[lane + i] += ys[lane + i]
    end
end

In [None]:
x = rand(Float32, 1_000_000);
y = rand(Float32, 1_000_000);

In [None]:
@btime vadd!($x,$y,Vec{32,Float32})

In [None]:
@btime $x .+= $y;

# If else statements in SIMD 

In [None]:
function myfunc(a, b)
    if a > b
        return a - b
    else
        return a + b
    end
end
x = rand(1_000_000);
# do myfunc.(x, 2.) with explicit simd calls
myfunc.(x, 2.);

In [None]:
function myfunc_simd(x::Vector{T}, value::T, ::Type{Vec{N,T}}) where {N, T}
           @assert length(x) % N == 0
           result = Array{T}(undef, length(x))
           lane   = VecRange{N}(0)
           @inbounds for i in 1:N:length(x)        
               x_vslice    = vload(Vec{N, T}, x, i) # i = 2*k+1 where k=1,2,3,4,...
               result[lane + i] = vifelse(x_vslice > 2, x_vslice - value, x_vslice + value)
           end
           return result
       end

In [None]:
x = rand(Float32,1_000_000);

In [None]:
result_1 = myfunc.(x,1);
result_2 = myfunc_simd(x, Float32(1), Vec{8,Float32});
result_1 == result_2

In [None]:
@btime myfunc.(x,1);

In [None]:
@btime myfunc_simd(x, Float32(1), Vec{8,Float32});