<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Reading-and-writting-SIMD-vectors-from-arrays-using-vload-and-vstore" data-toc-modified-id="Reading-and-writting-SIMD-vectors-from-arrays-using-vload-and-vstore-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Reading and writting SIMD vectors from arrays using <code>vload</code> and <code>vstore</code></a></span><ul class="toc-item"><li><span><a href="#Translating-&quot;Array&quot;-code-to-&quot;SIMDVector&quot;-code" data-toc-modified-id="Translating-&quot;Array&quot;-code-to-&quot;SIMDVector&quot;-code-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Translating "Array" code to "SIMDVector" code</a></span></li><li><span><a href="#Arrays-with-number-of-elements-not-divisible-by-simd-witdh" data-toc-modified-id="Arrays-with-number-of-elements-not-divisible-by-simd-witdh-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Arrays with number of elements not divisible by simd witdh</a></span><ul class="toc-item"><li><span><a href="#Another-example" data-toc-modified-id="Another-example-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Another example</a></span></li></ul></li><li><span><a href="#vload-and-vstore-using-indexing-notation" data-toc-modified-id="vload-and-vstore-using-indexing-notation-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>vload and vstore using indexing notation</a></span></li></ul></li><li><span><a href="#Using-VecRange-objects" data-toc-modified-id="Using-VecRange-objects-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Using <code>VecRange</code> objects</a></span></li><li><span><a href="#If-else-statements-in-SIMD" data-toc-modified-id="If-else-statements-in-SIMD-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>If else statements in SIMD</a></span></li></ul></div>

In [2]:
using SIMD
using BenchmarkTools

# Reading and writting SIMD vectors from arrays using `vload` and `vstore`


- `vload`  reads to a SIMD vector from an array (reading from several contiguous array elements).
- `vstore` writes to an array from a SIMD vector (writting to consecutive positions of an array) .

Given an array `a` we can get a slice of size `N` as follows

In [31]:
T = Float64
N = 4
a = rand(T, 10_000);
v_type = Vec{N, T}

Vec{4,Float64}

Notice that `Vec{N,T}` is not a vector, is the type of a vector (that is why we named it `v_type`).

In [32]:
typeof(v_type)

DataType

When we use the `vload(v_type,a,i)` function we actually generate a vector of type `v_type`

In [33]:
i = 1
v = vload(v_type, a, i)

<4 x Float64>[0.9357647892861083, 0.8605442359943398, 0.6671121286441688, 0.4117393532614202]

In [34]:
typeof(v)

Vec{4,Float64}

In this case `v` contains the information from `a[1:4]` because `i=1`.

More generally, **`vload(v_type, a, i)` will contain the same information as `a[i:(i+N-1)]`**

In [35]:
for i in 1:N
    println( "v[$i] == a[$i] is ", v[i] == a[i])
end

v[1] == a[1] is true
v[2] == a[2] is true
v[3] == a[3] is true
v[4] == a[4] is true


## Translating "Array" code to "SIMDVector" code 

What would happen if we want to translate

```
T = Float64
a = rand(T, 1000)
b = rand(T, 1000)
c = zeros(T, 1000)

c_a_times_b!(c,a,b)
```

Using SIMD instructions ?



In [36]:
T = Float64
a = rand(T, 10_000)
b = rand(T, 100_00)
c = zeros(T, 100_00)

@btime c_a_times_b!($c,$a,$b)

  5.616 μs (0 allocations: 0 bytes)


Now let us convert the method `c_a_times_b!` manually to use SIMD instructions, to do so we need to decompose our problem into subproblems of size `SIMD_WIDTH`. 

Now we can assume, without any loss in generality, that `SIMD_WIDTH=4`. We will cover more details later on.

##### Example of `c_a_times_b!(c[1:8], a[1:8], b[1:8])`

Now what can we do to, for example, sum the first 8 positions of `a` to the first 8 positions of `b` and then save the results into `c` ?

In [37]:
c = zeros(8)
i = 1
c_chunk = vload(v_type, a, i) * vload(v_type, b, i)
vstore(c_chunk,c,i)
i = 1 + N
c_chunk = vload(v_type, a, i) * vload(v_type, b, i)
vstore(c_chunk,c,i)
c

8-element Array{Float64,1}:
 0.5016238777443504  
 0.18793853104134334 
 0.009747989525473767
 0.14699073291223802 
 0.3365845204622229  
 0.20695896603027628 
 0.18520153561321254 
 0.19679010405495975 

This is equivalent to...

In [38]:
c = zeros(8)
c_a_times_b!(c, a[1:8], b[1:8])
c

8-element Array{Float64,1}:
 0.5016238777443504  
 0.18793853104134334 
 0.009747989525473767
 0.14699073291223802 
 0.3365845204622229  
 0.20695896603027628 
 0.18520153561321254 
 0.19679010405495975 

Now, instead of doing this process twice, we can do it multiple times.
This means that we can instead of iterating over all positions in the array we can do the operation as follows

In [39]:
function c_a_times_b_SIMD!(c::Array{T}, a::Array{T}, b::Array, v_type::Type{Vec{N,T}}) where {N, T}
    #@assert length(a) == length(b) == length(c)
    
    @inbounds for i in 1:N:length(a)
        a_chunk = vload(v_type, a, i) 
        b_chunk = vload(v_type, b, i) 
        a_chunk *=  b_chunk
        vstore(a_chunk,c,i)
    end
end

c_a_times_b_SIMD! (generic function with 1 method)

In [40]:
@btime c_a_times_b_SIMD!($c,$a,$b,Vec{4,Float64})

BoundsError: BoundsError

We could aso have the information of the SIMD width inside the function

In [41]:

function c_a_times_b_SIMD_2!(c::Array{T}, a::Array{T}, b::Array{T})
    #@assert length(a) == length(b) == length(c)
    N = 4
    @inbounds for i in 1:N:length(a)
        a_chunk = vload(Vec{N,T}, a, i) 
        b_chunk = vload(Vec{N,T}, b, i) 
        a_chunk *=  b_chunk
        vstore(a_chunk,c,i)
    end
end

c_a_times_b_SIMD_2! (generic function with 1 method)

In [42]:
@btime c_a_times_b_SIMD_2!($c,$a,$b)

BoundsError: BoundsError

In fact we could have the element type of the arrays and the SIMD width inside the function with the 
        same performance

In [43]:
function c_a_times_b_SIMD_3!(c, a, b)
    N = 4
    T = eltype(c)
    @inbounds for i in 1:N:length(a)
        a_chunk = vload(Vec{N,T}, a, i) 
        b_chunk = vload(Vec{N,T}, b, i) 
        a_chunk *=  b_chunk
        vstore(a_chunk,c,i)
    end
end

c_a_times_b_SIMD_3! (generic function with 1 method)

In [44]:
@btime c_a_times_b_SIMD_3!($c,$a,$b)

BoundsError: BoundsError

## Arrays with number of elements not divisible by simd witdh

What happens now if we use `c_a_times_b_SIMD_3!` with an array of **1010** elements?

Our code assumed `length(a)%N==0` because if this is not true then the code crashes.



In [45]:
T = Float64
n_elements = 1010
SIMD_WIDTH = 4
a = rand(T, n_elements)
b = rand(T, n_elements)
c = zeros(T, n_elements);

In [46]:
c_a_times_b_SIMD_3!(c,a,b)

BoundsError: BoundsError

To solve this problem we can do the following: 

Split the problem into two subproblems.
The first subproblem has a number of elements that is divisible by `N`, we can proceed as we did before.
THe second subproblem has a number of elements that is not divisible by `N`, we can do those sequentially.

Since `mod(1010,4) = 2` that means we can do the first `1008` elements using SIMD vectors and the remaining 2 using scalar operations.

In general:

```
    n_remaining = mod(n_elements, N)
    n_first = n_elements - n_remaining
```


In [47]:
n_remaining = mod(1010,4)
n_first = n_elements -n_remaining
n_remaining, n_first

(2, 1008)

In [75]:
function c_a_times_b_SIMD_4!(c, a, b)
    N = 4
    T = eltype(c)
    n_elements = length(a)
    n_remaining = mod(n_elements, N)
    n_first = n_elements - n_remaining
    
    for i in 1:N:length(n_first)
        a_chunk = vload(Vec{N,T}, a, i) 
        b_chunk = vload(Vec{N,T}, b, i) 
        a_chunk *=  b_chunk
        vstore(a_chunk,c,i)
    end
    
    for i in n_first:n_elements
        c[i] = a[i]*b[i]
    end
end

c_a_times_b_SIMD_4! (generic function with 1 method)

In [76]:
c_a_times_b_SIMD_4!(c,a,b)

In [78]:
aux = zero(a)
c_a_times_b!(aux,a,b)
isapprox(aux,c)

true

In [80]:
@btime c_a_times_b_SIMD_4!($c, $a, $b)

  18.109 ns (0 allocations: 0 bytes)


In [81]:
@btime c_a_times_b!($c, $a, $b)

  4.864 μs (0 allocations: 0 bytes)


In [82]:
@btime c_a_times_b_SIMD_3!(c,a,b)

  5.340 μs (0 allocations: 0 bytes)


In [63]:
T = Float64
n_elements = 10_000
SIMD_WIDTH = 4
a = rand(T, n_elements)
b = rand(T, n_elements)
c = zeros(T, n_elements);

In [64]:
@btime c_a_times_b_SIMD_4!($c, $a, $b)

  20.807 ns (0 allocations: 0 bytes)


In [65]:
@btime c_a_times_b!($c, $a, $b)

  5.050 μs (0 allocations: 0 bytes)


### Another example

In [None]:
using SIMD
using BenchmarkTools

x1 = rand(Float64, 64)
x2 = rand(Float64, 64)
y = similar(x1)

function add!(y, x1,x2)
    @inbounds for i=1:length(x1)
        y[i] = x1[i] + x2[i] 
    end
end

function vadd!(y::Vector{T}, xs::Vector{T}, ys::Vector{T}, vec::Type{Vec{N,T}}=Vec{8,T}) 
    @inbounds for i in 1:N:length(xs)
        xv = vload(Vec{N,T}, xs, i)
        yv = vload(Vec{N,T}, ys, i)
        xv += yv 
        vstore(xv, y, i)
    end
end


## vload and vstore using indexing notation

In [None]:
function c_a_times_b_SIMD_2!(c::Array, a::Array, b::Array, N::Int)
    @assert length(a) == length(b) == length(c)
    
    T      = eltype(c)
    v_type = Vec{N, T}
    a_chunk = zero(Vec{4,Float64})
    b_chunk = zero(Vec{4,Float64})
    
    for i in 1:N:length(a)
        c_chunk = vload(v_type, a, i) * vload(v_type, b, i)
        vstore(c_chunk,c,i)
    end
end

In [None]:
@btime c_a_times_b_SIMD_2!($c,$a,$b,4)

In [None]:
A[i,j:j+N]

# Using `VecRange` objects

In [None]:
using SIMD
function vadd!(xs::Vector{T}, ys::Vector{T}, ::Type{Vec{N,T}}) where {N, T}
    @assert length(ys) == length(xs)
    @assert length(xs) % N == 0
    lane = VecRange{N}(0)
    @inbounds for i in 1:N:length(xs)
        xs[lane + i] += ys[lane + i]
    end
end

In [None]:
x = rand(Float32, 1_000_000);
y = rand(Float32, 1_000_000);

In [None]:
@btime vadd!($x,$y,Vec{32,Float32})

In [None]:
@btime $x .+= $y;

# If else statements in SIMD 

In [None]:
function myfunc(a, b)
    if a > b
        return a - b
    else
        return a + b
    end
end
x = rand(1_000_000);
# do myfunc.(x, 2.) with explicit simd calls
myfunc.(x, 2.);

In [None]:
function myfunc_simd(x::Vector{T}, value::T, ::Type{Vec{N,T}}) where {N, T}
           @assert length(x) % N == 0
           result = Array{T}(undef, length(x))
           lane   = VecRange{N}(0)
           @inbounds for i in 1:N:length(x)        
               x_vslice    = vload(Vec{N, T}, x, i) # i = 2*k+1 where k=1,2,3,4,...
               result[lane + i] = vifelse(x_vslice > 2, x_vslice - value, x_vslice + value)
           end
           return result
       end

In [None]:
x = rand(Float32,1_000_000);

In [None]:
result_1 = myfunc.(x,1);
result_2 = myfunc_simd(x, Float32(1), Vec{8,Float32});
result_1 == result_2

In [None]:
@btime myfunc.(x,1);

In [None]:
@btime myfunc_simd(x, Float32(1), Vec{8,Float32});