<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#The-Vec-type" data-toc-modified-id="The-Vec-type-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>The <code>Vec</code> type</a></span></li><li><span><a href="#Operations-on-Vec-types" data-toc-modified-id="Operations-on-Vec-types-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Operations on <code>Vec</code> types</a></span><ul class="toc-item"><li><span><a href="#Operations-between-Vec-elements" data-toc-modified-id="Operations-between-Vec-elements-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Operations between <code>Vec</code> elements</a></span></li><li><span><a href="#Elementwise-operation-in-Vec-elements" data-toc-modified-id="Elementwise-operation-in-Vec-elements-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Elementwise operation in <code>Vec</code> elements</a></span></li><li><span><a href="#Reduction-operations" data-toc-modified-id="Reduction-operations-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Reduction operations</a></span></li><li><span><a href="#Performance-gain-by-lowering-precission" data-toc-modified-id="Performance-gain-by-lowering-precission-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Performance gain by lowering precission</a></span></li><li><span><a href="#Example-automatic-vectorization" data-toc-modified-id="Example-automatic-vectorization-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Example automatic vectorization</a></span></li></ul></li></ul></div>

In [34]:
using SIMD
using BenchmarkTools

# The `Vec` type


SIMD vectors are similar to small fixed-size arrays of "simple" types.  These element types are supportedin SIMD.jl are:

```
- Bool 
- Int{8,16,32,64,128} 
- UInt{8,16,32,64,128} 
- Float{16,32,64}
```


We can create vector types (or `SIMD.Vec` types) as follows:

```
my_vec = Vec(a_tuple)
```

Notice that `eltype(a_tuple)` has to be in 

`[Bool, Int8, Int16, Int32, Int64, Int128, UInt8, UInt16, UInt32, UInt128, Float16, Float32, Float64]`

##### Examples:

```
a1_v = Vec(1,2,3,4,5,6,7,8)
a2_v = Vec(9,10,11,12,13,14,15,16,17,18)
```

##### Breaking examples:

```
a = Vec(("the","house")) # strings are not in the set of possible element types for a Vec
a = Vec((1,23.231))      # All elements in the tuple constructing the Vec need to be of the same type
```

In [11]:
a1_v = Vec((1,2,3,4,5,6,7,8))
a2_v = Vec((9,10,11,12,13,14,15,16))

<8 x Int64>[9, 10, 11, 12, 13, 14, 15, 16]

We can operate with the given vectors as if they were arrays

In [31]:
res_v = a1_v + a2_v

<8 x Int64>[10, 12, 14, 16, 18, 20, 22, 24]

In [32]:
a1_a = [1,2,3,4,5,6,7,8]
a2_a = [9,10,11,12,13,14,15,16];

In [33]:
res_a = a1_a + a2_a

8-element Array{Int64,1}:
 10
 12
 14
 16
 18
 20
 22
 24

Using operations between Vec types will be faster than the coutnerpart Array operations

In [36]:
@btime res_a = a1_a + a2_a

  53.730 ns (1 allocation: 144 bytes)


8-element Array{Int64,1}:
 10
 12
 14
 16
 18
 20
 22
 24

In [37]:
@btime res_v = a1_v + a2_v

  22.162 ns (1 allocation: 80 bytes)


<8 x Int64>[10, 12, 14, 16, 18, 20, 22, 24]

The main difference of using standard vectors and Vec types is the fact that Vec types can be easily mapped to instructions that operate on the whole Vec. 


In [43]:
@code_native a1_v + a2_v

	.text
; ┌ @ SIMD.jl:996 within `+'
; │┌ @ SIMD.jl:583 within `llvmwrap' @ SIMD.jl:583
; ││┌ @ SIMD.jl:996 within `macro expansion'
	vmovdqu	(%rdx), %ymm0
	vmovdqu	32(%rdx), %ymm1
	vpaddq	(%rsi), %ymm0, %ymm0
	vpaddq	32(%rsi), %ymm1, %ymm1
; │└└
	vmovdqa	%ymm1, 32(%rdi)
	vmovdqa	%ymm0, (%rdi)
	movq	%rdi, %rax
	vzeroupper
	retq
	nopw	%cs:(%rax,%rax)
; └


This does not happen in standard arrays (at least not as easily, the compiler could be smart enough to do it).

In [45]:
@code_native a1_a + a2_a

	.text
; ┌ @ arraymath.jl:44 within `+'
	pushq	%r14
	pushq	%rbx
	subq	$56, %rsp
	movq	%rsi, %r8
	vxorps	%xmm0, %xmm0, %xmm0
	vmovaps	%xmm0, (%rsp)
	movq	$0, 16(%rsp)
	movq	%r8, 48(%rsp)
	movq	%fs:0, %rcx
	movq	$2, (%rsp)
	movq	-15552(%rcx), %rsi
	movq	%rsi, 8(%rsp)
	movq	%rsp, %rsi
	movq	%rsi, -15552(%rcx)
	leaq	-15552(%rcx), %r14
	movq	(%r8), %rcx
; │┌ @ tuple.jl:43 within `iterate' @ tuple.jl:43
; ││┌ @ tuple.jl:24 within `getindex'
	movq	8(%r8), %rsi
; │└└
; │ @ arraymath.jl:45 within `+'
; │┌ @ indices.jl:145 within `promote_shape'
; ││┌ @ abstractarray.jl:75 within `axes'
; │││┌ @ array.jl:155 within `size'
	movq	24(%rcx), %rbx
; │││└
; │││┌ @ tuple.jl:165 within `map'
; ││││┌ @ range.jl:317 within `Type' @ range.jl:308
; │││││┌ @ promotion.jl:414 within `max'
	movq	%rbx, %rdi
	sarq	$63, %rdi
	andnq	%rbx, %rdi, %rdi
; ││└└└└
; ││┌ @ array.jl:155 within `axes'
	movq	24(%rsi), %rax
; ││└
; ││┌ @ abstractarray.jl:75 within `axes'
; │││┌ @ tuple.jl:165 within `map'
; ││││┌ @ range.jl:


# Operations on `Vec` types


## Operations between `Vec` elements


Let $Vec\{N,T\}$ be a set where a `Vec` can be instanciated. For example `Vec{8,Int32}`.

Then a vector operation is a function $Vec\{N,T\} \times Vec\{N,T\}  \longrightarrow Vec\{N,T\}$. We have the following methods in SIMD.jl :


```
+ - * / % ^ ! ~ & | $ << >> >>> == != < <= > >=
```


In [207]:
x = Vec((1.,2.,3.,4.))
y = Vec((1.,2.,3.,9.))

<4 x Float64>[1.0, 2.0, 3.0, 9.0]

In [213]:
operations = Symbol.([ +, -, *,  /,  %, ^, !, ~, &, |])

10-element Array{Symbol,1}:
 :+  
 :-  
 :*  
 :/  
 :rem
 :^  
 :!  
 :~  
 :&  
 :|  

In [212]:
for op in operations:
    println(op, "x $op y", eval(exp(x op y)))

LoadError: syntax: line break in ":" expression

In [193]:
operations = Symbol.([ +, -, *,  /,  %, ^, !, ~, &, |, $, <<, >>, >>>, ==, !=, <, <=, >, >=])

UndefVarError: UndefVarError: $ not defined


## Elementwise operation in `Vec` elements

An elementwise operation in a vec element is a function $Vec\{N,T\} \longrightarrow Vec\{N,T\}$

The following operations are available:

```
abs cbrt ceil copysign cos div exp exp10 exp2 flipsign floor fma inv isfinite isinf isnan issubnormal log log10 log2 muladd rem round sign signbit sin sqrt trunc vifelse
```

In [151]:
x = Vec((1.,2.,3.,4.))

<4 x Float64>[1.0, 2.0, 3.0, 4.0]

In [152]:
exp(x)

<4 x Float64>[2.718281828459045, 7.38905609893065, 20.085536923187668, 54.598150033144236]

In [153]:
ceil(x)

<4 x Float64>[1.0, 2.0, 3.0, 4.0]



## Reduction operations

A reduction operation is a function of the form $Vec\{N,T\}  \longrightarrow T$.

Therefore reductions "reduce" a SIMD vector to a scalar. The following reduction operations are provided:

```
all any maximum minimum sum prod
```

In [145]:
x = Vec((1.,2.,3.,4.))
sum(x)

10.0

In [146]:
x = Vec((1,2,3,4))
sum(x)

10

## Performance gain by lowering precission

The speed difference can be magnified if we do more operations than a single sum:

In [154]:
function sum_vec(a1,a2,b1,b2)
    res1 = a1 + a2
    res2 = b1 + b2
    return res1, res2
end

sum_vec (generic function with 1 method)

In [155]:
@btime sum_vec(a1_a,a2_a,a1_a,a2_a)

  101.254 ns (3 allocations: 320 bytes)


([10, 12, 14, 16, 18, 20, 22, 24], [10, 12, 14, 16, 18, 20, 22, 24])

In [156]:
@btime sum_vec(a1_v,a2_v,a1_v,a2_v)

  28.685 ns (1 allocation: 144 bytes)


(<8 x Int64>[10, 12, 14, 16, 18, 20, 22, 24], <8 x Int64>[10, 12, 14, 16, 18, 20, 22, 24])

In [157]:
function sum_vec_4(a1,a2,b1,b2)
    res1 = a1 + a2
    res2 = b1 + b2
    res3 = a1 + b1
    res4 = a2 + b2

    return res1, res2,res3, res4
end

sum_vec_4 (generic function with 1 method)

In [158]:
@btime sum_vec_4(a1_v,a2_v,a1_v,a2_v)

  41.625 ns (1 allocation: 272 bytes)


(<8 x Int64>[10, 12, 14, 16, 18, 20, 22, 24], <8 x Int64>[10, 12, 14, 16, 18, 20, 22, 24], <8 x Int64>[2, 4, 6, 8, 10, 12, 14, 16], <8 x Int64>[18, 20, 22, 24, 26, 28, 30, 32])

In [159]:
@btime sum_vec_4(a1_a,a2_a,a1_a,a2_a)

  172.804 ns (5 allocations: 624 bytes)


([10, 12, 14, 16, 18, 20, 22, 24], [10, 12, 14, 16, 18, 20, 22, 24], [2, 4, 6, 8, 10, 12, 14, 16], [18, 20, 22, 24, 26, 28, 30, 32])

Notice that the speed gain between the prevoius two calls is, at most, 4X.

We can increase the difference in speed using an integer type with less bits. For example Int32.

In [160]:
a1_i8_v = Vec(tuple(Int32.(a1_a)...))
a2_i8_v = Vec(tuple(Int32.(a2_a)...))

<8 x Int32>[9, 10, 11, 12, 13, 14, 15, 16]

In [161]:
@btime sum_vec_4(a1_i8_v,a2_i8_v,a1_i8_v,a2_i8_v)

  28.317 ns (1 allocation: 144 bytes)


(<8 x Int32>[10, 12, 14, 16, 18, 20, 22, 24], <8 x Int32>[10, 12, 14, 16, 18, 20, 22, 24], <8 x Int32>[2, 4, 6, 8, 10, 12, 14, 16], <8 x Int32>[18, 20, 22, 24, 26, 28, 30, 32])

## Example automatic vectorization

In [226]:
function c_a_times_b!(c::Array, a::Array, b::Array)
    @assert length(a) == length(b) == length(c)
    @inbounds for i in 1:length(a)
        c[i] = a[i] * b[i]
    end
end

c_a_times_b! (generic function with 1 method)

In [241]:
 V64 = Vector{Float64}

Array{Float64,1}

Inspecting `code_llvm` we can see

```
 ┌ @ float.jl:399 within `*'
   %55 = fmul <4 x double> %wide.load, %wide.load24
   %56 = fmul <4 x double> %wide.load21, %wide.load25
   %57 = fmul <4 x double> %wide.load22, %wide.load26
   %58 = fmul <4 x double> %wide.load23, %wide.load27
```

In [245]:
code_llvm(c_a_times_b!, Tuple{V64, V64, V64})


;  @ In[226]:2 within `c_a_times_b!'
define nonnull %jl_value_t addrspace(10)* @"japi1_c_a_times_b!_14660"(%jl_value_t addrspace(10)*, %jl_value_t addrspace(10)**, i32) #0 {
top:
  %gcframe = alloca %jl_value_t addrspace(10)*, i32 3
  %3 = bitcast %jl_value_t addrspace(10)** %gcframe to i8*
  call void @llvm.memset.p0i8.i32(i8* %3, i8 0, i32 24, i32 0, i1 false)
  %4 = alloca %jl_value_t addrspace(10)**, align 8
  store volatile %jl_value_t addrspace(10)** %1, %jl_value_t addrspace(10)*** %4, align 8
  %thread_ptr = call i8* asm "movq %fs:0, $0", "=r"()
  %ptls_i8 = getelementptr i8, i8* %thread_ptr, i64 -15552
  %ptls = bitcast i8* %ptls_i8 to %jl_value_t***
  %5 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %gcframe, i32 0
  %6 = bitcast %jl_value_t addrspace(10)** %5 to i64*
  store i64 2, i64* %6
  %7 = getelementptr %jl_value_t**, %jl_value_t*** %ptls, i32 0
  %8 = getelementptr %jl_value_t addrspace(10)*, %jl_value_t addrspace(10)** %gcframe, i32 1
  %9

Inspecting `code_native`

```
	vmovupd	(%r10,%rcx,8), %ymm0
	vmovupd	32(%r10,%rcx,8), %ymm1
	vmovupd	64(%r10,%rcx,8), %ymm2
	vmovupd	96(%r10,%rcx,8), %ymm3
```

In [244]:
code_native(c_a_times_b!, Tuple{V64, V64, V64})

	.text
; ┌ @ In[226]:2 within `c_a_times_b!'
	pushq	%rbx
	subq	$32, %rsp
	vxorpd	%xmm0, %xmm0, %xmm0
	vmovapd	%xmm0, (%rsp)
	movq	$0, 16(%rsp)
	movq	%rsi, 24(%rsp)
	movq	%fs:0, %rax
	movq	$2, (%rsp)
	movq	-15552(%rax), %rcx
	movq	%rcx, 8(%rsp)
	movq	%rsp, %rcx
	movq	%rcx, -15552(%rax)
	leaq	-15552(%rax), %rdi
	movq	8(%rsi), %rax
	movq	16(%rsi), %rcx
; │┌ @ array.jl:199 within `length'
	movq	8(%rcx), %r8
; │└
; │┌ @ promotion.jl:403 within `=='
	cmpq	%r8, 8(%rax)
; │└
	jne	L368
	movq	(%rsi), %rsi
; │ @ In[226]:2 within `c_a_times_b!'
; │┌ @ promotion.jl:403 within `=='
	cmpq	8(%rsi), %r8
; │└
	jne	L368
; │ @ In[226]:3 within `c_a_times_b!'
; │┌ @ range.jl:5 within `Colon'
; ││┌ @ range.jl:274 within `Type'
; │││┌ @ range.jl:279 within `unitrange_last'
; ││││┌ @ operators.jl:333 within `>='
; │││││┌ @ int.jl:428 within `<='
	testq	%r8, %r8
; │└└└└└
	jle	L169
	movq	(%rax), %r10
	movq	(%rcx), %rdx
	movq	(%rsi), %rsi
; │ @ In[226]:3 within `c_a_times_b!'
	cmpq	$16, %r8
	jae	L196
	movl	$1, %