<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#The-Vec-type" data-toc-modified-id="The-Vec-type-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>The <code>Vec</code> type</a></span><ul class="toc-item"><li><span><a href="#Speedup-of-SIMD-vector-operations-vs-standard-arrays" data-toc-modified-id="Speedup-of-SIMD-vector-operations-vs-standard-arrays-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Speedup of SIMD vector operations vs standard arrays</a></span></li></ul></li><li><span><a href="#Operations-from-Vec-types-to-Vec-types" data-toc-modified-id="Operations-from-Vec-types-to-Vec-types-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Operations from <code>Vec</code> types to <code>Vec</code> types</a></span><ul class="toc-item"><li><span><a href="#Vector-operations" data-toc-modified-id="Vector-operations-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Vector operations</a></span><ul class="toc-item"><li><span><a href="#Integer-operators:-[-+,--,-*,--&amp;,-|,-&lt;&lt;,-&gt;&gt;,-&gt;&gt;&gt;,-==,-!=,-&lt;,-&lt;=,-&gt;,-&gt;=]" data-toc-modified-id="Integer-operators:-[-+,--,-*,--&amp;,-|,-<<,->>,->>>,-==,-!=,-<,-<=,->,->=]-2.1.1"><span class="toc-item-num">2.1.1&nbsp;&nbsp;</span>Integer operators: <code>[ +, -, *,  &amp;, |, &lt;&lt;, &gt;&gt;, &gt;&gt;&gt;, ==, !=, &lt;, &lt;=, &gt;, &gt;=]</code></a></span></li><li><span><a href="#Float-operators:-[-+,--,-*,--&amp;,-|,-&lt;&lt;,-&gt;&gt;,-&gt;&gt;&gt;,-==,-!=,-&lt;,-&lt;=,-&gt;,-&gt;=]" data-toc-modified-id="Float-operators:-[-+,--,-*,--&amp;,-|,-<<,->>,->>>,-==,-!=,-<,-<=,->,->=]-2.1.2"><span class="toc-item-num">2.1.2&nbsp;&nbsp;</span>Float operators: <code>[ +, -, *,  &amp;, |, &lt;&lt;, &gt;&gt;, &gt;&gt;&gt;, ==, !=, &lt;, &lt;=, &gt;, &gt;=]</code></a></span></li></ul></li></ul></li><li><span><a href="#Elementwise-operation-in-Vec-elements" data-toc-modified-id="Elementwise-operation-in-Vec-elements-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Elementwise operation in <code>Vec</code> elements</a></span></li><li><span><a href="#Reduction-operations" data-toc-modified-id="Reduction-operations-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Reduction operations</a></span><ul class="toc-item"><li><span><a href="#Shuffle" data-toc-modified-id="Shuffle-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Shuffle</a></span></li></ul></li><li><span><a href="#Performance-gain-by-lowering-precission" data-toc-modified-id="Performance-gain-by-lowering-precission-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Performance gain by lowering precission</a></span><ul class="toc-item"><li><span><a href="#Example-automatic-vectorization" data-toc-modified-id="Example-automatic-vectorization-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Example automatic vectorization</a></span></li><li><span><a href="#Exercise:-Algorithm-to-check-if-an-element-is-within-a-vector." data-toc-modified-id="Exercise:-Algorithm-to-check-if-an-element-is-within-a-vector.-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Exercise: Algorithm to check if an element is within a vector.</a></span></li><li><span><a href="#Exercise:-Algorithm-to-sum-over-indices-in-an-array" data-toc-modified-id="Exercise:-Algorithm-to-sum-over-indices-in-an-array-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Exercise: Algorithm to sum over indices in an array</a></span></li></ul></li></ul></div>

In [2]:
using SIMD
using BenchmarkTools

# The `Vec` type


SIMD vectors are similar to small fixed-size arrays of "simple" types.  These element types are supportedin SIMD.jl are:

```
- Bool 
- Int{8,16,32,64,128} 
- UInt{8,16,32,64,128} 
- Float{16,32,64}
```


We can create vector types (or `SIMD.Vec` types) as follows:

```
my_vec = Vec(a_tuple)
```

Notice that `eltype(a_tuple)` has to be in 

`[Bool, Int8, Int16, Int32, Int64, Int128, UInt8, UInt16, UInt32, UInt128, Float16, Float32, Float64]`

##### Examples:

```
a1_v = Vec(1,2,3,4,5,6,7,8)
a2_v = Vec(9,10,11,12,13,14,15,16,17,18)
```

##### Breaking examples:

```
a = Vec(("the","house")) # strings are not in the set of possible element types for a Vec
a = Vec((1,23.231))      # All elements in the tuple constructing the Vec need to be of the same type
```

We can operate with the given vectors as if they were arrays

In [3]:
a1_v = Vec((1,2,3,4,5,6,7,8))
a2_v = Vec((9,10,11,12,13,14,15,16))
res_v = a1_v + a2_v

<8 x Int64>[10, 12, 14, 16, 18, 20, 22, 24]

In [4]:
a1_a = [1,2,3,4,5,6,7,8]
a2_a = [9,10,11,12,13,14,15,16];
res_a = a1_a + a2_a

8-element Vector{Int64}:
 10
 12
 14
 16
 18
 20
 22
 24

Using operations between Vec types will be faster than the coutnerpart Array operations

In [4]:
print("Vector operation time: ")
println(@benchmark res_v = a1_v + a2_v)

print("Array operation time:  ")
println(@benchmark res_a = a1_a + a2_a)

Vector operation time: Trial(20.792 ns)
Array operation time:  Trial(52.057 ns)


## Speedup of SIMD vector operations vs standard arrays

In [10]:
a1_v = Vec(Int8.((1,2,3,4)))
a2_v = Vec(Int8.((1,2,3,4)))
print("Vector operation time: ")
@btime res = a1_v + a2_v

a1_a = [1,2,3,4]
a2_a = [1,2,3,4]
print("Array operation time:  ")
@btime res = a1_a + a2_a;

Vector operation time:   20.229 ns (1 allocation: 16 bytes)
Array operation time:    42.905 ns (1 allocation: 96 bytes)


In [11]:
a1_v = Vec(Int8.((1,2,3,4,5,6,7,8)))
a2_v = Vec(Int8.((1,2,3,4,5,6,7,8)))
print("Vector operation time: ")
@btime res = a1_v + a2_v;

a1_a = [1,2,3,4,5,6,7,8]
a2_a = [1,2,3,4,5,6,7,8]
print("Array operation time:  ")
@btime res = a1_a + a2_a;

Vector operation time:   19.690 ns (1 allocation: 16 bytes)
Array operation time:    46.040 ns (1 allocation: 128 bytes)


In [12]:
a1_v = Vec(Int64.(Tuple(x for x in 1:20)))
a2_v = Vec(Int64.(Tuple(x for x in 1:20)))
    
print("Vector operation time: ")
@btime res = a1_v + a2_v;

a1_a = [x for x in 1:20]
a2_a = [x for x in 1:20]
    
print("Array operation time: ")
@btime res = a1_a + a2_a;

Vector operation time:   30.034 ns (1 allocation: 272 bytes)
Array operation time:   53.946 ns (1 allocation: 224 bytes)


In [13]:
# error to be solved: If len is not even there can be some errors
# Discussion about this in https://github.com/eschnett/SIMD.jl/issues/61

#a1_v = Vec(Int64.((1,2,3,4,5,6,7)))
#a2_v = Vec(Int64.((1,2,3,4,5,6,7)))
#@btime res = a1_v + a2_v

The main difference of using standard vectors and Vec types is the fact that Vec types can be easily mapped to instructions that operate on the whole Vec. 


In [14]:
@code_native a1_v + a2_v

	[0m.section	[0m__TEXT[0m,[0m__text[0m,[0mregular[0m,[0mpure_instructions
	[0m.build_version [0mmacos[0m, [33m11[39m[0m, [33m0[39m
	[0m.globl	[0m"_julia_+_2159"                 [0m## [0m-- [0mBegin [0mfunction [0mjulia_+_2159
	[0m.p2align	[33m4[39m[0m, [33m0x90[39m
[91m"_julia_+_2159":[39m                        [0m## [0m@"julia_+_2159"
[90m; ┌ @ /Users/davidbuchaca/.julia/packages/SIMD/myoU9/src/simdvec.jl:252 within `+`[39m
	[0m.cfi_startproc
[0m## [0m%bb.0[0m:                               [0m## [0m%top
	[96m[1mmovq[22m[39m	[0m%rdi[0m, [0m%rax
[90m; │ @ /Users/davidbuchaca/.julia/packages/SIMD/myoU9/src/simdvec.jl:253 within `+`[39m
[90m; │┌ @ /Users/davidbuchaca/.julia/packages/SIMD/myoU9/src/LLVM_intrinsics.jl:227 within `add`[39m
[90m; ││┌ @ /Users/davidbuchaca/.julia/packages/SIMD/myoU9/src/LLVM_intrinsics.jl:235 within `macro expansion`[39m
	[96m[1mvmovdqu[22m[39m	[33m128[39m[33m([39m[0m%rdx[33m)[39m[0m, [0m%ymm

This does not happen in standard arrays (at least not as easily, the compiler could be smart enough to do it).

In [15]:
@code_native a1_a + a2_a

	[0m.section	[0m__TEXT[0m,[0m__text[0m,[0mregular[0m,[0mpure_instructions
	[0m.build_version [0mmacos[0m, [33m11[39m[0m, [33m0[39m
	[0m.globl	[0m"_julia_+_2195"                 [0m## [0m-- [0mBegin [0mfunction [0mjulia_+_2195
	[0m.p2align	[33m4[39m[0m, [33m0x90[39m
[91m"_julia_+_2195":[39m                        [0m## [0m@"julia_+_2195"
[90m; ┌ @ arraymath.jl:12 within `+`[39m
	[0m.cfi_startproc
[0m## [0m%bb.0[0m:                               [0m## [0m%top
	[96m[1mpushq[22m[39m	[0m%rbp
	[0m.cfi_def_cfa_offset [33m16[39m
	[0m.cfi_offset [0m%rbp[0m, [33m-16[39m
	[96m[1mmovq[22m[39m	[0m%rsp[0m, [0m%rbp
	[0m.cfi_def_cfa_register [0m%rbp
	[96m[1mpushq[22m[39m	[0m%r15
	[96m[1mpushq[22m[39m	[0m%r14
	[96m[1mpushq[22m[39m	[0m%r13
	[96m[1mpushq[22m[39m	[0m%r12
	[96m[1mpushq[22m[39m	[0m%rbx
	[96m[1mandq[22m[39m	[33m$-32[39m[0m, [0m%rsp
	[96m[1msubq[22m[39m	[33m$160[39m[0m, [0m%rsp
	[0m.cfi_of

	[96m[1mjmp[22m[39m	[91mLBB0_39[39m
[91mLBB0_15:[39m                                [0m## [0m%L174.preheader
	[96m[1mcmpq[22m[39m	[33m$16[39m[0m, [0m%r13
	[96m[1mjae[22m[39m	[91mLBB0_17[39m
[0m## [0m%bb.16[0m:
	[96m[1mxorl[22m[39m	[0m%esi[0m, [0m%esi
	[96m[1mjmp[22m[39m	[91mLBB0_22[39m
[91mLBB0_42:[39m                                [0m## [0m%vector.memcheck172
	[96m[1mleaq[22m[39m	[33m([39m[0m%rdx[0m,[0m%r13[0m,[33m8[39m[33m)[39m[0m, [0m%rsi
	[96m[1mleaq[22m[39m	[33m8[39m[33m([39m[0m%rcx[33m)[39m[0m, [0m%rdi
	[96m[1mleaq[22m[39m	[33m8[39m[33m([39m[0m%rax[33m)[39m[0m, [0m%r8
	[96m[1mcmpq[22m[39m	[0m%rdi[0m, [0m%rdx
	[96m[1msetb[22m[39m	[0m%bl
	[96m[1mcmpq[22m[39m	[0m%rsi[0m, [0m%rcx
	[96m[1msetb[22m[39m	[0m%r10b
	[96m[1mcmpq[22m[39m	[0m%r8[0m, [0m%rdx
	[96m[1msetb[22m[39m	[0m%dil
	[96m[1mcmpq[22m[39m	[0m%rsi[0m, [0m%rax
	[96m[1msetb[22m[39m	[0m%r8b
	[96

	[96m[1maddq[22m[39m	[33m([39m[0m%rcx[33m)[39m[0m, [0m%rdi
[90m; ││││││└└└└[39m
[90m; ││││││┌ @ array.jl:966 within `setindex!`[39m
	[96m[1mmovq[22m[39m	[0m%rdi[0m, [33m([39m[0m%rdx[0m,[0m%rsi[0m,[33m8[39m[33m)[39m
[90m; ││││││└[39m
[90m; ││││││ @ simdloop.jl:78 within `macro expansion`[39m
[90m; ││││││┌ @ int.jl:87 within `+`[39m
	[96m[1mincq[22m[39m	[0m%rsi
[90m; ││││││└[39m
[90m; ││││││ @ simdloop.jl:75 within `macro expansion`[39m
[90m; ││││││┌ @ int.jl:83 within `<`[39m
	[96m[1mcmpq[22m[39m	[0m%rsi[0m, [0m%r13
[90m; ││││││└[39m
	[96m[1mjne[22m[39m	[91mLBB0_39[39m
	[96m[1mjmp[22m[39m	[91mLBB0_48[39m
[91mLBB0_17:[39m                                [0m## [0m%vector.memcheck
	[96m[1mleaq[22m[39m	[33m([39m[0m%rdx[0m,[0m%r13[0m,[33m8[39m[33m)[39m[0m, [0m%rsi
	[96m[1mleaq[22m[39m	[33m([39m[0m%rcx[0m,[0m%r13[0m,[33m8[39m[33m)[39m[0m, [0m%rdi
	[96m[1mleaq[22m[39m	[33m([39m[0m%rax

[90m; │││││││┌ @ broadcast.jl:516 within `_bcs1`[39m
[90m; ││││││││┌ @ strings/io.jl:185 within `string`[39m
	[96m[1mmovabsq[22m[39m	[93m$_ijl_box_int64[39m[0m, [0m%r15
	[96m[1mvzeroupper[22m[39m
	[96m[1mcallq[22m[39m	[0m*[0m%r15
	[96m[1mmovq[22m[39m	[0m%rax[0m, [0m%r14
	[96m[1mmovq[22m[39m	[0m%rax[0m, [33m56[39m[33m([39m[0m%rsp[33m)[39m
	[96m[1mmovq[22m[39m	[0m%rbx[0m, [0m%rdi
	[96m[1mcallq[22m[39m	[0m*[0m%r15
	[96m[1mmovq[22m[39m	[0m%rax[0m, [33m48[39m[33m([39m[0m%rsp[33m)[39m
	[96m[1mmovabsq[22m[39m	[33m$4770284320[39m[0m, [0m%rcx               [0m## [0mimm [0m= [33m0x11C54C320[39m
	[96m[1mmovq[22m[39m	[0m%rcx[0m, [33m64[39m[33m([39m[0m%rsp[33m)[39m
	[96m[1mmovq[22m[39m	[0m%r14[0m, [33m72[39m[33m([39m[0m%rsp[33m)[39m
	[96m[1mmovabsq[22m[39m	[33m$4770284288[39m[0m, [0m%rcx               [0m## [0mimm [0m= [33m0x11C54C300[39m
	[96m[1mmovq[22m[39m	[0m%rcx[0m, [3


# Operations from `Vec` types to `Vec` types


###### Operations between `Vec` elements


Let $Vec\{N,T\}$ be a set where a `Vec` can be instanciated. For example `Vec{8,Int32}`.
Here $N$ is the length of the vector and $T$ is the DataType. 


## Vector operations

There are essentially two types of vector operations.

###### Same data-type operators, where the input vectors have type $T$ and the output has type $T$

$$
Vec\{N,T\} \times Vec\{N,T\}  \longrightarrow Vec\{N,T\}
$$
Theese are `[+, -, *,  &, |, <<, >>, >>>,  / , ^]`


###### Comparation  operators, where the output has a boolean datatype

$$
Vec\{N,T\} \times Vec\{N,T\}  \longrightarrow Vec\{N,\text{Bool}\}
$$

Which are  `[==, !=, <, <=, >, >=]` and the output is a vector of booleans.



##### Float operations


Most operations between SIMD vector elements are functions of the form $Vec\{N,T \text{<:}\text{Float}\} \times Vec\{N,T\text{<:}\text{Float}\}  \longrightarrow Vec\{N,T \text{<:}\text{Float} \}$. We have the following methods in SIMD.jl:

```
+, -, *,  /,  ^
```


##### Integer operations

Some vector operations return a boolen vector  $Vec\{N,T\text{<:}\text{Int}\} \times Vec\{N,T\text{<:}\text{Int}\}  \longrightarrow Vec\{N,\text{<:}\text{Int}\}$. We have the following methods in SIMD.jl:


```
+ - * / % ^ !  & | $ << >> >>> 
```

##### Comparation operations

Some vector operations return a boolen vector  $Vec\{N,T\text{<:}\text{Int}\} \times Vec\{N,T\text{<:}\text{Int}\}  \longrightarrow Vec\{N,\text{Bool}\}$. We have the following methods in SIMD.jl:

```
== != < <= > >=
```


###### The "~" operator

There is an example of an operator that does not apply to two Vec elements. The "~" operator is $Vec\{N,T \text{<:}\text{Int}\}  \longrightarrow Vec\{N,T \text{<:} \text{Int}\}$.




###### Note: Vector operations cannot operate with vectors containing different data types

```julia
x = Vec(Int.((1,2,3,4)))
y = Vec(Float.((1,2,3,8)))
x + y

MethodError: no method matching +(::Vec{4,Int64}, ::Vec{4,Float64})
```

This is true even if both datatypes are subtypes of the same abstract type

```julia
x = Vec(Float32.((1,2,3,4)))
y = Vec(Float64.((1,2,3,8)))
x + y

MethodError: no method matching +(::Vec{4,Float32}, ::Vec{4,Float64})
```


In [52]:
x = Vec((1,2,6,4))
y = Vec((1,2,4,8))
x % y

<4 x Int64>[0, 0, 2, 4]

In [17]:
# Notice this does not work for Floats
# x = Vec((1.,2.,3.,4.))
# ~x

### Integer operators: `[ +, -, *,  &, |, <<, >>, >>>, ==, !=, <, <=, >, >=]`

In [53]:
x = Vec((1,2,3,4))
y = Vec((1,2,3,9))

# I should add to this list % operator
operations_integers = Symbol.([ +, -, *,  &, |, <<, >>, >>>, ==, !=, <, <=, >, >=])

for op in operations_integers
    println(op, "\tx $op y\t\t", eval(Meta.parse("x $(op) y"))) #, #eval(exp(x op y)))
end

+	x + y		<4 x Int64>[2, 4, 6, 13]
-	x - y		<4 x Int64>[0, 0, 0, -5]
*	x * y		<4 x Int64>[1, 4, 9, 36]
&	x & y		<4 x Int64>[1, 2, 3, 0]
|	x | y		<4 x Int64>[1, 2, 3, 13]
<<	x << y		<4 x Int64>[2, 8, 24, 2048]
>>	x >> y		<4 x Int64>[0, 0, 0, 0]
>>>	x >>> y		<4 x Int64>[0, 0, 0, 0]
==	x == y		<4 x Bool>[1, 1, 1, 0]
!=	x != y		<4 x Bool>[0, 0, 0, 1]
<	x < y		<4 x Bool>[0, 0, 0, 1]
<=	x <= y		<4 x Bool>[1, 1, 1, 1]
>	x > y		<4 x Bool>[0, 0, 0, 0]
>=	x >= y		<4 x Bool>[1, 1, 1, 0]


###### Be carefull with operations because they can behave in a wrap-around (or modulus) fashion


In the following cell we can see that 255+1 gives 0 as result and 255+2 gives 1 as result.

In [69]:
x = Vec(UInt8.((255,255,255,255)))
y = Vec(UInt8.((1,2,0,0)))
x + y

<4 x UInt8>[0x00, 0x01, 0xff, 0xff]

We can proactively avoid this using `add_saturate` operation.
This operation keeps the `typemax` of the elementype if the result exceedds it (instead of doing a wrap around )

In [71]:
# Put Here a ceil version
x = Vec(UInt8.((255,255,255,255)))
y = Vec(UInt8.((1,0,0,0)))
SIMD.add_saturate(x, y)

<4 x UInt8>[0xff, 0xff, 0xff, 0xff]

### Float operators: `[ +, -, *,  &, |, <<, >>, >>>, ==, !=, <, <=, >, >=]`

In [21]:
x = Vec((1.,2.,3.,4.))
y = Vec((1.,2.,3.,9.))

operations_floats = Symbol.([ +, -, *,  /,  ^,  ==, !=, <, <=, >, >=])

for op in operations_floats
    println(op, "\tx $op y\t\t", eval(Meta.parse("x $op y"))) #, #eval(exp(x op y)))
end

+	x + y		<4 x Float64>[2.0, 4.0, 6.0, 13.0]
-	x - y		<4 x Float64>[0.0, 0.0, 0.0, -5.0]
*	x * y		<4 x Float64>[1.0, 4.0, 9.0, 36.0]
/	x / y		<4 x Float64>[1.0, 1.0, 1.0, 0.4444444444444444]
^	x ^ y		<4 x Float64>[1.0, 4.0, 27.0, 262144.0]
==	x == y		<4 x Bool>[1, 1, 1, 0]
!=	x != y		<4 x Bool>[0, 0, 0, 1]
<	x < y		<4 x Bool>[0, 0, 0, 1]
<=	x <= y		<4 x Bool>[1, 1, 1, 1]
>	x > y		<4 x Bool>[0, 0, 0, 0]
>=	x >= y		<4 x Bool>[1, 1, 1, 0]



# Elementwise operation in `Vec` elements

An elementwise operation in a vec element is a function $Vec\{N,T\} \longrightarrow Vec\{N,T\}$

The following operations are available:

```
abs cbrt ceil copysign cos div exp exp10 exp2 flipsign floor fma inv isfinite isinf isnan issubnormal log log10 log2 muladd rem round sign signbit sin sqrt trunc vifelse
```

In [22]:
x = Vec((1.,2.,3.,4.))

<4 x Float64>[1.0, 2.0, 3.0, 4.0]

In [23]:
exp(x)

<4 x Float64>[2.718281828459045, 7.38905609893065, 20.085536923187668, 54.598150033144236]

In [24]:
ceil(x)

<4 x Float64>[1.0, 2.0, 3.0, 4.0]



# Reduction operations

A reduction operation is a function of the form $Vec\{N,T\}  \longrightarrow T$.

Therefore reductions "reduce" a SIMD vector to a scalar. The following reduction operations are provided:

```
all any maximum minimum sum prod
```

In [25]:
x = Vec((1.,2.,3.,4.))
sum(x)

10.0

In [26]:
x = Vec((1,2,3,4))
sum(x)

10

## Shuffle

`shufflevector` allows a mask to shuffle elements in a Vec

In [5]:
a = Vec{4, Int32}((1,2,3,4))
mask = (0,3,1,2)
shufflevector(a, Val(mask))

<4 x Int32>[1, 4, 2, 3]

Note that the mask can take the same 

In [8]:
a = Vec{4, Int32}((1,2,3,4))
mask = (0,0,1,2)
shufflevector(a, Val(mask))

<4 x Int32>[1, 1, 2, 3]

In [22]:
#a[ Vec{4, Int32}((2,2,1,3))]

In [27]:
x = [1,2,3,4]
id = [2,2,1,3]
@benchmark x[id]

BenchmarkTools.Trial: 10000 samples with 986 evaluations.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m53.165 ns[22m[39m … [35m 1.889 μs[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 92.86%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m56.673 ns              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m70.859 ns[22m[39m ± [32m65.505 ns[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m3.59% ±  4.11%

  [39m█[34m█[39m[39m▅[39m▄[39m▃[39m▂[39m▂[32m▂[39m[39m▂[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▂
  [39m█[34m█[39m[39m█[39m█[39m

In [58]:
arr = Vector{Float64}(undef, 100)
xs = vload(Vec{4,Float64}, arr, 1)

<4 x Float64>[5.0e-324, 1.0e-323, 5.0e-324, 2.5e-323]

In [34]:
@benchmark  shufflevector($a, $Val(mask))

BenchmarkTools.Trial: 10000 samples with 204 evaluations.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m377.373 ns[22m[39m … [35m 15.624 μs[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 96.22%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m386.458 ns               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m410.087 ns[22m[39m ± [32m352.750 ns[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m2.04% ±  2.34%

  [39m▇[39m█[34m▇[39m[39m▄[39m▃[32m▂[39m[39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁
  [39m█[39m█[34

In [66]:
arr = [i for i in 1:20]
i = 1
lane = VecRange{4}(0)
v = arr[lane + i]             # vload

<4 x Int64>[1, 2, 3, 4]

In [70]:
idx = Vec((1, 3, 4, 7))
@benchmark aux = $arr[$idx]  

BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m3.682 ns[22m[39m … [35m41.142 ns[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m3.998 ns              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m4.192 ns[22m[39m ± [32m 1.674 ns[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m▇[39m▄[39m▁[39m [34m█[39m[39m▄[39m▂[32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▂
  [39m█[39m█[39m█[39m▇[34m█[39m[39m█[3

In [7]:
@benchmark sum(shufflevector(a, Val(mask)))

BenchmarkTools.Trial: 10000 samples with 218 evaluations.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m341.514 ns[22m[39m … [35m 13.274 μs[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 96.10%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m354.904 ns               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m422.250 ns[22m[39m ± [32m257.726 ns[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.86% ±  1.65%

  [39m█[34m▇[39m[39m▄[39m▃[39m▂[39m▂[39m▂[39m▂[32m▁[39m[39m▁[39m [39m [39m [39m▁[39m▁[39m▁[39m [39m [39m [39m▁[39m [39m [39m▁[39m [39m [39m▁[39m [39m [39m [39m [39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁
  [39m█[34m█[39

The same function can be used with two input vectors, and the mask is applied to positions of both vectors

In [226]:
                  #0,1,2,3
a = Vec{4, Int32}((1,2,3,4))
                  #4,5,6,7
b = Vec{4, Int32}((5,6,7,8))
mask = (2,3,4,5)
shufflevector(a, b, Val(mask))

<4 x Int32>[3, 4, 5, 6]

# Performance gain by lowering precission

The speed difference can be magnified if we do more operations than a single sum:

In [27]:
function sum_vec(a1, a2, b1, b2)
    res1 = a1 + a2
    res2 = b1 + b2
    return res1, res2
end

sum_vec (generic function with 1 method)

In [28]:
@btime sum_vec(a1_a, a2_a, a1_a, a2_a)

  107.177 ns (3 allocations: 480 bytes)


([2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40], [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40])

In [29]:
@btime sum_vec(a1_v, a2_v, a1_v, a2_v)

  46.208 ns (1 allocation: 544 bytes)


(<20 x Int64>[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40], <20 x Int64>[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40])

In [31]:
function sum_vec_4(a1,a2,b1,b2)
    res1 = a1 + a2
    res2 = b1 + b2
    res3 = a1 + b1
    res4 = a2 + b2

    return res1, res2, res3, res4
end

sum_vec_4 (generic function with 1 method)

In [32]:
@btime sum_vec_4(a1_v,a2_v,a1_v,a2_v)

  78.090 ns (1 allocation: 1.06 KiB)


(<20 x Int64>[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40], <20 x Int64>[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40], <20 x Int64>[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40], <20 x Int64>[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40])

In [34]:
@btime sum_vec_4(a1_a, a2_a, a1_a, a2_a)

  170.673 ns (5 allocations: 944 bytes)


([2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40], [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40], [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40], [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40])

Notice that the speed gain between the prevoius two calls is, at most, 4X.

We can increase the difference in speed using an integer type with less bits. For example Int32.

In [35]:
a1_i8_v = Vec(tuple(Int32.(a1_a)...))
a2_i8_v = Vec(tuple(Int32.(a2_a)...))

<20 x Int32>[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]

In [36]:
@btime sum_vec_4(a1_i8_v,a2_i8_v,a1_i8_v,a2_i8_v)

  47.582 ns (1 allocation: 544 bytes)


(<20 x Int32>[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40], <20 x Int32>[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40], <20 x Int32>[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40], <20 x Int32>[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40])

## Example automatic vectorization

In [37]:
function c_a_times_b!(c::Array, a::Array, b::Array)
    @assert length(a) == length(b) == length(c)
    @inbounds for i in 1:length(a)
        c[i] = a[i] * b[i]
    end
end

c_a_times_b! (generic function with 1 method)

In [38]:
 V64 = Vector{Float64}

Vector{Float64}[90m (alias for [39m[90mArray{Float64, 1}[39m[90m)[39m

Inspecting `code_llvm` we can see

```
 ┌ @ float.jl:399 within `*'
   %55 = fmul <4 x double> %wide.load, %wide.load24
   %56 = fmul <4 x double> %wide.load21, %wide.load25
   %57 = fmul <4 x double> %wide.load22, %wide.load26
   %58 = fmul <4 x double> %wide.load23, %wide.load27
```

In [47]:
# Uncomment the following line and run
#code_llvm(c_a_times_b!, Tuple{V64, V64, V64})

Inspecting `code_native`

```
	vmovupd	(%r10,%rcx,8), %ymm0
	vmovupd	32(%r10,%rcx,8), %ymm1
	vmovupd	64(%r10,%rcx,8), %ymm2
	vmovupd	96(%r10,%rcx,8), %ymm3
```

In [46]:
# Uncomment the following line and run
#code_native(c_a_times_b!, Tuple{V64, V64, V64})

## Exercise: Algorithm to check if an element is within a vector.

In [77]:
function find_val_in_array(x, val)
        
    @inbounds for i in 1:length(x)
        if x[i] == val
            return true
        end
    end

    return false
end

find_val_in_array (generic function with 1 method)

In [78]:
x = Array{Int32}(1:1_000_000);
va = Int32(500_000);
@btime find_val_in_array(x, va)

  151.972 μs (0 allocations: 0 bytes)


true

In [75]:

function find_val_in_array_simd(x, val)
    n_simd = 32
    last_pos_simd_chunk = length(x)-n_simd
    @inbounds for i in 1:n_simd:last_pos_simd_chunk
        vec_i = vload(Vec{n_simd, Int32}, x, i)
        sum_equality = sum(vec_i == val)
        if sum_equality >0
            return true
        end
    end

    @inbounds for i in last_pos_simd_chunk:length(x)
        if x[i] == val
            return true
        end
    end

    return false
end

find_val_in_array_simd (generic function with 1 method)

In [76]:
x = Array{Int32}(1:1_000_000);
va = Int32(500_000);
@btime find_val_in_array_simd(x, va)

  30.508 μs (0 allocations: 0 bytes)


true

Note that we implemented this function hardcoding the elementtype `Int32` as well as the vector length `n_simd`.

We can avoid this.

In [79]:

function find_val_in_array_simd(x::Array{T}, val::T) where {T}
    n_simd = 64
    last_pos_simd_chunk = length(x)-n_simd
    @inbounds for i in 1:n_simd:last_pos_simd_chunk
        vec_i = vload(Vec{n_simd, T}, x, i)
        sum_equality = sum(vec_i == val)
        if sum_equality > 0
            return true
        end
    end

    @inbounds for i in last_pos_simd_chunk:length(x)
        if x[i] == val
            return true
        end
    end

    return false
end

find_val_in_array_simd (generic function with 2 methods)

In [80]:
x = Array{Int32}(1:1_000_001);
va = Int32(500_000);
@btime  find_val_in_array_simd(x, va)

  30.447 μs (0 allocations: 0 bytes)


true

## Exercise: Algorithm to sum over indices in an array

In [352]:
using BenchmarkTools
using LoopVectorization

n_clusters = 32
n_examples = 1_000_000
n_features = 128

T = Float32.(rand(n_clusters, n_features));                   # ADC table
PQ = UInt8.(rand(1:n_clusters, n_features, n_examples))    # PQcodes
y = Float32.(rand(n_features));

function lsh(PQ, T)
    
    n_features, n_examples = size(PQ)
    d = zeros(eltype(T), n_examples)
    
    @turbo for n in 1:n_examples
        res = zero(eltype(T))
        for j in 1:n_features
            res += T[PQ[j,n],j]    
        end
        d[n] = res
    end
    return d
end


lsh (generic function with 1 method)

In [353]:
@benchmark lsh($PQ, $T)

BenchmarkTools.Trial: 132 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m32.239 ms[22m[39m … [35m40.976 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m38.705 ms              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m38.102 ms[22m[39m ± [32m 1.890 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▃[39m [39m [39m [39m [39m [39m [39m▄[39m [39m [39m [39m▁[39m [39m [39m [39m [32m [39m[39m [39m▁[39m [34m█[39m[39m▁[39m [39m▆[39m▄[39m▃[39m▃[39m█[39m▃[39m [39m▃[39m▃[39m [39m [39m 
  [39m▄[39m▁[39m▁[39m▁[39m▄[39m▁[39m

In [73]:
idx = Vec((1, 3, 4, 7))
@benchmark aux = $arr[$idx] 

BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m3.678 ns[22m[39m … [35m18.731 ns[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m3.696 ns              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m3.718 ns[22m[39m ± [32m 0.219 ns[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m█[34m▇[39m[32m▃[39m[39m▂[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁
  [39m█[34m█[39m[32m█[39m[39m█[39m█[39

In [6]:
idx = Vec((1, 3, 4, 7))
T[idx]

<4 x Float32>[0.7432008, 0.17963971, 0.0061912057, 0.50137603]

In [7]:
PQt = Matrix(PQ');

In [74]:
idx = Vec(Tuple([i for i in 1:10]))

<10 x Int64>[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [9]:
PQt[idx]

<10 x UInt8>[0x16, 0x03, 0x06, 0x04, 0x17, 0x20, 0x06, 0x15, 0x20, 0x0c]

In [94]:
PQ[idx]

<10 x UInt8>[0x16, 0x15, 0x16, 0x1b, 0x03, 0x06, 0x0e, 0x14, 0x18, 0x20]

In [33]:
PQ[1:5, 1:5]

5×5 Matrix{UInt8}:
 0x16  0x03  0x06  0x04  0x17
 0x15  0x10  0x19  0x0f  0x0e
 0x16  0x01  0x1b  0x15  0x02
 0x1b  0x0b  0x0d  0x0b  0x01
 0x03  0x11  0x19  0x10  0x1f

In [34]:
size(PQ)

(128, 1000000)

We can use values in `PQ` to index `T`, but to do this we need to load a `Vec` from values in `PQ`

In [49]:
PQ_cols = [Vector(col) for col in eachrow(PQ)];
eltype(PQ_cols)

Vector{UInt8}[90m (alias for [39m[90mArray{UInt8, 1}[39m[90m)[39m

In [55]:
T_cols = [Vector(col) for col in eachcol(T)];
length(T_cols)

128

In [186]:
j = 1
pq_j_idx_vec = vload(Vec{16,UInt8}, PQ_per_coord[j], 1)
#pq_j_idx = PQ_per_coord[j][1:16];

<16 x UInt8>[0x16, 0x03, 0x06, 0x04, 0x17, 0x20, 0x06, 0x15, 0x20, 0x0c, 0x07, 0x04, 0x17, 0x08, 0x0d, 0x0b]

In [187]:
T_cols_j = T_cols[1];
T_cols_j_vec = vload(Vec{32,Float32}, T_cols[1], 1);

In [188]:
T_cols_j_vec

<32 x Float32>[0.7432008, 0.58594626, 0.17963971, 0.0061912057, 0.845229, 0.0151418755, 0.50137603, 0.74285394, 0.16654709, 0.31553423, 0.75937486, 0.5785777, 0.7701414, 0.9977113, 0.35985947, 0.6442237, 0.24016194, 0.65202904, 0.22425988, 0.014705921, 0.38223788, 0.17157936, 0.4026203, 0.39046642, 0.8976286, 0.9663715, 0.4591865, 0.3941384, 0.61383337, 0.22689447, 0.9064843, 0.65431994]

In [183]:
#@btime shufflevector(T_cols_j_vec, Val(Tuple(pq_j_idx_vec)))

In [191]:
#@btime shufflevector(T_cols_j_vec, Val(Tuple(pq_j_idx)))

In [213]:
# Why vgather only works with Int64 ??
pq_j_idx_vec = vload(Vec{16,Int64}, Int64.(PQ_per_coord[j]), 1)
vgather(T_cols_j, pq_j_idx_vec)

<16 x Float32>[0.17157936, 0.17963971, 0.0151418755, 0.0061912057, 0.4026203, 0.65431994, 0.0151418755, 0.38223788, 0.65431994, 0.5785777, 0.50137603, 0.0061912057, 0.4026203, 0.74285394, 0.7701414, 0.75937486]

In [217]:
@btime  vgather(T_cols_j, pq_j_idx_vec)

  28.057 ns (1 allocation: 80 bytes)


<16 x Float32>[0.17157936, 0.17963971, 0.0151418755, 0.0061912057, 0.4026203, 0.65431994, 0.0151418755, 0.38223788, 0.65431994, 0.5785777, 0.50137603, 0.0061912057, 0.4026203, 0.74285394, 0.7701414, 0.75937486]

In [219]:
# Why vgather only works with Int64 ??
# https://github.com/eschnett/SIMD.jl/issues/98
#pq_j_idx_vec = vload(Vec{16,Int32}, Int32.(PQ_per_coord[j]), 1)
#vgather(T_cols_j, pq_j_idx_vec)

In [221]:
PQ_cols64 = [Vector(Int64.(col)) for col in eachcol(PQ)];

In [225]:
j=1
pq_j_idx_vec = vload(Vec{16,Int64}, PQ_cols64[j], 1)
vgather(T_cols_j, pq_j_idx_vec)

<16 x Float32>[0.17157936, 0.38223788, 0.17157936, 0.4591865, 0.17963971, 0.0151418755, 0.9977113, 0.014705921, 0.39046642, 0.65431994, 0.61383337, 0.9663715, 0.31553423, 0.61383337, 0.0061912057, 0.8976286]

In [346]:
function process_N_pqcodes(j, PQ_cols64, T_cols)
    acc = Vec{8, Float32}((0,0,0,0,0,0,0,0))
    @inbounds for j in 1:128
        pq_j_idx_vec = @inbounds vload(Vec{8,Int64}, PQ_cols64[j], 1)
        acc += vgather(T_cols[j], pq_j_idx_vec)
        #println(pq_j_idx_vec)
    end
    return acc
end

process_N_pqcodes (generic function with 2 methods)

In [348]:
PQ_16 = Int64.(PQ[:,1:8])

println(process_N_pqcodes(1, PQ_cols64, T_cols) )
println(lsh(PQ_16, T))

<8 x Float32>[66.6573, 68.29665, 69.38001, 67.32883, 65.33147, 63.233086, 60.921135, 59.93974]
Float32[66.21081, 60.81546, 67.543236, 59.01086, 64.59701, 64.47578, 66.44614, 67.383545]


In [349]:
a = Vec{4, Int32}((1,2,3,4))
mask = (0,3,1,2)
shufflevector(a, Val(mask))

<4 x Int32>[1, 4, 2, 3]

In [340]:
@benchmark process_N_pqcodes(1, PQ_cols64, T_cols)

BenchmarkTools.Trial: 10000 samples with 200 evaluations.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m397.385 ns[22m[39m … [35m 2.170 μs[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m406.373 ns              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m440.229 ns[22m[39m ± [32m79.869 ns[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m█[34m▆[39m[39m▅[39m▄[39m▄[39m▃[39m▂[32m▂[39m[39m▂[39m▂[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m [39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁
  [39m█[34m█[39m[39m█[39

In [337]:
@benchmark lsh(PQ_16, T)

BenchmarkTools.Trial: 10000 samples with 482 evaluations.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m225.683 ns[22m[39m … [35m 1.976 μs[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m283.249 ns              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m301.109 ns[22m[39m ± [32m87.334 ns[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m▆[39m█[39m▅[39m▃[39m▂[39m▁[39m▁[39m▁[39m▂[39m▃[34m▃[39m[39m▂[39m▁[32m▁[39m[39m▃[39m▄[39m▄[39m▄[39m▄[39m▄[39m▃[39m▃[39m▂[39m▂[39m▂[39m▂[39m▂[39m▂[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m [39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▂
  [39m█[39m█[39m█[39m█[3

We can use `shuffle` instead of `mvgather`

Acessing with `T_col[j][pq_j_idx_vec]` is really slow

In [321]:
function process_N_pqcodes_v2(j, PQ_cols64, T_cols)
    acc = Vec{16, Float32}((0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0))
    @fastmath for j in 1:128
        pq_j_idx_vec = vload(Vec{16,Int64}, PQ_cols64[j], 1)
        acc += T_col[j][pq_j_idx_vec]
        #println(pq_j_idx_vec)
    end
    return acc
end

process_N_pqcodes_v2 (generic function with 1 method)

In [322]:
@benchmark process_N_pqcodes_shuff(1, PQ_cols64, T_cols)

BenchmarkTools.Trial: 10000 samples with 4 evaluations.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m 7.058 μs[22m[39m … [35m 49.442 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m 0.00% … 99.94%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m 9.838 μs               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m 0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m15.892 μs[22m[39m ± [32m494.320 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m31.09% ±  1.00%

  [39m▁[39m▃[39m▂[39m [39m▃[39m▄[39m▂[39m▁[39m▇[34m█[39m[39m▅[39m▃[39m▂[39m▁[39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [32m [39m[39m▁[39m▂[39m▁[39m [39m▂[39m▂[39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁
  [39m█[39m█[39m█[39m▇

In [None]:
function lsh_vec(PQ_cols::Vector, T_cols::Vector, n_batch = 1024)
    
    n_features = legnth(PQ_cols)
    n_examples = length(PQ_cols[1])
    n_batch = zeros(eltype(T), n_examples)
    
    for j in 1:n_features
        
        res = zero(n_bach, eltype(T))
            res += T[PQ[j,n],j]    
        end
        d[n] = res
    end
    return d
end

In [133]:
Int.(PQt[idx])

LoadError: MethodError: no method matching iterate(::Vec{10, UInt8})
[0mClosest candidates are:
[0m  iterate([91m::Union{LinRange, StepRangeLen}[39m) at range.jl:872
[0m  iterate([91m::Union{LinRange, StepRangeLen}[39m, [91m::Integer[39m) at range.jl:872
[0m  iterate([91m::T[39m) where T<:Union{Base.KeySet{<:Any, <:Dict}, Base.ValueIterator{<:Dict}} at dict.jl:712
[0m  ...

In [318]:
v = PQ[:,1]
pq_1 = PQ[:,1]
pq_1_vec = vload(Vec{128,UInt8}, v, 1)

<128 x UInt8>[0x01, 0x14, 0x0e, 0x09, 0x11, 0x1c, 0x09, 0x17, 0x02, 0x0f, 0x0e, 0x19, 0x17, 0x07, 0x1c, 0x04, 0x08, 0x03, 0x19, 0x04, 0x04, 0x06, 0x18, 0x01, 0x10, 0x16, 0x0b, 0x0a, 0x0f, 0x17, 0x19, 0x1a, 0x04, 0x0a, 0x0a, 0x0f, 0x1a, 0x0b, 0x07, 0x03, 0x1d, 0x01, 0x08, 0x14, 0x1c, 0x14, 0x02, 0x18, 0x0c, 0x18, 0x12, 0x18, 0x12, 0x13, 0x0b, 0x02, 0x0c, 0x01, 0x13, 0x0d, 0x14, 0x03, 0x15, 0x06, 0x04, 0x1e, 0x0f, 0x03, 0x05, 0x1e, 0x07, 0x1a, 0x02, 0x1f, 0x1e, 0x09, 0x13, 0x06, 0x01, 0x0f, 0x13, 0x10, 0x0f, 0x17, 0x1e, 0x13, 0x13, 0x1f, 0x20, 0x11, 0x11, 0x0b, 0x0c, 0x20, 0x12, 0x10, 0x13, 0x15, 0x1c, 0x07, 0x0b, 0x07, 0x11, 0x05, 0x12, 0x16, 0x11, 0x17, 0x0a, 0x12, 0x02, 0x20, 0x06, 0x1f, 0x1f, 0x0b, 0x11, 0x09, 0x09, 0x0b, 0x1b, 0x0d, 0x0b, 0x08, 0x09, 0x11, 0x08, 0x05]

In [328]:
T_v = Vec((1.,2.,3.,4.,5.,6.,7.,8.))
pq_idx = Vec(Int.((3, 3, 3, 3, 3, 3, 3, 2)))
x_idx = vgather(x, idx)

<8 x Float32>[0.5049787, 0.5049787, 0.5049787, 0.5049787, 0.5049787, 0.5049787, 0.5049787, 0.673387]

In [371]:
T_v = vload(Vec{32,Float32}, T[:,1], 1)
pq_idx = vload(Vec{128,UInt8}, PQ[:,1], 1);
pq_decoded = vgather(T_1, Vec(Int.(PQ[:,1])...));

In [370]:
@benchmark sum(pq_decoded)

BenchmarkTools.Trial: 10000 samples with 995 evaluations.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m27.642 ns[22m[39m … [35m206.458 ns[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m29.426 ns               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m32.122 ns[22m[39m ± [32m  8.975 ns[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m▂[39m▄[34m█[39m[39m▄[39m▂[39m▁[32m▁[39m[39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁
  [39m█[39m█[34m█[39m[39

In [377]:
Int.(round(T[:,1]*1000))

LoadError: MethodError: no method matching round(::Vector{Float32})
[0mClosest candidates are:
[0m  round([91m::Union{Float16, Float32, Float64}[39m, [91m::RoundingMode{:ToZero}[39m) at float.jl:367
[0m  round([91m::Union{Float16, Float32, Float64}[39m, [91m::RoundingMode{:Down}[39m) at float.jl:368
[0m  round([91m::Union{Float16, Float32, Float64}[39m, [91m::RoundingMode{:Up}[39m) at float.jl:369
[0m  ...

In [256]:
@benchmark sum(pq_1_vec)

BenchmarkTools.Trial: 10000 samples with 998 evaluations.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m12.668 ns[22m[39m … [35m135.888 ns[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m13.988 ns               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m14.474 ns[22m[39m ± [32m  4.001 ns[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m▅[39m▂[39m [39m▁[39m█[34m▅[39m[32m▂[39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁
  [39m█[39m█[39m▆[39m█[3

In [257]:
@benchmark sum(pq_1)

BenchmarkTools.Trial: 10000 samples with 993 evaluations.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m33.901 ns[22m[39m … [35m246.870 ns[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m35.841 ns               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m41.170 ns[22m[39m ± [32m 12.795 ns[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m▃[39m█[34m▆[39m[39m▅[39m▃[39m▁[39m▁[39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁
  [39m█[39m█[34m█[39m[39

We could

In [267]:
table_vec = vload(Vec{32,Float32}, T[:,1], 1)

<32 x Float32>[0.95423836, 0.673387, 0.5049787, 0.48568603, 0.3124652, 0.65885574, 0.5401918, 0.623749, 0.40774092, 0.94546694, 0.47636735, 0.89755404, 0.5313211, 0.04663205, 0.61256987, 0.4160553, 0.9608643, 0.1554406, 0.9803495, 0.9772245, 0.35994515, 0.22976878, 0.5784012, 0.12941517, 0.33469325, 0.92864996, 0.6486976, 0.41019225, 0.90692693, 0.8520791, 0.87212634, 0.012455661]

In [279]:
#vload(Vec{32,Float32}, T[:,1], 1)
#pq_1_vec

In [283]:
#T[:,1]
vload(Vec{32,Float32}, T[:,1], 1)

<32 x Float32>[0.95423836, 0.673387, 0.5049787, 0.48568603, 0.3124652, 0.65885574, 0.5401918, 0.623749, 0.40774092, 0.94546694, 0.47636735, 0.89755404, 0.5313211, 0.04663205, 0.61256987, 0.4160553, 0.9608643, 0.1554406, 0.9803495, 0.9772245, 0.35994515, 0.22976878, 0.5784012, 0.12941517, 0.33469325, 0.92864996, 0.6486976, 0.41019225, 0.90692693, 0.8520791, 0.87212634, 0.012455661]

In [284]:
#vload(Vec{32,Float32}, T[:,1], 1)

In [314]:
idx = PQ[:,1];

In [302]:
Int(maximum(idx))

32

In [310]:
#vgather(T[:,1], idx)

In [313]:
# T[:,1]

In [176]:
x = T[:,1] ;

In [214]:
a1_v = Vec((1.,2.,3.,4.,5.,6.,7.,8.))
idx = Vec(Int.((3, 3, 3, 3, 3, 3, 3, 2)))
x_idx = vgather(x, idx)

<8 x Float32>[0.5049787, 0.5049787, 0.5049787, 0.5049787, 0.5049787, 0.5049787, 0.5049787, 0.673387]

In [211]:
#vgather(x, idx)

In [190]:
table_vec = vload(Vec{32,Float32}, T[:,1], 1)

<32 x Float32>[0.95423836, 0.673387, 0.5049787, 0.48568603, 0.3124652, 0.65885574, 0.5401918, 0.623749, 0.40774092, 0.94546694, 0.47636735, 0.89755404, 0.5313211, 0.04663205, 0.61256987, 0.4160553, 0.9608643, 0.1554406, 0.9803495, 0.9772245, 0.35994515, 0.22976878, 0.5784012, 0.12941517, 0.33469325, 0.92864996, 0.6486976, 0.41019225, 0.90692693, 0.8520791, 0.87212634, 0.012455661]

In [193]:
dd

<8 x Int64>[3, 3, 3, 3, 3, 3, 3, 2]

In [147]:
i = 1
arr = Vector{Float64}(undef, 100)
xs = vload(Vec{4,Float64}, arr, i)

<4 x Float64>[0.0, 0.0, 0.0, 0.0]

In [146]:
xs + xs

<4 x Float64>[0.0, 0.0, 0.0, 0.0]