# High performant Julia code


#### Devectorize and NumericExtensions packages

- https://github.com/lindahua/NumericExtensions.jl
- https://github.com/lindahua/Devectorize.jl


#### Write non vectorized code

- http://www.juliabloggers.com/fast-numeric-computation-in-julia/



#### Using SIMD instructions in Julia 
- http://ucidatascienceinitiative.github.io/IntroToJulia/Slides/HPCJulia#/

- http://www.juliabloggers.com/optimizing-julia-for-performance-a-practical-example/

- https://github.com/eschnett/SIMD.jl

#### Let us test the numpy-matlab way

In [132]:
dot(x,x')>0.

true

In [None]:
x*x'>0

In [115]:
srand(1234)
len = 100000;

x = randn(len);
y = randn(len);

In [116]:
# optimized version
# 0.000081 seconds (5 allocations: 176 bytes)

In [122]:
@time begin a=x-y; dot(a,a)/length(a) end

  0.000412 seconds (9 allocations: 781.531 KB)


2.002124027318451

In [120]:
@time begin sum((x - y).^2)./length(x) end

  0.035292 seconds (6.07 k allocations: 1.816 MB)


2.0021240273184535

In [63]:
0.03/0.000082 

365.8536585365853

In [514]:
print(@benchmark sum((x - y).^2)/length(x))

BenchmarkTools.Trial: 
  memory estimate:  1.53 mb
  allocs estimate:  30
  --------------
  minimum time:     308.366 μs (0.00% GC)
  median time:      731.433 μs (0.00% GC)
  mean time:        838.502 μs (17.75% GC)
  maximum time:     4.225 ms (66.62% GC)
  --------------
  samples:          5945
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

#### For loop 

In [32]:
function l2_squared(x::Array{Float64},y::Array{Float64})
    norm = 0.
    for i in 1:length(x)
        norm = norm + (x[i] - y[i])^2
    end
    return norm/length(x)
end

l2_squared (generic function with 1 method)

In [84]:
@time l2_squared(x,y)

  0.000211 seconds (5 allocations: 176 bytes)


2.0021240273184464

In [46]:
print(@benchmark l2_squared(x,y))

LoadError: [91mUndefVarError: @benchmark not defined[39m

#### Only inbounds does not make any improvements

In [76]:
function l2_squared_inbounds(x::Array{Float64},y::Array{Float64})
    norm = 0.
    @inbounds begin
    for i in 1:length(x)
         norm += (x[i] - y[i])^2
        end
    end
    return norm/length(x)
end

l2_squared_inbounds (generic function with 1 method)

In [83]:
@time l2_squared_inbounds(x,y)

  0.000200 seconds (5 allocations: 176 bytes)


2.0021240273184464

In [511]:
print(@benchmark l2_squared_inbounds(x,y))

BenchmarkTools.Trial: 
  memory estimate:  16.00 bytes
  allocs estimate:  1
  --------------
  minimum time:     91.080 μs (0.00% GC)
  median time:      91.523 μs (0.00% GC)
  mean time:        102.975 μs (0.00% GC)
  maximum time:     720.031 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

#### improve speed l2_squared with simd

We will use now the @simd macro in a for loop. Notice that this does not make every loop faster. In particular, note that using SIMD implies that the order of operations within and across the loop might change. This macro tells the compiler that reordering will be safe before it attempts to parallelize a loop. Therefore, before adding @simd annotation to your code, you need to ensure that the loop has the following properties:

- All iterations of the loop are independent of each other.  No iteration of the loop uses a value from a previous iteration or waits for its completion.
   
   
- The arrays being operated upon within the loop do not overlap in memory.


-  The loop body is straight-line code without branches or function calls.


-   The number of iterations of the loop is obvious. In practical terms, this means that the loop should typically be expressed on the length of the arrays within it.


- The subscript (or index variable) within the loop changes by one for each iteration. In other words, the subscript is unit stride.


- Bounds checking is disabled for SIMD loops. (Bound checking can cause branches due to exceptional conditions.)


In [364]:
typeof(x)

Array{Float64,1}

In [471]:
function l2_squared_inbounds_simd(x::Array{Float64},y::Array{Float64})
    norm = 0.
    n = length(x)
    @inbounds @simd for i in 1:n
             norm += (x[i] - y[i])^2
        end

    return norm/length(x)
end

l2_squared_inbounds_simd (generic function with 4 methods)

In [473]:
@time l2_squared_inbounds_simd(x,y)

  0.000135 seconds (5 allocations: 176 bytes)


2.0021240273184526

In [474]:
print(@benchmark l2_squared_inbounds_simd(x,y))

BenchmarkTools.Trial: 
  memory estimate:  16.00 bytes
  allocs estimate:  1
  --------------
  minimum time:     44.055 μs (0.00% GC)
  median time:      44.524 μs (0.00% GC)
  mean time:        48.039 μs (0.00% GC)
  maximum time:     177.770 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

#### SIMD instructions might benefit of lower precision floats

In [109]:
len = 100000
srand(1234)
x32 = Array{Float32}(randn(len));
y32 = Array{Float32}(randn(len));

function l2_squared_inbounds_simd(x::Array{Float32},y::Array{Float32})
    norm = 0.
    n = length(x)
    @inbounds @simd for i in 1:n
             norm += (x[i] - y[i])^2
        end

    return norm/length(x)
end

l2_squared_inbounds_simd (generic function with 1 method)

In [114]:
@time l2_squared_inbounds_simd(x32,y32)

  0.000081 seconds (5 allocations: 176 bytes)


2.002124028294853

In [464]:
using BenchmarkTools

In [518]:
print(@benchmark l2_squared_inbounds_simd(x32,y32))

BenchmarkTools.Trial: 
  memory estimate:  16.00 bytes
  allocs estimate:  1
  --------------
  minimum time:     39.090 μs (0.00% GC)
  median time:      39.354 μs (0.00% GC)
  mean time:        44.881 μs (0.00% GC)
  maximum time:     195.329 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

#### Go to float 16 -> No improvement !

In [487]:
srand(1234)
len = 100000

x16 = Array{Float16}(randn(len));
y16 = Array{Float16}(randn(len));

function l2_squared_inbounds_simd(x::Array{Float16},y::Array{Float16})
    norm = 0.
    l = Float16(length(x))
    @inbounds @simd for i in 1:length(x)
             norm += (x[i] - y[i])^2
        end

    return norm/l
end

l2_squared_inbounds_simd (generic function with 4 methods)

In [491]:
@time l2_squared_inbounds_simd(x16,y16)

  0.005248 seconds (5 allocations: 176 bytes)


0.0

In [493]:
print(@benchmark l2_squared_inbounds_simd(x16,y16))

BenchmarkTools.Trial: 
  memory estimate:  16.00 bytes
  allocs estimate:  1
  --------------
  minimum time:     3.616 ms (0.00% GC)
  median time:      3.899 ms (0.00% GC)
  mean time:        4.141 ms (0.00% GC)
  maximum time:     16.481 ms (0.00% GC)
  --------------
  samples:          1207
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

# Using Yeppp for math operations 

I found this particulary uggly (having a Yeppp before evey operation is not pretty).


It would be nice to know how to create an alias and use all implementations from Yeppp without
writting Yeppp every time.

- http://www.yeppp.info/#arguments

In [521]:
using Yeppp 

In [556]:
@time Yeppp.sin(x);

  0.000454 seconds (6 allocations: 781.484 KB)


In [557]:
@time [sin(xi) for xi in x];

  0.002067 seconds (7 allocations: 781.500 KB)


In [569]:
@time Yeppp.exp(x)/Yeppp.sum(x);

  0.000557 seconds (10 allocations: 1.526 MB)


In [577]:
@time exp(x)/sum(x);

  0.003806 seconds (79 allocations: 1.537 MB)


# Parallel Accelerator

- https://github.com/IntelLabs/ParallelAccelerator.jl