# High performant Julia code


#### Devectorize and NumericExtensions packages

- https://github.com/lindahua/NumericExtensions.jl
- https://github.com/lindahua/Devectorize.jl


#### Write non vectorized code

- http://www.juliabloggers.com/fast-numeric-computation-in-julia/



#### Using SIMD instructions in Julia 
- http://ucidatascienceinitiative.github.io/IntroToJulia/Slides/HPCJulia#/

- http://www.juliabloggers.com/optimizing-julia-for-performance-a-practical-example/

- https://github.com/eschnett/SIMD.jl

# Part 1

## Example montecarlo pi estimate

Let us play with an example from 



In [8]:
n_cores = 4

4

In [1]:
workers()

1-element Array{Int64,1}:
 1

In [2]:
addprocs(4) 

4-element Array{Int64,1}:
 2
 3
 4
 5

In [3]:
workers()

4-element Array{Int64,1}:
 2
 3
 4
 5

In [4]:
@everywhere function compute_pi(N::Int)
    """
    Compute pi with a Monte Carlo simulation of N darts thrown in [-1,1]^2
    Returns estimate of pi
    """
    # counts number of points that have radial coordinate < 1, i.e. in circle
    n_landed_in_circle = 0  
    for i = 1:N
        x = rand() * 2 - 1  # uniformly distributed number on x-axis
        y = rand() * 2 - 1  # uniformly distributed number on y-axis

        r2 = x*x + y*y  # radius squared, in radial coordinates
        if r2 < 1.0
            n_landed_in_circle += 1
        end
    end

    return n_landed_in_circle / N * 4.0    
end

In [5]:
compute_pi(10)

@time compute_pi(1000_000_000)

  9.270990 seconds (131 allocations: 7.734 KB)


3.14155188

#### Let us go parallel

In [6]:
N = Int(1000_000_000)


1000000000

In [9]:
result = pmap(compute_pi,[Int(N/n_cores) for core in 1:n_cores])

4-element Array{Any,1}:
 3.14162
 3.14143
 3.14159
 3.14154

In [11]:
@time mean(pmap(compute_pi,[Int(N/n_cores) for core in 1:n_cores]))

  3.892370 seconds (14.25 k allocations: 671.349 KB)


3.141607924

In [None]:
function par_pi_computation(N::Int64; ncores::Int64=4)
    """
    Compute pi in parallel, over ncores cores, with a Monte Carlo simulation throwing N total darts
    """

    # compute sum of pi's estimated among all cores in parallel
    sum_of_pis = @parallel (+) for i=1:ncores
        compute_pi(Int(N / ncores))
    end

    return sum_of_pis / ncores  # average value
end

In [None]:
@time par_pi_computation(1000_000_000)

# Part 2
### Let us test the numpy-matlab way

In [None]:
dot(x,x')>0.

In [None]:
x*x'>0

In [None]:
srand(1234)
len = 100000;

x = randn(len);
y = randn(len);

In [None]:
# optimized version
# 0.000081 seconds (5 allocations: 176 bytes)

In [None]:
@time begin a=x-y; dot(a,a)/length(a) end

In [None]:
@time begin sum((x - y).^2)./length(x) end

In [None]:
0.03/0.000082 

In [None]:
print(@benchmark sum((x - y).^2)/length(x))

#### For loop 

In [None]:
function l2_squared(x::Array{Float64},y::Array{Float64})
    norm = 0.
    for i in 1:length(x)
        norm = norm + (x[i] - y[i])^2
    end
    return norm/length(x)
end

In [None]:
@time l2_squared(x,y)

In [None]:
print(@benchmark l2_squared(x,y))

#### Only inbounds does not make any improvements

In [None]:
function l2_squared_inbounds(x::Array{Float64},y::Array{Float64})
    norm = 0.
    @inbounds begin
    for i in 1:length(x)
         norm += (x[i] - y[i])^2
        end
    end
    return norm/length(x)
end

In [None]:
@time l2_squared_inbounds(x,y)

In [None]:
print(@benchmark l2_squared_inbounds(x,y))

#### improve speed l2_squared with simd

We will use now the @simd macro in a for loop. Notice that this does not make every loop faster. In particular, note that using SIMD implies that the order of operations within and across the loop might change. This macro tells the compiler that reordering will be safe before it attempts to parallelize a loop. Therefore, before adding @simd annotation to your code, you need to ensure that the loop has the following properties:

- All iterations of the loop are independent of each other.  No iteration of the loop uses a value from a previous iteration or waits for its completion.
   
   
- The arrays being operated upon within the loop do not overlap in memory.


-  The loop body is straight-line code without branches or function calls.


-   The number of iterations of the loop is obvious. In practical terms, this means that the loop should typically be expressed on the length of the arrays within it.


- The subscript (or index variable) within the loop changes by one for each iteration. In other words, the subscript is unit stride.


- Bounds checking is disabled for SIMD loops. (Bound checking can cause branches due to exceptional conditions.)


In [None]:
typeof(x)

In [None]:
function l2_squared_inbounds_simd(x::Array{Float64},y::Array{Float64})
    norm = 0.
    n = length(x)
    @inbounds @simd for i in 1:n
             norm += (x[i] - y[i])^2
        end

    return norm/length(x)
end

In [None]:
@time l2_squared_inbounds_simd(x,y)

In [None]:
print(@benchmark l2_squared_inbounds_simd(x,y))

#### SIMD instructions might benefit of lower precision floats

In [None]:
len = 100000
srand(1234)
x32 = Array{Float32}(randn(len));
y32 = Array{Float32}(randn(len));

function l2_squared_inbounds_simd(x::Array{Float32},y::Array{Float32})
    norm = 0.
    n = length(x)
    @inbounds @simd for i in 1:n
             norm += (x[i] - y[i])^2
        end

    return norm/length(x)
end

In [None]:
@time l2_squared_inbounds_simd(x32,y32)

In [None]:
using BenchmarkTools

In [None]:
print(@benchmark l2_squared_inbounds_simd(x32,y32))

#### Go to float 16 -> No improvement !

In [None]:
srand(1234)
len = 100000

x16 = Array{Float16}(randn(len));
y16 = Array{Float16}(randn(len));

function l2_squared_inbounds_simd(x::Array{Float16},y::Array{Float16})
    norm = 0.
    l = Float16(length(x))
    @inbounds @simd for i in 1:length(x)
             norm += (x[i] - y[i])^2
        end

    return norm/l
end

In [None]:
@time l2_squared_inbounds_simd(x16,y16)

In [None]:
print(@benchmark l2_squared_inbounds_simd(x16,y16))

# Using Yeppp for math operations 

I found this particulary uggly (having a Yeppp before evey operation is not pretty).


It would be nice to know how to create an alias and use all implementations from Yeppp without
writting Yeppp every time.

- http://www.yeppp.info/#arguments

In [None]:
using Yeppp 

In [None]:
@time Yeppp.sin(x);

In [None]:
@time [sin(xi) for xi in x];

In [None]:
@time Yeppp.exp(x)/Yeppp.sum(x);

In [None]:
@time exp(x)/sum(x);

# Parallel Accelerator

- https://github.com/IntelLabs/ParallelAccelerator.jl