# Parallel computing in Julia
In this notebook it is shown how parallel computation can speed up complex computations

For this example, `QuadGK`, `Distributed`, `BenchmarkTools` and `SharedArrays` are required, if not already installed please run the following code:

```julia
using Pkg
Pkg.add("QuadGK")
Pkg.add("Distributed")
Pkg.add("BenchmarkTools")
Pkg.add("SharedArrays")
``` 



In [1]:
using Distributed
using BenchmarkTools

CPUcores=4

addprocs(CPUcores) # One should add n workers, where n is the number of available CPU cores
print(workers())

[2, 3, 4, 5]

I now need to load the library required for computing the integral (`QuadGK`) and the lib for `SharedArray`s.
Since every process need to be able to calculate integrals and operate on arrays, I load the libraies `@everywhere`

In [2]:
@everywhere using QuadGK
@everywhere using SharedArrays

Define the Euler $\Gamma$ function:
$$\Gamma(z)=\int_{0}^{\infty} x^{z-1}e^{-x}$$

In [3]:
@everywhere Γ(z)=quadgk(x->x^(z-1)*exp(-x),0,Inf32)

We shall now create 2 ```SharedArray```s. Such arrays can be accessed and modified by multiple processes simultaneously and in efficient fashion. They behave like regular arrays if no multiprocessing is required. 

In [4]:
npoints=10000
z = SharedArray(rand(range(1,stop=30, length=10000), npoints));
a = SharedArray(zeros(npoints)); # results array

Define a function used to fill `a` whith zeros

In [5]:
function reset_a!()
    @distributed for i=1:10000
        global a[i]=0
    end
end

reset_a! (generic function with 1 method)

I define two functions that will compute $\Gamma(z)$ at n points between 1:30 (up to 10000 points). The first one is parallelized, the second one not

<span style="color:red">Caution:</span> @synch is needed for the benchmark but usually it is not. 
Tasks can by run asynchronously or synchronously. <br>
A synchronous routine waits for all the tasks to finish before returning a results, while an asynchronous computation returns instantaneously a `Future` object, which will contains the results of the computation once it is done. <br>
Thus, in order to know the total compute time, it is preferable to run a synchronous task.

In [6]:
function test_me_distributed(n)
    @sync @distributed for i=1:n
        global a[i]=Γ(z[i])[1]
    end
end

function test_me_distributed_async(n) #an asynchronous taks to show the results of an asynchronous computation
    @distributed for i=1:n
        global a[i]=Γ(z[i])[1]
    end
end

function test_me(n)
    for i=1:n
        global a[i]=Γ(z[i])[1]
    end
end

test_me (generic function with 1 method)

In [7]:
@time test_me_distributed_async(10000)
@time test_me_distributed(10000)
@time test_me(10000)

  0.033074 seconds (15.39 k allocations: 831.626 KiB)
  3.739367 seconds (197.61 k allocations: 9.708 MiB, 0.16% gc time)
  2.994496 seconds (33.94 M allocations: 712.996 MiB, 3.60% gc time)


As we can see, an async task requires virually no time, but the computation is still running when a future result is returned.

In [8]:
res1=0.
n=100
reset_a!()
for i=1:n
    tmp=@timed test_me_distributed(10000)
    global res1+=tmp[2]
end
print("elapsed time with parallelism: $(res1/n) seconds")

elapsed time with parallelism: 0.28937014896999996 seconds

In [9]:
#%%
res2=0.
reset_a!()
n=100
for i=1:n
    tmp=@timed test_me(10000)
    global res2+=tmp[2]
end
print("elapsed time single process: $(res2/n) seconds")


elapsed time single process: 0.8324613159400001 seconds

In [10]:
print("speed-up factor: $(res2/res1)")

speed-up factor: 2.8768043936221783