# Parallel computing in Julia
In this notebook it is shown how parallel computation can speed up complex computations

For this example, `QuadGK`, `Distributed`, `BenchmarkTools`, `SharedArrays` and `PyCall` are required, if not already installed please run the following code:

```julia
using Pkg
Pkg.add("QuadGK")
Pkg.add("Distributed")
Pkg.add("BenchmarkTools")
Pkg.add("SharedArrays")
Pkg.add("PyCall")

using Conda
Conda.install("scipy")

``` 



In [1]:
using Distributed
using BenchmarkTools

CPUcores=4

addprocs(CPUcores) # One should add n workers, where n is the number of available CPU cores
print(workers())

[2, 3, 4, 5]

I now need to load the library required for computing the integral (`QuadGK`) and the lib for `SharedArray`s.
Since every process need to be able to calculate integrals and operate on arrays, I load the libraies `@everywhere`

In [2]:
@everywhere using QuadGK
@everywhere using SharedArrays

Define the Euler $\Gamma$ function:
$$\Gamma(z)=\int_{0}^{\infty} x^{z-1}e^{-x}$$

In [3]:
@everywhere Γ(z)=quadgk(x->x^(z-1)*exp(-x),0,Inf32)

We shall now create 2 ```SharedArray```s. Such arrays can be accessed and modified by multiple processes simultaneously and in efficient fashion. They behave like regular arrays if no multiprocessing is required. 

In [4]:
npoints=10000
z = SharedArray(rand(range(1,stop=30, length=10000), npoints));
a = SharedArray(zeros(npoints)); # results array

Define a function used to fill `a` whith zeros

In [5]:
function reset_a!()
    @distributed for i=1:10000
        global a[i]=0
    end
end

reset_a! (generic function with 1 method)

I define two functions that will compute $\Gamma(z)$ at n points between 1:30 (up to 10000 points). The first one is parallelized, the second one not

<span style="color:red">Caution:</span> @synch is needed for the benchmark but usually it is not. 
Tasks can by run asynchronously or synchronously. <br>
A synchronous routine waits for all the tasks to finish before returning a results, while an asynchronous computation returns instantaneously a `Future` object, which will contains the results of the computation once it is done. <br>
Thus, in order to know the total compute time, it is preferable to run a synchronous task.

In [6]:
function test_me_distributed(n)
    @sync @distributed for i=1:n
        global a[i]=Γ(z[i])[1]
    end
end
test_me_distributed(3) #run it once so that it is already compiled

function test_me_distributed_async(n) #an asynchronous taks to show the results of an asynchronous computation
    @distributed for i=1:n
        global a[i]=Γ(z[i])[1]
    end
end
test_me_distributed_async(3)

function test_me(n)
    for i=1:n
        global a[i]=Γ(z[i])[1]
    end
end
test_me(3)

In [7]:
@time test_me_distributed_async(10000)
@time test_me_distributed(10000)
@time test_me(10000)

  0.000004 seconds (6 allocations: 928 bytes)
  3.282285 seconds (1.34 k allocations: 67.719 KiB)
  0.835946 seconds (28.41 M allocations: 443.341 MiB, 2.59% gc time)


As we can see, an async task requires virually no time, but the computation is still running when a future result is returned.

In [8]:
res1=0.
n=100
reset_a!()
for i=1:n
    tmp=@timed test_me_distributed(10000)
    global res1+=tmp[2]
end
print("elapsed time with parallelism: $(res1/n) seconds")

elapsed time with parallelism: 0.28786496399999983 seconds

In [9]:
res2=0.
reset_a!()
n=100
for i=1:n
    tmp=@timed test_me(10000)
    global res2+=tmp[2]
end
print("elapsed time single process: $(res2/n) seconds")


elapsed time single process: 0.8849689530099997 seconds

In [10]:
print("speed-up factor: $(res2/res1)")

speed-up factor: 3.074250303729217

# Comparison with python

In [11]:
using PyCall

In [12]:
Γ_py=pyimport("scipy.special").gamma

PyObject <ufunc 'gamma'>

In [62]:
@time Γ_py(z)

  0.000561 seconds (45 allocations: 80.172 KiB)


10000-element Array{Float64,1}:
      5.036909328561965e26 
   5648.053076193347       
      3.028500953928122e27 
      3.194235778900446e29 
  25327.751423357135       
      5.6907983535647095e10
      3.4658359830564453e12
      4.409596721333813e17 
     25.05496951827898     
      1.5916887852926785e10
      2.5094712119866572e16
      2.2121236443213978e18
      0.9318220733059418   
      ⋮                    
 480487.93710176414        
      6.0623769827002734e13
      3.739530876724871e17 
     55.22061589162701     
      1.056564564481503    
      4.0872478261627494e18
      3.0427608342061473e29
      7.3428861832682675e22
      2.655054630559022e6  
    189.4591822683282      
      1.0007894228409088e10
      5.389019146047299e7  

In [74]:
function test_me_py()
    a=SharedArray(Γ_py(z))
end
test_me_py();

In [75]:
@time test_me_py()

  0.002760 seconds (498 allocations: 95.938 KiB)


10000-element SharedArray{Float64,1}:
      5.036909328561965e26 
   5648.053076193347       
      3.028500953928122e27 
      3.194235778900446e29 
  25327.751423357135       
      5.6907983535647095e10
      3.4658359830564453e12
      4.409596721333813e17 
     25.05496951827898     
      1.5916887852926785e10
      2.5094712119866572e16
      2.2121236443213978e18
      0.9318220733059418   
      ⋮                    
 480487.93710176414        
      6.0623769827002734e13
      3.739530876724871e17 
     55.22061589162701     
      1.056564564481503    
      4.0872478261627494e18
      3.0427608342061473e29
      7.3428861832682675e22
      2.655054630559022e6  
    189.4591822683282      
      1.0007894228409088e10
      5.389019146047299e7  

In [66]:
res3=0.
reset_a!()
n3=1000
for i=1:n
    tmp=@timed test_me_py()
    global res3+=tmp[2]
end
print("elapsed time python: $(res3/n3) seconds")

elapsed time python: 0.000197734604 seconds

In [67]:
print("speed-up factor python-julia handwritten: $(res1/res3*(n3/n))")

speed-up factor python-julia handwritten: 1455.8148051819994

Not so fast, unfortunately. Let's try the native Julia implementation of the gamma function (which is based on the GNU MPFR)

# Native gamma function parallelized

In [68]:
@everywhere using SpecialFunctions

In [69]:
function test_me_distributed_builtin(n)
    @sync @distributed for i=1:n
        global a[i]=gamma(z[i])
    end
end

function test_me_builtin(n)
    for i=1:n
        global a[i]=gamma(z[i])
    end
end

test_me_builtin (generic function with 1 method)

In [70]:
@time test_me_distributed_builtin(10)
@time test_me_builtin(10)

  0.095133 seconds (198.49 k allocations: 9.756 MiB)
  0.007734 seconds (12.06 k allocations: 638.119 KiB)


In [71]:
res4=0.
reset_a!()
n4=1000
for i=1:n
    tmp=@timed test_me_distributed_builtin(n)
    global res4+=tmp[2]
end
print("elapsed time builtin gamma: $(res4/n4) seconds\n")

res5=0.
reset_a!()
n5=1000
for i=1:n
    tmp=@timed test_me_builtin(n)
    global res5+=tmp[2]
end
print("elapsed time builtin gamma: $(res5/n5) seconds")

elapsed time builtin gamma: 0.00020210450200000002 seconds
elapsed time builtin gamma: 1.798797e-6 seconds

In [72]:
print("speed-up factor native-hand written: $(res1/res5*(n5/n)) \n")
print("speed-up factor 4-5: $(res4/res5) \n")
print("speed-up factor native-python: $(res3/res5*(n5/n3))")

speed-up factor native-hand written: 160031.93467634195 
speed-up factor 4-5: 112.35536972765688 
speed-up factor native-python: 109.92602500448912

As we observe, in this case there is no gain in parallelizing the gamma function. Since it is already really fast, the overhead generated by the parallelization tecnique is higher than the speed gain. 
Furthermore, an efficient algorithm is much better than the parallelization, leading to a performance increase of $10^5$. <br> 
As a side note, Julia is **100x faster than python** at computing the gamma function, when properly optimized