# More power method (continuation of lecture 2)

In this continuation of lecture 2, we will see that having a good abstraction of hardware resources allows us to run the **same code** in parallel.

"Parallel computing will have made it when we never have to know any of the internal details." Alan Edelman

## Using parallel hardware

In [17]:
using LinearAlgebra

In [13]:
function power_method(M, v)
    T = eltype(v)
    for i in 1:100
        v = M*v        # repeatedly creates a new vector and destroys the old v
        v /= T(norm(v))
    end
    
    return v, T(norm(M*v)) / T(norm(v))  # or  (M*v) ./ v
end

power_method (generic function with 1 method)

### ArrayFire for calculations on the GPU

ArrayFire is a Julia package that wraps the [`arrayfire`](http://arrayfire.com/) library for doing efficient array calculations on the GPU (graphics card). 

The [ArrayFire.jl](https://github.com/JuliaComputing/ArrayFire.jl) package provides a Julia wrapper for the `arrayfire` library. Note that **you must separately install the `arrayfire` library on your computer, before using this package**.

In [2]:
# First install the arrayfire library from the arrayfire homepage.
# On Mac you can use [Homebrew](http://brew.sh/), which you must first install and then execute
# brew install arrayfire

In [3]:
Pkg.add("ArrayFire")

INFO: Nothing to be done
INFO: METADATA is out-of-date — you may not have the latest version of ArrayFire
INFO: Use `Pkg.update()` to get the latest versions of your packages


In [6]:
using ArrayFire

`ArrayFire.jl` provides an easy way to create and manipulate arrays on the GPU. 
It is easy (although may be expensive!) to pass arrays backwards and forwards from the CPU to the GPU:

First we create a standard Julia matrix (on the CPU):

In [6]:
M = [2 1; 1 1.]

2×2 Array{Float64,2}:
 2.0  1.0
 1.0  1.0

We can copy this into an array on the GPU with 

In [7]:
MM = AFArray(M);   # can't print out these objects unless use master branch of ArrayFire

2×2 ArrayFire.AFArray{Float64,2}:
 2.0  1.0
 1.0  1.0

[Note that in order to print out these objects on Julia v0.5, you require (on 15 September 2016) the master (i.e. development) branch of the package, which may be obtained by executing

    Pkg.checkout("ArrayFire")
]

This calls a **constructor** of the `AFArray` object, that constructs an array on the GPU by copying the data provided.

We do the same for a vector:

In [10]:
v = [1., 1]
vv = AFArray(v)

2-element ArrayFire.AFArray{Float64,1}:
 1.0
 1.0

and then multiply the matrix and vector **on the GPU**:

In [12]:
MM * vv

2-element ArrayFire.AFArray{Float64,2}:
 3.0
 2.0

We see that the `*` operation indeed has a method defined to perform the matrix-vector multiplication and create the result as a new object (in fact, a $2 \times 1$ matrix) on the GPU.

We are thus now able to call `power_method` directly:

In [14]:
vec, val = power_method(MM, vv)

([0.850651,0.525731],2.618033988749895)

In [15]:
vec

2-element ArrayFire.AFArray{Float64,1}:
 0.850651
 0.525731

We see that the result of the calculation is indeed still an object on the GPU.

**Exercise**: Compare the time on GPU and CPU. Since execution on the GPU is asynchronous, it is necessary to synchronise:

In [7]:
function runGPU(MM, vv)
    power_method(MM, vv)
    sync()   # wait for the GPU to finish
end

runGPU (generic function with 1 method)

In [8]:
@time runGPU(MM, vv)

LoadError: LoadError: UndefVarError: MM not defined
while loading In[8], in expression starting on line 184

In [21]:
@time power_method(M, v)

  0.000026 seconds (307 allocations: 20.609 KB)


([0.850651,0.525731],2.618033988749895)

In [12]:

n = 10000
M = rand(Float32, n, n)  # GPUs are much more efficient with Float32s
M = (M + M')/2
v = rand(Float32, n, 1);

In [17]:
typeof(MM),typeof(vv)

(ArrayFire.AFArray{Float32,2},ArrayFire.AFArray{Float32,2})

In [13]:
@time power_method(M, v)

  8.195866 seconds (508 allocations: 7.686 MB)


(
Float32[0.00996414; 0.00998084; … ; 0.0100507; 0.0100459],

4999.5728f0)

In [14]:
MM = AFArray(M)
vv = AFArray(v)

#@time power_method(M, v);
@time runGPU(MM, vv);


  0.803818 seconds (3.02 k allocations: 62.891 KB)


In [23]:
typeof(norm(MM*vv))

Float64

On my machine, the CPU version is much faster for small arrays, while the GPU version is 3 times faster for matrices of linear size $n=10000$.

### DistributedArrays for large arrays spread across different processors

The Julia package [DistributedArrays.jl](https://github.com/JuliaParallel/DistributedArrays.jl) defines a `DArray` ("distributed array") type, which provides an abstraction that looks like a standard Julia array, but is spread across several different processors.

Since modern desktops and laptops often have multiple cores, we can use this.

First we allow Julia access to more processes:

In [1]:
using Distributed

In [2]:
addprocs(4)   # add cores to your Julia process

4-element Array{Int64,1}:
 2
 3
 4
 5

In [4]:
]add DistributedArrays

[32m[1m  Updating[22m[39m registry at `~/.julia/registries/General`
[32m[1m  Updating[22m[39m git-repo `https://github.com/JuliaRegistries/General.git`
[32m[1m Installed[22m[39m SoftGlobalScope ─── v1.0.5
[32m[1m Installed[22m[39m Graphics ────────── v0.4.0
[32m[1m Installed[22m[39m Rotations ───────── v0.9.0
[32m[1m Installed[22m[39m AssetRegistry ───── v0.1.0
[32m[1m Installed[22m[39m DistributedArrays ─ v0.5.1
[32m[1m Installed[22m[39m Primes ──────────── v0.4.0
[32m[1m Installed[22m[39m WebSockets ──────── v1.0.1
[32m[1m Installed[22m[39m WebIO ───────────── v0.3.4
[32m[1m  Updating[22m[39m `~/.julia/environments/v1.0/Project.toml`
 [90m [aaf54ef3][39m[92m + DistributedArrays v0.5.1[39m
[32m[1m  Updating[22m[39m `~/.julia/environments/v1.0/Manifest.toml`
 [90m [bf4720bc][39m[93m ↑ AssetRegistry v0.0.2 ⇒ v0.1.0[39m
 [90m [aaf54ef3][39m[92m + DistributedArrays v0.5.1[39m
 [90m [a2bd30eb][39m[93m ↑ Graphics v0.3.0 ⇒ v0.4.0

In [5]:
using DistributedArrays

┌ Info: Precompiling DistributedArrays [aaf54ef3-cdf8-58ed-94cc-d582ad619b94]
└ @ Base loading.jl:1186


There are several ways to create `DArray`s:

In [6]:
M = drand(10, 10)

10×10 DArray{Float64,2,Array{Float64,2}}:
 0.665071   0.33031    0.442901   0.103348  …  0.962567  0.0746497  0.2958   
 0.538561   0.905336   0.826823   0.650489     0.281129  0.415344   0.375255 
 0.913234   0.33151    0.575058   0.655278     0.739308  0.902407   0.0533642
 0.421379   0.377522   0.165261   0.88224      0.663905  0.182826   0.230881 
 0.0160093  0.141049   0.0986772  0.965648     0.911866  0.855509   0.562088 
 0.468438   0.376021   0.754844   0.244142  …  0.736987  0.63443    0.249177 
 0.774035   0.57182    0.834012   0.364425     0.132242  0.0314969  0.440438 
 0.346113   0.0878297  0.29829    0.548972     0.210167  0.833964   0.624782 
 0.935727   0.0324994  0.657721   0.901339     0.58102   0.155088   0.707661 
 0.931774   0.862633   0.0322514  0.696279     0.119139  0.274396   0.217467 

If we really need to, we can find out where Julia is storing each piece of the array:

In [8]:
M.indices

2×2 Array{Tuple{UnitRange{Int64},UnitRange{Int64}},2}:
 (1:5, 1:5)   (1:5, 6:10) 
 (6:10, 1:5)  (6:10, 6:10)

This shows that the array was divided up into equal pieces on each of the four processors.

In [9]:
v = drand(10)

10-element DArray{Float64,1,Array{Float64,1}}:
 0.025682159261005033
 0.7686611037300377  
 0.8894532510985373  
 0.3065697719362501  
 0.04550492919470028 
 0.9466530063361316  
 0.11004720012249503 
 0.5970217409255159  
 0.7103997866198997  
 0.7199507617253142  

In [10]:
M * v

10-element DArray{Float64,1,Array{Float64,1}}:
 2.0727443778849715
 2.703368083586137 
 2.8773031107195854
 2.369305815646867 
 2.1351360020555115
 3.1624500852934174
 2.661526526411242 
 2.0303272566296924
 2.0357766100872974
 1.5931089202446835

Again, we see that `*` has been defined for these objects, so once again we can just run

In [18]:
power_method(M, v)

([0.261998, 0.334175, 0.341432, 0.306695, 0.273889, 0.401734, 0.365948, 0.274706, 0.290701, 0.280631], 4.642175345010186)

## Operators

Consider the following averaging operator that we could call a random walk or averaging operator:

In [19]:
averaging(n) = 0.5 * SymTridiagonal(zeros(n), ones(n-1))

averaging (generic function with 1 method)

In [48]:
averaging(7)

7×7 SymTridiagonal{Float64}:
 0.0  0.5   ⋅    ⋅    ⋅    ⋅    ⋅ 
 0.5  0.0  0.5   ⋅    ⋅    ⋅    ⋅ 
  ⋅   0.5  0.0  0.5   ⋅    ⋅    ⋅ 
  ⋅    ⋅   0.5  0.0  0.5   ⋅    ⋅ 
  ⋅    ⋅    ⋅   0.5  0.0  0.5   ⋅ 
  ⋅    ⋅    ⋅    ⋅   0.5  0.0  0.5
  ⋅    ⋅    ⋅    ⋅    ⋅   0.5  0.0

In [20]:
v = 1:2:13
averaging(7) * v

7-element Array{Float64,1}:
  1.5
  3.0
  5.0
  7.0
  9.0
 11.0
  5.5

In [24]:
averaging(7)

7×7 SymTridiagonal{Float64,Array{Float64,1}}:
 0.0  0.5   ⋅    ⋅    ⋅    ⋅    ⋅ 
 0.5  0.0  0.5   ⋅    ⋅    ⋅    ⋅ 
  ⋅   0.5  0.0  0.5   ⋅    ⋅    ⋅ 
  ⋅    ⋅   0.5  0.0  0.5   ⋅    ⋅ 
  ⋅    ⋅    ⋅   0.5  0.0  0.5   ⋅ 
  ⋅    ⋅    ⋅    ⋅   0.5  0.0  0.5
  ⋅    ⋅    ⋅    ⋅    ⋅   0.5  0.0

In [25]:
v

1:2:13

In [26]:
power_method(averaging(7), collect(v))

InexactError: InexactError: Int64(Int64, 17.81852968120546)

Although we have saved some memory by using a `SymTridiagonal` structure, we clearly are still storing far more information than we need to, since this is just "0 on the diagonal and 0.5 on the super- and sub-diagonal".

We can define a new type in Julia to reflect this. We realise that we do **not actually need to store any information inside the "matrix"**. In fact, we will rather define a **linear operator**, just as we would really like to do in mathematics:

In [53]:
type AveragingOp
    # contains *no* information
end

We have a "dummy type" that contains no information. It will be interesting because of "what it can do", i.e. the operations that we define that involve objects of this type.

We create an object of this type, called `A`, with

In [55]:
A = AveragingOp()  # default constructor

AveragingOp()

In [56]:
A

AveragingOp()

We will define what it means to multiply an object of this type by a vector. The simplest case would be

In [57]:
import Base.*  # necessary to overload *

function *(A::AveragingOp, v::AbstractVector)
    v  # just the identity operator
end

* (generic function with 161 methods)

which gives an identity operator:

In [60]:
v = [1, 2, 43]
A*v

3-element Array{Int64,1}:
  1
  2
 43

In [15]:
power_method(A, v)

UndefVarError: UndefVarError: A not defined

We now define the actual averaging operation. It takes a vector and returns a new vector:

In [62]:
function *(A::AveragingOp, v::AbstractVector)
    [ v[1];    # ; concatenates
      [(v[i-1] + v[i+1])/2  for i in 2:length(v)-1];    # array comprehension
      v[end] 
    ]
end



* (generic function with 161 methods)

In [66]:
v = (1:7).^2
@show v
A*v

v = [1,4,9,16,25,36,49]


7-element Array{Float64,1}:
  1.0
  5.0
 10.0
 17.0
 26.0
 37.0
 49.0

Since `*` now works, we can again just reuse our some generic `power_method` implementation:

In [14]:
power_method(A, v)

UndefVarError: UndefVarError: A not defined

You could worry that `*` is not the correct notation. Mathematically, for an operator $\mathcal{L}$ operating on a vector $\mathbf{v}$, we might write $\mathcal{L} \mathbf{v}$, just using juxtaposition. Unfortunately, we are unable to use this notation in Julia.

We could instead use a `⋅` for juxtaposition. Now that we have defined `*`, we can just do

In [70]:
import Base.⋅
A::AveragingOp ⋅ v = A * v

dot (generic function with 16 methods)

In [71]:
A ⋅ v

7-element Array{Float64,1}:
  1.0
  5.0
 10.0
 17.0
 26.0
 37.0
 49.0

We can even define $\mathcal{L}(\mathbf{v})$:

In [72]:
(A::AveragingOp)(v) = A*v

In [73]:
A(v)

7-element Array{Float64,1}:
  1.0
  5.0
 10.0
 17.0
 26.0
 37.0
 49.0

In [27]:
@which norm(vv)

In [28]:
?ArrayFire.af_norm

No documentation found.

`ArrayFire.af_norm` is a `Function`.

```
# 1 method for generic function "af_norm":
af_norm(out::Ref, _in::ArrayFire.AFArray, _type::Int64, p::Real, q::Real) at /Users/dpsanders/.julia/v0.5/ArrayFire/src/wrap.jl:1758
```
