# GPUArrays in julia

#### QUESTIONS

- How do we list all possible openCL devices? cpus and gpus?
- How do we select a particular device, send an array there and make an operation
- How do we check at anytime, how much memory is on a device:
    - In this example if  `A_mul_B!(X_result, X, X)` is done using bigger matrices OSX becomes completly unusable (the graphical user interface). We should use the GPU that is not beeing used for the graphical user interface.

In [23]:
using GPUArrays
using CLArrays
using BenchmarkTools

In [8]:
s = 500
X = rand(Float32,s,s);
X_result = zeros(X);

#### Mapping arrays to devices

Let `X` be an array. When doing  `X_gpu = GPUArray(X)`.

In [18]:
X_gpu = GPUArrays.JLArray(X);
X_result_gpu = GPUArrays.JLArray(zeros(Float32,500,500));

In [19]:
X_gpu.size

(500, 500)

## Matrix multiplication demo

In [31]:
sizes = [x for x in 100:500:4000];
cpu_times = Dict()
gpu_times = Dict()

println("\nCPU times")
for s in sizes
    X = rand(Float32,s,s);
    X_result = zeros(X);
    res_cpu = @elapsed A_mul_B!(X_result, X,X)
    println("size: ", s, " x ", s, " seconds: ", res_cpu, " seconds")
    #cpu_times[s] = mean(res_cpu.times)/10^6
end

println("\nGPU times")
for s in sizes
    X = rand(Float32,s,s);
    X_result = zeros(X);
    X_gpu = CLArray(X);
    X_result_gpu =  CLArray(zeros(Float32,s,s));

    res_gpu = @elapsed A_mul_B!(X_result_gpu, X_gpu, X_gpu)
    println("size: ", s, " x ", s, " seconds: ", res_gpu, " seconds")
    #gpu_times[s] = mean(res_gpu.times)/10^6
end


CPU times
size: 100 x 100 seconds: 8.138e-5 seconds
size: 600 x 600 seconds: 0.005215154 seconds
size: 1100 x 1100 seconds: 0.025084941 seconds
size: 1600 x 1600 seconds: 0.071604935 seconds
size: 2100 x 2100 seconds: 0.165592458 seconds
size: 2600 x 2600 seconds: 0.313047113 seconds
size: 3100 x 3100 seconds: 0.529375186 seconds
size: 3600 x 3600 seconds: 0.832789605 seconds

GPU times
size: 100 x 100 seconds: 3.9666e-5 seconds
size: 600 x 600 seconds: 3.5773e-5 seconds
size: 1100 x 1100 seconds: 5.2693e-5 seconds
size: 1600 x 1600 seconds: 7.8384e-5 seconds
size: 2100 x 2100 seconds: 8.7341e-5 seconds
size: 2600 x 2600 seconds: 8.2634e-5 seconds
size: 3100 x 3100 seconds: 6.0159e-5 seconds
size: 3600 x 3600 seconds: 0.000168922 seconds


#### Choosing a device

In [39]:
#CLBackend.init(device_type=:gpu,device_idx=1)  ### How do we seelct a particular GPU?
#CLBackend.init()


## Using CLBLAS

In [40]:
?CLBLAS.gemm!

```
gemm!(tA, tB, alpha, A, B, beta, C)
```

Update `C` as `alpha*A*B + beta*C` or the other three variants according to [`tA`](@ref stdlib-blas-trans) and `tB`. Returns the updated `C`.


In [41]:
# Since alpha=1., beta=0 this is doing C = A * B  
@benchmark CLBLAS.gemm!('N', 'N', Float32(1.0), X_gpu, X_gpu, Float32(0.0), X_result_gpu)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     868.350 ms (0.00% GC)
  median time:      875.972 ms (0.00% GC)
  mean time:        878.634 ms (0.00% GC)
  maximum time:     890.103 ms (0.00% GC)
  --------------
  samples:          6
  evals/sample:     1

In [42]:
A_mul_B!(X_result, X, X);

In [43]:
isapprox(Array(X_result_gpu), X_result)

true

In [45]:
free(X_gpu), free(X_result_gpu)

LoadError: [91mUndefVarError: free not defined[39m

## Benchmarking array operations

We can use functions such `A_mul_B!` with `GPUArray` objects.  Multiple dispatch will take care of using the targeted GPU.

- A_mul_B!
- A_mul_Bc!


In [53]:
X_gpu = CLArray(X);
X_result_gpu = similar(X_gpu);


In [46]:
@benchmark A_mul_B!(X_result, X, X)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     884.468 ms (0.00% GC)
  median time:      903.473 ms (0.00% GC)
  mean time:        906.247 ms (0.00% GC)
  maximum time:     938.804 ms (0.00% GC)
  --------------
  samples:          6
  evals/sample:     1

In [54]:
@benchmark A_mul_B!(X_result_gpu, X_gpu, X_gpu)

BenchmarkTools.Trial: 
  memory estimate:  2.30 KiB
  allocs estimate:  96
  --------------
  minimum time:     13.674 μs (0.00% GC)
  median time:      14.674 μs (0.00% GC)
  mean time:        17.124 μs (4.41% GC)
  maximum time:     16.526 ms (45.69% GC)
  --------------
  samples:          10000
  evals/sample:     1

In [None]:
isapprox(Array(X_result_gpu), X_result)

### Benchmarking A_mul_B! for different sizes

Notice that a microseconds (μ) is one milionth of a second

In [None]:
sizes = [x for x in 100:100:400];
cpu_times = Dict()
gpu_times = Dict()

In [None]:
for s in sizes
    X = rand(Float32,s,s);
    X_result = zeros(X);

    X_gpu = GPUArray(X);
    X_result_gpu = GPUArray(zeros(Float32,s,s));
    
    res_cpu = @benchmark A_mul_B!(X_result, X,X)
    res_gpu = @benchmark A_mul_B!(X_result_gpu, X_gpu, X_gpu)
    
    println("\nsize: ", s, " x ", s)
    println("\t cpu mean time taken: ", mean(res_cpu.times)/10^6, " seconds")
    println("\t gpu mean time taken: ", mean(res_gpu.times)/10^6, " seconds")
    cpu_times[s] = mean(res_cpu.times)/10^6
    gpu_times[s] = mean(res_gpu.times)/10^6

end

## TODO: Explain, test, the following

- Check at anytime how much memory is available in the GPU
- Check at which device the GPUArray  is beeing send to, decide how to do it

In [None]:
res = @benchmark A_mul_B!(X_result_gpu, X_gpu, X_gpu)

In [None]:
println("mean time taken: ", mean(res.times)/10^6, " seconds")