# GPUArrays in julia

#### QUESTIONS

- How do we list all possible openCL devices? cpus and gpus?
- How do we select a particular device, send an array there and make an operation
- How do we check at anytime, how much memory is on a device:
    - In this example if  `A_mul_B!(X_result, X, X)` is done using bigger matrices OSX becomes completly unusable (the graphical user interface). We should use the GPU that is not beeing used for the graphical user interface.

In [1]:
using GPUArrays
using BenchmarkTools

In [2]:
X = rand(Float32,200,200);
X_result = zeros(X);

In [3]:
GPUArrays.supported_backends()

(:julia, :opencl)

In [4]:
GPUArrays.supported_blas_libs()

(:BLAS, :CLBLAS)

In [5]:
GPUArrays

GPUArrays

#### Mapping arrays to devices

Let `X` be an array. When doing  `X_gpu = GPUArray(X)`.

In [6]:
X_gpu = GPUArray(X);
X_result_gpu = GPUArray(zeros(Float32,200,200));

In [7]:
GPUArrays.CLBackend.CLContext()

CLContext: AMD Radeon HD - FirePro D300 Compute Engine

In [9]:
CLBackend.current_context()

CLContext: AMD Radeon HD - FirePro D300 Compute Engine

## Matrix multiplication demo

In [None]:
sizes = [x for x in 100:100:1000];
cpu_times = Dict()
gpu_times = Dict()

println("\nCPU times")
for s in sizes
    X = rand(Float32,s,s);
    X_result = zeros(X);
    res_cpu = @elapsed A_mul_B!(X_result, X,X)
    println("size: ", s, " x ", s, " seconds: ", res_cpu, " seconds")
    #cpu_times[s] = mean(res_cpu.times)/10^6
end

println("\nGPU times")
for s in sizes
    X = rand(Float32,s,s);
    X_result = zeros(X);
    X_gpu = GPUArray(X);
    X_result_gpu = GPUArray(zeros(Float32,s,s));

    res_gpu = @elapsed A_mul_B!(X_result_gpu, X_gpu, X_gpu)
    println("size: ", s, " x ", s, " seconds: ", res_gpu, " seconds")
    #gpu_times[s] = mean(res_gpu.times)/10^6
end



CPU times
size: 100 x 100 seconds: 0.492055662 seconds
size: 200 x 200 seconds: 0.000188722 seconds
size: 300 x 300 seconds: 0.00093443 seconds
size: 400 x 400 seconds: 0.001689195 seconds
size: 500 x 500 seconds: 0.00334348 seconds
size: 600 x 600 seconds: 0.003039236 seconds
size: 700 x 700 seconds: 0.004727789 seconds
size: 800 x 800 seconds: 0.00778064 seconds
size: 900 x 900 seconds: 0.010136177 seconds
size: 1000 x 1000 seconds: 0.014474952 seconds

GPU times
size: 100

#### Choosing a device

In [9]:
#CLBackend.init(device_type=:gpu,device_idx=1)  ### How do we seelct a particular GPU?
CLBackend.init()

CLContext: AMD Radeon HD - FirePro D300 Compute Engine


## Using CLBLAS

In [10]:
?CLBLAS.gemm!

```
gemm!(tA, tB, alpha, A, B, beta, C)
```

Update `C` as `alpha*A*B + beta*C` or the other three variants according to [`tA`](@ref stdlib-blas-trans) and `tB`. Returns the updated `C`.


In [11]:
# Since alpha=1., beta=0 this is doing C = A * B  
@benchmark CLBLAS.gemm!('N', 'N', Float32(1.0), X_gpu, X_gpu, Float32(0.0), X_result_gpu)

base64 binary data: G1s5MW1FUlJPUiAodW5oYW5kbGVkIHRhc2sgZmFpbHVyZSk6IBtbOTFtT3BlbkNMIEVycm9yOiBPcGVuQ0wuQ29udGV4dCBlcnJvcjogyFGt14Z/G1szOW0KU3RhY2t0cmFjZToKIFsxXSAbWzFtcmFpc2VfY29udGV4dF9lcnJvchtbMjJtG1syMm0bWzFtKBtbMjJtG1syMm06OlN0cmluZywgOjpTdHJpbmcbWzFtKRtbMjJtG1syMm0gYXQgG1sxbS9Vc2Vycy9tYWNwcm8vLmp1bGlhL3YwLjYvT3BlbkNML3NyYy9jb250ZXh0LmpsOjEwORtbMjJtG1syMm0KIFsyXSAbWzFtbWFjcm8gZXhwYW5zaW9uG1syMm0bWzIybSBhdCAbWzFtL1VzZXJzL21hY3Byby8uanVsaWEvdjAuNi9PcGVuQ0wvc3JjL2NvbnRleHQuamw6MTQ4G1syMm0bWzIybSBbaW5saW5lZF0KIFszXSAbWzFtKDo6T3BlbkNMLmNsLiMjNDMjNDQpG1syMm0bWzIybRtbMW0oG1syMm0bWzIybRtbMW0pG1syMm0bWzIybSBhdCAbWzFtLi90YXNrLmpsOjMzNRtbMjJtG1syMm0KG1szOW0=


BenchmarkTools.Trial: 
  memory estimate:  1.67 KiB
  allocs estimate:  62
  --------------
  minimum time:     20.038 μs (0.00% GC)
  median time:      24.783 μs (0.00% GC)
  mean time:        26.176 μs (0.00% GC)
  maximum time:     132.573 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

In [12]:
A_mul_B!(X_result, X, X);

In [13]:
isapprox(Array(X_result_gpu), X_result)

true

In [18]:
free(X_gpu), free(X_result_gpu)

LoadError: [91mUndefVarError: free not defined[39m

## Benchmarking array operations

We can use functions such `A_mul_B!` with `GPUArray` objects.  Multiple dispatch will take care of using the targeted GPU.

- A_mul_B!
- A_mul_Bc!


In [19]:
@benchmark A_mul_B!(X_result, X, X)

BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     131.119 μs (0.00% GC)
  median time:      134.124 μs (0.00% GC)
  mean time:        146.956 μs (0.00% GC)
  maximum time:     333.347 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

In [20]:
@benchmark A_mul_B!(X_result_gpu, X_gpu, X_gpu)

BenchmarkTools.Trial: 
  memory estimate:  1.67 KiB
  allocs estimate:  62
  --------------
  minimum time:     20.481 μs (0.00% GC)
  median time:      24.641 μs (0.00% GC)
  mean time:        25.318 μs (0.00% GC)
  maximum time:     91.758 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

In [21]:
isapprox(Array(X_result_gpu), X_result)

true

### Benchmarking A_mul_B! for different sizes

Notice that a microseconds (μ) is one milionth of a second

In [None]:
sizes = [x for x in 100:100:400];
cpu_times = Dict()
gpu_times = Dict()

In [None]:
for s in sizes
    X = rand(Float32,s,s);
    X_result = zeros(X);

    X_gpu = GPUArray(X);
    X_result_gpu = GPUArray(zeros(Float32,s,s));
    
    res_cpu = @benchmark A_mul_B!(X_result, X,X)
    res_gpu = @benchmark A_mul_B!(X_result_gpu, X_gpu, X_gpu)
    
    println("\nsize: ", s, " x ", s)
    println("\t cpu mean time taken: ", mean(res_cpu.times)/10^6, " seconds")
    println("\t gpu mean time taken: ", mean(res_gpu.times)/10^6, " seconds")
    cpu_times[s] = mean(res_cpu.times)/10^6
    gpu_times[s] = mean(res_gpu.times)/10^6

end

## TODO: Explain, test, the following

- Check at anytime how much memory is available in the GPU
- Check at which device the GPUArray  is beeing send to, decide how to do it

In [None]:
res = @benchmark A_mul_B!(X_result_gpu, X_gpu, X_gpu)

In [None]:
println("mean time taken: ", mean(res.times)/10^6, " seconds")