# Computing on GPUs

## GPU (local)

First of all, let's check if there is a GPU available.

### Matrix multiplication

In [None]:
A, B = rand(1000,1000), rand(1000,1000);

Let's move these arrays to the GPU.

In [None]:
using CuArrays
@assert CuArrays.functional() # if this fails your GPU isn't recognized correctly

In [None]:
Agpu, Bgpu = CuArray(A), CuArray(B);

That's it!

In [None]:
typeof(Agpu)

How much faster is a simple matmul on the GPU? Let's find out.

In [None]:
using BenchmarkTools

In [None]:
println("A*B (cpu)")
@btime $A * $B;

In [None]:
println("A*B (gpu)")
@btime $Agpu * $Bgpu;

That's at least 3 orders of magnitude faster!

In [None]:
# Free GPU memory
Agpu, Bgpu = nothing, nothing
GC.gc()

Note that the result of the multiplication lives on the GPU as well and needs to be pulled back to main memory.

In [None]:
Agpu, Bgpu = CuArray(A), CuArray(B);

In [None]:
Cgpu = Agpu * Bgpu;

In [None]:
typeof(Cgpu)

In [None]:
C = Matrix(Cgpu); # move to cpu

How long does it take to move the `CuArray` back to main memory?

In [None]:
@btime Matrix($Cgpu);

In [None]:
# Free GPU memory
Agpu, Bgpu, Cgpu = nothing, nothing, nothing
GC.gc()

### Machine learning

In [None]:
using Flux

In [None]:
m = Chain(
    Dense(1000, 100),
    Dense(100, 10),
    Dense(10, 5),
    Dense(5, 2),
    softmax # normalize output neurons
    )

data = rand(1000, 1000); # fake data
labels = fill(0.5, 2, 1000); # fake data

loss(x, y) = sum(Flux.mse(m(x), y)) # mean squared error
opt = Descent(0.01)

In [None]:
@time Flux.train!(loss, Flux.params(m), [(data,labels)], opt)

Let's train the network on the GPU instead! It's as simple as `|> gpu`:

In [None]:
# move the model to the gpu
m = Chain(
    Dense(1000, 100),
    Dense(100, 10),
    Dense(10, 5),
    Dense(5, 2),
    softmax
    ) |> gpu

# move data to the gpu
data = rand(1000, 1000) |> gpu;
labels = fill(0.5, 2, 1000) |> gpu;

loss(x, y) = sum(Flux.mse(m(x), y))
opt = Descent(0.01)

In [None]:
typeof(m)

In [None]:
@time Flux.train!(loss, Flux.params(m), [(data,labels)], opt)

The training is about **two orders of magnitude faster** on the GPU in this case!

Now that our model is trained, let's feed it some data.

In [None]:
m(rand(1000))

Oops. Since our model lives on the GPU we can't feed it with data living in main memory. We must move our model back to the CPU first.

In [None]:
m_cpu = m |> cpu

In [None]:
typeof(m_cpu)

In [None]:
m_cpu(rand(1000))

## GPU (remote)

Let's start a worker on a gpu node of a supercomputer cluster.

In [9]:
using Distributed
addprocs([("cbauer17@gpu2", 1)]; exename=`/projects/ag-trebst/julia/1.5.3/bin/julia`, exeflags=`--project=/projects/ag-trebst/bauer/JuliaOulu20/backup/gpu`, dir="/projects/ag-trebst/bauer/JuliaOulu20/backup/gpu", tunnel=true)
@fetch gethostname()

"cheops51801"

Extract the GPU name:

In [10]:
@fetch @eval using CUDAdrv
@fetch CUDAdrv.name(CuDevice(0))

"Tesla V100-SXM2-16GB"

### Matrix multiplication

In [11]:
using BenchmarkTools
@fetch @eval using CuArrays, BenchmarkTools

In [None]:
@fetch begin
    A, B = rand(1000,1000), rand(1000,1000);
    Agpu, Bgpu = CuArray(A), CuArray(B);
    
    println("Move array CPU -> GPU")
    @btime CuArray($A);
    
    println("A*B (cpu)")
    @btime $A * $B;

    println("A*B (gpu)")
    @btime $Agpu * $Bgpu;
    
    
    println("Move array GPU -> CPU")
    Cgpu = Agpu * Bgpu
    @btime Array($Cgpu);
    
    nothing
end

      From worker 3:	Move array CPU -> GPU
      From worker 3:	  723.740 μs (3 allocations: 96 bytes)
      From worker 3:	A*B (cpu)
      From worker 3:	  53.355 ms (2 allocations: 7.63 MiB)
      From worker 3:	A*B (gpu)
      From worker 3:	Downloading artifact: CUDA10.2
      From worker 3:	[?25lcurl: (7) couldn't connect to host
      From worker 3:	[1A[2K[?25hDownloading artifact: CUDA10.2
