# Computing on GPUs

## GPU (local)

First of all, let's check if there is a GPU available.

In [1]:
using CUDAdrv

In [2]:
CuDevice(0)

CuDevice(0): GeForce GT 1030

### Matrix multiplication

In [3]:
A, B = rand(1000,1000), rand(1000,1000);

Let's move these arrays to the GPU.

In [4]:
using CuArrays
@assert CuArrays.functional() # if this fails your GPU isn't recognized correctly

In [5]:
Agpu, Bgpu = CuArray(A), CuArray(B);

That's it!

In [6]:
typeof(Agpu)

CuArray{Float64,2,Nothing}

How much faster is a simple matmul on the GPU? Let's find out.

In [7]:
using BenchmarkTools

In [8]:
println("A*B (cpu)")
@btime $A * $B;

A*B (cpu)
  12.731 ms (2 allocations: 7.63 MiB)


In [9]:
println("A*B (gpu)")
@btime $Agpu * $Bgpu;

A*B (gpu)
  2.933 μs (10 allocations: 416 bytes)


That's at least 3 orders of magnitude faster!

In [None]:
# Free GPU memory
Agpu, Bgpu = nothing, nothing
GC.gc()

Note that the result of the multiplication lives on the GPU as well and needs to be pulled back to main memory.

In [None]:
Agpu, Bgpu = CuArray(A), CuArray(B);

In [None]:
Cgpu = Agpu * Bgpu;

In [None]:
typeof(Cgpu)

In [None]:
C = Matrix(Cgpu); # move to cpu

How long does it take to move the `CuArray` back to main memory?

In [None]:
@btime Matrix($Cgpu);

In [None]:
# Free GPU memory
Agpu, Bgpu, Cgpu = nothing, nothing, nothing
GC.gc()

### Machine learning

In [1]:
using Flux

└ @ Flux C:\Users\carsten\.julia\packages\Flux\2i5P1\src\Flux.jl:58


In [2]:
m = Chain(
    Dense(1000, 100),
    Dense(100, 10),
    Dense(10, 5),
    Dense(5, 2),
    softmax # normalize output neurons
    )

data = rand(1000, 1000); # fake data
labels = fill(0.5, 2, 1000); # fake data

loss(x, y) = sum(Flux.mse(m(x), y)) # mean squared error
opt = Descent(0.01)

Descent(0.01)

In [8]:
@time Flux.train!(loss, Flux.params(m), [(data,labels)], opt)

  0.314975 seconds (5.02 M allocations: 110.066 MiB, 1.12% gc time)


Let's train the network on the GPU instead! It's as simple as `|> gpu`:

In [9]:
sin(3)

0.1411200080598672

In [10]:
3 |> sin

0.1411200080598672

In [11]:
# move the model to the gpu
m = Chain(
    Dense(1000, 100),
    Dense(100, 10),
    Dense(10, 5),
    Dense(5, 2),
    softmax
    ) |> gpu

# move data to the gpu
data = rand(1000, 1000) |> gpu;
labels = fill(0.5, 2, 1000) |> gpu;

loss(x, y) = sum(Flux.mse(m(x), y))
opt = Descent(0.01)

Descent(0.01)

In [12]:
typeof(m)

Chain{Tuple{Dense{typeof(identity),CuArrays.CuArray{Float32,2,Nothing},CuArrays.CuArray{Float32,1,Nothing}},Dense{typeof(identity),CuArrays.CuArray{Float32,2,Nothing},CuArrays.CuArray{Float32,1,Nothing}},Dense{typeof(identity),CuArrays.CuArray{Float32,2,Nothing},CuArrays.CuArray{Float32,1,Nothing}},Dense{typeof(identity),CuArrays.CuArray{Float32,2,Nothing},CuArrays.CuArray{Float32,1,Nothing}},typeof(softmax)}}

In [14]:
@time Flux.train!(loss, Flux.params(m), [(data,labels)], opt)

  0.006236 seconds (3.80 k allocations: 167.000 KiB)


The training is about **two orders of magnitude faster** on the GPU in this case!

Now that our model is trained, let's feed it some data.

In [15]:
m(rand(1000))

ArgumentError: ArgumentError: cannot take the CPU address of a CuArrays.CuArray{Float32,2,Nothing}

Oops. Since our model lives on the GPU we can't feed it with data living in main memory. We must move our model back to the CPU first.

In [16]:
m_cpu = m |> cpu

Chain(Dense(1000, 100), Dense(100, 10), Dense(10, 5), Dense(5, 2), softmax)

In [17]:
typeof(m_cpu)

Chain{Tuple{Dense{typeof(identity),Array{Float32,2},Array{Float32,1}},Dense{typeof(identity),Array{Float32,2},Array{Float32,1}},Dense{typeof(identity),Array{Float32,2},Array{Float32,1}},Dense{typeof(identity),Array{Float32,2},Array{Float32,1}},typeof(softmax)}}

In [18]:
m_cpu(rand(1000))

2-element Array{Float32,1}:
 0.5463516 
 0.45364842

## GPU (remote)

Let's start a worker on a gpu node of a supercomputer cluster.

In [None]:
using Distributed
addprocs([("cbauer17@gpu2", 1)]; exename=`/projects/ag-trebst/bauer/bin/julia-1.3.1/bin/julia`, exeflags=`--project=/projects/ag-trebst/bauer/JuliaOulu20/backup/gpu`, dir="/projects/ag-trebst/bauer/JuliaOulu20/backup/gpu", tunnel=true)
@fetch gethostname()

In [None]:
params = (exename=`nice -19 /home/bauer/bin/julia-1.3.1/bin/julia --project=/home/bauer/JuliaOulu20`, dir="/home/bauer")
addprocs([("l94", :auto)]; params...)

In [None]:
using Distributed
addprocs([("cbauer17@gpu2", 1)]; exename=`/projects/ag-trebst/bauer/bin/julia-1.3.1/bin/julia`, dir=`/projects/ag-trebst/bauer`, tunnel=true)
@fetch gethostname()

Extract the GPU name:

In [None]:
@fetch @eval using CUDAdrv
@fetch CUDAdrv.name(CuDevice(0))

### Matrix multiplication

In [None]:
@fetch @eval using CuArrays, BenchmarkTools

In [None]:
@fetch begin
    A, B = rand(1000,1000), rand(1000,1000);
    Agpu, Bgpu = CuArray(A), CuArray(B);
    
    println("Move array CPU -> GPU")
    @btime CuArray($A);
    
    println("A*B (cpu)")
    @btime $A * $B;

    println("A*B (gpu)")
    @btime $Agpu * $Bgpu;
    
    
    println("Move array GPU -> CPU")
    Cgpu = Agpu * Bgpu
    @btime Array($Cgpu);
    
    nothing
end