# Matrix multiplication on GPUs

To make another case for *generic programming*, if you want to move the calculation to a GPU, chances are you only have to change the type of your matrix!

First of all, let's check if there is a GPU available.

In [None]:
A, B = rand(1000,1000), rand(1000,1000);

Let's move these arrays to the GPU.

In [None]:
using CUDA
@assert CUDA.functional() # if this fails your GPU isn't recognized correctly

In [None]:
CUDA.versioninfo()

In [None]:
Agpu, Bgpu = CuArray(A), CuArray(B);

That's it!

In [None]:
typeof(Agpu)

In [None]:
Cgpu = Agpu * Bgpu;

Note that the result of the multiplication lives on the GPU as well and needs to be pulled back to main memory.

In [None]:
Cgpu[1]

In [None]:
C = Matrix(Cgpu); # move to cpu

In [None]:
typeof(C)

In [None]:
C[1]

How much faster is a simple matmul on the GPU? Let's find out.

In [None]:
using BenchmarkTools

In [None]:
println("A*B (cpu)")
@btime $A * $B;

In [None]:
println("A*B (gpu)")
@btime $Agpu * $Bgpu;

That's at least 3 orders of magnitude faster!

# Machine learning on GPUs

### CPU

In [None]:
using Flux

In [None]:
m = Chain(
    Dense(1000, 100),
    Dense(100, 10),
    Dense(10, 5),
    Dense(5, 2),
    softmax # normalize output neurons
    )

data = rand(1000, 1000); # fake data
labels = fill(0.5, 2, 1000); # fake data

loss(x, y) = sum(Flux.mse(m(x), y)) # mean squared error
opt = Descent(0.01)

In [None]:
@time Flux.train!(loss, Flux.params(m), [(data,labels)], opt)

### GPU

Let's train the network on the GPU instead! It's as simple as `|> gpu`:

In [None]:
# move the model to the gpu
m = Chain(
    Dense(1000, 100),
    Dense(100, 10),
    Dense(10, 5),
    Dense(5, 2),
    softmax
    ) |> gpu

# move data to the gpu
data = rand(1000, 1000) |> gpu;
labels = fill(0.5, 2, 1000) |> gpu;

loss(x, y) = sum(Flux.mse(m(x), y))
opt = Descent(0.01)

In [None]:
typeof(m)

In [None]:
@time Flux.train!(loss, Flux.params(m), [(data,labels)], opt)

The training is about **two orders of magnitude faster** on the GPU in this case!

Now that our model is trained, let's feed it some data.

In [None]:
m(rand(1000))

Oops. Since our model lives on the GPU we can't feed it with data living in main memory. We must move our model back to the CPU first.

In [None]:
m_cpu = m |> cpu

In [None]:
typeof(m_cpu)

In [None]:
m_cpu(rand(1000))