In [12]:
] activate .

[32m[1m  Activating[22m[39m environment at `~/tutorial_julia_interactive/04_cuda_highlevel/Project.toml`


In [13]:
] instantiate

# Using the GPU without writing all the kernels by hand

The fact that we can easily write generic kernels in Julia is nice, but there are many cases where you can do array programming & calling into libraries to do the heavy lifting.

Julia allows you to fuse scalar operations, it's called broadcasting. Quite often all you need is to add some dots, and have the same code work on the CPU & GPU...

In this example we'll look at a partial differential equation that comes up in a time-stepping method, and we solve it using FFT's + broadcasting operations.

In [14]:
# a separate package defines some interfaces / types
using AbstractFFTs: Plan, plan_fft!

# And FFTW and CUDA implement them
using FFTW, CUDA
using CUDA.CUFFT
using CUDA: @sync as @cusync
using LinearAlgebra

FFTW.set_num_threads(12) # <- FFTW disables threading by default -- make sure to enable it. Note: you can also get the MKL version through FFTW!

In [15]:
"""
    time_step!(u, fft_plan, p)

Explanations:

1. Generate a 3D array with nice visualization properties.

2. Obtain approximately fractal Brownian noise, appropriately damping
   the high frequencies of Fourier transformed spatial white noise,
   and (inverse) Fourier transforming the result back into the spatial domain.
 
3. Do a backward Euler time step on the fractional PDE

       du(x, t)/dt = Δᵖu(x, t).

   Discretize in time with time step = 1:

         uₜ₊₁ - uₜ = Δᵖuₜ₊₁
     (-Δᵖ + 1)uₜ₊₁ = uₜ
   (|ξ|²ᵖ + 1)ũₜ₊₁ = ũₜ                  (Forward Fourier transform: \tilde<tab> gives you the wiggle)
              ũₜ₊₁ = ũₜ / (|ξ|²ᵖ + 1)    (Scalar updates)
              uₜ₊₁ = (ũₜ / (|ξ|²ᵖ + 1))̃  (Inverse Fourier transform, unwiggle!)
"""
function time_step!(
        uₜ::AbstractArray{T,3} = randn(ComplexF64, 128, 128, 128),
        𝓕::Plan{T} = plan_fft!(uₜ),
        𝓕⁻¹::Plan{T} = plan_ifft!(uₜ),
        p = real(T)(0.75)) where {T<:Complex}
    
    @assert size(uₜ, 1) == size(uₜ, 2) == size(uₜ, 3)
    
    n = size(uₜ, 1)

    # Note: uₜ is modified in-place if you provide an in-place plan -- no copy is made
    @cusync ũₜ = 𝓕 * uₜ
    
    # Discrete Fourier transforms indices are confusing more often than not
    # In a perfect where I have time on my hands I would do this without the
    # temporary, and rather us something like n .- abs.(-n÷2:n÷2) lazily -- but then again,
    # it's only O(n) memory in an O(n^3) problem.
    @cusync ξ = maybe_to_device(uₜ)(Float32[0:n÷2; n÷2-1:-1:(iseven(n) ? 1 : 0)])

    ξ₁ = reshape(ξ, :, 1, 1)
    ξ₂ = reshape(ξ, 1, :, 1) 
    ξ₃ = reshape(ξ, 1, 1, :)
    
    # The broadcasting bit -- fuses operations.
    @cusync ũₜ ./= 1 .+ (ξ₁.^2 .+ ξ₂.^2 .+ ξ₃.^2) .^ p
    
    # Also in-place
    @cusync uₜ₊₁ = 𝓕⁻¹ * ũₜ
    
    return uₜ₊₁
end

# If I work with CuArrays I want to get a CuVector constructor
maybe_to_device(::CuArray) = CuVector
maybe_to_device(::AbstractArray) = identity;

## 3D volume plots in Julia using Makie.jl

There is one thing I didn't figure out in time: remote rendering. But it is implemented in Makie.jl and we should set it up!

In [16]:
# WGLMakie is a webgl plotting library
using JSServe, WGLMakie

Page(exportable=true, offline=true);
set_theme!(resolution=(1024, 1024))

In [17]:
# Compute a solution on the GPU
u = Complex.(CUDA.randn(Float32, 150, 150, 150))

# Do a step
time_step!(u)

# Get the real part and download to CPU
u_real = Array(real(u))

# Note to self: fix his -- currently plotting serializes data and sends it to the browser
# it takes forever! better would be to do remote rendering
# volume(u_real, colorrange=extrema(u_real))


150×150×150 Array{Float32, 3}:
[:, :, 1] =
  0.0057804     0.00445605    0.00100542   …   0.00444631    0.00770759
  0.00253026    0.000527489  -7.5042f-5        0.00218634    0.000308043
 -0.000399728  -0.00163142   -0.00197742      -3.11043f-5    0.00132125
 -0.00230765   -0.00678148   -0.00637932      -0.00216577    0.00241848
 -0.00281224   -0.00550273   -0.00468624      -0.00285926   -0.000255659
 -0.00285854   -0.00529108   -0.00365866   …  -0.00261306   -0.000906653
 -0.00292633   -0.00485817   -0.00437061      -0.0040111    -0.00355011
 -0.00574465   -0.00684693   -0.00321379      -0.00210126   -0.00763699
 -0.00423064   -0.00641217   -0.00391258      -0.0036463    -0.0051799
  0.00342602   -0.00262055   -0.00690729      -0.00596435   -0.00371351
  0.000521163  -0.0021999    -0.00217574   …  -0.0051417    -0.00117805
 -0.00293995   -0.00312833   -0.00133278      -0.000923452  -0.00307118
 -0.00464156   -0.0051453     0.000361606     -0.000415342  -0.00317865
  ⋮                

## Performance comparison between CPU & GPU

Let's runs our time stepper on the CPU and GPU. Let's add a validation test & compare GPU speedup

In [18]:
# Some parameters for our problem

n = 256
T = ComplexF64
p = 0.75

@show n T p

# Initial random value
uₜ = randn(T, n, n, n)
uₜ_d = CuArray(uₜ)

# Compute the FFT plan ahead of time -- no part of our benchmark
# CPU
𝓕     = plan_fft!(uₜ);
𝓕⁻¹   = plan_ifft!(uₜ)

# GPU
𝓕_d   = plan_fft!(uₜ_d);
𝓕⁻¹_d = plan_ifft!(uₜ_d)

# Validate CPU & GPU results
using LinearAlgebra, Test
println(@test norm(time_step!(uₜ, 𝓕, 𝓕⁻¹) - Array(time_step!(uₜ_d, 𝓕_d, 𝓕⁻¹_d))) < 100eps(real(T)))

# Run on CPU
cpu_time = @elapsed time_step!(uₜ, 𝓕, 𝓕⁻¹, 0.75)

# Run it GPU
gpu_time = @elapsed time_step!(uₜ_d, 𝓕_d, 𝓕⁻¹_d, 0.75)

println("CPU took: ", round(cpu_time, digits=4), "s")
println("GPU took: ", round(gpu_time, digits=4), "s")
println("That's a ", round(cpu_time / gpu_time, digits=2), "× speedup")

nothing

n = 256
T = ComplexF64
p = 0.75
[32m[1mTest Passed[22m[39m
CPU took: 1.8916s
GPU took: 0.0105s
That's a 179.64× speedup


Okay, tests are passing -- but the speedup is an order of magnitude larger than theoretically possible, that can't be correct?

And now we hit some quirks in Julia on the CPU.

Although broadcasting and loop fusion looks pretty, it can be slow! In particular the `(ξ₁.^2 .+ ξ₂.^2 .+ ξ₃.^2)` bit actually allocates a temporary matrix of size O(n^3) Also it's not threaded on the CPU... it's 10x slower than the fft/ffti computation.

So basically:
- create a pull request to julia to enable threading by default in broadcasting
- create a pull request to julia to handle nested broadcasting lazily to avoid big temporaries
- use LoopVectorization.jl -- but it does not handle interleaved loads yet (you'd have to change from CRCRCR... to CCCRRR storage for vectorized loads/stores)

### From broadcasting to handwritten loops on the CPU
Let's see if the performance hit is actually fixable

And it turns out that 65536 is a multiple of the "critical stride", meaning that loads along the third axis will always use a tiny fraction of the memory cache. So let's try again with $n \not= 2^k$

In [19]:
using OffsetArrays

"""
Fallback version for the CPU
"""
function solve_in_fourier_space!(ũₜ::AbstractArray{T,3}, p) where {T}
    n = size(ũₜ, 1)
    n_half = n÷2
    from, to = -n_half, n_half - iseven(n)    
    ũₜ_offset = OffsetArray(ũₜ, from:to, from:to, from:to)
    
    Threads.@threads for k = axes(ũₜ_offset, 3)
        ξ₃ = n_half - abs(k)
        for j = axes(ũₜ_offset, 2)
            ξ₂ = n_half - abs(j)
            @simd for i = axes(ũₜ_offset, 1)
                ξ₁ = n_half - abs(i)
                @inbounds ũₜ_offset[i, j, k] /= one(real(T)) + (ξ₁*ξ₁ + ξ₂*ξ₂ + ξ₃*ξ₃)^p
            end
        end
    end
    
    return ũₜ
end

"""
Same old GPU version
"""
function solve_in_fourier_space!(ũₜ::CuArray{T,3}, p) where {T}
    n = size(ũₜ, 1)
    ξ = CuVector(real(T)[0:n÷2; n÷2-1:-1:(iseven(n) ? 1 : 0)])
    ξ₁ = reshape(ξ, :, 1, 1)
    ξ₂ = reshape(ξ, 1, :, 1) 
    ξ₃ = reshape(ξ, 1, 1, :)
    CUDA.@sync ũₜ ./= one(real(T)) .+ (ξ₁.^2 .+ ξ₂.^2 .+ ξ₃.^2) .^ p
    return real.(ũₜ)
end

function time_step_attempt_2!(uₜ::AbstractArray{T,3} = randn(ComplexF64, 128, 128, 128), p = real(T)(0.75)) where {T<:Complex}
    @assert size(uₜ, 1) == size(uₜ, 2) == size(uₜ, 3)
    n = size(uₜ, 1)
    fft_plan = plan_fft!(uₜ)
    CUDA.@sync ũₜ = fft_plan * uₜ
    solve_in_fourier_space!(ũₜ, p)
    CUDA.@sync uₜ₊₁ = inv(fft_plan) * ũₜ
    CUDA.@sync real_data = real.(uₜ₊₁)
    
    return real_data
end

function time_step_attempt_2!(
        uₜ::AbstractArray{T,3} = randn(ComplexF64, 128, 128, 128),
        𝓕::Plan{T} = plan_fft!(uₜ),
        𝓕⁻¹::Plan{T} = plan_ifft!(uₜ),
        p = real(T)(0.75)) where {T<:Complex}
    
    @assert size(uₜ, 1) == size(uₜ, 2) == size(uₜ, 3)
    
    n = size(uₜ, 1)

    # Note: uₜ is modified in-place if you provide an in-place plan -- no copy is made
    @cusync ũₜ = 𝓕 * uₜ
    
    # Discrete Fourier transforms indices are confusing more often than not
    # In a perfect where I have time on my hands I would do this without the
    # temporary, and rather us something like n .- abs.(-n÷2:n÷2) lazily -- but then again,
    # it's only O(n) memory in an O(n^3) problem.
    @cusync ξ = solve_in_fourier_space!(ũₜ, p)
    
    # Also in-place
    @cusync uₜ₊₁ = 𝓕⁻¹ * ũₜ
    
    return uₜ₊₁
end

time_step_attempt_2! (generic function with 6 methods)

In [23]:
n = 256

# Initial random value
uₜ = randn(T, n, n, n)
uₜ_d = CuArray(uₜ)

# Compute the FFT plan ahead of time -- no part of our benchmark
# CPU
𝓕     = plan_fft!(uₜ);
𝓕⁻¹   = plan_ifft!(uₜ)

# GPU
𝓕_d   = plan_fft!(uₜ_d);
𝓕⁻¹_d = plan_ifft!(uₜ_d)

# Validate CPU & GPU results
using LinearAlgebra, Test
println(@test norm(time_step_attempt_2!(uₜ, 𝓕, 𝓕⁻¹) - Array(time_step_attempt_2!(uₜ_d, 𝓕_d, 𝓕⁻¹_d))) < 100eps(real(T)))

# Run on CPU
cpu_time = @elapsed time_step_attempt_2!(uₜ, 𝓕, 𝓕⁻¹, 0.75)

# Run it GPU
gpu_time = @elapsed time_step_attempt_2!(uₜ_d, 𝓕_d, 𝓕⁻¹_d, 0.75)

println("CPU took: ", round(cpu_time, digits=4), "s")
println("GPU took: ", round(gpu_time, digits=4), "s")
println("That's a ", round(cpu_time / gpu_time, digits=2), "× speedup")

nothing

[32m[1mTest Passed[22m[39m
CPU took: 0.3868s
GPU took: 0.0118s
That's a 32.69× speedup


That's slightly more reasonable -- but still... 33x is far from the theoretical ~10x gap!

Turns out:

Although $n = 256$  is the perfect size for FFT, it's a horrible size for memory. The strides of this array look like this:

In [24]:
strides(uₜ)

(1, 256, 65536)

The solution is to add a bit of padding to the arrays, maybe in each dimension to be sure.

In [26]:
n = 256

# Try pad = 0 vs 1 2 3
pad = 2

# Initial random value
uₜ = view(randn(T, n + pad, n + pad, n + pad), 1:n, 1:n, 1:n)

# Compute the FFT plan ahead of time -- no part of our benchmark
𝓕   = plan_fft!(uₜ)
𝓕⁻¹ = plan_ifft!(uₜ)

cpu_time = @elapsed time_step_attempt_2!(uₜ, 𝓕, 𝓕⁻¹, 0.75)
println("CPU took: ", round(cpu_time, digits=4), "s")
println("That's a ", round(cpu_time / gpu_time, digits=2), "× speedup")

CPU took: 0.2423s
That's a 20.48× speedup
