In [12]:
] activate .

[32m[1m  Activating[22m[39m environment at `~/tutorial_julia_interactive/04_cuda_highlevel/Project.toml`


In [13]:
] instantiate

# Using the GPU without writing all the kernels by hand

The fact that we can easily write generic kernels in Julia is nice, but there are many cases where you can do array programming & calling into libraries to do the heavy lifting.

Julia allows you to fuse scalar operations, it's called broadcasting. Quite often all you need is to add some dots, and have the same code work on the CPU & GPU...

In this example we'll look at a partial differential equation that comes up in a time-stepping method, and we solve it using FFT's + broadcasting operations.

In [14]:
# a separate package defines some interfaces / types
using AbstractFFTs: Plan, plan_fft!

# And FFTW and CUDA implement them
using FFTW, CUDA
using CUDA.CUFFT
using CUDA: @sync as @cusync
using LinearAlgebra

FFTW.set_num_threads(12) # <- FFTW disables threading by default -- make sure to enable it. Note: you can also get the MKL version through FFTW!

In [15]:
"""
    time_step!(u, fft_plan, p)

Explanations:

1. Generate a 3D array with nice visualization properties.

2. Obtain approximately fractal Brownian noise, appropriately damping
   the high frequencies of Fourier transformed spatial white noise,
   and (inverse) Fourier transforming the result back into the spatial domain.
 
3. Do a backward Euler time step on the fractional PDE

       du(x, t)/dt = Œî·µñu(x, t).

   Discretize in time with time step = 1:

         u‚Çú‚Çä‚ÇÅ - u‚Çú = Œî·µñu‚Çú‚Çä‚ÇÅ
     (-Œî·µñ + 1)u‚Çú‚Çä‚ÇÅ = u‚Çú
   (|Œæ|¬≤·µñ + 1)uÃÉ‚Çú‚Çä‚ÇÅ = uÃÉ‚Çú                  (Forward Fourier transform: \tilde<tab> gives you the wiggle)
              uÃÉ‚Çú‚Çä‚ÇÅ = uÃÉ‚Çú / (|Œæ|¬≤·µñ + 1)    (Scalar updates)
              u‚Çú‚Çä‚ÇÅ = (uÃÉ‚Çú / (|Œæ|¬≤·µñ + 1))ÃÉ  (Inverse Fourier transform, unwiggle!)
"""
function time_step!(
        u‚Çú::AbstractArray{T,3} = randn(ComplexF64, 128, 128, 128),
        ùìï::Plan{T} = plan_fft!(u‚Çú),
        ùìï‚Åª¬π::Plan{T} = plan_ifft!(u‚Çú),
        p = real(T)(0.75)) where {T<:Complex}
    
    @assert size(u‚Çú, 1) == size(u‚Çú, 2) == size(u‚Çú, 3)
    
    n = size(u‚Çú, 1)

    # Note: u‚Çú is modified in-place if you provide an in-place plan -- no copy is made
    @cusync uÃÉ‚Çú = ùìï * u‚Çú
    
    # Discrete Fourier transforms indices are confusing more often than not
    # In a perfect where I have time on my hands I would do this without the
    # temporary, and rather us something like n .- abs.(-n√∑2:n√∑2) lazily -- but then again,
    # it's only O(n) memory in an O(n^3) problem.
    @cusync Œæ = maybe_to_device(u‚Çú)(Float32[0:n√∑2; n√∑2-1:-1:(iseven(n) ? 1 : 0)])

    Œæ‚ÇÅ = reshape(Œæ, :, 1, 1)
    Œæ‚ÇÇ = reshape(Œæ, 1, :, 1) 
    Œæ‚ÇÉ = reshape(Œæ, 1, 1, :)
    
    # The broadcasting bit -- fuses operations.
    @cusync uÃÉ‚Çú ./= 1 .+ (Œæ‚ÇÅ.^2 .+ Œæ‚ÇÇ.^2 .+ Œæ‚ÇÉ.^2) .^ p
    
    # Also in-place
    @cusync u‚Çú‚Çä‚ÇÅ = ùìï‚Åª¬π * uÃÉ‚Çú
    
    return u‚Çú‚Çä‚ÇÅ
end

# If I work with CuArrays I want to get a CuVector constructor
maybe_to_device(::CuArray) = CuVector
maybe_to_device(::AbstractArray) = identity;

## 3D volume plots in Julia using Makie.jl

There is one thing I didn't figure out in time: remote rendering. But it is implemented in Makie.jl and we should set it up!

In [16]:
# WGLMakie is a webgl plotting library
using JSServe, WGLMakie

Page(exportable=true, offline=true);
set_theme!(resolution=(1024, 1024))

In [17]:
# Compute a solution on the GPU
u = Complex.(CUDA.randn(Float32, 150, 150, 150))

# Do a step
time_step!(u)

# Get the real part and download to CPU
u_real = Array(real(u))

# Note to self: fix his -- currently plotting serializes data and sends it to the browser
# it takes forever! better would be to do remote rendering
# volume(u_real, colorrange=extrema(u_real))


150√ó150√ó150 Array{Float32, 3}:
[:, :, 1] =
  0.0057804     0.00445605    0.00100542   ‚Ä¶   0.00444631    0.00770759
  0.00253026    0.000527489  -7.5042f-5        0.00218634    0.000308043
 -0.000399728  -0.00163142   -0.00197742      -3.11043f-5    0.00132125
 -0.00230765   -0.00678148   -0.00637932      -0.00216577    0.00241848
 -0.00281224   -0.00550273   -0.00468624      -0.00285926   -0.000255659
 -0.00285854   -0.00529108   -0.00365866   ‚Ä¶  -0.00261306   -0.000906653
 -0.00292633   -0.00485817   -0.00437061      -0.0040111    -0.00355011
 -0.00574465   -0.00684693   -0.00321379      -0.00210126   -0.00763699
 -0.00423064   -0.00641217   -0.00391258      -0.0036463    -0.0051799
  0.00342602   -0.00262055   -0.00690729      -0.00596435   -0.00371351
  0.000521163  -0.0021999    -0.00217574   ‚Ä¶  -0.0051417    -0.00117805
 -0.00293995   -0.00312833   -0.00133278      -0.000923452  -0.00307118
 -0.00464156   -0.0051453     0.000361606     -0.000415342  -0.00317865
  ‚ãÆ      

## Performance comparison between CPU & GPU

Let's runs our time stepper on the CPU and GPU. Let's add a validation test & compare GPU speedup

In [18]:
# Some parameters for our problem

n = 256
T = ComplexF64
p = 0.75

@show n T p

# Initial random value
u‚Çú = randn(T, n, n, n)
u‚Çú_d = CuArray(u‚Çú)

# Compute the FFT plan ahead of time -- no part of our benchmark
# CPU
ùìï     = plan_fft!(u‚Çú);
ùìï‚Åª¬π   = plan_ifft!(u‚Çú)

# GPU
ùìï_d   = plan_fft!(u‚Çú_d);
ùìï‚Åª¬π_d = plan_ifft!(u‚Çú_d)

# Validate CPU & GPU results
using LinearAlgebra, Test
println(@test norm(time_step!(u‚Çú, ùìï, ùìï‚Åª¬π) - Array(time_step!(u‚Çú_d, ùìï_d, ùìï‚Åª¬π_d))) < 100eps(real(T)))

# Run on CPU
cpu_time = @elapsed time_step!(u‚Çú, ùìï, ùìï‚Åª¬π, 0.75)

# Run it GPU
gpu_time = @elapsed time_step!(u‚Çú_d, ùìï_d, ùìï‚Åª¬π_d, 0.75)

println("CPU took: ", round(cpu_time, digits=4), "s")
println("GPU took: ", round(gpu_time, digits=4), "s")
println("That's a ", round(cpu_time / gpu_time, digits=2), "√ó speedup")

nothing

n = 256
T = ComplexF64
p = 0.75
[32m[1mTest Passed[22m[39m
CPU took: 1.8916s
GPU took: 0.0105s
That's a 179.64√ó speedup


Okay, tests are passing -- but the speedup is an order of magnitude larger than theoretically possible, that can't be correct?

And now we hit some quirks in Julia on the CPU.

Although broadcasting and loop fusion looks pretty, it can be slow! In particular the `(Œæ‚ÇÅ.^2 .+ Œæ‚ÇÇ.^2 .+ Œæ‚ÇÉ.^2)` bit actually allocates a temporary matrix of size O(n^3) Also it's not threaded on the CPU... it's 10x slower than the fft/ffti computation.

So basically:
- create a pull request to julia to enable threading by default in broadcasting
- create a pull request to julia to handle nested broadcasting lazily to avoid big temporaries
- use LoopVectorization.jl -- but it does not handle interleaved loads yet (you'd have to change from CRCRCR... to CCCRRR storage for vectorized loads/stores)

### From broadcasting to handwritten loops on the CPU
Let's see if the performance hit is actually fixable

And it turns out that 65536 is a multiple of the "critical stride", meaning that loads along the third axis will always use a tiny fraction of the memory cache. So let's try again with $n \not= 2^k$

In [19]:
using OffsetArrays

"""
Fallback version for the CPU
"""
function solve_in_fourier_space!(uÃÉ‚Çú::AbstractArray{T,3}, p) where {T}
    n = size(uÃÉ‚Çú, 1)
    n_half = n√∑2
    from, to = -n_half, n_half - iseven(n)    
    uÃÉ‚Çú_offset = OffsetArray(uÃÉ‚Çú, from:to, from:to, from:to)
    
    Threads.@threads for k = axes(uÃÉ‚Çú_offset, 3)
        Œæ‚ÇÉ = n_half - abs(k)
        for j = axes(uÃÉ‚Çú_offset, 2)
            Œæ‚ÇÇ = n_half - abs(j)
            @simd for i = axes(uÃÉ‚Çú_offset, 1)
                Œæ‚ÇÅ = n_half - abs(i)
                @inbounds uÃÉ‚Çú_offset[i, j, k] /= one(real(T)) + (Œæ‚ÇÅ*Œæ‚ÇÅ + Œæ‚ÇÇ*Œæ‚ÇÇ + Œæ‚ÇÉ*Œæ‚ÇÉ)^p
            end
        end
    end
    
    return uÃÉ‚Çú
end

"""
Same old GPU version
"""
function solve_in_fourier_space!(uÃÉ‚Çú::CuArray{T,3}, p) where {T}
    n = size(uÃÉ‚Çú, 1)
    Œæ = CuVector(real(T)[0:n√∑2; n√∑2-1:-1:(iseven(n) ? 1 : 0)])
    Œæ‚ÇÅ = reshape(Œæ, :, 1, 1)
    Œæ‚ÇÇ = reshape(Œæ, 1, :, 1) 
    Œæ‚ÇÉ = reshape(Œæ, 1, 1, :)
    CUDA.@sync uÃÉ‚Çú ./= one(real(T)) .+ (Œæ‚ÇÅ.^2 .+ Œæ‚ÇÇ.^2 .+ Œæ‚ÇÉ.^2) .^ p
    return real.(uÃÉ‚Çú)
end

function time_step_attempt_2!(u‚Çú::AbstractArray{T,3} = randn(ComplexF64, 128, 128, 128), p = real(T)(0.75)) where {T<:Complex}
    @assert size(u‚Çú, 1) == size(u‚Çú, 2) == size(u‚Çú, 3)
    n = size(u‚Çú, 1)
    fft_plan = plan_fft!(u‚Çú)
    CUDA.@sync uÃÉ‚Çú = fft_plan * u‚Çú
    solve_in_fourier_space!(uÃÉ‚Çú, p)
    CUDA.@sync u‚Çú‚Çä‚ÇÅ = inv(fft_plan) * uÃÉ‚Çú
    CUDA.@sync real_data = real.(u‚Çú‚Çä‚ÇÅ)
    
    return real_data
end

function time_step_attempt_2!(
        u‚Çú::AbstractArray{T,3} = randn(ComplexF64, 128, 128, 128),
        ùìï::Plan{T} = plan_fft!(u‚Çú),
        ùìï‚Åª¬π::Plan{T} = plan_ifft!(u‚Çú),
        p = real(T)(0.75)) where {T<:Complex}
    
    @assert size(u‚Çú, 1) == size(u‚Çú, 2) == size(u‚Çú, 3)
    
    n = size(u‚Çú, 1)

    # Note: u‚Çú is modified in-place if you provide an in-place plan -- no copy is made
    @cusync uÃÉ‚Çú = ùìï * u‚Çú
    
    # Discrete Fourier transforms indices are confusing more often than not
    # In a perfect where I have time on my hands I would do this without the
    # temporary, and rather us something like n .- abs.(-n√∑2:n√∑2) lazily -- but then again,
    # it's only O(n) memory in an O(n^3) problem.
    @cusync Œæ = solve_in_fourier_space!(uÃÉ‚Çú, p)
    
    # Also in-place
    @cusync u‚Çú‚Çä‚ÇÅ = ùìï‚Åª¬π * uÃÉ‚Çú
    
    return u‚Çú‚Çä‚ÇÅ
end

time_step_attempt_2! (generic function with 6 methods)

In [23]:
n = 256

# Initial random value
u‚Çú = randn(T, n, n, n)
u‚Çú_d = CuArray(u‚Çú)

# Compute the FFT plan ahead of time -- no part of our benchmark
# CPU
ùìï     = plan_fft!(u‚Çú);
ùìï‚Åª¬π   = plan_ifft!(u‚Çú)

# GPU
ùìï_d   = plan_fft!(u‚Çú_d);
ùìï‚Åª¬π_d = plan_ifft!(u‚Çú_d)

# Validate CPU & GPU results
using LinearAlgebra, Test
println(@test norm(time_step_attempt_2!(u‚Çú, ùìï, ùìï‚Åª¬π) - Array(time_step_attempt_2!(u‚Çú_d, ùìï_d, ùìï‚Åª¬π_d))) < 100eps(real(T)))

# Run on CPU
cpu_time = @elapsed time_step_attempt_2!(u‚Çú, ùìï, ùìï‚Åª¬π, 0.75)

# Run it GPU
gpu_time = @elapsed time_step_attempt_2!(u‚Çú_d, ùìï_d, ùìï‚Åª¬π_d, 0.75)

println("CPU took: ", round(cpu_time, digits=4), "s")
println("GPU took: ", round(gpu_time, digits=4), "s")
println("That's a ", round(cpu_time / gpu_time, digits=2), "√ó speedup")

nothing

[32m[1mTest Passed[22m[39m
CPU took: 0.3868s
GPU took: 0.0118s
That's a 32.69√ó speedup


That's slightly more reasonable -- but still... 33x is far from the theoretical ~10x gap!

Turns out:

Although $n = 256$  is the perfect size for FFT, it's a horrible size for memory. The strides of this array look like this:

In [24]:
strides(u‚Çú)

(1, 256, 65536)

The solution is to add a bit of padding to the arrays, maybe in each dimension to be sure.

In [26]:
n = 256

# Try pad = 0 vs 1 2 3
pad = 2

# Initial random value
u‚Çú = view(randn(T, n + pad, n + pad, n + pad), 1:n, 1:n, 1:n)

# Compute the FFT plan ahead of time -- no part of our benchmark
ùìï   = plan_fft!(u‚Çú)
ùìï‚Åª¬π = plan_ifft!(u‚Çú)

cpu_time = @elapsed time_step_attempt_2!(u‚Çú, ùìï, ùìï‚Åª¬π, 0.75)
println("CPU took: ", round(cpu_time, digits=4), "s")
println("That's a ", round(cpu_time / gpu_time, digits=2), "√ó speedup")

CPU took: 0.2423s
That's a 20.48√ó speedup
