In [1]:
using CUDA

In [2]:
CUDA.versioninfo()

CUDA runtime 12.6, artifact installation
CUDA driver 12.6
NVIDIA driver 560.94.0

CUDA libraries: 
- CUBLAS: 12.6.4
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 24.0.0)
- NVML: 12.0.0+560.94

Julia packages: 
- CUDA: 5.5.2
- CUDA_Driver_jll: 0.10.4+0
- CUDA_Runtime_jll: 0.15.5+0

Toolchain:
- Julia: 1.11.1
- LLVM: 16.0.6

1 device:
  0: NVIDIA GeForce GTX 1660 (sm_75, 5.241 GiB / 6.000 GiB available)


In [3]:
x = randn(Float32, 60, 60)
y = randn(Float32, 60, 60)
x * y
cx = CuArray(x)
cy = CuArray(y)
cx * cx

x * y ≈ Matrix(cx * cy)


true

This may not be anything remarkable, as such functionality is available in many other langs albeit usually with a less mathematical notation like `x.dot(y)`. With Julia's multiple dispatch, we can simply dispatch the multiplication operator/function `*` to a specific method that works on `CuArray` type. Check with `@code_typed` shows the call to CUBLAS lib under the hood.

In [4]:
@code_typed cx * cy

CodeInfo(
[90m1 ─[39m %1 = Base.getfield(A, :dims)[36m::Tuple{Int64, Int64}[39m
[90m│  [39m %2 = $(Expr(:boundscheck, true))[36m::Bool[39m
[90m│  [39m %3 = Base.getfield(%1, 1, %2)[36m::Int64[39m
[90m│  [39m %4 = Base.getfield(B, :dims)[36m::Tuple{Int64, Int64}[39m
[90m│  [39m %5 = $(Expr(:boundscheck, true))[36m::Bool[39m
[90m│  [39m %6 = Base.getfield(%4, 2, %5)[36m::Int64[39m
[90m│  [39m %7 = Core.tuple(%3, %6)[36m::Tuple{Int64, Int64}[39m
[90m│  [39m %8 = invoke CuArray{Float32, 2, CUDA.DeviceMemory}(CUDA.undef::UndefInitializer, %7::Tuple{Int64, Int64})[36m::CuArray{Float32, 2, CUDA.DeviceMemory}[39m
[90m│  [39m %9 = invoke LinearAlgebra.generic_matmatmul!(%8::CuArray{Float32, 2, CUDA.DeviceMemory}, 'N'::Char, 'N'::Char, A::CuArray{Float32, 2, CUDA.DeviceMemory}, B::CuArray{Float32, 2, CUDA.DeviceMemory}, true::Bool, false::Bool)[36m::CuArray{Float32, 2, CUDA.DeviceMemory}[39m
[90m└──[39m      return %9
) => CuArray{Float32, 2, CUDA.DeviceMemo

In [5]:
using BenchmarkTools

In [6]:
# using Pkg;
# Pkg.add(["FileIO", "ImageMagick", "ImageShow", "ColorTypes", "FFTW"])

In [7]:
using FileIO, ImageShow, ColorTypes, ImageMagick

rgb_img = FileIO.load("../../Downloads/2019JulyLunarEclipse-gt1Mpxjpg.jpg");
gray_img = Float32.(Gray.(rgb_img));
cgray_img = CuArray(gray_img);

# Hints: Use Float32 everywhere for better performance
#        Use CUDA.@sync during benchmarking in order to ensure that the computation has completed.

# Remove high frequency signal by means of modifying Fourier image.


6416×11406 CuArray{Float32, 2, CUDA.DeviceMemory}:
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0         0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0         0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0         0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0         0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0         0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0         0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0         0.00392157
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.00392157  0.00392157
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0         0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0         0.0
 ⋮                        ⋮              ⋱                        ⋮
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0         0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0         0.00392157
 0.0  0.0  0.0  0.0 

In [8]:
negative(i) = 1.0f0 .- i
darken(i) = i .* 0.5f0

using CUDA.CUFFT
using FFTW

fourier(i) = fft(i)
brightest(i) = findmax(i)

brightest (generic function with 1 method)

In [9]:
# Now for the benchmarking

@btime CUDA.@sync negative($cgray_img);

  3.944 ms (48 allocations: 960 bytes)
 (48 allocations: 960 bytes)


In [10]:
@btime negative($gray_img);

  49.718 ms (3 allocations: 279.16 MiB)


In [11]:
@btime CUDA.@sync darken($cgray_img);

  3.945 ms (48 allocations: 960 bytes)


In [12]:
@btime darken($cgray_img);