<center> <H1> Programmer en CUDA avec Julia </H1> 
<img src="logo.png" width="200"/>
  Marc Fuentes : SED de l'INRIA de l'UPPA  
</center>

In [4]:
# dans l'interprete, verifié que CUDA marche bien -> aller a l'installation
using CUDA
CUDA.versioninfo()

CUDA toolkit 11.4.1, artifact installation
CUDA driver 11.6.0
NVIDIA driver 510.47.3

Libraries: 
- CUBLAS: 11.5.4
- CURAND: 10.2.5
- CUFFT: 10.5.1
- CUSOLVER: 11.2.0
- CUSPARSE: 11.6.0
- CUPTI: 14.0.0
- NVML: 11.0.0+510.47.3
- CUDNN: 8.20.2 (for CUDA 11.4.0)
- CUTENSOR: 1.3.0 (for CUDA 11.2.0)

Toolchain:
- Julia: 1.7.0-beta3
- LLVM: 12.0.0
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

1 device:
  0: Quadro T2000 with Max-Q Design (sm_75, 3.815 GiB / 4.000 GiB available)


# Installation
 - sur un laptop, le gestionnaire de paquet `Pkg` va télécharger des artefacts
```julia
using Pkg
Pkg.add("CUDA")
```
 - sur plafrim (pour ce TP) a utiliser sur des GPU pascal ou Volta (`salloc -C "sirocco&p100"`)
```bash
> module load language/julia/1.7.2
> julia
```
 - certaines variables peuvent aider julia (`JULIA_CUDA_VERSION`) et  (`JULIA_CUDA_BUILDBINARY=false`)

# GPU : généralités sur l'architecture
- le GPU est un accélérateur possédant sa mémoire (DRAM) et un grand nombre de «fils d'exécution» (threads)
<img src="archi_gpu.svg" width="600px" > 
- 2 principes à rétenir
 - limiter les transferts (ou les recouvrir par des calculs)
 - donner suffisament de grain a moudre au GPU (calcul vectoriel)

# Paradigme de programmation sur GPU : 
 - remplacer un indice de boucle par un indice de «thread»
```julia
for i=...
    a[i] = ...
end
``` 
devient ainsi
```julia
i = threadIdx().x + (blockIdx().x - 1) * blockDim().x  
a[i] = ...
```

# utilisation transparente
- il suffit d'avoir recours a des abstractions parallèles agissant sur le conteneneur `CuArray`

In [2]:
#version GPU
using BenchmarkTools
N = 2^10*32
A = CuArray([1:N;])
B = CuArray([0:N-1;])
@benchmark z = reduce(+, A.^2+B.^2-2 * A .* B)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m66.041 μs[22m[39m … [35m 36.856 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 20.44%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m70.660 μs               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m81.799 μs[22m[39m ± [32m612.030 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m2.60% ±  0.35%

  [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▂[39m▄[39m▇[39m█[34m▇[39m[39m▆[39m▃[39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▂[39m▁[39m▂[39m▂[39m▂[3

In [3]:
# version CPU
A = [1:N;]
B = [0:N-1;]
@benchmark z = reduce(+, A.^2+B.^2-2 * A .* B)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m130.140 μs[22m[39m … [35m  1.546 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m 0.00% … 86.67%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m144.081 μs               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m 0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m187.668 μs[22m[39m ± [32m222.271 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m22.01% ± 16.10%

  [39m█[34m▄[39m[39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▂[39m [39m▁
  [39m█[34m█[39

Attention avec ce paradigme il faut eviter d'acceder individuellement au indices!

In [4]:
A = CuArray([1:1000;])
s = 0
#CUDA.allowscalar(false) tweak that!
for i =1:1000
   s += A[i]
end
s

│ Invocation of getindex resulted in scalar indexing of a GPU array.
│ This is typically caused by calling an iterating implementation of a method.
│ Such implementations *do not* execute on the GPU, but very slowly on the CPU,
│ and therefore are only permitted from the REPL for prototyping purposes.
│ If you did intend to index this array, annotate the caller with @allowscalar.
└ @ GPUArrays /home/fux/.julia/packages/GPUArrays/UBzTm/src/host/indexing.jl:56


500500


# Résolution de l'équation de laplace en 2D par Jacobi
- On se propose de résoudre l'équation 
$ \Delta u  = \frac{\partial^2 u}{\partial x^2} + \frac{\partial^2 u}{\partial y^2} = 0 $
sur le carré $[0,1]^2$
- Pour cela on discrétise le carré $[0,1]^2$ avec un pas de taille $h=1/(n+1)$

In [5]:
function jacobi_gpu!(ap, a)
    i = threadIdx().x + (blockIdx().x - 1) * blockDim().x
    j = threadIdx().y + (blockIdx().y - 1) * blockDim().y
    if ((i >= 2) && (i <= (size(a,1)-1)) && (j >= 2) && (j <= (size(a,2)-1)))
      ap[i,j] = 0.2f0 * (a[i-1,j]  + a[i+1,j]   + a[i,j-1]  + a[i,j+1])  +
                0.05f0 * (a[i-1,j-1]+ a[i+1,j-1] + a[i-1,j+1] + a[i+1,j+1])  
    end
    return
end

function init_sol!(a)
    a .= 0.0f0
    m = size(a,1)
    y₀ = sin.(π*[0:m-1;] ./ (m))
    a[:,1] = y₀
    a[:,end]= y₀ * exp(-π)
end    

init_sol! (generic function with 1 method)

In [9]:
N = 4096
a = CuArray{Float32}(undef, N, N);
ap = similar(a)
init_sol!(a);
init_sol!(ap);

In [10]:
nThreads = 32
for i = 1:300
   @cuda threads=(nThreads,nThreads) blocks=(cld(N,nThreads), cld(N, nThreads)) jacobi_gpu!(ap,a)
   error = maximum(abs.(ap-a))   
   if (i % 20) == 0
        println("i =", i, " error = ",error)
   end 
   if (error<=1e-3) 
     break
   end 
   a[:,:] = ap[:,:]
end

i =20 error = 0.011931241
i =40 error = 0.0060647726
i =60 error = 0.0040402412
i =80 error = 0.003028661
i =100 error = 0.0024201274
i =120 error = 0.0020114481
i =140 error = 0.0017290413
i =160 error = 0.0015127957
i =180 error = 0.0013430417
i =200 error = 0.0012105107
i =220 error = 0.0010983944
i =240 error = 0.0010086596


In [11]:
function jacobi_cpu!(ap, a)
    m,n = size(a)
    for i=2:m-1
        for j=2:n-1
            ap[i,j] = 0.2f0 * (a[i-1,j]  + a[i+1,j]   + a[i,j-1]  + a[i,j+1])  +
                      0.05f0 * (a[i-1,j-1]+ a[i+1,j-1] + a[i-1,j+1] + a[i+1,j+1])  
        end 
    end
    return
end

jacobi_cpu! (generic function with 1 method)

In [12]:
b = Array{Float32}(undef, N,N)
c = similar(b)
init_sol!(b)
init_sol!(c);

for i = 1:300
   jacobi_cpu!(c,b)
   error = maximum(abs.(c-b))   
   if (i % 20) == 0
        println("i =", i, " error = ",error)
   end 
   if (error<=1e-3) 
     break
   end 
   b[:,:] = c[:,:]
end

i =20 error = 0.011931226
i =40 error = 0.0060647726
i =60 error = 0.0040402412
i =80 error = 0.003028661
i =100 error = 0.0024201572
i =120 error = 0.0020114481
i =140 error = 0.0017290711
i =160 error = 0.0015127957
i =180 error = 0.0013430417
i =200 error = 0.0012105107
i =220 error = 0.0010983646
i =240 error = 0.0010086894
