<center> <H1> Programmer en CUDA avec Julia </H1> 
<img src="logo.png" width="200"/>
  Marc Fuentes : SED de l'INRIA de l'UPPA  
</center>

In [1]:
# dans l'interprete, verifié que CUDA marche bien -> aller a l'installation
using CUDA
CUDA.versioninfo()

CUDA toolkit 11.4.1, artifact installation
CUDA driver 11.6.0
NVIDIA driver 510.47.3

Libraries: 
- CUBLAS: 11.5.4
- CURAND: 10.2.5
- CUFFT: 10.5.1
- CUSOLVER: 11.2.0
- CUSPARSE: 11.6.0
- CUPTI: 14.0.0
- NVML: 11.0.0+510.47.3
- CUDNN: 8.20.2 (for CUDA 11.4.0)
- CUTENSOR: 1.3.0 (for CUDA 11.2.0)

Toolchain:
- Julia: 1.7.0-beta3
- LLVM: 12.0.0
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

1 device:
  0: Quadro T2000 with Max-Q Design (sm_75, 3.815 GiB / 4.000 GiB available)


# Installation
 - sur un laptop, le gestionnaire de paquet `Pkg` va télécharger des artefacts
```julia
using Pkg
Pkg.add("CUDA")
```
 - sur plafrim (pour ce TP) a utiliser sur des GPU pascal ou Volta (`salloc -C "sirocco&p100"`)
```bash
> module load language/julia/1.7.2
> julia
```
 - certaines variables peuvent aider julia (`JULIA_CUDA_VERSION`) et  (`JULIA_CUDA_BUILDBINARY=false`)

# GPU : généralités sur l'architecture
- le GPU est un accélérateur possédant sa mémoire (DRAM) et un grand nombre de «fils d'exécution» (threads)
<img src="archi_gpu.svg" width="600px" > 
- 2 principes à rétenir
 - limiter les transferts (ou les recouvrir par des calculs)
 - donner suffisament de grain a moudre au GPU (calcul vectoriel)

# Paradigme de programmation sur GPU : 
 - remplacer un indice de boucle par un indice de «thread»
```julia
for i=...
    a[i] = ...
end
``` 
devient ainsi
```julia
i = threadIdx().x + (blockIdx().x - 1) * blockDim().x  
a[i] = ...
```

# utilisation transparente
- il suffit d'avoir recours a des abstractions parallèles agissant sur le conteneneur `CuArray`

In [20]:
#version GPU
using BenchmarkTools
N = 2^10*32
A = CuArray([1:N;])
B = CuArray([0:N-1;])
@benchmark z = reduce(+, A.^2+B.^2-2 * A .* B)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m66.821 μs[22m[39m … [35m 38.057 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 20.92%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m73.970 μs               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m85.829 μs[22m[39m ± [32m652.539 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m2.67% ±  0.35%

  [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▃[39m▆[39m█[39m█[39m▆[34m▄[39m[39m▃[39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▂[39m▁[39m▁[39m▂[39m▂[3

In [18]:
# version CPU
A = [1:N;]
B = [0:N-1;]
@benchmark z = reduce(+, A.^2+B.^2-2 * A .* B)

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m133.615 μs[22m[39m … [35m  1.765 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m 0.00% … 90.94%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m147.202 μs               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m 0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m193.403 μs[22m[39m ± [32m237.598 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m22.85% ± 16.19%

  [39m█[34m▃[39m[39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▂[39m [39m▁
  [39m█[34m█[39

Attention avec ce paradigme il faut eviter d'acceder individuellement au indices!

In [32]:
A = CuArray([1:1000;])
s = 0
#CUDA.allowscalar(false) tweak that!
for i =1:1000
   s += A[i]
end
s

LoadError: Scalar indexing is disallowed.
Invocation of getindex resulted in scalar indexing of a GPU array.
This is typically caused by calling an iterating implementation of a method.
Such implementations *do not* execute on the GPU, but very slowly on the CPU,
and therefore are only permitted from the REPL for prototyping purposes.
If you did intend to index this array, annotate the caller with @allowscalar.


# Résolution de l'équation de laplace en 2D par Jacobi
- On se propose de résoudre l'équation 
$ \Delta u  = \frac{\partial^2 u}{\partial x^2} + \frac{\partial^2 u}{\partial y^2} = 0 $
sur le carré $[0,1]^2$
- Pour cela on discrétise le carré $[0,1]^2$ avec un pas de taille $h=1/(n+1)$

In [14]:
function jacobi!(ap, a)
    i = threadIdx().x + (blockIdx().x - 1) * blockDim().x
    j = threadIdx().y  +(blockIdx().y - 1) * blockDim().y
    if ((i >= 2) && (i <= (size(a,1)-1)) && (j >= 2) && (j <= (size(a,2)-1)))
    ap[i,j] = 0.02f0 * (a[i-1,j]  + a[i+1,j]   + a[i,j-1]  + a[i,j+1])  +
              0.05f0 * (a[i-1,j-1]+ a[i+1,j-1] + a[i-1,j+1] + a[i+1,j+1])  
    end
    return
end

function init_sol!(a)
    a .= 0.0f0
    m = size(a,1)
    y₀ = sin.(π*[0:m-1;] ./ (m))
    a[:,1] = y₀
    a[:,end]= y₀ * exp(-π)
end    

init_sol! (generic function with 1 method)

In [16]:
const N = 4096
a = CuArray{Float32}(undef, N, N);
a₊ = similar(a)
init_sol!(a);
init_sol!(a₊);

In [17]:
nThreads = 1024
nBlocks = N÷nThreads
for i = 1:100
   @cuda threads=nThreads blocks=nBlocks jacobi!(a₊,a)
   error = maximum(abs.(a₊-a))   
   println("i =", i, " error = ",error)
   if (error<=1e-3) 
     break
   end 
   a = a₊
end

i =1 error = 0.0
