<a href="https://colab.research.google.com/github/amontoison/Workshop-GERAD/blob/main/gpu_programming.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Parallel computing and GPU programming with Julia
## Part III: GPU programming
Alexis Montoison

In [1]:
import Pkg
Pkg.activate("colab5")
Pkg.add(["BenchmarkTools", "CUDA"])

[32m[1m  Activating[22m[39m new project at `/content/colab5`
[32m[1m    Updating[22m[39m registry at `~/.julia/registries/General.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m   Installed[22m[39m BenchmarkTools ─ v1.6.0
[32m[1m    Updating[22m[39m `/content/colab5/Project.toml`
  [90m[6e4b80f9] [39m[92m+ BenchmarkTools v1.6.0[39m
  [90m[052768ef] [39m[92m+ CUDA v5.8.3[39m
[32m[1m    Updating[22m[39m `/content/colab5/Manifest.toml`
  [90m[621f4979] [39m[92m+ AbstractFFTs v1.5.0[39m
  [90m[79e6a3ab] [39m[92m+ Adapt v4.3.0[39m
  [90m[a9b6321e] [39m[92m+ Atomix v1.1.2[39m
  [90m[ab4f0b2a] [39m[92m+ BFloat16s v0.5.1[39m
  [90m[6e4b80f9] [39m[92m+ BenchmarkTools v1.6.0[39m
  [90m[fa961155] [39m[92m+ CEnum v0.5.0[39m
  [90m[052768ef] [39m[92m+ CUDA v5.8.3[39m
  [90m[1af6417a] [39m[92m+ CUDA_Runtime_Discovery v1.0.0[39m
  [90m[3da002f7] [39m[92m+ ColorTypes v0.12.1[39m
  [90m[5ae59095] [39m[92m+ Colors v0.13

In [2]:
using BenchmarkTools
using CUDA

Julia has first-class support for GPU programming through the following packages:

#### Core
- [GPUCompiler.jl](https://github.com/JuliaGPU/GPUCompiler.jl): Takes native Julia code and compiles it directly to GPUs
- [GPUArrays.jl](https://github.com/JuliaGPU/GPUArrays.jl): High-level array based common functionality
- [KernelAbstractions.jl](https://github.com/JuliaGPU/KernelAbstractions.jl): Vendor-agnostic kernel programming language
- [Adapt.jl](https://github.com/JuliaGPU/Adapt.jl): Translate complex structs across the host-device boundary

#### Vendor specific

- [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl) for NVIDIA GPUs
- [AMDGPU.jl](https://github.com/JuliaGPU/AMDGPU.jl) for AMD GPUs
- [oneAPI.jl](https://github.com/JuliaGPU/oneAPI.jl) for Intel GPUs
- [Metal.jl](https://github.com/JuliaGPU/Metal.jl) for Apple M-series GPUs

CUDA.jl is the most mature and we will use it for the workshop.
AMDGPU.jl is somewhat behind but still ready for general use, while oneAPI.jl and Metal.jl are functional but might contain bugs, miss some features and provide suboptimal performance.

What is the difference between a CPU and a GPU?

<img src='https://github.com/amontoison/Workshop-GERAD/blob/main/Graphics/cpu_vs_gpu.png?raw=1' width='700'>

<img src='https://github.com/amontoison/Workshop-GERAD/blob/main/Graphics/meme_gpu.jpg?raw=1' width='300'>

Some key aspects of GPUs that need to be kept in mind:
- The large number of compute elements on a GPU (in the thousands) can enable extreme scaling for data parallel tasks.
- GPUs have their own memory. This means that data needs to be transfered to and from the GPU during the execution of a program.
- Cores in a GPU are arranged into a particular structure. At the highest level they are divided into “streaming multiprocessors” (SMs). Some of these details are important when writing own GPU kernels.

<img src="https://github.com/amontoison/Workshop-GERAD/blob/main/Graphics/gpu.png?raw=1" width=500px>

<img src="https://github.com/amontoison/Workshop-GERAD/blob/main/Graphics/gpu_topology.svg?raw=1" width=500px>

* **host**: CPU + system memory (host memory)
* **device**: GPU with its memory (device memory)
* **SM**: Streaming Multiprocessor

Communication:
* Host-device bandwidth: **31.5 GB/s**
* GPU global memory bandwidth: **1555 GB/s**

GPU programming with Julia can be as simple as using a different array type instead of regular `Base.Array` arrays:
- `CuArray` from CUDA.jl for NVIDIA GPUs
- `ROCArray` from AMDGPU.jl for AMD GPUs
- `oneArray` from oneAPI.jl for Intel GPUs
- `MtlArray` from Metal.jl for Apple GPUs

These array types are subtypes of `GPUArrays` from [GPUArrays.jl](https://github.com/JuliaGPU/GPUArrays.jl) and closely resemble `Base.Array` which enables us to write generic code which works on both CPU and GPU arrays.

In [3]:
if CUDA.functional()
    A_d = CuArray([1,2,3,4])
    A_d .+= 1
end

4-element CuArray{Int64, 1, CUDA.DeviceMemory}:
 2
 3
 4
 5

We can do the same operation with other subtypes of `GPUArrays`:
```julia
if AMDGPU.functional()
    A_d = ROCArray([1,2,3,4])
    A_d .+= 1
end

if oneAPI.functional()
    A_d = oneArray([1,2,3,4])
    A_d .+= 1
end

A_d = MtlArray([1,2,3,4])
A_d .+= 1
```

Moving an array back from the GPU to the CPU is simple:

In [4]:
if CUDA.functional()
    A = Array(A_d)
end

4-element Vector{Int64}:
 2
 3
 4
 5

 <img src="https://github.com/amontoison/Workshop-GERAD/blob/main/Graphics/cpu_gpu_transfer.svg?raw=1" width=180px>

However, the overhead of copying data to the GPU makes such simple calculations very slow.

Let’s have a look at a more realistic example: matrix multiplication.
We create two random arrays, one on the CPU and one on the GPU, and compare the performance:

In [5]:
if CUDA.functional()
    A = rand(2^12, 2^12)
    A_d = CuArray(A)

    @btime $A * $A;
    CUDA.@time A_d * A_d;
end

  1.986 s (3 allocations: 128.00 MiB)
  2.879376 seconds (4.69 M CPU allocations: 241.683 MiB) (3 GPU allocations: 128.000 MiB, 0.01% memmgmt time)


4096×4096 CuArray{Float64, 2, CUDA.DeviceMemory}:
 1026.99  1015.3   1025.14  1031.25  …  1016.87  1008.68  1034.22  1011.5
 1047.13  1026.29  1044.63  1049.7      1032.05  1042.32  1055.84  1039.99
 1040.51  1024.55  1037.85  1024.51     1035.03  1028.18  1048.08  1038.94
 1034.67  1013.85  1028.39  1032.16     1022.21  1033.61  1037.04  1036.53
 1044.02  1025.37  1028.57  1034.6      1015.15  1036.7   1052.18  1031.16
 1024.72  1017.55  1028.31  1032.17  …  1016.58  1021.62  1029.85  1027.92
 1028.99  1020.21  1020.69  1017.87     1019.39  1016.73  1023.25  1025.08
 1041.94  1024.67  1042.36  1037.09     1029.69  1027.95  1049.76  1041.12
 1032.85  1020.61  1031.86  1031.84     1029.17  1025.58  1038.33  1034.66
 1033.51  1004.56  1037.56  1013.91     1013.63  1021.88  1036.51  1017.24
 1028.97  1017.46  1033.71  1028.78  …  1019.38  1015.61  1034.45  1032.91
 1024.38  1006.09  1021.21  1020.32     1013.48  1016.88  1024.25  1023.06
 1054.25  1035.79  1047.1   1035.38     1050.19  10

In [6]:
if CUDA.functional()
    A = rand(Float32, 2^12, 2^12)
    A_d = CuArray(A)
    @btime $A * $A
    CUDA.@time A_d * A_d
end

  1.004 s (3 allocations: 64.00 MiB)
  0.775521 seconds (902.11 k CPU allocations: 45.847 MiB) (3 GPU allocations: 64.000 MiB, 0.00% memmgmt time)


4096×4096 CuArray{Float32, 2, CUDA.DeviceMemory}:
  996.087   999.592   998.09   …   988.316  1007.7   1022.01  1007.2
 1006.53   1011.06   1023.48      1001.09   1040.52  1041.84  1031.08
 1017.01   1008.32   1018.55      1009.75   1027.09  1043.75  1022.13
 1019.15   1011.22   1024.76      1021.3    1037.39  1038.91  1025.44
 1014.54   1011.69   1031.85      1014.21   1036.06  1039.57  1025.65
 1013.87    992.487   999.248  …   999.858  1012.78  1028.95  1012.23
 1022.98   1018.57   1022.47      1018.9    1022.3   1039.56  1029.28
 1002.36    995.587  1007.21       989.008  1002.11  1011.41  1007.44
 1012.87   1011.56   1022.31      1007.16   1020.63  1028.17  1024.57
 1005.17    994.774  1005.77       986.332  1010.36  1018.36  1005.56
 1012.01   1003.77   1005.64   …  1005.09   1015.53  1041.75  1026.64
 1001.91   1006.91   1002.48      1005.54   1021.41  1025.02  1019.69
  991.453   996.826  1000.81       992.184  1012.54  1023.66  1000.38
    ⋮                          ⋱         

GPUs normally perform significantly better for 32-bit floats. Some GPUs doesn't support 64-bit floats!

Many array operations in Julia are implemented using loops, processing one element at a time. Doing so with GPU arrays is very ineffective, as the loop won't actually execute on the GPU, but transfer one element at a time and process it on the CPU. As this wrecks performance, you will be warned when performing this kind of iteration:

In [7]:
if CUDA.functional()
    A_d[1] = 3.0
end

[33m[1m│ [22m[39mInvocation of setindex! resulted in scalar indexing of a GPU array.
[33m[1m│ [22m[39mThis is typically caused by calling an iterating implementation of a method.
[33m[1m│ [22m[39mSuch implementations *do not* execute on the GPU, but very slowly on the CPU,
[33m[1m│ [22m[39mand therefore should be avoided.
[33m[1m│ [22m[39m
[33m[1m│ [22m[39mIf you want to allow scalar iteration, use `allowscalar` or `@allowscalar`
[33m[1m│ [22m[39mto enable scalar iteration globally or for the operations in question.
[33m[1m└ [22m[39m[90m@ GPUArraysCore ~/.julia/packages/GPUArraysCore/aNaXo/src/GPUArraysCore.jl:145[39m


3.0

Scalar indexing is only allowed in an interactive session, e.g. the REPL, because it is convenient when porting CPU code to the GPU. If you want to disallow scalar indexing, e.g. to verify that your application executes correctly on the GPU, call the allowscalar function:

In [8]:
if CUDA.functional()
    CUDA.allowscalar()
    A_d[1] = 3.0
end

[33m[1m│ [22m[39mInstead, use `allowscalar() do end` or `@allowscalar` to denote exactly which operations can use scalar operations.
[33m[1m└ [22m[39m[90m@ GPUArraysCore ~/.julia/packages/GPUArraysCore/aNaXo/src/GPUArraysCore.jl:184[39m


3.0

In a non-interactive session, e.g. when running code from a script or application, scalar indexing is disallowed by default. There is no global toggle to allow scalar indexing; if you really need it, you can mark expressions using allowscalar with do-block syntax or `@allowscalar` macro:

In [9]:
if CUDA.functional()
    CUDA.allowscalar(false)

    CUDA.allowscalar() do
        A_d[1] += 1
    end

    CUDA.@allowscalar A_d[1] += 1
end

5.0f0

Nvidia provides CUDA toolkit, a collection of libraries that contain precompiled kernels for common operations like matrix multiplication ([cuBLAS](https://docs.nvidia.com/cuda/cublas/)), fast Fourier transforms ([cuFFT](https://docs.nvidia.com/cuda/cufft/)), linear solvers ([cuSOLVER](https://docs.nvidia.com/cuda/cusolver/)), sparse linear algebra ([CUSPARSE](https://docs.nvidia.com/cuda/cusparse/)), etc.
These kernels are wrapped in CUDA.jl and can be used directly with CuArrays.

The recommended way to use CUDA.jl is to let it automatically download an appropriate CUDA toolkit. CUDA.jl will check your driver's capabilities, which versions of CUDA are available for your platform, and automatically download an appropriate artifact containing all the libraries that CUDA.jl supports.

```julia
CUDA.set_runtime_version!( v"11.8" )
```
To use a local installation, you can invoke the same API but set the version to `"local"`:
```julia
CUDA.set_runtime_version!( local_toolkit=true )
```

In [10]:
if CUDA.functional()
    CUDA.versioninfo()
end

CUDA toolchain: 
- runtime 12.5, local installation
- driver 550.54.15 for 13.0
- compiler 12.9

CUDA libraries: 
- CUBLAS: 12.5.3
- CURAND: 10.3.6
- CUFFT: 11.2.3
- CUSOLVER: 11.6.3
- CUSPARSE: 12.5.1
- CUPTI: 2024.2.1 (API 23.0.0)
- NVML: 12.0.0+550.54.15

Julia packages: 
- CUDA: 5.8.3
- CUDA_Driver_jll: 13.0.1+0
- CUDA_Compiler_jll: 0.2.1+0
- CUDA_Runtime_jll: 0.19.1+0
- CUDA_Runtime_Discovery: 1.0.0

Toolchain:
- Julia: 1.11.5
- LLVM: 16.0.6

Preferences:
- CUDA_Runtime_jll.version: 12.5.1
- CUDA_Runtime_jll.local: true

1 device:
  0: Tesla T4 (sm_75, 14.283 GiB / 15.000 GiB available)


Let's do a guided tour of what is inside CUDA.jl!

In [11]:
if CUDA.functional()
    using CUDA.CUBLAS
    using CUDA.CUFFT
    using CUDA.CUSOLVER
    using CUDA.CUSPARSE
end

A powerful way to program GPUs with arrays is through Julia’s higher-order array abstractions.
The simple element-wise addition we saw above, `a .+= 1`, is an example of this, but more general constructs can be created with `broadcast`, `map`, `reduce`, `accumulate` etc:

In [12]:
if CUDA.functional()
    broadcast(-, A_d, 1)
end

4096×4096 CuArray{Float32, 2, CUDA.DeviceMemory}:
  4.0        -0.508248   -0.335083   -0.912087    …  -0.44785    -0.858303
 -0.852563   -0.893101   -0.321687   -0.23571        -0.604702   -0.834698
 -0.221156   -0.532528   -0.346107   -0.741136       -0.792318   -0.944039
 -0.727079   -0.943091   -0.1827     -0.382383       -0.433809   -0.455255
 -0.88376    -0.353073   -0.78484    -0.403264       -0.645132   -0.550728
 -0.538295   -0.692474   -0.48064    -0.435903    …  -0.345621   -0.331552
 -0.0550714  -0.517408   -0.440841   -0.648895       -0.0305568  -0.957892
 -0.983735   -0.460856   -0.125004   -0.411053       -0.121264   -0.412667
 -0.766548   -0.546135   -0.762496   -0.181115       -0.0594926  -0.567672
 -0.461056   -0.312635   -0.778755   -0.434568       -0.802518   -0.861647
 -0.10338    -0.363496   -0.360205   -0.64806     …  -0.0638976  -0.0796155
 -0.520788   -0.152382   -0.205976   -0.00538105     -0.597848   -0.860872
 -0.842006   -0.692116   -0.208375   -0.960882   

In [13]:
if CUDA.functional()
    map(x -> x+1, A_d)
end

4096×4096 CuArray{Float32, 2, CUDA.DeviceMemory}:
 6.0      1.49175  1.66492  1.08791  …  1.6675   1.72558  1.55215  1.1417
 1.14744  1.1069   1.67831  1.76429     1.02618  1.21002  1.3953   1.1653
 1.77884  1.46747  1.65389  1.25886     1.72794  1.0586   1.20768  1.05596
 1.27292  1.05691  1.8173   1.61762     1.42323  1.43254  1.56619  1.54474
 1.11624  1.64693  1.21516  1.59674     1.35074  1.20594  1.35487  1.44927
 1.46171  1.30753  1.51936  1.5641   …  1.1884   1.91192  1.65438  1.66845
 1.94493  1.48259  1.55916  1.35111     1.00937  1.89166  1.96944  1.04211
 1.01626  1.53914  1.875    1.58895     1.28885  1.62338  1.87874  1.58733
 1.23345  1.45386  1.2375   1.81889     1.36867  1.73993  1.94051  1.43233
 1.53894  1.68737  1.22124  1.56543     1.94618  1.99532  1.19748  1.13835
 1.89662  1.6365   1.63979  1.35194  …  1.52033  1.96453  1.9361   1.92038
 1.47921  1.84762  1.79402  1.99462     1.26965  1.07163  1.40215  1.13913
 1.15799  1.30788  1.79162  1.03912     1.29896  1.8

In [14]:
if CUDA.functional()
    reduce(+, A_d)
end

8.386694f6

In [15]:
if CUDA.functional()
    accumulate(+, A_d)
end

4096×4096 CuArray{Float32, 2, CUDA.DeviceMemory}:
    5.0      2027.4   4040.62  6077.33  …  8.38053f6  8.38258f6  8.38465f6
    5.14744  2027.51  4041.29  6078.09     8.38053f6  8.38258f6  8.38465f6
    5.92628  2027.97  4041.95  6078.35     8.38053f6  8.38258f6  8.38465f6
    6.1992   2028.03  4042.76  6078.97     8.38053f6  8.38258f6  8.38465f6
    6.31544  2028.68  4042.98  6079.57     8.38053f6  8.38258f6  8.38465f6
    6.77715  2028.99  4043.5   6080.13  …  8.38053f6  8.38258f6  8.38465f6
    7.72208  2029.47  4044.06  6080.48     8.38053f6  8.38258f6  8.38465f6
    7.73834  2030.01  4044.93  6081.07     8.38053f6  8.38258f6  8.38465f6
    7.97179  2030.46  4045.17  6081.89     8.38053f6  8.38258f6  8.38465f6
    8.51074  2031.15  4045.39  6082.46     8.38054f6  8.38258f6  8.38465f6
    9.40736  2031.79  4046.03  6082.81  …  8.38054f6  8.38258f6  8.38465f6
    9.88657  2032.63  4046.83  6083.8      8.38054f6  8.38258f6  8.38465f6
   10.0446   2032.94  4047.62  6083.84     8.38054

Using the high-level GPU array functionality made it easy to perform this computation on the GPU. However, we didn't learn about what's going on under the hood, and that's the main goal of this tutorial. It's time to write our own kernels!

In [16]:
function vadd!(C, A, B)
    for i in 1:length(A)
        @inbounds C[i] = A[i] + B[i]
    end
    return nothing
end

vadd! (generic function with 1 method)

In [17]:
A = ones(10)
B = ones(10)
C = similar(B)
vadd!(C, A, B)
C

10-element Vector{Float64}:
 2.0
 2.0
 2.0
 2.0
 2.0
 2.0
 2.0
 2.0
 2.0
 2.0

In [18]:
if CUDA.functional()
    # We can already run this on the GPU with the @cuda macro,
    # which will compile vadd!() into a GPU kernel and launch it
    A_d = CuArray(A)
    B_d = CuArray(B)
    C_d = similar(B_d)
    @cuda vadd!(C_d, A_d, B_d)
    C_d
end

10-element CuArray{Float64, 1, CUDA.DeviceMemory}:
 2.0
 2.0
 2.0
 2.0
 2.0
 2.0
 2.0
 2.0
 2.0
 2.0

The macros for the other GPU backends are `@roc`, `@oneapi` and `@metal`.

The performance are just terrible because each thread on the GPU would be performing the same loop! So we have to remove the loop over all elements and instead use the special `threadIdx` and `blockDim` functions, analogous respectively to `threadid` and `nthreads` for multithreading.

We can split work between the GPU threads by using a special function which returns the index of the GPU thread which executes it.

**GPU kernel**: a function that will be executed by all *GPU threads* in parallel.
    
Based on the index of a thread we can make them operate on different pieces of give n data.

(It might be helpful to think of the GPU kernel as being the body of a loop.)

In [19]:
function vadd2!(C, A, B)
    index = threadIdx().x   # linear indexing, so only use `x`
    @inbounds C[index] = A[index] + B[index]
    return nothing
end

vadd2! (generic function with 1 method)

In [20]:
if CUDA.functional()
    N = 2^8
    A = 2 * CUDA.ones(N)
    B = 3 * CUDA.ones(N)
    C = similar(B)

    nthreads = N
    @cuda threads=nthreads vadd2!(C, A, B)
end

CUDA.HostKernel for vadd2!(CuDeviceVector{Float32, 1}, CuDeviceVector{Float32, 1}, CuDeviceVector{Float32, 1})

In [21]:
if CUDA.functional()
    all(Array(C) .== 5.0)
end

true

The syntax is similar for the other GPU backends!
```julia
groupsize = length(A)
@roc groupsize=groupsize vadd!(C, A, B)

items = length(A)
@oneapi items=items vadd!(C, A, B)

nthreads = length(A)
@metal threads=nthreads vadd!(C, A, B)
```

To do even better, we need to parallelize more. GPUs have a limited number of threads they can run on a single streaming multiprocessor (SM), but they also have multiple SMs. To take advantage of them all, we need to run a kernel with multiple blocks. We'll divide up the work like this:

![gpu_threads_block](https://github.com/amontoison/Workshop-GERAD/blob/main/Graphics/gpu_threads_block.png?raw=1)

Conceptual mapping:

* **Grid** of blocks → entire GPU
* **Blocks** of threads → SMs
* **Threads** → CUDA cores

**Note**: up to three dimensions, $(x, y, z)$, can be used to organize the thread blocks and threads in each block.

This diagram was borrowed from a description of the NVIDIA C/C++ library; in Julia, threads and blocks begin numbering with 1 instead of 0. In this diagram, the 4096 blocks of 256 threads (making 1048576 = 2^20 threads) ensures that each thread increments just a single entry; however, to ensure that arrays of arbitrary size can be handled, let's still use a loop:

In [22]:
function vadd3!(C, A, B)
    index = threadIdx().x + (blockIdx().x - 1) * blockDim().x
    stride = gridDim().x * blockDim().x
    for i = index:stride:length(B)
        @inbounds C[index] = A[index] + B[index]
    end
end

vadd3! (generic function with 1 method)

In [23]:
if CUDA.functional()
    nthreads = CUDA.attribute(device(), CUDA.DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK)
end

1024

The maximum number of allowed threads to launch depends on your GPU!

In [24]:
if CUDA.functional()
    N = 2^14
    A = 2 * CUDA.ones(N)
    B = 3 * CUDA.ones(N)
    C = similar(B)

    # smallest integer larger than or equal to N / nthreads
    numblocks = ceil(Int, N/nthreads)
end

16

In [25]:
if CUDA.functional()
    @cuda threads=nthreads blocks=numblocks vadd3!(C, A, B)
end

CUDA.HostKernel for vadd3!(CuDeviceVector{Float32, 1}, CuDeviceVector{Float32, 1}, CuDeviceVector{Float32, 1})

In [26]:
 all(Array(C) .== 5.0)

true

CUDA.jl supports indexing in up to 3 dimensions (x, y and z, e.g. `threadIdx().z`). This is convenient for multidimensional data where thread blocks can be organised into 1D, 2D or 3D arrays of threads.

To automatically select an appropriate number of threads, it is recommended to use the launch configuration API. This API takes a compiled (but not launched) kernel, returns a tuple with an upper bound on the number of threads, and the minimum number of blocks that are required to fully saturate the GPU:

To optimize the number of threads, we can first create the kernel without launching it, query it for the number of threads supported, and then launch the compiled kernel:

In [27]:
# compile kernel
kernel = @cuda launch=false vadd3!(C, A, B)

# extract configuration via occupancy API
config = launch_configuration(kernel.fun)

# number of threads should not exceed size of array
threads = min(length(A), config.threads)

# smallest integer larger than or equal to length(A)/threads
blocks = cld(length(A), threads)

# launch kernel with specific configuration
kernel(C, A, B; threads, blocks)

**Debugging**: Many things can go wrong with GPU kernel programming and unfortunately error messages are sometimes not very useful because of how the GPU compiler works.

Conventional print-debugging is often a reasonably effective way to debug GPU code. CUDA.jl provides macros that facilitate this:
- `@cushow` (like @show): visualize an expression and its result, and return that value.
- `@cuprintln` (like println): to print text and values.
- `@cuaassert` (like @assert) can also be useful to find issues and abort execution.

GPU code introspection macros also exist, like `@device_code_warntype`, to track down type instabilities.

In [28]:
function gpu_add_print!(y, x)
    index = threadIdx().x    # this example only requires linear indexing, so just use `x`
    stride = blockDim().x
    @cuprintln("thread $index, block $stride")
    for i = index:stride:length(y)
        @inbounds y[i] += x[i]
    end
    return nothing
end

if CUDA.functional()
    x_d = CUDA.rand(10)
    y_d = CUDA.rand(10)
    @cuda threads=10 gpu_add_print!(y_d, x_d)
    synchronize()
end

thread 1, block 10
thread 2, block 10
thread 3, block 10
thread 4, block 10
thread 5, block 10
thread 6, block 10
thread 7, block 10
thread 8, block 10
thread 9, block 10
thread 10, block 10


**Conclusion**: Keep in mind that the high-level functionality of CUDA often means that you don't need to worry about writing kernels at such a low level. However, there are many cases where computations can be optimized using clever low-level manipulations. The kernels implemented in Julia give you all the flexibility and performance a GPU has to offer, within a familiar language.

A typical approach for porting or developing an application for the GPU is as follows:
- develop an application using generic array functionality, and test it on the CPU with the `Array` type;
- port your application to the GPU by switching to the `CuArray` type;
- disallow the CPU fallback ("scalar indexing") to find operations that are not implemented for or incompatible with GPU execution;
- (optional) use lower-level, CUDA-specific interfaces to implement missing functionality or optimize performance.   

**Exercise**: GPU-port the `sqrt_sum` function we saw in te first notebook:

In [29]:
function sqrt_sum(A)
    T = eltype(A)
    s = zero(T)
    for i in eachindex(A)
        @inbounds s += sqrt(A[i])
    end
    return s
end

sqrt_sum (generic function with 1 method)

# References:
- https://cuda.juliagpu.org/stable/
- https://www.youtube.com/watch?v=Fz-ogmASMAE
- https://www.cherryservers.com/blog/gpu-vs-cpu-what-are-the-key-differences
- https://developer.nvidia.com/blog/tag/cuda-refresher/
- https://docs.nvidia.com/cuda/
- https://www.youtube.com/watch?v=Hz9IMJuW5hU
- https://julialang.org/learning/