---
### Welcome to the GPU exercises of the topical CERN School of Computing Spring 2021 edition!

Please follow the steps in this notebook. All underlying code can be found in the directories of the project. The files can be opened with an editor by clicking on the respective files. You can obtain syntax highlighting for .cu files by choosing `LANGUAGE` -> `C++` from the menu. We highly recommend that you inspect the source code of every exercise. They are linked as used in the examples below.

You can execute the code in the code execution cells below by pressing `CTRL` + `ENTER`. The output will appear just below the cell. Sometimes it the execution can take a little while, so be a bit patient. Note that compilation commands do not produce output when running successfully. 

When CUDA programs are executed in the exercise setup, we always call a small script, called `run-exclusive` to ensure that no other user is using the GPU assigned to you at the same time. This is necessary since we have fewer GPUs available for the session than students attending. If the program is not launched for about 30 seconds, just try again. If it still does not work, you can inspect the processes running on your GPU with `nvidia-smi` (see instructions below). If in doubt, please get in touch with the organizers. 

### Exploring the GPU characteristics

Let us first explore the GPU available for you in this lab environment. It is an accelerated system, containing a GPU assigned to you (and a fellow student). `nvidia-smi` (Systems Management Interface) is a utility shipped with CUDA that monitors the processes running on the GPU and provides some information. It is often useful to check whether another user is using the same GPU.
Find out which GPU type you are provided with, whether a process is running and what the memory consumption is. Note that `nvidia-smi` also tells you the exact driver and CUDA version. This can be useful information if after a new installation / an update there is a mismatch between the CUDA and driver versions (nothing to be worried about for this lab). 

Not down the GPU name and memory available. 

In [None]:
!nvidia-smi

### Compile a CUDA program

Let us now compile and run a small program, [`device_properties.cu`](edit/device_properties/device_properties.cu) (*<---- click on the link of the source file to open it in another tab for editing*), which will give more detailed information about the GPU(s) available. This program does not do any calculations on the GPU, it simply queries information about it and shows you which function calls are available for that. 

`.cu` is the extension for CUDA accelerated files. To compile a CUDA file, we use the `nvcc` compiler, which compiles both the host and the device sections of the code. Its usage is very similar to `gcc`. Let's take a closer look at the following command:
- `nvcc` invokes the compiler from the command line
- `-arch` indicates the GPU architecture for which the file is compiled. For our GPU this is `sm_70`. For more information on the architecture, please refer to the [CUDA documentation](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#options-for-steering-gpu-code-generation)
- `-o` specifies the output file (i.e. the compiled program) 
- `device_properties/device_properties.cu` is the file to compile

In [None]:
!nvcc -arch=sm_70 -o device_properties/device_properties device_properties/device_properties.cu 

After successfully compiling your first CUDA program, we can now execute it. This is done by calling the output file produced above, i.e. `./device_properties/device_properties`. In our case we call it with the wrapper script [`run-exclusive.sh`](edit/run-exclusive.sh) described above which ensures that only a single process runs on the GPU at once. 

In [None]:
!./run-exclusive.sh ./device_properties/device_properties

Note down the following: How large can a grid be maximally on this GPU? How much global and shared memory is available? What are the restrictions on the block size? How large is the warp size? How many Streaming Multiprocessors are there on the GPU? Can you infer from the information that the compute architecture of this GPU is indeed `sm_70`?

*Side note for when you work on any CUDA server: A similar program is available in the CUDA samples which are installed with every CUDA installation. You can find them in the directory `cuda-samples` of your CUDA installation directory (typically `/usr/local/cuda`). The scripts in `1_Utilities` can be useful.In particular, `1_Utilities/deviceQuery` gives similar (and more) information than the `device_properties`program we provide here.*

### Hello World from the GPU

We are now ready to write our first own function for GPUs, starting off from the following code: [`hello_world/hello_world.cu`](../edit/gpulab-tcsc-june-211/hello_world/hello_world.cu). This code comes with a CPU function and already compiles, but only calls the CPU version of the function. Your task is to modify the code to also provide a working GPU function and to invoke it.

The idea is to print a message from every invocated kernel to the terminal. For this, modify the GPU function such that it is actually executed on the GPU and invoke the GPU function. Open the file in a separate tab in an editor, modify it according to the instructions given in the file (marked with *to do*) and remember to save your changes. Then you can compile and run it with the commands given below.

Remember that a GPU kernel is launched with the following syntax: `somKernel<<<number_of_blocks, number_of_threads>>>()`. `someKernel` is then executed for every thread in every block, so `number_of_blocks` * `number_of_threads` times. `somKernel<<<1, 1>>>()` launches only one instance: one thread in one block. `somKernel<<<1, 32>>>()` launches 32 instances: 32 threads in one block. `somKernel<<<2, 32>>>()` launches 64 instances: two blocks with 32 threads each.

Each thread is identified by an index, starting from 0, and each block is identified by an index starting from 0. 
To identify within the CUDA kernel code which instance of the kernel is processed, the pre-defined variables `blockIdx.x` and `threadIdx.x` are available to identify the index of the block and the index of the thread within the block. Note their usage in our `hello_world_gpu` function. 

If you are stuck or would like to have some inspiration, you can take a look at the [solution](../edit/gpulab-tcsc-june-211/hello_world/hello_world_solution.cu).

In [None]:
!nvcc -arch=sm_70 -o hello_world/hello_world hello_world/hello_world.cu

In [None]:
!./run-exclusive.sh ./hello_world/hello_world 1 1

Congratulations! You just processed your first function on a GPU! 

Let's explore a bit deeper. The program takes the following input parameters: the number of threads per block and the number of blocks in the grid (both set to one in the above program call). Try experimenting with different settings! In particular, try at least 64 or 96 threads per block. What pattern do you observe in the printout. What could it be due to?

In [None]:
!./run-exclusive.sh ./hello_world/hello_world 3 64

### Vector addition
We are now ready to move on to an exercise where data is copied to and from the GPU and calculations are executed on the GPU: a vector addition. Start from the code provided in [vector_addition.cu](../edit/gpulab-tcsc-june-211/vector_addition/vector_addition.cu) and follow the instructions below and in the code. 
The provided code allocates memory on the host and runs the vector addition on the GPU. 
Our goal now is to allocate the required memory on the GPU, copy the input from the host to the GPU, call the vector addition in parallel on the GPU and copy the result back to the host to check that it is correct.

The initial version of the code compiles and runs. It takes three input parameters: the size of the vectors to be added, the number of blocks in the grid and the number of threads per block. Note that the last two parameters will only be relevant once we parallelize the vector addition on the GPU in Step 2.

Check after every of the below steps that your code compiles and runs!

As before, if you are stuck or need inspiration you can take a look at the [solutons](../edit/gpulab-tcsc-june-211/vector_addition/vector_addition_solution.cu)

#### Step 1: Allocating memory
To do a vector addition on the GPU, we have to allocate GPU memory (global memory) for the input vectors and also for the vector, where the result is stored. The input vectors have to be copied from host memory to GPU global memory and in the end the result vector has to be copied from the GPU to the host. 

Note that in the code we label host variables with `_h` in the end and device variables with `_d` in the end. This is common practice in CUDA programs to distinguish between pointers to host and device memory. 

Follow the instructions in the code labelled with *Step 1 to do*. There are three places labelled like that to
- Allocate GPU global memory for the three device vectors `a_d`, `b_d` and `c_d`
- Copy the data in the host vectors `a_h`, `b_h`, `c_h` to the device vectors `a_d`, `b_d`, `c_d`
- Free the global memory used for the device vectors `a_d`, `b_d`, `c_d`

Test that your code compiles and runs!

In [32]:
!nvcc -arch=sm_70 -o vector_addition/vector_addition vector_addition/vector_addition.cu

[01m[Kgcc:[m[K [01;31m[Kerror: [m[Kvector_addition/vector_addition.cu: No such file or directory
[01m[Kgcc:[m[K [01;31m[Kfatal error: [m[Kno input files
compilation terminated.


In [33]:
!./run-exclusive.sh ./vector_addition/vector_addition 36 6 6

./run-exclusive.sh: line 30: ./vector_addition/vector_addition_solution: No such file or directory


#### Step 2: Vector addition in parallel on the GPU

It is now time to call `vector_addition_gpu` on the GPU and to ensure that the addition is carried out in parallel. For this, follow the instructions labelled with *Step 2 to do* to do the following:
- Label `vector_addition_gpu` with the `__global__` identifier
- Modify the for loop inside `vector_addition_gpu` to be executed in parallel (see explanations below)
- Uncomment the grid dimension variable definitions
- Launch the kernel 

For loops are ideal candidates to be processed in parallel if the iterations do not depend on one another, as is the case in vector addition. The idea is instead of running each iteration of the loop sequentially, the iterations are processed in parallel by all available threads. For this two things must happen: 1) The kernel is written to execute one iteration based on its thread and block index and 2) we must ensure that all iterations of the for loop are processed, irrespectively of how many threads and blocks the kernel was launched with. Note that for this to work you can use the known `threadIdx.x`, `blockIdx.x` and `blockDim.x` variables. 

Now test again that your code compiles and runs!

In [36]:
!nvcc -arch=sm_70 -o vector_addition/vector_addition vector_addition/vector_addition.cu

[01m[Kgcc:[m[K [01;31m[Kerror: [m[Kvector_addition/vector_addition.cu: No such file or directory
[01m[Kgcc:[m[K [01;31m[Kfatal error: [m[Kno input files
compilation terminated.


In [35]:
!nvcc -arch=sm_70 -o vector_addition/vector_addition_solution vector_addition/vector_addition_solution.cu

In [37]:
!./run-exclusive.sh ./vector_addition/vector_addition 36 6 6

Need two arguments: size of vector and number of threads / block


#### Step 3: Copy and verify result

As last step, we have to copy back the vector containing the result and verify that the computations executed on the GPU were correct. Follow the instructions labelled with *Step 3 to do* for this. 

- Copy content of `c_d` to `c_h`
- Synchronize to ensure that the GPU work is finished
- Verify the result obtained from the GPU

Now compile and run again to check that your first calculations on a GPU are working!

Play with the number of blocks and threads and test scenarios where the `n_threads` * `n_block` is not equal to the vector size. If this works, you have correctly parallelized your for loop.

In [38]:
!nvcc -arch=sm_70 -o vector_addition/vector_addition vector_addition/vector_addition.cu

[01m[Kgcc:[m[K [01;31m[Kerror: [m[Kvector_addition/vector_addition.cu: No such file or directory
[01m[Kgcc:[m[K [01;31m[Kfatal error: [m[Kno input files
compilation terminated.


In [39]:
!./run-exclusive.sh ./vector_addition/vector_addition 36 6 6

./run-exclusive.sh: line 30: ./vector_addition/vector_addition: No such file or directory
