# Welcome to the GPU exercises of the topical CERN School of Computing!

Please follow the steps in this notebook. All underlying code can be found in the directories of the project. Click on the links in the readme to open the files for the respective exercises in an editor in a new browser window. You might have to scroll with the mouse or click once for all code to be visible. You can obtain syntax highlighting for .cu files by choosing `LANGUAGE` -> `C++` from the menu. 

You can execute the code in the code execution cells below by pressing `SHIFT` + `ENTER` or by clicking on the triangle ("run") symbol in the tool bar at the top. The output will appear just below the cell. Sometimes the execution can take a little while, so be a bit patient. The process is still working, if the prompt looks like this: `[*]`. Note that compilation commands do not produce output when running successfully. 

When CUDA programs are executed in the exercise setup, we always call a small script, called `run-exclusive` to ensure that no other user is using the GPU assigned to you at the same time. This is necessary since we have fewer GPUs available for the session than students attending. If the program is not launched for about 30 seconds, just try again. If it still does not work, you can inspect the processes running on your GPU with `nvidia-smi` (see instructions below). If in doubt, please get in touch with the organizers. 

## Exploring the GPU status

Let us first explore the GPU available for you in this lab environment. It is an accelerated system, containing a GPU assigned to you (and a fellow student). `nvidia-smi` (Systems Management Interface) is a utility shipped with CUDA that monitors the processes running on the GPU and provides some information. It is often useful to check whether another user is using the same GPU.

*Sidenote: `nvidia-smi` also tells you the exact driver and CUDA version. This can be useful information if after a new installation / an update there is a mismatch between the CUDA and driver versions (nothing to be worried about for this lab).*

Note down the GPU name, memory available and whether a process is currently running on the GPU! 

In [None]:
!nvidia-smi

## Compile a CUDA program

Let us now compile and run a small program, `device_properties.cu` (*<---- open this file for editing from the link given in the readme*), which will give more detailed information about the GPU(s) available. This program does not do any calculations on the GPU, it simply queries information about it and shows you which function calls are available for that. 

`.cu` is the extension for CUDA accelerated files. To compile a CUDA file, we use the `nvcc` compiler, which compiles both the host and the device sections of the code. Its usage is very similar to `gcc`. Let's take a closer look at the following command:
- `nvcc` invokes the compiler from the command line
- `-arch` indicates the GPU architecture for which the file is compiled, consisting in a major number followed up by a minor number. Depending on the GPU you got assigned, this is either `sm_70` or `sm_75`, therefore `sm_70` is guaranteed to work. For more information on the architecture, please refer to the [CUDA documentation](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#options-for-steering-gpu-code-generation)
- `-o` specifies the output file (i.e. the compiled program) 
- `device_properties/device_properties.cu` is the file to compile

In [None]:
!nvcc -arch=sm_70 -o device_properties/device_properties device_properties/device_properties.cu 

## Exploring some GPU characteristics

After successfully compiling your first CUDA program, we can now execute it. This is done by calling the output file produced above, i.e. `./device_properties/device_properties`. In our case we call it with the wrapper script [`run-exclusive.sh`](edit/run-exclusive.sh) described above which ensures that only a single process runs on the GPU at once. 

In [None]:
!./run-exclusive.sh ./device_properties/device_properties

Note down the following: How large can a grid be maximally on this GPU? How much global and shared memory is available? What are the restrictions on the block size? How large is the warp size? How many Streaming Multiprocessors are there on the GPU? Can you infer from the information that the compute architecture of this GPU is indeed `sm_70`?

*Side note for when you work on any CUDA server: A similar program is available in the CUDA samples which are installed with every CUDA installation. You can find them in the directory `cuda-samples` of your CUDA installation directory (typically `/usr/local/cuda`). The scripts in `1_Utilities` can be useful. In particular, `1_Utilities/deviceQuery` gives similar (and more) information than the `device_properties`program we provide here.*

## Hello World from the GPU

We are now ready to write our first own function for GPUs, starting off from the `hello_world/hello_world.cu` code (click on link in readme). This code comes with a CPU function and already compiles, but only calls the CPU version of the function. Your task is to modify the code to also provide a working GPU function and to invoke it.

The idea is to print a message from every invocated kernel to the terminal. For this, modify the GPU function such that it is actually executed on the GPU and invoke the GPU function. Open the file in a separate tab in an editor, modify it according to the instructions given in the file (marked with *to do*) and remember to save your changes. Then you can compile and run it with the commands given below.

Remember that a GPU kernel is launched with the following syntax: `somKernel<<<number_of_blocks, number_of_threads>>>()`. `someKernel` is then executed for every thread in every block, so `number_of_blocks` * `number_of_threads` times. `someKernel<<<1, 1>>>()` launches only one instance: one thread in one block. `someKernel<<<1, 32>>>()` launches 32 instances: 32 threads in one block. `someKernel<<<2, 32>>>()` launches 64 instances: two blocks with 32 threads each.

Each thread is identified by an index, starting from 0, and each block is identified by an index starting from 0. 
To identify within the CUDA kernel code which instance of the kernel is processed, the pre-defined variables `blockIdx.x` and `threadIdx.x` are available to identify the index of the block and the index of the thread within the block. Note their usage in our `hello_world_gpu` function. 

If you are stuck or would like to have some inspiration, you can take a look at the [solution](../../../edit/SWAN_projects/gpulab-tcsc-2021-spring/hello_world/hello_world_solution.cu).

In [None]:
!nvcc -arch=sm_70 -o hello_world/hello_world hello_world/hello_world.cu

In [None]:
!./run-exclusive.sh ./hello_world/hello_world 1 1

Congratulations! You just processed your first function on a GPU! 

Let's explore a bit deeper. The program takes the following input parameters: the number of threads per block and the number of blocks in the grid (both set to one in the above program call). Try experimenting with different settings! In particular, try at least 64 or 96 threads per block. What pattern do you observe in the printout. What could it be due to?

In [None]:
!./run-exclusive.sh ./hello_world/hello_world 3 64

## Vector addition
We are now ready to move on to an exercise where data is copied to and from the GPU and calculations are executed on the GPU: a vector addition. Start from the code provided in `vector_addition/vector_addition.cu` and follow the instructions below and in the code. 
The provided code allocates memory on the host and runs the vector addition on the host. 
Our goal now is to allocate the required memory on the GPU, copy the input from the host to the GPU, call the vector addition in parallel on the GPU and copy the result back to the host to check that it is correct.

The initial version of the code compiles and runs. It takes three input parameters: the size of the vectors to be added, the number of blocks in the grid and the number of threads per block. Note that the last two parameters will only be relevant once we parallelize the vector addition on the GPU in Step 2.

Check after every of the below steps that your code compiles and runs!

As before, if you are stuck or need inspiration you can take a look at the [solution](../../../edit/SWAN_projects/gpulab-tcsc-2021-spring/vector_addition/vector_addition_solution.cu)

### Step 1: Allocating memory
To do a vector addition on the GPU, we have to allocate GPU memory (global memory) for the input vectors (called `a` and `b`) and also for the vector, where the result is stored (called `c`). The input vectors have to be copied from host memory to GPU global memory and in the end the result vector has to be copied from the GPU to the host. 

Note that in the code we label host variables with `_h` in the end and device variables with `_d` in the end. This is common practice in CUDA programs to distinguish between pointers to host and device memory. 

Follow the instructions in the code labelled with *Step 1 to do*. There are three places labelled like that to
- Allocate GPU global memory for the three device vectors `a_d`, `b_d` and `c_d`
- Copy the data in the host vectors `a_h`, `b_h`, `c_h` to the device vectors `a_d`, `b_d`, `c_d`
- Free the global memory used for the device vectors `a_d`, `b_d`, `c_d`

Test that your code compiles and runs!

In [None]:
!nvcc -arch=sm_70 -o vector_addition/vector_addition vector_addition/vector_addition.cu

In [None]:
!./run-exclusive.sh ./vector_addition/vector_addition 36 6 6

### Step 2: Vector addition in parallel on the GPU

It is now time to call `vector_addition_gpu` on the GPU and to ensure that the addition is carried out in parallel. For this, follow the instructions labelled with *Step 2 to do* to do the following:
- Label `vector_addition_gpu` with the `__global__` identifier
- Modify the for loop inside `vector_addition_gpu` to be executed in parallel (see explanations below)
- Uncomment the grid dimension variable definitions
- Launch the kernel 

For loops are ideal candidates to be processed in parallel if the iterations do not depend on one another, as is the case in vector addition. The idea is instead of running each iteration of the loop sequentially, the iterations are processed in parallel by all available threads. For this two things must happen: 1) The kernel is written to execute one iteration based on its thread and block index and 2) we must ensure that all iterations of the for loop are processed, irrespectively of how many threads and blocks the kernel was launched with. Note that for this to work you should use the known `threadIdx.x`, `blockIdx.x`, `blockDim.x` and `gridDim.x` variables. 

Now test again that your code compiles and runs!

In [None]:
!nvcc -arch=sm_70 -o vector_addition/vector_addition vector_addition/vector_addition.cu

In [None]:
!./run-exclusive.sh ./vector_addition/vector_addition 36 6 6

### Step 3: Copy and verify result

As last step, we have to copy back the vector containing the result and verify that the computations executed on the GPU were correct. Follow the instructions labelled with *Step 3 to do* for this. 

- Copy content of `c_d` to `c_h`
- Synchronize to ensure that the GPU work is finished
- Verify the result obtained from the GPU

Now compile and run again to check that your first calculations on a GPU are working!

Play with the number of blocks and threads and test scenarios where the `n_threads` * `n_block` is not equal to the vector size. If this works, you have correctly parallelized your for loop. 

Caveat: The restult vector can be correct, but the parallelization might not be the intended one. This can happen if you are in fact doing the same work in every block of your grid. If you did not use the variables `blockDim.x` and `gridDim.x` this happened. Take a look again at your parallelized for loop and modify it such that every block in the grid processed different vector elements from the other blocks. Check again that you can process vector lengths that do not match the number of blocks and threads!

In [None]:
!nvcc -arch=sm_70 -o vector_addition/vector_addition vector_addition/vector_addition.cu

In [None]:
!./run-exclusive.sh ./vector_addition/vector_addition 39 6 6

## Profiling a CUDA application

The only way to be assured that attempts at optimizing accelerated code bases are actually successful is to profile the application for quantitative information about the application's performance. `nsys` is the Nsight Systems command line tool. It is a powerful tool for profiling accelerated applications.

Its most basic usage is to simply pass it the path to an executable compiled with `nvcc`. `nsys` will proceed to execute the application, after which it will print a summary output of the application's GPU activities, CUDA API calls and so on.

When accelerating applications, or optimizing already-accelerated applications, take a scientific and iterative approach. Profile your application after making changes, take note, and record the implications of any refactoring on performance. Make these observations early and often: frequently, enough performance boost can be gained with little effort such that you can ship your accelerated application. Additionally, frequent profiling will teach you how specific changes to your CUDA codebases impact its actual performance: knowledge that is hard to acquire when only profiling after many kinds of changes in your codebase.

### Exercise: Profile an Application with nsys

`nsys profile` will generate a `qdrep` report file which can be used in a variety of manners. We use the `--stats=true` flag here to indicate we would like summary statistics printed. There is quite a lot of information printed:

- Profile configuration details
- Report file(s) generation details
- **CUDA API Statistics**
- **CUDA Kernel Statistics**
- **CUDA Memory Operation Statistics (time and size)**
- OS Runtime API Statistics

We will be inspecting the 3 sections in **bold** above throughout the following exercises.

After profiling the application, answer the following questions using information displayed in the profiling output:

- What was the name of the only CUDA kernel called in this application?
- How many times did this kernel run?
- How long did it take this kernel to run?

As a first example, you can profile the vector addition application you worked on so far:

In [None]:
!PATH=/cvmfs/sft.cern.ch/lcg/releases/cuda/11.4/x86_64-centos7-gcc10-opt/bin:$PATH ./run-exclusive.sh nsys profile --stats=true ./vector_addition/vector_addition 39 6 6

Worth mentioning is that by default, `nsys profile` will not overwrite an existing report file. This is done to prevent accidental loss of work when profiling. If for any reason, you would rather overwrite an existing report file, say during rapid iterations, you can provide th `-f` flag to `nsys profile` to allow overwriting an existing report file.

### Exercise: Profile an Application with ncu

Nsight Compute is another profiling tool that provides detailed performance metrics and API debugging. Its command line tool version `ncu` can also be used during this lab.

We will use `ncu` with the `--target-processes=all` option, requiring it to process all kernels. The command launches each kernel a number of times in order to analyze various metrics requested by the user. Even though we will use the default metrics, you may have a look [at other metrics if you want to dig deeper](https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html#nvprof-metric-comparison).

By default, the following metric sections are reported:

* GPU Speed of Light (SOL): Reports throughput as the achieved percentage of utilization with respect to the theoretical maximum.
* Launch Statistics: Statistics of the launch of the kernel. The number of threads, shared memory size requested and registers per thread are three very useful indicators that can affect the performance of your application.
* Occupancy: Occupancy is the ratio of the number of active warps per multiprocessor to the maximum number of possible active warps. Another way to view occupancy is the percentaqe of the hardware's ability to process warps that is actively in use. Higher occupancy does not always result in higher performance, however, low occupancy always reduces the ability to hide latencies, resulting in overall performance degradation. Large discrepancies between the theoretical and the achieved occupancy during execution typically indicates highly imbalanced workloads.

Have a look at the reported metrics with your vector addition kernel:

In [None]:
!./run-exclusive.sh ncu --target-processes=all ./vector_addition/vector_addition 39 6 6

## Matrix multiplication

In this exercise, we will use a canonical example where GPUs can make a difference. We will look at **matrix multiplication**, concretely at the case of two square matrices to make life easier. For matrices A and B, every element of the result matrix C can be calculated as follows:

![](https://wikimedia.org/api/rest_v1/media/math/render/svg/ee372c649dea0a05bf1ace77c9d6faf051d9cc8d)

This is an inherently parallel problem, since all elements in the matrix C can be calculated independently at the same. In a first parallel implementation, the idea would be to assign each thread in a block to process a different element of the result matrix C.

![Matrix multiplication](https://upload.wikimedia.org/wikipedia/commons/e/eb/Matrix_multiplication_diagram_2.svg)

Since we are going to focus in performance, it is better to define matrices A, B and C as linear matrices representing a 2D array. There are two main methods of representing matrices:

![Matrix representations](https://upload.wikimedia.org/wikipedia/commons/thumb/4/4d/Row_and_column_major_order.svg/340px-Row_and_column_major_order.svg.png)

The format used for this exercise will be **row-major order**.

### Matrix multiplication with threads in a block

In file `matrix_multiplication/matrix_multiply.cu` you will find a first implementation of square matrix multiplication. This version of matrix multiplication runs on the GPU, but it runs sequentially with a single block and a single thread.

When compiled, the program generated accepts the size of the matrix as a single argument which is the width and height of the matrices involved in the multiplication. The time it took to run is also recorded and shown at the end of the exercise. In 
addition, after performing the multiplication, a submatrix of size 64x64 is also computed on the CPU and the result obtained is checked to verify if there were errors.

Test that the program compiles and runs. The command below executes the matrix multiply of 512x512 matrices and checks the results.

In [None]:
!nvcc -arch=sm_70 -o matrix_multiplication/matrix_multiply matrix_multiplication/matrix_multiply.cu

In [None]:
!PATH=/cvmfs/sft.cern.ch/lcg/releases/cuda/11.4/x86_64-centos7-gcc10-opt/bin:$PATH ./run-exclusive.sh ./matrix_multiplication/matrix_multiply 512

It is time to parallelize using threads in a block. Since the work is conceptually done over 2D, you may use the capability of defining the block dimension in 2D by setting `dim3 block (n_threads, n_threads);`. Bear in mind that there is a limit in the number of threads you may use per block, calculated by multiplying the block dimensions, which on most GPUs is set to be 1024 threads per block in total. 

* Parallelize the work by using `threadIdx.x` and `threadIdx.y` to iterate over the first two for-loops in the kernel.
* Keep the grid dimension to be 1 for now, and optimize the block dimension used to invoke your function.

Use the file `matrix_multiplication/matrix_multiply_threads.cu` to write your code. You may look into the `matrix_multiplication/matrix_multiply_threads_solution.cu` if you get stuck.

In [None]:
!nvcc -arch=sm_70 -o matrix_multiplication/matrix_multiply_threads matrix_multiplication/matrix_multiply_threads.cu

In [None]:
!PATH=/cvmfs/sft.cern.ch/lcg/releases/cuda/11.4/x86_64-centos7-gcc10-opt/bin:$PATH ./run-exclusive.sh ./matrix_multiplication/matrix_multiply_threads 512

It is also relevant to look at the profile information as we are optimizing the application. Look at the various statistics reported by the `nsys` and `ncu` applications.

You may notice that `ncu` is configured with an argument `128` (instead of `512`). Given that the application is run several times with this tool, higher values would take considerably longer.

Record also the kernel time somewhere: you will be optimizing this application and will want to know how much faster you can make it.

In [None]:
!PATH=/cvmfs/sft.cern.ch/lcg/releases/cuda/11.4/x86_64-centos7-gcc10-opt/bin:$PATH ./run-exclusive.sh nsys profile --stats=true ./matrix_multiplication/matrix_multiply_threads 512

In [None]:
!./run-exclusive.sh ncu --target-processes=all ./matrix_multiplication/matrix_multiply_threads 128

### Multiple threads and blocks

Add now more parallelization by using many blocks in a grid. You should use as many blocks as needed so that every thread is tasked with calculating a single element of the grid.

* The number of blocks should be defined as a 2D grid.
* The number of blocks should depend on the matrix `size` and the `number of threads`.

Write your solution in file `matrix_multiplication/matrix_multiply_grid.cu`. Here is the `matrix_multiplication/matrix_multiply_grid_solution.cu` in case you want to have a look.

In [None]:
!nvcc -arch=sm_70 -o matrix_multiplication/matrix_multiply_grid matrix_multiplication/matrix_multiply_grid.cu

In [None]:
!PATH=/cvmfs/sft.cern.ch/lcg/releases/cuda/11.4/x86_64-centos7-gcc10-opt/bin:$PATH ./run-exclusive.sh ./matrix_multiplication/matrix_multiply_grid 512

In [None]:
!PATH=/cvmfs/sft.cern.ch/lcg/releases/cuda/11.4/x86_64-centos7-gcc10-opt/bin:$PATH ./run-exclusive.sh nsys profile --stats=true ./matrix_multiplication/matrix_multiply_grid 512

In [None]:
!./run-exclusive.sh ncu --target-processes=all ./matrix_multiplication/matrix_multiply_grid 128

### Shared memory

The final step consists in using **shared memory** as an intermediate buffer where data will be available by using a **tiling** method. As you may recall from the lectures, you should follow these steps:

* Load the tile from global into shared memory in a coalesced manner.
* Synchronize.
* Have multiple threads access the data from the shared buffer.
* Synchronize.
* Move on to the next tile.

You will have to define a `tile size` at compile time in order to be able to define the size of the shared memory array. You may use the expression `constexpr int tile_size = 32;` as a starting point at the top of your program.

All threads should participate in loading each tile into memory, calculate a partial result in a register, and then move on to the next tile. Visually:

![pr](https://docs.nvidia.com/cuda/cuda-c-programming-guide/graphics/matrix-multiplication-with-shared-memory.png)

Use file `matrix_multiplication/matrix_multiply_shared.cu` to write your answer. In case you need to resort to the `matrix_multiplication/matrix_multiply_grid_solution.cu` you may have a look at it.

In [None]:
!nvcc -arch=sm_70 -o matrix_multiplication/matrix_multiply_shared matrix_multiplication/matrix_multiply_shared.cu

In [None]:
!PATH=/cvmfs/sft.cern.ch/lcg/releases/cuda/11.4/x86_64-centos7-gcc10-opt/bin:$PATH ./run-exclusive.sh ./matrix_multiplication/matrix_multiply_shared 512

In [None]:
!PATH=/cvmfs/sft.cern.ch/lcg/releases/cuda/11.4/x86_64-centos7-gcc10-opt/bin:$PATH ./run-exclusive.sh nsys profile --stats=true ./matrix_multiplication/matrix_multiply_shared 512

In [None]:
!./run-exclusive.sh ncu --target-processes=all ./matrix_multiplication/matrix_multiply_shared 128

## Optional advanced exercise: Kalman filter for track fitting of LHCb data

We provide here an additional exercise that you are welcome to try in case you finish the other ones very quickly or if you already had basic knowledge of CUDA programming and wanted to skip the first very basic exercises.

It consists of fitting tracks in the VELO tracking detector of the LHCb experiment with a Kalman filter - so it is a more practical HEP example. There is no magnetic field in the VELO detector, so particle trajectories are straight lines. The VELO is a silicon pixel detector providing 3D measurement points in the global coordinate system of LHCb.

In this exercise we use CMake for compilation and build libraries from CUDA code. So if you are interested in some of the implementation specifics, feel free to look at the source code for inspiration. A Kalman filter is one method for track fitting, i.e. it happens after the pattern recognition step. This means that the hits originating from the same particle have already been identified and we now want to describe the trajectory such that we can extrapolate for example to other parts of the detector. The Kalman filter subsequently iterates over all hits of a track. For every hit, it estimates the state (i.e. direction and uncertainty) of the track at that point. With the addition of every hit the estimate becomes more precise until the the best linear estimator is achieved at the last hit. 

For this exercise, we provide you with collection of hits making up the tracks. They are stored in two containers: `hits` contains all space points present on all tracks. `tracks` contains indices to the hits of particular tracks. We also provide the code that runs the Kalman filter on both the CPU and the GPU. The GPU code is however not yet parallelized and no optimizations have been performed. 

Your task is to parallelize the GPU function and then to optimize step by step the performance.

Open the files referenced in every part of the exercise in an editor tab by clicking on them in your clone of the exercise repository!

### 1. Compile code
Execute the following cells to import some tools and set a few environment variables.

In [None]:
import requests
import shutil
import os
import sys
import importlib

In [None]:
pwd = os.path.realpath('.')
install_prefix = os.path.join(pwd, 'kalman_filter')
data_directory = os.path.join(install_prefix, 'data')
binary_prefix = os.path.join(install_prefix, 'kalman_binary')

In [None]:
def build_binary():
    pwd = os.path.realpath('.')
    build_dir = os.path.join(install_prefix, 'build')
    !mkdir -p $build_dir
    os.chdir(build_dir)
    !cmake -Wno-dev -DCMAKE_INSTALL_PREFIX=$binary_prefix ..
    !make install
    os.chdir(pwd)

In [None]:
build_binary()

### 2. Download input data

Execute the following cells to download the input data (5000 simulated events), i.e. the `hits` and `tracks` directories. They each contain one file per proton-proton collision event with the information about the hits (i.e. 3D space points) and the tracks (i.e. which hits make up the track). Note that unpacking the hits and tracks data will take a little while.

The hits and tracks in the files will be read by the program and stored in the following way:
This hits are stored in AoS containers as shown in the picture. The offsets to the first hit of every event are also stored.
<img src="kalman_filter/figures/hits_SoA.png">
The hit structure is defined like this. For our purpose, only the x, y and z position are of interest.
<img src="kalman_filter/figures/hit_struct.png">
The tracks are stored in structs specifying the number of hits on the track and the indices to the hits of this trak within the hits array
<img src="kalman_filter/figures/track_struct.png">
The output of the Kalman filter will be the track state closest to the beam line. It is stored in a structure specifying the x, y, z position closest to the beam line as well as the slopes tx = dx/dz and ty = dy/dz of the straight line and the covariance elements computed with the Kalman filter (c00, c11, etc.).
<img src="kalman_filter/figures/state_struct.png">



In [None]:
def download_file(url, local_filename=None):
    if local_filename is None:
        local_filename = url.split('/')[-1]
    with requests.get(url, stream=True) as r:
        with open(local_filename, 'wb') as f:
            shutil.copyfileobj(r.raw, f)

    return local_filename

In [None]:
def download_data():
    pwd = os.path.realpath('.')
    !mkdir -p $data_directory
    os.chdir(data_directory)
    if not os.path.exists(os.path.join(data_directory, 'hits.zip')):
        !wget "https://cernbox.cern.ch/index.php/s/2Rn3m7ESbKaZFAa/download" -O hits.tar.gz
        !wget "https://cernbox.cern.ch/index.php/s/lXm6eKC5Un8PgUz/download" -O tracks.tar.gz
        !tar zxf hits.tar.gz
        !tar zxf tracks.tar.gz
    os.chdir(pwd)

In [None]:
download_data()

### 3. Run and understand sequential implementation



Start by taking a look at the initial implementation located in the `start` directory. The main executable is defined in `src/kalman_filter.cu`. It takes care of memory (de-)allocations, memory transfers and calls the CPU and GPU functions. The time of the GPU execution is measured using the `std::chrono` library and the results of the computations on the CPU and the GPU are compared in the end. 
The CPU function to call the Kalman filter for all events is also defined in `src/kalman_filter.cu`. The gpu function (`__global__`) and the combined `__host__ __device__` function to perform the actual fit are defined in `src/kalman_filter_impl.cu`. The definitions of Hit, Track ect. are defined in definitions.cuh and functions needed to read in the hits and tracks are in utils.cu.

The binary we compiled before is configued to take five input arguments: 
    
    part (string): one of ["start", "SoA", "pinned_host_memory", "streams"]
    number of events (int)
    data location (string): 'kalman_filter/data/'
    number of repetitions
    device ID (int): leave at 0
    number of CUDA streams: only has an effect in the "streams" part of the exercise

Choose one of `start`, `SoA`, `pinned_host_memory` or `streams` depending on which part of the exercise you are working on. Note that whenever you make changes in the source code you have to re-run the `build_binary()` command. 

Let's start by executing the initial version of the code located in the `start` directory:

In [None]:
!./run-exclusive.sh $binary_prefix/bin/KalmanFilter start 10 $data_directory 10 0 0

This executed the Kalman filter application defined in `start` over 10 events, using as input the data in the `$data_directory`, looping over the same 10 events 10 times (number of repetitions), using device 0 and using one CUDA stream. Note that the number of CUDA streams will only be relevant for a later part of the exercise. 

### 4. Run the Kalman filter on the GPU in parallel

The current code in the `start` directory executes the loop over all events and over every track in one event sequentially on the GPU. Think about how to parallelize the problem and run the Kalman filter on the GPU in parallel. For this, you should modify `start/src/kalman_filter.cu` and `start/src/kalman_filter_impl.cu`.

Test different numbers of threads per block and blocks per grid and study the impact on the speed of your code. Run many repetitions (hundreds), that will make the time measurement more reliable. 

In [None]:
build_binary()

In [None]:
!./run-exclusive.sh $binary_prefix/bin/KalmanFilter start 10 $data_directory 10 0 0

### 5. Double versus single precision

In `start/include/definitions.cuh`, a switch between single and double precision is implemented. It works by using a global typedef: `typedef float my_float_t` or `typedef double my_float_t`. All floating point variables necessary for the Kalman filter are implemented as `my_float_t`. By changing the definition from double to float you can test the difference. Do you observe a difference in execution speed and accuracy? Think about which types of calculations can require double precision rather than single precision.

### 6. Multiple streams

For this exercise, start from the code located in the directory `streams`. The code structure is similar as in the `start` directory.

Use several CUDA streams to hide the latency of copying data to the device now. For simplicity, we copy the same input data to the device for every stream. So basically the same calculation is repeated several times. In an actual setup, one would run over different data sets. There is already one device array per stream and one output states array per stream defined in the code. Include all the code that is relevant for stream processing, marked by `to do`!

Remember that a CUDA stream is created with 
```
cudaStream_t stream;
cudaStreamCreate(&stream);
```
and passed to the kernel as fourth argument after the grid and block dimensions and optionally the size of shared memory like so:
```
my_kernel<<<10,10,0,stream>>>(argumets)
```
Memory copies on specific streams are similar to the `cudaMemcpy` syntax:
```
cudaMemcpyAsync( a_d, &a_h, size, cudaMemcpyHostToDevice, stream);
```

A stream is destroyed with 
```
cudaStreamDestroy(stream);
```
Note that there is a specific synchronization function that blocks until all work on one specific stream has finished. This means one does not have to use `cudaDeviceSynchronize()`, which would wait for all streams to finish (all GPU work to be done), but rather:
```
cudaStreamSynchronize(stream[i_stream]);
```

In [None]:
build_binary()

In [None]:
!./run-exclusive.sh $binary_prefix/bin/KalmanFilter streams 10 $data_directory 10 0 0

What is your observation with respect to speed? 

### 7. Pinned host memory

Memory copies to the GPU are executed from page-locked memory via direct memory access (DMA). Memory allocated with `malloc` or `new` is not by default page-locked memory, but instead could be swapped out to an external disk for example. When copying data from the host to the GPU, it is therefore first copied to page-locked memory and only then to the GPU. This means that two copies are actually taking place. 

In CUDA, we can alloated directly page-locked host memory to store host variables. This then avoids the additional copy. However, memory allocated like this cannot be infinitely large. The syntax is as follows:
```
cudaMallocHost( (void**) &h_a, size_in_bytes )
cudaFreeHost( h_a )
```

Think about which memory arrays would make sense to store in pinned host memory rather than host memory allocated with `new`. Start from the code in the directory `pinned_host_memory` and implement the memory allocation of the host arrays with pinned memory intsead of `new`. 

In [None]:
build_binary()

In [None]:
!./run-exclusive.sh $binary_prefix/bin/KalmanFilter pinned_host_memory 10 $data_directory 10 0 0

What do you infer from the speed difference between this last implementation and the previous ones? What does this mean for executing this particular code on the GPU? Does it make sense on its own?