# Welcome to the GPU exercises of the thematic CERN School of Computing!

Please follow the steps in this notebook. All underlying code can be found in the directories of the project. Click on the links in the readme to open the files for the respective exercises in an editor in a new browser window. You might have to scroll with the mouse or click once for all code to be visible. You can obtain syntax highlighting for .cu files by choosing `LANGUAGE` -> `C++` from the menu. 

You can execute the code in the code execution cells below by pressing `SHIFT` + `ENTER` or by clicking on the triangle ("run") symbol in the tool bar at the top. The output will appear just below the cell. Sometimes the execution can take a little while, so be a bit patient. The process is still working if the prompt looks like this: `[*]`. Note that compilation commands do not produce output when running successfully. 

## Exploring the GPU status

Let us first explore the GPU available for you in this lab environment. It is an accelerated system, containing a GPU assigned to you (and a fellow student). `nvidia-smi` (Systems Management Interface) is a utility shipped with CUDA that monitors the processes running on the GPU and provides some information. It is often useful to check whether another user is using the same GPU.

*Sidenote: `nvidia-smi` also tells you the exact driver and CUDA version. This can be useful information if after a new installation / an update there is a mismatch between the CUDA and driver versions (nothing to be worried about for this lab).*

Note down the GPU name, memory available and whether a process is currently running on the GPU! 

In [None]:
!nvidia-smi

## Compile a CUDA program

Let us now compile and run a small program, `device_properties.cu` (*<---- open this file for editing from the link given in the readme*), which will give more detailed information about the GPU(s) available. This program does not do any calculations on the GPU, it simply queries information about it and shows you which function calls are available for that. 

`.cu` is the extension for CUDA accelerated files. To compile a CUDA file, we use the `nvcc` compiler, which compiles both the host and the device sections of the code. Its usage is very similar to `gcc`. Let's take a closer look at the following command:
- `nvcc` invokes the compiler from the command line
- `-arch` indicates the GPU architecture for which the file is compiled, consisting in a major number followed up by a minor number. The lab assumes A100 hardware, so this is `sm_80`. If you are running on older Tesla T40s, use `sm_75` instead. For more information on the architecture, please refer to the [CUDA documentation](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#options-for-steering-gpu-code-generation).
- `-o` specifies the output file (i.e. the compiled program) 
- `device_properties/device_properties.cu` is the file to compile

In [None]:
!nvcc -arch=sm_80 -o device_properties/device_properties device_properties/device_properties.cu 

## Exploring some GPU characteristics

After successfully compiling your first CUDA program, we can now execute it. This is done by calling the output file produced above, i.e. `./device_properties/device_properties`.

In [None]:
!./device_properties/device_properties

Note down the following: How large can a grid be maximally on this GPU? How much global and shared memory is available? What are the restrictions on the block size? How large is the warp size? How many Streaming Multiprocessors are there on the GPU? Can you infer from the information that the compute architecture of this GPU is indeed `sm_80`?

*Side note for when you work on any CUDA server: A similar program is available in the CUDA samples which are installed with every CUDA installation. You can find them in the directory `cuda-samples` of your CUDA installation directory (typically `/usr/local/cuda`). The scripts in `1_Utilities` can be useful. In particular, `1_Utilities/deviceQuery` gives similar (and more) information than the `device_properties`program we provide here.*

## Hello World from the GPU

We are now ready to write our first own function for GPUs, starting off from the `hello_world/hello_world.cu` code (click on link in readme). This code comes with a CPU function and already compiles, but only calls the CPU version of the function. Your task is to modify the code to also provide a working GPU function and to invoke it.

The idea is to print a message from every invocated kernel to the terminal. For this, modify the GPU function such that it is actually executed on the GPU and invoke the GPU function. Open the file in a separate tab in an editor, modify it according to the instructions given in the file (marked with *to do*) and remember to save your changes. Then you can compile and run it with the commands given below.

Remember that a GPU kernel is launched with the following syntax: `somKernel<<<number_of_blocks, number_of_threads>>>()`. `someKernel` is then executed for every thread in every block, so `number_of_blocks` * `number_of_threads` times. `someKernel<<<1, 1>>>()` launches only one instance: one thread in one block. `someKernel<<<1, 32>>>()` launches 32 instances: 32 threads in one block. `someKernel<<<2, 32>>>()` launches 64 instances: two blocks with 32 threads each.

Each thread is identified by an index, starting from 0, and each block is identified by an index starting from 0. 
To identify within the CUDA kernel code which instance of the kernel is processed, the pre-defined variables `blockIdx.x` and `threadIdx.x` are available to identify the index of the block and the index of the thread within the block. Note their usage in our `hello_world_gpu` function. 

If you are stuck or would like to have some inspiration, you can take a look at the [solution](../../../edit/SWAN_projects/gpulab-tcsc/hello_world/hello_world_solution.cu).

In [None]:
!nvcc -arch=sm_80 -o hello_world/hello_world hello_world/hello_world.cu

In [None]:
!./hello_world/hello_world 1 1

Congratulations! You just processed your first function on a GPU! 

Let's explore a bit deeper. The program takes the following input parameters: the number of threads per block and the number of blocks in the grid (both set to one in the above program call). Try experimenting with different settings! In particular, try at least 64 or 96 threads per block. What pattern do you observe in the printout. What could it be due to?

In [None]:
!./hello_world/hello_world 3 64

## Vector addition
We are now ready to move on to an exercise where data is copied to and from the GPU and calculations are executed on the GPU: a vector addition. Start from the code provided in `vector_addition/vector_addition.cu` and follow the instructions below and in the code. 
The provided code allocates memory on the host and runs the vector addition on the host. 
Our goal now is to allocate the required memory on the GPU, copy the input from the host to the GPU, call the vector addition in parallel on the GPU and copy the result back to the host to check that it is correct.

The initial version of the code compiles and runs. It takes three input parameters: the size of the vectors to be added, the number of blocks in the grid and the number of threads per block. Note that the last two parameters will only be relevant once we parallelize the vector addition on the GPU in Step 2.

Check after every of the below steps that your code compiles and runs!

As before, if you are stuck or need inspiration you can take a look at the [solution](../../../edit/SWAN_projects/gpulab-tcsc/vector_addition/vector_addition_solution.cu).

### Step 1: Allocating memory
To do a vector addition on the GPU, we have to allocate GPU memory (global memory) for the input vectors (called `a` and `b`) and also for the vector, where the result is stored (called `c`). The input vectors have to be copied from host memory to GPU global memory and in the end the result vector has to be copied from the GPU to the host. 

Note that in the code we label host variables with `_h` in the end and device variables with `_d` in the end. This is common practice in CUDA programs to distinguish between pointers to host and device memory. 

Follow the instructions in the code labelled with *Step 1 to do*. There are three places labelled like that to
- Allocate GPU global memory for the three device vectors `a_d`, `b_d` and `c_d`
- Copy the data in the host vectors `a_h`, `b_h`, `c_h` to the device vectors `a_d`, `b_d`, `c_d`
- Free the global memory used for the device vectors `a_d`, `b_d`, `c_d`

Test that your code compiles and runs!

In [None]:
!nvcc -arch=sm_80 -o vector_addition/vector_addition vector_addition/vector_addition.cu

In [None]:
!./vector_addition/vector_addition 36 6 6

### Step 2: Vector addition in parallel on the GPU

It is now time to call `vector_addition_gpu` on the GPU and to ensure that the addition is carried out in parallel. For this, follow the instructions labelled with *Step 2 to do* to do the following:
- Label `vector_addition_gpu` with the `__global__` identifier
- Modify the for loop inside `vector_addition_gpu` to be executed in parallel (see explanations below)
- Uncomment the grid dimension variable definitions
- Launch the kernel 

For loops are ideal candidates to be processed in parallel if the iterations do not depend on one another, as is the case in vector addition. The idea is instead of running each iteration of the loop sequentially, the iterations are processed in parallel by all available threads. For this two things must happen: 1) The kernel is written to execute one iteration based on its thread and block index and 2) we must ensure that all iterations of the for loop are processed, irrespectively of how many threads and blocks the kernel was launched with. Note that for this to work you should use the known `threadIdx.x`, `blockIdx.x`, `blockDim.x` and `gridDim.x` variables. 

Now test again that your code compiles and runs!

In [None]:
!nvcc -arch=sm_80 -o vector_addition/vector_addition vector_addition/vector_addition.cu

In [None]:
!./vector_addition/vector_addition 36 6 6

### Step 3: Copy and verify result

As last step, we have to copy back the vector containing the result and verify that the computations executed on the GPU were correct. Follow the instructions labelled with *Step 3 to do* for this. 

- Copy content of `c_d` to `c_h`
- Synchronize to ensure that the GPU work is finished
- Verify the result obtained from the GPU

Now compile and run again to check that your first calculations on a GPU are working!

Play with the number of blocks and threads and test scenarios where the `n_threads` * `n_block` is not equal to the vector size. If this works, you have correctly parallelized your for loop. 

Caveat: The restult vector can be correct, but the parallelization might not be the intended one. This can happen if you are in fact doing the same work in every block of your grid. If you did not use the variables `blockDim.x` and `gridDim.x` this happened. Take a look again at your parallelized for loop and modify it such that every block in the grid processed different vector elements from the other blocks. Check again that you can process vector lengths that do not match the number of blocks and threads!

In [None]:
!nvcc -arch=sm_80 -o vector_addition/vector_addition vector_addition/vector_addition.cu

In [None]:
!./vector_addition/vector_addition 39 6 6

## Profiling a CUDA application

The only way to be assured that attempts at optimizing accelerated code bases are actually successful is to profile the application for quantitative information about the application's performance. `nsys` is the Nsight Systems command line tool. It is a powerful tool for profiling accelerated applications.

Its most basic usage is to simply pass it the path to an executable compiled with `nvcc`. `nsys` will proceed to execute the application, after which it will print a summary output of the application's GPU activities, CUDA API calls and so on.

When accelerating applications, or optimizing already-accelerated applications, take a scientific and iterative approach. Profile your application after making changes, take note, and record the implications of any refactoring on performance. Make these observations early and often: frequently, enough performance boost can be gained with little effort such that you can ship your accelerated application. Additionally, frequent profiling will teach you how specific changes to your CUDA codebases impact its actual performance: knowledge that is hard to acquire when only profiling after many kinds of changes in your codebase.

### Exercise: Profile an Application with nsys

`nsys profile` will generate a `qdrep` report file which can be used in a variety of manners. We use the `--stats=true` flag here to indicate we would like summary statistics printed. There is quite a lot of information printed:

- Profile configuration details
- Report file(s) generation details
- **CUDA API Statistics**
- **CUDA Kernel Statistics**
- **CUDA Memory Operation Statistics (time and size)**
- OS Runtime API Statistics

We will be inspecting the 3 sections in **bold** above throughout the following exercises.

After profiling the application, answer the following questions using information displayed in the profiling output:

- What was the name of the only CUDA kernel called in this application?
- How many times did this kernel run?
- How long did it take this kernel to run?

As a first example, you can profile the vector addition application you worked on so far:

In [None]:
!nsys profile --stats=true ./vector_addition/vector_addition 39 6 6

Worth mentioning is that by default, `nsys profile` will not overwrite an existing report file. This is done to prevent accidental loss of work when profiling. If for any reason, you would rather overwrite an existing report file, say during rapid iterations, you can provide th `-f` flag to `nsys profile` to allow overwriting an existing report file.

### Exercise: Profile an Application with ncu

Nsight Compute is another profiling tool that provides detailed performance metrics and API debugging. Its command line tool version `ncu` can also be used during this lab.

We will use `ncu` with the `--target-processes=all` option, requiring it to process all kernels. The command launches each kernel a number of times in order to analyze various metrics requested by the user. Even though we will use the default metrics, you may have a look [at other metrics if you want to dig deeper](https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html#nvprof-metric-comparison).

By default, the following metric sections are reported:

* GPU Speed of Light (SOL): Reports throughput as the achieved percentage of utilization with respect to the theoretical maximum.
* Launch Statistics: Statistics of the launch of the kernel. The number of threads, shared memory size requested and registers per thread are three very useful indicators that can affect the performance of your application.
* Occupancy: Occupancy is the ratio of the number of active warps per multiprocessor to the maximum number of possible active warps. Another way to view occupancy is the percentaqe of the hardware's ability to process warps that is actively in use. Higher occupancy does not always result in higher performance, however, low occupancy always reduces the ability to hide latencies, resulting in overall performance degradation. Large discrepancies between the theoretical and the achieved occupancy during execution typically indicates highly imbalanced workloads.

Have a look at the reported metrics with your vector addition kernel:

In [None]:
!ncu --target-processes=all ./vector_addition/vector_addition 39 6 6

## Matrix multiplication

In this exercise, we will use a canonical example where GPUs can make a difference. We will look at **matrix multiplication**, concretely at the case of two square matrices to make life easier. For matrices A and B, every element of the result matrix C can be calculated as follows:

![](https://wikimedia.org/api/rest_v1/media/math/render/svg/ee372c649dea0a05bf1ace77c9d6faf051d9cc8d)

This is an inherently parallel problem, since all elements in the matrix C can be calculated independently at the same. In a first parallel implementation, the idea would be to assign each thread in a block to process a different element of the result matrix C.

![Matrix multiplication](https://upload.wikimedia.org/wikipedia/commons/e/eb/Matrix_multiplication_diagram_2.svg)

Since we are going to focus in performance, it is better to define matrices A, B and C as linear matrices representing a 2D array. There are two main methods of representing matrices:

![Matrix representations](https://upload.wikimedia.org/wikipedia/commons/thumb/4/4d/Row_and_column_major_order.svg/340px-Row_and_column_major_order.svg.png)

The format used for this exercise will be **row-major order**.

### Matrix multiplication with threads in a block

In file `matrix_multiplication/matrix_multiply.cu` you will find a first implementation of square matrix multiplication. This version of matrix multiplication runs on the GPU, but it runs sequentially with a single block and a single thread.

When compiled, the program generated accepts the size of the matrix as a single argument which is the width and height of the matrices involved in the multiplication. The time it took to run is also recorded and shown at the end of the exercise. In 
addition, after performing the multiplication, a submatrix of size 64x64 is also computed on the CPU and the result obtained is checked to verify if there were errors.

Test that the program compiles and runs. The command below executes the matrix multiply of 512x512 matrices and checks the results.

In [None]:
!nvcc -arch=sm_80 -o matrix_multiplication/matrix_multiply matrix_multiplication/matrix_multiply.cu

In [None]:
!./matrix_multiplication/matrix_multiply 512

It is time to parallelize using threads in a block. Since the work is conceptually done over 2D, you may use the capability of defining the block dimension in 2D by setting `dim3 block (n_threads, n_threads);`. Bear in mind that there is a limit in the number of threads you may use per block, calculated by multiplying the block dimensions, which on most GPUs is set to be 1024 threads per block in total. 

* Parallelize the work by using `threadIdx.x` and `threadIdx.y` to iterate over the first two for-loops in the kernel.
* Keep the grid dimension to be 1 for now, and optimize the block dimension used to invoke your function.

Use the file `matrix_multiplication/matrix_multiply_threads.cu` to write your code. You may look into the `matrix_multiplication/matrix_multiply_threads_solution.cu` if you get stuck.

In [None]:
!nvcc -arch=sm_80 -o matrix_multiplication/matrix_multiply_threads matrix_multiplication/matrix_multiply_threads.cu

In [None]:
!./matrix_multiplication/matrix_multiply_threads 512

It is also relevant to look at the profile information as we are optimizing the application. Look at the various statistics reported by the `nsys` and `ncu` applications.

Record also the kernel time somewhere: you will be optimizing this application and will want to know how much faster you can make it.

In [None]:
!nsys profile --stats=true ./matrix_multiplication/matrix_multiply_threads 512

In [None]:
!ncu --target-processes=all ./matrix_multiplication/matrix_multiply_threads 512

### Multiple threads and blocks

Add now more parallelization by using many blocks in a grid. You should use as many blocks as needed so that every thread is tasked with calculating a single element of the grid.

* The number of blocks should be defined as a 2D grid.
* The number of blocks should depend on the matrix `size` and the `number of threads`.

Write your solution in file `matrix_multiplication/matrix_multiply_grid.cu`. Here is the `matrix_multiplication/matrix_multiply_grid_solution.cu` in case you want to have a look.

In [None]:
!nvcc -arch=sm_80 -o matrix_multiplication/matrix_multiply_grid matrix_multiplication/matrix_multiply_grid.cu

In [None]:
!./matrix_multiplication/matrix_multiply_grid 512

In [None]:
!nsys profile --stats=true ./matrix_multiplication/matrix_multiply_grid 512

In [None]:
!ncu --target-processes=all ./matrix_multiplication/matrix_multiply_grid 512

### Shared memory

We will now move to using **shared memory** as an intermediate buffer where data will be available by using a **tiling** method. As you may recall from the lectures, you should follow these steps:

1. Load the tile from global into shared memory in a coalesced manner.
2. Synchronize.
3. Have multiple threads access the data from the shared buffer.
4. Synchronize.
5. Move on to the next tile.

You will have to define a `tile size` at compile time in order to be able to define the size of the shared memory array. You may use the expression `constexpr int tile_size = 32;` as a starting point at the top of your program.

All threads should participate in loading each tile into memory, calculate a partial result in a register, and then move on to the next tile. Visually:

![pr](figures/matrix-multiplication-with-shared-memory.png)

Use file `matrix_multiplication/matrix_multiply_shared.cu` to write your answer. In case you need to resort to the `matrix_multiplication/matrix_multiply_shared_solution.cu` you may have a look at it.

In [None]:
!nvcc -arch=sm_80 -o matrix_multiplication/matrix_multiply_shared matrix_multiplication/matrix_multiply_shared.cu

In [None]:
!./matrix_multiplication/matrix_multiply_shared 512

In [None]:
!nsys profile --stats=true ./matrix_multiplication/matrix_multiply_shared 512

In [None]:
!ncu --target-processes=all ./matrix_multiplication/matrix_multiply_shared 512

### Precision

It is time to address a very significant decision in our program: precision. We have used single precision until now. As you may recall from the lectures, we have a choice to make both for **storage and arithmetic**:

<img src="figures/precision.png" alt="precision" width="800"/>

First we should set our expectations right: *how precise do we want our results to be?* Comparing floating point numbers bit by bit is not a good idea, so the verification in above programs is done by checking the difference between numbers is under a certain threshold. [Comparing floating point numbers can get more complex](https://bitbashing.io/comparing-floats.html), but in our case we know the order of magnitude we expect so this method is acceptable. The threshold is configurable, and is defaulted to a value of `0.01`.

In fact, the above GPU implementations already differ from what the validation code calculates, even though both calculations adhere to the IEEE754 standard. The difference arises due to the default setting enabling FMAs on the GPU. We can observe the difference by making the threshold more strict: `matrix_multiply_shared_strict_check.cu` does exactly that by setting the threshold to `0.000001`, run it below to see the differences.

In [None]:
!nvcc -arch=sm_80 -o matrix_multiplication/matrix_multiply_strict_check matrix_multiplication/matrix_multiply_strict_check.cu

In [None]:
!./matrix_multiplication/matrix_multiply_strict_check 512

With that in mind, let's explore various arithmetic and storage precision configurations.

Open file `matrix_multiplication/matrix_multiply_precision.cu`. This file contains the solution of `matrix_multiply_shared_solution.cu`, however that currently runs in single precision (`float`), and is not configurable. At the top of the file you will find these two lines:

```c++
using storage_T = float; // Other types to test: half, float, double
using arithmetic_T = float; // Other types to test: half, float, double
```

We want to use these types throughout the program, instead of the hardcoded `float`. Replace the program with these types where relevant: mallocs and memcpys should use the type `storage_T`, the kernel expects pointers of type `storage_T` both for inputs and for shared memory, and the arithmetic of the matrix element multiplications should use `arithmetic_T`.

The validation has been fixed to always use the same matrices with the highest precision possible (`double`) regardless of `storage_T` to make our comparisons relevant. Locate the line in the program where the threshold is configured. You can now use the `threshold` variable to adjust the validation as you see fit:

```c++
  double threshold = 0.01;
  
```

When you are done, test your program with all five combinations from the table above. What threshold do you need to make each combination pass? Note down the kernel duration of each combination, as well as the threshold needed to make the verification pass.

If you get stuck, feel free to resort to the solution `matrix_multiplication/matrix_multiply_precision_solution.cu`

In [None]:
!nvcc -arch=sm_80 -o matrix_multiplication/matrix_multiply_precision matrix_multiplication/matrix_multiply_precision.cu

In [None]:
!./matrix_multiplication/matrix_multiply_precision 512

In [None]:
!nsys profile --stats=true ./matrix_multiplication/matrix_multiply_precision 512

In [None]:
!ncu --target-processes=all ./matrix_multiplication/matrix_multiply_precision 512

### Tensor Cores

As a last optimisation of our matrix-matrix multiplication we will exploit Tensor Cores.

Tensor cores are specialized processors in NVIDIA GPUs that allow efficient arithmetic operations via a set of functions. These allow to load or initialize values into the special format required by Tensor Cores, perform matrix multiply-accumulate steps and store the results back to memory. While newer iterations of Tensor Cores have incorporated various ways to deal with data, we will focus on the following specific operation:

<img src="figures/WMMA16x16x16.png" alt="tensorcores" width="800"/>

A matrix multiply-accumulate of three matrices A, B and C of size 16x16 each. The operation performs A * B + C and stores the result on D. The matrices must be in a specific floating point format. A and B must be fp16 (`half`) matrices, whereas C and D can either be fp16 or fp32 (`float`) matrices. Tensor Cores operate in mixed precision: input data is stored in fp16, but arithmetic operations are performed in fp32.

Probably the best way to learn how Tensor Cores work is with an example. Here is a kernel that declares A, B and C fragments, initializes C with zeros, loads two input arrays into the A and B fragments, performs the multiplication and stores the output into an output array. This example is [taken from the NVIDIA Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#wmma-example):

```c++
#include <mma.h>
using namespace nvcuda;

__global__ void wmma_ker(half *a, half *b, half *c) {
   // Declare the fragments
   wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::col_major> a_frag;
   wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::row_major> b_frag;
   wmma::fragment<wmma::accumulator, 16, 16, 16, float> c_frag;

   // Initialize the output to zero
   wmma::fill_fragment(c_frag, 0);

   // Load the inputs
   wmma::load_matrix_sync(a_frag, a, 16);
   wmma::load_matrix_sync(b_frag, b, 16);

   // Perform the matrix multiplication
   wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);

   // Store the output
   wmma::store_matrix_sync(c, c_frag, 16, wmma::mem_row_major);
}
```

Several things to point out from this example:

* A Tensor Core operation is performed by an entire warp (32 contiguous threads). For instance, if the kernel `wmma_ker` was invoked with 256 threads per block, each of the `wmma::` instructions would be executed 8 times (once per warp). A suggested block dimension configuration is `dim3(32, dim_y, dim_z)`, so that `threadIdx.y` and `threadIdx.z` identify separate warps.
* The format of the matrices (`row_major`, `col_major`) should match the representation of the matrix.
* `load_matrix_sync` and `store_matrix_sync` can specify the stride of the array (16 in the example above).
* The matrix you multiply must be a multiple of 16x16.
* The fragments only need to be defined once and can be reused.
* For more technical details, [refer to the following informative post on Tensor Cores](https://developer.nvidia.com/blog/programming-tensor-cores-cuda-9/).

Open file `matrix_multiplication/matrix_multiply_tensor_cores.cu`, which contains the optimisations we have done until now. You will note that we start without shared memory: we won't need it as Tensor Cores use `wmma::fragment`s instead. As usual, if you feel like, here's the solution for you to peek into: `matrix_multiplication/matrix_multiply_tensor_cores_solution.cu`

In [None]:
!nvcc -arch=sm_70 -o matrix_multiplication/matrix_multiply_tensor_cores matrix_multiplication/matrix_multiply_tensor_cores.cu

In [None]:
!./matrix_multiplication/matrix_multiply_tensor_cores 512

In [None]:
!nsys profile --stats=true ./matrix_multiplication/matrix_multiply_tensor_cores 512

In [None]:
!ncu --target-processes=all ./matrix_multiplication/matrix_multiply_tensor_cores 512

### Wrap-up and going beyond

If you have made it this far, congratulations! You are by now well-versed in matrix matrix multiplications. Hopefully you can see how relevant this problem is and why does it get so much attention these days.

As a last task, document the speedup you got in each step and think about the design decisions you needed to do to get this far. There are still many factors you could consider, here are some:

* Other non-standard floating point precision modes could be considered depending on the task at hand, such as BF16, FP8, FP4, or even integer precision such as INT8.
* If the matrices you multiply are not multiples of a tensor core tile size multiple, you'll be left with tiles that are mostly empty that need another iteration, leading to an ineffective use of the hardware. This effect is known as tile quantization:

<img src="figures/tiling-ex.svg" alt="tilequant" width="800"/>

* Matrix sizes that are multiples of the number of SMs in your hardware will be a perfect match for load balancing the work on the GPU, but this is an unlikely scenario. In general some SMs will end up doing more work than others, and in unlucky scenarios a few SMs could be stalling the whole multiplication having to do one more iteration. This effect is known as wave quantization.

Refer to [this resource](https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#mat-mat-multi) for more considerations on GEMMs.

We hope you enjoyed these exercises!