## Matrix multiplication

In this exercise, we will use a canonical example where GPUs can make a difference. We will look at **matrix multiplication**, concretely at the case of two square matrices to make life easier. For matrices A and B, every element of the result matrix C can be calculated as follows:

![](https://wikimedia.org/api/rest_v1/media/math/render/svg/ee372c649dea0a05bf1ace77c9d6faf051d9cc8d)

This is an inherently parallel problem, since all elements in the matrix C can be calculated independently at the same. In a first parallel implementation, the idea would be to assign each thread in a block to process a different element of the result matrix C.

![Matrix multiplication](https://upload.wikimedia.org/wikipedia/commons/e/eb/Matrix_multiplication_diagram_2.svg)

Since we are going to focus in performance, it is better to define matrices A, B and C as linear matrices representing a 2D array. There are two main methods of representing matrices:

![Matrix representations](https://upload.wikimedia.org/wikipedia/commons/thumb/4/4d/Row_and_column_major_order.svg/340px-Row_and_column_major_order.svg.png)

The format used for this exercise will be **row-major order**.

### Matrix multiplication with threads in a block

In file [matrix_multiply.cu](/edit/matrix_multiply.cu) you will find a first implementation of square matrix multiplication. This version of matrix multiplication runs on the GPU, but it runs sequentially with a single block and a single thread.

When compiled, the program generated accepts the size of the matrix as a single argument which is the width and height of the matrices involved in the multiplication. The time it took to run is also recorded and shown at the end of the exercise. In 
addition, after performing the multiplication, a submatrix of size 64x64 is also computed on the CPU and the result obtained is checked to verify if there were errors.

Test that the program compiles and runs. The command below executes the matrix multiply of 512x512 matrices and checks the results.

In [None]:
!nvcc -arch=sm_70 -o matrix_multiplication/matrix_multiply matrix_multiplication/matrix_multiply.cu

In [None]:
!./run-exclusive.sh ./matrix_multiplication/matrix_multiply 512

It is time to parallelize using threads in a block. Since the work is conceptually done over 2D, you may use the capability of defining the block dimension in 2D by setting `dim3 block (n_threads, n_threads);`. Bear in mind that there is a limit in the number of threads you may use per block, calculated by multiplying the block dimensions, which on most GPUs is set to be 1024 threads per block in total. 

* Parallelize the work by using `threadIdx.x` and `threadIdx.y` to iterate over the first two for-loops in the kernel.
* Keep the grid dimension to be 1 for now, and optimize the block dimension used to invoke your function.

Use the file [matrix_multiply_threads.cu](/edit/matrix_multiply_threads.cu) to write your code. You may look into the [solution](/edit/matrix_multiply_threads_solution.cu) if you get stuck.

In [None]:
!nvcc -arch=sm_70 -o matrix_multiplication/matrix_multiply_threads matrix_multiplication/matrix_multiply_threads.cu

In [None]:
!./run-exclusive.sh ./matrix_multiplication/matrix_multiply_threads 512

### Multiple threads and blocks

Add now more parallelization by using many blocks in a grid. You should use as many blocks as needed so that every thread is tasked with calculating a single element of the grid.

* The number of blocks should be defined as a 2D grid.
* The number of blocks should depend on the matrix `size` and the `number of threads`.

Write your solution in file [matrix_multiply_grid.cu](/edit/matrix_multiply_grid.cu). Here is the [solution](/edit/matrix_multiply_grid_solution.cu) in case you want to have a look.

In [None]:
!nvcc -arch=sm_70 -o matrix_multiplication/matrix_multiply_grid matrix_multiplication/matrix_multiply_grid.cu

In [None]:
!./run-exclusive.sh ./matrix_multiplication/matrix_multiply_grid 512

### Shared memory

The final step consists in using **shared memory** as an intermediate buffer where data will be available by using a **tiling** method. As you may recall from the lectures, you should follow these steps:

* Load the tile from global into shared memory in a coalesced manner.
* Synchronize.
* Have multiple threads access the data from the shared buffer.
* Synchronize.
* Move on to the next tile.

You will have to define a `tile size` at compile time in order to be able to define the size of the shared memory array. You may use the expression `constexpr int tile_size = 32;` as a starting point at the top of your program.

All threads should participate in loading each tile into memory, calculate a partial result in a register, and then move on to the next tile. Visually:

![pr](https://docs.nvidia.com/cuda/cuda-c-programming-guide/graphics/matrix-multiplication-with-shared-memory.png)

Use file [matrix_multiply_shared.cu](/edit/matrix_multiply_shared.cu) to write your answer. In case you need to resort to the [solution](/edit/matrix_multiply_grid_solution.cu) you may have a look at it.

In [None]:
!nvcc -arch=sm_70 -o matrix_multiplication/matrix_multiply_shared matrix_multiplication/matrix_multiply_shared.cu

In [None]:
!./run-exclusive.sh ./matrix_multiplication/matrix_multiply_shared 512

In [9]:
!git commit -a -m "Added second part with matrix multiplication"

Author identity unknown

*** Please tell me who you are.

Run

  git config --global user.email "you@example.com"
  git config --global user.name "Your Name"

to set your account's default identity.
Omit --global to set the identity only in this repository.

fatal: unable to auto-detect email address (got 'dcampora@jupyter-dcampora.(none)')


In [None]:
!git config --global user.email "dcampora@cern.ch"