# Matrix Multiplication with PyCUDA

This notebook demonstrates how to perform matrix multiplication using PyCUDA. The CUDA kernel implementation is left empty for you to complete as an exercise.

In [None]:
# Import Required Libraries
import numpy as np
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule

from cuda_helpers import profile_gpu

## Define Matrices

Let's define two matrices to multiply. You can change their size and values as needed.

In [None]:
# Define two matrices A and B
N = 1024
A = np.random.randn(N, N).astype(np.float32)
B = np.random.randn(N, N).astype(np.float32)
print("Matrix A:\n", A)
print("Matrix B:\n", B)

## Perform Matrix Multiplication on the GPU

We will multiply matrices A and B using a CUDA kernel. Implement the kernel code to perform standard matrix multiplication as done i algebra. For two matrices A (of size m×n) and B (of size n×p), their product C = AB is an m×p matrix where each element C[i, j] is computed as the sum of products of the i-th row of A and the j-th column of B:

$$
C_{i,j} = \sum_{k=1}^{N} A_{i,k} \cdot B_{k,j}
$$

This operation is also known as "GEMM" (General Matrix Multiply) in numerical computing libraries.

We will build-up the optimal solution in stages. Start simple, and implement the mutliplication using just global memory, where each thread computes output value for one cell in the output matrix.

In [None]:
# Allocate GPU memory and transfer matrices
d_a = cuda.mem_alloc(A.nbytes)
d_b = cuda.mem_alloc(B.nbytes)
d_c = cuda.mem_alloc(A.nbytes)
cuda.memcpy_htod(d_a, A)
cuda.memcpy_htod(d_b, B)

# CUDA kernel for matrix multiplication (to be completed)
kernel_code = '''
__global__ void matmul(float *A, float *B, float *C, int N) {
    // TODO: Implement matrix multiplication kernel
    // blockIdx, blockDim, threadIdx, gridDim
    float sum = 0;
    int2 global_id = make_int2(blockIdx.x * blockDim.x + threadIdx.x,
                               blockIdx.y * blockDim.y + threadIdx.y);

    if (global_id.x >= N || global_id.y >= N) {
        return;
    }
    
    for (int i = 0; i < N; i++) {
        int aij = global_id.y * N + i;
        int bij = i * N + global_id.x;
        sum  += A[aij] * B[bij];
    }

    int cij = global_id.y * N + global_id.x;
    C[cij] = sum;
}
'''

mod = SourceModule(kernel_code)
matmul = mod.get_function("matmul")

In [None]:
block_size = (8, 8, 1)
grid_size = (A.shape[0] // block_size[0], A.shape[1] // block_size[1], 1)

print(f'Launching with grid_size={grid_size}, block_size={block_size}')

n_warmup = 2
n_iters = 50

launch = lambda: matmul(d_a, d_b, d_c, np.int32(N), block=block_size, grid=grid_size)
_ = profile_gpu(launch, n_warmup=n_warmup, n_iters=n_iters)

## Display Results

After running the kernel, copy the result back to the host and display it.

Refer to the [solution](matrix_multiplication_solution_global.cu) if you get stuck.

In [None]:
# Copy result from GPU and display
C = np.empty_like(A)
cuda.memcpy_dtoh(C, d_c)
c_numpy = np.matmul(A, B)
print("Result matrix C (A x B):\n", C)

np.testing.assert_almost_equal(C, c_numpy, decimal=3)
# Note: You need to implement the kernel for correct results!

# Shared memory caching

Your next task is to optimize the kernel execution time. One of the common practices in matrix multipication is to optimize memory accesses. Elements in each input matrices A and B are accessed N times by reaching to global memory. While we can expect some level of caching happening under the hood, we can also cache parts of A and B matrices in Shared Local Memory.

## Optimization task

Write an optimized kernel which will cache blocks from A anb B matrices in SLM:
- for similicity assume constant block size of (8, 8)
- make sure there are not data races - so synchronize SLM accesses

Implement the following algorithm:
- for each cache block
    - load 8x8 blocks from global memory into declared SLMs
        - the main difficulty lays in accessing global memory based while iterating over cached blocks
    - produce partial matrix multipication sum from data cached in SLM and accumulate in local variable
        - here you just need local ids
- dump accumulated sum into global memory C

In [None]:
def gpu_matrix_multiply(A, B, kernel_code):
    N = A.shape[0]
    assert A.shape[1] == N and B.shape[0] == N and B.shape[1] == N, "Matrices must be square and same size"

    d_a = cuda.mem_alloc(A.nbytes)
    d_b = cuda.mem_alloc(B.nbytes)
    d_c = cuda.mem_alloc(A.nbytes)

    cuda.memcpy_htod(d_a, A)
    cuda.memcpy_htod(d_b, B)

    mod = SourceModule(kernel_code)
    matmul = mod.get_function("matmul")

    block_size = (8, 8, 1)
    nr_threads = int(np.prod(block_size))
    grid_size = (N // block_size[0], N // block_size[1], 1)
    shared_mem_bytes = int(nr_threads * 4)
    d_debug = cuda.mem_alloc(shared_mem_bytes)
    
    launch = lambda: matmul(d_a, d_b, d_c, d_debug,
                            np.int32(N), block=block_size, grid=grid_size) #shared=shared_mem_bytes
    launch()

    C = np.empty_like(A)
    cuda.memcpy_dtoh(C, d_c)
    debug = np.zeros(nr_threads, dtype=np.int32)
    cuda.memcpy_dtoh(debug, d_debug)
    
    c_numpy = np.matmul(A, B)
    #print("Result matrix C (A x B):\n", C[:8, :8])
    print("debug = ", debug[:8])

    np.testing.assert_almost_equal(C, c_numpy, decimal=3)
    
    n_warmup = 2
    n_iters = 100
    
    _ = profile_gpu(launch, n_warmup=n_warmup, n_iters=n_iters)

In [None]:
shared_mem_kernel = '''
    #define THREAD_INDEX (threadIdx.y * blockDim.x + threadIdx.x)
    #define BLOCK_LENGTH 8 * 8

    __global__ void matmul(float *A, float *B, float *C, int *debug, int N) {
        float sum = 0;
        int2 global_id = make_int2(blockIdx.x * blockDim.x + threadIdx.x,
                                   blockIdx.y * blockDim.y + threadIdx.y);
                                   
        if (global_id.x >= N || global_id.y >= N) {
            return;
        }
                
        //extern __shared__ float slm[];
        __shared__ float slm_A[BLOCK_LENGTH];
        __shared__ float slm_B[BLOCK_LENGTH];
        
        for (int b = 0; b < gridDim.x; b++) {
            int2 gidA = make_int2(b * blockDim.x + threadIdx.x,  blockIdx.y * blockDim.y + threadIdx.y);
            int2 gidB = make_int2(blockIdx.x * blockDim.x + threadIdx.x,  b * blockDim.y + threadIdx.y);
                        
            slm_A[THREAD_INDEX] = A[gidA.y * N + gidA.x];
            slm_B[THREAD_INDEX] = B[gidB.y * N + gidB.x];
                
            __syncthreads();
            
            for (int i = 0; i < blockDim.x; i++) {
                sum += slm_A[blockDim.x * threadIdx.y + i] * slm_B[blockDim.x * i + threadIdx.x];
            }            
            
            __syncthreads();
        }
        
        C[global_id.y * N + global_id.x] = sum;
    }
'''
    
gpu_matrix_multiply(A, B, shared_mem_kernel)

Refer to the [solution](matrix_multiplication_solution_slm.cu) if you get stuck.