# Matrix Multiplication with PyOpenCL

This notebook demonstrates how to perform matrix multiplication using PyOpenCL. The OpenCL kernel implementation is left empty for you to complete as an exercise.

In [None]:
import pyopencl as cl
import numpy as np
import sys
sys.path.append("..")

from helpers import profile_gpu

%load_ext pyopencl.ipython_ext

Create context and queue.


In [3]:
platform = cl.get_platforms()[0]

ctx = cl.Context(
    dev_type=cl.device_type.ALL, 
    properties=[(cl.context_properties.PLATFORM, platform)])    

queue = cl.CommandQueue(ctx, properties=cl.command_queue_properties.PROFILING_ENABLE)
    
devices = ctx.get_info(cl.context_info.DEVICES)
for d in devices:
    print(f"device={d}")

device=<pyopencl.Device 'NVIDIA GeForce RTX 2060 with Max-Q Design' on 'NVIDIA CUDA' at 0x19d3cdb2bf0>


## Define Matrices

Let's define two matrices to multiply. You can change their size and values as needed.

In [5]:
# Define two matrices A and B
N = 1024
A = np.random.randn(N, N).astype(np.float32)
B = np.random.randn(N, N).astype(np.float32)


## Perform Matrix Multiplication on the GPU

We will multiply matrices A and B using an OpenCL kernel. Implement the kernel code to perform standard matrix multiplication as done in algebra. For two matrices A (of size m×n) and B (of size n×p), their product C = AB is an m×p matrix where each element C[i, j] is computed as the sum of products of the i-th row of A and the j-th column of B:

$$
C_{i,j} = \sum_{k=1}^{N} A_{i,k} \cdot B_{k,j}
$$


This operation is also known as "GEMM" (General Matrix Multiply) in numerical computing libraries.

We will build-up the optimal solution in stages. Start simple, and implement the multiplication using just global memory, where each work item computes output value for one cell in the output matrix.

You can assume that all matrices are square for now, if you find it more convenient.

In [8]:
# Set up OpenCL context and buffers
mf = cl.mem_flags
d_a = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=A)
d_b = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=B)
d_c = cl.Buffer(ctx, mf.WRITE_ONLY, A.nbytes)

In [12]:
%%cl_kernel

__kernel void matmul(__global float* A, __global float* B, __global float* C, int N) {
    float sum = 0;
    int2 global_id = (int2)(get_global_id(0), get_global_id(1));

    if (global_id.x >= N || global_id.y >= N) {
        return;
    }
    
    // your code goes here
    
    for (int i = 0; i < N; i++) {
        int aij = global_id.y * N + i;
        int bij = i * N + global_id.x;
        sum  += A[aij] * B[bij];
    }

    int cij = global_id.y * N + global_id.x;
    C[cij] = sum;    
}

  lambda: self._prg.build(options_bytes, devices),


In [13]:
# Launch kernel
block_size = (8, 8)
global_size = (A.shape[0], A.shape[1])
print(f'Launching with global_size={global_size}, block_size={block_size}')
n_warmup = 2
n_iters = 50

_ = profile_gpu(matmul, 20,
            queue, 
            global_size, 
            block_size,
            d_a,
            d_b,
            d_c,
            np.int32(N)
            )

Launching with global_size=(1024, 1024), block_size=(8, 8)
matmul took minimum = 11.5420 ms, on average 18.6876 ms, with median 11.5600 ms, variance 155.7592 ms, standard deviation 12.4804 ms.
matmul took minimum = 11.5420 ms, on average 18.6876 ms, with median 11.5600 ms, variance 155.7592 ms, standard deviation 12.4804 ms.


## Display Results

After running the kernel, copy the result back to the host and display it.
Refer to the solution if you get stuck.

In [15]:
# Copy result from GPU and display
C = np.empty_like(A)
cl.enqueue_copy(queue, C, d_c)
c_numpy = np.matmul(A, B)
#print("Result matrix C (A x B):", C)
np.testing.assert_almost_equal(C, c_numpy, decimal=3)
# Note: You need to implement the kernel for correct results!

# Shared memory caching

Your next task is to optimize the kernel execution time. One of the common practices in matrix multiplication is to optimize memory accesses. Elements in each input matrices A and B are accessed N times by reaching to global memory. While we can expect some level of caching happening under the hood, we can also cache parts of A and B matrices in local memory (OpenCL's __local).

Below is a helper function that will launch your new kernel. It will:
* validate the sizes of your matrices
* automatically allocate dynamic local memory for matrices A and B
* transfer matrices from host to device and back
* launch your kernel
* verify correctness of the calculations against numpy implementation
* if the results are correct it will launch the kernel multiple times to measure execution times

In [None]:
def gpu_matrix_multiply(A, B, kernel_code, block_size=(16, 16), warmup=2, iters=100, *args):
    import numpy as np
    import pyopencl as cl
    from OpenCL.helpers import profile_gpu
    M, N_A = A.shape
    N_B, P = B.shape
    print(f"A: MxN ({M}, {N_A}), B: NxP ({N_B}, {P}), C: MxP ({M}, {P})")
    assert N_A == N_B, 'Inner matrix dimensions must match'
    ctx = cl.create_some_context()
    queue = cl.CommandQueue(ctx)
    mf = cl.mem_flags
    d_a = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=A)
    d_b = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=B)
    d_c = cl.Buffer(ctx, mf.WRITE_ONLY, M * P * 4)
    prg = cl.Program(ctx, kernel_code).build()
    global_size = (P, M)
    local_size = block_size
    use_local = '__local' in kernel_code
    shared_mem_bytes = block_size[0] * block_size[1] * 4 * 2 if use_local else 0
    def launch():
        prg.matmul(queue, global_size, local_size, d_a, d_b, d_c, *args)  # add local memory if needed
    print(f'Launching with global_size={global_size}, local_size={local_size}, shared_mem_bytes={shared_mem_bytes}')
    launch()
    C = np.empty((M, P), dtype=np.float32)
    cl.enqueue_copy(queue, C, d_c)
    ref = np.matmul(A, B)
    np.testing.assert_almost_equal(C, ref, decimal=3)
    _ = profile_gpu(launch, n_warmup=warmup, n_iters=iters)

## Optimization task

Write an optimized kernel which will cache blocks from A and B matrices in local memory:
- for simplicity assume that the shapes of all matrices are multiple of block size
- make sure there are not data races - so synchronize local accesses

Implement the following algorithm:
- for each cache block
    - load 8x8 blocks from global memory into declared local memory
        - the main difficulty lays in accessing global memory based while iterating over cached blocks
    - produce partial matrix multiplication sum from data cached in local memory and accumulate in local variable
        - here you just need local ids
- dump accumulated sum into global memory C

In [None]:
shared_mem_kernel = "

__kernel void matmul(__global float* A, __global float* B, __global float* C, int N, __local float* slm_A, __local float* slm_B) {
    int row = get_global_id(0);
    int col = get_global_id(1);
    int local_row = get_local_id(0);
    int local_col = get_local_id(1);
    // your code goes here
}
"

args = (np.int32(N),)
gpu_matrix_multiply(A, B, shared_mem_kernel, (16, 16), 2, 100, *args)

Refer to the solution if you get stuck.

# Non-square matrices

Adapt your solution to work with non-square matrices. You will need to:
* pass sizes of matrices to your kernel - as they are non-square so can have different shapes than only N.
* Use these shapes in your kernel to access elements.
* think about iterating through consecutive cached blocks in the for-loop. Make sure to look iterate from the perspective of output matrix.

In [None]:
non_square_kernel = "

__kernel void matmul(__global float* A, __global float* B, __global float* C, int M, int N, int P, __local float* slm_A, __local float* slm_B) {
    int row = get_global_id(0);
    int col = get_global_id(1);
    // your code goes here
}
"

A_ = np.random.randn(512, 1024).astype(np.float32)
B_ = np.random.randn(1024, 768).astype(np.float32)
M, N, P = A_.shape[0], A_.shape[1], B_.shape[1]
args = (np.int32(M), np.int32(N), np.int32(P))
gpu_matrix_multiply(A_, B_, non_square_kernel, (16, 16), 2, 100, *args)

Refer to the solution if you get stuck.