This example contains implementation of single-precision general matrix-multiplication (SGEMM). The implementation is based on the one in MAGMA.
How to demo
The demo contains a script that calculates matrix multiplication of A (m x k) and B (k x n). The demo can be run by the following command.
python sgemm.py [--gpu GPU_ID] [--m m] [--n n] [--k k]
What this demo contains
In this example, we work on a SGEMM kernel that requires a complete interface to
cuLaunchKernel (e.g. grid size and size of shared memory), which is not provided by
CuPy arrays work regardless of the underlying memory layouts thanks to
As is the case for this example,
ndarray abstraction does not need to be used if the underlying memory layouts of arrays match the ones expected by a kernel.
The SGEMM kernel expects input and output arrays to be in Fortran contiguous memory layout, and this layout is enforced by
How to dynamically compile and launch a kernel function written in CUDA C
cupy.RawKernel class is used to compile a CUDA code written in
The class takes a text of code and name of the kernel as an constructor argument.
The instance is a callable; the CUDA code will be compiled and then invoked when it is called.
The compiled code is cached, and it avoids the compilation process after the first time.
Also, the CUDA code can be modified at Python level because it is simply a text.
In this example, C macros that determine a distribution of data to threads are specified at runtime.
"extern C" needs to be put on top of the kernel that is called.
How to supply grid size, block size and shared memory size on launching a kernel function
cupy.RawKernel object allows you to call the kernel with CUDA's
In other words, you have control over grid size, block size, shared memory size and stream.
At this level of interface, it becomes straightforward to replace host
.cu that calls CUDA kernels with Python code.