# Introduction to GPU Programming with Python
## Alternatives to Numba: CuPy and Pycuda


### CuPy

CuPy is a GPU array backend that implements a subset of NumPy interface.

First load NumPu and CuPy modules

In [None]:
!pip install cupy --no-index

In [None]:
import numpy as np
import cupy as cp

Now we create matrices A,B,C in CPU memory:

In [None]:
A=np.random.rand(512,512).astype(np.float32)
B=np.random.rand(512,512).astype(np.float32)
C=np.zeros(shape=(512,512)).astype(np.float32)

Then we copy data to GPU memory

In [None]:
d_A=cp.asarray(A)
d_B=cp.asarray(B)
d_C=cp.asarray(C)

Then we use Numpy built-in matrix multiplication function and run on CPU:

In [None]:
%timeit C = np.matmul(A,B)

Then we use CuPy built-in matrix multiplication function and run on GPU:

In [None]:
%timeit d_C = cp.matmul(d_A,d_B)

### PyCuda

PyCuda gives you easy, Pythonic access to Nvidia's CUDA parallel computation API.
The idea is that you write CUDA kernels in C/C++, use a wrapper to make it Python object, but variables and execution are managed by Python

First load all pycuda modules:

In [None]:
import numpy as np
from pycuda import compiler
from pycuda import driver as cuda
import pycuda.autoinit

Write CUDA C code and feed it into the constructor of a pycuda.compiler.SourceModule:

In [None]:
mod = compiler.SourceModule("""
    __global__ void MatrixMultKernel(float *A, float *B, float *C, int Width)
    {
        float tmp=0;
        for(int k=0; k<Width; k++){
            tmp += A[threadIdx.y*Width + k] * B[k*Width + threadIdx.x];
        }
        C[threadIdx.y*Width + threadIdx.x] = tmp;
    }
""")

If there aren’t any errors, the code is now compiled and loaded onto the device.
Now lets define the grid:

In [None]:
NumThreads=32
NumBlocks = (C.shape[0]+(NumThreads-1))//NumThreads
blockdim = (NumThreads,NumThreads)
griddim = (NumBlocks,NumBlocks)
print(griddim,blockdim)

Now we create matrices A,B,C in CPU memory:

In [None]:
A=np.random.rand(512,512).astype(np.float32)
B=np.random.rand(512,512).astype(np.float32)
C=np.zeros(shape=(512,512)).astype(np.float32)

Now we allocate GPU memory for the same matrices:

In [None]:
d_A = cuda.mem_alloc(A.nbytes)
d_B = cuda.mem_alloc(B.nbytes)
d_C = cuda.mem_alloc(C.nbytes)

Now we copy data from CPU to GPU:

In [None]:
cuda.memcpy_htod(d_A, A)
cuda.memcpy_htod(d_B, B)

We find a reference to our pycuda.driver.Function and call :

In [None]:
matmul = mod.get_function("MatrixMultKernel")

In [None]:
%timeit matmul(d_A,d_B,d_C,griddim,blockdim)