# Vector Add

The vector addition kernel is one of the most simple GPU kernels and is therefore used to explain basic GPU programming concepts.

In this exercise you start with reading through the code and look for something that needs to be implemented. We will start with doing some necessary imports of modules that we need to compile and run GPU code.

In [None]:
import numpy as np
import pycuda.driver as drv
from pycuda.compiler import SourceModule

In [None]:
# Initialize pycuda and create a device context
drv.init()
context = drv.Device(0).make_context()

#get compute capability for compiling CUDA kernels
devprops = { str(k): v for (k, v) in context.get_device().get_attributes().items() }
cc = str(devprops['COMPUTE_CAPABILITY_MAJOR']) + str(devprops['COMPUTE_CAPABILITY_MINOR'])

Now we need to implement our GPU kernel, which is written in the CUDA language. The following cell writes its contents to a file named vector_add.cu which we will later compile on the GPU into a GPU kernel.

In [None]:
%%writefile vector_add.cu

__global__ void vec_add_kernel(float *c, float *a, float *b, int n) {
    int i = 0;   // Oops! Something is not right here, please fix it!
    if (i < n) {
        c[i] = a[i] + b[i];
    }
}

Before we continue with our GPU kernel we will setup the input and output data for our GPU kernel

In [None]:
n = np.int32(5e7)
a = np.random.randn(n).astype(np.float32)
b = np.random.randn(n).astype(np.float32)
c = np.zeros_like(b)

We can also measure the time it would take to compute an element-wise vector addition of a and b in Python

In [None]:
%%time

d = a+b

Now lets compile our CUDA kernel and see how long it takes to perform the same computation on the GPU

In [None]:
#first we allocate GPU memory and copy the data to the GPU
args = [c, a, b]
gpu_args = []
for arg in args:
    gpu_args.append(drv.mem_alloc(arg.nbytes))
    drv.memcpy_htod(gpu_args[-1], arg)
gpu_args.append(n)

Before compiling our kernel we setup the kernel launch parameters

In [None]:
#setup the thread block dimensions (x, y, z)
threads = (1024, 1, 1)
#setup the number of thread blocks in (x, y, z)
grid = (int(np.ceil(n/float(threads[0]))), 1, 1)

Now compile and run the kernel, measure the execution time, copy the data back from GPU memory to our Numpy array c and check if the result is correct.

This is all in one cell because you will have to modify the CUDA source code and run this cell again to check if you've completed the assignment.

In [None]:
#we have to pass the source code as a string, so we first read it from disk
with open('vector_add.cu', 'r') as f:
    kernel_string = f.read()

#compile the kernel
vector_add = SourceModule(kernel_string, arch='compute_' + cc, code='sm_' + cc,
                          cache_dir=False).get_function("vec_add_kernel")

#make sure the output data is clean
c = np.zeros_like(b)
drv.memcpy_htod(gpu_args[0], c)

#Make sure all previous operations on the GPU have completed
context.synchronize()
#Create events for measuring time
start = drv.Event()
end = drv.Event()

#Run the kernel
start.record()
vector_add(*gpu_args, block=threads, grid=grid, stream=None, shared=0)
end.record()

#Wait for the kernel to finish
context.synchronize()

#Print how long it took
print("vec_add_kernel took", end.time_since(start), "ms.")

#copy output data back from GPU
drv.memcpy_dtoh(c, gpu_args[0])

#check for correctness
print("PASSED" if np.allclose(c, a+b, atol=1e-6) else "FAILED")