# Local averages

In this hands-on your task is to optimize the performance of a kernel that computes averages.
The input is a one-dimensional array of size **N**, and the input is a different one-dimensional array of size **N/4** where each element **i** is the average of 4 consecutive elements of the input array.

Do not worry if the definition at this stage is still a bit vague, the code will be soon presented and you will realize it is self explanatory.
But first, let us start by importing the necessary Python modules, initialize the GPU, and create the necessary arrays.

In [None]:
import numpy as np
import pycuda.driver as drv
from pycuda.compiler import SourceModule

In [None]:
# Initialize pycuda and create a device context
drv.init()
context = drv.Device(0).make_context()

#get compute capability for compiling CUDA kernels
devprops = { str(k): v for (k, v) in context.get_device().get_attributes().items() }
cc = str(devprops['COMPUTE_CAPABILITY_MAJOR']) + str(devprops['COMPUTE_CAPABILITY_MINOR'])

In [None]:
N = np.int32(2e30)
A = np.random.randn(N).astype(np.float32)
B1 = np.zeros(N/4).astype(np.float32)
B2 = np.zeros_like(B1)

Now that we have the right data structures, we can write a function to compute our local averages.

In [None]:
def local_averages(A, B, N):
    for i in range(0, N/4):
        temp = 0.0
        for j in range(0, 4):
            temp = temp + A[(i * 4) + j]
        B[i] = temp / 4.0

We can now execute and time our code. In this way we will save our reference output (for testing purpose) and have a glimpse at the execution time on the CPU.

In [None]:
%%time

local_averages(A, B1, N)

It is now time to introduce the naive CUDA code, and save it to a local file, as done in previous exercise. The main difference this time is that the code is already correct.

In [None]:
%%writefile local_averages.cu

__global__ void local_averages_kernel(float * A, float * B, int size_B)
{
    int index = (blockIdx.x * blockDim.x) + threadIdx.x;
    
    if ( index < size_B )
    {
        float temp = 0.0;
        
        for ( int j = 0; j < 4; j++ )
        {
            temp = temp + A[(index * 4) + j];
        }
        B[index] = temp / 4.0;
    }
}

Your goal at this point is to understand how this kernel works, and improve its performance. But before doing that, let us allocate memory on the GPU, and prepare the execution environment.

In [None]:
#first we allocate GPU memory and copy the data to the GPU
args = [A, B2]
gpu_args = []
for arg in args:
    gpu_args.append(drv.mem_alloc(arg.nbytes))
    drv.memcpy_htod(gpu_args[-1], arg)
gpu_args.append(N/4)

In [None]:
#setup the thread block dimensions (x, y, z)
threads = (1024, 1, 1)
#setup the number of thread blocks in (x, y, z)
grid = (int(np.ceil((N/4)/float(threads[0]))), 1, 1)

It is time to execute the naive kernel, and measure its performance.

In [None]:
#we have to pass the source code as a string, so we first read it from disk
with open('local_averages.cu', 'r') as f:
    kernel_string = f.read()

#compile the kernel
vector_add = SourceModule(kernel_string, arch='compute_' + cc, code='sm_' + cc,
                          cache_dir=False).get_function("local_averages_kernel")

#Make sure all previous operations on the GPU have completed
context.synchronize()
#Create events for measuring time
start = drv.Event()
end = drv.Event()

#Run the kernel
start.record()
vector_add(*gpu_args, block=threads, grid=grid, stream=None, shared=0)
end.record()

#Wait for the kernel to finish
context.synchronize()

#Print how long it took
print("local_averages_kernel took", end.time_since(start), "ms.")

#copy output data back from GPU
drv.memcpy_dtoh(c, gpu_args[0])

#check for correctness
print("PASSED" if np.allclose(B2, B1, atol=1e-6) else "FAILED")

It is now your turn to change the CUDA code and improve the performance of the kernel.

To avoid you losing track of the naive kernel's execution time, we are going to replicate the previous cell below this one. Just go back to the cell containing the CUDA code, modify the code, run that cell, and then run the one below. In the cell below we also take care to clean the output array.

In [None]:
#we have to pass the source code as a string, so we first read it from disk
with open('local_averages.cu', 'r') as f:
    kernel_string = f.read()

#compile the kernel
vector_add = SourceModule(kernel_string, arch='compute_' + cc, code='sm_' + cc,
                          cache_dir=False).get_function("local_averages_kernel")

#make sure the output data is clean
B2 = np.zeros_like(B1)
drv.memcpy_htod(gpu_args[1], B2)

#Make sure all previous operations on the GPU have completed
context.synchronize()
#Create events for measuring time
start = drv.Event()
end = drv.Event()

#Run the kernel
start.record()
vector_add(*gpu_args, block=threads, grid=grid, stream=None, shared=0)
end.record()

#Wait for the kernel to finish
context.synchronize()

#Print how long it took
print("local_averages_kernel took", end.time_since(start), "ms.")

#copy output data back from GPU
drv.memcpy_dtoh(c, gpu_args[0])

#check for correctness
print("PASSED" if np.allclose(B2, B1, atol=1e-6) else "FAILED")