Showing the uselessness of trying to make your kernel fast through the use of local memory. <br>
Modern GPUs have effective on-chip caches which can provide much of the benefit of Local Memory <br>
but without programmer intervention (pg.8 of https://comp.anu.edu.au/courses/acceleratorsHPC/slides/OpenCLMemory.pdf). <br><br>
The kernel here does matmul. 

In [1]:
import pyopencl as cl
import numpy as np
import time

In [2]:
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)

The classic "naive" matmul kernel. 

In [3]:
krnl_1 = """
__kernel void mat_mul_1(__global float *res, __global float *first, __global float *second){
    
    int idx_row = get_global_id(0);
    int idx_col = get_global_id(1);
    int size = get_global_size(0);
    float val = 0; 
    for(int i=0; i<size; ++i){
        val += first[size*idx_row+i] * second[size*i+idx_col];
    }
    res[size*idx_row+idx_col] = val;
}
"""
krnl_1_prog = cl.Program(ctx, krnl_1).build()

Trying to optimize matmul via the use of local memory. The basic idea is to copy the data in local memory once, and then <br>
do the actual computations by accessing the data from the local memory. If there was no caching, it makes sense that this<br>
would be faster as accessing from local memory is faster than from global. But if the global memory data are cached, this <br>
probably won't add performance benefits. 

About "barrier" and memory fence flags: https://justpaste.it/cymn1 <br>
The code is based on "matmul2.cl" in this: https://public.websites.umich.edu/~smeyer/cuda/Preso07-OpenCL.pdf

In [4]:
krnl_2 = """
__kernel void mat_mul_2(__global float *res, __global float *first, __global float *second/*, 
__local float *first_local, __local float *second_local*/){
    
    int g_id_0 = get_global_id(0);
    int g_id_1 = get_global_id(1);
    
    int l_id_0 = get_local_id(0);
    int l_id_1 = get_local_id(1);
    
    int g_size = get_global_size(0);
    int l_size = get_local_size(0);
   
   // should make these dynamic for generic kernels.
   // but there's some problem doing that for some reason. 
    __local float first_local[8*1024];
    __local float second_local[8*1024];
   
   float val = 0; 
   for(int i=0; i<g_size; i+=l_size){
       first_local[l_id_0*g_size+l_id_1+i] = first[g_id_0*g_size+l_id_1+i];
       second_local[(l_id_0+i)*l_size+l_id_1] = second[(l_id_0+i)*g_size+g_id_1];
      
       // I don't think you need CLK_GLOBAL_MEM_FENCE here. 
       // The global memory access is there to serve the local memory.
       // The local memory write is only finished once the global memory
       // read is done. And when the local memory write is done, the
       // memory operations are all over. 
       
       barrier(CLK_LOCAL_MEM_FENCE);
       
       for(int j=0; j<l_size; ++j){
           val += first_local[l_id_0*g_size+j+i] * second_local[(j+i)*l_size+l_id_1];
       }
   }
   res[g_id_0*g_size+g_id_1] = val;
}
"""
krnl_2_prog = cl.Program(ctx, krnl_2).build()

Defining a function for memory allocation stuff to simulate a more or less practical scenario. <br><Br>
If I use the same buffers for multiple runs of the kernel, then the runs after the first run are signficantly faster. <br>
This is probably because of further cache-based optimization. It doesn't seem realistic to presume <br>
that the same kernel with the same buffers will be executed in very short span again and again; this is <br>
why I am allocating new buffers in every run. In reality, it is probably a mix of allocating new buffers vs <br>
reusing buffers, but I guess I am going for the worst case scenario. Actually, an even worst-case scenario would <br>
be to wait a bit after allocation and before execution such that the cache is completely cleared of the buffers. But <br>
this doesn't seem realistic. One can play with these different scenarios. My experiments, till now, suggest that there are more or less <br>
3 scenarios: <br><br>

1.) When the buffers are allocated for the first time and you execute the kernel after clearing caches. - Slowest <br>
2.) When the buffers are allocated for the first time and you execute the kernel before the caches get cleared. - Second Fastest <br>
3.) You execute a kernel that has been executed before using the same buffers as the previous execution before caches get cleared. - Fastest <br><br>

I am simulating simulating 2.) here. 

In [5]:
def memory_stuff():
    
    mf = cl.mem_flags

    sz = 1024  # the size of the matrix is sz x sz.

    first_host = np.random.uniform(0, 1, size=(sz, sz)).astype(np.float32)
    first_device = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=first_host)

    second_host = np.random.uniform(0, 1, size=(sz, sz)).astype(np.float32)
    second_device = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=second_host)

    # first_local = cl.LocalMemory(sz*sz*4)
    # second_local = cl.LocalMemory(sz*sz*4)

    result = cl.Buffer(ctx, mf.WRITE_ONLY, first_host.nbytes)
    
    return sz, first_device, second_device, result

Running the kernels multiple times and averaging the times.

In [6]:
krnl_1_obj = krnl_1_prog.mat_mul_1
krnl_2_obj = krnl_2_prog.mat_mul_2

start = end = None
krnl_1_exec_time = krnl_2_exec_time = 0

for i in range(50):
    
    sz, first_device, second_device, result = memory_stuff()

    start = time.perf_counter()
    krnl_1_obj(queue, (sz, sz), (8, 8), result, first_device, second_device)
    end = time.perf_counter()
    
    krnl_1_exec_time += (end-start)
    
for i in range(50):
    
    sz, first_device, second_device, result = memory_stuff()

    start = time.perf_counter()
    krnl_2_obj(queue, (sz, sz), (8, 8), result, first_device, second_device)
    end = time.perf_counter()
    
    krnl_2_exec_time += (end-start)
    
krnl_1_exec_time /= 100
krnl_2_exec_time /= 100

f"Kernel 1 execution time: {krnl_1_exec_time} s and Kernel 2 execeution time: {krnl_2_exec_time} s."

'Kernel 1 execution time: 0.0014242268312955274 s and Kernel 2 execeution time: 0.0013630261606886053 s.'

You can run the whole notebook multiple times and you'll see that they will essentially always be "roughly the same speed." <br>
This is because the buffers are in cache, and access from cache is going to be fast too -- evidently as fast as local memory access.

In [7]:
if krnl_1_exec_time < krnl_2_exec_time:
    
    speedup = round(krnl_2_exec_time / krnl_1_exec_time)
    
    if speedup == 1:
        
        print(f"They are roughly the same speed. Kernel 1 is a bit faster.")
        
    else:
        
        print(f"Kernel 1 is roughly {speedup}x as fast as Kernel 2.")
else:
    
    speedup = round(krnl_1_exec_time / krnl_2_exec_time)
    
    if speedup == 1:
        
        print(f"They are roughly the same speed. Kernel 2 is a bit faster.")
        
    else:
        
        print(f"Kernel 2 is roughly {speedup}x fast as Kernel 1.")

They are roughly the same speed. Kernel 2 is a bit faster.
