# Shared Local Memory Bank Conflicts

This notebook describes an advanced technique to optimize SLM access patterns.

## Setup

Load libraries and extensions.

In [None]:
import os
import sys

import pyopencl as cl
import numpy as np
import pandas as pd

sys.path.append("..")

from helpers import profile_gpu

os.environ["PYOPENCL_COMPILER_OUTPUT"] = "1" # set to 1 to see compiler warnings

%load_ext pyopencl.ipython_ext

Create context and queue.

In [None]:
platform = cl.get_platforms()[0]

ctx = cl.Context(
    dev_type=cl.device_type.ALL, 
    properties=[(cl.context_properties.PLATFORM, platform)])    

queue = cl.CommandQueue(ctx, properties=cl.command_queue_properties.PROFILING_ENABLE)
    
devices = ctx.get_info(cl.context_info.DEVICES)
for d in devices:
    print(f"device={d}")

## Memory Banks

The slides below describe what are Memory Banks and how to use access SLM avoiding conflicts.

In [None]:
%%HTML
<iframe src="https://docs.google.com/presentation/d/e/2PACX-1vSlVEew--oxKhSXYcCJP3vRHD2EQ-gPYH1g7lt0FotTBAV2LzbRF0koXVXRXIpJkv920L0rcqVSrbzz/embed?start=false&loop=false&delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>

For more information here is a good description on [Bank Conflicts on NVidia](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory-5-x).

## Exercise - Remove memory bank conflicts

The code below executes some fictional calculations, which suffer from Memory Bank Conflicts. 

First, define input data. We'll just use an array of same numbers and will increment them.

In [None]:
N = np.int32(2**25)
h_a = np.full(N, 1).astype(np.int32)

print(f"Working with {len(h_a):,} elements.")

Define execution configuration and buffers. We're going to add extra buffers to:
* store number of GPU clock cycles spend by a work group - so we can see the cost of accessing SLM
* store time spend in GPU
* store Streaming multiprocessor IDs - not related to Bank conflicts - but as a curiosity so that we can see how work groups are scheduled.

In [None]:
flags = cl.mem_flags

local_work_size = (32,)
global_work_size = (N,)
num_groups = global_work_size[0]//local_work_size[0]

d_a = cl.Buffer(ctx, flags.READ_ONLY | flags.COPY_HOST_PTR, hostbuf=h_a)
d_result = cl.Buffer(ctx, flags.WRITE_ONLY, h_a.nbytes)
d_clock_cycles = cl.Buffer(ctx, flags.WRITE_ONLY, num_groups * 4)
d_durations_ns = cl.Buffer(ctx, flags.WRITE_ONLY, num_groups * 4)
d_smids = cl.Buffer(ctx, flags.WRITE_ONLY, num_groups * 4)

Below is a kernel with some profiling function calls. There is a trick in OpenCL to use CUDA calls directly using assemler instructions.

* [clock function](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#time-function) - returns the current clock counter
* [globaltimer_lo](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#special-registers-globaltimer) - returns lower 32bits of GPU timer in nanoseconds
* [smid](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#special-registers-smid) - returns on ID on Streaming multiprocessor on which a thread is executing
* [all special registers](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#special-registers) - list of all available registers

We've also defined 'SomeInfo' structure which is used to stored some information. In this example it's just declared for demonstration purposes, but it's easy to assume some other useful data could be stored in it, in a real example.

In the kernel below we fetch data from global memory, put it into SLM and increment 100 times, so that SLM is used multiple times and the results of such calculations are visible.

At the end, first thread in a work group saves a few profiling informations as well as information on which SM this group executed.

In [None]:
%%cl_kernel -o "-cl-fast-relaxed-math"

uint clock()
{
    uint clock_time;
    asm volatile ("mov.u32 %0, %%clock;" : "=r"(clock_time));
    return clock_time;
}

uint globaltimer_ns()
{
    uint timer;
    asm volatile ("mov.u32 %0, %%globaltimer_lo;" : "=r"(timer));
    return timer;
}

uint get_smid(void) 
{
     uint ret;
     asm("mov.u32 %0, %%smid;" : "=r"(ret) );
     return ret;
}

typedef struct
{
    int a;
    int buffer_data[31];
} SomeInfo;

__kernel void add_vectors(__global const int *a, 
                          __global int *result,  
                          __global uint* clock_cycles,
                          __global uint* durations,
                          __global uint* sms)
{
    __local SomeInfo slm[32];
    
    uint start_timer = globaltimer_ns();
   
    int gid = get_global_id(0);
    int lid = get_local_id(0);
    int group_id = get_group_id(0);
    int local_size = get_local_size(0);
    
    slm[lid].a = a[gid];
    
    barrier(CLK_LOCAL_MEM_FENCE);
    
    uint start_cycles = clock();
        
    for (int i=0; i < 100; i++) {
        slm[lid].a++;
    }

    uint end_cycles = clock();
    
    result[gid] = slm[lid].a;
    
    uint end_timer = globaltimer_ns();
    
    if (lid == 0) {
        clock_cycles[group_id] = end_cycles - start_cycles;
        durations[group_id] = end_timer - start_timer;
        sms[group_id] = get_smid();
    }
}

Schedule work to GPU.

In [None]:
_ = profile_gpu(add_vectors, 20, 
            queue, 
            global_work_size, 
            local_work_size,
            d_a,
            d_result,
            d_clock_cycles,
            d_durations_ns,
            d_smids)

Fetch data back to CPU and display as Pandas Dataframe. Pandas is a useful library for visualizing data and statistics. It's also popular in Machine Learning applications, but here we'll just use it to display data in Excel-like manner.

In [None]:
h_result = np.zeros(N).astype(np.int32)
h_clock_cycles = np.zeros(num_groups).astype(np.uint32)
h_durations_ns = np.zeros(num_groups).astype(np.uint32)
h_smids = np.zeros(num_groups).astype(np.int32)

cl.enqueue_copy(queue, h_result, d_result)
cl.enqueue_copy(queue, h_clock_cycles, d_clock_cycles)
cl.enqueue_copy(queue, h_durations_ns, d_durations_ns)
cl.enqueue_copy(queue, h_smids, d_smids)

df = pd.DataFrame({'clock cycles' : h_clock_cycles, 
                   'Duration [ns]': h_durations_ns,
                   'Streaming Multiprocessor IDs' : h_smids,
                   'Result' : h_result[::32],})

df.index.name = 'Work Group ID'
pd.set_option('display.max_rows', 500)
df[:64] # display only some work groups

## Task

Your task is to bring down the execution time and number of clock cycles spend accessing SLM, using knowledge gain in the presentation above.

Refer to the [solution](./slm_bank_conflicts_solution.c) if you get stuck.

## Bank conflicts with global and private memory

So why are there Bank conflicts in SLM but not in registers or global memory?

### Global memory
Accesses to global memory are much slower and are grouped into requests, so they are not visible. Global memory is accessed by 64 or 128 byte long requests. If multiple threads access data within this transfer you get good performance because of coelesced accesses. This has been discussed in details in previous notebooks.

### Private memory - registers

Registers are private to each thread so there are no bank conflicts. It's just one thread accessing this type of memory.

# Investigating GPU assembly 

When digging deeper in search for performance you may want to look at the assembly instructions generated by the compiler. You will be able to see exactly what you code does.  Details of [Nvidia PTX assembly](https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html)

In [None]:
kernel_string = """
uint clock_time()
{
    uint clock_time;
    asm volatile("mov.u64 %0, %%globaltimer;" : "=l"(clock_time));
    return clock_time;
}

uint get_smid(void) 
{
     uint ret;
     asm("mov.u32 %0, %%smid;" : "=r"(ret) );
     return ret;
}

__kernel void add_vectors(__global const int *a, 
                          __global const int *b, 
                          __global int *c,  
                          __global uint* times,
                          __global uint* sms)
{
    int gid = get_global_id(0);
    int lid = get_local_id(0);
    int group_id = get_group_id(0);
    barrier(CLK_GLOBAL_MEM_FENCE);
    uint start = clock_time();
    barrier(CLK_LOCAL_MEM_FENCE);
    
    c[gid] = 0;
    
    for (int i=0; i < 100; i++) {
        c[gid] += i * a[gid] + b[gid];        
    }
    
    uint end = clock_time();
    
    times[group_id] = end - start;
    
    if (lid == 0) {    
        sms[group_id] = get_smid();
    }
}

"""

prg = cl.Program(ctx, kernel_string).build()

print(prg.binaries[0].decode())

On Nvidia the code above will the following instructions:
* ld.param.u64 - loading kernel parameters
* bar.sync - barrier
* mov.u32 - assignment operations eg. 
    * mov.u32 %r11, %ntid.x; - will move conent of 'ntid' register to 11th register 'r11'. 'ntid' contains thread ID
* st.global.u32 - store to global memory
* add - addition operation
* mul - multiply operation
* mad - multiply and add operation - two operations in one cycle
* BB0_1 - branching - in this case for-loop, can also be if-statement