# Occupancy calculator

Maximizing perfomance of compute kernel is a complex task depending on many factors. It's not uncommon that individual code changes don't yield improvements but combined together they perform faster. Going blindly with trial and error is a way to get somewhere but it's wiser to optimize your code with some knowledge of underlying GPU architecture.

## Terminology

Each of the GPU vendors produces it's own devices in a different way - there are different component, memory or registers can be located on different levels.

Frameworks like OpenCL distinguish some of the building blocks while leaving others up to vendors, so are unnamed. We will follow Khronos OpenCL naming. Below is the high level list from highest (work groups level) to lowest level (close to the thread):
* Compute Device - mostly it's GPU - divides work groups to Compute Units (CU)
* Compute Unit (CU) - consists of multiple PBs processes entire work groups, divides work group into subgroups and sends them lower. On major platforms it has SLM. On some it has register files as well.
* Processing Block (PB)- works on a subgroup level, has ALUs (Arightmetic Logic Units) and FPU (Floating Point Units ) and usually has registers to keep the state of kernels.

In [None]:
%%HTML
<iframe src="https://docs.google.com/presentation/d/e/2PACX-1vSbabHdA1t-2f9uM6TsUtn4eiUITPP_lZljBa8cCGgEDz-XLzQUtleg5b-_O0AK9jiN_s0ecx-oqiGw/embed?start=false&loop=false&delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>

Below are vendor specific building block names.

Nvidia (Turing):
* CUDA Core or Streaming Processor - processes onekernel instance, has an ALU and FPU
* Processing Block (PB) - works with subgroups, has 16kB register file, has warp scheduler, L0 cache
* Streaming Multiprocessor (SM) - wraps 4 PBs, works with entire work groups - has SLM and L1 cache
* Texture Processing Cluster (TPC) - wraps one SM
* Graphics Processing Clusters (GPC) - wraps several TPCs, one or more of them create GPU
* GPU - consists of one or more GPCs, has L2 cache

Intel (Gen 11):
* 2 ALUs - together processing 8 threads
* Execution Unit (EU) - The foundational building block - has register file, processes several subgroups, 2 ALUs where actual calcualtions happen
* Subslice - works on entire work groups - has SLM, can distribute work group to EUs
* Slice - can dispatch entier work groups to different subslices, has L3 cache. If there are no barriers in the code and work items are independent within work groups, work groups can be scheduled to multiple subslices
* GPU - consist of one or more Slices

Mapping of those building blocks to generalized architexture is documented on [GPU Terminology Dictionary](https://confluence.synaptics.com/display/IOTVI/GPU+Terminology+Dictionary).

In [None]:
%%HTML
<iframe src="https://docs.google.com/presentation/d/e/2PACX-1vSKOdrKeIO8oPMlm1f1ypEJcdWx3BvY4zntgfj0fABYdllJaXGpjJEq1reFeo4-tfEyAKD_pQjogLGG/embed?start=false&loop=false&delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>

## Compute Unit Occupancy

One of the things we can calculate statically - without running the shader - is subgroup occupancy. It can tell us how well our compiled shader theoretically utilizes CU.

To calculate it we need some information about the CU:
* SLM size in bytes
* registers file size in PB per subgroup
* register file size per CU
* number of subgroup states - how many subgroups can be kept in register file. When GPU is processing a subgroup, other subgroup state is kept in register space, those other subgroups can wait for eg. memory transactions or for other subgroups to finish their work.
* number of processing blocks
* maximum number of work groups which can be active at the particlar moment

And some information about the kernel binary compiled for that particular platform:
* number of threads per work group - it's part of your execution configuration - local work group size
* SLM usage per work group - easy to calculate - from the sizes of your local memory arrays, it's also possible to query OpenCL API on the program object.

* subgroup size - can be deduced from "preferred work group size multiple" returned by OpenCL API call.
* registers usage per kernel instance - depend on a platform's compiler:
 - NVidia - specifying compiler option to get the value
 - Adreno - by querying private bytes through OpenCL API.
 - Mali - can only return spilled registers amount when queried for private bytes.
 - Intel - we don't have a reliable way to get that information. A hint of how many registers are used could be the DirectCompute HLSL bytecode value - dcl_temps - this value should be multiplied by 16 - to figure out why by 16 have a look 
[here](https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/dcl-temps)
 
 
### Calculating CU occupancy
 
Below there are values filled for some platforms.

In [None]:
# ------ Nvidia RTX 2080 Ti ------

# Processing block
processing_block = {
    'register_file_in_bytes': 8192 * 4, # 32 threads * 255 max registers * 4 - in the docs it's 16384 * 4 per PB
    'nr_subgroups_states': 16, # this number is calculated, 
    # Turing has a limit of 2048 threads per SM. Divided by subgroup width of 32 and by 4 PBs gives - 2048/(32 * 4) = 16
}

# Streaming Multiprocessor
compute_unit = {
    'slm_size_in_bytes': 65536,
    'register_file_in_bytes': 4 * 65536, # from documentation 65536 32 bit
    'max_active_work_groups': 16,
    'nr_processing_blocks': 4
}

nvidia_rtx_2080_ti_platform = {
    'name' : "NVidia RTX 2080 Ti",
    'compute_unit' : compute_unit,
    'processing_block': processing_block
}

# ------ Intel Gen 11 ------

# Execution Unit
processing_block = {
    'register_file_in_bytes': 4096, # this is a register limit per subgroup
    'nr_subgroups_states': 7 # this actually means how many subgroup states can be kept in hardware
}

# Subslice
compute_unit = {
    'slm_size_in_bytes': 65536,
    'register_file_in_bytes': 229376, # this is (subgroup register file) x (nr subgroup states) x (nr of processing blocks)
    'max_active_work_groups': 16,
    'nr_processing_blocks': 8
}

intel_gen_11_platform = {
    'name' : "Intel Gen 11",    
    'compute_unit' : compute_unit,
    'processing_block': processing_block
}

# ------ ARM Mali G52 ------

# Execution Engine
processing_block = {
    'register_file_in_bytes': 8 * 64 * 4, # each thread can store 64 32 bits wide register - 8 threads per subgroup
    'nr_subgroups_states': 16 # deduced from the limit of work groups size - 384 - can be wrong
}

# Core
compute_unit = {
    'slm_size_in_bytes': 32 * 1024, # returned by OpenCL - SLM on Mali is part of global memory
    'register_file_in_bytes': 384 * 64 * 4, # 96kB
    'max_active_work_groups': 16, # this is a guess based on nothing - taken from intel - may be wrong
    'nr_processing_blocks': 3
}

mali_g52_platform = {
    'name' : "Mali G52 MP2",
    'compute_unit' : compute_unit,
    'processing_block': processing_block
}

Below are 3 main functions calculating the subgroup limit with respect to particular resource:
* register usage
    - check if registers are spilled by calculating if subgroup fits into it's register limits
    - calculate how many work group states fit into total CU register space
* SLM
* subgroups
    - calculate how many subgroup states (from entire work groups) can be stored on a CU
* hardware design limits defined by vendor in documentation
    - maximum number of work groups which can be active - determined by other factors

From these limits we take the minimum - because it's the bottleneck. Based on that, we can calculate how many subgroups can be scheduled. Then we compare it against maximum number of possible subgroups we could have. This ratio gives the occupancy percentage.

In [None]:
#occupancy calculator
from math import ceil


def calculate_register_limits(compute_unit, processing_block, kernel):
    # Calculate maximum number of registers available per thread
    available_regs_per_thread = ceil(processing_block['register_file_in_bytes'] / kernel['subgroup_size'])    
    
    # Check if registers are spilled
    if kernel['register_usage_in_bytes'] > available_regs_per_thread:
        return 0
    
    reg_per_compute_unit = compute_unit['register_file_in_bytes']
    # Calculate register usage per work group
    reg_per_wk = kernel['register_usage_in_bytes'] * kernel['work_group_size']
    
    # Calculate how many work group states fit into total CU register space    
    return reg_per_compute_unit//reg_per_wk


def calculate_slm_limits(compute_unit, kernel):    
    slm_per_compute_unit = compute_unit['slm_size_in_bytes']
    slm_per_wk = kernel['slm_usage_per_work_group_in_bytes']
    
    # Calculate how many work groups fit into total CU SLM space
    return slm_per_compute_unit//slm_per_wk


def calculate_subgroup_limits(compute_unit, processing_block, kernel):
    # Calculate how many subgroup states can be stored on a CU, given work groups cannot be split across CUs
    
    # Total number of subgroups that can be stored on a CU
    total_subgroups = compute_unit['nr_processing_blocks'] * processing_block['nr_subgroups_states']
    # Number of subgroups per work group
    subgroups_per_work_group = ceil(kernel['work_group_size'] / kernel['subgroup_size'])
    
    # Calculate how many work groups fit into total CU subgroup states
    return total_subgroups//subgroups_per_work_group


def describe_compute_unit(compute_unit):
    print("Compute Unit")
    print(f"\tShared Memory Size (bytes): {compute_unit['slm_size_in_bytes']} ")
    print(f"\tRegister File (bytes): {compute_unit['register_file_in_bytes']} ")
    print(f"\tMaximum number of Active Work Groups: {compute_unit['max_active_work_groups']} ")
    print(f"\tNumber of Processing Blocks: {compute_unit['nr_processing_blocks']} ")

    
def describe_processing_block(processing_block):
    print("Processing Block")
    print(f"\tNumber of subgroup states: {processing_block['nr_subgroups_states']} ")
    print(f"\tRegister File per subgroup (bytes): {processing_block['register_file_in_bytes']} ")


def describe_kernel(kernel):
    print(f"Kernel - {kernel['name']}")
    print(f"\tWork Group Size (number of threads in Work Group): {kernel['work_group_size']} ")       
    print(f"\tSubgroup size: {kernel['subgroup_size']} ")          
    print(f"\tLocal Shared Memory usage per Work Group (bytes): {kernel['slm_usage_per_work_group_in_bytes']} ")
    print(f"\tRegister usage per kernel (bytes): {kernel['register_usage_in_bytes']} ")

          
def occupancy(platform, kernel):
    compute_unit = platform['compute_unit']
    processing_block = platform['processing_block']
    
    reg_limit = calculate_register_limits(compute_unit, processing_block, kernel)
    slm_limit = calculate_slm_limits(compute_unit, kernel)
    subgroup_limit = calculate_subgroup_limits(compute_unit, processing_block, kernel)

    max_wks = compute_unit['max_active_work_groups']
    
    limit = min(reg_limit, slm_limit, subgroup_limit, max_wks)
    
    print(f"-------------------------------------------------------------------------------------\n")
    print(f"Platform: {platform['name']}")
    describe_compute_unit(compute_unit)
    describe_processing_block(processing_block)
    describe_kernel(kernel)

    print("")
    print(f"{limit} work groups is the final limit.")
    
    print(f"\t{max_wks} work groups limit imposed by hardware design.")
    
    if reg_limit > 0:
        print(f"\t{reg_limit} work groups limit imposed by register usage of your kernel.")
    else:
        print(f"\tYou kernel uses too many registers and it's not possible to schedule any work.")
        
    if subgroup_limit > 0:
        print(f"\t{subgroup_limit} work groups limit imposed by your work group size and compiled subgroup width.")
    else:
        print(f"\tYou kernel uses too many registers and it's not possible to schedule any work.")
        
    if slm_limit > 0:
        print(f"\t{slm_limit} work groups limit imposed by SLM usage of your kernel.")
    else:
        print(f"\tYou kernel uses too much SLM and it's not possible to schedule any work.")
    
    threads_per_wk = kernel['work_group_size']
    subgroup_size = kernel['subgroup_size']
    
    # Number of subgroups per work group
    subgroups_per_wk = ceil(threads_per_wk / subgroup_size)
    # Number of active subgroups per compute unit
    subgroups_per_compute_unit = subgroups_per_wk * limit
    
    # Total number of subgroups that could be stored on a CU
    total_subgroups = compute_unit['nr_processing_blocks'] * processing_block['nr_subgroups_states']
    
    # ------------- Compute Unit OCCUPANCY -------------------------
    # Tells how many of the subgroups can be stored and processed on a CU
    # this can get below 100% if your kernel uses too many registers or SLM
    subgroup_occupancy = 100 * subgroups_per_compute_unit / total_subgroups
    
    print(f"{subgroup_occupancy:.2f}% subgroup occupany of Compute Unit.")
    print(f"\t{subgroups_per_wk} subgroups per work group.")
    print(f"\t{subgroups_per_compute_unit} of {total_subgroups} available subgroups are utilized by your kernel.")    
    
    # Thread occupancy - tells how many lanes (or cores) in your subgroup are utilized
    # this can get below 100% only if your work group size is not multiple of subgroup width
    # this is independent of register or SLM usage
    # Number of threads per work group ceiled to the next multiple of subgroup size
    threads_per_wk_ceiled = subgroups_per_wk * subgroup_size
    thread_occupancy = 100 * threads_per_wk / threads_per_wk_ceiled
    print(f"{thread_occupancy:.2f}% thread occupany of your subgroups.")
    print(f"\t{subgroup_size} threads per subgroup.")
    print(f"\t{threads_per_wk} of {threads_per_wk_ceiled} running threads are active in your kernel.")        
    print(f"-------------------------------------------------------------------------------------\n")

Then define kernel specific information and feed it to the occupancy function.

In [None]:
kernel_a = {
    'name' : 'Kernel A on Intel Gen 11',
    'work_group_size' : 48, 
    'slm_usage_per_work_group_in_bytes' : 768,
    'subgroup_size' : 8,
    'register_usage_in_bytes': 384
}

kernel_b = {
    'name' : 'Kernel B on Intel Gen 11',
    'work_group_size' : 64,
    'slm_usage_per_work_group_in_bytes' : 384,
    'subgroup_size' : 16,
    'register_usage_in_bytes': 256
}

occupancy(intel_gen_11_platform, kernel_a)

kernel_nvidia = {
    'name' : 'Kernel B compiled on NVidia',
    'work_group_size' : 128,
    'slm_usage_per_work_group_in_bytes' : 384,
    'subgroup_size' : 32,
    'register_usage_in_bytes': 300
}

occupancy(nvidia_rtx_2080_ti_platform, kernel_nvidia)

kernel_mali = {
    'name' : 'Soft kernel on Mali',
    'work_group_size' : 32, 
    'slm_usage_per_work_group_in_bytes' : 768,
    'subgroup_size' : 8,
    'register_usage_in_bytes': 64 * 4
}

occupancy(mali_g52_platform, kernel_mali)

Those theoretical occupancy results are a guideline when designing your kernels. Better performance is not always linked to high occupancy as it depends on other factors as well. Your kernel may perform computations inefficiently, access memory in non optimal fashion or have lot of divergent paths. However low occupancy may give you a hint to change something when your kernels are memory bound. With higher occupancy you may hide more memory latency. 

The calculations here are a simplification. For example, on NVidia platforms warps are allocated in groups which was neglected here.

## Exercise

Play around with Nvidia kernel B configuration to get 100% subgroup occupancy.
* you are on NVidia platform so you cannot change the subgroup size - it has to remain 32
* assume you do imaginary code changes so you can manipulate all other kernel parameters
* change the values in code examples above

Refer to the [solution](./occupancy_solution.py) if you get stuck.

## Compute Device Occupancy

Keeping your entire GPU busy is more straightforward. We should have at least as many work-groups as Compute Units. If you create less Work Groups then some Compute Units will remain idle.

For example assume we have 5 Compute Units. You have optimized your Compute Unit occupancy and it will fit 6 work groups. When it would make sense to create multiple of number of work group that fit on CU times number of CUs.
In this example it will give 5 CUs * 6 work groups on each CU which gives 30 work groups.

## Other performance hints

#### Static arrays in kernel code

On various platforms statically declared arrays (like 'int my_array[64]') turned out to cause performance issues.

On Adreno it seems like they are treated as global memory access and are slowing down kernel performance a **couple of times**.

Often registers can be reused. So you can have for example 50 variables but they may require only 25 registers - because some or the variables go out of scope of will not be needed at a further place in your kernel. But if you put all those variables in an array instead of separate variables, the compiler will not optimize the size of the array but will keep it at the declared size.

#### Number of memory barriers

There is a dedicated hardware handling barriers with limited capacity. If this capacity is exceeded performance drops. 
This topic should be studies more in depth but it's good to be aware of such limitation.

Using barriers in your code can cause your subgroup size to drop - observed on Adreno.