# L_004: Compute and Memory Basics

## At a glance, What do we have to know to keep the GPU busy?

## S1 - Compute bound:

Key Terms:

- SM (Streaming Multiprocessor)
    - **max_threads_block_SM % block_dim == 0**
    > A thread block is assigned to one SM. When selecting block dimensions, we must consider the maximum number of threads available per SM. We should choose a block dimension that is divisible by this number to ensure that no threads remain idle.

    - We don't have control over which block is scheduled on which SM. On modern GPUs, threads are organized into warps (groups of 32 threads). Additionally, parts of a warp can diverge and execute different instructions, resulting in some threads not running synchronously with others.

- Threads, Warps, Blocks 
  
    `threadIdx.x` is the fastest-varying dimension, while the other dimensions (e.g., `threadIdx.y` and `threadIdx.z`) vary more slowly.
  ![thread_linearization.png](./ax-images/thread_linearization.png)


- **Avoid Divergence Within a Warp:**  
  Write conditionals (e.g., `cond ? x[i] : of`) in a way that minimizes branch divergence so that threads within the same warp execute the same instructions.

- Avoid FP64/INT64.

- **Inspect CUDA Device Properties:**  
   - Use `torch.cuda.get_device_properties()` to retrieve detailed information about the GPU's capabilities.
   - Even more in [CUDA Docs](https://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/html/group__CUDART__DEVICE_g5aa4f47938af8276f08074d09b7d520c.html)




## S2 - Memory bound: Memory architecture and data locality

First, we have seen how threads work and how they are scheduled on the GPU. The second important factor is how memory accesses limit the execution speed of our kernels.