# 1. The architecture of modern GPU
* GPUs are organized into an array of highly threaded Streaming Multiprocessors(SMs).
* Each SM has many cores, known as CUDA cores.
* Each SM has an on-chip memory structure. 
* GPUs also come with a large amount of off-chip memory called "Global Memory."

# 2. Block scheduling
* Threads are assigned to SMs on a block-by-block basis. All the threads in a block are assigned to the same SM simultaneously.
* Blocks must reserve hardware resources before being assigned to an SM, so only a limited number of blocks can be assigned simultaneously. 
* Each GPU has a limited number of SMs, each of which can be assigned a limited number of blocks simultaneously. Hence, there is a limit to how many blocks can simultaneously run on a CUDA device. 
* The runtime maintains a list of blocks to be executed and assigns new blocks to SMs when previously assigned blocks have finished. 
* Block-by-block thread assignment allows for special interactions among threads in the same block that are impossible among threads of different blocks. This includes barrier sync and low-latency access to shared memory. 
* Threads in different blocks can also synchronize if they follow certain patterns, but that is outside the scope of this chapter. 

# 3. Synchronization and transparent scalability
* __synchthreads() allows threads in the same block to coordinate. When a thread calls __syncthreads, it will be held at that location until all threads in the same block have reached that location. This kind of synchronization is called **Barrier Synchronization**.
* If in a kernel __synchthreads() is present, all threads must execute it, or none should. Incorrect usage of __synchthreads will result in undefined behavior.
* Barrier Synchronization enforces strict constraints on threads within a block:
    * All threads in a block should execute near each other to avoid excessive long time.
    * The system must ensure all threads participating in Barrier Synchronization have the necessary resources.
CUDA achieves these constraints by enforcing block-by-block scheduling.
* CUDA hits a significant trade-off by not allowing Barrier Synchronization across blocks. Blocks can be executed in any order and independent of each other. Independent block execution also provides for different hardware with different power/performance/cost profits. Albeit at various speeds. This is called **Transparent Scalability**


# 4. Warps and SIMD hardware
* Threads in a block execute independently of each other and only follow Barrier Synchronization rules.
* Thread scheduling in CUDA is hardware-specific.
* In most implementations, once a block is assigned to an SM, the threads inside the block are divided into 32 thread units called warps. Knowledge of warps can be crucial in optimizing performance on a CUDA device.
* Blocks are partitioned into warps based on thread indices. For multi-dimensional blocks, the threads are laid in a row-major format and then split into warps of 32. After the threads have been ***linearized***, the split into blocks is done as follows:
$$
\text{warp}_n = [32n, 32(n - 1) + 1]
$$
The last warp is padded with inactive threads if required.
* Higher coordinates are laid before lower coordinates to do the row-major **linearization**. Concretely, z, then y, and then x. Refer to Figure 4.1.
* Execution units (cores) in an SM are grouped into **processing blocks**. All cores in the same processing block share the same Instruction Fetch/Dispatch units. For example, A100 has 64 cores divided into four processing blocks of 16 cores.
* All threads in the same warp are assigned to the same processing block. They apply the same instruction to different parts of the data at the same time. This model is called Single Instruction Multiple Data(SIMD). The advantage of SIMD is that the same control structure is shared across many compute units. Hence, large amounts of hardware can be assigned to compute cores. 

# 5. Control divergence
* When different threads in the same warp take different control flow paths, the SIMD hardware will take multiple passes through these paths, one for each path.
* When different threads in the same warps execute different paths, these threads exhibit **control divergence**.
* The multipass approach to divergent warp execution extends the SIMD hardware capability to implement the full semantics of CUDA threads.
* While the SIMD hardware executes the same instructions for all threads in the warp, it selectively lets the thread take part in the paths that they took.
* The multipass approach preserves thread independence, allowing us to continue taking advantage of low-cost SIMD hardware. The trade-off is extra passes on the same instruction set.
* After Pascal architecture, many passes can be executed concurrently. This feature is called **Independent Thread Scheduling**.
* Divergence can also arise from loops.
* A prevalent reason for control divergence is to handle boundary conditions. As such, the larger the compute load, the lesser the impact of control divergence. This is because the larger the grid size, the smaller the percentage of warps involved in boundary condition paths.
* Another impact of control divergence is that one cannot assume the same execution time for all threads. Therefore, if all threads in a warp must complete a phase before moving on, we must use a barrier synchronization mechanism, like __synchwarps().