# **CUDA BASICS**

Add CUDA to path in Jupyter Notebook even though nvcc compiler is detected in terminal, as it is not directly detected by ipykernel.

In [None]:
import os
os.environ["PATH"] += ":/usr/local/cuda/bin"

# Verify nvcc is now accessible
!nvcc --version

---
## **01 - CUDA Device Properties**

Open the file [01_device_details.cu](./01_device_details.cu) to see the code.

In [None]:
!make SRC=./src/01_device_details.cu run


### 1. Basic Properties
- Device Name: Name of the GPU device.
  - Example: `NVIDIA A6000`, `GeForce GTX 1080 Ti`.
- Compute Capability: Indicates the architecture and feature set supported by the GPU.
  - Format: `major.minor` (e.g., `7.5` for Turing, `8.0` for Ampere).
  - Determines compatibility with CUDA features.

### 2. Hardware Specifications
- Number of Multiprocessors (SMs): Number of Streaming Multiprocessors.
  - Higher SM count generally means higher parallelism.
- Max Threads Per Block: Maximum number of threads allowed per block.
  - Typical values: `1024`, `1536`.
- Max Threads Per Multiprocessor: Maximum threads an SM can handle concurrently.
  - Dependent on the architecture (e.g., `2048` for Volta, `1536` for Pascal).
- Max Blocks Per SM: Maximum number of thread blocks an SM can run simultaneously.

### 3. Memory Properties
- Global Memory: Total memory available on the GPU device.
  - Example: `48GB` for A6000, `8GB` for GTX 1080 Ti.
  - Used for data transfer between host and device.
- Shared Memory Per Block: Memory shared among threads in a block.
  - Example: `48KB` or `100KB` (depending on architecture and configuration).
- Total Shared Memory Per SM: Total shared memory available to an SM.
- L1 Cache/Shared Memory Configurable: Ability to partition shared memory and L1 cache.
  - Example: 16KB L1, 48KB shared or vice versa.
- Registers Per Block: Maximum number of registers available per block.
- Constant Memory: Read-only memory optimized for frequently used constants.
  - Typically `64KB`.

### 4. Execution Capabilities
- Warp Size: Number of threads in a warp.
  - Typically `32` for all NVIDIA GPUs.
- Max Grid Dimensions: Maximum dimensions of a grid.
  - Example: `(2^31 - 1, 65535, 65535)` in the X, Y, Z dimensions.
- Max Block Dimensions: Maximum dimensions of a block.
  - Example: `(1024, 1024, 64)` in X, Y, Z dimensions.

### 5. Performance Metrics
- Clock Rate: GPU core clock speed in kHz.
  - Example: `1410 MHz`.
  - Affects computation speed.
- Memory Clock Rate: Speed of the GPU memory in kHz.
  - Example: `6 GHz` for GDDR6.
- Memory Bus Width: Width of the memory bus in bits.
  - Example: `384-bit`.
- Peak Memory Bandwidth: Maximum memory transfer rate.
  - Example: `936 GB/s`.

### 6. Concurrency Features
- Concurrent Kernels: Indicates if multiple kernels can execute simultaneously.
- Async Engine Count: Number of asynchronous engines for concurrent copy and execution.
- Overlap: Ability to overlap data transfer and kernel execution.

### 7. Unified Addressing
- Unified Memory: Indicates support for unified memory, allowing shared memory between host and device.
- Managed Memory: Support for memory managed automatically by CUDA.

### 8. Special Capabilities
- Tensor Cores: Present in GPUs with compute capability `7.0` and above (e.g., Turing, Ampere).
  - Accelerates deep learning matrix operations.
- Ray Tracing Cores: Present in RTX GPUs for real-time ray tracing applications.
- FP16 and FP64 Performance: Indicates support for 16-bit and 64-bit floating-point operations.
  - Double precision (`FP64`) is slower on consumer GPUs compared to professional GPUs (e.g., A6000).

### 9. Others
- ECC Support: Indicates whether Error Correcting Code (ECC) memory is available.
  - Critical for scientific and financial computations.
- Device Overlap: If device can overlap computation and data transfer.
- CUDA Version: Supported CUDA runtime version.

### How to Use This Information
- Optimize Kernel Performance:
  - Design kernels to utilize shared memory efficiently.
  - Use appropriate thread/block configurations within the device limits.
- Memory Bandwidth:
  - Use coalesced memory access patterns to improve bandwidth utilization.
- Concurrency:
  - Use streams for overlapping data transfer and computation.
- Deep Learning:
  - Leverage Tensor Cores for matrix multiplication if available.



In [None]:
!make SRC=./src/01_device_details.cu clean

---
## **02 - Kernels, Thread and Block Configuration** 

<div style="text-align: center;">
  <img src="./images/CUDA-GridBlockThread-Structure.png" alt="Threads and blocks configuration">
</div>

- Defining grid and block structure:
    - dim3 is a built-in CUDA type that represents 3D vectors.
    - threadsPerBlock specifies the dimensions of each block in terms of threads.
    -  ```dim3 threadsPerBlock(2, 2, 2);``` : Here, each block contains 2 threads along the x-axis, 2 along the y-axis, and 2 along the z-axis, totaling 2×2×2=82×2×2=8 threads per block.
    - numberOfBlocks specifies the dimensions of the grid in terms of blocks.
    - ```dim3 numberOfBlocks(2, 2, 2);``` : The grid consists of 2 blocks along each of the three axes, resulting in 2×2×2=82×2×2=8 blocks in total.
    - 8 threads/block * 8 blocks = 64 threads.

- Indices and dimensions of Grid, Blocks and Threads:
    - gridDim:
        - Size of the grid (number of blocks) in x, y, and z dimensions.
        - Example: gridDim = dim3(4, 2, 1) for 4x2x1 blocks.

    - blockDim:
        - Size of a block (number of threads) in x, y, and z dimensions.
        - Example: blockDim = dim3(8, 4, 1) for 8x4x1 threads.

    - blockIdx:
        - Index of the current block in the grid.
        - Example: blockIdx = dim3(2, 1, 0) for the third block in x and second in y.

    - threadIdx:
        - Index of the current thread in the block.
        - Example: threadIdx = dim3(3, 1, 0) for the fourth thread in x and second in y.

- Kernel launch:
    - The ```kernel_name<<<NumberOfBlocks, threadsPerBlock>>>;``` syntax is used to launch a kernel in CUDA.

- In CUDA, the keywords `__global__` and `__device__` are function qualifiers that define how and where functions are executed and called. Here's a concise explanation:

- `__global__`:
    - Purpose: Marks a function as a kernel, which can be called from the host (CPU) and executed on the device (GPU).
    - Execution:
        - Called from: Host code.
        - Executed on: GPU.
    - Special Notes:
        - Must return void.
        - Cannot be called from another kernel or device function.
    - Syntax: Uses triple angle brackets <<<...>>> to specify the execution configuration (grid and block dimensions) when invoking the kernel.
    Example:

In [None]:
# __global__ void add(int *a, int *b, int *c) {
#     int idx = threadIdx.x;
#     c[idx] = a[idx] + b[idx];
# }

# int main() {
#     // Call kernel
#     add<<<1, 256>>>(a, b, c);  // 1 block, 256 threads
# }

- `__device__`:
    - Purpose: Marks a function as a device function, which is executed on the GPU and can only be called from another GPU function (e.g., another __global__ or __device__ function).
    - Execution:
        - Called from: GPU code.
        - Executed on: GPU.
    - Special Notes:
        - Can return values.
        - Cannot be called directly from the host.
    - Syntax: Called like a regular function (no <<<...>>> required).
    - Example:

In [None]:
# __device__ int square(int x) {
#     return x * x;
# }

# __global__ void calculateSquares(int *a) {
#     int idx = threadIdx.x;
#     a[idx] = square(idx);  // Call __device__ function
# }

- Block ID, Block Offset, Thread ID and Global Thread ID
    - blockID
        - Purpose: Calculates the unique 1D ID of a block within the entire 3D grid.
        - Explanation:
            - blockIdx.x: Block's position in the x-dimension.
            - blockIdx.y * gridDim.x: Adds the blocks from rows above in the grid.
            - blockIdx.z * gridDim.x * gridDim.y: Accounts for the blocks in the z-dimension (layers).
        - Result: A unique block ID (blockId) in the entire grid.

    - blockOffset 
        - Purpose: Computes the starting global thread index of the block.
        - Explanation:
            - blockId: Unique block ID from Step 1.
            - blockDim.x * blockDim.y * blockDim.z: Total number of threads in a block.
            - Multiplying the two gives the starting position of the block in the global thread index space.
    
    - threadId
        - Purpose: Computes the local thread ID (0-based index) within a block.
        - Explanation:
            - threadIdx.x: Thread’s position in the x-dimension within the block.
            - threadIdx.y * blockDim.x: Adds threads from rows in the y-dimension.
            - threadIdx.z * blockDim.x * blockDim.y: Adds threads from layers in the z-dimension.
        - Result: A unique thread ID (threadId) within the block.
    
    - globalThreadID
        - Purpose: Combines the block’s starting position (blockOffset) and the thread’s position within the block (threadId) to compute the global thread ID.
        - Explanation:
            - Threads in each block are indexed starting from the blockOffset.
            - Adding threadId to blockOffset gives a unique global ID for the thread in the entire grid.
        - Result: globalThreadId uniquely identifies a thread in the entire grid.

In [None]:
!make SRC=./src/02_kernel_thread_and_block.cu run

In [None]:
!make SRC=./src/02_kernel_thread_and_block.cu clean

---
## **03 - Memory Allocation and Deallocation**

- `cudaMalloc`:
    - Explicitly allocates memory in the GPU's global memory.
    - `cudaMalloc(void **devPtr, size_t size);`
        - devPtr: Pointer to the memory location on the device. A pointer-to-pointer (void **) is required because the function modifies the pointer value to point to the allocated device memory.
        - size: Number of bytes to allocate.
    - Necessary because the CPU and GPU have separate memory spaces.

- `cudaMemcpy`:
    - Handles data transfer between host and device.
    - `cudaError_t cudaMemcpy(void *dst, const void *src, size_t count, kind);`
        - dst: Destination memory address.
        - src: Source memory address.
        - count: Number of bytes to copy.
        - kind: Direction of data transfer:
            - cudaMemcpyHostToDevice: Copy data from host (CPU) to device (GPU).
            - cudaMemcpyDeviceToHost: Copy data from device (GPU) to host (CPU).
            - cudaMemcpyDeviceToDevice: Copy data between two device memory locations.
    - Essential for initializing GPU computations with host data and retrieving results.


- `cudaFree`:
    - Frees GPU memory to avoid memory leaks.
    - `cudaFree(void *devPtr);`
        - devPtr: Pointer to the memory on the device to be freed.
    - Crucial for efficient memory management in GPU applications.

Together, these functions form the foundation of memory management in CUDA programming.

In [None]:
!make SRC=./src/03_memory_allocation.cu run

In [None]:
!make SRC=./src/03_memory_allocation.cu clean

---
## **04 - Memory Hierarchy**

Explanation of Memory Hierarchy in Code

1. Global Memory:
    - `globalData` and output arrays are stored in global memory.
    - Threads load values from global memory into registers for computation.
    - Global memory access is slow and requires coalesced access for efficiency.

2. Register Memory:
    - `regValue` is a register variable, stored in the fastest memory on the GPU.
    -  are private to each thread and provide the fastest access time.

3. Shared Memory:
    - `sharedMem` is shared among threads in a block.
    - Faster than global memory but limited in size (e.g., 48 KB per SM on modern GPUs).
    - Example: Compute the sum of all thread values in a block using shared memory.

4. Local Memory:
    - Dynamically allocated memory (`localMem`) is private to each thread.
    - Typically stored in global memory if registers are insufficient.
    - Used for per-thread data not shared with others.

Notes

- Performance Tips:
    - Use shared memory to minimize global memory accesses.
    - Avoid excessive use of local memory, as it is stored in global memory and can be slow.
- Synchronization:
    - __syncthreads() is essential to ensure all threads have finished accessing shared memory before proceeding.

This code demonstrates how memory is accessed and utilized across the GPU memory hierarchy in CUDA.

Key Differences Between malloc/free and cudaMalloc/cudaFree

| **Aspect**               | **`malloc` / `free` in `__global__`** | **`cudaMalloc` / `cudaFree`**           |
|--------------------------|----------------------------------------|-----------------------------------------|
| **Scope**                | Per-thread allocation (local memory). | Device-wide allocation (global memory). |
| **Location**             | Allocates memory from the thread’s heap (local memory). | Allocates memory in global memory.      |
| **Usage**                | Called inside a kernel (`__global__` or `__device__`). | Called from the host.                   |
| **Speed**                | Relatively slower due to thread-level heap management. | Faster but requires explicit host calls.|
| **Purpose**              | Per-thread local memory, private to a thread. | Shared memory for threads, blocks, or grids. |
| **Memory Lifetime**      | Only valid during the kernel execution. | Persistent across kernel launches (until freed). |


In [73]:
!make SRC=./src/04_memory_hierarchy.cu run

nvcc -o ./src/04_memory_hierarchy ./src/04_memory_hierarchy.cu
././src/04_memory_hierarchy
Block 2: Shared memory sum = 2864
Block 1: Shared memory sum = 1840
Block 0: Shared memory sum = 816
Block 3: Shared memory sum = 3888
Output from GPU:
20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 122 124 126 128 130 132 134 136 138 140 142 144 146 148 150 152 154 156 158 160 162 164 166 168 170 172 174 176 178 180 182 184 186 188 190 192 194 196 198 200 202 204 206 208 210 212 214 216 218 220 222 224 226 228 230 232 234 236 238 240 242 244 246 248 250 252 254 256 258 260 262 264 266 268 270 272 274 


In [74]:
!make SRC=./src/04_memory_hierarchy.cu clean

rm -f ./src/04_memory_hierarchy


---
## **05 - Synchronization**

1. Warp-Level Synchronization

    - Synchronization happens automatically within a warp (32 threads).
    - No need for `__syncthreads()` as all threads execute in lock-step (SIMD).
    - Limited to threads in the same warp, with no inter-warp communication.

2. Block-Level Synchronization

    - Synchronization within a block using `__syncthreads()` ensures all threads reach a barrier before proceeding.
    - Threads in a block can communicate via shared memory.
    - Cannot synchronize threads across different blocks.

3. Grid-Level Synchronization

    - Achieved through host intervention using `cudaDeviceSynchronize()`, ensuring all blocks complete before launching the next kernel.
    - Involves storing intermediate results in global memory.
    - No direct GPU-level synchronization across blocks in a single kernel.

In [72]:
!make SRC=./src/05_synchronization.cu run

nvcc -o ./src/05_synchronization ./src/05_synchronization.cu
././src/05_synchronization
Warp-level synchronization:
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 

Block-level synchronization:
Block sums:
Block 0 sum: 512
Block 1 sum: 512

Grid-level synchronization:
Block 1: Updated sum after iteration 1 = 1
Block 0: Updated sum after iteration 1 = 256
Block 1: Updated sum after iteration 2 = 3
Block 0: Updated sum after iteration 2 = 256
Block 1: Updated sum after iteration 3 = 6
Block 0: Updated sum after iteration 3 = 256
Block 1: Updated sum after iteration 4 = 10
Block 0: Updated sum after iteration 4 = 256
Block 1: Updated sum after iteration 5 = 15
Block 0: Updated sum after iteration 5 = 256
Final block sums:
Block 0 sum: 256
Block 1 sum: 256


In [75]:
!make SRC=./src/05_synchronization.cu clean

rm -f ./src/05_synchronization


---
## **06 - Error Handling**

- Error Handling Mechanism:

    - `cudaGetLastError()`:
        - Retrieves the last error that occurred.
        - Resets the error status to cudaSuccess for subsequent checks.
    - `cudaGetErrorString()`:
        - Converts a CUDA error code into a human-readable string.

- Kernel Launch Error:

    - The faultyKernel deliberately attempts an out-of-bounds memory access, which will cause an illegal memory access error.

- Error Propagation:

    - Each CUDA API call and kernel launch is followed by an error check using:

    - `checkCudaError("Error Message")`;

- Device Synchronization:

    - `cudaDeviceSynchronize()` ensures that all kernel executions and memory operations are complete, making runtime errors visible to the host.

- Error Messages:

    - If an error occurs, the program prints the error message and exits with EXIT_FAILURE.

In [76]:
!make SRC=./src/06_error_handling.cu run

nvcc -o ./src/06_error_handling ./src/06_error_handling.cu
/usr/bin/ld: /usr/lib/gcc/x86_64-linux-gnu/13/../../../x86_64-linux-gnu/Scrt1.o: in function `_start':
(.text+0x1b): undefined reference to `main'
collect2: error: ld returned 1 exit status
make: *** [Makefile:11: src/06_error_handling] Error 1


In [77]:
!make SRC=./src/06_error_handling.cu clean

rm -f ./src/06_error_handling
