# **Advanced CUDA**

Add CUDA to path in Jupyter Notebook even though nvcc compiler is detected in terminal, as it is not directly detected by ipykernel.

In [29]:
import os
os.environ["PATH"] += ":/usr/local/cuda/bin"

# Verify nvcc is now accessible
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Oct_29_23:50:19_PDT_2024
Cuda compilation tools, release 12.6, V12.6.85
Build cuda_12.6.r12.6/compiler.35059454_0


---
## **01 - Atomic Operations**

An atomic operation in CUDA is a type of operation that is performed in a way that ensures it is indivisible—that is, it cannot be interrupted or affected by other threads. When multiple threads attempt to modify a shared memory location, atomic operations ensure that these modifications are executed one at a time, avoiding race conditions.

For example, when multiple threads try to increment a shared counter, an atomic operation ensures that each increment happens sequentially, even if threads are running concurrently.

**Why Are Atomic Operations Necessary?**

In parallel programming, multiple threads often need to access or update shared data. Without synchronization mechanisms like atomic operations, the following issues can arise:
- Race Conditions: Multiple threads attempt to update the same variable simultaneously, leading to inconsistent results.
- Data Corruption: Intermediate results of one thread's operation can be overwritten by another thread.
- Incorrect Computation: Operations that depend on shared data (e.g., summation, counting) may produce wrong results due to simultaneous accesses.

Atomic operations prevent these problems by serializing access to the shared resource, ensuring that only one thread modifies the variable at a time.

**Common Atomic Operations in CUDA**

CUDA provides several atomic functions that operate on different data types and perform common operations:

- Arithmetic Operations:
    - atomicAdd: Adds a value to a shared variable.
    - atomicSub: Subtracts a value from a shared variable.
    - atomicExch: Replaces a value with a new one.

- Comparison and Logical Operations:
    - atomicMin: Updates the variable with the minimum of the current and provided value.
    - atomicMax: Updates the variable with the maximum of the current and provided value.
    - atomicCAS (Compare and Swap): Updates a variable only if it equals a specified value.

- Bitwise Operations:
    - atomicAnd: Performs a bitwise AND.
    - atomicOr: Performs a bitwise OR.
    - atomicXor: Performs a bitwise XOR.

In [30]:
!make SRC=./src/01_atomic_operations.cu run

nvcc -o ./src/01_atomic_operations ./src/01_atomic_operations.cu
././src/01_atomic_operations
Sum of array elements (normalSumKernel): 1
Sum of array elements (atomicSumKernel): 1024


**Code Explanation**

- Normal Sum (`normalSumKernel`):
Each thread reads from the input array and adds its value to the shared variable result.
Issue: Without atomic operations, multiple threads may update result simultaneously, leading to race conditions and an incorrect sum.

- Atomic Sum (`atomicSumKernel`):
Uses `atomicAdd` to safely add each thread’s contribution to result.
Solution: Ensures that only one thread updates result at a time, preventing race conditions and producing the correct sum.

- Result Comparison:
`normalSumKernel`: Results are incorrect because threads overwrite each other's updates.
`atomicSumKernel`: Produces the correct sum by using atomic operations to serialize updates.

**Why Atomic Operations Are Crucial in This Code?**

- The shared variable result is updated concurrently by multiple threads.
- Without atomicAdd, updates are not safe in parallel, leading to data corruption.
- `atomicAdd` ensures correctness but can slow performance due to thread serialization.

**Counter using Normal Addition and Atomic Addition**

In [31]:
!make SRC=./src/01_atomic_operations.cu clean

rm -f ./src/01_atomic_operations


---
## **02 - Events**

CUDA events are a mechanism in the CUDA API used to measure the time taken by operations on the GPU or to synchronize operations between different streams. CUDA events are lightweight and designed specifically for timing and synchronization tasks in GPU programming.
Necessity of CUDA Events

- Performance Measurement: CUDA events allow you to measure the execution time of GPU operations accurately.
- Synchronization: Events can synchronize streams or host-device operations without blocking the entire application.
- Granular Timing: They provide more precise control and insight compared to cudaDeviceSynchronize or host-based timers.

In [32]:
!make SRC=./src/02_events.cu run

nvcc -o ./src/02_events ./src/02_events.cu
././src/02_events
Kernel execution time: 474.403 ms


**Create Events:**

`cudaEvent_t start, stop;`

`cudaEventCreate(&start);`

`cudaEventCreate(&stop);`

- `start` and `stop` are event handles.

**Record Events:**

`cudaEventRecord(start);`

- Start recording before the kernel execution.

**Synchronize Events:**

`cudaEventSynchronize(stop);`

- Ensures all operations before stop are completed.

**Calculate Elapsed Time:**

`cudaEventElapsedTime(&milliseconds, start, stop);`

- Computes time in milliseconds between the start and stop events.

**Destroy Events:**

`cudaEventDestroy(start);`

`cudaEventDestroy(stop);`

- Cleans up event resources.

In [33]:
!make SRC=./src/02_events.cu clean

rm -f ./src/02_events


---
## **03 - Streams**

**CUDA Streams: Enhancing GPU Parallelism**

CUDA streams are a powerful feature in NVIDIA's CUDA programming model that allow for concurrent execution of operations on the GPU. They provide an additional layer of parallelism beyond the traditional thread and block model, enabling more efficient utilization of GPU resources.
- Key Concepts

    - Definition: A CUDA stream is a sequence of operations that execute on the GPU in a specific order.
    - Purpose: Streams enable concurrent execution of kernels and memory transfers, improving overall performance2.
    - Default Stream: All CUDA operations occur in the default stream if not specified otherwise1.

- Stream Behavior

    - Ordering: Operations within a single stream are executed sequentially.
    - Concurrency: Different non-default streams can execute operations concurrently.
    - Default Stream Behavior: The default stream is blocking and synchronizes with all other streams.

<div style="text-align: center;">
  <img src="./images/cuda_streams.bmp" alt="CUDA Streams" width="800">
</div>

This image illustrates the performance difference between serial and concurrent CUDA stream execution.

- The top portion shows a Serial execution where operations happen sequentially:

    1. Memory copy from Host to Device (H2D)
    2. Kernel execution
    3. Memory copy from Device to Host (D2H)

- The bottom portion shows Concurrent execution using three streams:

    - Stream 1, 2, and 3 execute their operations (H2D, Kernel, D2H) in parallel
    - Operations within each stream remain sequential
    - Streams are staggered in time, allowing overlap of different operations

- The red dotted lines highlight the Performance improvement achieved through concurrent execution, showing how parallel streams complete the same workload in less time compared to serial execution. The green boxes represent memory transfers (H2D and D2H), while the blue boxes represent kernel executions. 

This visualization effectively demonstrates how CUDA streams can improve GPU utilization by overlapping computation and data transfer operations.

**Benefits of Using Streams**

- Improved GPU Utilization: Overlapping kernel execution with data transfers.
- Reduced Idle Time: Keeping the GPU busy with multiple concurrent operations.
- Enhanced Performance: Achieving higher throughput for certain workloads.

**Syntax**

`myKernel<<<gridSize, blockSize, sharedMem, stream>>>(parameters);`


- `gridSize`: Specifies the number of thread blocks in the grid.
- `blockSize`: Defines the number of threads in each block.
- `sharedMem`: Amount of shared memory to allocate per block (in bytes).
- `stream`: Specifies which CUDA stream will execute this kernel.

**Cuda Stream Synchronization**

`cudaStreamSynchronize(cudaStream_t stream);` 

The cudaStreamSynchronize function is a crucial synchronization tool that blocks the host thread until all previously queued operations in the specified stream complete their execution.

Usage Scenarios:
- Ensuring data consistency before host access.
- Coordinating multiple stream operations.
- Managing dependencies between CPU and GPU tasks.

**Squaring number - Without CUDA Stream**

In [34]:
!make SRC=./src/03a_no_streams.cu run

nvcc -o ./src/03a_no_streams ./src/03a_no_streams.cu
././src/03a_no_streams
Execution time without streams: 12.28 ms


**Squaring number - With CUDA Stream (Individually created streams)**

In [35]:
!make SRC=./src/03b_with_streams.cu run

nvcc -o ./src/03b_with_streams ./src/03b_with_streams.cu
././src/03b_with_streams
Execution time with streams: 2.63987 ms


**Squaring number - With CUDA Stream (Streams created in `for` loop)**

In [36]:
!make SRC=./src/03c_with_streams_for.cu run

nvcc -o ./src/03c_with_streams_for ./src/03c_with_streams_for.cu
././src/03c_with_streams_for
Execution time with streams: 3.52768 ms


The dramatic speedup from 12ms to ~2-4ms (~3x improvement) occurs because:

- The workload is divided into 4 independent streams
- Memory transfers and kernel executions overlap across streams
- While one stream is executing its kernel, another stream can be performing memory transfers
- The GPU's hardware resources are utilized more efficiently through concurrent execution

This example demonstrates how CUDA streams can significantly improve performance by enabling parallel execution of operations that would otherwise need to wait for previous operations to complete in a serial implementation.

In [37]:
!make SRC=./src/03a_no_streams.cu clean
!make SRC=./src/03b_with_streams.cu clean
!make SRC=./src/03c_with_streams_for.cu clean

rm -f ./src/03a_no_streams
rm -f ./src/03b_with_streams
rm -f ./src/03c_with_streams_for


---
## **04 - Memory Coalescing**

---
## **05 - Shared Memory Bank Conflict**

---
## **06 - Warp Divergence**