# **CUDA BASICS**

Add CUDA to path in Jupyter Notebook even though nvcc compiler detected in terminal, as it is not directly detected by ipykernel.

In [10]:
import os
os.environ["PATH"] += ":/usr/local/cuda/bin"

# Verify nvcc is now accessible
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Sep_12_02:18:05_PDT_2024
Cuda compilation tools, release 12.6, V12.6.77
Build cuda_12.6.r12.6/compiler.34841621_0


---
## **01 - CUDA Device Properties**

Open the file [01_device_details.cu](./01_device_details.cu) to see the code.

In [11]:
!make SRC=01_device_details.cu run

nvcc -o 01_device_details 01_device_details.cu
./01_device_details

Number of CUDA devices: 1
Device #0: NVIDIA GeForce RTX 3060 Laptop GPU
  Compute Capability: 8.6
  Total Global Memory: 6.44193 GB
  Shared Memory per Block: 48 KB
  Registers per Block: 65536
  Warp Size: 32
  Max Threads per Block: 1024
  Number of SMs: 30
  Clock Rate: 1.425 GHz
  Max Threads Dimension: [1024, 1024, 64]
  Max Grid Size: [2147483647, 65535, 65535]
  L2 Cache Size: 3072 KB
  Memory Clock Rate: 7001 MHz
  Memory Bus Width: 192 bits




### 1. Basic Properties
- Device Name: Name of the GPU device.
  - Example: `NVIDIA A6000`, `GeForce GTX 1080 Ti`.
- Compute Capability: Indicates the architecture and feature set supported by the GPU.
  - Format: `major.minor` (e.g., `7.5` for Turing, `8.0` for Ampere).
  - Determines compatibility with CUDA features.

### 2. Hardware Specifications
- Number of Multiprocessors (SMs): Number of Streaming Multiprocessors.
  - Higher SM count generally means higher parallelism.
- Max Threads Per Block: Maximum number of threads allowed per block.
  - Typical values: `1024`, `1536`.
- Max Threads Per Multiprocessor: Maximum threads an SM can handle concurrently.
  - Dependent on the architecture (e.g., `2048` for Volta, `1536` for Pascal).
- Max Blocks Per SM: Maximum number of thread blocks an SM can run simultaneously.

### 3. Memory Properties
- Global Memory: Total memory available on the GPU device.
  - Example: `48GB` for A6000, `8GB` for GTX 1080 Ti.
  - Used for data transfer between host and device.
- Shared Memory Per Block: Memory shared among threads in a block.
  - Example: `48KB` or `100KB` (depending on architecture and configuration).
- Total Shared Memory Per SM: Total shared memory available to an SM.
- L1 Cache/Shared Memory Configurable: Ability to partition shared memory and L1 cache.
  - Example: 16KB L1, 48KB shared or vice versa.
- Registers Per Block: Maximum number of registers available per block.
- Constant Memory: Read-only memory optimized for frequently used constants.
  - Typically `64KB`.

### 4. Execution Capabilities
- Warp Size: Number of threads in a warp.
  - Typically `32` for all NVIDIA GPUs.
- Max Grid Dimensions: Maximum dimensions of a grid.
  - Example: `(2^31 - 1, 65535, 65535)` in the X, Y, Z dimensions.
- Max Block Dimensions: Maximum dimensions of a block.
  - Example: `(1024, 1024, 64)` in X, Y, Z dimensions.

### 5. Performance Metrics
- Clock Rate: GPU core clock speed in kHz.
  - Example: `1410 MHz`.
  - Affects computation speed.
- Memory Clock Rate: Speed of the GPU memory in kHz.
  - Example: `6 GHz` for GDDR6.
- Memory Bus Width: Width of the memory bus in bits.
  - Example: `384-bit`.
- Peak Memory Bandwidth: Maximum memory transfer rate.
  - Example: `936 GB/s`.

### 6. Concurrency Features
- Concurrent Kernels: Indicates if multiple kernels can execute simultaneously.
- Async Engine Count: Number of asynchronous engines for concurrent copy and execution.
- Overlap: Ability to overlap data transfer and kernel execution.

### 7. Unified Addressing
- Unified Memory: Indicates support for unified memory, allowing shared memory between host and device.
- Managed Memory: Support for memory managed automatically by CUDA.

### 8. Special Capabilities
- Tensor Cores: Present in GPUs with compute capability `7.0` and above (e.g., Turing, Ampere).
  - Accelerates deep learning matrix operations.
- Ray Tracing Cores: Present in RTX GPUs for real-time ray tracing applications.
- FP16 and FP64 Performance: Indicates support for 16-bit and 64-bit floating-point operations.
  - Double precision (`FP64`) is slower on consumer GPUs compared to professional GPUs (e.g., A6000).

### 9. Others
- ECC Support: Indicates whether Error Correcting Code (ECC) memory is available.
  - Critical for scientific and financial computations.
- Device Overlap: If device can overlap computation and data transfer.
- CUDA Version: Supported CUDA runtime version.

### How to Use This Information
- Optimize Kernel Performance:
  - Design kernels to utilize shared memory efficiently.
  - Use appropriate thread/block configurations within the device limits.
- Memory Bandwidth:
  - Use coalesced memory access patterns to improve bandwidth utilization.
- Concurrency:
  - Use streams for overlapping data transfer and computation.
- Deep Learning:
  - Leverage Tensor Cores for matrix multiplication if available.



In [12]:
!make SRC=01_device_details.cu clean

rm -f 01_device_details


---
## **02 - Generating threads and blocks** 

---
## **03 - Data movement**