# CUDA C/C++ - A Beginner's Guide

Learn GPU programming from the ground up. By the end of this guide, you'll understand how to write parallel code that runs on NVIDIA GPUs.

**What you'll learn:**
1. Why GPUs are fast (and when they're not)
2. The CPU-GPU programming model
3. Writing and launching GPU kernels
4. Thread organization: threads, blocks, and grids
5. Memory management between CPU and GPU
6. Profiling and debugging CUDA code

**Prerequisites:** Basic C programming knowledge. Setup instructions are in [Appendix A](#appendix-a-setup).

*Inspired by Mark Harris's [An Even Easier Introduction to CUDA](https://developer.nvidia.com/blog/even-easier-introduction-cuda/).*

---
## 1. Why GPU Programming?

Consider adding two arrays of 1 billion numbers:

```
x = [1, 1, 1, ...] (1 billion elements)
y = [2, 2, 2, ...] (1 billion elements)
result: y = [3, 3, 3, ...]
```

On a CPU, you'd write a loop that processes elements one by one:

```c
for (int i = 0; i < 1000000000; i++)
    y[i] = x[i] + y[i];
```

This takes about 15-20 seconds (depends on the CPU obviously). Each addition waits for the previous one to complete.

**The insight:** Each addition is independent. Element 0 doesn't need element 1's result. What if we could do all 1 billion additions *at the same time*?

That's what GPUs help to do.

---
## 2. CPU vs GPU: The Mental Model

| | CPU | GPU |
|---|---|---|
| **Design philosophy** | Few fast cores | Many slower cores |
| **Core count** | 4-64 cores | 1,000-16,000 cores |
| **Optimized for** | Complex sequential tasks | Simple parallel tasks |
| **Memory** | System RAM ("host memory") | VRAM ("device memory") |
| **Code terminology** | Host code | Device code / Kernels |

**Key insight:** GPUs are fast because they do the *same operation* on *many data points* simultaneously. This is called **data parallelism**.

### When GPUs Help (and When They Don't)

**Good for GPUs:**
- Array/matrix operations (same operation on millions of elements)
- Image processing (same filter applied to millions of pixels)
- Neural network inference (matrix multiplications)
- Physics simulations (same equations for many particles)

**Bad for GPUs:**
- Sequential algorithms where step N depends on step N-1
- Workloads with heavy branching (if/else) that differs per element
- Small datasets (overhead exceeds benefit)
- Tasks requiring lots of CPU-GPU communication

### Your GPU

Let's see what GPU you're working with:

In [3]:
%%bash
nvidia-smi --query-gpu=name,memory.total,compute_cap --format=csv

name, memory.total [MiB], compute_cap
Tesla T4, 15360 MiB, 7.5


**Understanding the output:**

| Field | Example | Meaning |
|-------|---------|--------|
| name | Tesla T4 | GPU model |
| memory.total | 15360 MiB | VRAM available (~15 GB) |
| compute_cap | 7.5 | Architecture version (for compiler flags) |

The **compute capability** tells you which features your GPU supports and which compiler flag to use:

| Compute Capability | Architecture | Compiler Flag |
|-------------------|--------------|---------------|
| 7.5 | Turing (T4, RTX 20xx) | `-arch=sm_75` |
| 8.0 | Ampere (A100) | `-arch=sm_80` |
| 8.6 | Ampere (RTX 30xx) | `-arch=sm_86` |
| 8.9 | Ada (RTX 40xx) | `-arch=sm_89` |
| 9.0 | Hopper (H100) | `-arch=sm_90` |

Always match your compile flag to your GPU. Using the wrong one may silently fail at runtime.

---
## 3. Your First CUDA Program

Let's start with a CPU program that adds two arrays, then convert it to CUDA.

### The CPU Version

In [None]:
%%writefile add_cpu.c
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

void add(int n, float *x, float *y) {
    for (int i = 0; i < n; i++)
        y[i] = x[i] + y[i];
}

int main() {
    int N = 1 << 20;  // 1 million elements (1<<20 = 2^20)
    
    float *x = malloc(N * sizeof(float));
    float *y = malloc(N * sizeof(float));
    
    // Initialize arrays
    for (int i = 0; i < N; i++) {
        x[i] = 1.0f;
        y[i] = 2.0f;
    }
    
    add(N, x, y);  // Add arrays
    
    // Verify result (all elements should be 3.0)
    float maxError = 0.0f;
    for (int i = 0; i < N; i++)
        maxError = fmax(maxError, fabs(y[i] - 3.0f));
    printf("Max error: %f\n", maxError);
    
    free(x);
    free(y);
    return 0;
}

In [None]:
%%bash
gcc add_cpu.c -o add_cpu -lm && ./add_cpu

### Converting to CUDA: Three Changes

To run this on a GPU, we need exactly **three changes**:

#### Change 1: Mark the function with `__global__`

```c
// CPU version
void add(int n, float *x, float *y) { ... }

// GPU version
__global__ void add(int n, float *x, float *y) { ... }
```

The `__global__` keyword tells the compiler: "This function runs on the GPU but is called from the CPU."

Functions marked `__global__` are called **kernels**.

#### Change 2: Use CUDA memory allocation

```c
// CPU version
float *x = malloc(N * sizeof(float));
free(x);

// GPU version (Unified Memory)
float *x;
cudaMallocManaged(&x, N * sizeof(float));
cudaFree(x);
```

`cudaMallocManaged` allocates **Unified Memory** - memory accessible from both CPU and GPU. The CUDA runtime automatically handles data movement.

#### Change 3: Launch with execution configuration

```c
// CPU version
add(N, x, y);

// GPU version
add<<<1, 1>>>(N, x, y);   // Launch kernel
cudaDeviceSynchronize();   // Wait for GPU to finish
```

The `<<<blocks, threads>>>` syntax specifies how many parallel threads to launch. We'll explore this in detail soon.

`cudaDeviceSynchronize()` makes the CPU wait for the GPU to finish - kernel launches are *asynchronous* (the CPU continues immediately).

### The CUDA Version

In [None]:
%%writefile add_gpu_v1.cu
#include <stdio.h>
#include <math.h>

__global__ void add(int n, float *x, float *y) {
    for (int i = 0; i < n; i++)
        y[i] = x[i] + y[i];
}

int main() {
    int N = 1 << 20;  // 1 million elements
    float *x, *y;
    
    // Allocate Unified Memory
    cudaMallocManaged(&x, N * sizeof(float));
    cudaMallocManaged(&y, N * sizeof(float));
    
    // Initialize arrays (on CPU)
    for (int i = 0; i < N; i++) {
        x[i] = 1.0f;
        y[i] = 2.0f;
    }
    
    // Launch kernel with 1 block, 1 thread
    add<<<1, 1>>>(N, x, y);
    cudaDeviceSynchronize();
    
    // Verify result
    float maxError = 0.0f;
    for (int i = 0; i < N; i++)
        maxError = fmax(maxError, fabs(y[i] - 3.0f));
    printf("Max error: %f\n", maxError);
    
    cudaFree(x);
    cudaFree(y);
    return 0;
}

In [None]:
%%bash
/usr/local/cuda/bin/nvcc -arch=sm_75 add_gpu_v1.cu -o add_gpu_v1 && ./add_gpu_v1

**It works!** But it's actually *slower* than the CPU version. Why? We're only using 1 GPU thread - like buying a supercomputer and only using one key on the keyboard.

To make it fast, we need to understand GPU threads.

---
## 4. GPU Thread Organization

GPUs organize threads into a hierarchy:

```
Grid (all threads for one kernel launch)
└── Block 0
│   ├── Thread 0
│   ├── Thread 1
│   └── ... (up to 1024 threads)
├── Block 1
│   ├── Thread 0
│   ├── Thread 1
│   └── ...
└── ... (thousands of blocks)
```

### Why Two Levels?

**Threads within a block** can:
- Share fast on-chip memory (shared memory)
- Synchronize with each other
- Cooperate on a task

**Threads in different blocks** cannot:
- Share memory directly
- Synchronize (they may run at different times)

This design allows the GPU to schedule blocks independently across its processors.

### Built-in Thread Variables

Every thread can identify itself using built-in variables:

| Variable | Meaning | Example |
|----------|---------|--------|
| `threadIdx.x` | Thread index within block | "I'm thread 5 in my block" |
| `blockIdx.x` | Block index within grid | "I'm in block 2" |
| `blockDim.x` | Threads per block | "My block has 256 threads" |
| `gridDim.x` | Blocks in grid | "The grid has 4096 blocks" |

### The Global Index Formula

To get a unique index for each thread across the entire grid:

```c
int i = blockIdx.x * blockDim.x + threadIdx.x;
```

**Example:** Block 2, Thread 5, with 256 threads per block:
```
i = 2 * 256 + 5 = 517
```

This thread processes `array[517]`.

### Visualizing Thread Assignment

In [None]:
%%writefile show_threads.cu
#include <stdio.h>

__global__ void showThreadInfo() {
    int globalIdx = blockIdx.x * blockDim.x + threadIdx.x;
    printf("Block %d, Thread %d -> Global index: %d\n",
           blockIdx.x, threadIdx.x, globalIdx);
}

int main() {
    printf("Launching 2 blocks x 4 threads = 8 threads:\n\n");
    showThreadInfo<<<2, 4>>>();
    cudaDeviceSynchronize();
    return 0;
}

In [None]:
%%bash
/usr/local/cuda/bin/nvcc -arch=sm_75 show_threads.cu -o show_threads && ./show_threads

**Notice:** The output order is unpredictable! Threads run in parallel, not sequentially. Never assume execution order.

### Why `.x`?

CUDA supports 1D, 2D, and 3D thread layouts. For arrays, 1D (`.x` only) is sufficient. For images, you might use 2D (`.x` and `.y`). For volumes, 3D.

```c
// 2D example for image processing
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
```

---
## 5. Making It Parallel

Our first CUDA program used `<<<1, 1>>>` - one thread doing all the work. Let's fix that.

### Version 2: One Block, Many Threads

With 256 threads, each thread handles every 256th element:

```
Thread 0: elements 0, 256, 512, 768, ...
Thread 1: elements 1, 257, 513, 769, ...
Thread 2: elements 2, 258, 514, 770, ...
```

This is called a **stride loop**:

In [None]:
%%writefile add_gpu_v2.cu
#include <stdio.h>
#include <math.h>

__global__ void add(int n, float *x, float *y) {
    int index = threadIdx.x;          // Starting position (0-255)
    int stride = blockDim.x;          // Step size (256)
    
    for (int i = index; i < n; i += stride)
        y[i] = x[i] + y[i];
}

int main() {
    int N = 1 << 20;
    float *x, *y;
    
    cudaMallocManaged(&x, N * sizeof(float));
    cudaMallocManaged(&y, N * sizeof(float));
    
    for (int i = 0; i < N; i++) {
        x[i] = 1.0f;
        y[i] = 2.0f;
    }
    
    // 1 block, 256 threads
    add<<<1, 256>>>(N, x, y);
    cudaDeviceSynchronize();
    
    float maxError = 0.0f;
    for (int i = 0; i < N; i++)
        maxError = fmax(maxError, fabs(y[i] - 3.0f));
    printf("Max error: %f\n", maxError);
    
    cudaFree(x);
    cudaFree(y);
    return 0;
}

In [None]:
%%bash
/usr/local/cuda/bin/nvcc -arch=sm_75 add_gpu_v2.cu -o add_gpu_v2 && ./add_gpu_v2

Better! But GPUs have thousands of cores organized into multiple **Streaming Multiprocessors (SMs)**. One block only runs on one SM. We need more blocks.

### Version 3: Many Blocks, Many Threads (Full Parallelization)

In [None]:
%%writefile add_gpu_v3.cu
#include <stdio.h>
#include <math.h>

__global__ void add(int n, float *x, float *y) {
    // Global thread index
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    // Total threads in entire grid
    int stride = blockDim.x * gridDim.x;
    
    for (int i = index; i < n; i += stride)
        y[i] = x[i] + y[i];
}

int main() {
    int N = 1 << 20;
    float *x, *y;
    
    cudaMallocManaged(&x, N * sizeof(float));
    cudaMallocManaged(&y, N * sizeof(float));
    
    for (int i = 0; i < N; i++) {
        x[i] = 1.0f;
        y[i] = 2.0f;
    }
    
    // Calculate grid size
    int blockSize = 256;
    int numBlocks = (N + blockSize - 1) / blockSize;  // Round up
    
    printf("Launching %d blocks x %d threads = %d total threads\n",
           numBlocks, blockSize, numBlocks * blockSize);
    
    add<<<numBlocks, blockSize>>>(N, x, y);
    cudaDeviceSynchronize();
    
    float maxError = 0.0f;
    for (int i = 0; i < N; i++)
        maxError = fmax(maxError, fabs(y[i] - 3.0f));
    printf("Max error: %f\n", maxError);
    
    cudaFree(x);
    cudaFree(y);
    return 0;
}

In [None]:
%%bash
/usr/local/cuda/bin/nvcc -arch=sm_75 add_gpu_v3.cu -o add_gpu_v3 && ./add_gpu_v3

### Understanding the Grid-Stride Loop

The pattern `for (int i = index; i < n; i += stride)` is called a **grid-stride loop**. It handles arrays of any size:

- If `n <= total_threads`: Each thread processes at most 1 element
- If `n > total_threads`: Each thread processes multiple elements

This is the recommended pattern for CUDA kernels because it's flexible and efficient.

### Choosing Optimal Block Size

We used 256 threads per block, but is that optimal? CUDA provides a way to calculate the best block size for your kernel.

**Key factors:**
- Block size must be a multiple of 32 (warp size)
- Maximum is 1024 threads per block
- Optimal size depends on kernel's register and shared memory usage

Use `cudaOccupancyMaxPotentialBlockSize` to let CUDA calculate it:

In [4]:
%%writefile optimal_blocksize.cu
#include <stdio.h>
#include <math.h>

__global__ void add(int n, float *x, float *y) {
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    for (int i = index; i < n; i += stride)
        y[i] = x[i] + y[i];
}

int main() {
    int N = 1 << 20;
    float *x, *y;
    cudaMallocManaged(&x, N * sizeof(float));
    cudaMallocManaged(&y, N * sizeof(float));
    
    for (int i = 0; i < N; i++) { x[i] = 1.0f; y[i] = 2.0f; }
    
    // Let CUDA calculate optimal block size
    int blockSize, minGridSize;
    cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize, add, 0, 0);
    int numBlocks = (N + blockSize - 1) / blockSize;
    
    printf("Optimal block size: %d\n", blockSize);
    printf("Minimum grid size for full occupancy: %d\n", minGridSize);
    printf("Actual grid size: %d\n", numBlocks);
    
    add<<<numBlocks, blockSize>>>(N, x, y);
    cudaDeviceSynchronize();
    
    float maxError = 0.0f;
    for (int i = 0; i < N; i++)
        maxError = fmax(maxError, fabs(y[i] - 3.0f));
    printf("Max error: %f\n", maxError);
    
    cudaFree(x); cudaFree(y);
    return 0;
}

Writing optimal_blocksize.cu


In [5]:
%%bash
/usr/local/cuda/bin/nvcc -arch=sm_75 optimal_blocksize.cu -o optimal_blocksize && ./optimal_blocksize

Optimal block size: 1024
Minimum grid size for full occupancy: 40
Actual grid size: 1024
Max error: 0.000000


For simple kernels like ours, 256 or 1024 are typically optimal. Complex kernels using more registers or shared memory may need smaller block sizes.

---
## 6. Error Handling

CUDA errors are silent by default. Your program may appear to work while producing garbage results. Always check for errors.

### Checking CUDA API Calls

CUDA functions return `cudaError_t`. Check it:

In [None]:
%%writefile error_handling.cu
#include <stdio.h>

// Error checking macro
#define CUDA_CHECK(call) do { \
    cudaError_t err = call; \
    if (err != cudaSuccess) { \
        fprintf(stderr, "CUDA error at %s:%d: %s\n", \
                __FILE__, __LINE__, cudaGetErrorString(err)); \
        exit(1); \
    } \
} while(0)

__global__ void myKernel(float *data) {
    data[threadIdx.x] = threadIdx.x;
}

int main() {
    float *d_data;
    
    // Good: Check allocation
    CUDA_CHECK(cudaMalloc(&d_data, 256 * sizeof(float)));
    
    // Launch kernel
    myKernel<<<1, 256>>>(d_data);
    
    // Check for kernel launch errors
    CUDA_CHECK(cudaGetLastError());
    
    // Check for kernel execution errors
    CUDA_CHECK(cudaDeviceSynchronize());
    
    printf("Kernel executed successfully!\n");
    
    CUDA_CHECK(cudaFree(d_data));
    return 0;
}

In [None]:
%%bash
/usr/local/cuda/bin/nvcc -arch=sm_75 error_handling.cu -o error_handling && ./error_handling

### Common Errors and Their Causes

| Error | Typical Cause |
|-------|---------------|
| `cudaErrorInvalidConfiguration` | Too many threads per block (max 1024) |
| `cudaErrorMemoryAllocation` | Requested more memory than available |
| `cudaErrorIllegalAddress` | Kernel accessed invalid memory |
| `cudaErrorInvalidDevice` | Trying to use a GPU that doesn't exist |
| `cudaErrorNoKernelImageForDevice` | Compiled for wrong architecture |

---
## 7. Memory Management

Memory is typically the bottleneck in GPU programs. Understanding memory types is essential.

### Unified Memory vs Explicit Memory

So far we've used **Unified Memory** (`cudaMallocManaged`) for simplicity. For production code, **explicit memory management** gives better performance.

| Approach | Pros | Cons |
|----------|------|------|
| Unified Memory | Simple, automatic | Hidden overhead, less control |
| Explicit | Maximum performance | More code, manual management |

### Explicit Memory Management

In [None]:
%%writefile explicit_memory.cu
#include <stdio.h>
#include <math.h>

__global__ void add(int n, float *x, float *y) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n)
        y[i] = x[i] + y[i];
}

int main() {
    int N = 1 << 20;
    size_t size = N * sizeof(float);
    
    // Step 1: Allocate host (CPU) memory
    float *h_x = (float*)malloc(size);
    float *h_y = (float*)malloc(size);
    
    // Step 2: Initialize on host
    for (int i = 0; i < N; i++) {
        h_x[i] = 1.0f;
        h_y[i] = 2.0f;
    }
    
    // Step 3: Allocate device (GPU) memory
    float *d_x, *d_y;
    cudaMalloc(&d_x, size);
    cudaMalloc(&d_y, size);
    
    // Step 4: Copy data from host to device
    cudaMemcpy(d_x, h_x, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_y, h_y, size, cudaMemcpyHostToDevice);
    
    // Step 5: Launch kernel
    int blockSize = 256;
    int numBlocks = (N + blockSize - 1) / blockSize;
    add<<<numBlocks, blockSize>>>(N, d_x, d_y);
    
    // Step 6: Copy results back to host
    cudaMemcpy(h_y, d_y, size, cudaMemcpyDeviceToHost);
    
    // Verify
    float maxError = 0.0f;
    for (int i = 0; i < N; i++)
        maxError = fmax(maxError, fabs(h_y[i] - 3.0f));
    printf("Max error: %f\n", maxError);
    
    // Step 7: Free memory
    cudaFree(d_x);
    cudaFree(d_y);
    free(h_x);
    free(h_y);
    
    return 0;
}

In [None]:
%%bash
/usr/local/cuda/bin/nvcc -arch=sm_75 explicit_memory.cu -o explicit_memory && ./explicit_memory

### GPU Memory Hierarchy

GPUs have several memory types with different speeds and scopes:

| Memory | Speed | Scope | Size | Use Case |
|--------|-------|-------|------|----------|
| Registers | Fastest | Per thread | ~256 KB total | Local variables |
| Shared Memory | Very fast | Per block | 48-164 KB | Thread cooperation |
| L1/L2 Cache | Fast | Automatic | MB range | Hardware-managed |
| Global Memory (VRAM) | Slow | All threads | GB range | Main data storage |

For beginners, focus on global memory. Shared memory optimization is an intermediate topic.

---
## 8. Profiling Your Code

**Nsight Systems** (`nsys`) profiles CPU/GPU activity and shows where time is spent.

### Basic Profiling

In [None]:
%%bash
nsys profile --stats=true ./add_gpu_v3 2>&1 | grep -A 10 'cuda_gpu_kern_sum'

### Understanding the Output

| Column | Meaning |
|--------|--------|
| Time (%) | Percentage of total GPU time |
| Total Time (ns) | Kernel execution time in nanoseconds |
| Instances | Number of kernel launches |
| Name | Kernel function name |

To convert nanoseconds to seconds: divide by 1,000,000,000 (10^9).

### Comparing Versions

Let's profile all three versions to see the speedup:

In [None]:
%%bash
echo "=== Version 1: 1 thread ==="
nsys profile --stats=true ./add_gpu_v1 2>&1 | grep -A 5 'cuda_gpu_kern_sum'

echo ""
echo "=== Version 2: 256 threads (1 block) ==="
nsys profile --stats=true ./add_gpu_v2 2>&1 | grep -A 5 'cuda_gpu_kern_sum'

echo ""
echo "=== Version 3: Many blocks x 256 threads ==="
nsys profile --stats=true ./add_gpu_v3 2>&1 | grep -A 5 'cuda_gpu_kern_sum'

---
## 9. Common Pitfalls

### 1. Forgetting to Synchronize

```c
// WRONG: Results may not be ready
add<<<blocks, threads>>>(N, x, y);
printf("%f\n", y[0]);  // Race condition!

// CORRECT
add<<<blocks, threads>>>(N, x, y);
cudaDeviceSynchronize();  // Wait for GPU
printf("%f\n", y[0]);    // Safe
```

### 2. Out-of-Bounds Access

When total threads exceed array size, add bounds checking:

```c
__global__ void add(int n, float *x, float *y) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n)  // Bounds check!
        y[i] = x[i] + y[i];
}
```

### 3. Integer Overflow in Index Calculation

For very large arrays, use `size_t` or `long long`:

```c
__global__ void process(size_t n, float *data) {
    size_t i = (size_t)blockIdx.x * blockDim.x + threadIdx.x;
    // ...
}
```

### 4. Not Checking Errors

Always use error checking (see Section 6). Silent failures are common.

### 5. Wrong Architecture Flag

```bash
# If your GPU is compute capability 7.5 (T4, RTX 2080)
nvcc -arch=sm_75 program.cu -o program  # CORRECT
nvcc -arch=sm_80 program.cu -o program  # Compiles but may fail at runtime
```

---
## 10. Summary

### CPU to CUDA Cheat Sheet

| Concept | CPU (C) | GPU (CUDA) |
|---------|---------|------------|
| Function declaration | `void func()` | `__global__ void func()` |
| Memory allocation | `malloc(size)` | `cudaMallocManaged(&ptr, size)` |
| Memory free | `free(ptr)` | `cudaFree(ptr)` |
| Function call | `func(args)` | `func<<<blocks, threads>>>(args)` |
| Wait for completion | (automatic) | `cudaDeviceSynchronize()` |
| Thread ID | N/A | `blockIdx.x * blockDim.x + threadIdx.x` |
| File extension | `.c` | `.cu` |
| Compiler | `gcc` | `nvcc` |

### Key Concepts

1. **GPUs excel at data parallelism** - same operation on many elements
2. **Threads are organized hierarchically** - threads → blocks → grid
3. **Each thread computes its global index** - `blockIdx.x * blockDim.x + threadIdx.x`
4. **Use grid-stride loops** for flexible, efficient kernels
5. **Always check for errors** - CUDA fails silently
6. **Profile before optimizing** - measure, don't guess

---
## What's Next?

This guide covered the fundamentals. To continue learning:

**Intermediate topics:**
- Shared memory for thread cooperation
- Memory coalescing for better bandwidth
- Streams for overlapping computation and data transfer
- Atomic operations

**Advanced topics:**
- Warp-level programming
- Tensor Cores (for deep learning)
- Multi-GPU programming
- CUDA graphs

**Libraries to explore:**
- cuBLAS (linear algebra)
- cuDNN (deep learning primitives)
- Thrust (C++ STL-like parallel algorithms)
- CUB (block and warp primitives)

---
## Resources

**Documentation:**
- [CUDA C++ Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html)
- [CUDA Best Practices Guide](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html)
- [CUDA Toolkit Documentation](https://docs.nvidia.com/cuda/index.html)

**Free Courses:**
- [Fundamentals of Accelerated Computing with CUDA C/C++](https://courses.nvidia.com/courses/course-v1:DLI+C-AC-01+V1/about) - NVIDIA DLI
- [Fundamentals of Accelerated Computing with CUDA Python](https://courses.nvidia.com/courses/course-v1:DLI+C-AC-02+V1/about) - NVIDIA DLI

**Tools:**
- `nsys` - Nsight Systems profiler (used in this guide)
- [NVIDIA Nsight Systems](https://developer.nvidia.com/nsight-systems) - Visual profiler
- [NVIDIA Nsight Compute](https://developer.nvidia.com/nsight-compute) - Kernel profiler

---
<a id="appendix-a-setup"></a>
## Appendix A: Setup

This appendix covers installing the CUDA development environment. Skip if already set up.

### Requirements

- NVIDIA GPU (any CUDA-capable GPU)
- Linux (Ubuntu 22.04/24.04 recommended)
- C++ compiler (g++)
- Python + Jupyter (for this notebook)

### A.1 Install Miniconda

Miniconda provides Python and the conda package manager:

In [None]:
%%bash
if [ ! -d "$HOME/miniconda3" ]; then
    wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
    bash Miniconda3-latest-Linux-x86_64.sh -b -p $HOME/miniconda3
    $HOME/miniconda3/bin/conda init bash
    echo "Miniconda installed. Run: source ~/.bashrc"
else
    echo "Miniconda already installed"
fi

### A.2 Install Jupyter Kernel

In [None]:
%%bash
if ! conda list -n base ipykernel 2>/dev/null | grep -q ipykernel; then
    conda install -n base ipykernel --update-deps --force-reinstall -y
else
    echo "ipykernel already installed"
fi

### A.3 Install CUDA Toolkit (Ubuntu)

The CUDA Toolkit provides:
- `nvcc` compiler
- CUDA runtime libraries
- Header files
- Profiling tools (Nsight Systems, Nsight Compute)

In [None]:
%%bash
# For Ubuntu 24.04 with CUDA 13.1 (current version)
if ! command -v /usr/local/cuda/bin/nvcc &> /dev/null; then
    wget -nc https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
    sudo dpkg -i cuda-keyring_1.1-1_all.deb
    sudo apt-get update
    sudo apt-get -y install cuda-toolkit-13-1
else
    echo "CUDA toolkit already installed"
fi

### A.4 Add CUDA to PATH

In [None]:
%%bash
if ! grep -q 'cuda' ~/.bashrc; then
    echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
    echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
    echo "Added CUDA to PATH. Run: source ~/.bashrc"
else
    echo "CUDA PATH already configured"
fi

### A.5 Install NVIDIA Driver

Choose ONE option based on your GPU:

**Option 1: Open-source driver** (recommended for datacenter GPUs like T4, V100, A100)

In [None]:
%%bash
if ! command -v nvidia-smi &> /dev/null; then
    sudo apt-get install -y nvidia-open
else
    echo "NVIDIA driver already installed"
fi

**Option 2: Proprietary driver** (for consumer GPUs like RTX series)

In [None]:
%%bash
# Uncomment to use proprietary driver instead
# sudo apt-get install -y cuda-drivers

### A.6 Verify Installation

In [None]:
%%bash
echo "Checking installation:"
command -v g++ >/dev/null && echo "  g++: installed" || echo "  g++: NOT FOUND"
/usr/local/cuda/bin/nvcc --version >/dev/null 2>&1 && echo "  nvcc: installed" || echo "  nvcc: NOT FOUND"
command -v nvidia-smi >/dev/null && echo "  nvidia-smi: installed" || echo "  nvidia-smi: NOT FOUND"
echo ""
nvidia-smi --query-gpu=name,driver_version --format=csv 2>/dev/null || echo "GPU not accessible"

---
*Based on Mark Harris's [An Even Easier Introduction to CUDA](https://developer.nvidia.com/blog/even-easier-introduction-cuda/)*