# CUDA Environment Test

This notebook mirrors the original PyOpenCL environment test but uses **PyCUDA** to verify your CUDA toolkit + driver + Python stack are correctly installed.

Run all cells from top to bottom. The final message should confirm success.

## 1. Imports and setup
We import PyCUDA and NumPy. `pycuda.autoinit` creates a context automatically on the first CUDA device.

In [None]:
import numpy as np
import pycuda.autoinit  # initializes CUDA context
import pycuda.driver as drv
from pycuda.compiler import SourceModule
print(f'CUDA Driver Version: {drv.get_driver_version()}')
device = drv.Context.get_device()
print(f'Using device: {device.name()} with compute capability {device.compute_capability()}')
print(f'Total Memory (MB): {device.total_memory() / 1024**2:.2f}')

## 2. Helper: GPU profiling wrapper
We'll mimic the earlier timing helper. CUDA events provide timing in milliseconds.

In [None]:
def profile_gpu(func, n_warmup, n_iters, *kernel_launch_args):
    # Warm-up launches (not timed)
    for _ in range(n_warmup):
        func(*kernel_launch_args)
    times = np.zeros(n_iters, dtype=np.float64)
    for i in range(n_iters):
        start = drv.Event(); end = drv.Event()
        start.record()
        func(*kernel_launch_args)
        end.record()
        end.synchronize()
        times[i] = start.time_till(end)  # ms
    print(f'Kernel took on average {times.mean():.4f} ms, median {np.median(times):.4f} ms, std {times.std():.4f} ms over {n_iters} runs.')
    return times

## 3. Host data
We allocate large NumPy int32 arrays similar to the OpenCL version.

In [None]:
N = 2**25  # 33,554,432 elements
h_a = np.full(N, 1, dtype=np.int32)
h_b = np.full(N, 2, dtype=np.int32)
print(f'Working with {h_a.size:,} elements consuming {h_a.nbytes/1024**2:.2f} MB per array.')

## 4. Device allocations and transfers
Allocate device memory and transfer host arrays.

In [None]:
d_a = drv.mem_alloc(h_a.nbytes)
d_b = drv.mem_alloc(h_b.nbytes)
d_c = drv.mem_alloc(h_a.nbytes)
drv.memcpy_htod(d_a, h_a)
drv.memcpy_htod(d_b, h_b)
print('Device buffers allocated & data transferred.')

## 5. CUDA kernel
We reproduce the operation: c[i] = 2*a[i] + b[i].

In [None]:
kernel_code = r'''
        extern "C" __global__ void add_vectors(const int *a, const int *b, int *c, int N) {
            int gid = blockIdx.x * blockDim.x + threadIdx.x;
            if (gid < N) {
                c[gid] = 2 * a[gid] + b[gid];
            }
        }
        '''
mod = SourceModule(kernel_code, options=['-use_fast_math'])
add_vectors = mod.get_function('add_vectors')
print('Kernel compiled.')

## 6. Execution configuration
Choose a block size (work-group size) and derive grid size (number of blocks).

In [None]:
block_size = 256  # threads per block
grid_size = (N + block_size - 1) // block_size
print(f'Launching with grid_size={grid_size}, block_size={block_size}')

## 7. Run & profile kernel
We wrap the kernel launch in a lambda for the profiler.

In [None]:
launch = lambda: add_vectors(d_a, d_b, d_c, np.int32(N), block=(block_size,1,1), grid=(grid_size,1))
_ = profile_gpu(launch, n_warmup=2, n_iters=20)

## 8. Copy back & validate
Transfer result back to host and compare with NumPy reference.

In [None]:
h_c = np.empty_like(h_a)
drv.memcpy_dtoh(h_c, d_c)
expected = 2 * h_a + h_b
np.testing.assert_array_equal(expected, h_c)
print('Validation passed. If this message appears everything worked correctly.')

## 9. Notes
If any cell failed: check that the NVIDIA driver & CUDA toolkit are installed, and that `pycuda` is in your environment (see requirements or install via `pip install pycuda`).