# CUDA Python (Numba) Environment Test

This notebook replicates the original PyOpenCL environment test using **CUDA Python via Numba**. It validates that your NVIDIA driver + CUDA toolkit and the Numba CUDA runtime are working.

Run all cells in order. The final message should confirm success.

## 1. Imports and device info
We query available CUDA devices through Numba.

In [1]:
from numba import cuda
import numpy as np

cuda.detect()  # prints detected devices
device = cuda.get_current_device()
print(f'Active device: {device.name.decode()} (CC {device.compute_capability})')
print(f'Max threads per block: {device.MAX_THREADS_PER_BLOCK}')
#print(f'Memory (bytes): {device.total_memory}')

Found 1 CUDA devices
id 0    b'NVIDIA GeForce RTX 2060 with Max-Q Design'                              [SUPPORTED]
                      Compute Capability: 7.5
                           PCI Device ID: 0
                              PCI Bus ID: 1
                                    UUID: GPU-42a52008-a233-7f47-7c29-d01344a0b937
                                Watchdog: Enabled
                            Compute Mode: WDDM
             FP32/FP64 Performance Ratio: 32
Summary:
	1/1 devices are supported
Active device: NVIDIA GeForce RTX 2060 with Max-Q Design (CC (7, 5))
Max threads per block: 1024


## 2. Profiling helper
We implement a timing helper using CUDA events through Numba's driver APIs.

In [2]:
from numba.cuda.cudadrv import driver as _drv

def profile_gpu(fn, n_warmup, n_iters, *launch_args):
    # warmup
    for _ in range(n_warmup):
        fn(*launch_args)
        cuda.synchronize()
    times = np.zeros(n_iters, dtype=np.float64)
    for i in range(n_iters):
        start = _drv.event()
        end = _drv.event()
        start.record()
        fn(*launch_args)
        end.record()
        end.synchronize()
        times[i] = _drv.event_elapsed_time(start, end)  # ms
    print(f'Kernel average {times.mean():.4f} ms, median {np.median(times):.4f} ms, std {times.std():.4f} ms over {n_iters} runs.')
    return times

## 3. Host data
Allocate large host arrays (int32) like in the OpenCL example.

In [3]:
N = 2**25  # 33,554,432
h_a = np.full(N, 1, dtype=np.int32)
h_b = np.full(N, 2, dtype=np.int32)
print(f'Working with {h_a.size:,} elements; each array uses {h_a.nbytes/1024**2:.2f} MB.')

Working with 33,554,432 elements; each array uses 128.00 MB.


## 4. Device arrays
Transfer host arrays to device using Numba's `to_device`.

In [4]:
d_a = cuda.to_device(h_a)
d_b = cuda.to_device(h_b)
d_c = cuda.device_array_like(h_a)
print('Device arrays created.')

IndexError: list index out of range

## 5. CUDA kernel (Numba)
Define the kernel performing c[i] = 2*a[i] + b[i].

In [None]:
@cuda.jit
def add_vectors(a, b, c):
    gid = cuda.grid(1)
    if gid < a.size:
        c[gid] = 2 * a[gid] + b[gid]

## 6. Execution configuration
Define threads per block & blocks per grid.

In [None]:
threads_per_block = 256
blocks_per_grid = (N + threads_per_block - 1) // threads_per_block
print(f'blocks={blocks_per_grid}, threads_per_block={threads_per_block}')

## 7. Run & profile kernel
Wrap the launch in a lambda for profiling.

In [None]:
launch = lambda: add_vectors[blocks_per_grid, threads_per_block](d_a, d_b, d_c)
_ = profile_gpu(launch, n_warmup=2, n_iters=20)

## 8. Copy back & validate
Copy result and compare with NumPy.

In [None]:
h_c = d_c.copy_to_host()
expected = 2 * h_a + h_b
np.testing.assert_array_equal(expected, h_c)
print('Validation passed. If this message appears everything worked correctly.')

## 9. Notes
If detection fails: ensure an NVIDIA GPU, driver, CUDA toolkit, and that `numba` detects the CUDA toolkit. Install with `pip install numba`. Optionally remove PyCUDA if not needed anymore.