# The CUDA Programming Model 

not all problem are expressed in array, ufunc

you just need for loop for some problem

In [2]:
import numpy as np
import cupy  as cp
import math

from numba import vectorize, cuda

## A first example

In [8]:
@cuda.jit
def add_kernel(x, y, out):
    tx = cuda.threadIdx.x # this is the unique thread ID within a 1D bloack
    ty = cuda.blockIdx.x  # this is the unique block ID wihtint the 1D grid
    
    block_size = cuda.blockDim.x # number of threads per block
    grid_size  = cuda.gridDim.x  # number of blocks in the grid
    
    start  = tx + ty * block_size
    stride = block_size * grid_size # ensure all the data points are taking care of
    
    # assuming x and y input are same length
    for i in range(start, x.shape[0], stride):
        out[i] = x[i] + y[i]

The unusual syntax for calling the kernel function is designed to mimic the CUDA runtime API in C, where teh above call would look like:

```
add_kernel<<<blocks_per_grid, threads_per_block>>>(x, y, out)
```

Note:
- unlike the ufunc, the arguments are passed to the kernel as full NumPy arraygs.
    - the kernel can access any element in the array
    - this is why CUDA kernel is more powerful than ufunc

In [9]:
n = 100000
x = np.arange(n).astype(np.float32)
y = 2 * x
out = np.empty_like(x)

threads_per_block = 128
blocks_per_grid = 30

add_kernel[blocks_per_grid, threads_per_block](x, y, out)
print(out[:10])

[ 0.  3.  6.  9. 12. 15. 18. 21. 24. 27.]


**Alternative: raw kernels in CuPy (c code)**

### Numba includes several helper functions to simplify the thread offset calculations above 

In [10]:
@cuda.jit
def add_kernel(x, y, out):
    start = cuda.grid(1)
    stride = cuda.gridsize(1)
    
    for i in range(start, x.shape[0], stride):
        out[i] = x[i] + y[i]

In [11]:
n = 100000
x = np.arange(n).astype(np.float32)
y = 2 * x
out = np.empty_like(x)

threads_per_block = 128
blocks_per_grid = 30

add_kernel[blocks_per_grid, threads_per_block](x, y, out)
print(out[:10])

[ 0.  3.  6.  9. 12. 15. 18. 21. 24. 27.]


## memory management

In [12]:
x_device   = cuda.to_device(x)
y_device   = cuda.to_device(y)
out_device = cuda.device_array_like(x)

In [13]:
%timeit add_kernel[blocks_per_grid, threads_per_block](x, y, out)

2.08 ms ± 7.27 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [14]:
%timeit add_kernel[blocks_per_grid, threads_per_block](x_device, y_device, out_device)

184 µs ± 3.49 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


## kernel Synchronization