In [8]:
from numba import cuda
import numpy as np
from rich import print
print(cuda.current_context().device)

# Declare kernel function
https://numba.readthedocs.io/en/stable/cuda/kernels.html#kernel-declaration

When running a kernel, the kernel function’s code is executed by every thread once. It therefore has to know which thread it is in, in order to know which array element(s) it is responsible for (complex algorithms may define more complex responsibilities, but the underlying principle is the same).

![](https://docs.nvidia.com/cuda/cuda-c-programming-guide/graphics/grid-of-thread-blocks.png)

In [2]:
@cuda.jit
def increment_by_one(an_array):
    # Thread id in a 1D block
    tx = cuda.threadIdx.x
    # Block id in a 1D grid
    ty = cuda.blockIdx.x
    # Block width, i.e. number of threads per block
    bw = cuda.blockDim.x
    # Compute flattened index inside the array
    pos = tx + ty * bw
    if pos < an_array.size:  # Check array boundaries
        an_array[pos] += 1

# Invoke kernel function
https://numba.readthedocs.io/en/stable/cuda/kernels.html#kernel-invocation

We need to compile the function first, which is done by invoking it for small valid input. Note that there is a compiler warning the first time the function is invoked:
```
/home/axean/anaconda3/envs/az1/lib/python3.9/site-packages/numba/cuda/cudadrv/devicearray.py:790: NumbaPerformanceWarning: Host array used in CUDA kernel will incur copy overhead to/from device.
  warn(NumbaPerformanceWarning(msg))
```
There is a limit to the number of threads per block, since all threads of a block are expected to reside on the same processor core and must share the limited memory resources of that core. On current GPUs, a thread block may contain up to 1024 threads.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/#thread-hierarchy

![](https://docs.nvidia.com/cuda/cuda-c-programming-guide/graphics/memory-hierarchy.png)

In [3]:
threadsperblock = 2
an_array = np.asarray(list(range(40)))
blockspergrid = (an_array.size + (threadsperblock - 1)) // threadsperblock
print('blockspergrid ', blockspergrid)
increment_by_one[blockspergrid, threadsperblock](an_array)



In [4]:
array_size = 300000

def python_naive_add(an_array):
    for i in range(len(an_array)):
        an_array[i] += 1

big_array = np.asarray(list(range(array_size)))
print(big_array[:5])
%timeit python_naive_add(big_array)
print(big_array[:5])

59.8 ms ± 125 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [5]:
# numpy is very fast for simple operations, but less fast
# when intermediary arrays are needed
def python_numpy_add(an_array):
    an_array += 1

big_array = np.asarray(list(range(array_size)))
print(big_array[:5])
%timeit python_numpy_add(big_array)
print(big_array[:5])

58.5 µs ± 323 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [6]:
big_array = np.asarray(list(range(array_size)))
blockspergrid = (big_array.size + (threadsperblock - 1)) // threadsperblock
print('blockspergrid ', blockspergrid)
print(big_array[:5])
%timeit increment_by_one[blockspergrid, threadsperblock](big_array)
print(big_array[:5])

1.67 ms ± 2.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
