# 1. Introduction

CUDA has an execution model unlike the traditional sequential model used for programming CPUs. In CUDA, the code you write will be executed by multiple threads at once (often hundreds or thousands). Your solution will be modeled by defining a thread **hierarchy of grid, blocks and threads**.
- CPU: sequential model
- CUDA: multiple threads

In [48]:
import numpy as np
import math
# -------------------
import numba
import numba.cuda as cuda
print(numba.__version__)

0.34.0


# 2. Kernel declaration
**A kernel function is a GPU function that is meant to be called from CPU code** (*). It gives it two fundamental characteristics:
- kernels cannot explicitly return a value
- kernels explicitly declare their thread hierarchy when called

(*) Note: newer CUDA devices support device-side kernel launching; this feature is called dynamic parallelism but Numba does not support it currently)

In [5]:
@cuda.jit
def increment_by_one(an_array):
    """
    Increment all array elements by one.
    """
    # code elided here; read further for different implementations
    pass

# 3. Kernel invocation

A kernel is typically launched in the following way:

In [11]:
an_array = np.random.random(63)

In [12]:
threadsperblock = 32
blockspergrid = (an_array.size + (threadsperblock - 1)) // threadsperblock

`kernel[number of blocks per grid, number of threads per block]` is like a function

In [13]:
increment_by_one[blockspergrid, threadsperblock]

<numba.cuda.compiler.AutoJitCUDAKernel at 0x7f2026952a20>

**two steps**  
- Instantiate the kernel proper, by specifying a number of blocks (or “blocks per grid”), and a number of threads per block.
    - total threads launched = threadsperblock x blockspergrid
- Running the kernel, by passing it the input array (and any separate output arrays if necessary). By default, **running a kernel is synchronous**: the function returns when the kernel has finished executing and the data is synchronized back.

## Choosing the block size

a two-level hierarchy when declaring the number of threads needed by a kernel

- On the software side: **how many threads share a given area of shared memory**
- On the hardware side: **the block size must be large enough for full occupation of execution units**

## Multi-dimensional blocks and grids
To help deal with multi-dimensional arrays, CUDA allows you to specify multi-dimensional blocks and grids.

# 4. Thread positioning

When running a kernel, the kernel function’s code is executed by every thread once. It therefore has to know which thread it is in, in order to know which array element(s) it is responsible for (complex algorithms may define more complex responsibilities, but the underlying principle is the same).

One way is for the thread to determines its position in the grid and block and manually compute the corresponding array position:

In [14]:
@cuda.jit
def increment_by_one(an_array):
    
    tx = cuda.threadIdx.x    # Thread id in a 1D block
    ty = cuda.blockIdx.x     # Block id in a 1D grid
    bw = cuda.blockDim.x     # Block width, i.e. number of threads per block
    
    pos = tx + ty * bw       # Compute flattened index inside the array
    if pos < an_array.size:  # Check array boundaries
        an_array[pos] += 1

In [17]:
an_array = np.arange(10)

# ============================
threadsperblock = 32
blockspergrid = (an_array.size + (threadsperblock - 1)) // threadsperblock

# ============================
print(an_array)
increment_by_one[blockspergrid, threadsperblock](an_array)
print(an_array)

[0 1 2 3 4 5 6 7 8 9]
[ 1  2  3  4  5  6  7  8  9 10]


### Demonstration: From below code, the idea is like allocating 2 dimensional memory in C/C++

In [34]:
@cuda.jit
def get_blockDim(an_array):
    
    tx = cuda.threadIdx.x    # Thread id in a 1D block
    ty = cuda.blockIdx.x     # Block id in a 1D grid
    bw = cuda.blockDim.x     # Block width, i.e. number of threads per block
    
    pos = tx + ty * bw       # Compute flattened index inside the array
    if pos < an_array.size:  # Check array boundaries
        an_array[pos] = bw
        
@cuda.jit
def get_pos(an_array):
    
    tx = cuda.threadIdx.x    # Thread id in a 1D block
    ty = cuda.blockIdx.x     # Block id in a 1D grid
    bw = cuda.blockDim.x     # Block width, i.e. number of threads per block
    
    pos = tx + ty * bw       # Compute flattened index inside the array
    if pos < an_array.size:  # Check array boundaries
        an_array[pos] = pos
        
@cuda.jit
def get_threadIdx(an_array):
    
    tx = cuda.threadIdx.x    # Thread id in a 1D block
    ty = cuda.blockIdx.x     # Block id in a 1D grid
    bw = cuda.blockDim.x     # Block width, i.e. number of threads per block
    
    pos = tx + ty * bw       # Compute flattened index inside the array
    if pos < an_array.size:  # Check array boundaries
        an_array[pos] = tx
        
@cuda.jit
def get_blockIdx(an_array):
    
    tx = cuda.threadIdx.x    # Thread id in a 1D block
    ty = cuda.blockIdx.x     # Block id in a 1D grid
    bw = cuda.blockDim.x     # Block width, i.e. number of threads per block
    
    pos = tx + ty * bw       # Compute flattened index inside the array
    if pos < an_array.size:  # Check array boundaries
        an_array[pos] = ty

In [39]:
an_array = np.zeros(13)

# ============================
threadsperblock = 4
blockspergrid = (an_array.size + (threadsperblock - 1)) // threadsperblock

# ============================

print("\n===== Block Dimension =====")
get_blockDim[blockspergrid, threadsperblock](an_array)
print(an_array)

print("\n===== position =====")
get_pos[blockspergrid, threadsperblock](an_array)
print(an_array)

print("\n===== Thread Indices =====")
get_threadIdx[blockspergrid, threadsperblock](an_array)
print(an_array)

print("\n===== Block Indices =====")
get_blockIdx[blockspergrid, threadsperblock](an_array)
print(an_array)


===== Block Dimension =====
[4. 4. 4. 4. 4. 4. 4. 4. 4. 4. 4. 4. 4.]

===== position =====
[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12.]

===== Thread Indices =====
[0. 1. 2. 3. 0. 1. 2. 3. 0. 1. 2. 3. 0.]

===== Block Indices =====
[0. 0. 0. 0. 1. 1. 1. 1. 2. 2. 2. 2. 3.]


### Absolute positions

Simple algorithms will tend to always use thread indices in the same way as shown in the example above. Numba provides additional facilities to automate such calculations:
- numba.cuda.grid(ndim)
- numba.cuda.gridsize(ndim)

Rewrite example for 1D array

In [40]:
@cuda.jit
def increment_by_one(an_array):
    pos = cuda.grid(1)
    if pos < an_array.size:
        an_array[pos] += 1

In [41]:
an_array = np.arange(10)

# ============================
threadsperblock = 32
blockspergrid = (an_array.size + (threadsperblock - 1)) // threadsperblock

# ============================
print(an_array)
increment_by_one[blockspergrid, threadsperblock](an_array)
print(an_array)

[0 1 2 3 4 5 6 7 8 9]
[ 1  2  3  4  5  6  7  8  9 10]


The same example for a 2D array and grid of threads would be:

In [50]:
@cuda.jit
def increment_a_2D_array(an_array):
    x, y = cuda.grid(2)
    if x < an_array.shape[0] and y < an_array.shape[1]:
        an_array[x, y] += 1

In [57]:
an_array = np.arange(20).reshape(2, -1)
print(an_array)

[[ 0  1  2  3  4  5  6  7  8  9]
 [10 11 12 13 14 15 16 17 18 19]]


In [59]:
# now we need to define threadsperblock and blockspergrid for two dimension
threadsperblock = (16, 16)
blockspergrid_x = math.ceil(an_array.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(an_array.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)

print(an_array)
increment_a_2D_array[blockspergrid, threadsperblock](an_array)
print()
print(an_array)

[[ 1  2  3  4  5  6  7  8  9 10]
 [11 12 13 14 15 16 17 18 19 20]]

[[ 2  3  4  5  6  7  8  9 10 11]
 [12 13 14 15 16 17 18 19 20 21]]
