# Basic Array Operations
This notebook explains how to execute CUDA kernels using CuPy's `RawModule`.

The fundamental process involves reading CUDA kernel source code from a text file, compiling it, and then executing it as a function.
This section also covers using string replacement for compile-time constant embedding (similar to a preprocessor), leveraging CUDA's constant memory, and measuring execution time with CuPy's Event objects.
The CUDA kernels discussed here are well-suited for implementing the map pattern in parallel computing.

In [None]:
import os
import math
import numpy as np
import cupy as cp

dn = os.path.join(os.getcwd(), 'kernels')
fpfn = os.path.join(dn, '01_basic_array_operations_1.cu')
with open(fpfn, 'r') as f:
  cuda_source = f.read()
module = cp.RawModule(code=cuda_source)
module.compile()

`01_basic_array_operations_1.cu` implements constant scaling of arrays as an example of basic array operations.
The CUDA kernel, read as a string, is displayed below.

In [None]:
print(cuda_source)

extern "C" __global__ void mult(float *x, float a, int length)
{
  const int index = blockIdx.x * blockDim.x + threadIdx.x;
  if (index >= length)
  {
    return;
  }
  x[index] *= a;
}



In CUDA, the smallest unit of processing is a thread. These threads are grouped into thread blocks. A single thread block can contain a maximum of 1024 threads.
This limit exists because all threads within a block are expected to reside on the same streaming multiprocessor core and must share that core's limited memory resources.

Grids are formed when multiple thread blocks are combined. Essentially, a grid is a collection of thread blocks, enabling highly scalable parallel computation.
Thread blocks within a grid execute independently.

Let's clarify some key identifiers and built-in variables:
The `__global__` identifier is used in CUDA C++ to define a kernel.
Unlike regular C++ functions, kernels are executed in parallel by multiple CUDA threads when called.
`blockIdx` is a built-in variable that provides a unique index for each thread block within the grid.
It can be accessed as a 1D, 2D, or 3D index, helping to identify which block the current thread belongs to within the kernel.
`blockDim` is a built-in variable indicating the dimensions of a thread block.
Specifically, it provides the number of threads (sizes in x, y, and z directions) contained within each thread block.
This is useful for calculating a thread's global index.
`threadIdx` is another built-in variable accessible within the kernel, providing a unique thread ID assigned to each thread executing the kernel.
`threadIdx` is treated as a 3-component vector, allowing threads to be identified using 1D, 2D, or 3D thread indices.
This helps threads within a block identify their specific roles.

CuPy allows direct uploading of NumPy `numpy.array objects` to the GPU as `cupy.array` objects.
For demonstration, a random array of length 65536 is generated and uploaded to the GPU.
Additionally, a random scalar for constant multiplication and a threshold for verification are defined.

In [None]:
length = 65536
err_eps = 1E-6
x = np.random.rand((length)).astype(np.float32)
x_gpu = cp.array(x, dtype=cp.float32)
a = np.random.rand(1,).astype(np.float32)
a_gpu = cp.float32(a)

To execute the compiled CUDA kernel, the `mult` function is first retrieved as a function object using `get_function()`.
The function object is executed by providing thread block and grid dimensions as a tuple, followed by the actual arguments for the CUDA kernel.
Since CUDA kernels execute asynchronously on the GPU, explicit synchronization is performed to ensure correctness.

In [None]:
gpu_func = module.get_function('mult')
sz_block = 1024,
sz_grid = math.ceil(length / sz_block[0]),
gpu_func(
  block=sz_block, grid=sz_grid,
  args=(x_gpu, a_gpu, length)
)
cp.cuda.runtime.deviceSynchronize()

After execution, the results from the GPU are retrieved using the `get()` function and verified against NumPy's calculation.
An assertion ensures the computation was successful.

In [None]:
x2 = x_gpu.get()
err = np.abs((x2 / a) - x)
assert np.max(err) < err_eps


Next, compile-time constant embedding using string replacement in the source code is demonstrated.
The CUDA kernel is read as a string.

In [None]:
fpfn = os.path.join(dn, '01_basic_array_operations_2.cu')
with open(fpfn, 'r') as f:
  cuda_source = f.read()
print(cuda_source)

const float g_a(BASIC_ARRAY_OPTRATIONS_2_A);
extern "C" __global__ void multConstant(float *x)
{
  const int index = blockIdx.x * blockDim.x + threadIdx.x;
  if (index >= BASIC_ARRAY_OPTRATIONS_2_LENGTH)
  {
    return;
  }
  x[index] *= g_a;
}



Note that, as read, it cannot be compiled directly due to placeholder strings for constants (`BASIC_ARRAY_OPTRATIONS_2_A`) and array length (`BASIC_ARRAY_OPTRATIONS_2_LENGTH`).
After reading, Python's standard library is used to perform string replacement.
It's important to understand that values from Python variables are embedded as constants in the CUDA kernel at compile time. 

In [None]:
cuda_source = cuda_source.replace('BASIC_ARRAY_OPTRATIONS_2_LENGTH', str(length))
cuda_source = cuda_source.replace('BASIC_ARRAY_OPTRATIONS_2_A', str(float(a[0])))
module = cp.RawModule(code=cuda_source)
module.compile()

The GPU function is then executed, with an assertion to verify calculation success.

In [None]:
x_gpu = cp.array(x, dtype=cp.float32)

gpu_func = module.get_function('multConstant')
sz_block = 1024,
sz_grid = math.ceil(length / sz_block[0]),
gpu_func(
  block=sz_block, grid=sz_grid,
  args=(x_gpu)
)
cp.cuda.runtime.deviceSynchronize()

x2 = x_gpu.get()
err = np.abs((x2 / a) - x)
assert np.max(err) < err_eps

Now, we implement a CUDA kernel that utilizes CUDA's constant memory.

This memory resides in the GPU's device memory and is cached by the constant cache.
If the data is found in the cache, it's processed at the constant cache's throughput; otherwise, it's processed at the device memory's throughput.

In [None]:
fpfn = os.path.join(dn, '01_basic_array_operations_3.cu')
with open(fpfn, 'r') as f:
  cuda_source = f.read()
print(cuda_source)

__constant__ float g_a; // constant memory
extern "C" __global__ void multConstantMemory(float *x)
{
  const int index = blockIdx.x * blockDim.x + threadIdx.x;
  if (index >= BASIC_ARRAY_OPTRATIONS_3_LENGTH)
  {
    return;
  }
  x[index] *= g_a;
}



CUDA's constant memory can be used by declaring variables with the `__constant__` memory space specifier.
Here, the constant for multiplication is defined in constant memory, and the array length is defined as a regular constant.

In [None]:
cuda_source = cuda_source.replace('BASIC_ARRAY_OPTRATIONS_3_LENGTH', str(length))
module = cp.RawModule(code=cuda_source)
module.compile()

After compilation, the constant memory pointer is obtained using the `get_global()` function, and the value is set.

In [None]:

ptr = module.get_global('g_a')
a_gpu = cp.ndarray((1), dtype=cp.float32, memptr=ptr)
a_gpu[:] = a[0]

The GPU function is then executed, with an assertion to verify calculation success.

In [None]:
x_gpu = cp.array(x, dtype=cp.float32)

gpu_func = module.get_function('multConstantMemory')
sz_block = 1024,
sz_grid = math.ceil(length / sz_block[0]),
gpu_func(
  block=sz_block, grid=sz_grid,
  args=(x_gpu)
)
cp.cuda.runtime.deviceSynchronize()

x2 = x_gpu.get()
err = np.abs((x2 / a) - x)
assert np.max(err) < err_eps

Next, we extend the `multConstantMemory` kernel to demonstrate how to use multiple coefficients stored as a structure in constant memory.
This kernel performs a fused multiply-add operation similar to the SAXPY ($y = ax + b$) operation.

In [None]:
fpfn = os.path.join(dn, '01_basic_array_operations_4.cu')
with open(fpfn, 'r') as f:
  cuda_source = f.read()
print(cuda_source)

#pragma pack(push, 4)
struct multAddCoefficient
{
  float a, b;
};
#pragma pack(pop)

__constant__ multAddCoefficient g_maCoef = {}; // constant memory

extern "C" __global__ void getMACoefSize(int *sz)
{
  const int index = blockIdx.x * blockDim.x + threadIdx.x;
  if (index > 0)
  {
    return;
  }
  *sz = sizeof(multAddCoefficient);
}

extern "C" __global__ void multAddConstantMemory(float *x)
{
  const int index = blockIdx.x * blockDim.x + threadIdx.x;
  if (index >= BASIC_ARRAY_OPTRATIONS_4_LENGTH)
  {
    return;
  }
  x[index] = g_maCoef.a * x[index] + g_maCoef.b;
  // use FMA for faster operation
  // x[index] = fmaf(g_maCoef.a, x[index], g_maCoef.b);
}



In this CUDA kernel, a POD (Plain Old Data) structure named `multAddCoefficient` is defined and declared as constant memory `g_maCoef`.
The `#pragma pack(push, 4)` and `#pragma pack(pop)` directives are crucial here. They ensure the struct's members are tightly packed to a 4-byte alignment, accurately matching the Python-side data structure's memory layout for precise data transfer.

Also, notice the commented-out line utilizing the `fmaf` (Fused Multiply-Add) function.
This is included for optional experimentation, and its numerical implications will be discussed in more detail after the kernel execution.
While `fmaf` generally offers performance and precision benefits by combining multiplication and addition into a single instruction, it's worth noting its potential for subtle differences in numerical outcomes compared to separate multiplication and addition.

In [None]:
cuda_source = cuda_source.replace('BASIC_ARRAY_OPTRATIONS_4_LENGTH', str(length))
module = cp.RawModule(code=cuda_source)
module.compile()

The `multAddCoefficient` structure defined on the GPU side adheres to CUDA's data packing rules. Typically, `float` types are 4-byte aligned, so this structure, with `a` and `b` each being 4 bytes, totals 8 bytes and is 4-byte aligned.

On the Python side, `numpy.dtype` is used to define a data type `dtype_ma_coef` that strictly matches the alignment and memory layout of the CUDA-side structure.
`'a': (np.float32, 0)` means a 4-byte float starting at offset 0, and `'b': (np.float32, 4)` means a 4-byte float starting at offset 4, ensuring proper mapping between the CUDA structure and Python data.

In [None]:
dtype_ma_coef = np.dtype({'a': (np.float32, 0), 'b': (np.float32, 4)})

sz_gpu = cp.empty(1, dtype=cp.int32)
gpu_func_get_size = module.get_function('getMACoefSize')
gpu_func_get_size(
    block=(1,),
    grid=(1,),
    args=(sz_gpu,)
)
cp.cuda.runtime.deviceSynchronize()
sz = int(sz_gpu[0].get())
assert sz == dtype_ma_coef.itemsize, \
    'Expected POD size {dtype_pod.itemsize}, but got {sz}'

This structure is directly uploaded to the GPU's constant memory via a pointer obtained with `get_global()`, and then utilized within the kernel.
Specifically, the `ma_coef` NumPy array, which is meticulously aligned to match the CUDA `multAddCoefficient` structure, is converted into a byte array using `tobytes()`.
This byte array is then uploaded to the GPU's constant memory region that `ma_coef_gpu_constant` points to.
This direct byte-level transfer ensures that the exact memory layout and values of the host-side POD structure are faithfully replicated in the device's constant memory, making them accessible to the CUDA kernel.

In [None]:
b = np.random.rand(1,).astype(np.float32)

ma_coef = np.empty((1,), dtype=dtype_ma_coef)
ma_coef[0]['a'] = a[0]
ma_coef[0]['b'] = b[0]

ma_coef_ptr = module.get_global('g_maCoef')
ma_coef_gpu_constant = cp.ndarray((sz,), dtype=cp.byte, memptr=ma_coef_ptr)
ma_coef_gpu_constant[:] = cp.frombuffer(ma_coef.tobytes(), dtype=np.byte)

The GPU function is then executed, followed by an assertion to verify calculation success.

In [None]:
x_gpu = cp.array(x, dtype=cp.float32)

gpu_func = module.get_function('multAddConstantMemory')
sz_block = 1024,
sz_grid = math.ceil(length / sz_block[0]),
gpu_func(
  block=sz_block, grid=sz_grid,
  args=(x_gpu)
)
cp.cuda.runtime.deviceSynchronize()

x2 = x_gpu.get()
err = np.abs(((x2 - b)/ a) - x)
assert np.max(err) < err_eps

The kernel executes the multiply-add operation: `x[index] = g_maCoef.a * x[index] + g_maCoef.b;`.

This CUDA kernel also includes a commented-out line utilizing the `fmaf` (Fused Multiply-Add) function.
It's important to note that using `fmaf` may result in slightly different numerical outcomes compared to performing separate multiplication and addition due to the nature of floating-point arithmetic.
If the `fmaf` line is uncommented and the code is run with `err_eps = 1E-6`, the assertion might fail because the numerical result from `fmaf` could deviate slightly more than the threshold allows.
This highlights how floating-point precision can vary depending on the exact sequence of operations.
Those interested in exploring this can uncomment the line to compare performance and precision.

Finally, this notebook explains how to measure execution time using CuPy's `Event` objects.
Event objects allow for the recording of timestamps at specific points on the GPU.
The `get_elapsed_time()` function is then used to calculate the elapsed time from these timestamps.

This involves creating start and end event objects.
`start.record()` marks the beginning of the timed section on the GPU.
Immediately after, `start.synchronize()` ensures the CPU waits for this marker to be recorded, guaranteeing accurate timing.
The GPU function (the target computation) is then called.
Crucially, `cp.cuda.runtime.deviceSynchronize()` is invoked to ensure the GPU device finishes all its pending operations before `end.record()` is called.
This is vital because GPU operations are asynchronous.
Finally, `end.record()` marks the completion of the timed section, and `end.synchronize()` ensures the CPU waits for this final marker.
The elapsed time in milliseconds is then retrieved.

In [None]:
start = cp.cuda.Event()
end = cp.cuda.Event()

start.record()
start.synchronize()
gpu_func(
  block=sz_block, grid=sz_grid,
  args=(x_gpu, a, length)
)
cp.cuda.runtime.deviceSynchronize()
end.record()
end.synchronize()
msec = cp.cuda.get_elapsed_time(start, end)
print('Elapsed Time: {} [msec]'.format(msec))

Elapsed Time: 0.517952024936676 [msec]
