# Ch. 4 Evaluating Functions on Multi-Dimensional Grids

In the notebook for Chapter 2 "Function Evaluations and the Map Pattern" we created
serial and parallel implementations of the _map_ app to evaluate
a trigonometric function on a one-dimensional grid of input data.
The notebook for Chapter 3 "Tools for Timing, Profiling, and Debugging" is in development and will come out later). In this notebook, we create the _map2D_ and _map3D_ apps
that extend the map pattern to perform function evaluation on two-dimensional (2D)
and three-dimensional (3D) grids of input data.

> CUDA natively supports indexing up
to three dimensions so we will typically not
continue on to higher dimensional
grids. If you need to handle a problem which would naturally involve a geometric grid of dimension $D_g>3$, you can include a loop over additional dimensions within the kernel or you can map the index values for the $D_g$-dimensional geometric grid to the index values of a $D_c$-dimensional computational grid with $D_c \leq 3$.


# 4.1 Evaluating a Two-Dimensional Array of Function Values: The _map2D_ App

We start by using the _map_ app as a template to expand upon and create
the _map2D_ app that computes values of a function on a 2D grid of input points. For the sample code, we choose a new function of 2 variables:

$$f_{2D}(x,y) = \frac{sin(\pi x) sinh(\pi y)}{sinh(\pi)}$$ (4.1)

which involves both trigonometric and hyperbolic functions.

As alluded to in the comment above, there are really 2 grids to deal with: a geometric grid of points $\{x_p,y_q\}=\{x_0+p \Delta x, y_0 + q \Delta y \}$ where we want to evaluate the function $f(x_p,y_q)$ and a computational grid of threads with indices `i,j`. 

While there is considerable freedom in choosing the relation between the geometric and computational grids, here we present a simple, intuitive, and very commonly used choice. We establish a 1:1 correspondence between the geometric and computational grids, by identifying $p$ with `i` and $q$ with `j` so that each 2-tuple of index values specifies a corresponding point on the geometric grid. 

Having established the relationship of the computational grid indices to the coordinate values for points on the geometric grid, we are now ready to look at implementations that map the function of 2 variables $f(x,y)$ on the 2D geometric grid. 

# 4.1 Serial Implementation

Just like the one-dimensional _map_ app, our serial implementation of _map2D_ consists of 2 files:

1. _apps/map2D/main.py_ which contains the `main()` function that initializes a 2D array, calls a function to evaluate the function on the 2D geometric grid,
and plots the computed 2D array of function values.

2. _apps/map2D/serial.py_ which computes the function values, stores them in the appropriate positions in the array, and returns the array of computed function values.

Listing 4.1 and 4.3 together form the implementation of _map2D_. In both cases, there are only a few changes needed to transform the previously created _map_ app into the _map2D_ app.

```
import numpy as np
import matplotlib.pyplot as plt

NX, NY = 128, 64

def main():
	x = np.linspace(0, 1, NX)
	y = np.linspace(0, 1, NY)

	from serial import fArray2D
	f = fArray2D(x, y)

	X, Y = np.meshgrid(x, y)
	plt.contourf(X, Y, f.T) #`T` is shorthand for `transpose()`
	plt.xlabel("X")
	plt.ylabel("Y")
	plt.show()

if __name__ == '__main__':
	main()
```

 $$ \text{ Listing 4.1: }apps/map2D/main.py$$
 
> For applications like creating multidimensional plots, the numpy function `meshgrid()` is often used to create tuples of coordinate values corresponding to points on a cartesian grid from 1D arrays of coordinate values along each coordinate direction. Here `x` and `y` are the arrays of 1D coordinate values, so  `np.meshgrid(x,y)` constructs the 2-tuples of $\{x,y\}$ values on the 2D geometric grid. `X,Y = np.meshgrid(x,y)` assigns the components of the tuple to 1D arrays `X` and `Y` to produce 1D arrays of coordinates of points that arise while traversing the grid (which, together with the 2D array of function values, correspond to the positional arguments expected for standard 2D plotting functions like `contourf()`.)

Most of the differences between _apps/map/main.py_ and  _apps/map2D/main.py_ arise in defining additional variables for the added dimension. Listing 4.2 provides a comparison of the files with code from _apps/map/main/py_ appearing as comments after `#`. `#SAME` indicates that the same line appears in both files.

```
import numpy as np
import matplotlib.pyplot as plt

NX, NY = 128, 64 # N = 128

def main():
	x = np.linspace(0, 1, NX) #SAME
	y = np.linspace(0, 1, NY)

	from serial import fArray2D #SAME
	f = fArray2D(x, y) #f = sArray(x)

	X, Y = np.meshgrid(x, y) #plot.plot(x,f,'bo')
	plt.contourf(X, Y, f.T) #`T` is shorthand for `transpose()`
	plt.xlabel("X")
	plt.ylabel("Y")
	plt.show()

if __name__ == '__main__':
	main()
```

$$ \text{Listing 4.2: Comparison of } apps/map2D/main.py \text{ and } apps/map/main.py $$

Instead of a single value, `N` , to define the array size, we use two separate values, `NX` and `NY` , to specify the number of points in each grid direction.
`np.linspace()` is called a second time to create an array `y` of coordinate values along the $y$-axis. The name of the imported function is changed to `fArray2D()`. Lastly, The graphical output is a contour plot so the commands
to create the plot requires the additional line calling `np.meshgrid()` to create the grid of $x,y$ coordinate pairs, and `plt.plot()` is replaced with `plt.contourf(X, Y, f.T)`.

> The output array is transposed using `f.T` as shortthand for f.transpose()` to provide data in the format required by `contourf()`.

The file _apps/map2D/serial.py_ defining the imported function is changed in much the same way to produce _apps/
map2D/main.py_ as shown in Listing 4.3

```
1 import math
2 import numpy as np
3 PI = np.pi
4 def f2D(x, y):
5   return math.sin( PI *x)* math.sinh( PI *y)/ math.sinh( PI )
6
7 def fArray2D(x, y):
8   nx = x.size
9   ny = y.size
10  f = np.empty((nx ,ny), dtype = np.float32)
11
12  for i in range(nx):
13      for j in range(ny):
14          f[i,j] = f2D(x[i], y[j])
15  return f
```
$$\text{Listing 4.3: } apps/map2D/serial.py$$

All of the alterations are to accommodate the new dimension. On Lines 4-5, the function to be evaluated at each point is named `f2D()` and corresponds to Equation 4.1, $f_{2D}(x, y)$. Lines 8-9 assign to the variables `nx` and `ny` the sizes of the input coordinate arrays `x` and `y`. On line 10, an empty numpy array `f` of size `nx` $\times$ `ny` is created to provide storage for the array of 32-bit function values to be computed. Lines 12-13 specify nested `for` loops that iterate over the range of index values corresponding to the size of the input arrays. On line 14, the function `f2D()` is evaluated with arguments corresponding to the coordinates of the corresponding point on the geometric grid (stored in `x[i],y[j]`), and the function value is stored as the corresponding array element `f[i,j]`.

That concludes the serial implementation, and we are ready to parallelize.

### 4.1.2 Parallel Implementation

As with the 1D case, the plan for parallelization is straightforward. Modify the imported function `fArray2D()` so that the `for` loops in the serial implementation are replaced with the launch of a kernel function `fKernel2D[]()` that performs the computation previously contained in the loops.

We store the modified code in _apps/map2D/parallel.py_ which is shown in Listing 4.4. The changes performed to get from _apps/map2D/serial.py_ to  _apps/map2D/parallel.py_ are described below, and you should compare listing 4.4 with Listing 4.3 as you read that description.

```
1 import math
2 import numpy as np
3 from numba import cuda
4 PI = np.pi
5 TPBX = 16
6 TPBY = 16
7
8 @cuda.jit( device = True )
9 def f2D(x, y):
10  return math.sin( PI *x)* math.sinh( PI *y)/ math.sinh( PI )
11
12 @cuda.jit('void (f4[:] , f4[:] , f4[: ,:])')
13 def fKernel2D(d_x , d_y , d_f):
14  i , j = cuda.grid(2)
15  nx , ny = d_f.shape
16  if i < nx and j < ny:
17      d_f[i,j] = f2D(d_x[i], d_y[j])
18
19 def fArray2D(x, y):
20  nx = x.size
21  ny = y.size
22
23  d_x = cuda.to_device(x)
24  d_y = cuda.to_device(y)
25  d_f = cuda.device_array((nx , ny), dtype = np.float32)
26
27  gridDims = ((nx + TPBX - 1)//TPBX ,
28              (ny + TPBY - 1)// TPBY )
29  blockDims = (TPBX , TPBY )
30
31  fKernel2D[ gridDims , blockDims ](d_x , d_y , d_f)
32
33  return d_f.copy_to_host()
```

$$ \text{Listing 4.4: } apps/map2D/parallel.py$$

Like the serial version, the file begins by importing relevant packages, now including numba's `cuda` package to provide support for parallelism. Lines 4-5 assign values for the variables, `TPBX` and `TPBY`, that are used to define the size (number of threads along each index direction) of the blocks to be established when the kernel is launched.

> __Execution parameter limitations:__ In the implementation of CUDA, a number of design decisions had to be made that lead to practical limitations on how it can be used. We have mentioned one already: Grids of dimension higher than 3 are not supported. Here we run into another important limit: There is a maximum number of threads that are allowed in a block, and the limit is 1024 for almost all currently available GPUs. Since $64 \times 64 = 4096$, a $64 \times 64$ block would violate the restriction and generate an error, so the sample code launches a grid with $16 \times 16$ blocks (each with 256 threads). See the CUDA documentation (or Wikipedia's CUDA page) for the full set of technical specifications and limitations.

Lines 9-10 define the function `f2d` that is to be evaluated at each grid point. As in the 1D case, this function will be called from the kernel, so the decorator `@cuda.jit( device = True )` is included on line 8, immediately before the function definition to identify it as a device function.

The definition of the kernel function `fKernel2D()` appears on lines 12-17 including the decorator `@cuda.jit('void(f4[:] , f4[:] , f4[: ,:])')` that accomplishes several goals:

1. The decorator identifies `fKernel2D` as a kernel function to be launched from the host and executed on the device.
2. It satisfies the requirement that a kernel cannot return a value by specifying the return type `void`.
3. The code in the parentheses follow `void` indicate that there will be 3 arguments, two 1D arrays (with a single colon in the square brackets) followed by a 2D array (with `[]:,:]`). `f4` indicates that 4 bytes of memory are allocate for each entry in the array. Thus, `f4` provides an abbreviated alternative to `np.float32`.

The definition statement on line 13 incidates that `fKernel2D` takes 3 arguments: the device array `d_x` that should store a copy of the $x$-coordinates,  the device array `d_y` that should store a copy of the $y$-coordinates, and the device array `d_f` to store the computed function values. On line 14, `i,j = cuda.grid(2)`,  defines the 2-tuple of index values. 

> Note that `i,j = cuda.grid(2)` is numba's handy abbreviation for the equivalent code:
<br>`i = threadIdx.x + blockDim.x * blockIdx.x`
<br>`j = threadIdx.y + blockDim.y * blockIdx.y`
<br>the first line of which defines `i` according to the  1D index formula, while the second line defines `j` simply by changing the "suffix" from `.x` to `.y`.

Line 15 assigns to `nx,ny` the dimensions of the array `d_f` to be computed. Line 16 tests whether the indices `i` and `j` are within the index bounds of `d_f` and, if so, `d_f[i,j] = f2D(d_x[i], d_y[j])` evaluates the function at the corresponding point on the geometric grid and stores the result as the indexed entry in `d_f`.

> __Bounds checking:__ If an array dimension is not an exact multiple of the corresponding block size, then the "last" block (with the largest `blockIdx` value) will include index values that lie beyond the extent of the array. To avoid reading or writing into unintended portions of memory, the kernel should routinely include a bounds check similar to the code on line 16. _Later we will consider whether this produces a performance penalty due to_ ___thread divergence___.

The remainder of the file, lines 19-33, defines the wrapper function `fArray2D()`. Lines 20-21 assign to `nx` and `ny` the length of the input arrays of coordinate values `x` and `y`. Lines 23-24  call `cuda.to_device()` to create `d_x` and `d_y` (as "mirror" copies of the inputs `x` and `y`) that give the kernel access to copies of the input data. On line 25, `d_f` is created to provide a `nx` $\times$ `ny` device array to store the function values. Lines 27-29 set the values for the execution parameters. Here, both `gridDims` and `blockDims` are 2-tuples (for a 2D grid). The components of `blockDims` are assigned to match the specified values, `TPBX` and `TBPY`, and each dimension of `gridDims` is computed from the corresponding component of `blockDims` according to the same formula used in the 1D case. The call to execute the kernel launch appears on line 31: `fKernel2D[ gridDims , blockDims ](d_x , d_y , d_f)`. This call matches the format of the 1D version: kernel function name, followed by `[ gridDims , blockDims ]`, followed by parentheses containing the comma-separated list of arguments). Finally, on line 33, the array of computed vaues are copied back to the host by `return d_f.copy_to_host()` so the output is available for plotting on the CPU side.

> __Device array methods:__ `d_f` was created as a device array object, so it comes with a variety of methods including `copy_to_host()` for copying data from device to host.

> __Return value:__ Note carefully how the results get back to the host. A device array is defined in the wrapper function and included as a kernel argument. The kernel stores the output in the device array, but does ___NOT___ return anything to the host. It is the wrapper function, not the kernel itself, that returns the output to the host using `copy_to_host()`.

That completes the description of _apps/map2D/parallel.py_, and all that remains is a minor change to _apps/map2D/main.py_. To run the parallelized version of the _map2D_ app, we want to use the parallel version of `fArray2D()` instead of the serial version used previously, so on line 12 of  _apps/map2D/main.py_, `from serial import fArray2D` should be replaced by `from parallel import fArray2D`.

> __File locations:__ For the `import` statement to work as desired, the files to be imported (_apps/map2D/serial.py_ and _apps/map2D/parallel.py_) should be located in the same directory as _apps/map2D/main.py_. To import from other locations, directory location information must be provided for files from which is imported.



## 4.2 Evaluating a Three-Dimensional Array of Function Values: The map3D App

Having seen the implementation of _map2D_, which basically involved adding a second coordinate direction to _map_, the extension to compute function values on a 3D grid should seem relatively straightforward; it involves similarly adding a third coordinate direction. Let’s start by choosing a function with three parameters:

$$f_{3D}(x,y,z) = \frac{sinh(\pi y)}{sinh(\pi)} sin(\pi x)  cos(\pi z) $$

Listings 4.5, 4.6, and 4.7 show the code for _apps/map3D/main.py_, _apps/map3D/serial.py__, and _apps/map3D/parallel.py_. For now, we settle for printing results to the terminal; we will return later to consider visualization of 3d grids of data.

```
File: main.py
01: import numpy as np
02: import matplotlib.pyplot as plt
03: 
04: NX, NY, NZ = 8, 8, 16
05: 
06: def main():
07: 	x = np.linspace(0, 1, NX)
08: 	y = np.linspace(0, 1, NY)
09: 	z = np.linspace(0, 1, NZ)
10: 
11: 	from serial import fArray3D
12: 	f = fArray3D(x, y, z)
13: 	print(f)
14: 
15: if __name__ == '__main__':
16: 	main()

```

$$ \text{Listing 4.5: } apps/map3D/main.py$$

Note that _apps/map3D/main.py_ is a bit simpler (without the plotting code) and that number of blocks in each direction is reduced so that the total number of threads in each block, $8 \times 8 \times 8 = 512$, is under the 1024 limit.

At this point, the code will hopefully seem pretty readable, so you should read through it and then experiment using it to see how it works.

```
File: serial.py
01: import math
02: import numpy as np
03: PI = np.pi
04: def f3D(x0, y0, z0):
05: 	return math.sin(PI * x0) * math.cos(PI * z0) * math.sinh(PI * y0) / math.sinh(PI)
06: 
07: def fArray3D(x, y, z):
08: 	nx = x.shape[0]
09: 	ny = y.shape[0]
10: 	nz = z.shape[0]
11: 	f = np.empty(shape=[nx,ny,nz], dtype = np.float32)
12: 	for i in range(nx):
13: 		for j in range(ny):
14: 			for k in range(nz):
15: 				f[i,j,k] = f3D(x[i], y[j], z[k])
16: 	return f
```
$$ \text{Listing 4.6: } apps/map3D/serial.py$$


```
File: parallel.py
01: import math
02: import numpy as np
03: from numba import jit, cuda, float32
04: 
05: TPBX, TPBY, TPBZ = 8, 8, 8
06: 
07: @cuda.jit(device = True)
08: def f3D(x0, y0, z0):
09: 	return math.sin(np.pi * x0) * math.cos(np.pi * z0) * math.sinh(np.pi * y0) / math.sinh(np.pi)
10: 
11: @cuda.jit
12: def fKernel3D(d_f, d_x, d_y, d_z):
13: 	i,j,k = cuda.grid(3)
14: 	nx,ny,nz = d_f.shape	
15: 	if i < nx and j < ny and k < nz:
16: 		d_f[i,j,k] = f3D(d_x[i], d_y[j], d_z[k])
17: 
18: def fArray3D(x, y, z):
19: 	nx = x.shape[0]
20: 	ny = y.shape[0]
21: 	nz = z.shape[0]
22: 	d_x = cuda.to_device(x)
23: 	d_y = cuda.to_device(y)
24: 	d_z = cuda.to_device(z)
25: 	d_f = cuda.device_array(shape = [nx,ny,nz], dtype = np.float32)
26: 	gridDims = (nx+TPBX-1)//TPBX, (ny+TPBY-1)//TPBY, (nz+TPBZ-1)//TPBZ
27: 	blockDims = TPBX, TPBY, TPBZ
28: 	fKernel3D[gridDims, blockDims](d_f, d_x, d_y, d_z)
29: 
30: 	return d_f.copy_to_host()
```
$$ \text{Listing 4.7: } apps/map3D/parallel.py$$


The changes toget from_apps/map2D/serial.py_ to  _apps/map3D/serial.py_ include renaming and updating the functions `f3D()` and `fArray3D()`. Each has three inputs for the $x$, $y$, and $z$ directions. `fArray3D()` includes a triply nested loop on Lines 12-15 with indices `i`, `j`, and `k` with bounds determined from the shapes of the input arrays `x` , `y` and `z`. Each output is stored as a 32-bit float in the three-dimensional array `f`.

The modifications to _apps/map3D/parallel.py_ are similar. The suffix on the function names is changed from 2D to 3D and each takes 3 arguments. The indices in the kernel are assigned using `i, j, k = cuda.grid(3)`, and the bounds are determined from the shape of the 3D output device array `d_f`. Bounds are checked in all three directions, and the computed function value is stored in the appropriate
location in `d_f`.

The wrapper function uses the newly introduced `TPBZ` to set the `.z`-component size of each block. The input arrays are copied to device arrays on lines 22-24, and a three-dimensional output device array
is created on Line 25. The execution parameters, which are now both tuples of length 3, are specified on Lines 26-27, and the kernel call occurs on line 28. The wrapper once again ends by copying the results
from the output device array to the host on Line 30.

This finishes our discussion on setting up multi-dimensional grids. After reading carefullly through this notebook, you should be ready to do Homework 3 and then move on to further notebooks.

## Suggested Projects

1. Experiment with removing the data type specifications in _apps/map2D/main.py_. Do the serial and parallel results agree exactly without explicit dtype specification. Try removing the data type specifications from the signature as well.

2. Time the execution of the serial and parallel implementations of `fArray2D()`. Characterize the acceleration due to parallelization for a range of array sizes.

3. What is largest square block size for which you can execute `fArray2D()`? What CUDA limit do you run into? What error message is generated?
when the requested block is too large?

4. Add a signature to the `fKernel3D()` function decorator.

5. Time the execution of the serial and parallel implementations of `fArray3D()`. Characterize the acceleration due to parallelization for a range of array sizes.

6. What is largest cubic block size for which you can execute `fArray3D()`? What CUDA limit do you run into? What error message is generated when the requested block is too large?

7. Experiment with execution parameter specifications that change the "aspect ratio" of your blocks (i.e. square vs. rectangular blocks). Can you detect any patterns about how aspect ratio changes affect kernel execution times?