# Ch. 2 - Function Evaluation and the Map Pattern

For our first software application, or "app", we pick something that
is straightforward in terms of both explanation and implementation.
The basic task is to evaluate a function over an array of evenly spaced
points. Performing a particular operation on every element of an array
is a common computing task referred to as the "map" pattern. Given
that we are working with a map pattern, we use "map" as the name
for this first example app. To be concrete, we choose a simple trigonometric
polynomial function

$$f (x) = 1 - 2 sin(\pi x)^2 $$ (2.1)

and evaluate it at N points equally spaced on the interval $0 \leq x \leq 1$.

## 2.1 - Serial Implementation: The Map App

We start with a serial implementation using some Python essentials
and, in the following section "Parallelizing the Map App", we show
how to efficiently convert to a parallel implementation.

To keep track of code for multiple, and often larger, projects we offer
a few suggestions that should make it easier for you to keep track of
your code and easier for us to refer to specific files and line numbers:

- We find it convenient to create a directory where Python projects
can be stored in sub-directories. We refer to this directory by the
name _apps_ although the full address on your system will be longer
than that.

- For each app, we create a sub-directory with the name of the app.
We refer to the map app as _map_, so we refer to the directory
where the code files are stored as _apps/map/_.

- Within the sub-directory for each app, we create a _main.py_ file. The
particular version for the map app can be referred to uniquely as
_apps/map/main.py_.

- Each app typically includes other files besides _main.py_. In this case,
we create a _serial.py_ file to hold definitions of the functions called within `main()` in this serial implementation of the app. That file
can be uniquely referred to as _apps/map/serial.py_.

Below is the listing of the file  _apps/map/serial.py_.

```
File: serial.py
01: import math
02: import numpy as np
03: 
04: PI = np.pi
05: 
06: def s(x0):
07: 	return (1.-2*math.sin(PI*x0)**2)
08: 
09: def sArray(x):
10: 	n = x.shape[0]
11: 	f = np.zeros(n)
12: 	for i in range(n):
13: 		f[i] = s(x[i])
14: 	return f
```

$$ \text{Listing 2.1: }  apps/map/serial.py $$


The first step in the serial implementation involves creating a Python
function equivalent to Equation  2.1, which we call `s()`. As we
head toward parallelism, we have a particular interest in operating
on arrays of data, so our second function, `sArray()` , takes in the N
points on the interval $0 \leq x \leq 1$ and passes each of them to `s()`
to generate the corresponding array of function values. The complete
code for both these functions can be seen in Listing 2.1.

The function `s()` is shown on on Lines 4-5. `s()` takes in a single value and operates on it according to the given function, for
our purposes Equation 2.1. Lines 7-14 contain the function `sArray()`, which is meant to take in a numpy array cotaining the N interval
points and output an array containing N computed function values.
On Line 8 we start by determining the size of the input array by using
the size or shape attribute of the numpy array. (See the numpy documentation for more details about array attributes and methods.). The input array size determines both the range of indices to loop over and how large the output array
needs to be. On Line 9 an empty output array is created. Lines
11-12 contain the for loop, which executes over the index values
$[0, 1, 2, . . . , N-1]$. (If range is only given an end value the default starting value is 0.) On each loop `s()` is called with the $i^{th}$ input value and the function value is stored as the $i^{th}$ element in the  output array `f[i]`. Finally, Line 14 returns the `f` array containing the evaluated function values.

To complete the `map` app, we need to create _apps/map/main.py_ which contains the definition of a `main()` function that controls execution.
An implementation of _apps/map/main.py_ is shown in Listing
2.2. 

```
File: main.py
01: import math
02: import numpy as np
03: import matplotlib.pyplot as plt
04: 
05: N = 64
06: 
07: def main():
08: 	x = np.linspace(0, 1, N, dtype=np.float32)
09: 	
10: 	from serial import sArray
11: 	f = sArray(x)
12: 	plt.plot(x, f, 'o', label='1-2*Sin(2*PI*x))**2')
13: 	plt.legend()
14: 	plt.show()
15: 
16: if __name__ == '__main__':
17: 	main()

```
$$ \text{Listing 2.2: }  apps/map/main.py $$

Lines 1-3 import the necessary libraries (`numpy` is imported for array functionality,
`matplotlib.pyplot` is
imported for plotting, and `seaborn` is
an optional library that aims to improve the aesthetics of the plots.) and Line 5 assigns a value of N to be used when defining the array size. Lines 7-14 define the `main()` function that controls execution. To create an array
of N input values, on Line 8, we use numpy’s `linspace()` function to produce a list of N values uniformly distributed on $[0,1]$. Line 10
imports our definition of `sArray()` from _apps/map/serial.py_, and
Line 11 calls `sArray()` with the array `x` and stores the results in `f`.
Lines 13-14 plot and display `f`, the evaluated function array, and
Lines 16-17 call for execution of `main()`. From `main()` we can call
`sArray()` to execute `s()` on the desired array of values. After
execution, the script should produce a pop-up plot of the computed
values of `s()`.

> `__name__` is referred to as a __dunder__.
Whenever you run a Python
script, it is assigned a name associated with 
the `__name__` variable. Whichever
file is being explicitly executed by
the interpreter is given the name
`__main__`. The conditional `if __name__ == '__main__ ':` prevents
`main()` from running if the file
where to be imported by a different
module. This is not required, but it
is a good habit to always include it.

## 2.2 Parallelizing the Map App

The basic plan for parallelization is to make minimal changes to _apps/map/main.py_, and focus on an alternative CUDA-powered, parallel implementation
of the `sArray()` function in _apps/map/parallel.py_.

Listing 2.3 shows the code for the parallelized version _apps/map/parallel.py_, and Listing 2.4 shows a side-by-side comparison with _apps/map/parallel.py_ on the left and _apps/map/serial.py_ on the right. From this side-by-side comparison it is apparent that not too many changes are needed to convert the serial implementation to the parallel implementation.
Let’s walk through the changes line by line to see what is needed to take advantage of GPU-based parallelism.

> At this point, it might be helpful to open an extra copy of the notebook to keep Listing 2.4 visible for reference while reading through the line-by-line description of the code.

```
File: parallel.py
01: import math
02: import numpy as np
03: from numba import jit, cuda, float32
04: 
05: PI = np.pi
06: TPB = 32
07: 
08: @cuda.jit(device = True)
09: def s(x0):
10: 	return (1.-2.*math.sin(PI*x0)**2)
11: 
12: @cuda.jit #Lazy compilation
13: #@cuda.jit('void(float32[:], float32[:])') #Eager compilation
14: def sKernel(d_f, d_x):
15: 	i = cuda.grid(1)
16: 	n = d_x.shape[0]	
17: 	if i < n:
18: 		d_f[i] = s(d_x[i])
19: 
20: def sArray(x):
21: 	n = x.shape[0]
22: 	d_x = cuda.to_device(x)
23: 	d_f = cuda.device_array(n, dtype = np.float32) #need dtype spec for eager compilation
24: 	blockdims = TPB
25: 	gridDims = (n+TPB-1)//TPB
26: 	sKernel[gridDims, blockDims](d_f, d_x)
27:
28: 	return d_f.copy_to_host()
```
$$ \text{Listing 2.3: }  apps/map/parallel.py $$


```
File: parallel.py									File: serial.py
01: import math										import math
02: import numpy as np								import numpy as np
03: from numba import jit, cuda, float32
04: 
05: PI = np.pi										PI = np.pi
06: TPB = 32
07: 
08: @cuda.jit(device = True)
09: def s(x0):										def s(x0):
10: 	return (1.-2.*math.sin(PI*x0)**2)				return (1.-2.*math.sin(PI*x0)**2)
11: 
12: @cuda.jit #Lazy compilation
13: #@cuda.jit('void(float32[:], float32[:])') 
14: def sKernel(d_f, d_x):
15: 	i = cuda.grid(1)
16: 	n = d_x.shape[0]	
17: 	if i < n:
18: 		d_f[i] = s(d_x[i])
19: 
20: def sArray(x):									def sArray(x):
21: 	n = x.shape[0]									n = x.shape[0]
22: 	d_x = cuda.to_device(x)						
23: 	d_f = cuda.device_array(n,dtype=np.float32) 	f = np.zeros(n)
24: 	blockdims = TPB
25: 	gridDims = (n+TPB-1)//TPB
26: 	sKernel[gridDims, blockDims](d_f, d_x)			for i in range(n):
27:															f[i] = s(x[i])
28: 	return d_f.copy_to_host()						return f

```

$$ \text{Listing 2.4: Side by side comparison of } \\
apps/map/parallel.py \text{ and }
apps/map/serial.py $$


In Listing 2.3, Line 3 contains the first change between the serial and parallel versions. An additional import statement is added where the
cuda module is imported from the numba library. This provides the essential support for parallelization.
Line 5 defines `TPB`, a global variable that will be used when the kernel is launched to specify the number of threads in each block.

> `TPB` stands for __Threads Per Block__.
The value of `TPB` can have an effect
on performance, but choosing an
optimal value is hardware dependent.
A reliable rule of thumb is to make `TPB`
a multiple of 32, which corresponds
to the size of a warp. While we are
working with small arrays, we choose
`TPB = 32`. This variable will be used in the parallel implementaton of `sArray()`. 

The most significant changes made in the conversion from serial
to parallel involve the introduction of the `sKernel()` function and the modification of `sArray()`. While the serial implementation of `sArray()` consists of a loop that iterates over the entries in the array `x`, calling
`s()` for each element, the parallel version replaces the serial loop structure with a new version that we implement in two parts:

- A __kernel__ function, or global function, that executes the code inside
the loop in parallel on the GPU. Here the kernel function is called `sKernel()` and is defined on Lines 12-18. Kernel functions are called from the host (CPU) and execute on the device (GPU). In CUDA terminology, such a function (which is neither exclusively a host function nor a device function) is classified as global.
- A __wrapper__, or launcher function, that does a bit of necessary bookkeeping and calls for execution of the kernel. Here the wrapper function is the new version of `sArray()` which is defined on Lines 20-28.

Let’s start by looking at the details of the wrapper function. The function arguments of `sArray()` remain unchanged: both the serial and parallel versions take in an array of values. The parallel version of `sArray()` is derived from the serial version by replacing the loop with some array definitions and a kernel call.

On Line 21, we use the `shape` attribute of the numpy array `x` to obtain the number of entries in the input array. The value `n = x.shape[0]` will be used to both determine the size of the output array and to calculate the value of `BPG` (short for __Blocks Per Grid__), the number of blocks required to cover the array of data. 

Since a kernel function cannot directly access the input from or write the output directly to a host array (on the CPU side of the PCIe bus), we create "mirror" arrays on the device, called __device arrays__. On Lines 22-23 the device arrays `d_x` and `d_f` are created. 

> We employ a common notation and
use the prefix `d_` to distinguish
between a host array and a device array

On Line 22 we copy the `x` array, created on the host side, to `d_x` the device side using `d_x = cuda.to_device(x)`. Line 23 creates an empty device array `d_f` that can store `n` 32-bit floating point entries, using the built-in numba function `cuda.device_array()` .

> `d_f` can also be created by calling
`cuda.to_device()` on an empty or
initialized array created on the host side
but this requires more steps and is
slower. Directly creating a device array of zeros is not yet supported in numba, so to be really sure that your device array does not include any random bits, it may be worthwhile to create a device array by copying an initialized host array.

The final step before launching the kernel is to establish the kernel’s __execution configuration parameters__ to specify the number of threads and blocks in the computational grid. We take a typical approach by directly specifying the number of threads in each block (by assigning a constant value to `TPB`) and then determining the number of blocks necessary to cover every array element. A common trick to make sure that the grid is fully covered involves computing `(n + TPB - 1)//TPB` which ensures that the integer division always rounds up. 

> `//` forces integer division in
Python 3.

On lines 24-25, the variables `gridDims` and `blockDims`, which will appear in thekernel call, are assigned the desired execution parameter values `BPG` and `TPB`. 

> This is an optional notational convenience to give kernel calls have a more uniform appearance. It is perfectly acceptable for `TPB` and `BPG` to appear directly in the execution parameters of your kernel calls.

Line 26 calls for execution of the kernel which introduces a bit of new syntax. As with all function calls, it starts with the function name and ends with parentheses containing a comma-separated list of arguments.
The new element is in the middle: square brackets containing two arguments separated by commas:

```
kernel[gridDim, blockDim](args)
```

> The values of `gridDim` and
`blockDim` will be tuples if the kernel
is initialized with more than one
grid dimension. There can also be optional
third and fourth parameters to
specify a computational stream and an
allocation of dynamic shared memory. More on that later as needed...


The entries in the square brackets specify the execution configuration as follows:

- The first entry specifies the number of blocks in the computational
grid.

- The second entry specifies the number of threads in each block.

After the kernel has computed the output, the data needs to be copied back to host memory. This the `return` statement on line 28 calls `copy_to_host()` so that the computed values stored in the device array `d_f` get copied back to the host side for further use (e.g. plotting).

Let’s now look at the details of `sKernel()`, which specfies the computation to be carried out by a single thread. The code for a kernel function looks much like the code for any function in that it has a definition line followed by an indented code block, but with a few
notable changes. On Line 12 `sKernel()` is preceded by a decorator, signified by the `@` symbol. This decorator `@cuda.jit()` indicates that the function `sKernel()` is a global function or kernel. An alternative version of the decorator with an optional argument specifying a __signature__ is included as a comment on line 13.

The simpler version (just `@cuda.jit()`) without the optional signature specification leads to __lazy or just-in-time (JIT) compilation__. Since the compiler does not know the data types _a priori_, it waits until the kernel is actually called, infers the data types, and compiles "just in time" to execute. When the decorator includes the signature specifying the input and output data types, then __eager or ahead-of-time (AOT) compilation__ occurs, and the kernel code is compiled when the app is started rather than waiting for a kernel call to occur.

> __Timing note:__ The first execution of code with lazy compilation will typically have a much longer execution time because it includes the kernel compilation time, so be sure to run lazy compilation code multiple times to get timings that reflect actual execution time.

In this case, the signature  `void(float32[:], float32[:])` specifies that there are two arguments, each a 1-dimensional numpy array of 32-bit floats, and the return type is `void`. Note that all kernels have return type `void` because kernel functions, while launched from the host, run on the device (GPU) and cannot return a value directly to the host. The input list needs to include an argument so the results can be stored and then copied back to the host as needed. Note that the __wrapper__ function `sArray()` creates device arrays to store the input and output values that appear as arguments on line 26 where the kernel is called for execution.

> Reminder: ___Kernel functions cannot return values, so provide an argument for storing results.___

With the loop replaced by the kernel call, we still need some sort of index to uniquely identify each iteration. As seen on Line 16, CUDA provides built-in index and dimension variables which can be
accessed within a global function which replace the traditional loop index. Each index value is associated with a computational block and
thread within the computational grid:

- `blockDim` gives the number of threads in a block.

- `blockIdx` gives the index of a block in the grid.

- `threadIdx` gives the index of a thread in the block.

> CUDA supports grids up to dimension 3, so the built-in index variables are equivalent to python tuples with 3 entries designated by appending `.x`, `.y`, and `.z`. Here our arrays are 1D, we launch a 1D grid, and we are concerned solely with the `.x` components. By default, numba treats a numerical value as a 3-tuple with `.y` and `.z` components equal to zero, so integers suffice to specify execution parameters for 1D grids. 

The index for a thread is given by the sum of the number of blocks with smaller values of `blockIdx.x` multiplied by the size of a block,
`blockDim.x`, plus the index of the thread in its block, `threadIdx.x`.
This leads to the canonical formula for computing the index for a given
thread within a one-dimensional computational grid:

```
i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
```

This expression occurs so routinely that Numba offers a convenient
shorthand:

```
i = cuda.grid(1)
```

> The argument of `grid()` indicates the dimension of the computational
grid. We have a one-dimensional
grid so the argument is 1.

The index `i`, computed from the formula or from using `cuda.grid(1)`, acts as the replacement for the index provided by the `for` loop in the serial version of this code.

Line 17 performs an essential bounds check to ensure that no thread tries to access memory that is "out of bounds"; i.e. with index beyond the dimensions of the array. If the data array size is not an integer multiple of `TPB` the "last" block (with largest `blockIdx.x` value) will include thread values beyond the bounds of the array indices. The conditional `if i < n:` ensures that only threads with index values corresponding to input and output array elements proceed to access an array element. 

The heart of each thread’s computing task appears on Line 18 with the call to `s()`. This function
call is exactly the same as the function call within the loop of the serial `sArray()` function, but with the index `i` (which previousy arose as the iterator in the `for` loop) determined by the built-in index variables by<br>`i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x`<br> abbreviated in numba as `cuda.grid(1)`.

The final change occurs on Line 8, with the addition of a decorator above `s()`. The function definition itself is unchanged but the decorator, `@cuda.jit(device = True)`, identifies `s()` as a device function so it gets compiled to run on the GPU. 

> The definition of any function, like `sArray()`, that is to be called from the device (i.e. by other functions executing on the GPU) must be preceeded by the decorator  `@cuda.jit(device=True)` that identifies __device functions__ for compilation to execute on the GPU. Device functions cannot be called from the CPU, but functions can be identified as "device/host" functions so that separate versions are compiled to run on CPU and GPU.

You should have already executed the serial version of the the _map_ app, and
you can now execute the parallel version by changing `serial` to `parallel` on line 10 of _apps/map/main.py_ (so the parallel version of `sArray()` is imported instead of the serial version imported previously).

The parallel results should coincide with the serial results. You can check this visually by running both the serial and parallel code and
looking at the plotted output. To validate further you can look to the suggested projects and create a comparison file to better analyze the differences, if any, between the two versions.

> It is always good practice to verify
that your parallel code produces the
same result as the serial code it is
meant to replace and "accelerate". In
fact, it is good practice to perform such
verification tests for various problem
sizes and on the various types of
hardware where your app is expected
to run. ___Reducing the computation time but producing the wrong answer does not count as accelerating your app!___

At this point you should have a working version of both serial and parallel versions of a Python _map_ app, and you are ready to move on to Homework 2 and future notebooks.

## 2.3 Suggested projects

1. Create a file _apps/map/compare.py_ that calls both the serial and parallel versions of `sArray()` and plots both
results and the difference between them. Does the parallel implementation
reproduce the the results of the serial implementation?

2. Code up `cKernel()` and `cArray()`(analogous to `sKernel()` and
`sArray()`) for the function $g(x) = cos(2 \pi x)$. Evaluate your code
and describe the relation between the values of $f(x)$ and $g(x)$.
