# Parallel Patterns - Map

The simplest parallel patterns is probably map. The idea is that same operation (or a function) is applied to every element of an input array. Examples of map can be:
* vector scaling 
* vector addition

Map is [embarrasingly parallel](https://en.wikipedia.org/wiki/Embarrassingly_parallel) problem - it is straighforward to parallelize this algorithm, with no communication required between the threads. In Computer graphics pixel shading or ray tracing is embarrasingly parallel because each pixel can be easily processed separately.

In [None]:
%%HTML
<iframe src="https://docs.google.com/presentation/d/e/2PACX-1vQbNSmPHSnKybcXyErXr9Tz2nqZf80bXCWvaL0x1GEq0MvsWgNJobxgw8Qp62hl_INBwdo2cnaAhVkn/embed?start=false&loop=false&delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>

## Kernel compilation in notebooks

PyOpenCL added execution and compilation support for kernels from Jupyter notebooks. It is possible to edit and compile kernels directly is a notebook cell.

First, load the PyOpenCL IPython extension:

In [None]:
%load_ext pyopencl.ipython_ext

import os
os.environ["GPU_DEVICE_ORDINAL"] = "0"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

To make use of the notebook support you have to give Context a name 'ctx' or 'cl_ctx'.

In [None]:
import numpy as np
import pyopencl as cl

ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx, properties=cl.command_queue_properties.PROFILING_ENABLE)

devices = ctx.get_info(cl.context_info.DEVICES)
for d in devices:
    print(f"device={d}")

Compile the kernel directly from notebook. Add the kernel magic '%%cl_kernel' at the beginning, then pass required [compilation options](https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clBuildProgram.html).

It's sufficient to declare a kernel here - in jupyter notebook cell. Notice that the cell execution take slightly longer time than just declaring a string.

In [None]:
%%cl_kernel -o "-cl-fast-relaxed-math"

__kernel void multiply_vector(__global float *buffer_gpu, int multiplier)
{
  int gid = get_global_id(0);
  buffer_gpu[gid] = buffer_gpu[gid] * multiplier;
}

## Handling arbitrary dataset sizes

The number of threads in a group applies to all work groups - all work groups must be of the same size. This does not always align with the dataset size being processed. 

In [None]:
%%HTML
<iframe src="https://docs.google.com/presentation/d/e/2PACX-1vRpVhN7h5X-1EFA1wH7ipQJcbBw4iWp4Q65mklz_ZZWuaDrJJo8C0Y_4L4QmPzTsQ4TBPcDOGVkwX3x/embed?start=false&loop=false&delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>

### Dataset smaller than number of threads

To handle situations with less data than threads we can:
* pass number of elements in the dataset
* add a check to the kernel.

In the case of vector scaling kernel the code is:

In [None]:
%%cl_kernel -o "-cl-fast-relaxed-math"

__kernel void multiply_vector(__global float *buffer_gpu, int multiplier, int number_of_elements)
{
    const int gid = get_global_id(0);
    if (gid < number_of_elements) {
        buffer_gpu[gid] = buffer_gpu[gid] * 2;
    }
}

### Dataset bigger than the number of threads

With larger datasets handling one data cell per thread requires creation of huge amounts on work groups. This may not always be the most efficient way to solve a problem. 

A common way to reduce the amount of threads is to keep the thread alive after it processes a cell. When a thread is finished processing a cell it will then jump to process the next element.

There are various options how to divide the entire dataset. An efficient solution from memory-accesses point of view is so called **grid-stride loop**. In this solution we use a for-loop in which we jump in each iteration by the size of the grid.

Size of the grid can be queried or calculated in the following ways:
* [get_global_size](https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/get_global_size.html)
* [get_local_size](https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/get_local_size.html) x [get_num_groups](https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/get_num_groups.html) - multiply work group size by number of work groups. In frameworks global size is not available so that's the reason to calculate the threads amount in a grid.

In [None]:
%%cl_kernel -o "-cl-fast-relaxed-math"

__kernel void multiply_vector(__global float *buffer_gpu, int multiplier, int number_of_elements)
{
    const int gid = get_global_id(0);
    const int grid_stride = get_global_size(0);

    for (int i = gid; i < number_of_elements; i += grid_stride) 
    {
        buffer_gpu[i] = buffer_gpu[i] * multiplier;
    }
}

In case of frameworks where get_global_id or get_global_size intrinsic functions are not available global thread index can be calculated in the way similar to this (where the first four variables may be build-in ones - like in CUDA):

In [None]:
%%cl_kernel -o "-cl-fast-relaxed-math"

__kernel void multiply_vector(__global float *buffer_gpu, int multiplier, int number_of_elements)
{
    const uint3 threadId = (uint3)(get_local_id(0), get_local_id(1), get_local_id(2));
    const uint3 groupId = (uint3)(get_group_id(0), get_group_id(1), get_group_id(2));
    const uint3 groupDim = (uint3)(get_local_size(0), get_local_size(1), get_local_size(2));
    const uint3 gridDim = (uint3)(get_num_groups(0), get_num_groups(1), get_num_groups(2));
    
    const int gid = threadId.x + groupDim.x * groupId.x;
    const uint3 grid_stride = gridDim.x * groupDim.x;
    
    for (int i = gid; i < number_of_elements; i+= grid_stride.x) 
    {
        buffer_gpu[i] = buffer_gpu[i] * 2;
    }
}

## Exercies - vector addition

Now that you know how handle data GPU with varying sizes you are ready to write a proper GPU application.

Your task is to accelerate vector addition with PyOpenCL.

The code below multiplies the vector by 2. Extend the code to:
1. Fix the code calculate the following formula:

$$
res = 2 a + b
$$

    - accept two input arrays 'a' and 'b'
    - write the results so that the input arrays are not overwritten

2. Improve the code to:

    - handle arbitrary vectors size
    - reuse the threads in work group so that you don't have to create as many threads as input items
    - take care of reading data out of bounds
    - take care of buffers' access parameters - read only, write only etc.

3. Fix performance

    - experiment with work group size and the amount of work groups
    - take care of memory coalescence
    - measure performance of GPU execution time
    - measure performance of CPU time

To create an uninitialized output buffer you need to pass a size in bytes, which can be retrieved like that:

In [None]:
flags = cl.mem_flags
some_array = np.zeros(8).astype(np.float32)
result_gpu = cl.Buffer(ctx, flags.WRITE_ONLY, some_array.nbytes)

CPU profiling helper function.

In [None]:
from time import time

def profile_cpu(function, n, *args):
    """
    This function profiles other functions:
        function - any function to be profiled
        n - number of times a function will be rerun - the more times you run it the more stable results you get,
            but the more time it will take to profile
        args - variable list of arguments that will be passed to a profiled function
    """
    times = np.zeros(n)
    total_ms = 0
    value = 0
    for i in range(n):
        start = time()
        value = function(*args)
        end = time()
        elapsed = (end - start) * 1e3
        times[i] = elapsed

    avg_ms = np.mean(times)
    median_ms = np.median(times)
    variance = np.var(times)
    std = np.std(times)

    print(f"{function.__name__} took on average {avg_ms:.4f} ms, with median {median_ms:.4f} ms, variance {variance:.4f} ms, standard deviation {std:.4f} ms.")        
    return value, avg_ms

GPU profiling helper function.

In [None]:
def profile_gpu(function, n, queue, global_size, local_size, *args):
    times = np.zeros(n)
    function(queue, global_size, local_size, *args).wait()
    function(queue, global_size, local_size, *args).wait()
    
    for i in range(n):
        e = function(queue, global_size, local_size, *args)
        e.wait()
        elapsed = (e.profile.end - e.profile.start) * 1e-6
        times[i] = elapsed

    avg_ms = np.mean(times)
    median_ms = np.median(times)
    variance = np.var(times)
    std = np.std(times)

    print(f"{function.function_name} took on average {avg_ms:.4f} ms, with median {median_ms:.4f} ms, variance {variance:.4f} ms, standard deviation {std:.4f} ms.")
    
    return avg_ms

Kernel code

In [None]:
%%cl_kernel -o "-cl-fast-relaxed-math"

__kernel void compute_linear_equations_gpu(__global int *a)
{
    const uint gid = get_global_id(0);

    a[gid] = 5 * a[gid];
}

The kernel name is injected into user namespace so use the function just by it's name. 

In [None]:
N=np.int32(10000000)

h_a = np.arange(0, N).astype(np.int32)
h_b = np.arange(1, N + 1).astype(np.int32)

d_a = cl.Buffer(ctx, flags.READ_ONLY | flags.COPY_HOST_PTR, hostbuf=h_a)

local_work_size = (32, )
global_work_size = (32, )

gpu_time_ms = profile_gpu(compute_linear_equations_gpu, 
                          20, 
                          queue, 
                          global_work_size, 
                          local_work_size,
                          d_a
                          )

h_res = np.zeros(N).astype(np.int32)

_ = cl.enqueue_copy(queue, h_res, d_a)

def compute_linear_equations_cpu(a, b):
    return 2 * a + b

numpy_res, numpy_avg_ms = profile_cpu(compute_linear_equations_cpu, 20, h_a, h_b)
np.testing.assert_equal(h_res, numpy_res)

Refer to the [solution](./map_solution.txt) if you get stuck.