# Introduction to PyOpenCL

### Who makes GPUs?

Many of us are aware that the following vendors produce graphics cards for desktop PCs and laptops:
* Intel
* Nvidia
* AMD

There is a growing and interesting market in mobile GPUs in our phones represented by the following brands:
* Adreno - Qualcomm
* Mali - ARM
* PowerVR - Imagination

### What tools are there to program GPUs?

There are many frameworks which enable developers to program GPUs. Some of the most popular are listed below:

* OpenCL by Khronos organization (cross platform)
* Vulkan by Khronos - getting increasingly popular (cross platform)
* DirectCompute - part of DirectX by Microsoft (Windows only)
* CUDA by NVidia (NVidia only)
* Metal by Apple (Mac only)

Some are available only on specific hardware (like CUDA) or operating systems (like Metal or DirectX).

### Why OpenCL?

OpenCL is available on most platforms - runs on all three big desktop platforms as well as on Android devices. It's well established in the market, widely used and with good documentation. 

Some useful vendor extensions are available only through OpenCL. 

### Why Python for OpenCL?
Python is a great language for learning because of its simplicity. The goal of this workshop is to focus your attention on GPU concepts and algorithms rather than spending hours settings up complicated low level code.

PyOpenCL is complete Python OpenCL API made easier. Underneath it uses C++ so there is no speed compromise. Comes under MIT License.

## Host code and device code

We will dive straight into the code skipping the litany of reasons why GPUs should be used for General Purpose computing.

When working with GPUs we write two types of code:
* host code - it can be thought as a boilerplate code - it is executed on CPU. In case of this workshop it is the Python code - with other GPU frameworks like CUDA, OpenCL or DirectX it will most likely be C++. Host side CPU code sets up the necessary buffers, copies data to GPU memory and back and tells GPU what to do in which order.
* device code - GPU code - things that are executed on GPU. These are often relatively small pieces of code which are executed many times but with different data. The functions executed on GPU are called kernels or shaders.

Below are slides with very high level view of relation between CPU and GPU work.

In [None]:
%%HTML
<iframe src="https://docs.google.com/presentation/d/e/2PACX-1vS2_z7au4f9L0BmsYorl_CEdFP-ec6pTCqREdGiRqCvpeNuTG8bm7VHMEfU2tYtiA2cNEkIedixJIqk/embed?start=false&loop=false&delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>

## BeeHive specific setup

If you are working on a BeeHive cluster and share a GPU with other people the lines below are neede to make the notebooks work.

In [None]:
import os
os.environ["GPU_DEVICE_ORDINAL"] = "0"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

## Let's start coding

First thing to do is to import PyOpenCL library. When this succeeds it is a confirmation that PyOpenCL is installed on your machine.

In [None]:
import pyopencl as cl
import numpy as np

%load_ext pyopencl.ipython_ext

Let's learn a bit more about the platform we are going to work on. We will use [get_platforms](https://documen.tician.de/pyopencl/runtime_platform.html?highlight=get_platforms#pyopencl.get_platforms) function to get the list of available OpenCL platforms. Examples of platforms include:
* Nvidia CUDA
* AMD Accelerated Parallel Processing
* Intel

In [None]:
for p in cl.get_platforms():
    print(f"profile = {p.profile}, name = {p.name}, version = {p.version}, vendor = {p.vendor}")
    print('\n\tPlatform Extensions: ')
    for e in p.extensions.split():
        print(f"\t\t{e}")
        
    devices = p.get_devices(cl.device_type.ALL)

    print('\n\tAvailable devices: ')
    if not devices:
        print('\t\tNone')

    for dev in devices:
        indent = '\t\t'
        print(indent + '{} ({})'.format(dev.name, dev.vendor))

        indent = '\t\t\t'
        flags = [('Version', dev.version),
                 ('Type', cl.device_type.to_string(dev.type)),
                 #('Extensions', str(dev.extensions.strip().split(' '))),
                 ('Memory (global)', str(dev.global_mem_size)),
                 ('Memory (local)', str(dev.local_mem_size)),
                 ('Address bits', str(dev.address_bits)),
                 ('Max work item dims', str(dev.max_work_item_dimensions)),
                 ('Max work group size', str(dev.max_work_group_size)),
                 ('Max compute units', str(dev.max_compute_units)),
                 ('Driver version', dev.driver_version),
                 ('Image support', str(bool(dev.image_support))),
                 ('Little endian', str(bool(dev.endian_little))),
                 ('Device available', str(bool(dev.available))),
                 ('Compiler available', str(bool(dev.compiler_available)))]        
        [print(indent + '{0:<25}{1:<10}'.format(name + ':', flag)) for name, flag in flags]

You can also use 'clinfo' application to get detailed information about your platform.

In [None]:
!clinfo

### Create open CL context and command queue.
Context is an abstraction in which we execute kernels, it isolates the resources you are using. It can contain many devices present in a system. When you exit the application the resources assigned to that context are deleted and cleared when context is destroyed. 

We will use a somewhat strange function [create_some_context](https://documen.tician.de/pyopencl/runtime_platform.html?highlight=get_platforms#pyopencl.create_some_context) which as the name suggests "somehow" creates context. The way how it is done and what it includes is up to a vendor. In case of NVidia it creates context with only one GPU - even if there are multiple GPUs present.

Command Queue is a queue to which you schedule different tasks. In a simple case there will be one queue. To overlap some work we can create multiple queues. GPU picks the work from these queues. 

In [None]:
ctx = cl.create_some_context()    
queue = cl.CommandQueue(ctx)

After creating context we can query how many devices are present.

In [None]:
devices = ctx.get_info(cl.context_info.DEVICES)
for d in devices:
    print(f"device={d}")

To create context in a predictable way we can use Context constructor which takes arguments:
* [device type](https://documen.tician.de/pyopencl/runtime_const.html#device_type) - CPU, GPU, ALL - OpenCL can also be run on FPGAs
* properties - where we can pass a platform

How many devices are available now?

In [None]:
platform = cl.get_platforms()[0]

ctx = cl.Context(
    dev_type=cl.device_type.ALL, 
    properties=[(cl.context_properties.PLATFORM, platform)])    

queue = cl.CommandQueue(ctx, properties=cl.command_queue_properties.PROFILING_ENABLE)
    
devices = ctx.get_info(cl.context_info.DEVICES)
for d in devices:
    print(f"device={d}")

## Hello GPU World

Kernel is a function that will run in GPU. It will be executed many times with different data in each thread.

The kernel below is named 'hello_gpu'. It takes no arguments.

[__kernel](https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/functionQualifiers.html) qualifier indicates that a function can be executed on an OpenCL device.

The return value of OpenCL kernels has to be void, because they don't return any value. In practical applications the kernel is executed multiple times; there is no single value to return from the function itself. Though data can be returned through buffers which we will cover later.

In [None]:
%%cl_kernel -o "-cl-fast-relaxed-math"

__kernel void hello_gpu()
{
    
}

[Calling a kernel function](https://documen.tician.de/pyopencl/runtime_program.html#pyopencl.Program.kernel_name) on program object schedules the execution of kernel on GPU. Work is scheduled on a queue that is passed as an argument. Processing on a GPU happens asynchronously with respect to the CPU code.

One way of running work on GPU is calling a kernel name on a program object. It takes the following arguments:
* queue - on which the work will be scheduled
* global work size - for now we pass a tuple of 1
* local work size - for now we pass a tuple of 1
* arguments that will be passed to kernel function - in this case there are no extra arguments

The return value is an event associated with the work. After scheduling the execution of work on the GPU we wait for the kernel to finish, since the work is done asynchronously with respect to CPU python code.

In [None]:
event = hello_gpu(queue, (1,), (1,))

event.wait()

Now we have a code that compiles and executes but does nothing yet. 

## Work division

Before we pass some data for processing it's neccessary to understand how to create multiple threads.

In [None]:
%%HTML
<iframe src="https://docs.google.com/presentation/d/e/2PACX-1vRcAADTeaEmIr2nO_XGEnelxWP3OQjtRDEWxrZr08sayWBmx7nJyiGfSQcOKEIPcawr6ptd5LrNGXEn/embed?start=false&loop=false&delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>

## Accessing data in kernels

In [None]:
%%HTML
<iframe src="https://docs.google.com/presentation/d/e/2PACX-1vRxua8PCOImOODpVPZRFTtUvQKnTHn_kcRE44M7aXjzoZPUJLZtlozxHk7Of002F4fmUu3yUpWs2eRR/embed?start=false&loop=false&delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>

## Vector scaling

In this section you will see an example of a kernel that accesses data from a buffer. It will multiply every value in the buffer by a constant and will store it in the same memory cell.

First we create a regular, CPU visible numpy array. By convention handles to objects will be prefixed with:
* h_ - on the host - CPU buffers
* d_ - on device side - GPU buffers

In [None]:
N = 8
h_buffer = np.arange(0, N).astype(np.int32)

h_buffer

GPUs often have their own memory on the Graphics Card. To perform operations on data in GPU we have to copy the memory from CPU visible RAM to GPU. This can done when creating [Buffer](https://documen.tician.de/pyopencl/runtime_memory.html?highlight=buffer#pyopencl.Buffer) object with arguments:
* context - Context that we have previously created
* [flags](https://documen.tician.de/pyopencl/runtime_const.html?highlight=mem_flags#mem_flags) argument we passed tell that:
 - READ_WRITE - we will be reading the value from the buffer and writing back - so the buffer should be readable and writable
 - COPY_HOST_PTR - the memory in GPU should be filled with data provided in 'hostbuf' argument
* Argument hostbuf is assigned a CPU visible memory we have created with numpy

In [None]:
flags = cl.mem_flags

d_input_buffer = cl.Buffer(ctx, flags.READ_ONLY | flags.COPY_HOST_PTR, hostbuf=h_buffer)
d_output_buffer = cl.Buffer(ctx, flags.WRITE_ONLY, h_buffer.nbytes)

### Vector Scale kernel
The kernel below is named 'scale_vector'. It takes two arguments:
* read only pointer to an input buffer
* writable pointer to output buffer
* multiplier - read-only constant value

In [None]:
%%cl_kernel -o "-cl-fast-relaxed-math"

__kernel void scale_vector(__global const int *input_buffer, 
                           __global int *output_buffer, 
                           const int multiplier)
{
  int gid = get_global_id(0);
  output_buffer[gid] = input_buffer[gid] * multiplier;
}

In GPU there are a few kinds of memory address spaces where data can reside. GPU RAM is a global memory and we use
__global - qualifier to state that.
Available address space qualifiers are:
* [global](https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/global.html) - RAM memory - slowest but a lot is available
* [local](https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/local.html) - shared memory - close to processors - faster but not much available, high usage may decrease performance.
* [constant](https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/constant.html) - special GPU read only memory, filled from CPU - used for constants
* [private](https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/private.html) - registers - fastest but little of it is available, low usage is usually good for performance.

__kernel qualifier indicates that a function can be executed on an OpenCL device.

scale_vector kernel function takes as argument a pointer to an array. Each thread accesses one element from an array and multiplies it by a value. 
To get the index from which a thread will read the value, we use OpenCL built-in function:

[get_global_id](https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/get_global_id.html) - returns the global index of the current thread. Argument with value 0 indicates that we get index on x-axis. Similarly value 1 will return index on y-axis, and value 2 index on z-axis.

Data is returned from kernels through arguments - in this case by writing the value back the output buffer.

We create simple execution configuration with one thread per work group. This is specified by defining local work group size. The amount of work groups is defined in global work size.

There will be as many work groups as elements in the arrays.

We must also define a constant that will be passed to a kernel as a parameter. It has to be a numpy value - otherwise PyOpenCL cannot understand the value.

In [None]:
local_work_size = (1,)
global_work_size = (N,)

multiplier = np.int32(4)

The arguments passed to this kernel invocation are:
* queue - on which the work will be scheduled.
* global work size - stating the total number of threads in a grid. It's the number of independent work groups multiplied by threads amount in work group (by local work size).
* local work size - stating how many threads you will have in a work group.
* variable amount of arguments that will be passed to kernel function. In case of this kernel there are two arguments:
 - input buffer
 - output buffer
 - multiplier - a numpy constant by which vector will be multiplied

In [None]:
event = scale_vector(queue, 
                     global_work_size, 
                     local_work_size,
                     d_input_buffer,
                     d_output_buffer,
                     multiplier)

event.wait()

The result of vector multiplication is available in GPU buffer. To access it here in Python code we need to copy it back from GPU memory to CPU. 

So we need a CPU visible buffer. We create an empty array for that. Then we need to fetch the data back.

We can use [enqueue_copy](https://documen.tician.de/pyopencl/runtime_memory.html#pyopencl.enqueue_copy) OpenCL function which schedules a copy from GPU to CPU on the same queue. The function will block by default. If you do not want it to block set is_blocking argument to False. Note that in this case you will have to manually synchronize GPU with CPU at some point. Otherwise your buffer may be in invalid state.

To verify if the multiplication was done correctly in the GPU we also multiply the input CPU buffer here. By comparing CPU multipied buffer with GPU multiplied one we know that the processing was done correctly.

In [None]:
h_result_from_gpu = np.zeros(N).astype(np.int32)
cl.enqueue_copy(queue, h_result_from_gpu, d_output_buffer)

result_from_cpu = h_buffer * multiplier
print(f"computed in cpu = {result_from_cpu}")
print(f"computed in gpu = {h_result_from_gpu}")

np.testing.assert_array_equal(result_from_cpu, h_result_from_gpu)

## Exercise: Efficient hardware utilization with optimal subgroup size

To utilize the underlying hardware efficiently we need to understand basics of underlying hardware.

In the example above 'local_work_size' was set to one. This is very not optimal which will be explained in the slides below.

In [None]:
%%HTML
<iframe src="https://docs.google.com/presentation/d/e/2PACX-1vQYzwdOhK1JejtcOPk7ua972EVKQ3OluGZ-m9LPl4Uhp0bOK1D_wWWzATBRHa8wc1NR6_P__DFJBmFv/embed?start=false&loop=false&delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>

To get a hint about the best work group size you can query the compiled kernel for performance hints.
Call [get_work_group_info](https://documen.tician.de/pyopencl/runtime_program.html#pyopencl.Kernel.get_work_group_info) on a kernel with kernel_work_group_info parameter value of [PREFERRED_WORK_GROUP_SIZE_MULTIPLE](https://registry.khronos.org/OpenCL/sdk/1.1/docs/man/xhtml/clGetKernelWorkGroupInfo.html).

In [None]:
device = ctx.get_info(cl.context_info.DEVICES)[0]

work_group_multiple = scale_vector.get_work_group_info(cl.kernel_work_group_info.PREFERRED_WORK_GROUP_SIZE_MULTIPLE, device)
max_work_size = scale_vector.get_work_group_info(cl.kernel_work_group_info.WORK_GROUP_SIZE, device)
local_memory_size = scale_vector.get_work_group_info(cl.kernel_work_group_info.LOCAL_MEM_SIZE, device)
private_bytes = scale_vector.get_work_group_info(cl.kernel_work_group_info.PRIVATE_MEM_SIZE, device)

print(f"Preffered work group size multiple = {work_group_multiple}")
print(f"Maximum work group size = {max_work_size}")
print(f"Local memory size = {local_memory_size}")
print(f"Spilled or private memory in bytes = {private_bytes}")

The first printed value is actually subgroup width, the width of your SIMD. Your work group size should be a multiple of it.

Later we will profile the kernel defined in the example above. For convevience let's define now a helper function which will measure kernel execution time.

In [None]:
def profile_gpu(function, n, queue, global_size, local_size, *args):
    times = np.zeros(n)
    function(queue, global_size, local_size, *args).wait()
    function(queue, global_size, local_size, *args).wait()
    
    for i in range(n):
        e = function(queue, global_size, local_size, *args)
        e.wait()
        elapsed = (e.profile.end - e.profile.start) * 1e-6
        times[i] = elapsed

    avg_ms = np.mean(times)
    median_ms = np.median(times)
    variance = np.var(times)
    std = np.std(times)
    print(f"{function.function_name} took on average {avg_ms:.4f} ms, with median {median_ms:.4f} ms, variance {variance:.4f} ms, standard deviation {std:.4f} ms.")

The input the previous exercise has just 8 elements and there is no point in doing so little operations in GPU. To get some meaningful results we will increase the size of the input buffer and redefine the variables affected.

In [None]:
N = 2**23
h_buffer = np.arange(0, N).astype(np.int32)

print(f"Computing {N:,} elements - total of {h_buffer.nbytes:,} bytes.")

flags = cl.mem_flags
d_input_buffer = cl.Buffer(ctx, flags.READ_ONLY | flags.COPY_HOST_PTR, hostbuf=h_buffer)
d_output_buffer = cl.Buffer(ctx, flags.WRITE_ONLY, h_buffer.nbytes)

Execution configuration also needs to be updated since we've changed N.

In [None]:
local_work_size = (1,)
global_work_size = (N,)

So let's profile the kernel on the new bigger buffer.

In [None]:
profile_gpu(scale_vector, 20, 
            queue, 
            global_work_size, 
            local_work_size,
            d_input_buffer,
            d_output_buffer,
            multiplier)

Your task is to modify the execution configuration defined above to significantly reduce the time taken by the this kernel. Always check if the results after optimizations are still correct.

In [None]:
h_result_from_gpu = np.zeros(N).astype(np.int32)
cl.enqueue_copy(queue, h_result_from_gpu, d_output_buffer)

result_from_cpu = h_buffer * multiplier
print(f"computed in cpu = {result_from_cpu}")
print(f"computed in gpu = {h_result_from_gpu}")

np.testing.assert_array_equal(result_from_cpu, h_result_from_gpu)

Refer to the [solution](./waves_solution.txt) if you get stuck.

# Parallel Patterns - Map

The simplest parallel patterns is probably map. The idea is that same operation (or a function) is applied to every element of an input array. Examples of map can be:
* vector scaling 
* vector addition

Map is [embarrasingly parallel](https://en.wikipedia.org/wiki/Embarrassingly_parallel) problem - it is straighforward to parallelize this algorithm, with no communication required between the threads. In Computer graphics pixel shading or ray tracing is embarrasingly parallel because each pixel can be easily processed separately.

In [None]:
%%HTML
<iframe src="https://docs.google.com/presentation/d/e/2PACX-1vQbNSmPHSnKybcXyErXr9Tz2nqZf80bXCWvaL0x1GEq0MvsWgNJobxgw8Qp62hl_INBwdo2cnaAhVkn/embed?start=false&loop=false&delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>

Let's define CPU profiling helper function for later use.

In [None]:
from time import time

def profile_cpu(function, n, *args):
    """
    This function profiles other functions:
        function - any function to be profiled
        n - number of times a function will be rerun - the more times you run it the more stable results you get,
            but the more time it will take to profile
        args - variable list of arguments that will be passed to a profiled function
    """
    times = np.zeros(n)
    total_ms = 0
    value = 0
    for i in range(n):
        start = time()
        value = function(*args)
        end = time()
        elapsed = (end - start) * 1e3
        times[i] = elapsed

    avg_ms = np.mean(times)
    median_ms = np.median(times)
    variance = np.var(times)
    std = np.std(times)

    print(f"{function.__name__} took on average {avg_ms:.4f} ms, with median {median_ms:.4f} ms, variance {variance:.4f} ms, standard deviation {std:.4f} ms.")        
    return value, avg_ms

## Exercise: Linear Equations

It's time to write some kernel yourself. Your task is to modify the code so that multiple simple linear equations are calculated.

The code below multiplies the data points from buffer 'a' by 2 and adds bias 'b'. 

$$
res = 2 a + b
$$

Extend the code below to:
* accept two input buffer 'a' and 'b'
* accept output buffer 'c'
* write the results so that the input arrays are not overwritten
* fix execution configuration
* set arguments to host function scheduling execution to GPU

In [None]:
import numpy as np

N = np.int32(2**25)
h_a = np.full(N, 1).astype(np.int32)
h_b = np.full(N, 2).astype(np.int32)

print(f"Working with {len(h_a):,} elements.")

Create required GPU buffers.

In [None]:
flags = cl.mem_flags

d_a = None
# ...
d_c = None

Write the kernel below to add elements from two arrays and write the result back to a third array.

In [None]:
%%cl_kernel -o "-cl-fast-relaxed-math"

__kernel void compute_linear_equations_gpu(__global int *a)
{
    // your kernel code goes here
}    

Create appropriate execution configuration.

In [None]:
local_work_size = (2,)
global_work_size = (128,)

Execute and profile the kernel.

In [None]:
profile_gpu(compute_linear_equations_gpu, 
            20, 
            queue, 
            global_work_size, 
            local_work_size,
            # other arguments
            )

Check if the resulting array is correct.

In [None]:
h_c = np.zeros(N).astype(np.int32)
cl.enqueue_copy(queue, h_c, d_c)

def compute_linear_equations_cpu(a, b):
    return 2 * a + b

numpy_res, numpy_avg_ms = profile_cpu(compute_linear_equations_cpu, 20, h_a, h_b)
np.testing.assert_array_equal(numpy_res, h_c)

Refer to the [solution](./vector_addition_solution.py) if you get stuck.

## Handling arbitrary dataset sizes

The number of threads in a group applies to all work groups - all work groups must be of the same size. This does not always align with the dataset size being processed. 

In [None]:
%%HTML
<iframe src="https://docs.google.com/presentation/d/e/2PACX-1vRpVhN7h5X-1EFA1wH7ipQJcbBw4iWp4Q65mklz_ZZWuaDrJJo8C0Y_4L4QmPzTsQ4TBPcDOGVkwX3x/embed?start=false&loop=false&delayms=3000" frameborder="0" width="960" height="569" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe>

### Dataset smaller than number of threads

To handle situations with less data than threads we can:
* pass number of elements in the dataset
* add a check to the kernel.

In the case of vector scaling kernel the code is:

In [None]:
%%cl_kernel -o "-cl-fast-relaxed-math"

__kernel void multiply_vector(__global float *buffer_gpu, int multiplier, int number_of_elements)
{
    const int gid = get_global_id(0);
    if (gid < number_of_elements) {
        buffer_gpu[gid] = buffer_gpu[gid] * 2;
    }
}

NOTE: If you read outside the buffer things can:
* go very wrong - your GPU driver can enter invalid state and you will need to restart your computer (or reset GPU in some way)
* can go unnoticed - driver may have allocated more memory that you have requested and things will seem to work. But there may be situations where your input data has different size and can cuase crashes in procution.

So keep your data access guarded!

### Dataset bigger than the number of threads

With larger datasets handling one data cell per thread requires creation of huge amounts on work groups. This may not always be the most efficient way to solve a problem. 

A common way to reduce the number of threads is to keep the thread alive after it has processed a cell. When a thread is finished processing a cell it will then jump to process the next element.

There are various options how to divide the entire dataset. An efficient solution from memory-accesses point of view is so called **grid-stride loop**. In this solution we use a for-loop in which we jump in each iteration by the size of the grid.

Size of the grid can be queried or calculated in the following ways:
* [get_global_size](https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/get_global_size.html)
* [get_local_size](https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/get_local_size.html) x [get_num_groups](https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/get_num_groups.html) - multiply work group size by number of work groups. In frameworks global size is not available so that's the reason to calculate the threads amount in a grid.

In [None]:
%%cl_kernel -o "-cl-fast-relaxed-math"

__kernel void multiply_vector(__global float *buffer_gpu, int multiplier, int number_of_elements)
{
    const int gid = get_global_id(0);
    const int grid_stride = get_global_size(0);

    for (int i = gid; i < number_of_elements; i += grid_stride) 
    {
        buffer_gpu[i] = buffer_gpu[i] * multiplier;
    }
}

## Exercise - safe vector addition

Modify the code from previous exercise to:

* handle arbitrary vectors size - take care of reading data out of bounds
* add information about number of elements to the kernel
* reuse the threads in work group so that you don't have to create as many threads as input items
* take care of buffers' access parameters - read only, write only etc.
* take care of memory coalescence
* experiment with work group size and the amount of work groups
* watch GPU execution time
    
Refer to the [solution](./map_solution.txt) if you get stuck.