# Introduction to OpenCL

<figure style="float:right; width:30%;">
    <img src="images/OpenCL_RGB_Apr20.svg" alt="OpenCL logo"/>
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.</figcaption>
</figure>

OpenCL (short for Open Computing Language) is an open standard for running compute workloads on many different kinds of compute hardware (e.g CPUs, GPU's). The OpenCL trademark is held by Apple, and the standard is developed and released by the [Khronos](https://www.khronos.org) group, a non-for-profit organisation that provides a focal point for the development of royalty-free standards such as OpenGL. The OpenCL specification itself is just a document, and can be downloaded from the Khronos website [here](https://www.khronos.org/registry/OpenCL/specs/). It is then the task of compute hardware vendors to produce software implementations of OpenCL that best make use of their compute devices.

## How does OpenCL work?

In order to answer how an OpenCL implementation works, we need to start thinking about hardware. In every compute device such as a CPU or GPU there are a number of cores on which software can be run. In OpenCL terminology these cores are called **Compute Units**. Each Compute Unit makes available to the operating system a number of hardware threads that can run software. In OpenCL terminology we call these hardware threads **Processing Elements**. For example, an NVIDIA GP102 die is shown below. Each die contains 30 compute units, shown contained by the orange squares. Each compute unit provides 128 processing elements (CUDA cores), so in this example there are $30\times128 = 3840$ processing elements available for use in compute applications. 

<figure style="margin: 1em; margin-left:auto; margin-right:auto; width:70%;">
    <img src="images/compute_units.svg">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">NVIDIA GP102 die with compute units highlighted in orange. Image credit: <a href="https://www.flickr.com/photos/130561288@N04/46079430302/")>Fritzchens Fritz</a></figcaption>
</figure>

During execution of an OpenCL program, processing elements each run an instance of a user-specified piece of compiled code called a **kernel**. Below is an example OpenCL C kernel that takes the absolute value of a single element of an array.

```C
__kernel void vec_fabs(
        // Memory allocations that are on the compute device
        __global float *src, 
        __global float *dst,
        // Number of elements in the memory allocations
        int length) {

    // Get our position in the array
    size_t gid0 = get_global_id(0);

    // Get the absolute value of 
    if (gid0 < length) {
        dst[gid0] = fabs(src[gid0]);
    }
}
```

We want to run a kernel instance for every element of the array. An OpenCL implementation is a way to run kernel instances on processing elements as they become available. The implementation also provides the means to upload and download memory to and from compute devices. We specify how many kernel instances we want at runtime by defining a 3D execution space called a **Grid** and specifying its size at kernel launch. Every point in the Grid is called a **work-item** and represents a unique invocation of the kernel. Work-item means kernel invocation. This is much like defining an execution space using nested loops, however with OpenCL there are no guarantees on the order in which work items are completed.

<figure style="margin-left:auto; margin-right:auto; width:70%;">
    <img style="vertical-align:middle" src="images/grid.svg">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Three-dimensional Grid with work-items and work-groups.</figcaption>
</figure>

Work-items are executed in teams called work-groups. In the example above the grid is of global size (10, 8, 2) and each work-group is of size (5,4,1). The the number of work-groups in each dimension is then (2,2,2). Every work item has access to device memory that it can use exclusively (**private memory**), access to memory the team can use (**local memory**), and access to memory that other teams use (**global** and **constant** memory). Every kernel invocation or work-item can query its location within the **Grid** and use that position as a reference to access allocated memory on the compute device at an appropriately calculated offset.

<figure style="margin-left:auto; margin-right:auto; width:70%;">
    <img style="vertical-
                align:middle" src="images/mem_access.svg">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Using the location within the Grid to access memory within a memory allocation on a GPU compute device.</figcaption>
</figure>

The above concepts form the core ideas surrounding OpenCL. Everything that follows in this course is supporting information on how to prepare compute devices, memory allocations, kernel invocations, and how best to use these concepts together to get the best performance out of your compute devices. 

### Elements of an accelerated application

In every accelerated application there is the concept of a host computer with one or more attached compute devices. The host usually has the largest memory space available and the compute device usually has the most compute power and memory bandwidth. This is why we say the application is "accelerated" by the compute device.

At runtime, the host executes the application and compiles kernels for execution on the compute device. The host manages memory allocations and submits kernels to the compute device for execution. For instances where the compute device is a CPU, the host CPU and the compute device are the same.

Every accelerated application follows the same logical progression of steps: 

1. Compute devices discovered
1. Kernels prepared for compute devices
1. Memory allocated on the compute device
1. Memory copied to the compute device
1. Kernels run on the compute device
1. Wait for kernels to finish
1. Memory copied back from the computed device to the host
1. Repeat steps 3 - 8 as many times as necessary
1. Clean up resources and exit

### Taxonomy of an OpenCL application

There may be a number of OpenCL implementations available on the Host computer. Thankfully when compiling an OpenCL application we don't need to link against every implementation, we just need to link against a single library file call the **Installable Client Driver (ICD)**. The ICD has the name (**opencl.dll**) on Windows and (**libOpenCL.so**) on Linux. Accompanying the ICD are header files (**opencl.h** for C and **cl.hpp** for C++) that must be "included" from the C/C++ source code. The ICD takes care of intercepting all library calls and routing them to the appropriate vendor implementation. This happens transparently to the user. 

Below is a representation of the core software components that are available to an OpenCL application.

<figure style="margin-left:auto; margin-right:auto; width:50%;">
    <img style="vertical-
                align:middle" src="images/opencl_components.svg">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Components of an OpenCL application.</figcaption>
</figure>

The first is the **platform**. This is a software representation of the vendor and provides access to all **devices** that the vendor supports. During device discovery, the available platforms must be discovered before anything else. A platform provides access to one or more compute devices and possibly even a mixture of accelerator devices from the same vendor, as shown in the diagram. Surrounding the devices is a **Context**. A Context is like a registry that keeps track of everything that is happening on the compute device/s. It is constructed with a platform and one or more devices that are accessible from the platform. There are some benefits that could be obtained by encapsulating one or more devices in the same context, however it makes an assumption that devices belong to the same platform, which may not always be valid. I usually prefer to create a unique context for every compute device.

Outside the devices and within the context are **Buffers**. Buffers are memory created using contexts and are allocations of memory that are migrated between the devices and host as it is needed. The source code for kernels is compiled into **programs**, and there must be a program for every context.

## Historical foundation 

From [Wikipedia](https://en.wikipedia.org/wiki/OpenCL) OpenCL was originally designed by Apple, who developed a proposal to submit to the Khronos group. The first specification, OpenCL 1.0 was ratified on November 18, 2008 and the first public release of the standard was on December 2008. Since then, a number of different versions of the standard have been released.

## Specification roadmap

An implementation of an OpenCL standard works like CUDA in that it provides a framework for the creation of specialised lightweight functions called **kernels**. These kernels provide a specific function and are compiled at runtime for the accelerator device they will run on. Kernels operate in parallel, on memory from both host and device, using any number of hardware threads that the device provides. In this way, OpenCL can readily scale up to large numbers of hardware threads such as can be found on modern GPU and manycore CPU architectures.

## Vendor implementations

Device hardware vendors each provide a way to compile OpenCL kernels and to manage device memory according to the OpenCL standard. Due to support from a  wide variety of hardware vendors, OpenCL applications are not limited to running on NVIDIA GPU's - they can run on CPU's, GPU's, FPGA's, and other embedded devices. 

Given the significant time taken to develop applications, developing your compute applications with OpenCL can unlock the ability to run portable applications across many different compute platforms.

## Getting help for OpenCL


## Is OpenCL right for my application


## Exercise: compiling your first OpenCL application

## References

<address>
&copy; 2021 by Dr. Toby Potter<br>
email: <a href="mailto:tobympotter@gmail.com">tobympotter@gmail.com</a><br>
Visit us at: <a href="https://www.pelagos-consulting.com">www.pelagos-consulting.com</a><br>
</address>