# Introduction to OpenCL

<figure style="float:right; width:30%;">
    <img src="images/OpenCL_RGB_Apr20.svg" alt="OpenCL logo"/>
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.</figcaption>
</figure>

OpenCL (short for Open Computing Language) is an open standard for running compute workloads on many different kinds of compute hardware (e.g CPUs, GPU's). The OpenCL trademark is held by Apple, and the standard is developed and released by the [Khronos](https://www.khronos.org) group, a non-for-profit organisation that provides a focal point for the development of royalty-free standards such as OpenGL. The OpenCL specification itself is just a document, and can be downloaded from the Khronos website [here](https://www.khronos.org/registry/OpenCL/specs/). It is then the task of compute hardware vendors to produce software implementations of OpenCL that best make use of their compute devices.

## How does OpenCL work?

In order to answer how an OpenCL implementation works, we need to start thinking about hardware. In every compute device such as a CPU or GPU there are a number of cores on which software can be run. In OpenCL terminology these cores are called **Compute Units**. Each Compute Unit makes available to the operating system a number of hardware threads that can run software. In OpenCL terminology we call these hardware threads **Processing Elements**. For example, an NVIDIA GP102 die is shown below. Each die contains 30 compute units, shown contained by the orange squares. Each compute unit provides 128 processing elements (CUDA cores). In this example up to 3840 processing elements are available for use in compute applications. 

<!-- <figure style="margin-bottom:auto; margin-top:auto; margin-left:auto; margin-right:auto; width:30%;"> -->
<figure style="float:left; width:70%; margin:1em;">
    <img style="vertical-align:middle" src="images/compute_units.svg">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">NVIDIA GP102 die with compute units highlighted in orange. Image credit: <a href="https://www.flickr.com/photos/130561288@N04/46079430302/")>Fritzchens Fritz</a></figcaption>
</figure>

During execution of an OpenCL program, processing elements each run an instance of a user-specified piece of compiled code called a **kernel**. As follows is an example OpenCL C kernel that takes the absolute value of a single element of an array.

```C
__kernel void vec_fabs(
        // Memory allocations that are on the compute device
        __global float *src, 
        __global float *dst,
        // Number of elements in the memory allocations
        int length) {

    // Get our position in the array
    size_t gid0 = get_global_id(0);

    // Get the absolute value of 
    if (gid0 < length) {
        dst[gid0] = fabs(src[gid0]);
    }
}
```

We run a kernel instance for every element of the array. An OpenCL implementation is then a way to run kernel instances on processing elements as they become available. We specify how many kernel instances we want at runtime by defining a 3D execution space called a **Grid** and specifying its size at kernel launch. Every point in the Grid is called a **work-item** and represents a unique kernel instance. This is much like defining an execution space using nested loops, however with OpenCL there are no guarantees on the order in which work items are completed.

<figure style="float:left; width:70%; margin:1em;">
    <img style="vertical-align:middle" src="images/grid.svg">
    <figcaption style= "text-align:lower; margin:1em; float:bottom; vertical-align:bottom;">Three-dimensional Grid with work-items and work-groups.</figcaption>
</figure>

### Compute device





### Taxonomy of an accelerated application
### Taxonomy of OpenCL components

## Historical foundation 

From [Wikipedia](https://en.wikipedia.org/wiki/OpenCL) OpenCL was originally designed by Apple, who developed a proposal to submit to the Khronos group. The first specification, OpenCL 1.0 was ratified on November 18, 2008 and the first public release of the standard was on December 2008. Since then, a number of different versions of the standard have been released.

## Specification roadmap

An implementation of an OpenCL standard works like CUDA in that it provides a framework for the creation of specialised lightweight functions called **kernels**. These kernels provide a specific function and are compiled at runtime for the accelerator device they will run on. Kernels operate in parallel, on memory from both host and device, using any number of hardware threads that the device provides. In this way, OpenCL can readily scale up to large numbers of hardware threads such as can be found on modern GPU and manycore CPU architectures.

## Vendor implementations

Device hardware vendors each provide a way to compile OpenCL kernels and to manage device memory according to the OpenCL standard. Due to support from a  wide variety of hardware vendors, OpenCL applications are not limited to running on NVIDIA GPU's - they can run on CPU's, GPU's, FPGA's, and other embedded devices. 

Given the significant time taken to develop applications, developing your compute applications with OpenCL can unlock the ability to run portable applications across many different compute platforms.

## Exercise: compiling your first OpenCL application

## References

<address>
&copy; 2021 by Dr. Toby Potter<br>
email: <a href="mailto:tobympotter@gmail.com">tobympotter@gmail.com</a><br>
Visit us at: <a href="https://www.pelagos-consulting.com">www.pelagos-consulting.com</a><br>
</address>