Skip to content

Latest commit

 

History

History
732 lines (491 loc) · 35.8 KB

OpenCL.md

File metadata and controls

732 lines (491 loc) · 35.8 KB

OpenCL.org Initiative

ToDo: What is OpenCL.org initiative?

Acknowledgements

Table Of Contents

  1. OpenCL Introduction
  2. Platforms and tools
  3. Libraries
  4. Specification
  5. Getting Started
  6. Profiling
  7. Bibliography

OpenCL Introduction

What is OpenCL?

OpenCL is an open, royalty-free industry standard that makes much faster computations possible through parallel computing. The standard is controlled by non-profit standards organisation Khronos. For example, by using this technology with graphics cards and modern multi-core processors it is possible to convert a video in 20 minutes instead of 2 hours, or analyze spectra of hundreds of thousands of stars in minutes instead of several hours.

How does it work?

OpenCL is an extension to existing languages. It makes it possible to specify a piece of code that is executed multiple times independently from each other. This code can run on various processors – not only the main CPU. Also, OpenCL has built-in support for vector types (float2, short4, int8, long16, etc), and operations on those types can be mapped to vector extensions of modern CPUs (SSE, AVX).

For example, you need to calculate sin(x) of a large array A of one million numbers. Based on the information provided by the OpenCL framework you can pick the best device, or even several devices, and send the data to the device(s). Normally you would loop over the million numbers, one number after another, but with OpenCL you can just say: "For each x in array A give me sin(x)", and each x is processed in parallel. When finished, you can take the data back from the device(s) or proceed with other computations.

OpenCL framework is great at exposing parallel nature of various compute-devices: x86 CPUs, GPUs, FPGAs, DSPs. This can significantly lower the total execution time compared to conventional sequential methods.

5 questions on OpenCL

Q: Why is it so fast?

OpenCL framework gives programmers tools and features which enable them to implement efficient parallel algorithms. It does it by providing direct access to the cores of multi-core CPUs, and to the hundreds of little processors on graphics cards.

Q: Does it work on any type of hardware?

As it is an open standard, it can work on any type of hardware that targets parallel execution. This can be a CPU, GPU, DSP or FPGA. You can read more in Platforms and tools chapter.

Q: How does it compare to OpenMP/MPI?

Where OpenMP and MPI try to split loops over threads/servers and is CPU-oriented, OpenCL focuses on getting threads being data-position aware and making use of processor-capabilities.

Q: Does it replace C or C++?

No, it is a framework which integrates well with C, C++, Python, Java and more.

Q: How stable/mature is OpenCL?

First version of OpenCL was released in 2009. The latest version is OpenCL 2.1 from November 2015. The standard is being actively developed. Next version (2.2), which currently has provisional status, will include OpenCL C++ kernel language based C++14 standard.

Platforms and tools

SDKs And OpenCL Implementations

AMD OpenCL™ Accelerated Parallel Processing (APP) technology is a set of advanced hardware and software technologies that enable AMD graphics processing cores (GPU), working in concert with the system’s x86 cores (CPU), to execute heterogeneously to accelerate many applications beyond just graphics.

The SDK provides samples, documentation, and other materials. GPU drivers must be installed in order to run OpenCL programs on AMD GPUs.

Note

AMD Linux Catalyst driver are not being updated since 18.12.2015 (Crimson Edition 15.12). We recommend you to install and test AMDGPU-PRO driver (only OpenCL 1.2), or try AMD ROCm.

Supported hardware:

  • AMD GPU
  • AMD APU
  • x86 CPU

Supported OS:

  • Windows
  • Linux

Standards:

  • OpenCL 2.0 (AMD Catalyst/Crimson)
  • OpenCL 1.2 (AMDGPU-PRO)

See also

AMD ROCm (Partially Open Source)

ROCm is Open Source Platform for GPU Computing. ROCm 1.4 includes developer preview of OpenCL support (not yet open source in this release).

Supported hardware:

  • AMD GPU (limited list, GFX7 and GFX8 only - Hawaii, Fiji, Polaris)

Supported OS:

  • Linux (with a special ROCm kernel)

Standards:

  • OpenCL 2.0 compatible kernel language and OpenCL 1.2 compatible runtime

See also

The Intel® SDK for OpenCL™ Applications is a comprehensive development environment for developing and optimizing OpenCL™ applications on Intel® platforms.

The SDK supports offloading compute-intensive parallel workloads to Intel® Graphics Technology using an advanced OpenCL™ kernel compiler, runtime debugger and code performance analyzer.

Intel® SDK for OpenCL™ Application includes Intel OpenCL™ Code Builder which allows you to build (offline), debug, and analyze OpenCL programs. It is required to install OpenCL drivers for CPUs and GPUs (Intel OpenCL drivers) since SDK only includes experimental 2.1 platform with drivers for a CPU device.

Note

Since Intel SDK supports OpenCL 2.1 (and therefore supports SPIR-V 1.0), it is currently the only SDK that lets you run OpenCL C++ kernels (not all features works).

Supported hardware:

  • Intel® Graphics (GPU), only Windows
  • Intel® Processors (CPU)
  • Intel® Xeon Phi™ Coprocessors

Supported OS:

  • Windows
  • Linux
  • Android (as a target only)

Standards:

  • OpenCL 2.0 & 1.2
  • OpenCL 2.1 (CPU only) with SPIR and SPIR-V support

See also

Beignet for Intel GPUs on Linux (Open Source)

Beignet contains the code to run OpenCL programs on Intel GPUs, which defines and implements host functions required to initialize the device, create the command queues, the kernels, and the programs and run them on the GPU. It also contains the compiler part of the stack.

Supported hardware:

  • Intel® Graphics (GPU)

Supported OS:

  • Linux
  • Android

Standards:

  • OpenCL 2.0 & 1.2

See alos

NVIDIA does not have a separate OpenCL SDK, but CUDA Toolkit contains OpenCL headers and shared libraries. OpenCL support is included in NVIDIA GPU drivers.

OpenCL samples with source code for Windows, Linux and macOS are available at https://developer.nvidia.com/opencl

Supported hardware:

  • NVIDIA GPUs

Supported OS:

  • Windows
  • Linux
  • macOS

Standards:

  • OpenCL 1.2
  • beta-support of OpenCL 2.0

See also

Portable Computing Language (pocl) aims to become a MIT-licensed open source implementation of the OpenCL standard which can be easily adapted for new targets and devices, both for homogeneous CPU and heterogeneous GPUs/accelerators.

pocl uses Clang as an OpenCL C frontend and LLVM for the kernel compiler implementation, and as a portability layer. Thus, if your desired target has an LLVM backend, it should be able to get OpenCL support easily by using pocl.

Supported hardware:

  • x86 CPU
  • HSA targets

Supported OS:

  • Windows
  • Linux

Standards:

See also

The Intel FPGA SDK for OpenCL allows the easy implementation of applications onto FPGAs by abstracting away the complexities of FPGA design, allowing software programmers to write hardware-accelerated kernel functions in OpenCL C, an ANSI C-based language with additional OpenCL constructs. As part of our SDK we provide a suite of tools to further resemble the fast development flow of software programmers.

Supported hardware:

  • Altera FPGA

Supported software:

  • Windows
  • Linux

Standards:

  • OpenCL 1.0 with support of SVM (OpenCL 2.0 feature) and image arrays (OpenCL 1.2 feature)

The SDAccel™ development environment for OpenCL™, C, and C++, enables up to 25X better performance/watt for data center application acceleration leveraging FPGAs. SDAccel, member of the SDx™ family, combines the industry's first architecturally optimizing compiler supporting any combination of OpenCL, C, and C++ kernels, along with libraries, development boards and the first complete CPU/GPU like development and run-time experience for FPGAs.

Supported hardware:

  • Adreno GPU (Snapdragon processor)

Supported software:

  • Windows
  • Linux
  • macOS
  • Android (as a target only)

Standards:

  • OpenCL 2.0

The Mali OpenCL SDK provides developers with a framework and series of samples for developing OpenCL 1.1 applications on ARM Mali based platforms such as the Mali-T600 and above family of GPUs. The samples cover a wide range of use cases that utilize the Mali GPU to achieve a significant improvement in performance when compared to running on the CPU alone.

Supported hardware:

  • ARM Mali based platforms such as the Mali-T600 family of GPUs

Supported software:

  • Windows
  • Linux

Standards:

  • OpenCL 1.1

Texas Instruments supports OpenCL 1.1 on a few selected processors. Full list of supported systems is available here.

Supported hardware:

  • Selected processors (list)

Standards:

  • OpenCL 1.1

See also

Debuggers and profilers

CodeXL (Debugger, Profiler)

CodeXL is a comprehensive tool suite that enables developers to harness the benefits of CPUs, GPUs and APUs. CodeXL is available both as a Visual Studio extension and a standalone user interface application for Windows and Linux.

CodeXL works on ROCm platform.

Features:

  • Combined Host and GPU Debugging
    • Real-time OpenCL kernel debugging
  • AMD GPU Profiling
    • Application Timeline Trace (multiple contexts and queues, tips about redundant synchronizations, OpenCL objects leaks...)
    • GPU Performance Counters (memory transfers, cache hits, occupancy, registers usage...)
  • CPU Profiling
  • Static Kernel Analysis
  • APU/CPU/GPU power profiling
  • Remote machine profiling and debugging

See also

With Intel® VTune™ Amplifier, you get all these advanced profiling capabilities with a single, friendly analysis interface. And for media applications, you also get powerful tools to tune OpenCL* and the GPU.

Note

Intel® SDK for OpenCL™ Application includes Intel OpenCL™ Code Builder which allows you to build (offline), debug, and analyze OpenCL programs.

See alo

Intel® SDK for OpenCL™ Applications provides integration with the Intel® Graphics Performance Analyzers (Intel® GPA), which enables you to optimize and analyze your OpenCL code in visual computing applications.

Intel GPA support various metrics for Intel CPU and HD Graphics devices. Some metrics are specific to the rendering (Microsoft DirectX* API) pipeline only, while some are more general and can be associated with OpenCL execution.

With Intel GPA you can also inspect various important hardware counters for Intel CPU and HD Graphics devices in real time, for example:

  • Utilization of CPU cores and the execution units in Intel HD Graphics devices
  • Memory traffic for Intel HD Graphics devices
  • Power consumption, and so on

See also

Snapdragon Profiler is profiling software that runs on the Windows, Mac, and Linux platforms. It connects with Android devices powered by Snapdragon processors over USB. Snapdragon Profiler allows developers to analyze CPU, GPU, DSP, memory, power, thermal, and network data, so they can find and fix performance bottlenecks.

See also

Oclgrind is an open source SPIR interpreter and OpenCL device simulator. The core of this project is an interpreter for SPIR, which can simulate the execution of an OpenCL kernel running on a virtual OpenCL device. Oclgrind is designed to be extensible and aims to provide a platform with which a variety of useful tools can be created to aid OpenCL application development.

At present, Oclgrind includes support for the following features:

  • Detecting memory access errors (such as reading/writing outside the bounds of an array)
  • Data-race detection
  • Interactive kernel debugging
  • Detecting work-group divergence (barriers and async copies)
  • Collecting histograms of instructions executed
  • Logging OpenCL runtime API errors

See also

Libraries

The crucial part of every technology and programming language are libraries. They extend the language, simplify it, add new features, and help you build you applications faster. The same goes for OpenCL. This chapter presents a few selected libraries which at the time seemed popular, useful and reliable.

Wrappers

OpenCL Host API C++ bindings.

For many large applications C++ is the language of choice and so it seems reasonable to define C++ bindings for OpenCL.

The interface is contained with a single C++ header file cl2.hpp and all definitions are contained within the namespace cl. There is no additional requirement to include cl.h and to use either the C++ or original C bindings; it is enough to simply include cl2.hpp.

The bindings themselves are lightweight and correspond closely to the underlying C API. Using the C++ bindings introduces no additional execution overhead.

Boost.Compute is an official Boost library, added to the Boost C++ Libraries package since 1.61 version. It is a comprehensive wrapper for OpenCL 1.0 - 2.0. It provides STL-like algorithms and various helper features (functions, classes) which make developing in C++/OpenCL environment much faster.

Boost.Compute is a GPU/parallel-computing library for C++ based on OpenCL.

The core library is a thin C++ wrapper over the OpenCL API and provides access to compute devices, contexts, command queues and memory buffers.

On top of the core library is a generic, STL-like interface providing common algorithms (e.g. transform(), accumulate(), sort()) along with common containers (e.g. vector<T>, flat_set<T>). It also features a number of extensions including parallel-computing algorithms (e.g. exclusive_scan(), scatter(), reduce()) and a number of fancy iterators (e.g. transform_iterator<>, permutation_iterator<>, zip_iterator<>).

Aparapi enables Java developers to define OpenCL kernels and executed them on GPU and APU devices.

Aparapi allows Java developers to take advantage of the compute power of GPU and APU devices by executing data parallel code fragments on the GPU rather than being confined to the local CPU. It does this by converting Java bytecode to OpenCL at runtime and executing on the GPU, if for any reason Aparapi can't execute on the GPU it will execute in a Java thread pool.

We like to think that for the appropriate workload this extends Java's 'Write Once Run Anywhere' to include GPU devices.

Note

You find more OpenCL wrappers/bindings in blog post StreamComputing: OpenCL Wrappers.

Specialized libraries

VexCL was developed with scientific computing in mind. It provides custom DSL to simplify using various accelerators for vector arithmetic, reduction, sparse matrix-vector product etc. It has some features of a wrapper library, but not all of them. However, it is possible to define generic C++ algorithms with its DSL and write custom kernels using VexCL. It supports both OpenCL and CUDA.

VexCL is a vector expression template library for OpenCL/CUDA. It has been created for ease of GPGPU development with C++. VexCL strives to reduce amount of boilerplate code needed to develop GPGPU applications. The library provides convenient and intuitive notation for vector arithmetic, reduction, sparse matrix-vectork products, etc. Multi-device and even multi-platform computations are supported. The source code of the library is distributed under very permissive MIT license.

ArrayFire is a general-purpose library that simplifies the process of developing software that targets parallel and massively-parallel architectures including CPUs, GPUs, and other hardware acceleration devices.

ViennaCL is a free open-source linear algebra library for computations on many-core architectures (GPUs, MIC) and multi-core CPUs. The library is written in C++ and supports CUDA, OpenCL, and OpenMP (including switches at runtime).

clMathLibraries is a name for a group of math libraries with OpenCL backend. Also see StreamComputing: OpenCL alternatives for CUDA Linear Algebra Libraries.

  • clFFT - a software library containing FFT functions written in OpenCL
  • clBLAS - a software library containing BLAS functions written in OpenCL
  • clSPARSE - a software library containing Sparse functions written in OpenCL.
  • clRNG - an OpenCL based software library containing random number generation functions
Note

More complete lists of OpenCL libraries, bindings and toolkits can be found at:

Specification

The OpenCL specification documents can be found on the Khronos OpenCL Registry. They are a must-read for everyone who wants to dive into OpenCL programming and a great reference for the experienced OpenCL developer.

The OpenCL Architecture

The crucial part of The OpenCL Specification is the chapter about the OpenCL architecture. It is very important to understand at least its basic principles. Without this knowledge it is not possible neither to efficiently program using OpenCL nor to use it to its full capabilities.

The architecture is described by four models:

  • Platform Model
  • Memory Model
  • Programming Model
  • Execution Model

The platform model describes in general how the OpenCL framework looks like and gives definitions of a host and an OpenCL device. The memory model explains different memory regions and memory objects introduced in OpenCL. More advanced sections about memory model present details of how shared virtual memory mechanism works, and presents memory consistency model and ordering rules. Since OpenCL 2.0 the programming model does not have its own section. It specifies that the OpenCL execution model supports both data parallel and task parallel programming models.

The execution model is the model that explains how the OpenCL framework really works. It defines responsibilities of a host program, it says what is a command queue, what is a kernel, and it describes how an execution of an OpenCL kernel on a device works. In order to understand OpenCL you must carefully read and understand particularly this part of the specification, and know: how workloads are mapped onto devices, when synchronization happens, what is an index space.

Getting Started

Tutorials

ToDo

CL-basic is a C prototype to help you get started with creating your first simple OpenCL application. It offers simplified host-code OpenCL API functions and a sample OpenCL kernel that you can reference to get started quick! In addition, this prototype can be compiled under both Windows and Linux-based systems thanks to the use of CMake.

Host-code functions can be found in files:

cl_util.c

cl_util.h

For getting platform and/or device information, use the following functions are:

void PrintPlatformName(cl_platform_id platform);
void PrintDeviceName(cl_device_id device);
int PrintOpenCLInfo();
void SelectOpenCLPlatformAndDevice(cl_platform_id* pPlatform, cl_device_id* pDevice);

For creating and releasing an OpenCL context, the following functions are:

cl_context CreateOpenCLContext(cl_platform_id platform, cl_device_id device);
void ReleaseOpenCLContext(cl_context *pContext);

For creating and releasing an OpenCL command queue, the following functions are:

cl_command_queue CreateOpenCLQueue(cl_device_id device, cl_context context);
void ReleaseOpenCLQueue(cl_command_queue *pQueue);

For creating and releasing an OpenCL buffer allocated on an OpenCL device, the following functions are:

cl_mem CreateDeviceBuffer(cl_context context, size_t sizeInBytes);
void ReleaseDeviceBuffer(cl_mem *pDeviceBuffer);

For copying data rom the host to device, or from the device to the host, the following functions are:

void CopyHostToDevice(void* hostBuffer, cl_mem deviceBuffer, size_t sizeInBytes, cl_command_queue queue, cl_bool blocking);
void CopyDeviceToHost(cl_mem deviceBuffer, void* hostBuffer, size_t sizeInBytes, cl_command_queue queue, cl_bool blocking);

For loading an OpenCL source from a file, creating and releasing the OpenCL program, the following functions are:

char* LoadOpenCLSourceFromFile(char* filePath, size_t *pSourceLength);
cl_program CreateAndBuildProgram(cl_context context, char* sourceCode, size_t sourceCodeLength);
void ReleaseProgram(cl_program *pProgram);

For creating and releasing OpenCL Kernels, the following functions are:

cl_kernel CreateKernel(cl_program program, char* kernelName);
void ReleaseKernel(cl_kernel *pKernel);

A macro for checking error values has also been prepared to make it easier to translate an error code returned by an OpenCL host function, and it can be used as below:

clError = clEnqueueNDRangeKernel(queue, simpleFunctionKernel, workDim, NULL, globalWorkSize, localWorkSize, 0, NULL, NULL);
CHECK_OCL_ERR("clEnqueueNDRangeKernel", clError);

Refer to main.cpp for a reference of how these host-code functions are used, and OpenCLKernels.cl for how the OpenCL kernel is written.

ToDo

Profiling

Using OpenCL events

OpenCL framework has a built-in feature that can provide information about the execution times of enqueued commands. By linking cl_event object to an OpenCL command, like clEnqueueNDRangeKernel, it is possible to use clGetEventProfilingInfo function to get the times of the start and the end of execution of that command and calulcate the performance. Profiling of OpenCL commands can be enabled by using a command queue created with CL_QUEUE_PROFILING_ENABLE flag set in properties argument.

Supported queries

cl_int clGetEventProfilingInfo(cl_event event,
                               cl_profiling_info param_name,
                               size_t param_value_size,
                               void *param_value,
                               size_t *param_value_size_ret)

CL_PROFILING_COMMAND_QUEUED

A 64-bit value that describes the current device time counter in nanoseconds when the command identified by event is enqueued in a command-queue by the host.

CL_PROFILING_COMMAND_SUBMIT

A 64-bit value that describes the current device time counter in nanoseconds when the command identified by event that has been enqueued is submitted by the host to the device associated with the command-queue.

CL_PROFILING_COMMAND_START

A 64-bit value that describes the current device time counter in nanoseconds when the command identified by event starts execution on the device.

CL_PROFILING_COMMAND_END

A 64-bit value that describes the current device time counter in nanoseconds when the command identified by event has finished execution on the device.

CL_PROFILING_COMMAND_COMPLETE (since OpenCL 2.1)

A 64-bit value that describes the current device time counter in nanoseconds when the command identified by event and any child commands enqueued by this command on the device have finished execution.

Example using OpenCL C API and CL-Basic:

First, the commend queue used to execute kernels must be created with CL_QUEUE_PROFILING_ENABLE flag:

#ifdef CL_VERSION_2_0
    cl_queue_properties prop[] = { CL_QUEUE_PROPERTIES, CL_QUEUE_PROFILING_ENABLE, 0 };

    queue = clCreateCommandQueueWithProperties(context, device, prop, &clError);
    CHECK_OCL_ERR("clCreateCommandQueueWithProperties", clError);
#else
    cl_command_queue_properties prop = 0;
    prop |= CL_QUEUE_PROFILING_ENABLE;

    queue = clCreateCommandQueue(context, device, prop, &clError);
    CHECK_OCL_ERR("clCreateCommandQueue", clError);
#endif

Then, you can enqueue a kernel, wait for completion and use associated event object and function clGetEventProfilingInfo to calculate the execution time:

cl_event event;

err = clEnqueueNDRangeKernel(
    queue, kernel, 1, NULL, &global_work_size, NULL, 0, NULL, &event
);
CHECK_OCL_ERR("clEnqueueNDRangeKernel", err);

err = clWaitForEvents(1, &event);
CHECK_OCL_ERR("clWaitForEvents", err);

cl_ulong start_time, end_time;
err = clGetEventProfilingInfo(
    event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start_time, NULL
);
CHECK_OCL_ERR("clGetEventProfilingInfo", err);
err = clGetEventProfilingInfo(
    event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end_time, NULL
);
CHECK_OCL_ERR("clGetEventProfilingInfo", err);

printf("%lu\n", end_time - start_time);

err = clReleaseEvent(event);
CHECK_OCL_ERR("clReleaseEvent", err);

Benchmarking on host

If you can't or don't want to use CL_QUEUE_PROFILING_ENABLE flag, it is also possible to benchmark kernels, or in general any OpenCL function, using features of the host language that you use OpenCL with. However, in that case it is important to always run the kernel multiple times in order to get the correct average execution time.

C++11 Example (with CL-basic)

#include <chrono>
#include <iostream>

(...)

using ms_duration_type = std::chrono::duration<double, std::milli>;
duration_type sum_durations(0);

// Remember to compile kernel before benchmarking.

const size_t iterations_count = 100;
for (size_t iteration = 0; iteration < iterations_count; iteration++)
{
    cl_event event;

    // Preparation: set kernel arguments, fill input buffers etc.

    auto start_time =
        std::chrono::high_resolution_clock::now();

    err = clEnqueueNDRangeKernel(
        queue, kernel, 1,
        NULL, &global_work_size, &local_work_size,
        0, NULL, &event
    );
    CHECK_OCL_ERR("clEnqueueNDRangeKernel", err);

    err = clWaitForEvents(1, &event);
    CHECK_OCL_ERR("clWaitForEvents", err);

    ms_duration_type duration = 
        std::chrono::high_resolution_clock::now() - start_time;
    sum_durations += duration;

    err = clReleaseEvent(event);
    CHECK_OCL_ERR("clReleaseEvent", err);
}

auto avg_duration = sum_durations / iterations_count;
std::cout << avg_duration.count() << " ms\n";

(...)

C++11 Example (with Boost.Compute)

#include <chrono>
#include <iostream>
#include <boost/compute.hpp>

(...)

using ms_duration_type = std::chrono::duration<double, std::milli>;
duration_type sum_durations(0);

// Remember to compile kernel before benchmarking.

const size_t iterations_count = 100;
for (size_t iteration = 0; iteration < iterations_count; iteration++)
{
    // Preparation: set kernel arguments, fill input buffers etc.

    auto start_time =
        std::chrono::high_resolution_clock::now();

    queue.enqueue_1d_range_kernel(
        kernel, 0 /* offset */, global_work_size, local_work_size
    ).wait();

    ms_duration_type duration = 
        std::chrono::high_resolution_clock::now() - start_time;
    sum_durations += duration;
}

auto avg_duration = sum_durations / iterations_count;
std::cout << avg_duration.count() << " ms\n";

(...)

Bibliography