OpenCL.org Initiative

ToDo: What is OpenCL.org initiative?

Acknowledgements

About StreamComputing

OpenCL Introduction

What is OpenCL?

OpenCL is an open, royalty-free industry standard that makes much faster computations possible through parallel computing. The standard is controlled by non-profit standards organisation Khronos. For example, by using this technology with graphics cards and modern multi-core processors it is possible to convert a video in 20 minutes instead of 2 hours, or analyze spectra of hundreds of thousands of stars in minutes instead of several hours.

How does it work?

OpenCL is an extension to existing languages. It makes it possible to specify a piece of code that is executed multiple times independently from each other. This code can run on various processors – not only the main CPU. Also, OpenCL has built-in support for vector types (float2, short4, int8, long16, etc), and operations on those types can be mapped to vector extensions of modern CPUs (SSE, AVX).

For example, you need to calculate sin(x) of a large array A of one million numbers. Based on the information provided by the OpenCL framework you can pick the best device, or even several devices, and send the data to the device(s). Normally you would loop over the million numbers, one number after another, but with OpenCL you can just say: "For each x in array A give me sin(x)", and each x is processed in parallel. When finished, you can take the data back from the device(s) or proceed with other computations.

OpenCL framework is great at exposing parallel nature of various compute-devices: x86 CPUs, GPUs, FPGAs, DSPs. This can significantly lower the total execution time compared to conventional sequential methods.

5 questions on OpenCL

Q: Why is it so fast?

OpenCL framework gives programmers tools and features which enable them to implement efficient parallel algorithms. It does it by providing direct access to the cores of multi-core CPUs, and to the hundreds of little processors on graphics cards.

Q: Does it work on any type of hardware?

As it is an open standard, it can work on any type of hardware that targets parallel execution. This can be a CPU, GPU, DSP or FPGA. You can read more in Platforms and tools chapter.

Q: How does it compare to OpenMP/MPI?

Where OpenMP and MPI try to split loops over threads/servers and is CPU-oriented, OpenCL focuses on getting threads being data-position aware and making use of processor-capabilities.

Q: Does it replace C or C++?

No, it is a framework which integrates well with C, C++, Python, Java and more.

Q: How stable/mature is OpenCL?

First version of OpenCL was released in 2009. The latest version is OpenCL 2.1 from November 2015. The standard is being actively developed. Next version (2.2), which currently has provisional status, will include OpenCL C++ kernel language based C++14 standard.

Platforms and tools

SDKs And OpenCL Implementations

AMD OpenCL™ Accelerated Parallel Processing - AMD APP SDK

AMD OpenCL™ Accelerated Parallel Processing (APP) technology is a set of advanced hardware and software technologies that enable AMD graphics processing cores (GPU), working in concert with the system’s x86 cores (CPU), to execute heterogeneously to accelerate many applications beyond just graphics.

The SDK provides samples, documentation, and other materials. GPU drivers must be installed in order to run OpenCL programs on AMD GPUs.

Note

AMD Linux Catalyst driver are not being updated since 18.12.2015 (Crimson Edition 15.12). We recommend you to install and test AMDGPU-PRO driver (only OpenCL 1.2), or try AMD ROCm.

Supported hardware:

AMD GPU
AMD APU
x86 CPU

Supported OS:

Windows
Linux

Standards:

OpenCL 2.0 (AMD Catalyst/Crimson)
OpenCL 1.2 (AMDGPU-PRO)

AMD ROCm (Partially Open Source)

ROCm is Open Source Platform for GPU Computing. ROCm 1.4 includes developer preview of OpenCL support (not yet open source in this release).

Supported hardware:

AMD GPU (limited list, GFX7 and GFX8 only - Hawaii, Fiji, Polaris)

Supported OS:

Linux (with a special ROCm kernel)

Standards:

OpenCL 2.0 compatible kernel language and OpenCL 1.2 compatible runtime

Intel® SDK for OpenCL™ Applications

The Intel® SDK for OpenCL™ Applications is a comprehensive development environment for developing and optimizing OpenCL™ applications on Intel® platforms.

The SDK supports offloading compute-intensive parallel workloads to Intel® Graphics Technology using an advanced OpenCL™ kernel compiler, runtime debugger and code performance analyzer.

Intel® SDK for OpenCL™ Application includes Intel OpenCL™ Code Builder which allows you to build (offline), debug, and analyze OpenCL programs. It is required to install OpenCL drivers for CPUs and GPUs (Intel OpenCL drivers) since SDK only includes experimental 2.1 platform with drivers for a CPU device.

Note

Since Intel SDK supports OpenCL 2.1 (and therefore supports SPIR-V 1.0), it is currently the only SDK that lets you run OpenCL C++ kernels (not all features works).

Supported hardware:

Intel® Graphics (GPU), only Windows
Intel® Processors (CPU)
Intel® Xeon Phi™ Coprocessors

Supported OS:

Windows
Linux
Android (as a target only)

Standards:

OpenCL 2.0 & 1.2
OpenCL 2.1 (CPU only) with SPIR and SPIR-V support

Beignet for Intel GPUs on Linux (Open Source)

Beignet contains the code to run OpenCL programs on Intel GPUs, which defines and implements host functions required to initialize the device, create the command queues, the kernels, and the programs and run them on the GPU. It also contains the compiler part of the stack.

Supported hardware:

Intel® Graphics (GPU)

Supported OS:

Linux
Android

Standards:

OpenCL 2.0 & 1.2

See alos

Beignet on 01.org (Intel's open source community)

NVIDIA® CUDA® Toolkit

NVIDIA does not have a separate OpenCL SDK, but CUDA Toolkit contains OpenCL headers and shared libraries. OpenCL support is included in NVIDIA GPU drivers.

OpenCL samples with source code for Windows, Linux and macOS are available at https://developer.nvidia.com/opencl

Supported hardware:

NVIDIA GPUs

Supported OS:

Windows
Linux
macOS

Standards:

OpenCL 1.2
beta-support of OpenCL 2.0

Portable Computing Language (pocl) (Open Source)

Portable Computing Language (pocl) aims to become a MIT-licensed open source implementation of the OpenCL standard which can be easily adapted for new targets and devices, both for homogeneous CPU and heterogeneous GPUs/accelerators.

pocl uses Clang as an OpenCL C frontend and LLVM for the kernel compiler implementation, and as a portability layer. Thus, if your desired target has an LLVM backend, it should be able to get OpenCL support easily by using pocl.

Supported hardware:

x86 CPU
HSA targets

Supported OS:

Windows
Linux

Standards:

OpenCL 1.2 with some features missing

Intel® FPGA SDK for OpenCL™

The Intel FPGA SDK for OpenCL allows the easy implementation of applications onto FPGAs by abstracting away the complexities of FPGA design, allowing software programmers to write hardware-accelerated kernel functions in OpenCL C, an ANSI C-based language with additional OpenCL constructs. As part of our SDK we provide a suite of tools to further resemble the fast development flow of software programmers.

Supported hardware:

Altera FPGA

Supported software:

Windows
Linux

Standards:

OpenCL 1.0 with support of SVM (OpenCL 2.0 feature) and image arrays (OpenCL 1.2 feature)

Xilinx SDAccel™ development environment

The SDAccel™ development environment for OpenCL™, C, and C++, enables up to 25X better performance/watt for data center application acceleration leveraging FPGAs. SDAccel, member of the SDx™ family, combines the industry's first architecturally optimizing compiler supporting any combination of OpenCL, C, and C++ kernels, along with libraries, development boards and the first complete CPU/GPU like development and run-time experience for FPGAs.

Qualcomm® Adreno™ SDK

Supported hardware:

Adreno GPU (Snapdragon processor)

Supported software:

Windows
Linux
macOS
Android (as a target only)

Standards:

OpenCL 2.0

ARM Mali OpenCL SDK

The Mali OpenCL SDK provides developers with a framework and series of samples for developing OpenCL 1.1 applications on ARM Mali based platforms such as the Mali-T600 and above family of GPUs. The samples cover a wide range of use cases that utilize the Mali GPU to achieve a significant improvement in performance when compared to running on the CPU alone.

Supported hardware:

ARM Mali based platforms such as the Mali-T600 family of GPUs

Supported software:

Windows
Linux

Standards:

OpenCL 1.1

The Texas Instruments OpenCL Implementation

Texas Instruments supports OpenCL 1.1 on a few selected processors. Full list of supported systems is available here.

Supported hardware:

Selected processors (list)

Standards:

OpenCL 1.1

Debuggers and profilers

CodeXL (Debugger, Profiler)

CodeXL is a comprehensive tool suite that enables developers to harness the benefits of CPUs, GPUs and APUs. CodeXL is available both as a Visual Studio extension and a standalone user interface application for Windows and Linux.

CodeXL works on ROCm platform.

Features:

Combined Host and GPU Debugging
- Real-time OpenCL kernel debugging
AMD GPU Profiling
- Application Timeline Trace (multiple contexts and queues, tips about redundant synchronizations, OpenCL objects leaks...)
- GPU Performance Counters (memory transfers, cache hits, occupancy, registers usage...)
CPU Profiling
Static Kernel Analysis
APU/CPU/GPU power profiling
Remote machine profiling and debugging

Intel® VTune™ Amplifier XE (Profiler)

With Intel® VTune™ Amplifier, you get all these advanced profiling capabilities with a single, friendly analysis interface. And for media applications, you also get powerful tools to tune OpenCL* and the GPU.

Note

Intel® SDK for OpenCL™ Application includes Intel OpenCL™ Code Builder which allows you to build (offline), debug, and analyze OpenCL programs.

See alo

Detailed list of features
Intel VTune Amplifier XE: Getting started with OpenCL performance analysis on Intel HD Graphics

Intel® Graphics Performance Analyzers (GPA) (Debugger, Profiler)

Intel® SDK for OpenCL™ Applications provides integration with the Intel® Graphics Performance Analyzers (Intel® GPA), which enables you to optimize and analyze your OpenCL code in visual computing applications.

Intel GPA support various metrics for Intel CPU and HD Graphics devices. Some metrics are specific to the rendering (Microsoft DirectX* API) pipeline only, while some are more general and can be associated with OpenCL execution.

With Intel GPA you can also inspect various important hardware counters for Intel CPU and HD Graphics devices in real time, for example:

Utilization of CPU cores and the execution units in Intel HD Graphics devices
Memory traffic for Intel HD Graphics devices
Power consumption, and so on

Qualcomm Snapdragon Profiler

Snapdragon Profiler is profiling software that runs on the Windows, Mac, and Linux platforms. It connects with Android devices powered by Snapdragon processors over USB. Snapdragon Profiler allows developers to analyze CPU, GPU, DSP, memory, power, thermal, and network data, so they can find and fix performance bottlenecks.

Oclgrind

Oclgrind is an open source SPIR interpreter and OpenCL device simulator. The core of this project is an interpreter for SPIR, which can simulate the execution of an OpenCL kernel running on a virtual OpenCL device. Oclgrind is designed to be extensible and aims to provide a platform with which a variety of useful tools can be created to aid OpenCL application development.

At present, Oclgrind includes support for the following features:

Detecting memory access errors (such as reading/writing outside the bounds of an array)
Data-race detection
Interactive kernel debugging
Detecting work-group divergence (barriers and async copies)
Collecting histograms of instructions executed
Logging OpenCL runtime API errors

Libraries

The crucial part of every technology and programming language are libraries. They extend the language, simplify it, add new features, and help you build you applications faster. The same goes for OpenCL. This chapter presents a few selected libraries which at the time seemed popular, useful and reliable.

Wrappers

OpenCL-CLHPP

OpenCL Host API C++ bindings.

For many large applications C++ is the language of choice and so it seems reasonable to define C++ bindings for OpenCL.

The interface is contained with a single C++ header file cl2.hpp and all definitions are contained within the namespace cl. There is no additional requirement to include cl.h and to use either the C++ or original C bindings; it is enough to simply include cl2.hpp.

The bindings themselves are lightweight and correspond closely to the underlying C API. Using the C++ bindings introduces no additional execution overhead.

Homepage: https://github.com/KhronosGroup/OpenCL-CLHPP
Documentation: http://github.khronos.org/OpenCL-CLHPP

Boost.Compute

Boost.Compute is an official Boost library, added to the Boost C++ Libraries package since 1.61 version. It is a comprehensive wrapper for OpenCL 1.0 - 2.0. It provides STL-like algorithms and various helper features (functions, classes) which make developing in C++/OpenCL environment much faster.

Boost.Compute is a GPU/parallel-computing library for C++ based on OpenCL.

The core library is a thin C++ wrapper over the OpenCL API and provides access to compute devices, contexts, command queues and memory buffers.

On top of the core library is a generic, STL-like interface providing common algorithms (e.g. transform(), accumulate(), sort()) along with common containers (e.g. vector<T>, flat_set<T>). It also features a number of extensions including parallel-computing algorithms (e.g. exclusive_scan(), scatter(), reduce()) and a number of fancy iterators (e.g. transform_iterator<>, permutation_iterator<>, zip_iterator<>).

Homepage: https://github.com/boostorg/compute/
Documentation: http://www.boost.org/doc/libs/1_63_0/libs/compute/doc/html/index.html
Talks:
- Talk about Boost.Compute @ IWOCL 2016 by Jakub Szuppe
- Boost.Compute @ C++Now 2015 by Kyle Lutz (Video)
Other resources:
- Differences between VexCL, Thrust, and Boost.Compute

AMD Aparapi

Aparapi enables Java developers to define OpenCL kernels and executed them on GPU and APU devices.

Aparapi allows Java developers to take advantage of the compute power of GPU and APU devices by executing data parallel code fragments on the GPU rather than being confined to the local CPU. It does this by converting Java bytecode to OpenCL at runtime and executing on the GPU, if for any reason Aparapi can't execute on the GPU it will execute in a Java thread pool.

We like to think that for the appropriate workload this extends Java's 'Write Once Run Anywhere' to include GPU devices.

Homepage: https://github.com/aparapi/aparapi
Documentation:
- https://github.com/aparapi/aparapi/blob/master/doc/README.md
- https://code.google.com/archive/p/aparapi/

Note

You find more OpenCL wrappers/bindings in blog post StreamComputing: OpenCL Wrappers.

Specialized libraries

VexCL

VexCL was developed with scientific computing in mind. It provides custom DSL to simplify using various accelerators for vector arithmetic, reduction, sparse matrix-vector product etc. It has some features of a wrapper library, but not all of them. However, it is possible to define generic C++ algorithms with its DSL and write custom kernels using VexCL. It supports both OpenCL and CUDA.

VexCL is a vector expression template library for OpenCL/CUDA. It has been created for ease of GPGPU development with C++. VexCL strives to reduce amount of boilerplate code needed to develop GPGPU applications. The library provides convenient and intuitive notation for vector arithmetic, reduction, sparse matrix-vectork products, etc. Multi-device and even multi-platform computations are supported. The source code of the library is distributed under very permissive MIT license.

Homepage: https://github.com/ddemidov/vexcl
Documentation: http://vexcl.readthedocs.io/en/latest/
Talks: http://vexcl.readthedocs.io/en/latest/talks.html
Other resources:
- Differences between VexCL, Thrust, and Boost.Compute

ArrayFire

ArrayFire is a general-purpose library that simplifies the process of developing software that targets parallel and massively-parallel architectures including CPUs, GPUs, and other hardware acceleration devices.

Homepages:
- https://arrayfire.com/
- https://github.com/arrayfire/arrayfire
Documentation: http://arrayfire.org/docs/

ViennaCL

ViennaCL is a free open-source linear algebra library for computations on many-core architectures (GPUs, MIC) and multi-core CPUs. The library is written in C++ and supports CUDA, OpenCL, and OpenMP (including switches at runtime).

Homepage: http://viennacl.sourceforge.net/
Documentation: http://viennacl.sourceforge.net/viennacl-documentation.html

clMathLibraries

clMathLibraries is a name for a group of math libraries with OpenCL backend. Also see StreamComputing: OpenCL alternatives for CUDA Linear Algebra Libraries.

clFFT - a software library containing FFT functions written in OpenCL
clBLAS - a software library containing BLAS functions written in OpenCL
clSPARSE - a software library containing Sparse functions written in OpenCL.
clRNG - an OpenCL based software library containing random number generation functions

Note

More complete lists of OpenCL libraries, bindings and toolkits can be found at:

khronos.org/opencl/resources, and at

iwocl.org/resources/opencl-libraries-and-toolkits.

Specification

The OpenCL specification documents can be found on the Khronos OpenCL Registry. They are a must-read for everyone who wants to dive into OpenCL programming and a great reference for the experienced OpenCL developer.

OpenCL 2.2 (Provisional)
- The OpenCL 2.2 Specification (API)
- The OpenCL C++ Language Specification
OpenCL 2.1
- The OpenCL 2.1 Specification (API)
- The OpenCL 2.0 C Language Specification
  - Note: OpenCL 2.1 uses OpenCL 2.0 C Language for OpenCL programs
- Online Reference Pages
OpenCL 2.0
OpenCL 1.2
All specifications and other documents regarding OpenCL standard, including including environment specifications, extensions specifications, and specifications for OpenCL 1.0 and 1.1, are available at https://www.khronos.org/registry/OpenCL.

The OpenCL Architecture

The crucial part of The OpenCL Specification is the chapter about the OpenCL architecture. It is very important to understand at least its basic principles. Without this knowledge it is not possible neither to efficiently program using OpenCL nor to use it to its full capabilities.

The architecture is described by four models:

Platform Model
Memory Model
Programming Model
Execution Model

The platform model describes in general how the OpenCL framework looks like and gives definitions of a host and an OpenCL device. The memory model explains different memory regions and memory objects introduced in OpenCL. More advanced sections about memory model present details of how shared virtual memory mechanism works, and presents memory consistency model and ordering rules. Since OpenCL 2.0 the programming model does not have its own section. It specifies that the OpenCL execution model supports both data parallel and task parallel programming models.

The execution model is the model that explains how the OpenCL framework really works. It defines responsibilities of a host program, it says what is a command queue, what is a kernel, and it describes how an execution of an OpenCL kernel on a device works. In order to understand OpenCL you must carefully read and understand particularly this part of the specification, and know: how workloads are mapped onto devices, when synchronization happens, what is an index space.

Getting Started

Tutorials

ToDo

CL-basic

CL-basic is a C prototype to help you get started with creating your first simple OpenCL application. It offers simplified host-code OpenCL API functions and a sample OpenCL kernel that you can reference to get started quick! In addition, this prototype can be compiled under both Windows and Linux-based systems thanks to the use of CMake.

Host-code functions can be found in files:

cl_util.c

cl_util.h

For getting platform and/or device information, use the following functions are:

void PrintPlatformName(cl_platform_id platform);
void PrintDeviceName(cl_device_id device);
int PrintOpenCLInfo();
void SelectOpenCLPlatformAndDevice(cl_platform_id* pPlatform, cl_device_id* pDevice);

For creating and releasing an OpenCL context, the following functions are:

cl_context CreateOpenCLContext(cl_platform_id platform, cl_device_id device);
void ReleaseOpenCLContext(cl_context *pContext);

For creating and releasing an OpenCL command queue, the following functions are:

cl_command_queue CreateOpenCLQueue(cl_device_id device, cl_context context);
void ReleaseOpenCLQueue(cl_command_queue *pQueue);

For creating and releasing an OpenCL buffer allocated on an OpenCL device, the following functions are:

cl_mem CreateDeviceBuffer(cl_context context, size_t sizeInBytes);
void ReleaseDeviceBuffer(cl_mem *pDeviceBuffer);

For copying data rom the host to device, or from the device to the host, the following functions are:

void CopyHostToDevice(void* hostBuffer, cl_mem deviceBuffer, size_t sizeInBytes, cl_command_queue queue, cl_bool blocking);
void CopyDeviceToHost(cl_mem deviceBuffer, void* hostBuffer, size_t sizeInBytes, cl_command_queue queue, cl_bool blocking);

For loading an OpenCL source from a file, creating and releasing the OpenCL program, the following functions are:

char* LoadOpenCLSourceFromFile(char* filePath, size_t *pSourceLength);
cl_program CreateAndBuildProgram(cl_context context, char* sourceCode, size_t sourceCodeLength);
void ReleaseProgram(cl_program *pProgram);

For creating and releasing OpenCL Kernels, the following functions are:

cl_kernel CreateKernel(cl_program program, char* kernelName);
void ReleaseKernel(cl_kernel *pKernel);

A macro for checking error values has also been prepared to make it easier to translate an error code returned by an OpenCL host function, and it can be used as below:

clError = clEnqueueNDRangeKernel(queue, simpleFunctionKernel, workDim, NULL, globalWorkSize, localWorkSize, 0, NULL, NULL);
CHECK_OCL_ERR("clEnqueueNDRangeKernel", clError);

Refer to main.cpp for a reference of how these host-code functions are used, and OpenCLKernels.cl for how the OpenCL kernel is written.

Boost.Compute and CL-compute

ToDo

Profiling

Using OpenCL events

OpenCL framework has a built-in feature that can provide information about the execution times of enqueued commands. By linking cl_event object to an OpenCL command, like clEnqueueNDRangeKernel, it is possible to use clGetEventProfilingInfo function to get the times of the start and the end of execution of that command and calulcate the performance. Profiling of OpenCL commands can be enabled by using a command queue created with CL_QUEUE_PROFILING_ENABLE flag set in properties argument.

Supported queries

cl_int clGetEventProfilingInfo(cl_event event,
                               cl_profiling_info param_name,
                               size_t param_value_size,
                               void *param_value,
                               size_t *param_value_size_ret)

CL_PROFILING_COMMAND_QUEUED

A 64-bit value that describes the current device time counter in nanoseconds when the command identified by event is enqueued in a command-queue by the host.

CL_PROFILING_COMMAND_SUBMIT

A 64-bit value that describes the current device time counter in nanoseconds when the command identified by event that has been enqueued is submitted by the host to the device associated with the command-queue.

CL_PROFILING_COMMAND_START

A 64-bit value that describes the current device time counter in nanoseconds when the command identified by event starts execution on the device.

CL_PROFILING_COMMAND_END

A 64-bit value that describes the current device time counter in nanoseconds when the command identified by event has finished execution on the device.

CL_PROFILING_COMMAND_COMPLETE (since OpenCL 2.1)

A 64-bit value that describes the current device time counter in nanoseconds when the command identified by event and any child commands enqueued by this command on the device have finished execution.

Example using OpenCL C API and CL-Basic:

First, the commend queue used to execute kernels must be created with CL_QUEUE_PROFILING_ENABLE flag:

#ifdef CL_VERSION_2_0
    cl_queue_properties prop[] = { CL_QUEUE_PROPERTIES, CL_QUEUE_PROFILING_ENABLE, 0 };

    queue = clCreateCommandQueueWithProperties(context, device, prop, &clError);
    CHECK_OCL_ERR("clCreateCommandQueueWithProperties", clError);
#else
    cl_command_queue_properties prop = 0;
    prop |= CL_QUEUE_PROFILING_ENABLE;

    queue = clCreateCommandQueue(context, device, prop, &clError);
    CHECK_OCL_ERR("clCreateCommandQueue", clError);
#endif

Then, you can enqueue a kernel, wait for completion and use associated event object and function clGetEventProfilingInfo to calculate the execution time:

cl_event event;

err = clEnqueueNDRangeKernel(
    queue, kernel, 1, NULL, &global_work_size, NULL, 0, NULL, &event
);
CHECK_OCL_ERR("clEnqueueNDRangeKernel", err);

err = clWaitForEvents(1, &event);
CHECK_OCL_ERR("clWaitForEvents", err);

cl_ulong start_time, end_time;
err = clGetEventProfilingInfo(
    event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start_time, NULL
);
CHECK_OCL_ERR("clGetEventProfilingInfo", err);
err = clGetEventProfilingInfo(
    event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end_time, NULL
);
CHECK_OCL_ERR("clGetEventProfilingInfo", err);

printf("%lu\n", end_time - start_time);

err = clReleaseEvent(event);
CHECK_OCL_ERR("clReleaseEvent", err);

Benchmarking on host

If you can't or don't want to use CL_QUEUE_PROFILING_ENABLE flag, it is also possible to benchmark kernels, or in general any OpenCL function, using features of the host language that you use OpenCL with. However, in that case it is important to always run the kernel multiple times in order to get the correct average execution time.

C++11 Example (with CL-basic)

#include <chrono>
#include <iostream>

(...)

using ms_duration_type = std::chrono::duration<double, std::milli>;
duration_type sum_durations(0);

// Remember to compile kernel before benchmarking.

const size_t iterations_count = 100;
for (size_t iteration = 0; iteration < iterations_count; iteration++)
{
    cl_event event;

    // Preparation: set kernel arguments, fill input buffers etc.

    auto start_time =
        std::chrono::high_resolution_clock::now();

    err = clEnqueueNDRangeKernel(
        queue, kernel, 1,
        NULL, &global_work_size, &local_work_size,
        0, NULL, &event
    );
    CHECK_OCL_ERR("clEnqueueNDRangeKernel", err);

    err = clWaitForEvents(1, &event);
    CHECK_OCL_ERR("clWaitForEvents", err);

    ms_duration_type duration = 
        std::chrono::high_resolution_clock::now() - start_time;
    sum_durations += duration;

    err = clReleaseEvent(event);
    CHECK_OCL_ERR("clReleaseEvent", err);
}

auto avg_duration = sum_durations / iterations_count;
std::cout << avg_duration.count() << " ms\n";

(...)

C++11 Example (with Boost.Compute)

#include <chrono>
#include <iostream>
#include <boost/compute.hpp>

(...)

using ms_duration_type = std::chrono::duration<double, std::milli>;
duration_type sum_durations(0);

// Remember to compile kernel before benchmarking.

const size_t iterations_count = 100;
for (size_t iteration = 0; iteration < iterations_count; iteration++)
{
    // Preparation: set kernel arguments, fill input buffers etc.

    auto start_time =
        std::chrono::high_resolution_clock::now();

    queue.enqueue_1d_range_kernel(
        kernel, 0 /* offset */, global_work_size, local_work_size
    ).wait();

    ms_duration_type duration = 
        std::chrono::high_resolution_clock::now() - start_time;
    sum_durations += duration;
}

auto avg_duration = sum_durations / iterations_count;
std::cout << avg_duration.count() << " ms\n";

(...)

Bibliography

The Khronos OpenCL Working Group, The OpenCL Specification, Version: 2.1
The Khronos OpenCL Working Group, The OpenCL C Specification, Version: 2.0
IWOCL.org: Resources

Files

OpenCL.md

Latest commit

History

OpenCL.md

File metadata and controls

OpenCL.org Initiative

Acknowledgements

About StreamComputing

Table Of Contents

OpenCL Introduction

What is OpenCL?

How does it work?

5 questions on OpenCL

Platforms and tools

SDKs And OpenCL Implementations

Note

See also

AMD ROCm (Partially Open Source)

See also

Note

See also

Beignet for Intel GPUs on Linux (Open Source)

See alos

See also

Portable Computing Language (pocl) (Open Source)

See also

See also

Debuggers and profilers

CodeXL (Debugger, Profiler)

See also

Intel® VTune™ Amplifier XE (Profiler)

Note

See alo

Intel® Graphics Performance Analyzers (GPA) (Debugger, Profiler)

See also

See also

See also

Libraries

Wrappers

Note

Specialized libraries

Note

Specification

The OpenCL Architecture

Getting Started

Tutorials

Boost.Compute and CL-compute

Profiling

Using OpenCL events

Supported queries

Example using OpenCL C API and CL-Basic:

Benchmarking on host

C++11 Example (with CL-basic)

C++11 Example (with Boost.Compute)

Bibliography