ToDo: What is OpenCL.org initiative?
About StreamComputing
- OpenCL Introduction
- Platforms and tools
- Libraries
- Specification
- Getting Started
- Profiling
- Bibliography
OpenCL is an open, royalty-free industry standard that makes much faster computations possible through parallel computing. The standard is controlled by non-profit standards organisation Khronos. For example, by using this technology with graphics cards and modern multi-core processors it is possible to convert a video in 20 minutes instead of 2 hours, or analyze spectra of hundreds of thousands of stars in minutes instead of several hours.
OpenCL is an extension to existing languages. It makes it possible to specify a piece of code that is executed multiple times independently from each other. This code can run on various processors – not only the main CPU. Also, OpenCL has built-in support for vector types (float2, short4, int8, long16, etc), and operations on those types can be mapped to vector extensions of modern CPUs (SSE, AVX).
For example, you need to calculate
sin(x)
of a large arrayA
of one million numbers. Based on the information provided by the OpenCL framework you can pick the best device, or even several devices, and send the data to the device(s). Normally you would loop over the million numbers, one number after another, but with OpenCL you can just say: "For eachx
in arrayA
give mesin(x)
", and eachx
is processed in parallel. When finished, you can take the data back from the device(s) or proceed with other computations.
OpenCL framework is great at exposing parallel nature of various compute-devices: x86 CPUs, GPUs, FPGAs, DSPs. This can significantly lower the total execution time compared to conventional sequential methods.
Q: Why is it so fast?
OpenCL framework gives programmers tools and features which enable them to implement efficient parallel algorithms. It does it by providing direct access to the cores of multi-core CPUs, and to the hundreds of little processors on graphics cards.
Q: Does it work on any type of hardware?
As it is an open standard, it can work on any type of hardware that targets parallel execution. This can be a CPU, GPU, DSP or FPGA. You can read more in Platforms and tools chapter.
Q: How does it compare to OpenMP/MPI?
Where OpenMP and MPI try to split loops over threads/servers and is CPU-oriented, OpenCL focuses on getting threads being data-position aware and making use of processor-capabilities.
Q: Does it replace C or C++?
No, it is a framework which integrates well with C, C++, Python, Java and more.
Q: How stable/mature is OpenCL?
First version of OpenCL was released in 2009. The latest version is OpenCL 2.1 from November 2015. The standard is being actively developed. Next version (2.2), which currently has provisional status, will include OpenCL C++ kernel language based C++14 standard.
AMD OpenCL™ Accelerated Parallel Processing (APP) technology is a set of advanced hardware and software technologies that enable AMD graphics processing cores (GPU), working in concert with the system’s x86 cores (CPU), to execute heterogeneously to accelerate many applications beyond just graphics.
The SDK provides samples, documentation, and other materials. GPU drivers must be installed in order to run OpenCL programs on AMD GPUs.
AMD Linux Catalyst driver are not being updated since 18.12.2015 (Crimson Edition 15.12). We recommend you to install and test AMDGPU-PRO driver (only OpenCL 1.2), or try AMD ROCm.
Supported hardware:
- AMD GPU
- AMD APU
- x86 CPU
Supported OS:
- Windows
- Linux
Standards:
- OpenCL 2.0 (AMD Catalyst/Crimson)
- OpenCL 1.2 (AMDGPU-PRO)
AMD ROCm (Partially Open Source)
ROCm is Open Source Platform for GPU Computing. ROCm 1.4 includes developer preview of OpenCL support (not yet open source in this release).
Supported hardware:
- AMD GPU (limited list, GFX7 and GFX8 only - Hawaii, Fiji, Polaris)
Supported OS:
- Linux (with a special ROCm kernel)
Standards:
- OpenCL 2.0 compatible kernel language and OpenCL 1.2 compatible runtime
The Intel® SDK for OpenCL™ Applications is a comprehensive development environment for developing and optimizing OpenCL™ applications on Intel® platforms.
The SDK supports offloading compute-intensive parallel workloads to Intel® Graphics Technology using an advanced OpenCL™ kernel compiler, runtime debugger and code performance analyzer.
Intel® SDK for OpenCL™ Application includes Intel OpenCL™ Code Builder which allows you to build (offline), debug, and analyze OpenCL programs. It is required to install OpenCL drivers for CPUs and GPUs (Intel OpenCL drivers) since SDK only includes experimental 2.1 platform with drivers for a CPU device.
Since Intel SDK supports OpenCL 2.1 (and therefore supports SPIR-V 1.0), it is currently the only SDK that lets you run OpenCL C++ kernels (not all features works).
Supported hardware:
- Intel® Graphics (GPU), only Windows
- Intel® Processors (CPU)
- Intel® Xeon Phi™ Coprocessors
Supported OS:
- Windows
- Linux
- Android (as a target only)
Standards:
- OpenCL 2.0 & 1.2
- OpenCL 2.1 (CPU only) with SPIR and SPIR-V support
- Intel OpenCL drivers
- Free OpenCL training materials provided by Intel
- OpenCL code samples provided by Intel
Beignet for Intel GPUs on Linux (Open Source)
Beignet contains the code to run OpenCL programs on Intel GPUs, which defines and implements host functions required to initialize the device, create the command queues, the kernels, and the programs and run them on the GPU. It also contains the compiler part of the stack.
Supported hardware:
- Intel® Graphics (GPU)
Supported OS:
- Linux
- Android
Standards:
- OpenCL 2.0 & 1.2
NVIDIA does not have a separate OpenCL SDK, but CUDA Toolkit contains OpenCL headers and shared libraries. OpenCL support is included in NVIDIA GPU drivers.
OpenCL samples with source code for Windows, Linux and macOS are available at https://developer.nvidia.com/opencl
Supported hardware:
- NVIDIA GPUs
Supported OS:
- Windows
- Linux
- macOS
Standards:
- OpenCL 1.2
- beta-support of OpenCL 2.0
Portable Computing Language (pocl) (Open Source)
Portable Computing Language (pocl) aims to become a MIT-licensed open source implementation of the OpenCL standard which can be easily adapted for new targets and devices, both for homogeneous CPU and heterogeneous GPUs/accelerators.
pocl uses Clang as an OpenCL C frontend and LLVM for the kernel compiler implementation, and as a portability layer. Thus, if your desired target has an LLVM backend, it should be able to get OpenCL support easily by using pocl.
Supported hardware:
- x86 CPU
- HSA targets
Supported OS:
- Windows
- Linux
Standards:
- OpenCL 1.2 with some features missing
The Intel FPGA SDK for OpenCL allows the easy implementation of applications onto FPGAs by abstracting away the complexities of FPGA design, allowing software programmers to write hardware-accelerated kernel functions in OpenCL C, an ANSI C-based language with additional OpenCL constructs. As part of our SDK we provide a suite of tools to further resemble the fast development flow of software programmers.
Supported hardware:
- Altera FPGA
Supported software:
- Windows
- Linux
Standards:
- OpenCL 1.0 with support of SVM (OpenCL 2.0 feature) and image arrays (OpenCL 1.2 feature)
The SDAccel™ development environment for OpenCL™, C, and C++, enables up to 25X better performance/watt for data center application acceleration leveraging FPGAs. SDAccel, member of the SDx™ family, combines the industry's first architecturally optimizing compiler supporting any combination of OpenCL, C, and C++ kernels, along with libraries, development boards and the first complete CPU/GPU like development and run-time experience for FPGAs.
Supported hardware:
- Adreno GPU (Snapdragon processor)
Supported software:
- Windows
- Linux
- macOS
- Android (as a target only)
Standards:
- OpenCL 2.0
The Mali OpenCL SDK provides developers with a framework and series of samples for developing OpenCL 1.1 applications on ARM Mali based platforms such as the Mali-T600 and above family of GPUs. The samples cover a wide range of use cases that utilize the Mali GPU to achieve a significant improvement in performance when compared to running on the CPU alone.
Supported hardware:
- ARM Mali based platforms such as the Mali-T600 family of GPUs
Supported software:
- Windows
- Linux
Standards:
- OpenCL 1.1
Texas Instruments supports OpenCL 1.1 on a few selected processors. Full list of supported systems is available here.
Supported hardware:
- Selected processors (list)
Standards:
- OpenCL 1.1
CodeXL (Debugger, Profiler)
CodeXL is a comprehensive tool suite that enables developers to harness the benefits of CPUs, GPUs and APUs. CodeXL is available both as a Visual Studio extension and a standalone user interface application for Windows and Linux.
CodeXL works on ROCm platform.
Features:
- Combined Host and GPU Debugging
- Real-time OpenCL kernel debugging
- AMD GPU Profiling
- Application Timeline Trace (multiple contexts and queues, tips about redundant synchronizations, OpenCL objects leaks...)
- GPU Performance Counters (memory transfers, cache hits, occupancy, registers usage...)
- CPU Profiling
- Static Kernel Analysis
- APU/CPU/GPU power profiling
- Remote machine profiling and debugging
- GPUOpen technical blogs about CodeXL by AMD
- Detailed list of all features
- https://github.com/GPUOpen-Tools/CodeXL
Intel® VTune™ Amplifier XE (Profiler)
With Intel® VTune™ Amplifier, you get all these advanced profiling capabilities with a single, friendly analysis interface. And for media applications, you also get powerful tools to tune OpenCL* and the GPU.
Intel® SDK for OpenCL™ Application includes Intel OpenCL™ Code Builder which allows you to build (offline), debug, and analyze OpenCL programs.
- Detailed list of features
- Intel VTune Amplifier XE: Getting started with OpenCL performance analysis on Intel HD Graphics
Intel® Graphics Performance Analyzers (GPA) (Debugger, Profiler)
Intel® SDK for OpenCL™ Applications provides integration with the Intel® Graphics Performance Analyzers (Intel® GPA), which enables you to optimize and analyze your OpenCL code in visual computing applications.
Intel GPA support various metrics for Intel CPU and HD Graphics devices. Some metrics are specific to the rendering (Microsoft DirectX* API) pipeline only, while some are more general and can be associated with OpenCL execution.
With Intel GPA you can also inspect various important hardware counters for Intel CPU and HD Graphics devices in real time, for example:
- Utilization of CPU cores and the execution units in Intel HD Graphics devices
- Memory traffic for Intel HD Graphics devices
- Power consumption, and so on
- Profiling OpenCL Applications with System Analyzer and Platform Analyzer
- Collecting OpenCL-related Metrics with Intel Graphics Performance Analyzers
Snapdragon Profiler is profiling software that runs on the Windows, Mac, and Linux platforms. It connects with Android devices powered by Snapdragon processors over USB. Snapdragon Profiler allows developers to analyze CPU, GPU, DSP, memory, power, thermal, and network data, so they can find and fix performance bottlenecks.
Oclgrind is an open source SPIR interpreter and OpenCL device simulator. The core of this project is an interpreter for SPIR, which can simulate the execution of an OpenCL kernel running on a virtual OpenCL device. Oclgrind is designed to be extensible and aims to provide a platform with which a variety of useful tools can be created to aid OpenCL application development.
At present, Oclgrind includes support for the following features:
- Detecting memory access errors (such as reading/writing outside the bounds of an array)
- Data-race detection
- Interactive kernel debugging
- Detecting work-group divergence (barriers and async copies)
- Collecting histograms of instructions executed
- Logging OpenCL runtime API errors
The crucial part of every technology and programming language are libraries. They extend the language, simplify it, add new features, and help you build you applications faster. The same goes for OpenCL. This chapter presents a few selected libraries which at the time seemed popular, useful and reliable.
OpenCL Host API C++ bindings.
For many large applications C++ is the language of choice and so it seems reasonable to define C++ bindings for OpenCL.
The interface is contained with a single C++ header file cl2.hpp and all definitions are contained within the namespace cl. There is no additional requirement to include cl.h and to use either the C++ or original C bindings; it is enough to simply include cl2.hpp.
The bindings themselves are lightweight and correspond closely to the underlying C API. Using the C++ bindings introduces no additional execution overhead.
- Homepage: https://github.com/KhronosGroup/OpenCL-CLHPP
- Documentation: http://github.khronos.org/OpenCL-CLHPP
Boost.Compute is an official Boost library, added to the Boost C++ Libraries package since 1.61 version. It is a comprehensive wrapper for OpenCL 1.0 - 2.0. It provides STL-like algorithms and various helper features (functions, classes) which make developing in C++/OpenCL environment much faster.
Boost.Compute is a GPU/parallel-computing library for C++ based on OpenCL.
The core library is a thin C++ wrapper over the OpenCL API and provides access to compute devices, contexts, command queues and memory buffers.
On top of the core library is a generic, STL-like interface providing common algorithms (e.g.
transform()
,accumulate()
,sort()
) along with common containers (e.g.vector<T>
,flat_set<T>
). It also features a number of extensions including parallel-computing algorithms (e.g.exclusive_scan()
,scatter()
,reduce()
) and a number of fancy iterators (e.g.transform_iterator<>
,permutation_iterator<>
,zip_iterator<>
).
- Homepage: https://github.com/boostorg/compute/
- Documentation: http://www.boost.org/doc/libs/1_63_0/libs/compute/doc/html/index.html
- Talks:
- Talk about Boost.Compute @ IWOCL 2016 by Jakub Szuppe
- Boost.Compute @ C++Now 2015 by Kyle Lutz (Video)
- Other resources:
Aparapi enables Java developers to define OpenCL kernels and executed them on GPU and APU devices.
Aparapi allows Java developers to take advantage of the compute power of GPU and APU devices by executing data parallel code fragments on the GPU rather than being confined to the local CPU. It does this by converting Java bytecode to OpenCL at runtime and executing on the GPU, if for any reason Aparapi can't execute on the GPU it will execute in a Java thread pool.
We like to think that for the appropriate workload this extends Java's 'Write Once Run Anywhere' to include GPU devices.
- Homepage: https://github.com/aparapi/aparapi
- Documentation:
You find more OpenCL wrappers/bindings in blog post StreamComputing: OpenCL Wrappers.
VexCL was developed with scientific computing in mind. It provides custom DSL to simplify using various accelerators for vector arithmetic, reduction, sparse matrix-vector product etc. It has some features of a wrapper library, but not all of them. However, it is possible to define generic C++ algorithms with its DSL and write custom kernels using VexCL. It supports both OpenCL and CUDA.
VexCL is a vector expression template library for OpenCL/CUDA. It has been created for ease of GPGPU development with C++. VexCL strives to reduce amount of boilerplate code needed to develop GPGPU applications. The library provides convenient and intuitive notation for vector arithmetic, reduction, sparse matrix-vectork products, etc. Multi-device and even multi-platform computations are supported. The source code of the library is distributed under very permissive MIT license.
- Homepage: https://github.com/ddemidov/vexcl
- Documentation: http://vexcl.readthedocs.io/en/latest/
- Talks: http://vexcl.readthedocs.io/en/latest/talks.html
- Other resources:
ArrayFire is a general-purpose library that simplifies the process of developing software that targets parallel and massively-parallel architectures including CPUs, GPUs, and other hardware acceleration devices.
- Homepages:
- Documentation: http://arrayfire.org/docs/
ViennaCL is a free open-source linear algebra library for computations on many-core architectures (GPUs, MIC) and multi-core CPUs. The library is written in C++ and supports CUDA, OpenCL, and OpenMP (including switches at runtime).
- Homepage: http://viennacl.sourceforge.net/
- Documentation: http://viennacl.sourceforge.net/viennacl-documentation.html
clMathLibraries is a name for a group of math libraries with OpenCL backend. Also see StreamComputing: OpenCL alternatives for CUDA Linear Algebra Libraries.
- clFFT - a software library containing FFT functions written in OpenCL
- clBLAS - a software library containing BLAS functions written in OpenCL
- clSPARSE - a software library containing Sparse functions written in OpenCL.
- clRNG - an OpenCL based software library containing random number generation functions
More complete lists of OpenCL libraries, bindings and toolkits can be found at:
The OpenCL specification documents can be found on the Khronos OpenCL Registry. They are a must-read for everyone who wants to dive into OpenCL programming and a great reference for the experienced OpenCL developer.
- OpenCL 2.2 (Provisional)
- OpenCL 2.1
- The OpenCL 2.1 Specification (API)
- The OpenCL 2.0 C Language Specification
- Note: OpenCL 2.1 uses OpenCL 2.0 C Language for OpenCL programs
- Online Reference Pages
- OpenCL 2.0
- OpenCL 1.2
- All specifications and other documents regarding OpenCL standard, including including environment specifications, extensions specifications, and specifications for OpenCL 1.0 and 1.1, are available at https://www.khronos.org/registry/OpenCL.
The crucial part of The OpenCL Specification is the chapter about the OpenCL architecture. It is very important to understand at least its basic principles. Without this knowledge it is not possible neither to efficiently program using OpenCL nor to use it to its full capabilities.
The architecture is described by four models:
- Platform Model
- Memory Model
- Programming Model
- Execution Model
The platform model describes in general how the OpenCL framework looks like and gives definitions of a host and an OpenCL device. The memory model explains different memory regions and memory objects introduced in OpenCL. More advanced sections about memory model present details of how shared virtual memory mechanism works, and presents memory consistency model and ordering rules. Since OpenCL 2.0 the programming model does not have its own section. It specifies that the OpenCL execution model supports both data parallel and task parallel programming models.
The execution model is the model that explains how the OpenCL framework really works. It defines responsibilities of a host program, it says what is a command queue, what is a kernel, and it describes how an execution of an OpenCL kernel on a device works. In order to understand OpenCL you must carefully read and understand particularly this part of the specification, and know: how workloads are mapped onto devices, when synchronization happens, what is an index space.
ToDo
CL-basic is a C prototype to help you get started with creating your first simple OpenCL application. It offers simplified host-code OpenCL API functions and a sample OpenCL kernel that you can reference to get started quick! In addition, this prototype can be compiled under both Windows and Linux-based systems thanks to the use of CMake.
Host-code functions can be found in files:
cl_util.c
cl_util.h
For getting platform and/or device information, use the following functions are:
void PrintPlatformName(cl_platform_id platform);
void PrintDeviceName(cl_device_id device);
int PrintOpenCLInfo();
void SelectOpenCLPlatformAndDevice(cl_platform_id* pPlatform, cl_device_id* pDevice);
For creating and releasing an OpenCL context, the following functions are:
cl_context CreateOpenCLContext(cl_platform_id platform, cl_device_id device);
void ReleaseOpenCLContext(cl_context *pContext);
For creating and releasing an OpenCL command queue, the following functions are:
cl_command_queue CreateOpenCLQueue(cl_device_id device, cl_context context);
void ReleaseOpenCLQueue(cl_command_queue *pQueue);
For creating and releasing an OpenCL buffer allocated on an OpenCL device, the following functions are:
cl_mem CreateDeviceBuffer(cl_context context, size_t sizeInBytes);
void ReleaseDeviceBuffer(cl_mem *pDeviceBuffer);
For copying data rom the host to device, or from the device to the host, the following functions are:
void CopyHostToDevice(void* hostBuffer, cl_mem deviceBuffer, size_t sizeInBytes, cl_command_queue queue, cl_bool blocking);
void CopyDeviceToHost(cl_mem deviceBuffer, void* hostBuffer, size_t sizeInBytes, cl_command_queue queue, cl_bool blocking);
For loading an OpenCL source from a file, creating and releasing the OpenCL program, the following functions are:
char* LoadOpenCLSourceFromFile(char* filePath, size_t *pSourceLength);
cl_program CreateAndBuildProgram(cl_context context, char* sourceCode, size_t sourceCodeLength);
void ReleaseProgram(cl_program *pProgram);
For creating and releasing OpenCL Kernels, the following functions are:
cl_kernel CreateKernel(cl_program program, char* kernelName);
void ReleaseKernel(cl_kernel *pKernel);
A macro for checking error values has also been prepared to make it easier to translate an error code returned by an OpenCL host function, and it can be used as below:
clError = clEnqueueNDRangeKernel(queue, simpleFunctionKernel, workDim, NULL, globalWorkSize, localWorkSize, 0, NULL, NULL);
CHECK_OCL_ERR("clEnqueueNDRangeKernel", clError);
Refer to main.cpp
for a reference of how these host-code functions are used, and OpenCLKernels.cl
for how the OpenCL kernel is written.
ToDo
OpenCL framework has a built-in feature that can provide information about the execution times of enqueued commands.
By linking cl_event
object to an OpenCL command, like clEnqueueNDRangeKernel
, it is possible to use clGetEventProfilingInfo
function to get the times of the start and the end of execution of that command and calulcate the performance. Profiling of OpenCL commands can be enabled by using a command queue created with CL_QUEUE_PROFILING_ENABLE
flag set in properties argument.
cl_int clGetEventProfilingInfo(cl_event event,
cl_profiling_info param_name,
size_t param_value_size,
void *param_value,
size_t *param_value_size_ret)
CL_PROFILING_COMMAND_QUEUED
A 64-bit value that describes the current device time counter in nanoseconds when the command identified by event is enqueued in a command-queue by the host.
CL_PROFILING_COMMAND_SUBMIT
A 64-bit value that describes the current device time counter in nanoseconds when the command identified by event that has been enqueued is submitted by the host to the device associated with the command-queue.
CL_PROFILING_COMMAND_START
A 64-bit value that describes the current device time counter in nanoseconds when the command identified by event starts execution on the device.
CL_PROFILING_COMMAND_END
A 64-bit value that describes the current device time counter in nanoseconds when the command identified by event has finished execution on the device.
CL_PROFILING_COMMAND_COMPLETE
(since OpenCL 2.1)
A 64-bit value that describes the current device time counter in nanoseconds when the command identified by event and any child commands enqueued by this command on the device have finished execution.
First, the commend queue used to execute kernels must be created with CL_QUEUE_PROFILING_ENABLE
flag:
#ifdef CL_VERSION_2_0
cl_queue_properties prop[] = { CL_QUEUE_PROPERTIES, CL_QUEUE_PROFILING_ENABLE, 0 };
queue = clCreateCommandQueueWithProperties(context, device, prop, &clError);
CHECK_OCL_ERR("clCreateCommandQueueWithProperties", clError);
#else
cl_command_queue_properties prop = 0;
prop |= CL_QUEUE_PROFILING_ENABLE;
queue = clCreateCommandQueue(context, device, prop, &clError);
CHECK_OCL_ERR("clCreateCommandQueue", clError);
#endif
Then, you can enqueue a kernel, wait for completion and use associated event object and function clGetEventProfilingInfo
to calculate the execution time:
cl_event event;
err = clEnqueueNDRangeKernel(
queue, kernel, 1, NULL, &global_work_size, NULL, 0, NULL, &event
);
CHECK_OCL_ERR("clEnqueueNDRangeKernel", err);
err = clWaitForEvents(1, &event);
CHECK_OCL_ERR("clWaitForEvents", err);
cl_ulong start_time, end_time;
err = clGetEventProfilingInfo(
event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start_time, NULL
);
CHECK_OCL_ERR("clGetEventProfilingInfo", err);
err = clGetEventProfilingInfo(
event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end_time, NULL
);
CHECK_OCL_ERR("clGetEventProfilingInfo", err);
printf("%lu\n", end_time - start_time);
err = clReleaseEvent(event);
CHECK_OCL_ERR("clReleaseEvent", err);
If you can't or don't want to use CL_QUEUE_PROFILING_ENABLE
flag, it is also possible to benchmark kernels, or in general any OpenCL function, using features of the host language that you use OpenCL with. However, in that case it is important to always run the kernel multiple times in order to get the correct average execution time.
#include <chrono>
#include <iostream>
(...)
using ms_duration_type = std::chrono::duration<double, std::milli>;
duration_type sum_durations(0);
// Remember to compile kernel before benchmarking.
const size_t iterations_count = 100;
for (size_t iteration = 0; iteration < iterations_count; iteration++)
{
cl_event event;
// Preparation: set kernel arguments, fill input buffers etc.
auto start_time =
std::chrono::high_resolution_clock::now();
err = clEnqueueNDRangeKernel(
queue, kernel, 1,
NULL, &global_work_size, &local_work_size,
0, NULL, &event
);
CHECK_OCL_ERR("clEnqueueNDRangeKernel", err);
err = clWaitForEvents(1, &event);
CHECK_OCL_ERR("clWaitForEvents", err);
ms_duration_type duration =
std::chrono::high_resolution_clock::now() - start_time;
sum_durations += duration;
err = clReleaseEvent(event);
CHECK_OCL_ERR("clReleaseEvent", err);
}
auto avg_duration = sum_durations / iterations_count;
std::cout << avg_duration.count() << " ms\n";
(...)
#include <chrono>
#include <iostream>
#include <boost/compute.hpp>
(...)
using ms_duration_type = std::chrono::duration<double, std::milli>;
duration_type sum_durations(0);
// Remember to compile kernel before benchmarking.
const size_t iterations_count = 100;
for (size_t iteration = 0; iteration < iterations_count; iteration++)
{
// Preparation: set kernel arguments, fill input buffers etc.
auto start_time =
std::chrono::high_resolution_clock::now();
queue.enqueue_1d_range_kernel(
kernel, 0 /* offset */, global_work_size, local_work_size
).wait();
ms_duration_type duration =
std::chrono::high_resolution_clock::now() - start_time;
sum_durations += duration;
}
auto avg_duration = sum_durations / iterations_count;
std::cout << avg_duration.count() << " ms\n";
(...)