Code appendix to an OpenCL matrix-multiplication tutorial
Switch branches/tags
Nothing to show
Clone or download
Type Name Latest commit message Commit time
Failed to load latest commit information.
extra Added minimal working example Nov 7, 2014
scripts Initial commit Nov 3, 2014
src Set default OpenCL device to 0 Nov 17, 2014
.gitignore Added .gitignore Nov 7, 2014
LICENSE Initial commit Nov 3, 2014
Makefile Added option to enable/disable CUDA and cuBLAS for non-NVIDIA systems Nov 7, 2014 Updated the link to the tutorial website Feb 7, 2017

Exploring the performance of SGEMM in OpenCL on NVIDIA GPUs

Date: 31-Oct-2014 - 07-Nov-2014

Author: Cedric Nugteren, SURFsara (

This repository contains multiple OpenCL implementations of single-precision generalised matrix-multiplication (SGEMM) tuned for an NVIDIA Tesla K40m GPU. The different versions (named myGEMM) are part of a step-by-step tutorial, in which each step adds a new optimisation. The different steps and the details of the OpenCL kernel codes are all explained in depth at

The OpenCL kernels can be used natively using the OpenCL framework. However, there is also a header-file included which converts the OpenCL kernels into CUDA syntax. This allows the same code to be tested through the CUDA-toolchain.

Apart from the OpenCL kernel codes, this repository contains fully working host code, including a loop over different matrix sizes and different BLAS libraries. It contains code to run NVIDIA's cuBLAS as a reference and the open-source clBlas library.


  • A C++ compiler (tested with GCC and ICC)
  • The CUDA toolkit and NVCC compiler (tested with version 6.5)
  • OpenCL headers and libraries (part of the CUDA toolkit)

Requirements to run the performance and correctness comparisons:

  • The cuBLAS library (part of the CUDA toolkit, tested version 6.5)
  • The open-source clBlas library (tested 2.2.0)


  • Compile the code:

    make build

    Compiles the benchmarking infrastructure and the myGEMM kernels. Make sure there is a "bin" and "obj" directory available. Note that you might have to edit the Makefile to set the proper locations of the CUDA and OpenCL installations on your system.

  • Run the code:

    make run

    This runs the code for matrices ranging from MINSIZE to MAXSIZE (defined in src/common.h). It will run cuBLAS, clBlas, and the CUDA and OpenCL versions of the myGEMM kernels. The particular kernel to be executed is defined using the KERNEL keyword in src/settings.h. This file also contains other settings you might want to modify for your particular GPU.

  • Inspect the code:

    make inspect

    This generates all kinds of assembly-like versions of the CUDA kernels in the "bin" subdirectory. It also prints out statistics of the kernels such as the register usage.

Minimal working example

Additionally, we supply the minimal.cpp file in the 'extra' directory. This file is a self-contained minimal working example (MWE) of the most basic SGEMM kernel (myGEMM1). This can be useful if you don't want to deal with Makefiles or don't have the CUDA, cuBLAS, or clBlas installed. Note that minimal.cpp misses some features compared to the main code, but we believe that it can nevertheless be a good starting point if you want to integrate myGEMM into your own code.

The code can be compiled using a regular C++ compiler and only requires OpenCL installed. Example compilation from the root folder:

g++ -O3 -Wall -I/path/to/opencl/include extra/minimal.cpp -o bin/minimal -lOpenCL

Be aware that the minimal working example does not:

  • Iterate over multiple matrix sizes
  • Compare performance with cuBLAS or clBlas
  • Check for correctness of the results
  • Check for OpenCL errors
  • Load a kernel-file from disk, instead it is embedded as a string