Benchmark corresponding to the eigen-magma project implementation
C++ R
Switch branches/tags
Nothing to show
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


Eigen Magma benchmark

This project provides a simple benchmarking facility for Eigen. It was developed mainly for benchmarking the Eigen MAGMA backend implementation. This project also serves as sample CMake project to use Eigen in combination with MAGMA and MKL.


You need to first install Intel MKL, Eigen, CUDA and MAGMA. It has been tested with Intel Parallel Studio 2013, Eigen 3.2.0, CUDA 5.5 and MAGMA 1.4.0.

Modus Operandis

  • Create a Release build with the following command (using Intel compiler):
    rm -rf build; mkdir build; cd build; cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=icc -DCMAKE_CXX_COMPILER=icpc -DCMAKE_Fortran_COMPILER=ifort ../src
  • Edit the CMakeLists.txt file and enable Eigen only, MKL or MAGMA by commenting or uncommenting the following definitions:
  • Build the project executing make to have faster compilation with more threads e.g. 5 use make -j5
  • Execute the benchmark ./benchmark or use ./benchmark --help for help.

Benchmark environment

The nVidia Titan GTX card out-of-the-box defaults to a Double-Precision (DP) performance that is only 1/24th of the Single-Precision (SP) performance. The nVidia Titan GTX is capable of reaching a DP performance of up to 1/3 of the SP performance. However, this has to be configured by changing the nVidia default driver settings using the “nvidia-settings” tool installed as part of the nVidia drivers. The CUDA-Double precision box must be checked as shown in the figure below.


The following plots where obtained by executing the currently ported Eigen MAGMA backends:

The benchmarks above where obtained using export MKL_NUM_THREADS=1 and export OMP_NUM_THREADS=1 increasing the MKL_NUM_THREADS may improve the results for both the MKL and the MAGMA versions. Furthermore, unlike the benchmarks shown in MAGMA testing implementations these benchmark results above account for the memory transfer times between Host and Device. This is the reason why the dgemv and dtrsm do not seem to perform better than the CPU versions.