Benchmarking matrix multiplication implementations
C C++ Makefile
Latest commit d225edf Aug 29, 2016 @lh3 lh3 added Eigen on Mac
Failed to load latest commit information.

This repo evaluates different matrix multiplication implementations given two large square matrices (2000-by-2000 in the following example):

Implementation Long description
Naive Most obvious implementation
Transposed Transposing the second matrix for cache efficiency
sdot w/o hints Replacing the inner loop with BLAS sdot()
sdot with hints sdot() with a bit unrolled loop
SSE sdot vectorized sdot() with explicit SSE instructions
SSE+tiling sdot SSE sdot() with loop tiling
OpenBLAS sdot sdot() provided by OpenBLAS
OpenBLAS sgemm sgemm() provided by OpenBLAS

To compile the evaluation program:

make CBLAS=/path/to/cblas/prefix

or omit the CBLAS setting you don't have it. After compilation, use

./matmul -h

to see the available options. Here is the result on my machines:

Implementation -a Linux,-n2000 Linux,-n4000 Linux/icc,-n4000 Mac,-n2000
Naive 0 7.53 sec 188.85 sec 173.76 sec 77.45 sec
Transposed 1 6.66 sec 55.48 sec 21.04 sec 9.73 sec
sdot w/o hints 4 6.66 sec 55.04 sec 21.35 sec 9.70 sec
sdot with hints 3 2.41 sec 29.47 sec 21.69 sec 2.92 sec
SSE sdot 2 1.36 sec 21.79 sec 22.18 sec 2.92 sec
SSE+tiling sdot 7 1.11 sec 10.84 sec 10.97 sec 1.90 sec
OpenBLAS sdot 5 2.69 sec 28.87 sec 5.61 sec
OpenBLAS sgemm 6 0.63 sec 4.91 sec 0.86 sec
uBLAS 7.43 sec 165.74 sec
Eigen 0.61 sec 4.76 sec 5.01 sec 0.85 sec

The machine configurations are as follows:

Machine CPU OS Compiler
Linux 2.6 GHz Xeon E5-2697 CentOS 6 gcc-4.4.7/icc-15.0.3
Mac 1.7 GHz Intel Core i5-2557M OS X 10.9.5 clang-600.0.57/LLVM-3.5svn

On both machines, OpenBLAS-0.2.18 is compiled with the following options (no AVX or multithreading):