Benchmarking matrix multiplication implementations
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
Makefile added Eigen3 (c++) Aug 28, 2016
matmul-eigen.cpp added Eigen3 (c++) Aug 28, 2016
matmul.c change block size from 8 to 16 Aug 28, 2016

This repo evaluates different matrix multiplication implementations given two large square matrices (2000-by-2000 in the following example):

Implementation Long description
Naive Most obvious implementation
Transposed Transposing the second matrix for cache efficiency
sdot w/o hints Replacing the inner loop with BLAS sdot()
sdot with hints sdot() with a bit unrolled loop
SSE sdot vectorized sdot() with explicit SSE instructions
SSE+tiling sdot SSE sdot() with loop tiling
OpenBLAS sdot sdot() provided by OpenBLAS
OpenBLAS sgemm sgemm() provided by OpenBLAS

To compile the evaluation program:

make CBLAS=/path/to/cblas/prefix

or omit the CBLAS setting you don't have it. After compilation, use

./matmul -h

to see the available options. Here is the result on my machines:

Implementation -a Linux,-n2000 Linux,-n4000 Linux/icc,-n4000 Mac,-n2000
Naive 0 7.53 sec 188.85 sec 173.76 sec 77.45 sec
Transposed 1 6.66 sec 55.48 sec 21.04 sec 9.73 sec
sdot w/o hints 4 6.66 sec 55.04 sec 21.35 sec 9.70 sec
sdot with hints 3 2.41 sec 29.47 sec 21.69 sec 2.92 sec
SSE sdot 2 1.36 sec 21.79 sec 22.18 sec 2.92 sec
SSE+tiling sdot 7 1.11 sec 10.84 sec 10.97 sec 1.90 sec
OpenBLAS sdot 5 2.69 sec 28.87 sec 5.61 sec
OpenBLAS sgemm 6 0.63 sec 4.91 sec 0.86 sec
uBLAS 7.43 sec 165.74 sec
Eigen 0.61 sec 4.76 sec 5.01 sec 0.85 sec

The machine configurations are as follows:

Machine CPU OS Compiler
Linux 2.6 GHz Xeon E5-2697 CentOS 6 gcc-4.4.7/icc-15.0.3
Mac 1.7 GHz Intel Core i5-2557M OS X 10.9.5 clang-600.0.57/LLVM-3.5svn

On both machines, OpenBLAS-0.2.18 is compiled with the following options (no AVX or multithreading):