SAXPY benchmark for CPU and GP-GPU
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
extra
results
src
.gitignore
README.md

README.md

SAXPY CPU and GPGPU Benchmarks

Table of Contents:

Benchmarks

The following benchmarks have been implemented:

C++ Bulk [gpu] Bulk is yet another parallel algorithms on top of CUDA. It claims to have better scalability than Thrust.
C++ CUDA [gpu] NVidia CUDA toolkit is the base library for accessing GPUs.
C++ OCL [cpu] OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors or hardware accelerators.
C++ OCL [gpu] OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors or hardware accelerators.
C++ OMP [cpu] OpenMP is API specification for parallel programming.
C++ TensorFlow [gpu] TensorFlow is a deep learning library from Google.
C++ Thrust [gpu] NVidia Thrust is a parallel algorithms library which resembles the C++ Standard Template Library (STL). Thrust is included with CUDA toolkit.
C++ cuBLAS [gpu] NVidia cuBLAS is a fast GPU-accelerated implementation of the standard basic linear algebra subroutines (BLAS).
C++ loop [cpu] Plain C++ for loop
Java loop [cpu] Plain Java loop
Julia (loop) [cpu] SIMD optimized Julia loop.
Julia (vec) [cpu] With Julia array operation.
Octave [cpu] GNU Octave is a high-level language primarily intended for numerical computations.
Py CNTK [cpu] CNTK is a deep learning library.
Py CNTK [gpu] CNTK is a deep learning library.
Py MXNet [cpu] MXNet is a deep learning library.
Py MXNet [gpu] MXNet is a deep learning library.
Py Numpy [cpu] With Python Numpy array.
Py Pandas [cpu] With Python Pandas dataframe.
Py TensorFlow [cpu] TensorFlow is a deep learning library.
Py TensorFlow [gpu] TensorFlow is a deep learning library.
PyCUDA [gpu] PyCUDA is a Python wrapper for CUDA.
PyOCL [cpu] PyOpenCL is a Python wrapper for OpenCL.
PyOCL [gpu] PyOpenCL is a Python wrapper for OpenCL.
Python loop [cpu] Simple Python for loop.
R (array) [cpu] With array in R, a free software environment for statistical computing and graphics.
R (data.frame) [cpu] With data.frame in R, a free software environment for statistical computing and graphics.
R (data.table) [cpu] With data.table in R, a free software environment for statistical computing and graphics.
R (loop) [cpu] Simple loop in R, a free software environment for statistical computing and graphics.
R (matrix) [cpu] With matrix in R, a free software environment for statistical computing and graphics.

Results

Python: Loop vs Numpy (CPU)

Comparison between simple Python loop and Numpy

results/charts-en/python-loop-vs-numpy-linux-cpu.png

Python: Loop vs Numpy 2 (CPU)

Same as above, on both Linux and Windows

results/charts-en/python-loop-vs-numpy-cpu.png

R: Loop vs Vectorized (CPU)

Benchmarking various vectorization methods in R (array, matrix, data.frame, data.table) vs plain loop

results/charts-en/r-loop-vs-vec.png

Python: Loop vs Numpy vs Pandas (CPU)

Benchmarking the performance of Numpy vs Panda (vs plain Python loop)

results/charts-en/python-loop-vs-numpy-vs-pandas-cpu.png

Julia: Loop vs Vector (CPU)

Comparing the performance of Julia loop vs Julia vector/array (vs C++)

results/charts-en/julia-loop-vs-vector.png

Numpy vs Octave vs R vs Java vs Julia vs C++ (CPU)

Comparing the performance of SAXPY in different programming languages

results/charts-en/script-vs-script-vs-java-vs-c++-cpu.png

Python Vectorization: Numpy vs Deep Learning Frameworks (CPU)

SAXPY array operation in Numpy vs machine learning frameworks such as Tensorflow, MXNet, and CNTK. Only tested on Linux.

Note: CNTK result is way off, not sure why. Please have a look at the source code, maybe I did something wrong.

results/charts-en/vectorized-numpy-vs-frameworks-cpu.png

Numpy vs Deep Learning Frameworks (GPU and CPU)

Same as above, but on GPU as well

results/charts-en/vectorized-numpy-vs-frameworks-gpu.png

Deep Learning Frameworks GPU vs Loop CPU

Comparing frameworks running on GPU with naive C++ loop running on CPU.

results/charts-en/frameworks-gpu-vs-c++-cpu.png

C++ Parallel APIs (CPU)

Comparing naive C++ loop with several parallel programming APIs (OpenCL and OpenMP) on CPU.

results/charts-en/parallel-c++-cpu.png

C++ GPU (vs CPU)

Comparing various C++ GPU libraries (CUDA, OpenCL, Thrust, Bulk, cuBLAS)

results/charts-en/c++-cpu-vs-gpu.png

OpenCL vs PyOpenCL (CPU & GPU)

Comparing C++ OpenCL with PyOpenCL, the OpenCL Python wrapper.

results/charts-en/pyopencl-vs-opencl.png

PyCUDA vs C++ (GPU)

Comparing PyCUDA (Python CUDA wrapper) with native C++ CUDA GPU

results/charts-en/pycuda-vs-c++.png

Tensorflow: Python vs C++ (GPU)

Comparing Tensorflow C++ and Python performance

results/charts-en/tensorflow-python-vs-c++.png

GPU Conclusion

Benchmarking various GPU APIs (only on Linux since it has the most APIs)

Excluded from this chart:

results/charts-en/conclusion-gpus.png

Linux Conclusion

Excluded from this chart:

results/charts-en/conclusion-linux.png

Windows Conclusion

Excluded from this chart:

results/charts-en/conclusion-windows.png

Conclusion

Excluded from this chart:

results/charts-en/conclusion.png

Machine Specifications

Ubuntu 16.04, NVidia GTX 1080

Note: same machine as Windows below (dual-boot)

System Intel i7-6700 CPU @ 3.40GHz 16GB RAM 4x2 cores (HT)
OS Ubuntu Linux 16.04 64bit
GPU NVidia GeForce GTX 1080 8GB
C++ Compiler g++ 5.4.0
Python3 3.5.2 64bit
TensorFlow TensorFlow 1.4 (GPU)
CUDA CUDA 9.0.61
CudNN7
OpenCL - Khronos OpenCL header 1.2
- Intel OpenCL driver 16.1.1
- NVidia OpenCL 1.2 driver
PyOpenCL version 2015.1
Octave version 4.0.0 64bit
R version 3.2.3 64bit
MXNet mxnet-cu90 (0.12.1)
CNTK CNTK 2.3.1 (CUDA-8, CudNN6)

Windows 10, NVidia GTX 1080

Note: same machine as Linux above (dual-boot)

System Intel i7-6700 CPU @ 3.40GHz 16GB RAM 4x2 cores (HT)
OS Windows 10 64bit
GPU NVidia GeForce GTX 1080 8GB
C++ Compiler Visual Studio 2015 C++ compiler 64bit version
Python 2.7.12 64bit
Python3 3.5.3 64bit
TensorFlow TensorFlow 1.4 (GPU)
CUDA Version 8.0.61
OpenCL - Intel OpenCL SDK Version 7.0.0.2519
- OpenCL from CUDA SDK
PyOpenCL version 2017.2
Octave version 4.2.1 64bit
R version 3.4.2 64bit