STREAM, for lots of devices written in many programming models
Clone or download
Latest commit da085f3 Oct 17, 2018
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
CL Update cl2.hpp May 3, 2016
results Add Titan Xp numbers May 7, 2018
.gitignore Add SYCL intermediate outputs to .gitignore Feb 23, 2017
ACCStream.cpp Use parallel loop for OpenACC instead of kernels Jul 25, 2018
ACCStream.h Merge remote-tracking branch 'origin/init-arrays' into devel Nov 4, 2016
CHANGELOG.md Support CSV output for triad only running mode Oct 4, 2018
CUDA.make pulled -O3 out into CXXFLAGS, refactored CUDA compiler into CUDA_CXX Mar 17, 2017
CUDAStream.cu Use static shared memory in dot for CUDA and HIP Feb 28, 2017
CUDAStream.h Merge remote-tracking branch 'origin/init-arrays' into devel Nov 4, 2016
HC.make enable propagation of preprocessor macros from CLI to compiler command Jul 31, 2017
HCStream.cpp refactored n_tiles into preprocessor macro Jul 31, 2017
HCStream.h moved experimental dot product implementation of dot_impl which is bu… Mar 27, 2017
HIP.make [HIP] Search for hipcc in the preferred way Mar 19, 2018
HIPStream.cpp [HIP] Fixes to work with latest HIP Mar 19, 2018
HIPStream.h Add dot kernel to HIP implementation Feb 23, 2017
Kokkos.make [Kokkos] Rearrange Makefile variables on liner line. Feb 15, 2018
KokkosStream.cpp [Kokkos] Use tempate type throughout instead of double Feb 15, 2018
KokkosStream.hpp [Kokkos] Use tempate type throughout instead of double Feb 15, 2018
LICENSE Rename to BabelStream Apr 8, 2017
OCLStream.cpp Manually clearing the global device vector May 2, 2018
OCLStream.h Merge remote-tracking branch 'origin/init-arrays' into devel Nov 4, 2016
OMPStream.cpp [OpenMP 4.5] Remove superfluous map clauses Feb 7, 2018
OMPStream.h Make OpenMP string name without version number Dec 9, 2016
OpenACC.make Add OpenACC Volta flags Nov 10, 2017
OpenCL.make Allow user to override CXX in OpenCL.make Feb 24, 2017
OpenMP.make Add mcpu=native flag to GNU OpenMP builds Apr 27, 2018
RAJA.make [RAJA] Use xHost and streaming stores with the Intel compiler Apr 6, 2017
RAJAStream.cpp [RAJA] Use Index_type for iterator index type instead of hardcoding int Apr 6, 2017
RAJAStream.hpp Merge remote-tracking branch 'origin/init-arrays' into devel Nov 4, 2016
README.android Move android instructions to seperate file Feb 23, 2017
README.md Add logo to README Oct 17, 2018
SYCL.make Split compilation lines for SYCL Stream May 2, 2018
SYCLStream.cpp Manually clearing the global device vector May 2, 2018
SYCLStream.h [SYCL] Fix multiple template specializations Nov 18, 2016
Stream.h Merge remote-tracking branch 'origin/init-arrays' into devel Nov 4, 2016
main.cpp Fix trailing comma in CSV output Oct 4, 2018

README.md

BabelStream

logo

Measure memory transfer rates to/from global device memory on GPUs. This benchmark is similar in spirit, and based on, the STREAM benchmark [1] for CPUs.

Unlike other GPU memory bandwidth benchmarks this does not include the PCIe transfer time.

There are multiple implementations of this benchmark in a variety of programming models. Currently implemented are:

  • OpenCL
  • CUDA
  • OpenACC
  • OpenMP 3 and 4.5
  • Kokkos
  • RAJA
  • SYCL

This code was previously called GPU-STREAM.

Website

uob-hpc.github.io/BabelStream/

Usage

Drivers, compiler and software applicable to whichever implementation you would like to build against is required.

We have supplied a series of Makefiles, one for each programming model, to assist with building. The Makefiles contain common build options, and should be simple to customise for your needs too.

General usage is make -f <Model>.make Common compiler flags and names can be set by passing a COMPILER option to Make, e.g. make COMPILER=GNU. Some models allow specifying a CPU or GPU style target, and this can be set by passing a TARGET option to Make, e.g. make TARGET=GPU.

Pass in extra flags via the EXTRA_FLAGS option.

The binaries are named in the form <model>-stream.

Building Kokkos

We use the following command to build Kokkos using the Intel Compiler, specifying the arch appropriately, e.g. KNL.

../generate_makefile.bash --prefix=<prefix> --with-openmp --with-pthread --arch=<arch> --compiler=icpc --cxxflags=-DKOKKOS_MEMORY_ALIGNMENT=2097152

For building with CUDA support, we use the following command, specifying the arch appropriately, e.g. Kepler35.

../generate_makefile.bash --prefix=<prefix> --with-cuda --with-openmp --with-pthread --arch=<arch> --with-cuda-options=enable_lambda --compiler=<path_to_kokkos_src>/bin/nvcc_wrapper

Building RAJA

We use the following command to build RAJA using the Intel Compiler.

cmake .. -DCMAKE_INSTALL_PREFIX=<prefix> -DCMAKE_C_COMPILER=icc -DCMAKE_CXX_COMPILER=icpc -DRAJA_PTR="RAJA_USE_RESTRICT_ALIGNED_PTR" -DCMAKE_BUILD_TYPE=ICCBuild -DRAJA_ENABLE_TESTS=Off

For building with CUDA support, we use the following command.

cmake .. -DCMAKE_INSTALL_PREFIX=<prefix> -DRAJA_PTR="RAJA_USE_RESTRICT_ALIGNED_PTR" -DRAJA_ENABLE_CUDA=1 -DRAJA_ENABLE_TESTS=Off

Results

Sample results can be found in the results subdirectory. If you would like to submit updated results, please submit a Pull Request.

Citing

Please cite BabelStream via this reference:

Deakin T, Price J, Martineau M, McIntosh-Smith S. GPU-STREAM v2.0: Benchmarking the achievable memory bandwidth of many-core processors across diverse parallel programming models. 2016. Paper presented at P^3MA Workshop at ISC High Performance, Frankfurt, Germany.

Other BabelStream publications:

Deakin T, McIntosh-Smith S. GPU-STREAM: Benchmarking the achievable memory bandwidth of Graphics Processing Units. 2015. Poster session presented at IEEE/ACM SuperComputing, Austin, United States. You can view the Poster and Extended Abstract.

Deakin T, Price J, Martineau M, McIntosh-Smith S. GPU-STREAM: Now in 2D!. 2016. Poster session presented at IEEE/ACM SuperComputing, Salt Lake City, United States. You can view the Poster and Extended Abstract.

Raman K, Deakin T, Price J, McIntosh-Smith S. Improving achieved memory bandwidth from C++ codes on Intel Xeon Phi Processor (Knights Landing). IXPUG Spring Meeting, Cambridge, UK, 2017.

Deakin T, Price J, Martineau M, McIntosh-Smith S. Evaluating attainable memory bandwidth of parallel programming models via BabelStream. International Journal of Computational Science and Engineering. Special issue (in press). 2017.

Deakin T, Price J, McIntosh-Smith S. Portable methods for measuring cache hierarchy performance. 2017. Poster sessions presented at IEEE/ACM SuperComputing, Denver, United States. You can view the Poster and Extended Abstract

[1]: McCalpin, John D., 1995: "Memory Bandwidth and Machine Balance in Current High Performance Computers", IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, December 1995.