Skip to content

avagordon01/accelerator-benchmarks

Repository files navigation

FLOPS & Latency Benchmarks

This repo tries to answer the question "for a specific input size, and (simple) computation, which framework/accelerator should I choose?".

By implementing the same computation on all available frameworks/accelerators (CPU FPU baseline, SIMD, threads, GPU, FPGA, multi-machine, ...) and running with input sizes from 1 float to 1 billion (or more) floats, we'll see which framework is optimal for which input size.

Frameworks:

TODO:

  • flops/byte estimation for the two different calculations
  • flops/byte (maximum FLOPS throughput / maximum memory bandwidth) estimation for all the different hardware (CPU, SIMD, threaded, GPU, multi-machine)
  • is it possible to create a flops/byte measurement tool, and a set of stress tests to measure

Similar projects

Parallel Programming Ecosystem Comparison

a parallel programming ecosystem needs:

  • language/language-extensions/compiler for describing parallelism, tasks, async, dependencies, etc
  • backends for SIMD, multi-core, GPU, and multi-machine
  • standard algorithms (blas, sort, reduce, etc)
  • tools for debugging and profiling
ecosystem compiler SIMD Multi-core GPU Multi-machine sort reduce blas
C++ STL any (plain c++) std::ex::simd std::thread std::executors (future) ❔ asio? (future) std::sort par_unseq std::ex::parallel::reduce stdblas (future)
OpenMP gcc, clang, icc (pragma extended c++) ✔️ ✔️ ✔️ ✔️ ✔️ ❔ OpenBLAS? Eigen?
sandia Kokkos any (plain c++) ✔️ ✔️ ✔️ ✔️ MPI ✔️ ✔️ ✔️ stdblas
intel oneAPI intel dpc++ (sycl) ✔️ ✔️ ✔️ ✔️ MPI ✔️ TBB ✔️ TBB ✔️ MKL
nvidia CUDA clang, nvc++ (extended c++) ✔️ ✔️ NCCL ✔️ thrust / libcu++ ✔️ thrust / libcu++ ✔️ cutlass / cuBLAS

Dependencies

sudo pacman -S \
    cmake clang \
    benchmark \
    python-matplotlib python-pandas \
    openmp \
    vulkan-tools vulkan-driver vulkan-headers glslang \
    eigen

Running

./test.sh

Example

For O(N) vector addition on specific hardware, this graph answers the question in the following way:

  • for input size <= 2^8 floats, use CPU SIMD
  • for input size between 2^12 and 2^20 floats, use OpenMP
  • for input size >= 2^24 floats (and negligible device/host memory transfer cost) use the GPU

For O(N^3) matrix multiplication

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published