FLOPS & Latency Benchmarks

This repo tries to answer the question "for a specific input size, and (simple) computation, which framework/accelerator should I choose?".

By implementing the same computation on all available frameworks/accelerators (CPU FPU baseline, SIMD, threads, GPU, FPGA, multi-machine, ...) and running with input sizes from 1 float to 1 billion (or more) floats, we'll see which framework is optimal for which input size.

Frameworks:

TODO:

flops/byte estimation for the two different calculations
flops/byte (maximum FLOPS throughput / maximum memory bandwidth) estimation for all the different hardware (CPU, SIMD, threaded, GPU, multi-machine)
is it possible to create a flops/byte measurement tool, and a set of stress tests to measure

Similar projects

https://github.com/ashvardanian/ParallelReductionsBenchmark

Parallel Programming Ecosystem Comparison

a parallel programming ecosystem needs:

language/language-extensions/compiler for describing parallelism, tasks, async, dependencies, etc
backends for SIMD, multi-core, GPU, and multi-machine
standard algorithms (blas, sort, reduce, etc)
tools for debugging and profiling

ecosystem	compiler	SIMD	Multi-core	GPU	Multi-machine	sort	reduce	blas
C++ STL	any (plain c++)	`std::ex::simd`	`std::thread`	`std::executors` (future)	❔ asio? (future)	`std::sort` `par_unseq`	`std::ex::parallel::reduce`	`stdblas` (future)
OpenMP	gcc, clang, icc (pragma extended c++)	✔️	✔️	✔️	✔️	❌	✔️	❔ OpenBLAS? Eigen?
sandia Kokkos	any (plain c++)	✔️	✔️	✔️	✔️ MPI	✔️	✔️	✔️ `stdblas`
intel oneAPI	intel dpc++ (sycl)	✔️	✔️	✔️	✔️ MPI	✔️ TBB	✔️ TBB	✔️ MKL
nvidia CUDA	clang, nvc++ (extended c++)	❌	❌	✔️	✔️ NCCL	✔️ thrust / libcu++	✔️ thrust / libcu++	✔️ cutlass / cuBLAS

Dependencies

sudo pacman -S \
    cmake clang \
    benchmark \
    python-matplotlib python-pandas \
    openmp \
    vulkan-tools vulkan-driver vulkan-headers glslang \
    eigen

Running

./test.sh

Example

For O(N) vector addition on specific hardware, this graph answers the question in the following way:

for input size <= 2^8 floats, use CPU SIMD
for input size between 2^12 and 2^20 floats, use OpenMP
for input size >= 2^24 floats (and negligible device/host memory transfer cost) use the GPU

For O(N^3) matrix multiplication

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
subprojects		subprojects
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
matmul.png		matmul.png
matrix-multiplication.cc		matrix-multiplication.cc
output.png		output.png
plot.py		plot.py
readme.md		readme.md
test.sh		test.sh
vector-addition.cc		vector-addition.cc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

subprojects

subprojects

.gitmodules

.gitmodules

CMakeLists.txt

CMakeLists.txt

matmul.png

matmul.png

matrix-multiplication.cc

matrix-multiplication.cc

output.png

output.png

plot.py

plot.py

readme.md

readme.md

test.sh

test.sh

vector-addition.cc

vector-addition.cc

Repository files navigation

FLOPS & Latency Benchmarks

Parallel Programming Ecosystem Comparison

Dependencies

Running

Example

About

Releases

Packages

Languages

avagordon01/accelerator-benchmarks

Folders and files

Latest commit

History

Repository files navigation

FLOPS & Latency Benchmarks

Parallel Programming Ecosystem Comparison

Dependencies

Running

Example

About

Resources

Stars

Watchers

Forks

Languages