Fast CUTLASS GEMM from scratch

Step-by-step optimization of matrix multiplication, implemented with the Nvidia CUTLASS C++ template library.

In style of https://siboehm.com/articles/22/CUDA-MMM.

Building

git submodule update --init --recursive
make

Running

Benchmark

make bench

Test

make test

Notes

Matrices are A MxK and B NxK with result C MxN
All matrices are stored in column major layout
The matrix B is stored as transposed (NT) (for vectorized memory access)
GEMM as C = α * A × Bᵗ + β * C
Tensor Core implementations
Precision used is BF16 with FP32 accumulate (Ampere+ required)

Other

The cuBLAS library is only needed to compile its benchmark implementation. But if it is not present, the code still compiles and runs.

Changing NN to NT in cuBLAS gives ~27% speedup in the original CUDA-MMM code.

Using half precision gives up to 2x memory bandwidth and compute.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
include		include
src		src
test		test
third_party		third_party
util @ 2bf0082		util @ 2bf0082
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fast CUTLASS GEMM from scratch

Building

Running

Benchmark

Test

Notes

Other

About

Releases

Packages

Languages

License

alecco/cutlass_gemm

Folders and files

Latest commit

History

Repository files navigation

Fast CUTLASS GEMM from scratch

Building

Running

Benchmark

Test

Notes

Other

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages