simpleGEMM


Generated by DALL·E 3

This is an extremely minimalistic but fast implementation of matrix multiplication in CUDA. The source code is a single, 200-line file gemm.cuh which implements half-precision tensor core matrix multiplication, optimised for Turing (SM75) architecture.

The implementation builds on top of CuTe from CUTLASS, a low-level interface for tensor manipulation in CUDA. The code is well-commented and is meant to be easily readable (minimal CUDA/C++ background knowledge required) and hackable.

Benchmark against standard implementations (see main.cu and reference.cu):

$ ./main
Usage: ./main M N K iters

$ ./main 4096 4096 4096 1000
Time elapse: 6043.59ms
TFLOPS: 22.7413

$ ./main 8192 8192 8192 100
Time elapse: 4819.51ms
TFLOPS: 22.8138

$ ./reference 4096 4096 4096 1000
Time elapse: 6040.42ms
TFLOPS: 22.7532

$ ./reference 8192 8192 8192 100
Time elapse: 4657.08ms
TFLOPS: 23.6095

The theoretical maximum for the hardware I used (RTX 2060) is 26 TFLOPS.

Quick start

Requires CUDA installed. Check out https://docs.nvidia.com/cuda/cuda-installation-guide-linux/ for instructions. If you don't have a compatible GPU, you can run this in Colab:

Compile the main.cu file:

nvcc \
    --include-path ./ \
    --include-path cutlass/include \
    --generate-code=arch=compute_75,code=[compute_75,sm_75] \
    --expt-relaxed-constexpr \
    -forward-unknown-to-host-compiler \
    -std=c++17 \
    -O3 \
    -o build/main \
    main.cu

And run!

$ ./build/main
Usage: ./main M N K iters

$ ./build/main 4096 4096 4096 1000
Time elapse: 6043.59ms
TFLOPS: 22.7413

You can also build with CMake (a better option for development):

$ mkdir build
$ cd build/
$ cmake ..
-- Configuring done
-- Generating done
-- Build files have been written to: /workspaces/simpleGEMM/build
$ make main 
Consolidate compiler generated dependencies of target main
[ 50%] Building CUDA object CMakeFiles/main.dir/main.cu.o
[100%] Linking CUDA executable main
[100%] Built target main
$ ./main
Usage: ./main M N K iters

What's missing

The code trades off generality for simplicity:

Only supports fp16 matmul out of the box. It should be quite easy to move to bf16, though.
Optimised for SM75 w/ tensor cores. This is probably sub-optimal for SM80+ (e.g. A100), but probably not terrible either.
Assumes (asserts) the inputs are divisible by the block size.
Assumes the inputs are in row-major layout. (Though you probably only want to use a row-major layout anyway, as other combinations are 10-30% slower.)
Doesn't do software pipelining. (interleaving global memory load for the next tile with computation.)
Is only optimal for "normal" problem sizes. For more exotic problem sizes like small-M/N with large-K, specialised implementations like split-K kernel is likely to perform better.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
colab		colab
cutlass @ 19f3cc3		cutlass @ 19f3cc3
test		test
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
gemm.cuh		gemm.cuh
main.cu		main.cu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

colab

colab

cutlass @ 19f3cc3

cutlass @ 19f3cc3

test

test

.gitignore

.gitignore

.gitmodules

.gitmodules

CMakeLists.txt

CMakeLists.txt

LICENSE

LICENSE

README.md

README.md

gemm.cuh

gemm.cuh

main.cu

main.cu

Repository files navigation

simpleGEMM

Quick start

What's missing

About

Releases

Packages

Languages

License

andylolu2/simpleGEMM

Folders and files

Latest commit

History

Repository files navigation

simpleGEMM

Quick start

What's missing

About

Topics

Resources

License

Stars

Watchers

Forks

Languages