CUDA Softmax

PyTorch extension of fused softmax in CUDA, performs on-par with the Triton Fused Softmax. Just getting experience writing CUDA reduction kernels and linking them against PyTorch.

CUDA PyTorch Build Instructions

python3 setup.py install

Example Use:

import torch
from softmax_cuda import fusedSoftmax

# create random f32 tensor -> only single precision and 2D tensors are supported!
x = torch.randn((6144, 8192), device = 'cuda:0', dtype = torch.float32)
out = fusedSoftmax(x)

CUDA C++ Standalone

You can also build the standalone CUDA kernel with ¹:

make gpu

This script performs benchmarking and tests for correctness against an (unoptimized) CPU implementation.

Once compiled, run with:

./softmax_cuda.bin

expected output should be similar to:

Warmup started
Benchmark started
Total elapsed time: (1.185808) s, performance: (  452.7) GB/s, memory reads & writes (GB): ( 536.9) 

Error checking:
Relative Error (0.00000190)

Benchmarks

Comparison against torch.softmax and Triton's Fused Softmax²:

Tests

pytest src/test_softmax.py

Tested on an SM 8.9 GPU with nvcc 12.0 ↩
Benchmark performed on RTX 4070 (Peak mem. BW is 504 GB/s) ↩

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
csrc		csrc
imgs		imgs
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py
softmax_cuda.cu		softmax_cuda.cu
test_softmax.py		test_softmax.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDA Softmax

CUDA PyTorch Build Instructions

Example Use:

CUDA C++ Standalone

Benchmarks

Tests

About

Releases

Packages

Languages

fattorib/CudaSoftmax

Folders and files

Latest commit

History

Repository files navigation

CUDA Softmax

CUDA PyTorch Build Instructions

Example Use:

CUDA C++ Standalone

Benchmarks

Tests

Footnotes

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages