PyTorch extension of fused softmax in CUDA, performs on-par with the Triton Fused Softmax. Just getting experience writing CUDA reduction kernels and linking them against PyTorch.
python3 setup.py install
import torch
from softmax_cuda import fusedSoftmax
# create random f32 tensor -> only single precision and 2D tensors are supported!
x = torch.randn((6144, 8192), device = 'cuda:0', dtype = torch.float32)
out = fusedSoftmax(x)
You can also build the standalone CUDA kernel with 1:
make gpu
This script performs benchmarking and tests for correctness against an (unoptimized) CPU implementation.
Once compiled, run with:
./softmax_cuda.bin
expected output should be similar to:
Warmup started
Benchmark started
Total elapsed time: (1.185808) s, performance: ( 452.7) GB/s, memory reads & writes (GB): ( 536.9)
Error checking:
Relative Error (0.00000190)
Comparison against torch.softmax
and Triton's Fused Softmax2:
pytest src/test_softmax.py