-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
While the memory throughput of the matrix vector CUDA kernels is already a high percentage of peak on GDDR GPUs, there is large room for improvement on High Bandwidth Memory (HBM) GPUs.
Motivation
Optimizations to improve memory access patterns can potentially vastly improve inference performance on NVIDIA HBM GPUs, and modestly improve performance on GDDR GPUs.
Possible Implementation
F16 MMV Kernel
PR #9816 already implements a large optimization for F16 models, through vectorizing load instructions.
Here is the speedup achieved in for Meta-Llama-3-8B-Instruct-F16 BS1 inference token evaluation:
| GPU | Overall speedup | MMV kernel speedup |
|---|---|---|
| H100-SXM-80GB-HBM3 | 1.27X | 1.33X |
| H100-PCIe-80GB-HBM2e | 1.20X | 1.24X |
| A100-SXM-80GB-HBM2e | 1.04X | 1.05X |
| L40S-40GB-GDDR6 | 1.01X | 1.01X |
| RTX-4090-24GB-GDDR6 | 1.01X | 1.01X |
Using Nsight Compute to inspect memory throughput, we see that RTX 4090 is already at 94% of peak before this optimization (which only slightly rises after this optimization). However, H100-SXM-80GB-HBM3 is only at 59% which rises to 74% with this optimization. This also shows there may be room for further improvements.
Quantized model MMV kernels
These are also showing significant room for improvement in memory throughput, although some more substantial refactoring may be needed to unlock.