Skip to content

Optimization of matrix-vector kernel memory accesses for NVIDIA CUDA High Bandwidth GPUs  #9817

@agray3

Description

@agray3

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

While the memory throughput of the matrix vector CUDA kernels is already a high percentage of peak on GDDR GPUs, there is large room for improvement on High Bandwidth Memory (HBM) GPUs.

Motivation

Optimizations to improve memory access patterns can potentially vastly improve inference performance on NVIDIA HBM GPUs, and modestly improve performance on GDDR GPUs.

Possible Implementation

F16 MMV Kernel
PR #9816 already implements a large optimization for F16 models, through vectorizing load instructions.

Here is the speedup achieved in for Meta-Llama-3-8B-Instruct-F16 BS1 inference token evaluation:

GPU Overall speedup MMV kernel speedup
H100-SXM-80GB-HBM3 1.27X 1.33X
H100-PCIe-80GB-HBM2e 1.20X 1.24X
A100-SXM-80GB-HBM2e 1.04X 1.05X
L40S-40GB-GDDR6 1.01X 1.01X
RTX-4090-24GB-GDDR6 1.01X 1.01X

Using Nsight Compute to inspect memory throughput, we see that RTX 4090 is already at 94% of peak before this optimization (which only slightly rises after this optimization). However, H100-SXM-80GB-HBM3 is only at 59% which rises to 74% with this optimization. This also shows there may be room for further improvements.

Quantized model MMV kernels
These are also showing significant room for improvement in memory throughput, although some more substantial refactoring may be needed to unlock.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions