Optimization of matrix-vector kernel memory accesses for NVIDIA CUDA High Bandwidth GPUs 

### Prerequisites

- [X] I am running the latest code. Mention the version if possible as well.
- [X] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md).
- [X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [X] I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new and useful enhancement to share.

### Feature Description

While the memory throughput of the matrix vector CUDA kernels is already a high percentage of peak on GDDR GPUs, there is large room for improvement on High Bandwidth Memory (HBM) GPUs.  

### Motivation

Optimizations to improve memory access patterns can potentially vastly improve inference performance on NVIDIA HBM GPUs, and modestly improve performance on GDDR GPUs.

### Possible Implementation

**F16 MMV Kernel**
PR https://github.com/ggerganov/llama.cpp/pull/9816 already implements a large optimization for F16 models, through vectorizing load instructions. 

Here is the speedup achieved in for Meta-Llama-3-8B-Instruct-F16 BS1 inference token evaluation:

| GPU      | Overall speedup | MMV kernel speedup |
|------------|----------|----------|
| H100-SXM-80GB-HBM3   | 1.27X    | 1.33X    |
| H100-PCIe-80GB-HBM2e  | 1.20X    | 1.24X    |
| A100-SXM-80GB-HBM2e   | 1.04X    | 1.05X    |
| L40S-40GB-GDDR6       | 1.01X    | 1.01X    |
| RTX-4090-24GB-GDDR6   | 1.01X    | 1.01X    |

Using Nsight Compute to inspect memory throughput, we see that RTX 4090 is already at 94% of peak before this optimization (which only slightly rises after this optimization). However,  H100-SXM-80GB-HBM3  is only at 59% which rises to 74% with this optimization. This also shows there may be room for further improvements.

**Quantized model MMV kernels**
These are also showing significant room for improvement in memory throughput, although some more substantial refactoring may be needed to unlock. 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimization of matrix-vector kernel memory accesses for NVIDIA CUDA High Bandwidth GPUs #9817

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GPU	Overall speedup	MMV kernel speedup
H100-SXM-80GB-HBM3	1.27X	1.33X
H100-PCIe-80GB-HBM2e	1.20X	1.24X
A100-SXM-80GB-HBM2e	1.04X	1.05X
L40S-40GB-GDDR6	1.01X	1.01X
RTX-4090-24GB-GDDR6	1.01X	1.01X

Optimization of matrix-vector kernel memory accesses for NVIDIA CUDA High Bandwidth GPUs #9817

Description

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions