Add PagedAttention support (experimental, CUDA only) #17579

ericcurtin · 2025-11-28T19:36:52Z

Implement PagedAttention algorithm for memory-efficient KV cache management. This feature reduces memory fragmentation by storing KV cache in fixed-size blocks (similar to virtual memory paging) and enables efficient memory sharing between sequences through copy-on-write semantics.

The implementation is experimental and disabled by default. Enable with the --pagedattention flag

ngxson · 2025-11-28T21:31:47Z

ggml/src/ggml-cuda/paged-attention-v1.cu

+            const int token_idx = block_idx * BLOCK_SIZE + i;
+            if (token_idx >= seq_len) break;
+
+            // TODO: Vectorized K loading and Q·K computation


some TODOs look quite sus, I'm wondering if the code is AI-generated and/or this function actually works

beside, probably give some credits to the original kernel: https://github.com/vllm-project/vllm/blob/main/csrc/attention/attention_kernels.cuh

I mark it experimental for good reason 🙂

I think it's important to explicitly state if you're using AI to generate this PR or not. the numerous TODOs though out the PR does make it look sus. there will be a human who spend real time and efforts reviewing this PR afterall.

I mark it experimental for good reason 🙂

I think this PR should be marked as a draft, until it is no longer experimental

IMO this shouldn't be turned into a PR until it's reasonably complete. I subscribed because I'm interested in trying it if it works, but I've gotten 10 notifications today and it doesn't even pass CI, so why is it being pushed at all?

Beside, I don't think adding paged attention make any differences in llama.cpp rather than have an additional feature with cool looking name.

While your AI is not capable enough to explain this to you @ericcurtin , I will:

This feature reduces memory fragmentation by storing KV cache in fixed-size blocks (similar to virtual memory paging)

No, it does not. The order of blocks can also be fragmented. This notion is explain by this sentence in this documentation: "the KV cache does not need to be stored in contiguous memory" - this crucial detail is left out which make the description technically wrong.

And we can definitely implement notion of "block" with the existing llama-kv-cache infrastructure. Just need to align placements of KV vectors to a fixed number and voilà, you got "fixed-size blocks"

enables efficient memory sharing between sequences through copy-on-write semantics.

llama.cpp do have copy-on-write, but just not automatic. llama_memory_seq_cp is there to allow sharing memory among multiple sequences. Ofc, we can implement the automatic mechanism, but just not right now.

Implement PagedAttention algorithm from for memory-efficient KV cache management. This feature reduces memory fragmentation by storing KV cache in fixed-size blocks (similar to virtual memory paging) and enables efficient memory sharing between sequences through copy-on-write semantics. The implementation is experimental and disabled by default. Enable with the --pagedattention flag Signed-off-by: Eric Curtin <eric.curtin@docker.com>

ericcurtin requested review from CISC and ggerganov as code owners November 28, 2025 19:36

ericcurtin force-pushed the add-pagedattention branch 3 times, most recently from 2a33486 to 14ad291 Compare November 28, 2025 19:58

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 28, 2025

ericcurtin force-pushed the add-pagedattention branch from 14ad291 to 06254d1 Compare November 28, 2025 20:17

loci-dev mentioned this pull request Nov 28, 2025

UPSTREAM PR #17579: Add PagedAttention support (experimental, CUDA only) auroralabs-loci/llama.cpp#352

Open

ericcurtin force-pushed the add-pagedattention branch from 06254d1 to 1745418 Compare November 28, 2025 20:37

ngxson reviewed Nov 28, 2025

View reviewed changes

ericcurtin force-pushed the add-pagedattention branch 3 times, most recently from 31d8188 to 08abefa Compare November 30, 2025 15:51

ericcurtin marked this pull request as draft November 30, 2025 16:19

ericcurtin force-pushed the add-pagedattention branch from 08abefa to b6edf80 Compare November 30, 2025 18:52

github-actions bot added the model Model specific label Nov 30, 2025

ericcurtin force-pushed the add-pagedattention branch 8 times, most recently from 19466fb to efeba44 Compare December 1, 2025 11:05

ericcurtin force-pushed the add-pagedattention branch from efeba44 to de93b99 Compare December 1, 2025 11:14

github-actions bot added examples server labels Dec 1, 2025

ericcurtin mentioned this pull request Dec 1, 2025

Add safetensors support #17580

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add PagedAttention support (experimental, CUDA only) #17579

Add PagedAttention support (experimental, CUDA only) #17579

ericcurtin commented Nov 28, 2025 •

edited

Loading

Uh oh!

ngxson Nov 28, 2025

Uh oh!

ericcurtin Nov 28, 2025

Uh oh!

ngxson Nov 28, 2025 •

edited

Loading

Uh oh!

ddh0 Nov 29, 2025

Uh oh!

jeffbolznv Dec 1, 2025

Uh oh!

ngxson Dec 1, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add PagedAttention support (experimental, CUDA only) #17579

Are you sure you want to change the base?

Add PagedAttention support (experimental, CUDA only) #17579

Conversation

ericcurtin commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

ericcurtin Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ddh0 Nov 29, 2025

Choose a reason for hiding this comment

Uh oh!

jeffbolznv Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ericcurtin commented Nov 28, 2025 •

edited

Loading

ngxson Nov 28, 2025 •

edited

Loading

ngxson Dec 1, 2025 •

edited

Loading