Skip to content

Conversation

@ericcurtin
Copy link
Collaborator

@ericcurtin ericcurtin commented Nov 28, 2025

Implement PagedAttention algorithm for memory-efficient KV cache management. This feature reduces memory fragmentation by storing KV cache in fixed-size blocks (similar to virtual memory paging) and enables efficient memory sharing between sequences through copy-on-write semantics.

The implementation is experimental and disabled by default. Enable with the --pagedattention flag

@ericcurtin ericcurtin force-pushed the add-pagedattention branch 3 times, most recently from 2a33486 to 14ad291 Compare November 28, 2025 19:58
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 28, 2025
const int token_idx = block_idx * BLOCK_SIZE + i;
if (token_idx >= seq_len) break;

// TODO: Vectorized K loading and Q·K computation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some TODOs look quite sus, I'm wondering if the code is AI-generated and/or this function actually works

beside, probably give some credits to the original kernel: https://github.com/vllm-project/vllm/blob/main/csrc/attention/attention_kernels.cuh

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mark it experimental for good reason 🙂

Copy link
Collaborator

@ngxson ngxson Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's important to explicitly state if you're using AI to generate this PR or not. the numerous TODOs though out the PR does make it look sus. there will be a human who spend real time and efforts reviewing this PR afterall.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mark it experimental for good reason 🙂

I think this PR should be marked as a draft, until it is no longer experimental

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this shouldn't be turned into a PR until it's reasonably complete. I subscribed because I'm interested in trying it if it works, but I've gotten 10 notifications today and it doesn't even pass CI, so why is it being pushed at all?

Copy link
Collaborator

@ngxson ngxson Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beside, I don't think adding paged attention make any differences in llama.cpp rather than have an additional feature with cool looking name.

While your AI is not capable enough to explain this to you @ericcurtin , I will:

This feature reduces memory fragmentation by storing KV cache in fixed-size blocks (similar to virtual memory paging)

No, it does not. The order of blocks can also be fragmented. This notion is explain by this sentence in this documentation: "the KV cache does not need to be stored in contiguous memory" - this crucial detail is left out which make the description technically wrong.

And we can definitely implement notion of "block" with the existing llama-kv-cache infrastructure. Just need to align placements of KV vectors to a fixed number and voilà, you got "fixed-size blocks"

enables efficient memory sharing between sequences through copy-on-write semantics.

llama.cpp do have copy-on-write, but just not automatic. llama_memory_seq_cp is there to allow sharing memory among multiple sequences. Ofc, we can implement the automatic mechanism, but just not right now.

@ericcurtin ericcurtin force-pushed the add-pagedattention branch 3 times, most recently from 31d8188 to 08abefa Compare November 30, 2025 15:51
@ericcurtin ericcurtin marked this pull request as draft November 30, 2025 16:19
@github-actions github-actions bot added the model Model specific label Nov 30, 2025
@ericcurtin ericcurtin force-pushed the add-pagedattention branch 8 times, most recently from 19466fb to efeba44 Compare December 1, 2025 11:05
Implement PagedAttention algorithm from for memory-efficient KV cache
management. This feature reduces memory fragmentation by storing KV cache
in fixed-size blocks (similar to virtual memory paging) and enables
efficient memory sharing between sequences through copy-on-write semantics.

The implementation is experimental and disabled by default. Enable with
the --pagedattention flag

Signed-off-by: Eric Curtin <eric.curtin@docker.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples ggml changes relating to the ggml tensor library for machine learning model Model specific Nvidia GPU Issues specific to Nvidia GPUs server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants