Skip to content

llama: avoid copying logits during prompt decode in MTP#23198

Merged
am17an merged 3 commits into
ggml-org:masterfrom
am17an:mtp-pp-fix
May 17, 2026
Merged

llama: avoid copying logits during prompt decode in MTP#23198
am17an merged 3 commits into
ggml-org:masterfrom
am17an:mtp-pp-fix

Conversation

@am17an
Copy link
Copy Markdown
Contributor

@am17an am17an commented May 17, 2026

Overview

Avoid copying the logits for every token in the batch when doing prompt processing for MTP since it only requires the pre-norm. This reduces memory traffic quite a bit and in turn increases PP speed with MTP.

Additional information

Requirements

@am17an am17an requested review from a team, CISC and ggerganov as code owners May 17, 2026 10:22
Comment thread src/llama-context.cpp Outdated
Comment thread src/models/qwen35moe.cpp
Copy link
Copy Markdown
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick bench on RTX 5090 with Qwen3.6 27B Q4_K

Image

@am17an am17an merged commit 3e12fbd into ggml-org:master May 17, 2026
75 of 81 checks passed
@am17an am17an deleted the mtp-pp-fix branch May 17, 2026 15:30
@d-r-e
Copy link
Copy Markdown

d-r-e commented May 17, 2026

A quick bench on RTX 5090 with Qwen3.6 27B Q4_K

Image

Are the legend colors swapped?

@pwilkin
Copy link
Copy Markdown
Member

pwilkin commented May 17, 2026

@d-r-e no, MTP does negatively impact prompt processing, but under this PR the negative impact is halved.

@cb88
Copy link
Copy Markdown

cb88 commented May 17, 2026

2xMI50 qwen 27b Q4_1 does see some improvement with this PR
MI50 without MTP = 500t/s
with MTP = 250t/s
with MTP this PR = 300t/s

@tha80 tha80 mentioned this pull request May 17, 2026
11 tasks
@0cc4m
Copy link
Copy Markdown
Contributor

0cc4m commented May 17, 2026

@d-r-e no, MTP does negatively impact prompt processing, but under this PR the negative impact is halved.

Why does it affect prompt processing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants