llama: avoid copying logits during prompt decode in MTP by am17an · Pull Request #23198 · ggml-org/llama.cpp

am17an · 2026-05-17T10:22:31Z

Overview

Avoid copying the logits for every token in the batch when doing prompt processing for MTP since it only requires the pre-norm. This reduces memory traffic quite a bit and in turn increases PP speed with MTP.

Additional information

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, for debugging and reviewing

ggerganov

A quick bench on RTX 5090 with Qwen3.6 27B Q4_K

d-r-e · 2026-05-17T16:04:56Z

A quick bench on RTX 5090 with Qwen3.6 27B Q4_K

Are the legend colors swapped?

pwilkin · 2026-05-17T16:06:30Z

@d-r-e no, MTP does negatively impact prompt processing, but under this PR the negative impact is halved.

cb88 · 2026-05-17T17:03:01Z

2xMI50 qwen 27b Q4_1 does see some improvement with this PR
MI50 without MTP = 500t/s
with MTP = 250t/s
with MTP this PR = 300t/s

0cc4m · 2026-05-17T18:27:48Z

@d-r-e no, MTP does negatively impact prompt processing, but under this PR the negative impact is halved.

Why does it affect prompt processing?

llama: avoid copying logits during prompt decode in MTP

0abcf8f

am17an requested review from a team, CISC and ggerganov as code owners May 17, 2026 10:22

ggerganov reviewed May 17, 2026

View reviewed changes

Comment thread src/llama-context.cpp Outdated

review: update comment

e964f98

ggerganov reviewed May 17, 2026

View reviewed changes

Comment thread src/models/qwen35moe.cpp

llama-graph: call set_output for t_h_pre_norm

70a7d0e

CISC approved these changes May 17, 2026

View reviewed changes

github-actions Bot added model Model specific examples server labels May 17, 2026

ggerganov approved these changes May 17, 2026

View reviewed changes

am17an merged commit 3e12fbd into ggml-org:master May 17, 2026
75 of 81 checks passed

am17an deleted the mtp-pp-fix branch May 17, 2026 15:30

tha80 mentioned this pull request May 17, 2026

llama + spec: MTP Support #22673

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama: avoid copying logits during prompt decode in MTP#23198

llama: avoid copying logits during prompt decode in MTP#23198
am17an merged 3 commits into
ggml-org:masterfrom
am17an:mtp-pp-fix

am17an commented May 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

ggerganov left a comment

Uh oh!

Uh oh!

d-r-e commented May 17, 2026

Uh oh!

pwilkin commented May 17, 2026

Uh oh!

cb88 commented May 17, 2026

Uh oh!

0cc4m commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

am17an commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

Uh oh!

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

d-r-e commented May 17, 2026

Uh oh!

pwilkin commented May 17, 2026

Uh oh!

cb88 commented May 17, 2026

Uh oh!

0cc4m commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

am17an commented May 17, 2026 •

edited

Loading