server: improve speed of speculative decoding #17808

ngxson · 2025-12-05T23:51:30Z

Fix #12968

I'm testing with:

draft model: https://huggingface.co/unsloth/Qwen3-0.6B-GGUF (using Q8_0)
main model: https://huggingface.co/unsloth/Qwen3-8B-GGUF (using Q4_K_M)

So far the results are coherent.

How it works:

ngxson · 2025-12-06T14:53:33Z

server tests passed locally, this should be ready for review @ggerganov

theo77186 · 2025-12-07T19:01:09Z

Just a cosmetic bug: in the llama-server logs, the eval time is 0.00ms, thus the total time only accounts for the prompt processing time. It also causes the eval tokens per second to be meaningless. The model outputs seem to be correct, though.

ngxson · 2025-12-08T10:27:58Z

@theo77186 you mean just the stdout/stderr log, right? (which is not the stats returned by API)

Edit: I think I need more details on the bug, as well as step-by-step reproduction. Feel free to open a dedicated issue.

ggerganov · 2025-12-08T10:41:27Z

@theo77186 you mean just the stdout/stderr log, right? (which is not the stats returned by API)

Both of them (logs and UI) were broken in cases when the draft batch was always accepted (e.g. "count from 1 to 100"). I fixed it with f74d1ee.

server: improve speed of speculative decoding

f2f08f8

loci-dev mentioned this pull request Dec 6, 2025

UPSTREAM PR #17808: server: improve speed of speculative decoding auroralabs-loci/llama.cpp#463

Open

github-actions bot added examples server labels Dec 6, 2025

fix small draft case

cac8d7b

ngxson marked this pull request as ready for review December 6, 2025 14:53

ngxson requested a review from ggerganov as a code owner December 6, 2025 14:53

add link to the PR

398ae8d

wishstudio mentioned this pull request Dec 7, 2025

Fix/improve mtp performance F1LM1/llama.cpp#5

Open

ggerganov added 3 commits December 8, 2025 10:17

server : fix generation time measurement

084cec9

server : fix draft acceptance logs (add SRV_CNT, SLT_CNT macros)

f74d1ee

server : add comment

75be6ba

ggerganov approved these changes Dec 8, 2025

View reviewed changes

Merge branch 'master' into xsn/server_improve_spec

ba5c0b4

ngxson added 2 commits December 8, 2025 14:27

Merge branch 'master' into xsn/server_improve_spec

afe2530

add PR to docs

0a63bd8

ngxson merged commit f896d2c into ggml-org:master Dec 8, 2025
68 of 69 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server: improve speed of speculative decoding #17808

server: improve speed of speculative decoding #17808

ngxson commented Dec 5, 2025 •

edited

Loading

Uh oh!

ngxson commented Dec 6, 2025

Uh oh!

theo77186 commented Dec 7, 2025

Uh oh!

ngxson commented Dec 8, 2025 •

edited

Loading

Uh oh!

ggerganov commented Dec 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

server: improve speed of speculative decoding #17808

server: improve speed of speculative decoding #17808

Conversation

ngxson commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Dec 6, 2025

Uh oh!

theo77186 commented Dec 7, 2025

Uh oh!

ngxson commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Dec 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ngxson commented Dec 5, 2025 •

edited

Loading

ngxson commented Dec 8, 2025 •

edited

Loading