Skip to content

Conversation

@ngxson
Copy link
Collaborator

@ngxson ngxson commented Dec 5, 2025

Fix #12968

I'm testing with:

So far the results are coherent.

How it works:

image

@ngxson ngxson marked this pull request as ready for review December 6, 2025 14:53
@ngxson ngxson requested a review from ggerganov as a code owner December 6, 2025 14:53
@ngxson
Copy link
Collaborator Author

ngxson commented Dec 6, 2025

server tests passed locally, this should be ready for review @ggerganov

@theo77186
Copy link
Contributor

Just a cosmetic bug: in the llama-server logs, the eval time is 0.00ms, thus the total time only accounts for the prompt processing time. It also causes the eval tokens per second to be meaningless. The model outputs seem to be correct, though.

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 8, 2025

@theo77186 you mean just the stdout/stderr log, right? (which is not the stats returned by API)

Edit: I think I need more details on the bug, as well as step-by-step reproduction. Feel free to open a dedicated issue.

@ggerganov
Copy link
Member

@theo77186 you mean just the stdout/stderr log, right? (which is not the stats returned by API)

Both of them (logs and UI) were broken in cases when the draft batch was always accepted (e.g. "count from 1 to 100"). I fixed it with f74d1ee.

@ngxson ngxson merged commit f896d2c into ggml-org:master Dec 8, 2025
68 of 69 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Misc. bug: llama-server speculative decoding not as performant as llama-speculative-simple

3 participants